"benchmark evaluation" Papers

100 papers found • Page 1 of 2

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri et al.

NEURIPS 2025arXiv:2506.09038

citations

Accessing Vision Foundation Models via ImageNet-1K

Yitian Zhang, Xu Ma, Yue Bai et al.

ICLR 2025arXiv:2407.10366

citations

AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios

Yunjia Qi, Hao Peng, Xiaozhi Wang et al.

NEURIPS 2025spotlight

citations

AHa-Bench: Benchmarking Audio Hallucinations in Large Audio-Language Models

Xize Cheng, Dongjie Fu, Chenyuhao Wen et al.

NEURIPS 2025

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski et al.

NEURIPS 2025spotlightarXiv:2507.02554

citations

AI-Researcher: Autonomous Scientific Innovation

Jiabin Tang, Lianghao Xia, Zhonghang Li et al.

NEURIPS 2025spotlightarXiv:2505.18705

citations

ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Ruchit Rawal, Reza Shirkavand, Heng Huang et al.

ICCV 2025arXiv:2506.07371

citations

A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks

Mucong Ding, Bang An, Tahseen Rabbani et al.

NEURIPS 2025

AutoPresent: Designing Structured Visuals from Scratch

Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou et al.

CVPR 2025arXiv:2501.00912

citations

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.

ICLR 2025arXiv:2410.18325

citations

Bag of Tricks for Inference-time Computation of LLM Reasoning

Fan LIU, Wen-Shuo Chao, Naiqiang Tan et al.

NEURIPS 2025arXiv:2502.07191

citations

Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?

Yifan Feng, Chengwu Yang, Xingliang Hou et al.

ICLR 2025arXiv:2410.10083

citations

BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Evan Antoniuk, Shehtab Zaman, Tal Ben-Nun et al.

NEURIPS 2025arXiv:2505.01912

citations

Can Knowledge Editing Really Correct Hallucinations?

Baixiang Huang, Canyu Chen, Xiongxiao Xu et al.

ICLR 2025arXiv:2410.16251

citations

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Zhibo Yang, Jun Tang, Zhaohai Li et al.

ICCV 2025arXiv:2412.02210

citations

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

Owen Queen, Harrison Zhang, James Zou

NEURIPS 2025arXiv:2510.11985

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Peng Xu, Wei Ping, Xianchao Wu et al.

ICLR 2025arXiv:2407.14482

citations

CofCA: A STEP-WISE Counterfactual Multi-hop QA benchmark

Jian Wu, Linyi Yang, Zhen Wang et al.

ICLR 2025arXiv:2402.11924

citations

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Zihui Cheng, Qiguang Chen, Jin Zhang et al.

AAAI 2025paperarXiv:2412.12932

citations

C-SEO Bench: Does Conversational SEO Work?

Haritz Puerto, Martin Gubri, Tommaso Green et al.

NEURIPS 2025arXiv:2506.11097

citations

DarkBench: Benchmarking Dark Patterns in Large Language Models

Esben Kran, Hieu Minh Nguyen, Akash Kundu et al.

ICLR 2025arXiv:2503.10728

citations

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

ICCV 2025arXiv:2504.16072

citations

DGCBench: A Deep Graph Clustering Benchmark

Benyu Wu, Yue Liu, Qiaoyu Tan et al.

NEURIPS 2025

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal et al.

ICLR 2025arXiv:2407.01725

citations

Does Spatial Cognition Emerge in Frontier Models?

Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Krähenbühl et al.

ICLR 2025arXiv:2410.06468

citations

Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information

Yuke Zhu, Yue Zhang, Dongdong Liu et al.

ICLR 2025

citations

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

Lixiong Qin, Shilong Ou, Miaoxuan Zhang et al.

NEURIPS 2025arXiv:2501.01243

citations

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Jin Wang, Chenghui Lv, Xian Li et al.

CVPR 2025arXiv:2503.15024

citations

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang, Zhuofan Zhang, Ziyu Zhu et al.

NEURIPS 2025arXiv:2506.04897

citations

Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf et al.

NEURIPS 2025arXiv:2507.02833

citations

HELMET: How to Evaluate Long-context Models Effectively and Thoroughly

Howard Yen, Tianyu Gao, Minmin Hou et al.

ICLR 2025

citations

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang, Wenliang Zheng, Aashrith Madasu et al.

ICCV 2025arXiv:2504.18406

IDEA-Bench: How Far are Generative Models from Professional Designing?

Chen Liang, Lianghua Huang, Jingwu Fang et al.

CVPR 2025arXiv:2412.11767

citations

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving of Inequalities

Haoyu Zhao, Yihan Geng, Shange Tang et al.

NEURIPS 2025arXiv:2505.12680

citations

Is Artificial Intelligence Generated Image Detection a Solved Problem?

Ziqiang Li, Jiazhen Yan, Ziwen He et al.

NEURIPS 2025arXiv:2505.12335

citations

Is Tracking really more challenging in First Person Egocentric Vision?

Matteo Dunnhofer, Zaira Manigrasso, Christian Micheloni

ICCV 2025highlightarXiv:2507.16015

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

Rui Li, Zixuan Hu, Wenxi Qu et al.

NEURIPS 2025arXiv:2505.22634

citations

LIFEBENCH: Evaluating Length Instruction Following in Large Language Models

Wei Zhang, Zhenhong Zhou, Kun Wang et al.

NEURIPS 2025arXiv:2505.16234

citations

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu, Ming Shan Hee, Zhiqiang Hu et al.

ICLR 2025arXiv:2409.02076

citations

Massive Sound Embedding Benchmark (MSEB)

Georg Heigold, Ehsan Variani, Tom Bagby et al.

NEURIPS 2025

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou et al.

NEURIPS 2025arXiv:2511.04703

citations

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Yinghao Zhu, Ziyi He, Haoran Hu et al.

NEURIPS 2025arXiv:2505.12371

citations

MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

Xi Jiang, Jian Li, Hanqiu Deng et al.

ICLR 2025arXiv:2410.09453

citations

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen et al.

NEURIPS 2025spotlightarXiv:2306.13394

1277

citations

MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study

Yuqing Zhang, Yue Han, Shuanghe Zhu et al.

NEURIPS 2025

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, XINGYU FU, James Y. Huang et al.

ICLR 2025oralarXiv:2406.09411

120

citations

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.

NEURIPS 2025arXiv:2511.07250

citations

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach et al.

ICLR 2025arXiv:2410.20092

citations

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li, Ge Zhang, Yinghao Ma et al.

NEURIPS 2025arXiv:2409.15272

citations

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou et al.

CVPR 2025arXiv:2412.07626

citations

← Previous

1 2