"benchmark construction" Papers

26 papers found

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Zicheng Zhang, Haoning Wu, Chunyi Li et al.

ICLR 2025arXiv:2406.03070
40
citations

AgentAuditor: Human-level Safety and Security Evaluation for LLM Agents

Hanjun Luo, Shenyu Dai, Chiming Ni et al.

NEURIPS 2025arXiv:2506.00641
18
citations

ALLVB: All-in-One Long Video Understanding Benchmark

Xichen Tan, Yuanjing Luo, Yunfan Ye et al.

AAAI 2025paperarXiv:2503.07298
6
citations

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu et al.

CVPR 2025arXiv:2501.03225
23
citations

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Cheng Yang, Chufan Shi, Yaxin Liu et al.

ICLR 2025arXiv:2406.09961
69
citations

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Chengyou Jia, Changliang Xia, Zhuohang Dang et al.

CVPR 2025arXiv:2411.17176
7
citations

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Yiwei Yang, Chung Peng Lee, Shangbin Feng et al.

NEURIPS 2025arXiv:2506.18322
3
citations

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra et al.

CVPR 2025arXiv:2505.21755
4
citations

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.

ICLR 2025arXiv:2410.12784
163
citations

Linguini: A benchmark for language-agnostic linguistic reasoning

Eduardo Sánchez, Belen Alastruey, Christophe Ropers et al.

NEURIPS 2025arXiv:2409.12126
13
citations

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier et al.

ICLR 2025arXiv:2407.01509
43
citations

MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu, Hao Fei, Yuhui Zhang et al.

NEURIPS 2025arXiv:2509.25851

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Zijia Zhao, Haoyu Lu, Yuqi Huo et al.

ICLR 2025oralarXiv:2406.09367
15
citations

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Cunxiang Wang, Ruoxi Ning, Boqi Pan et al.

ICLR 2025arXiv:2403.12766
24
citations

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu, Zhihan Zhang, Kate Lin et al.

NEURIPS 2025arXiv:2506.08249
2
citations

ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

Duong T. Tran, Trung-Kien Tran, Manfred Hauswirth et al.

ICCV 2025arXiv:2507.16403
3
citations

SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI

Yuzhou Nie, Zhun Wang, Yu Yang et al.

NEURIPS 2025

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xeron Du, Yifan Yao, Kaijing Ma et al.

NEURIPS 2025arXiv:2502.14739
118
citations

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Jinyang Li, Xiaolong Li, Ge Qu et al.

NEURIPS 2025arXiv:2506.18951
8
citations

SysBench: Can LLMs Follow System Message?

Yanzhao Qin, Tao Zhang, Tao Zhang et al.

ICLR 2025
5
citations

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

HONG LI, Nanxi Li, Yuanjie Chen et al.

ICLR 2025arXiv:2410.01417
3
citations

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Yuanxin Liu, Rui Zhu, Shuhuai Ren et al.

NEURIPS 2025arXiv:2503.09949
3
citations

WritingBench: A Comprehensive Benchmark for Generative Writing

Yuning Wu, Jiahao Mei, Ming Yan et al.

NEURIPS 2025arXiv:2503.05244
46
citations

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang, hongzhen wang, Zonghao Guo et al.

CVPR 2025highlightarXiv:2503.23771
26
citations

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

YU DU, Fangyun Wei, Hongyang Zhang

ICML 2024arXiv:2402.04253
85
citations

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Fangyun Wei, Xi Chen, Lin Luo

ICML 2024arXiv:2403.07872
13
citations