"benchmark evaluation" Papers

100 papers found • Page 2 of 2

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization

Saihui Hou, Panjian Huang, Zengbin Wang et al.

ICCV 2025arXiv:2410.00204

citations

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.

CVPR 2025arXiv:2411.18499

citations

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Shenghai Yuan, Xianyi He, Yufan Deng et al.

NEURIPS 2025arXiv:2505.20292

citations

OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling

Zhicheng YANG, Yiwei Wang, Yinya Huang et al.

ICLR 2025arXiv:2407.09887

citations

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Rui Xu, Dakuan Lu, Zicheng Zhao et al.

NEURIPS 2025spotlightarXiv:2511.18450

citations

OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Bingnan Li, Chen-Yu Wang, Haiyang Xu et al.

NEURIPS 2025arXiv:2509.19282

citations

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Junbo Niu, Yifei Li, Ziyang Miao et al.

CVPR 2025arXiv:2501.05510

citations

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Atharva Gundawar, Som Sagar, Ransalu Senanayake

NEURIPS 2025arXiv:2506.23725

citations

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.

NEURIPS 2025arXiv:2504.16074

citations

PokerBench: Training Large Language Models to Become Professional Poker Players

Richard Zhuang, Akshat Gupta, Richard Yang et al.

AAAI 2025paperarXiv:2501.08328

citations

Realistic Evaluation of Deep Partial-Label Learning Algorithms

Wei Wang, Dong-Dong Wu, Jindong Wang et al.

ICLR 2025arXiv:2502.10184

citations

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Dhruv Gautam, Spandan Garg, Jinu Jang et al.

ICLR 2025arXiv:2503.07832

citations

ReSi: A Comprehensive Benchmark for Representational Similarity Measures

Max Klabunde, Tassilo Wald, Tobias Schumacher et al.

ICLR 2025arXiv:2408.00531

citations

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min et al.

ICLR 2025arXiv:2410.16184

110

citations

Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances

Shilin Lu, Zihan Zhou, Jiayou Lu et al.

ICLR 2025arXiv:2410.18775

citations

ScImage: How good are multimodal large language models at scientific text-to-image generation?

Leixin Zhang, Steffen Eger, Yinjie Cheng et al.

ICLR 2025arXiv:2412.02368

citations

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Jiacheng Xie, Yang Yu, Ziyang Zhang et al.

NEURIPS 2025arXiv:2505.24063

citations

The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Lijun Sheng, Jian Liang, Ran He et al.

NEURIPS 2025arXiv:2506.24000

citations

This Time is Different: An Observability Perspective on Time Series Foundation Models

Ben Cohen, Emaad Khwaja, Youssef Doubli et al.

NEURIPS 2025arXiv:2505.14766

citations

THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza, Leo Fillioux, Sofiène Boutaj et al.

NEURIPS 2025spotlightarXiv:2507.07860

citations

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding et al.

ICLR 2025oralarXiv:2410.23266

citations

Towards Generalizable Scene Change Detection

Jae-Woo KIM, Ue-Hwan Kim

CVPR 2025arXiv:2409.06214

citations

Towards Scalable Human-aligned Benchmark for Text-guided Image Editing

Suho Ryu, Kihyun Kim, Eugene Baek et al.

CVPR 2025highlightarXiv:2505.00502

citations

Two Causally Related Needles in a Video Haystack

Miaoyu Li, Qin Chao, Boyang Li

NEURIPS 2025arXiv:2505.19853

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

Xin Xu, Jiaxin ZHANG, Tianhao Chen et al.

ICLR 2025arXiv:2501.13766

citations

UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

Qizhou Chen, Dakan Wang, Taolin Zhang et al.

NEURIPS 2025arXiv:2505.12345

citations

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie et al.

ICLR 2025arXiv:2406.03520

106

citations

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Lawrence Jang, Yinheng Li, Dan Zhao et al.

ICLR 2025arXiv:2410.19100

citations

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

Nilay Yilmaz, Maitreya Patel, Lawrence Luo et al.

ICLR 2025arXiv:2503.00043

citations

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao et al.

NEURIPS 2025arXiv:2511.22154

A Comparative Study of Image Restoration Networks for General Backbone Network Design

Xiangyu Chen, Zheyuan Li, Yuandong Pu et al.

ECCV 2024arXiv:2310.11881

citations

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

Fangjun Li, David C. Hogg, Anthony G. Cohn

AAAI 2024paperarXiv:2401.03991

citations

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin, Xianpei Han et al.

AAAI 2024paperarXiv:2309.01431

475

citations

Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

Denis Blessing, Xiaogang Jia, Johannes Esslinger et al.

ICML 2024arXiv:2406.07423

citations

CurBench: Curriculum Learning Benchmark

Yuwei Zhou, Zirui Pan, Xin Wang et al.

ICML 2024

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Jin Gao, Lei Gan, Yuankai Li et al.

ECCV 2024arXiv:2408.01091

citations

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Shengjie Wang, Shaohuai Liu, Weirui Ye et al.

ICML 2024spotlightarXiv:2403.00564

citations

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

Mingrui Wu, Jiayi Ji, Oucheng Huang et al.

ICML 2024arXiv:2406.16449

citations

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.

CVPR 2024arXiv:2310.14566

392

citations

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Xueyu Hu, Ziyu Zhao, Shuang Wei et al.

ICML 2024arXiv:2401.05507

citations

LingoQA: Video Question Answering for Autonomous Driving

Ana-Maria Marcu, Long Chen, Jan Hünermann et al.

ECCV 2024

citations

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang et al.

ECCV 2024arXiv:2403.14624

498

citations

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang et al.

ICML 2024arXiv:2310.03302

168

citations

Position: Towards Implicit Prompt For Text-To-Image Models

Yue Yang, Yuqi Lin, Hong Liu et al.

ICML 2024arXiv:2403.02118

citations

Premise Order Matters in Reasoning with Large Language Models

Xinyun Chen, Ryan Chi, Xuezhi Wang et al.

ICML 2024arXiv:2402.08939

citations

RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting

Lei Shu, Liangchen Luo, Jayakumar Hoskere et al.

AAAI 2024paperarXiv:2305.15685

citations

RoDLA: Benchmarking the Robustness of Document Layout Analysis Models

Yufan Chen, Jiaming Zhang, Kunyu Peng et al.

CVPR 2024arXiv:2403.14442

citations

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, ziniu hu, Pan Lu et al.

ICML 2024arXiv:2307.10635

181

citations

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Jian Xie, Kai Zhang, Jiangjie Chen et al.

ICML 2024spotlightarXiv:2402.01622

319

citations

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu, Saining Xie

CVPR 2024arXiv:2312.14135

345

citations

← Previous

1 2