Poster "benchmark evaluation" Papers

83 papers found • Page 2 of 2

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.

NEURIPS 2025arXiv:2504.16074
30
citations

Realistic Evaluation of Deep Partial-Label Learning Algorithms

Wei Wang, Dong-Dong Wu, Jindong Wang et al.

ICLR 2025arXiv:2502.10184
9
citations

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Dhruv Gautam, Spandan Garg, Jinu Jang et al.

ICLR 2025arXiv:2503.07832
17
citations

ReSi: A Comprehensive Benchmark for Representational Similarity Measures

Max Klabunde, Tassilo Wald, Tobias Schumacher et al.

ICLR 2025arXiv:2408.00531
12
citations

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min et al.

ICLR 2025arXiv:2410.16184
110
citations

Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances

Shilin Lu, Zihan Zhou, Jiayou Lu et al.

ICLR 2025arXiv:2410.18775
86
citations

ScImage: How good are multimodal large language models at scientific text-to-image generation?

Leixin Zhang, Steffen Eger, Yinjie Cheng et al.

ICLR 2025arXiv:2412.02368
5
citations

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Jiacheng Xie, Yang Yu, Ziyang Zhang et al.

NEURIPS 2025arXiv:2505.24063
3
citations

The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Lijun Sheng, Jian Liang, Ran He et al.

NEURIPS 2025arXiv:2506.24000
2
citations

This Time is Different: An Observability Perspective on Time Series Foundation Models

Ben Cohen, Emaad Khwaja, Youssef Doubli et al.

NEURIPS 2025arXiv:2505.14766
13
citations

Towards Generalizable Scene Change Detection

Jae-Woo KIM, Ue-Hwan Kim

CVPR 2025arXiv:2409.06214
8
citations

Two Causally Related Needles in a Video Haystack

Miaoyu Li, Qin Chao, Boyang Li

NEURIPS 2025arXiv:2505.19853

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

Xin Xu, Jiaxin ZHANG, Tianhao Chen et al.

ICLR 2025arXiv:2501.13766
14
citations

UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

Qizhou Chen, Dakan Wang, Taolin Zhang et al.

NEURIPS 2025arXiv:2505.12345
4
citations

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie et al.

ICLR 2025arXiv:2406.03520
106
citations

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Lawrence Jang, Yinheng Li, Dan Zhao et al.

ICLR 2025arXiv:2410.19100
26
citations

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

Nilay Yilmaz, Maitreya Patel, Lawrence Luo et al.

ICLR 2025arXiv:2503.00043
1
citations

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao et al.

NEURIPS 2025arXiv:2511.22154

A Comparative Study of Image Restoration Networks for General Backbone Network Design

Xiangyu Chen, Zheyuan Li, Yuandong Pu et al.

ECCV 2024arXiv:2310.11881
53
citations

Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

Denis Blessing, Xiaogang Jia, Johannes Esslinger et al.

ICML 2024arXiv:2406.07423
43
citations

CurBench: Curriculum Learning Benchmark

Yuwei Zhou, Zirui Pan, Xin Wang et al.

ICML 2024

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Jin Gao, Lei Gan, Yuankai Li et al.

ECCV 2024arXiv:2408.01091
4
citations

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

Mingrui Wu, Jiayi Ji, Oucheng Huang et al.

ICML 2024arXiv:2406.16449
27
citations

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.

CVPR 2024arXiv:2310.14566
392
citations

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Xueyu Hu, Ziyu Zhao, Shuang Wei et al.

ICML 2024arXiv:2401.05507
98
citations

LingoQA: Video Question Answering for Autonomous Driving

Ana-Maria Marcu, Long Chen, Jan Hünermann et al.

ECCV 2024
34
citations

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang et al.

ECCV 2024arXiv:2403.14624
498
citations

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang et al.

ICML 2024arXiv:2310.03302
168
citations

Position: Towards Implicit Prompt For Text-To-Image Models

Yue Yang, Yuqi Lin, Hong Liu et al.

ICML 2024arXiv:2403.02118
1
citations

Premise Order Matters in Reasoning with Large Language Models

Xinyun Chen, Ryan Chi, Xuezhi Wang et al.

ICML 2024arXiv:2402.08939
52
citations

RoDLA: Benchmarking the Robustness of Document Layout Analysis Models

Yufan Chen, Jiaming Zhang, Kunyu Peng et al.

CVPR 2024arXiv:2403.14442
13
citations

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, ziniu hu, Pan Lu et al.

ICML 2024arXiv:2307.10635
181
citations

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu, Saining Xie

CVPR 2024arXiv:2312.14135
345
citations