"language model evaluation" Papers

14 papers found

Filters:language model evaluation Clear all

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

Absence Bench: Language Models Can’t See What’s Missing

Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore et al.

NEURIPS 2025spotlight

An Auditing Test to Detect Behavioral Shift in Language Models

Leo Richter, Xuanli He, Pasquale Minervini et al.

ICLR 2025oralarXiv:2410.19406

citations

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Cathy Jiao, Yijun Pan, Emily Xiao et al.

NEURIPS 2025arXiv:2507.09424

Eliminating Position Bias of Language Models: A Mechanistic Approach

Ziqi Wang, Hanlin Zhang, Xiner Li et al.

ICLR 2025arXiv:2407.01100

citations

ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Sentences

Yuxin Wang, Xiaomeng Zhu, Weimin Lyu et al.

ICLR 2025arXiv:2411.05172

citations

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Jiayi Ye, Yanbo Wang, Yue Huang et al.

ICLR 2025arXiv:2410.02736

229

citations

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Linbo Cao, Jinman Zhao

COLM 2025paperarXiv:2507.17747

citations

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu, Zhihan Zhang, Kate Lin et al.

NEURIPS 2025arXiv:2506.08249

citations

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

David Heineman, Valentin Hofmann, Ian Magnusson et al.

NEURIPS 2025spotlightarXiv:2508.13144

citations

Towards more rigorous evaluations of language models

Desi R Ivanova, Ilija Ilievski, Momchil Konstantinov

ICLR 2025

Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?

Andreas Opedal, Alessandro Stolfo, Haruki Shirakami et al.

ICML 2024arXiv:2401.18070

citations

LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time

Sensitive Test Construction - Yucheng Li, Frank Guerin, Chenghua Lin

AAAI 2024paperarXiv:2312.12343

citations

Open-Domain Text Evaluation via Contrastive Distribution Methods

Sidi Lu, Hongyi Liu, Asli Celikyilmaz et al.

ICML 2024arXiv:2306.11879

citations

Task Contamination: Language Models May Not Be Few-Shot Anymore

Changmao Li, Jeffrey Flanigan

AAAI 2024paperarXiv:2312.16337

132

citations