"language model evaluation" Papers

14 papers found

Absence Bench: Language Models Can’t See What’s Missing

Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore et al.

NEURIPS 2025spotlight

An Auditing Test to Detect Behavioral Shift in Language Models

Leo Richter, Xuanli He, Pasquale Minervini et al.

ICLR 2025oralarXiv:2410.19406
2
citations

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Cathy Jiao, Yijun Pan, Emily Xiao et al.

NEURIPS 2025arXiv:2507.09424

Eliminating Position Bias of Language Models: A Mechanistic Approach

Ziqi Wang, Hanlin Zhang, Xiner Li et al.

ICLR 2025arXiv:2407.01100
50
citations

ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Sentences

Yuxin Wang, Xiaomeng Zhu, Weimin Lyu et al.

ICLR 2025arXiv:2411.05172
2
citations

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Jiayi Ye, Yanbo Wang, Yue Huang et al.

ICLR 2025arXiv:2410.02736
229
citations

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Linbo Cao, Jinman Zhao

COLM 2025paperarXiv:2507.17747
3
citations

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu, Zhihan Zhang, Kate Lin et al.

NEURIPS 2025arXiv:2506.08249
2
citations

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

David Heineman, Valentin Hofmann, Ian Magnusson et al.

NEURIPS 2025spotlightarXiv:2508.13144
6
citations

Towards more rigorous evaluations of language models

Desi R Ivanova, Ilija Ilievski, Momchil Konstantinov

ICLR 2025

Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?

Andreas Opedal, Alessandro Stolfo, Haruki Shirakami et al.

ICML 2024arXiv:2401.18070
24
citations

LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time

Sensitive Test Construction - Yucheng Li, Frank Guerin, Chenghua Lin

AAAI 2024paperarXiv:2312.12343
54
citations

Open-Domain Text Evaluation via Contrastive Distribution Methods

Sidi Lu, Hongyi Liu, Asli Celikyilmaz et al.

ICML 2024arXiv:2306.11879
1
citations

Task Contamination: Language Models May Not Be Few-Shot Anymore

Changmao Li, Jeffrey Flanigan

AAAI 2024paperarXiv:2312.16337
132
citations