"language model evaluation" Papers
14 papers found
Conference
Absence Bench: Language Models Can’t See What’s Missing
Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore et al.
NEURIPS 2025spotlight
An Auditing Test to Detect Behavioral Shift in Language Models
Leo Richter, Xuanli He, Pasquale Minervini et al.
ICLR 2025oralarXiv:2410.19406
2
citations
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
Cathy Jiao, Yijun Pan, Emily Xiao et al.
NEURIPS 2025arXiv:2507.09424
Eliminating Position Bias of Language Models: A Mechanistic Approach
Ziqi Wang, Hanlin Zhang, Xiner Li et al.
ICLR 2025arXiv:2407.01100
50
citations
ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Sentences
Yuxin Wang, Xiaomeng Zhu, Weimin Lyu et al.
ICLR 2025arXiv:2411.05172
2
citations
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang et al.
ICLR 2025arXiv:2410.02736
229
citations
Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks
Linbo Cao, Jinman Zhao
COLM 2025paperarXiv:2507.17747
3
citations
RADAR: Benchmarking Language Models on Imperfect Tabular Data
Ken Gu, Zhihan Zhang, Kate Lin et al.
NEURIPS 2025arXiv:2506.08249
2
citations
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
David Heineman, Valentin Hofmann, Ian Magnusson et al.
NEURIPS 2025spotlightarXiv:2508.13144
6
citations
Towards more rigorous evaluations of language models
Desi R Ivanova, Ilija Ilievski, Momchil Konstantinov
ICLR 2025
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?
Andreas Opedal, Alessandro Stolfo, Haruki Shirakami et al.
ICML 2024arXiv:2401.18070
24
citations
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time
Sensitive Test Construction - Yucheng Li, Frank Guerin, Chenghua Lin
AAAI 2024paperarXiv:2312.12343
54
citations
Open-Domain Text Evaluation via Contrastive Distribution Methods
Sidi Lu, Hongyi Liu, Asli Celikyilmaz et al.
ICML 2024arXiv:2306.11879
1
citations
Task Contamination: Language Models May Not Be Few-Shot Anymore
Changmao Li, Jeffrey Flanigan
AAAI 2024paperarXiv:2312.16337
132
citations