"data contamination" Papers

12 papers found

An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination

Sukanya Patra, Souhaib Ben Taieb

NEURIPS 2025spotlightarXiv:2510.21296

CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models

Shengzhuang Chen, Yikai Liao, Xiaoxiao Sun et al.

ICLR 2025arXiv:2503.04655
1
citations

CofCA: A STEP-WISE Counterfactual Multi-hop QA benchmark

Jian Wu, Linyi Yang, Zhen Wang et al.

ICLR 2025arXiv:2402.11924
14
citations

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Yue Yang, Shuibo Zhang, Kaipeng Zhang et al.

ICLR 2025arXiv:2410.08695
17
citations

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination’s Impact on Machine Translation

Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch et al.

ICML 2025oralarXiv:2501.18771
7
citations

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.

NEURIPS 2025arXiv:2504.16074
30
citations

Position: Benchmarking is Broken - Don't Let AI be Its Own Judge

Zerui Cheng, Stella Wohnig, Ruchika Gupta et al.

NEURIPS 2025arXiv:2510.07575
1
citations

SWE-bench Goes Live!

Linghao Zhang, Shilin He, Chaoyun Zhang et al.

NEURIPS 2025arXiv:2505.23419
25
citations

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

Shulin Huang, Linyi Yang, Yan Song et al.

NEURIPS 2025arXiv:2502.16268
15
citations

Dynamic Evaluation of Large Language Models by Meta Probing Agents

Kaijie Zhu, Jindong Wang, Qinlin Zhao et al.

ICML 2024arXiv:2402.14865
55
citations

Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

Linyuan Gong, Sida Wang, Mostafa Elhoushi et al.

ICML 2024arXiv:2403.04814
28
citations

LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time

Sensitive Test Construction - Yucheng Li, Frank Guerin, Chenghua Lin

AAAI 2024paperarXiv:2312.12343
54
citations