Poster "data contamination" Papers
9 papers found
Conference
CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models
Shengzhuang Chen, Yikai Liao, Xiaoxiao Sun et al.
ICLR 2025arXiv:2503.04655
1
citations
CofCA: A STEP-WISE Counterfactual Multi-hop QA benchmark
Jian Wu, Linyi Yang, Zhen Wang et al.
ICLR 2025arXiv:2402.11924
14
citations
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Yue Yang, Shuibo Zhang, Kaipeng Zhang et al.
ICLR 2025arXiv:2410.08695
17
citations
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.
NEURIPS 2025arXiv:2504.16074
30
citations
Position: Benchmarking is Broken - Don't Let AI be Its Own Judge
Zerui Cheng, Stella Wohnig, Ruchika Gupta et al.
NEURIPS 2025arXiv:2510.07575
1
citations
SWE-bench Goes Live!
Linghao Zhang, Shilin He, Chaoyun Zhang et al.
NEURIPS 2025arXiv:2505.23419
25
citations
ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
Shulin Huang, Linyi Yang, Yan Song et al.
NEURIPS 2025arXiv:2502.16268
15
citations
Dynamic Evaluation of Large Language Models by Meta Probing Agents
Kaijie Zhu, Jindong Wang, Qinlin Zhao et al.
ICML 2024arXiv:2402.14865
55
citations
Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
Linyuan Gong, Sida Wang, Mostafa Elhoushi et al.
ICML 2024arXiv:2403.04814
28
citations