Poster "trustworthy evaluation" Papers
2 papers found
Conference
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.
ICLR 2025arXiv:2406.07358
67
citations
Position: Benchmarking is Broken - Don't Let AI be Its Own Judge
Zerui Cheng, Stella Wohnig, Ruchika Gupta et al.
NEURIPS 2025arXiv:2510.07575
1
citations