by Cozmin Ududec Papers
2 papers found
Conference
Establishing Best Practices in Building Rigorous Agentic Benchmarks
Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun et al.
NEURIPS 2025arXiv:2507.02825
15
citations
Measuring what Matters: Construct Validity in Large Language Model Benchmarks
Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou et al.
NEURIPS 2025arXiv:2511.04703
9
citations