"benchmark curation" Papers
2 papers found
Conference
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
Jean Park, Kuk Jin Jang, Basam Alasaly et al.
AAAI 2025paperarXiv:2408.12763
16
citations
CLEVER: A Curated Benchmark for Formally Verified Code Generation
Amitayush Thakur, Jasper Lee, George Tsoukalas et al.
NEURIPS 2025arXiv:2505.13938
13
citations