"multi-task evaluation" Papers
2 papers found
Conference
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao et al.
CVPR 2025arXiv:2406.04264
105
citations
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
David Heineman, Valentin Hofmann, Ian Magnusson et al.
NEURIPS 2025spotlightarXiv:2508.13144
6
citations