"llm-as-a-judge evaluation" Papers
3 papers found
Conference
AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy
Sebastian Joseph, Syed M. Husain, Stella Offner et al.
NEURIPS 2025arXiv:2505.20538
2
citations
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani, Sachit Menon, Ahmet Iscen et al.
ICCV 2025arXiv:2505.00681
10
citations
To Code or Not To Code? Exploring Impact of Code in Pre-training
Viraat Aryabumi, Yixuan Su, Raymond Ma et al.
ICLR 2025arXiv:2408.10914
44
citations