"automatic evaluation" Papers
7 papers found
Conference
Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge
Aparna Elangovan, Lei Xu, Jongwoo Ko et al.
ICLR 2025arXiv:2410.03775
22
citations
Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos et al.
AAAI 2025paperarXiv:2408.08781
39
citations
M-Prometheus: A Suite of Open Multilingual LLM Judges
José Pombal, Dongkeun Yoon, Patrick Fernandes et al.
COLM 2025paperarXiv:2504.04953
24
citations
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu et al.
ICCV 2025arXiv:2507.21391
7
citations
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
José Pombal, Nuno M Guerreiro, Ricardo Rei et al.
COLM 2025paperarXiv:2504.01001
8
citations
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
Xueyu Hu, Ziyu Zhao, Shuang Wei et al.
ICML 2024arXiv:2401.05507
98
citations
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye et al.
ICLR 2024spotlightarXiv:2307.16789
1197
citations