"automated evaluation" Papers
8 papers found
Conference
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang, Yuchang Su, Yiming Liu et al.
CVPR 2025arXiv:2501.03225
23
citations
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
Peng Lai, Jianjie Zheng, Sijie Cheng et al.
NEURIPS 2025arXiv:2508.03550
3
citations
EditCLIP: Representation Learning for Image Editing
Qian Wang, Aleksandar Cvejic, Abdelrahman Eldesokey et al.
ICCV 2025arXiv:2503.20318
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang et al.
ICCV 2025arXiv:2510.16641
4
citations
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
Jie Zhang, Cezara Petrui, Kristina Nikolić et al.
NEURIPS 2025arXiv:2505.12575
12
citations
xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Qingchen Yu, Zifan Zheng, Shichao Song et al.
ICLR 2025arXiv:2405.11874
15
citations
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation
Gauthier Guinet, Behrooz Tehrani, Anoop Deoras et al.
ICML 2024arXiv:2405.13622
33
citations
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
Hao Zhao, Maksym Andriushchenko, Francesco Croce et al.
ICML 2024arXiv:2402.04833
88
citations