"model evaluation" Papers

13 papers found

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

Sangwoo Park, Matteo Zecchin, Osvaldo Simeone

NEURIPS 2025spotlightarXiv:2505.18659
4
citations

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu et al.

CVPR 2025arXiv:2501.03225
23
citations

Automated Model Discovery via Multi-modal & Multi-step Pipeline

Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin et al.

NEURIPS 2025arXiv:2509.25946

Fine-tuning can Help Detect Pretraining Data from Large Language Models

Hengxiang Zhang, Songxin Zhang, Bingyi Jing et al.

ICLR 2025arXiv:2410.10880
4
citations

Law of the Weakest Link: Cross Capabilities of Large Language Models

Ming Zhong, Aston Zhang, Xuewei Wang et al.

ICLR 2025arXiv:2409.19951
12
citations

SelectFormer in Data Markets: Privacy-Preserving and Efficient Data Selection for Transformers with Multi-Party Computation

Xu Ouyang, Felix Xiaozhu Lin, Yangfeng Ji

ICLR 2025

Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting

Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly et al.

COLM 2025paper
5
citations

The Zero Body Problem: Probing LLM Use of Sensory Language

Rebecca M. M. Hicke, Sil Hamilton, David Mimno

COLM 2025paperarXiv:2504.06393
2
citations

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan

ICML 2024arXiv:2406.01382
25
citations

Feedback Loops With Language Models Drive In-Context Reward Hacking

Alexander Pan, Erik Jones, Meena Jagadeesan et al.

ICML 2024arXiv:2402.06627
60
citations

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Guanhua Zhang, Moritz Hardt

ICML 2024oralarXiv:2405.01719
18
citations

Interplay of ROC and Precision-Recall AUCs: Theoretical Limits and Practical Implications in Binary Classification

Martin Mihelich, François Castagnos, Charles Dognin

ICML 2024

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Fangyun Wei, Xi Chen, Lin Luo

ICML 2024arXiv:2403.07872
13
citations