"model evaluation" Papers
13 papers found
Conference
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Sangwoo Park, Matteo Zecchin, Osvaldo Simeone
NEURIPS 2025spotlightarXiv:2505.18659
4
citations
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang, Yuchang Su, Yiming Liu et al.
CVPR 2025arXiv:2501.03225
23
citations
Automated Model Discovery via Multi-modal & Multi-step Pipeline
Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin et al.
NEURIPS 2025arXiv:2509.25946
Fine-tuning can Help Detect Pretraining Data from Large Language Models
Hengxiang Zhang, Songxin Zhang, Bingyi Jing et al.
ICLR 2025arXiv:2410.10880
4
citations
Law of the Weakest Link: Cross Capabilities of Large Language Models
Ming Zhong, Aston Zhang, Xuewei Wang et al.
ICLR 2025arXiv:2409.19951
12
citations
SelectFormer in Data Markets: Privacy-Preserving and Efficient Data Selection for Transformers with Multi-Party Computation
Xu Ouyang, Felix Xiaozhu Lin, Yangfeng Ji
ICLR 2025
Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly et al.
COLM 2025paper
5
citations
The Zero Body Problem: Probing LLM Use of Sensory Language
Rebecca M. M. Hicke, Sil Hamilton, David Mimno
COLM 2025paperarXiv:2504.06393
2
citations
Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan
ICML 2024arXiv:2406.01382
25
citations
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan et al.
ICML 2024arXiv:2402.06627
60
citations
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Guanhua Zhang, Moritz Hardt
ICML 2024oralarXiv:2405.01719
18
citations
Interplay of ROC and Precision-Recall AUCs: Theoretical Limits and Practical Implications in Binary Classification
Martin Mihelich, François Castagnos, Charles Dognin
ICML 2024
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension
Fangyun Wei, Xi Chen, Lin Luo
ICML 2024arXiv:2403.07872
13
citations