"model evaluation" Papers

13 papers found

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

Sangwoo Park, Matteo Zecchin, Osvaldo Simeone

NEURIPS 2025spotlightarXiv:2505.18659

citations

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu et al.

CVPR 2025arXiv:2501.03225

citations

Automated Model Discovery via Multi-modal & Multi-step Pipeline

Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin et al.

NEURIPS 2025arXiv:2509.25946

Fine-tuning can Help Detect Pretraining Data from Large Language Models

Hengxiang Zhang, Songxin Zhang, Bingyi Jing et al.

ICLR 2025arXiv:2410.10880

citations

Law of the Weakest Link: Cross Capabilities of Large Language Models

Ming Zhong, Aston Zhang, Xuewei Wang et al.

ICLR 2025arXiv:2409.19951

citations

SelectFormer in Data Markets: Privacy-Preserving and Efficient Data Selection for Transformers with Multi-Party Computation

Xu Ouyang, Felix Xiaozhu Lin, Yangfeng Ji

ICLR 2025

Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting

Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly et al.

COLM 2025paper

citations

The Zero Body Problem: Probing LLM Use of Sensory Language

Rebecca M. M. Hicke, Sil Hamilton, David Mimno

COLM 2025paperarXiv:2504.06393

citations

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan

ICML 2024arXiv:2406.01382

citations

Feedback Loops With Language Models Drive In-Context Reward Hacking

Alexander Pan, Erik Jones, Meena Jagadeesan et al.

ICML 2024arXiv:2402.06627

citations

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Guanhua Zhang, Moritz Hardt

ICML 2024oralarXiv:2405.01719

citations

Interplay of ROC and Precision-Recall AUCs: Theoretical Limits and Practical Implications in Binary Classification

Martin Mihelich, François Castagnos, Charles Dognin

ICML 2024

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Fangyun Wei, Xi Chen, Lin Luo

ICML 2024arXiv:2403.07872

citations