Spotlight "benchmark evaluation" Papers
8 papers found
Conference
AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios
Yunjia Qi, Hao Peng, Xiaozhi Wang et al.
NEURIPS 2025spotlight
15
citations
AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
Edan Toledo, Karen Hambardzumyan, Martin Josifoski et al.
NEURIPS 2025spotlightarXiv:2507.02554
16
citations
AI-Researcher: Autonomous Scientific Innovation
Jiabin Tang, Lianghao Xia, Zhonghang Li et al.
NEURIPS 2025spotlightarXiv:2505.18705
13
citations
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen et al.
NEURIPS 2025spotlightarXiv:2306.13394
1277
citations
ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints
Rui Xu, Dakuan Lu, Zicheng Zhao et al.
NEURIPS 2025spotlightarXiv:2511.18450
2
citations
THUNDER: Tile-level Histopathology image UNDERstanding benchmark
Pierre Marza, Leo Fillioux, Sofiène Boutaj et al.
NEURIPS 2025spotlightarXiv:2507.07860
3
citations
EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data
Shengjie Wang, Shaohuai Liu, Weirui Ye et al.
ICML 2024spotlightarXiv:2403.00564
31
citations
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Jian Xie, Kai Zhang, Jiangjie Chen et al.
ICML 2024spotlightarXiv:2402.01622
319
citations