SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

10citations

arXiv:2507.01001

citations

#627

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Yilun Zhao Kaiyan Zhang Tiansheng Hu Sihong Wu Ronan Le Bras Yixin Liu Robert Tang Joseph Chee Chang Jesse Dodge Jonathan Bragg Chen Zhao Hanna Hajishirzi Doug Downey Arman Cohan

Abstract

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons.By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses.The platform currently supports 44 open-source and proprietary foundation models and has collected over 19,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality.We discuss the results and insights based on the model ranking leaderboard.To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark’s challenges and emphasize the need for more reliable automated evaluation methods.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 28, 2026

Feb 13, 2026

10+10

Feb 13, 2026