ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

15citations

arXiv:2502.16268 Project

citations

#426

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Shulin Huang Linyi Yang Yan Song Shawn Chen Leyang Cui Ziyu Wan Qingcheng Zeng Ying Wen Kun Shao Weinan Zhang Jun Wang Yue Zhang

Topics

large language models out-of-distribution evaluation reasoning capability data contamination dynamic data generation reasoning tasks

Abstract

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 30, 2026

14+14

Feb 13, 2026

15+1

Feb 13, 2026