HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

28citations

arXiv:2410.09988 Project

citations

#627

in ICLR 2025

of 3827 papers

Top Authors

Data Points

Top Authors

Fan Sarah Martinson Erik Wang Kaylie Hausknecht Jonah Brenner Danxian Liu Nianli Peng Corey Wang Michael Brenner

Abstract

Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce $\textbf{HARDMath}$, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical approximation techniques. These problems demand a combination of mathematical reasoning, computational tools, and subjective judgment, making them difficult for LLMs. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on $\textbf{HARDMath-mini}$, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate the limitations of current LLM performance on advanced graduate-level applied math problems and underscore the importance of datasets like $\textbf{HARDMath}$ to advance mathematical abilities of LLMs.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 28, 2026

Feb 13, 2026

28+28

Feb 13, 2026