by Michael Brenner Papers
3 papers found
Conference
CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning
Hao Cui, Zahra Shamsi, Gowoon Cheon et al.
ICLR 2025arXiv:2503.13517
29
citations
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class
James Roggeveen, Erik Wang, David Ettel et al.
NEURIPS 2025arXiv:2505.11774
3
citations
HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics
Fan, Sarah Martinson, Erik Wang et al.
ICLR 2025arXiv:2410.09988
28
citations