"reproducible evaluation" Papers
2 papers found
Conference
ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments
Hojae Han, seung-won hwang, Rajhans Samdani et al.
ICLR 2025arXiv:2502.19852
13
citations
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
Div Garg, Diego Caples, Andis Draguns et al.
NEURIPS 2025arXiv:2504.11543
20
citations