"llm-based agents" Papers
15 papers found
Conference
Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents
Qizheng Zhang, Michael Wornow, Kunle Olukotun
NEURIPS 2025arXiv:2506.14852
7
citations
AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios
Yunjia Qi, Hao Peng, Xiaozhi Wang et al.
NEURIPS 2025spotlight
15
citations
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
Ke Yang, Yao Liu, Sapana Chaudhary et al.
ICLR 2025arXiv:2410.13825
69
citations
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems
Xiangyuan Xue, Zeyu Lu, Di Huang et al.
CVPR 2025arXiv:2409.01392
15
citations
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia
Chandler Smith, Marwa Abdulhai, Manfred Díaz et al.
NEURIPS 2025oralarXiv:2512.03318
4
citations
MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants
Zeyu Zhang, Quanyu Dai, Luyu Chen et al.
NEURIPS 2025arXiv:2409.20163
14
citations
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song et al.
ICLR 2025arXiv:2407.16741
387
citations
PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors
Yimeng Chen, Piotr Piękos, Mateusz Ostaszewski et al.
NEURIPS 2025arXiv:2507.15550
2
citations
Scaling Autonomous Agents via Automatic Reward Modeling And Planning
Zhenfang Chen, Delin Chen, Rui Sun et al.
ICLR 2025arXiv:2502.12130
15
citations
SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents
Yifu Guo, Jiaye Lin, Huacan Wang et al.
NEURIPS 2025arXiv:2508.02085
22
citations
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich et al.
NEURIPS 2025arXiv:2505.20411
33
citations
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Zimu Lu, Yunqiao Yang, Houxing Ren et al.
NEURIPS 2025oralarXiv:2505.03733
19
citations
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
Hao Kang, Qingru Zhang, Han Cai et al.
NEURIPS 2025spotlightarXiv:2505.19481
6
citations
CompeteAI: Understanding the Competition Dynamics of Large Language Model-based Agents
Qinlin Zhao, Jindong Wang, Yixuan Zhang et al.
ICML 2024arXiv:2310.17512
56
citations
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
Xueyu Hu, Ziyu Zhao, Shuang Wei et al.
ICML 2024arXiv:2401.05507
98
citations