Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs

0citations
0
citations
#2278
in ICML 2025
of 3340 papers
8
Top Authors
1
Data Points

Abstract

Existing reinforcement learning (RL) methods for large language models (LLMs) rely on static prompt sets, where prompts are curated a priori, and sampled in a fixed schedule for training, regardless of their usefulness to the RL process. We designeva, the first method that allows LLMs to prioritize and adaptively create useful prompts during RL training by reward signals. In principle,eva(Evolving via A symmetric Self-Play) casts language model training as a game between: (1) a creator, who samples and generates training prompts, and (2) a solver, who generates responses to the prompts.evais simple, suits both offline and online RL for LLMs, and sets a new state-of-the-art on challenging benchmarks without extra human prompts: it improves gemma-2-9b-it’s win-rate on Arena-Hard from 51.6% to 60.1% by DPO and 52.6% to 62.4% by RLOO, surpassing claude-3-opus and nearing gemini-1.5-pro, both are orders of magnitude larger. Further ablation studies showevacan induce meaningful learning curriculum, and effectively scale RL for LLMs beyond static human prompts.

Citation History

Jan 28, 2026
0