Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning

0citations
0
citations
#3347
in NEURIPS 2025
of 5858 papers
9
Top Authors
4
Data Points

Abstract

Reinforcement learning (RL) heavily depends on well-designed reward functions, which are often biased and difficult to design for complex behaviors. Preference-based RL (PbRL) addresses this by learning reward models from human feedback, but its practicality is constrained by a critical dilemma: while existing methods reduce human effort through query optimization, they neglect the preference buffer's restricted coverage — a factor that fundamentally determines the reliability of reward model. We systematically demonstrate this limitation creates distributional mismatch: reward models trained on static buffers reliably assess in-distribution trajectories but falter with out-of-distribution (OOD) trajectories from policy exploration. Crucially, such failures in policy-proximal regions directly misguide iterative policy updates. To address this, we proposeProximal Policy Exploration (PPE)with two key components: (1) aproximal-policy extensionmethod that expands exploration in undersampled policy-proximal regions, and (2) amixture distribution querymethod that balances in-distribution and OOD trajectory sampling. By enhancing buffer coverage while preserving evaluation accuracy in policy-proximal regions, PPE enables more reliable policy updates. Experiments across continuous control tasks demonstrate that PPE enhances preference feedback utilization efficiency and RL sample efficiency over baselines, highlighting preference buffer coverage management's vital role in PbRL.

Citation History

Jan 25, 2026
0
Jan 26, 2026
0
Jan 26, 2026
0
Jan 28, 2026
0