Poster "reward hacking mitigation" Papers
3 papers found
Conference
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu, Wei Xiong, Jie Ren et al.
ICLR 2025arXiv:2409.13156
50
citations
Transforming and Combining Rewards for Aligning Large Language Models
Zihao Wang, Chirag Nagpal, Jonathan Berant et al.
ICML 2024arXiv:2402.00742
26
citations
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Rame, Nino Vieillard, Léonard Hussenot et al.
ICML 2024