Poster "reward hacking" Papers
5 papers found
Conference
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Cassidy Laidlaw, Shivam Singhal, Anca Dragan
ICLR 2025arXiv:2403.03185
25
citations
Large language models can learn and generalize steganographic chain-of-thought under process supervision
ROBERT MC CARTHY, Joey SKAF, Luis Ibanez-Lissen et al.
NEURIPS 2025arXiv:2506.01926
13
citations
Towards Federated RLHF with Aggregated Client Preference for LLMs
Feijie Wu, Xiaoze Liu, Haoyu Wang et al.
ICLR 2025arXiv:2407.03038
10
citations
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan et al.
ICML 2024arXiv:2402.06627
60
citations
ODIN: Disentangled Reward Mitigates Hacking in RLHF
Lichang Chen, Chen Zhu, Jiuhai Chen et al.
ICML 2024arXiv:2402.07319
110
citations