"reward hacking" Papers
9 papers found
Conference
Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment
Yuang Cai, Yuyu Yuan, Jinsheng Shi et al.
AAAI 2025paperarXiv:2411.09341
4
citations
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Cassidy Laidlaw, Shivam Singhal, Anca Dragan
ICLR 2025arXiv:2403.03185
25
citations
Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
Xiaoying Xing, Avinab Saha, Junfeng He et al.
CVPR 2025highlightarXiv:2501.06481
4
citations
Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling et al.
NEURIPS 2025spotlightarXiv:2506.19248
3
citations
Large language models can learn and generalize steganographic chain-of-thought under process supervision
ROBERT MC CARTHY, Joey SKAF, Luis Ibanez-Lissen et al.
NEURIPS 2025arXiv:2506.01926
13
citations
Towards Federated RLHF with Aggregated Client Preference for LLMs
Feijie Wu, Xiaoze Liu, Haoyu Wang et al.
ICLR 2025arXiv:2407.03038
10
citations
Decoding-time Realignment of Language Models
Tianlin Liu, Shangmin Guo, Leonardo Martins Bianco et al.
ICML 2024spotlightarXiv:2402.02992
59
citations
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan et al.
ICML 2024arXiv:2402.06627
60
citations
ODIN: Disentangled Reward Mitigates Hacking in RLHF
Lichang Chen, Chen Zhu, Jiuhai Chen et al.
ICML 2024arXiv:2402.07319
110
citations