Spotlight "reward hacking" Papers
2 papers found
Conference
Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling et al.
NEURIPS 2025spotlightarXiv:2506.19248
3
citations
Decoding-time Realignment of Language Models
Tianlin Liu, Shangmin Guo, Leonardo Martins Bianco et al.
ICML 2024spotlightarXiv:2402.02992
59
citations