Spotlight "language model safety" Papers
2 papers found
Conference
Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling et al.
NEURIPS 2025spotlightarXiv:2506.19248
3
citations
Why Do Some Language Models Fake Alignment While Others Don't?
Abhay Sheshadri, John Hughes, Julian Michael et al.
NEURIPS 2025spotlightarXiv:2506.18032
5
citations