"language model safety" Papers
12 papers found
Conference
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.
ICLR 2025arXiv:2406.07358
67
citations
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.
ICLR 2025arXiv:2408.15313
24
citations
Breach By A Thousand Leaks: Unsafe Information Leakage in 'Safe' AI Responses
David Glukhov, Ziwen Han, I Shumailov et al.
ICLR 2025arXiv:2407.02551
10
citations
Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling et al.
NEURIPS 2025spotlightarXiv:2506.19248
3
citations
Measuring Non-Adversarial Reproduction of Training Data in Large Language Models
Michael Aerni, Javier Rando, Edoardo Debenedetti et al.
ICLR 2025arXiv:2411.10242
13
citations
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde, Alasdair Paren, Preetham Arvind et al.
ICLR 2025arXiv:2502.19320
4
citations
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang, Chengzhi (Martin) Hu, Paul Röttger et al.
ICLR 2025arXiv:2410.03415
26
citations
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin, Sadhika Malladi, Adithya Bhaskar et al.
ICLR 2025arXiv:2410.08847
51
citations
Why Do Some Language Models Fake Alignment While Others Don't?
Abhay Sheshadri, John Hughes, Julian Michael et al.
NEURIPS 2025spotlightarXiv:2506.18032
5
citations
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee, Xiaoyan Bai, Itamar Pres et al.
ICML 2024arXiv:2401.01967
165
citations
Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh, Shauli Ravfogel, Jonathan Herzig et al.
ICML 2024arXiv:2402.09631
31
citations
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Xavi Suau, Pieter Delobelle, Katherine Metcalf et al.
ICML 2024arXiv:2407.12824
28
citations