Poster "language model safety" Papers

10 papers found

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.

ICLR 2025arXiv:2406.07358
67
citations

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.

ICLR 2025arXiv:2408.15313
24
citations

Breach By A Thousand Leaks: Unsafe Information Leakage in 'Safe' AI Responses

David Glukhov, Ziwen Han, I Shumailov et al.

ICLR 2025arXiv:2407.02551
10
citations

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

Michael Aerni, Javier Rando, Edoardo Debenedetti et al.

ICLR 2025arXiv:2411.10242
13
citations

Shh, don't say that! Domain Certification in LLMs

Cornelius Emde, Alasdair Paren, Preetham Arvind et al.

ICLR 2025arXiv:2502.19320
4
citations

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Xinpeng Wang, Chengzhi (Martin) Hu, Paul Röttger et al.

ICLR 2025arXiv:2410.03415
26
citations

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Noam Razin, Sadhika Malladi, Adithya Bhaskar et al.

ICLR 2025arXiv:2410.08847
51
citations

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Andrew Lee, Xiaoyan Bai, Itamar Pres et al.

ICML 2024arXiv:2401.01967
165
citations

Representation Surgery: Theory and Practice of Affine Steering

Shashwat Singh, Shauli Ravfogel, Jonathan Herzig et al.

ICML 2024arXiv:2402.09631
31
citations

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Xavi Suau, Pieter Delobelle, Katherine Metcalf et al.

ICML 2024arXiv:2407.12824
28
citations