Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

108
citations
#26
in ICML 2025
of 3340 papers
8
Top Authors
4
Data Points

Abstract

We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broademergent misalignment. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?

Citation History

Jan 28, 2026
110
Feb 13, 2026
108
Feb 13, 2026
108
Feb 13, 2026
108