The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

32citations
arXiv:2502.17420
32
citations
#168
in ICML 2025
of 3340 papers
6
Top Authors
4
Data Points

Abstract

The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that asinglerefusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensionalconcept conesthat mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion ofrepresentational independencethat accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.

Citation History

Jan 28, 2026
0
Feb 13, 2026
32+32
Feb 13, 2026
32
Feb 13, 2026
32