"steering vectors" Papers
4 papers found
Conference
DISCO: Disentangled Communication Steering for Large Language Models
Max Torop, Aria Masoomi, Masih Eskandar et al.
NEURIPS 2025arXiv:2509.16820
1
citations
LayerNavigator: Finding Promising Intervention Layers for Efficient Activation Steering in Large Language Models
Hao Sun, Huailiang Peng, Qiong Dai et al.
NEURIPS 2025oral
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
Jacob Dunefsky, Arman Cohan
COLM 2025paperarXiv:2502.18862
8
citations
SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
Aashiq Muhamed, Jacopo Bonato, Mona T. Diab et al.
COLM 2025paper
17
citations