Steering LLMs' Behavior with Concept Activation Vectors

0citations

Project

citations

#3313

in ICLR 2025

of 3827 papers

Top Authors

Data Points

Top Authors

Ruixuan HUANG Shuai Wang

Topics

concept activation vectors large language models safety concepts behavior steering text style transfer code generation

Abstract

Concept activation vectors have been shown to take effects in safety concepts, efficiently and effectively guiding a considerable number of open-source large language models (LLMs) to respond positively to malicious instructions. In this blog, we aim to explore the capability boundaries of concept activation vectors in guiding various behaviors of LLMs through more extensive experiments. Our experiments demonstrate that this reasoning technique can low-costly transfer text styles and improve performance on specific tasks such as code generation.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 2, 2026