Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

185citations
arXiv:2309.16042
185
citations
#174
in ICLR 2024
of 2297 papers
2
Top Authors
3
Data Points

Abstract

Mechanistic interpretability seeks to understand the internal mechanisms ofmachine learning models, where localization—identifying the important modelcomponents—is a key step. Activation patching, also known as causal tracing orinterchange intervention, is a standard technique for this task (Vig et al., 2020), butthe literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impactof methodological details in activation patching, including evaluation metrics andcorruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparateinterpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we providerecommendations for the best practices of activation patching going forwards.

Citation History

Jan 28, 2026
0
Feb 13, 2026
185+185
Feb 13, 2026
185