"transformer interpretability" Papers

10 papers found

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

NEURIPS 2025arXiv:2511.20273
2
citations

EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Lin Zhang, Wenshuo Dong, Zhuoran Zhang et al.

NEURIPS 2025arXiv:2502.06852
9
citations

Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition

Aliyah Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri et al.

ICLR 2025arXiv:2407.00886
15
citations

FlowPrune: Accelerating Attention Flow Calculation by Pruning Flow Network

Shuo Xu, Yu Chen, Shuxia Lin et al.

NEURIPS 2025

Pinpointing Attention-Causal Communication in Language Models

Gabriel Franco, Mark Crovella

NEURIPS 2025
1
citations

Selective induction Heads: How Transformers Select Causal Structures in Context

Francesco D'Angelo, francesco croce, Nicolas Flammarion

ICLR 2025arXiv:2509.08184
6
citations

The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand et al.

NEURIPS 2025arXiv:2505.15807
4
citations

Transformer Layers as Painters

Qi Sun, Marc Pickett, Aakash Kumar Nain et al.

AAAI 2025paperarXiv:2407.09298
42
citations

Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

Omer Sahin Tas, Royden Wagner

ICLR 2025arXiv:2406.11624
4
citations

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer et al.

ICML 2024arXiv:2402.05602
92
citations