"transformer interpretability" Papers
10 papers found
Conference
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
NEURIPS 2025arXiv:2511.20273
2
citations
EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification
Lin Zhang, Wenshuo Dong, Zhuoran Zhang et al.
NEURIPS 2025arXiv:2502.06852
9
citations
Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition
Aliyah Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri et al.
ICLR 2025arXiv:2407.00886
15
citations
FlowPrune: Accelerating Attention Flow Calculation by Pruning Flow Network
Shuo Xu, Yu Chen, Shuxia Lin et al.
NEURIPS 2025
Pinpointing Attention-Causal Communication in Language Models
Gabriel Franco, Mark Crovella
NEURIPS 2025
1
citations
Selective induction Heads: How Transformers Select Causal Structures in Context
Francesco D'Angelo, francesco croce, Nicolas Flammarion
ICLR 2025arXiv:2509.08184
6
citations
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation
Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand et al.
NEURIPS 2025arXiv:2505.15807
4
citations
Transformer Layers as Painters
Qi Sun, Marc Pickett, Aakash Kumar Nain et al.
AAAI 2025paperarXiv:2407.09298
42
citations
Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers
Omer Sahin Tas, Royden Wagner
ICLR 2025arXiv:2406.11624
4
citations
AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers
Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer et al.
ICML 2024arXiv:2402.05602
92
citations