"language model interpretability" Papers

13 papers found

From Models to Microtheories: Distilling a Model's Topical Knowledge for Grounded Question-Answering

Nathaniel Weir, Bhavana Dalvi Mishra, Orion Weller et al.

ICLR 2025arXiv:2412.17701
3
citations

Monitoring Latent World States in Language Models with Propositional Probes

Jiahai Feng, Stuart Russell, Jacob Steinhardt

ICLR 2025arXiv:2406.19501
22
citations

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Tian Ye, Zicheng Xu, Yuanzhi Li et al.

ICLR 2025arXiv:2407.20311
100
citations

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Itay Itzhak, Yonatan Belinkov, Gabriel Stanovsky

COLM 2025paperarXiv:2507.07186
3
citations

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Hang Chen, Jiaying Zhu, Xinyu Yang et al.

NEURIPS 2025arXiv:2505.10039
3
citations

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.

ICLR 2025arXiv:2501.06254
9
citations

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman et al.

ICLR 2025arXiv:2406.04093
326
citations

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric Michaud et al.

ICLR 2025arXiv:2403.19647
263
citations

TopoLM: brain-like spatio-functional organization in a topographic language model

Neil Rathi, Johannes Mehrer, Badr AlKhamissi et al.

ICLR 2025arXiv:2410.11516
12
citations

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

James Oldfield, Shawn Im, Sharon Li et al.

NEURIPS 2025arXiv:2505.21364
1
citations

Transferring Linear Features Across Language Models With Model Stitching

Alan Chen, Jack Merullo, Alessandro Stolfo et al.

NEURIPS 2025spotlightarXiv:2506.06609
1
citations

Explorations of Self-Repair in Language Models

Cody Rushing, Neel Nanda

ICML 2024arXiv:2402.15390
19
citations

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Asma Ghandeharioun, ‪Avi Caciularu‬‏, Adam Pearce et al.

ICML 2024arXiv:2401.06102
173
citations