Poster "language model interpretability" Papers
11 papers found
Conference
From Models to Microtheories: Distilling a Model's Topical Knowledge for Grounded Question-Answering
Nathaniel Weir, Bhavana Dalvi Mishra, Orion Weller et al.
ICLR 2025arXiv:2412.17701
3
citations
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng, Stuart Russell, Jacob Steinhardt
ICLR 2025arXiv:2406.19501
22
citations
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
Tian Ye, Zicheng Xu, Yuanzhi Li et al.
ICLR 2025arXiv:2407.20311
100
citations
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen, Jiaying Zhu, Xinyu Yang et al.
NEURIPS 2025arXiv:2505.10039
3
citations
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.
ICLR 2025arXiv:2501.06254
9
citations
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupre la Tour, Henk Tillman et al.
ICLR 2025arXiv:2406.04093
326
citations
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks, Can Rager, Eric Michaud et al.
ICLR 2025arXiv:2403.19647
263
citations
TopoLM: brain-like spatio-functional organization in a topographic language model
Neil Rathi, Johannes Mehrer, Badr AlKhamissi et al.
ICLR 2025arXiv:2410.11516
12
citations
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
James Oldfield, Shawn Im, Sharon Li et al.
NEURIPS 2025arXiv:2505.21364
1
citations
Explorations of Self-Repair in Language Models
Cody Rushing, Neel Nanda
ICML 2024arXiv:2402.15390
19
citations
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun, Avi Caciularu, Adam Pearce et al.
ICML 2024arXiv:2401.06102
173
citations