Poster "sparse autoencoders" Papers

21 papers found

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan et al.

ICML 2025arXiv:2502.16681
56
citations

ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

Jinho Choi, Hyesu Lim, Steffen Schneider et al.

NEURIPS 2025arXiv:2510.26186

Dense SAE Latents Are Features, Not Bugs

Xiaoqing Sun, Alessandro Stolfo, Joshua Engels et al.

NEURIPS 2025arXiv:2506.15679
7
citations

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan et al.

ICLR 2025arXiv:2411.14257
85
citations

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Valérie Costa, Thomas Fel, Ekdeep S Lubana et al.

NEURIPS 2025arXiv:2506.03093
15
citations

Large Language Models Think Too Fast To Explore Effectively

Lan Pan, Hanbo Xie, Robert Wilson

NEURIPS 2025arXiv:2501.18009
6
citations

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025arXiv:2410.07656
14
citations

Not All Language Model Features Are One-Dimensionally Linear

Josh Engels, Eric Michaud, Isaac Liao et al.

ICLR 2025arXiv:2405.14860
101
citations

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Junqi Jiang, Tom Bewley, Salim I. Amoukou et al.

NEURIPS 2025arXiv:2506.21590
2
citations

Residual Stream Analysis with Multi-Layer SAEs

Tim Lawson, Lucy Farnik, Conor Houghton et al.

ICLR 2025arXiv:2409.04185
11
citations

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.

ICLR 2025arXiv:2501.06254
9
citations

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NEURIPS 2025

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Zhuohao Yu, Xingru Jiang, Weizheng Gu et al.

NEURIPS 2025arXiv:2508.08211
2
citations

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman et al.

ICLR 2025arXiv:2406.04093
326
citations

Scaling Sparse Feature Circuits For Studying In-Context Learning

Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy et al.

ICML 2025
5
citations

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Patrick Leask, Bart Bussmann, Michael Pearce et al.

ICLR 2025arXiv:2502.04878
41
citations

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Hyesu Lim, Jinho Choi, Jaegul Choo et al.

ICLR 2025arXiv:2412.05276
31
citations

The Complexity of Learning Sparse Superposed Features with Feedback

Akash Kumar

ICML 2025arXiv:2502.05407

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2025arXiv:2405.08366
65
citations

Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

Omer Sahin Tas, Royden Wagner

ICLR 2025arXiv:2406.11624
4
citations