"sparse autoencoders" Papers

26 papers found

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

David Chanin, James Wilken-Smith, Tomáš Dulka et al.

NEURIPS 2025oralarXiv:2409.14507
81
citations

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso

NEURIPS 2025spotlightarXiv:2504.04072
8
citations

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan et al.

ICML 2025arXiv:2502.16681
56
citations

ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

Jinho Choi, Hyesu Lim, Steffen Schneider et al.

NEURIPS 2025arXiv:2510.26186

Dense SAE Latents Are Features, Not Bugs

Xiaoqing Sun, Alessandro Stolfo, Joshua Engels et al.

NEURIPS 2025arXiv:2506.15679
7
citations

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan et al.

ICLR 2025arXiv:2411.14257
85
citations

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Valérie Costa, Thomas Fel, Ekdeep S Lubana et al.

NEURIPS 2025arXiv:2506.03093
15
citations

Large Language Models Think Too Fast To Explore Effectively

Lan Pan, Hanbo Xie, Robert Wilson

NEURIPS 2025arXiv:2501.18009
6
citations

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025arXiv:2410.07656
14
citations

Not All Language Model Features Are One-Dimensionally Linear

Josh Engels, Eric Michaud, Isaac Liao et al.

ICLR 2025arXiv:2405.14860
101
citations

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Junqi Jiang, Tom Bewley, Salim I. Amoukou et al.

NEURIPS 2025arXiv:2506.21590
2
citations

Residual Stream Analysis with Multi-Layer SAEs

Tim Lawson, Lucy Farnik, Conor Houghton et al.

ICLR 2025arXiv:2409.04185
11
citations

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.

ICLR 2025arXiv:2501.06254
9
citations

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NEURIPS 2025

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Zhuohao Yu, Xingru Jiang, Weizheng Gu et al.

NEURIPS 2025arXiv:2508.08211
2
citations

SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

Aashiq Muhamed, Jacopo Bonato, Mona T. Diab et al.

COLM 2025paper
17
citations

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman et al.

ICLR 2025arXiv:2406.04093
326
citations

Scaling Sparse Feature Circuits For Studying In-Context Learning

Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy et al.

ICML 2025
5
citations

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Patrick Leask, Bart Bussmann, Michael Pearce et al.

ICLR 2025arXiv:2502.04878
41
citations

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Hyesu Lim, Jinho Choi, Jaegul Choo et al.

ICLR 2025arXiv:2412.05276
31
citations

SparseMVC: Probing Cross-view Sparsity Variations for Multi-view Clustering

Ruimeng Liu, Xin Zou, Chang Tang et al.

NEURIPS 2025spotlight

The Complexity of Learning Sparse Superposed Features with Feedback

Akash Kumar

ICML 2025arXiv:2502.05407

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2025arXiv:2405.08366
65
citations

Transferring Linear Features Across Language Models With Model Stitching

Alan Chen, Jack Merullo, Alessandro Stolfo et al.

NEURIPS 2025spotlightarXiv:2506.06609
1
citations

Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

Omer Sahin Tas, Royden Wagner

ICLR 2025arXiv:2406.11624
4
citations