"interpretable features" Papers
4 papers found
Conference
Dense SAE Latents Are Features, Not Bugs
Xiaoqing Sun, Alessandro Stolfo, Joshua Engels et al.
NEURIPS 2025arXiv:2506.15679
7
citations
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
Valérie Costa, Thomas Fel, Ekdeep S Lubana et al.
NEURIPS 2025arXiv:2506.03093
15
citations
Not All Language Model Features Are One-Dimensionally Linear
Josh Engels, Eric Michaud, Isaac Liao et al.
ICLR 2025arXiv:2405.14860
101
citations
Verbalized Representation Learning for Interpretable Few-Shot Generalization
Cheng-Fu Yang, Da Yin, Wenbo Hu et al.
ICCV 2025arXiv:2411.18651
1
citations