"model interpretability" Papers

54 papers found • Page 1 of 2

Additive Models Explained: A Computational Complexity Approach

Shahaf Bassan, Michal Moshkovitz, Guy Katz

NEURIPS 2025arXiv:2510.21292
1
citations

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution

Fengyuan Liu, Nikhil Kandpal, Colin Raffel

ICLR 2025arXiv:2411.15102
15
citations

Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning

Xueqi Ma, Jun Wang, Yanbei Jiang et al.

NEURIPS 2025arXiv:2512.10978
3
citations

Concept Bottleneck Language Models For Protein Design

Aya Ismail, Tuomas Oikarinen, Amy Wang et al.

ICLR 2025arXiv:2411.06090
16
citations

Data-centric Prediction Explanation via Kernelized Stein Discrepancy

Mahtab Sarvmaili, Hassan Sajjad, Ga Wu

ICLR 2025arXiv:2403.15576
2
citations

Dataset Distillation for Pre-Trained Self-Supervised Vision Models

George Cazenavette, Antonio Torralba, Vincent Sitzmann

NEURIPS 2025arXiv:2511.16674
1
citations

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Cathy Jiao, Yijun Pan, Emily Xiao et al.

NEURIPS 2025arXiv:2507.09424

Defining and Discovering Hyper-meta-paths for Heterogeneous Hypergraphs

Yaming Yang, Ziyu Zheng, Weigang Lu et al.

NEURIPS 2025

Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning

Chen Qian, Dongrui Liu, Hao Wen et al.

NEURIPS 2025arXiv:2506.02867
24
citations

Dense SAE Latents Are Features, Not Bugs

Xiaoqing Sun, Alessandro Stolfo, Joshua Engels et al.

NEURIPS 2025arXiv:2506.15679
7
citations

Discovering Influential Neuron Path in Vision Transformers

Yifan Wang, Yifei Liu, Yingdong Shi et al.

ICLR 2025arXiv:2503.09046
4
citations

Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation

Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan et al.

AAAI 2025paperarXiv:2412.09817

Forking Paths in Neural Text Generation

Eric Bigelow, Ari Holtzman, Hidenori Tanaka et al.

ICLR 2025arXiv:2412.07961
20
citations

From Search to Sampling: Generative Models for Robust Algorithmic Recourse

Prateek Garg, Lokesh Nagalapatti, Sunita Sarawagi

ICLR 2025arXiv:2505.07351
3
citations

How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations

Siddhartha Gairola, Moritz Böhle, Francesco Locatello et al.

ICLR 2025arXiv:2503.00641
6
citations

I Am Big, You Are Little; I Am Right, You Are Wrong

David A Kelly, Akchunya Chanchal, Nathan Blake

ICCV 2025arXiv:2507.23509
3
citations

Interpreting Language Reward Models via Contrastive Explanations

Junqi Jiang, Tom Bewley, Saumitra Mishra et al.

ICLR 2025arXiv:2411.16502
5
citations

LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Zhuo Cao, Xuan Zhao, Lena Krieger et al.

NEURIPS 2025arXiv:2510.14623
1
citations

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

Walid Bousselham, Angie Boggust, Sofian Chaybouti et al.

ICCV 2025arXiv:2404.03214
25
citations

Localizing Knowledge in Diffusion Transformers

Arman Zarei, Samyadeep Basu, Keivan Rezaei et al.

NEURIPS 2025arXiv:2505.18832
2
citations

Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix Jedidja Binder, James Chua, Tomek Korbak et al.

ICLR 2025oralarXiv:2410.13787
44
citations

Manipulating Feature Visualizations with Gradient Slingshots

Dilyara Bareeva, Marina Höhne, Alexander Warnecke et al.

NEURIPS 2025arXiv:2401.06122
6
citations

Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability

Zhiyu Zhu, Zhibo Jin, Jiayu Zhang et al.

ICLR 2025arXiv:2502.14889
3
citations

Register and [CLS] tokens induce a decoupling of local and global features in large ViTs

Alexander Lappe, Martin Giese

NEURIPS 2025
3
citations

Self-Assembling Graph Perceptrons

Jialong Chen, Tong Wang, Bowen Deng et al.

NEURIPS 2025spotlight

SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries

Darin Tsui, Aryan Musharaf, Yigit Efe Erginbas et al.

NEURIPS 2025arXiv:2410.19236
3
citations

Smoothed Differentiation Efficiently Mitigates Shattered Gradients in Explanations

Adrian Hill, Neal McKee, Johannes Maeß et al.

NEURIPS 2025

Start Smart: Leveraging Gradients For Enhancing Mask-based XAI Methods

Buelent Uendes, Shujian Yu, Mark Hoogendoorn

ICLR 2025

TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

Pooyan Rahmanzadehgervi, Hung Nguyen, Rosanne Liu et al.

ICCV 2025arXiv:2412.18675
1
citations

The Fragile Truth of Saliency: Improving LLM Input Attribution via Attention Bias Optimization

Yihua Zhang, Changsheng Wang, Yiwei Chen et al.

NEURIPS 2025spotlight

The Zero Body Problem: Probing LLM Use of Sensory Language

Rebecca M. M. Hicke, Sil Hamilton, David Mimno

COLM 2025paperarXiv:2504.06393
2
citations

Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima et al.

NEURIPS 2025arXiv:2506.05744
13
citations

Towards Understanding How Knowledge Evolves in Large Vision-Language Models

Sudong Wang, Yunjian Zhang, Yao Zhu et al.

CVPR 2025arXiv:2504.02862
3
citations

Unveiling Concept Attribution in Diffusion Models

Nguyen Hung-Quang, Hoang Phan, Khoa D Doan

NEURIPS 2025arXiv:2412.02542
4
citations

Accelerating the Global Aggregation of Local Explanations

Alon Mor, Yonatan Belinkov, Benny Kimelfeld

AAAI 2024paperarXiv:2312.07991
7
citations

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

Saebom Leem, Hyunseok Seo

AAAI 2024paperarXiv:2402.04563
32
citations

Attribution-based Explanations that Provide Recourse Cannot be Robust

Hidde Fokkema, Rianne de Heide, Tim van Erven

ICML 2024arXiv:2205.15834
22
citations

CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation

Townim Chowdhury, Kewen Liao, Vu Minh Hieu Phan et al.

CVPR 2024arXiv:2404.02388
3
citations

Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort

Jeeyung Kim, Ze Wang, Qiang Qiu

ECCV 2024arXiv:2407.08947
6
citations

Distilled Datamodel with Reverse Gradient Matching

Jingwen Ye, Ruonan Yu, Songhua Liu et al.

CVPR 2024arXiv:2404.14006
3
citations

Explaining Graph Neural Networks via Structure-aware Interaction Index

Ngoc Bui, Trung Hieu Nguyen, Viet Anh Nguyen et al.

ICML 2024arXiv:2405.14352
12
citations

Explaining Probabilistic Models with Distributional Values

Luca Franceschi, Michele Donini, Cedric Archambeau et al.

ICML 2024spotlightarXiv:2402.09947
3
citations

Exploring the LLM Journey from Cognition to Expression with Linear Representations

Yuzi Yan, Jialian Li, YipinZhang et al.

ICML 2024arXiv:2405.16964
6
citations

Improving Neural Additive Models with Bayesian Principles

Kouroche Bouchiat, Alexander Immer, Hugo Yèche et al.

ICML 2024arXiv:2305.16905
13
citations

Iterative Search Attribution for Deep Neural Networks

Zhiyu Zhu, Huaming Chen, Xinyi Wang et al.

ICML 2024

KernelSHAP-IQ: Weighted Least Square Optimization for Shapley Interactions

Fabian Fumagalli, Maximilian Muschalik, Patrick Kolpaczki et al.

ICML 2024

MAPTree: Beating “Optimal” Decision Trees with Bayesian Decision Trees

Colin Sullivan, Mo Tiwari, Sebastian Thrun

AAAI 2024paperarXiv:2309.15312
4
citations

MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks

Zhiyu Zhu, Huaming Chen, Jiayu Zhang et al.

AAAI 2024paperarXiv:2312.13630
14
citations

On Gradient-like Explanation under a Black-box Setting: When Black-box Explanations Become as Good as White-box

Yi Cai, Gerhard Wunder

ICML 2024arXiv:2308.09381
3
citations

Position: Cracking the Code of Cascading Disparity Towards Marginalized Communities

Golnoosh Farnadi, Mohammad Havaei, Negar Rostamzadeh

ICML 2024arXiv:2406.01757
3
citations
PreviousNext