α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Himabindu Lakkaraju
Himabindu Lakkaraju
18
papers
969
total citations
papers (18)
Reliable Post hoc Explanations: Modeling Uncertainty in Explainability
NEURIPS 2021
arXiv
203
citations
OpenXAI: Towards a Transparent Evaluation of Model Explanations
NEURIPS 2022
arXiv
176
citations
Counterfactual Explanations Can Be Manipulated
NEURIPS 2021
arXiv
163
citations
Towards Robust and Reliable Algorithmic Recourse
NEURIPS 2021
arXiv
125
citations
Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post Hoc Explanations
NEURIPS 2022
arXiv
109
citations
Post Hoc Explanations of Language Models Can Improve Language Models
NEURIPS 2023
arXiv
76
citations
Understanding the Effects of Iterative Prompting on Truthfulness
ICML 2024
arXiv
21
citations
Learning Models for Actionable Recourse
NEURIPS 2021
arXiv
20
citations
Incorporating Interpretable Output Constraints in Bayesian Neural Networks
NEURIPS 2020
arXiv
17
citations
Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness
NEURIPS 2023
arXiv
17
citations
Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
NEURIPS 2023
arXiv
12
citations
Beyond Individualized Recourse: Interpretable and Interactive Summaries of Actionable Recourses
NEURIPS 2020
arXiv
12
citations
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
ICLR 2025
arXiv
10
citations
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
COLM 2025
arXiv
5
citations
Inference-Time Reward Hacking in Large Language Models
NEURIPS 2025
arXiv
3
citations
In-Context Unlearning: Language Models as Few-Shot Unlearners
ICML 2024
0
citations
$\mathcal{M}^4$: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models
NEURIPS 2023
0
citations
Efficient Training of Low-Curvature Neural Networks
NEURIPS 2022
0
citations