Himabindu Lakkaraju

papers

969

total citations

papers (18)

Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post Hoc Explanations

NEURIPS 2022arXiv

109

citations

Post Hoc Explanations of Language Models Can Improve Language Models

NEURIPS 2023arXiv

citations

Understanding the Effects of Iterative Prompting on Truthfulness

ICML 2024arXiv

citations

Learning Models for Actionable Recourse

NEURIPS 2021arXiv

citations

Incorporating Interpretable Output Constraints in Bayesian Neural Networks

NEURIPS 2020arXiv

citations

Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness

NEURIPS 2023arXiv

citations

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability

NEURIPS 2023arXiv

citations

Beyond Individualized Recourse: Interpretable and Interactive Summaries of Actionable Recourses

NEURIPS 2020arXiv

citations

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

ICLR 2025arXiv

citations

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

COLM 2025arXiv

citations

Inference-Time Reward Hacking in Large Language Models

NEURIPS 2025arXiv

citations

In-Context Unlearning: Language Models as Few-Shot Unlearners

ICML 2024

citations

$\mathcal{M}^4$: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models

NEURIPS 2023

citations

Efficient Training of Low-Curvature Neural Networks

NEURIPS 2022

citations

Himabindu Lakkaraju

papers (18)

Reliable Post hoc Explanations: Modeling Uncertainty in Explainability

OpenXAI: Towards a Transparent Evaluation of Model Explanations

Counterfactual Explanations Can Be Manipulated

Towards Robust and Reliable Algorithmic Recourse

Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post Hoc Explanations

Post Hoc Explanations of Language Models Can Improve Language Models

Understanding the Effects of Iterative Prompting on Truthfulness

Learning Models for Actionable Recourse

Incorporating Interpretable Output Constraints in Bayesian Neural Networks

Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability

Beyond Individualized Recourse: Interpretable and Interactive Summaries of Actionable Recourses

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Inference-Time Reward Hacking in Large Language Models

In-Context Unlearning: Language Models as Few-Shot Unlearners

$\mathcal{M}^4$: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models

Efficient Training of Low-Curvature Neural Networks

papers (18)

Reliable Post hoc Explanations: Modeling Uncertainty in Explainability

OpenXAI: Towards a Transparent Evaluation of Model Explanations

Counterfactual Explanations Can Be Manipulated

Towards Robust and Reliable Algorithmic Recourse

Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post Hoc Explanations

Post Hoc Explanations of Language Models Can Improve Language Models

Understanding the Effects of Iterative Prompting on Truthfulness

Learning Models for Actionable Recourse

Incorporating Interpretable Output Constraints in Bayesian Neural Networks

Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability

Beyond Individualized Recourse: Interpretable and Interactive Summaries of Actionable Recourses

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Inference-Time Reward Hacking in Large Language Models

In-Context Unlearning: Language Models as Few-Shot Unlearners

$\mathcal{M}^4$: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models

Efficient Training of Low-Curvature Neural Networks