α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Jacob Steinhardt
Jacob Steinhardt
28
papers
6,783
total citations
papers (28)
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
ICCV 2021
arXiv
2,156
citations
Natural Adversarial Examples
CVPR 2021
arXiv
1,783
citations
Jailbroken: How Does LLM Safety Training Fail?
NEURIPS 2023
arXiv
1,501
citations
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
CVPR 2022
arXiv
174
citations
Interpreting CLIP's Image Representation via Text-Based Decomposition
ICLR 2024
arXiv
154
citations
Capturing Failures of Large Language Models via Human Cognitive Biases
NEURIPS 2022
arXiv
124
citations
Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming
NEURIPS 2020
arXiv
101
citations
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
ICML 2024
arXiv
79
citations
Language Models Learn to Mislead Humans via RLHF
ICLR 2025
arXiv
78
citations
How do Language Models Bind Entities in Context?
ICLR 2024
arXiv
69
citations
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
ICML 2024
arXiv
65
citations
Goal Driven Discovery of Distributional Differences via Language Descriptions
NEURIPS 2023
arXiv
65
citations
Feedback Loops With Language Models Drive In-Context Reward Hacking
ICML 2024
arXiv
60
citations
Describing Differences in Image Sets with Natural Language
CVPR 2024
arXiv
52
citations
Mass-Producing Failures of Multimodal Systems with Language Models
NEURIPS 2023
arXiv
45
citations
Learning Equilibria in Matching Markets from Bandit Feedback
NEURIPS 2021
arXiv
44
citations
Supply-Side Equilibria in Recommender Systems
NEURIPS 2023
arXiv
43
citations
Forecasting Future World Events With Neural Networks
NEURIPS 2022
arXiv
39
citations
Which Attention Heads Matter for In-Context Learning?
ICML 2025
arXiv
37
citations
Limitations of Post-Hoc Feature Alignment for Robustness
CVPR 2021
arXiv
23
citations
Monitoring Latent World States in Language Models with Propositional Probes
ICLR 2025
arXiv
22
citations
Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition
NEURIPS 2023
arXiv
16
citations
Establishing Best Practices in Building Rigorous Agentic Benchmarks
NEURIPS 2025
arXiv
15
citations
Eliciting Language Model Behaviors with Investigator Agents
ICML 2025
arXiv
15
citations
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
NEURIPS 2022
arXiv
11
citations
Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts
ICML 2025
arXiv
9
citations
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
ICLR 2025
arXiv
3
citations
Grounding Representation Similarity Through Statistical Testing
NEURIPS 2021
0
citations