α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Dylan Hadfield-Menell
Dylan Hadfield-Menell
9
papers
1,065
total citations
papers (9)
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
ICLR 2025
arXiv
750
citations
Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF
ICLR 2024
arXiv
99
citations
Consequences of Misaligned AI
NEURIPS 2020
arXiv
92
citations
Robust Feature-Level Adversaries are Interpretability Tools
NEURIPS 2022
arXiv
33
citations
How to talk so AI will learn: Instructions, descriptions, and autonomy
NEURIPS 2022
arXiv
28
citations
Diverse Preference Learning for Capabilities and Alignment
ICLR 2025
arXiv
24
citations
Red Teaming Deep Neural Networks with Feature Synthesis Tools
NEURIPS 2023
arXiv
21
citations
Pitfalls of Evidence-Based AI Policy
ICLR 2025
arXiv
14
citations
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia
NEURIPS 2025
arXiv
4
citations