"reinforcement learning from human feedback" Papers

68 papers found • Page 2 of 2

Learning Optimal Advantage from Preferences and Mistaking It for Reward

W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson et al.

AAAI 2024paperarXiv:2310.02456
16
citations

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

songyang gao, Qiming Ge, Wei Shen et al.

ICML 2024arXiv:2401.11458
21
citations

MaxMin-RLHF: Alignment with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan et al.

ICML 2024arXiv:2402.08925
88
citations

MusicRL: Aligning Music Generation to Human Preferences

Geoffrey Cideron, Sertan Girgin, Mauro Verzetti et al.

ICML 2024arXiv:2301.11325
616
citations

Nash Learning from Human Feedback

REMI MUNOS, Michal Valko, Daniele Calandriello et al.

ICML 2024spotlightarXiv:2312.00886
195
citations

ODIN: Disentangled Reward Mitigates Hacking in RLHF

Lichang Chen, Chen Zhu, Jiuhai Chen et al.

ICML 2024arXiv:2402.07319
110
citations

Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback

Vincent Conitzer, Rachel Freedman, Jobstq Heitzig et al.

ICML 2024

Preference Ranking Optimization for Human Alignment

Feifan Song, Bowen Yu, Minghao Li et al.

AAAI 2024paperarXiv:2306.17492
337
citations

Privacy-Preserving Instructions for Aligning Large Language Models

Da Yu, Peter Kairouz, Sewoong Oh et al.

ICML 2024arXiv:2402.13659
36
citations

Quality Diversity through Human Feedback: Towards Open-Ended Diversity-Driven Optimization

Li Ding, Jenny Zhang, Jeff Clune et al.

ICML 2024arXiv:2310.12103
11
citations

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

Ziniu Li, Tian Xu, Yushun Zhang et al.

ICML 2024arXiv:2310.10505
147
citations

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban et al.

ICML 2024arXiv:2403.01857
20
citations

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor et al.

ICML 2024arXiv:2309.00267
527
citations

RLVF: Learning from Verbal Feedback without Overgeneralization

Moritz Stephan, Alexander Khazatsky, Eric Mitchell et al.

ICML 2024arXiv:2402.10893
14
citations

Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution

Emily McMilin

AAAI 2024paperarXiv:2210.00131

WARM: On the Benefits of Weight Averaged Reward Models

Alexandre Rame, Nino Vieillard, Léonard Hussenot et al.

ICML 2024

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Kirchner et al.

ICML 2024arXiv:2312.09390
406
citations

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue

Songhua Yang, Hanjie Zhao, Senbin Zhu et al.

AAAI 2024paperarXiv:2308.03549
210
citations