Poster "reinforcement learning from human feedback" Papers

56 papers found • Page 1 of 2

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Hui Yuan, Yifan Zeng, Yue Wu et al.

ICLR 2025arXiv:2410.13828
5
citations

As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss

Xin Mao, Huimin Xu, Feng-Lin Li et al.

ICLR 2025arXiv:2410.04834
3
citations

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.

ICLR 2025arXiv:2410.18252
43
citations

Avoiding exp(R) scaling in RLHF through Preference-based Exploration

Mingyu Chen, Yiding Chen, Wen Sun et al.

NEURIPS 2025
3
citations

Better Estimation of the Kullback--Leibler Divergence Between Language Models

Afra Amini, Tim Vieira, Ryan Cotterell

NEURIPS 2025arXiv:2504.10637
4
citations

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.

ICLR 2025arXiv:2408.15313
24
citations

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.

ICLR 2025arXiv:2407.14622
53
citations

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

Cassidy Laidlaw, Shivam Singhal, Anca Dragan

ICLR 2025arXiv:2403.03185
25
citations

Explainable Reinforcement Learning from Human Feedback to Improve Alignment

Shicheng Liu, Siyuan Xu, Wenjie Qiu et al.

NEURIPS 2025arXiv:2512.13837

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau et al.

NEURIPS 2025arXiv:2505.11475
38
citations

HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai et al.

ICLR 2025arXiv:2410.05116
3
citations

How to Evaluate Reward Models for RLHF

Evan Frick, Tianle Li, Connor Chen et al.

ICLR 2025arXiv:2410.14872
58
citations

Information-Theoretic Reward Decomposition for Generalizable RLHF

Liyuan Mao, Haoran Xu, Amy Zhang et al.

NEURIPS 2025arXiv:2504.06020
3
citations

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025arXiv:2409.12822
78
citations

Learning “Partner-Aware” Collaborators in Multi-Party Collaboration

Abhijnan Nath, Nikhil Krishnaswamy

NEURIPS 2025arXiv:2510.22462

LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging

Ke Wang, Nikos Dimitriadis, Alessandro Favero et al.

ICLR 2025arXiv:2410.17146
27
citations

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Pengxiang Li, Lu Yin, Shiwei Liu

ICLR 2025arXiv:2412.13795
26
citations

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

Aaron J. Li, Satyapriya Krishna, Hima Lakkaraju

ICLR 2025arXiv:2404.18870
10
citations

On a Connection Between Imitation Learning and RLHF

Teng Xiao, Yige Yuan, Mingxiao Li et al.

ICLR 2025arXiv:2503.05079
14
citations

Online Preference Alignment for Language Models via Count-based Exploration

Chenjia Bai, Yang Zhang, Shuang Qiu et al.

ICLR 2025arXiv:2501.12735
20
citations

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Javier Rando, Tony Wang, Stewart Slocum et al.

ICLR 2025arXiv:2307.15217
750
citations

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

ICLR 2025arXiv:2410.04612
18
citations

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Timo Kaufmann, Yannick Metz, Daniel Keim et al.

NEURIPS 2025arXiv:2512.25023

Reward Learning from Multiple Feedback Types

Yannick Metz, Andras Geiszl, Raphaël Baur et al.

ICLR 2025arXiv:2502.21038
5
citations

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min et al.

ICLR 2025arXiv:2410.16184
110
citations

Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models

Ángela López-Cardona, Carlos Segura, Alexandros Karatzoglou et al.

ICLR 2025arXiv:2410.01532
8
citations

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025arXiv:2411.00418
19
citations

Semantic-guided Diverse Decoding for Large Language Model

Weijie Shi, Yue Cui, Yaguang Wu et al.

NEURIPS 2025arXiv:2506.23601
2
citations

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Heyang Zhao, Chenlu Ye, Quanquan Gu et al.

NEURIPS 2025arXiv:2411.04625
16
citations

Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback

Zexu Sun, Yiju Guo, Yankai Lin et al.

ICLR 2025
5
citations

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Qining Zhang, Lei Ying

ICLR 2025arXiv:2409.17401
10
citations

Active Preference Learning for Large Language Models

William Muldrew, Peter Hayes, Mingtian Zhang et al.

ICML 2024arXiv:2402.08114
46
citations

A Minimaximalist Approach to Reinforcement Learning from Human Feedback

Gokul Swamy, Christoph Dann, Rahul Kidambi et al.

ICML 2024arXiv:2401.04056
139
citations

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Gaurav Pandey, Yatin Nandwani, Tahira Naseem et al.

ICML 2024arXiv:2402.02479
5
citations

Dense Reward for Free in Reinforcement Learning from Human Feedback

Alexander Chan, Hao Sun, Samuel Holt et al.

ICML 2024arXiv:2402.00782
65
citations

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Yihan Du, Anna Winnicki, Gal Dalal et al.

ICML 2024arXiv:2402.10342
17
citations

Exploring the LLM Journey from Cognition to Expression with Linear Representations

Yuzi Yan, Jialian Li, YipinZhang et al.

ICML 2024arXiv:2405.16964
6
citations

Fundamental Limitations of Alignment in Large Language Models

Yotam Wolf, Noam Wies, Oshri Avnery et al.

ICML 2024arXiv:2304.11082
178
citations

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

Ryan Liu, Theodore R Sumers, Ishita Dasgupta et al.

ICML 2024arXiv:2402.07282
28
citations

Human Alignment of Large Language Models through Online Preference Optimisation

Daniele Calandriello, Zhaohan Guo, REMI MUNOS et al.

ICML 2024arXiv:2403.08635
88
citations

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu, Jiaxuan Gao et al.

ICML 2024arXiv:2404.10719
253
citations

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Banghua Zhu, Michael Jordan, Jiantao Jiao

ICML 2024arXiv:2401.16335
48
citations

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint

Wei Xiong, Hanze Dong, Chenlu Ye et al.

ICML 2024arXiv:2312.11456
312
citations

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

songyang gao, Qiming Ge, Wei Shen et al.

ICML 2024arXiv:2401.11458
21
citations

MaxMin-RLHF: Alignment with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan et al.

ICML 2024arXiv:2402.08925
88
citations

MusicRL: Aligning Music Generation to Human Preferences

Geoffrey Cideron, Sertan Girgin, Mauro Verzetti et al.

ICML 2024arXiv:2301.11325
616
citations

ODIN: Disentangled Reward Mitigates Hacking in RLHF

Lichang Chen, Chen Zhu, Jiuhai Chen et al.

ICML 2024arXiv:2402.07319
110
citations

Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback

Vincent Conitzer, Rachel Freedman, Jobstq Heitzig et al.

ICML 2024

Privacy-Preserving Instructions for Aligning Large Language Models

Da Yu, Peter Kairouz, Sewoong Oh et al.

ICML 2024arXiv:2402.13659
36
citations

Quality Diversity through Human Feedback: Towards Open-Ended Diversity-Driven Optimization

Li Ding, Jenny Zhang, Jeff Clune et al.

ICML 2024arXiv:2310.12103
11
citations
PreviousNext