"large language model alignment" Papers

18 papers found

Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment

Yuang Cai, Yuyu Yuan, Jinsheng Shi et al.

AAAI 2025paperarXiv:2411.09341
4
citations

Ask a Strong LLM Judge when Your Reward Model is Uncertain

Zhenghao Xu, Qin Lu, Qingru Zhang et al.

NEURIPS 2025arXiv:2510.20369
3
citations

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.

NEURIPS 2025arXiv:2506.20520
17
citations

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.

ICLR 2025arXiv:2407.14622
53
citations

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Yuheng Zhang, Dian Yu, Baolin Peng et al.

ICLR 2025arXiv:2407.00617
34
citations

LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin et al.

NEURIPS 2025arXiv:2410.01735
6
citations

Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging

Jinluan Yang, Dingnan Jin, Anke Tang et al.

NEURIPS 2025arXiv:2502.06876
14
citations

On a Connection Between Imitation Learning and RLHF

Teng Xiao, Yige Yuan, Mingxiao Li et al.

ICLR 2025arXiv:2503.05079
14
citations

Rethinking Reward Modeling in Preference-based Large Language Model Alignment

Hao Sun, Yunyi Shen, Jean-Francois Ton

ICLR 2025
27
citations

RMB: Comprehensively benchmarking reward models in LLM alignment

Enyu Zhou, Guodong Zheng, Binghai Wang et al.

ICLR 2025arXiv:2410.09893
47
citations

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren et al.

ICLR 2025arXiv:2409.13156
50
citations

Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback

Zexu Sun, Yiju Guo, Yankai Lin et al.

ICLR 2025
5
citations

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu, Jiaxuan Gao et al.

ICML 2024arXiv:2404.10719
253
citations

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint

Wei Xiong, Hanze Dong, Chenlu Ye et al.

ICML 2024arXiv:2312.11456
312
citations

Nash Learning from Human Feedback

REMI MUNOS, Michal Valko, Daniele Calandriello et al.

ICML 2024spotlightarXiv:2312.00886
195
citations

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

Ziniu Li, Tian Xu, Yushun Zhang et al.

ICML 2024arXiv:2310.10505
147
citations

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor et al.

ICML 2024arXiv:2309.00267
527
citations

WARM: On the Benefits of Weight Averaged Reward Models

Alexandre Rame, Nino Vieillard, Léonard Hussenot et al.

ICML 2024