"large language model alignment" Papers

18 papers found

Filters:large language model alignment Clear all

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment

Yuang Cai, Yuyu Yuan, Jinsheng Shi et al.

AAAI 2025paperarXiv:2411.09341

citations

Ask a Strong LLM Judge when Your Reward Model is Uncertain

Zhenghao Xu, Qin Lu, Qingru Zhang et al.

NEURIPS 2025arXiv:2510.20369

citations

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.

NEURIPS 2025arXiv:2506.20520

citations

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.

ICLR 2025arXiv:2407.14622

citations

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Yuheng Zhang, Dian Yu, Baolin Peng et al.

ICLR 2025arXiv:2407.00617

citations

LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin et al.

NEURIPS 2025arXiv:2410.01735

citations

Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging

Jinluan Yang, Dingnan Jin, Anke Tang et al.

NEURIPS 2025arXiv:2502.06876

citations

On a Connection Between Imitation Learning and RLHF

Teng Xiao, Yige Yuan, Mingxiao Li et al.

ICLR 2025arXiv:2503.05079

citations

Rethinking Reward Modeling in Preference-based Large Language Model Alignment

Hao Sun, Yunyi Shen, Jean-Francois Ton

ICLR 2025

citations

RMB: Comprehensively benchmarking reward models in LLM alignment

Enyu Zhou, Guodong Zheng, Binghai Wang et al.

ICLR 2025arXiv:2410.09893

citations

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren et al.

ICLR 2025arXiv:2409.13156

citations

Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback

Zexu Sun, Yiju Guo, Yankai Lin et al.

ICLR 2025

citations

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu, Jiaxuan Gao et al.

ICML 2024arXiv:2404.10719

253

citations

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint

Wei Xiong, Hanze Dong, Chenlu Ye et al.

ICML 2024arXiv:2312.11456

312

citations

Nash Learning from Human Feedback

REMI MUNOS, Michal Valko, Daniele Calandriello et al.

ICML 2024spotlightarXiv:2312.00886

195

citations

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

Ziniu Li, Tian Xu, Yushun Zhang et al.

ICML 2024arXiv:2310.10505

147

citations

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor et al.

ICML 2024arXiv:2309.00267

527

citations

WARM: On the Benefits of Weight Averaged Reward Models

Alexandre Rame, Nino Vieillard, Léonard Hussenot et al.

ICML 2024