Poster "language model alignment" Papers

31 papers found

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Hui Yuan, Yifan Zeng, Yue Wu et al.

ICLR 2025arXiv:2410.13828
5
citations

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.

ICLR 2025arXiv:2410.18252
43
citations

Explainable Reinforcement Learning from Human Feedback to Improve Alignment

Shicheng Liu, Siyuan Xu, Wenjie Qiu et al.

NEURIPS 2025arXiv:2512.13837

How to Evaluate Reward Models for RLHF

Evan Frick, Tianle Li, Connor Chen et al.

ICLR 2025arXiv:2410.14872
58
citations

Interpreting Language Reward Models via Contrastive Explanations

Junqi Jiang, Tom Bewley, Saumitra Mishra et al.

ICLR 2025arXiv:2411.16502
5
citations

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025arXiv:2409.12822
78
citations

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Nguyen Phuc, Ngoc-Hieu Nguyen, Duy M. H. Nguyen et al.

NEURIPS 2025arXiv:2506.08681

Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

Xiaochuan Li, Zichun Yu, Chenyan Xiong

ICLR 2025arXiv:2410.14208
5
citations

Preference Distillation via Value based Reinforcement Learning

Minchan Kwon, Junwon Ko, Kangil kim et al.

NEURIPS 2025arXiv:2509.16965

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Simon Matrenok, Skander Moalla, Caglar Gulcehre

NEURIPS 2025arXiv:2507.08068
1
citations

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Stephen Zhao, Aidan Li, Rob Brekelmans et al.

NEURIPS 2025arXiv:2510.21184

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min et al.

ICLR 2025arXiv:2410.16184
110
citations

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Masahiro Fujisawa, Masaki Adachi, Michael A Osborne

NEURIPS 2025arXiv:2505.17859
1
citations

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025arXiv:2411.00418
19
citations

SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins

Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh et al.

ICLR 2025
5
citations

SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters

Teng Xiao, Yige Yuan, Zhengyu Chen et al.

ICLR 2025arXiv:2502.00883
26
citations

Sparta Alignment: Collectively Aligning Multiple Language Models through Combat

Yuru Jiang, Wenxuan Ding, Shangbin Feng et al.

NEURIPS 2025arXiv:2506.04721
4
citations

Variational Best-of-N Alignment

Afra Amini, Tim Vieira, Elliott Ash et al.

ICLR 2025arXiv:2407.06057
38
citations

Vector-ICL: In-context Learning with Continuous Vector Representations

Yufan Zhuang, Chandan Singh, Liyuan Liu et al.

ICLR 2025arXiv:2410.05629
10
citations

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.

ICLR 2025arXiv:2410.18640
15
citations

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Gaurav Pandey, Yatin Nandwani, Tahira Naseem et al.

ICML 2024arXiv:2402.02479
5
citations

Controlled Decoding from Language Models

Sidharth Mudgal, Jong Lee, Harish Ganapathy et al.

ICML 2024arXiv:2310.17022
118
citations

Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration

Youngsoo Jang, Geon-Hyeong Kim, Byoungjip Kim et al.

ICML 2024

Human Alignment of Large Language Models through Online Preference Optimisation

Daniele Calandriello, Zhaohan Guo, REMI MUNOS et al.

ICML 2024arXiv:2403.08635
88
citations

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Banghua Zhu, Michael Jordan, Jiantao Jiao

ICML 2024arXiv:2401.16335
48
citations

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

songyang gao, Qiming Ge, Wei Shen et al.

ICML 2024arXiv:2401.11458
21
citations

MaxMin-RLHF: Alignment with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan et al.

ICML 2024arXiv:2402.08925
88
citations

ODIN: Disentangled Reward Mitigates Hacking in RLHF

Lichang Chen, Chen Zhu, Jiuhai Chen et al.

ICML 2024arXiv:2402.07319
110
citations

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan

ICML 2024arXiv:2403.00409
103
citations

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan et al.

ICML 2024arXiv:2401.01335
480
citations

Towards Efficient Exact Optimization of Language Model Alignment

Haozhe Ji, Cheng Lu, Yilin Niu et al.

ICML 2024arXiv:2402.00856
32
citations