"policy optimization" Papers

75 papers found • Page 1 of 2

$q$-exponential family for policy optimization

Lingwei Zhu, Haseeb Shah, Han Wang et al.

ICLR 2025arXiv:2408.07245
2
citations

Accelerating RL for LLM Reasoning with Optimal Advantage Regression

Kianté Brantley, Mingyu Chen, Zhaolin Gao et al.

NEURIPS 2025arXiv:2505.20686
12
citations

A Differential and Pointwise Control Approach to Reinforcement Learning

Minh Nguyen, Chandrajit Bajaj

NEURIPS 2025arXiv:2404.15617
1
citations

Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning

Jifeng Hu, Sili Huang, Zhejian Yang et al.

NEURIPS 2025arXiv:2505.01822

Computational Hardness of Reinforcement Learning with Partial $q^{\pi}$-Realizability

Shayan Karimi, Xiaoqi Tan

NEURIPS 2025

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Jingjing Jiang, Chongjie Si, Jun Luo et al.

NEURIPS 2025spotlightarXiv:2505.17534
5
citations

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Zhihang Lin, Mingbao Lin, Yuan Xie et al.

NEURIPS 2025arXiv:2503.22342
56
citations

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Gang Li, Ming Lin, Tomer Galanti et al.

NEURIPS 2025arXiv:2505.12366
12
citations

DPAIL: Training Diffusion Policy for Adversarial Imitation Learning without Policy Optimization

Yunseon Choi, Minchan Jeong, Soobin Um et al.

NEURIPS 2025

EconGym: A Scalable AI Testbed with Diverse Economic Tasks

Qirui Mi, Qipeng Yang, Zijun Fan et al.

NEURIPS 2025arXiv:2506.12110
4
citations

Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees

Sourav Ganguly, Kishan Panaganti, Arnob Ghosh et al.

NEURIPS 2025arXiv:2505.19238
3
citations

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Xiyue Peng, Hengquan Guo, Jiawei Zhang et al.

NEURIPS 2025arXiv:2410.19933
5
citations

EvolvedGRPO: Unlocking Reasoning in LVLMs via Progressive Instruction Evolution

Zhebei Shen, Qifan Yu, Juncheng Li et al.

NEURIPS 2025

Fat-to-Thin Policy Optimization: Offline Reinforcement Learning with Sparse Policies

Lingwei Zhu, Han Wang, Yukie Nagai

ICLR 2025

How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza et al.

NEURIPS 2025arXiv:2507.04103
6
citations

Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning

Ke Sun, Yingnan Zhao, Enze Shi et al.

NEURIPS 2025arXiv:2110.03155
2
citations

Learning on One Mode: Addressing Multi-modality in Offline Reinforcement Learning

Mianchu Wang, Yue Jin, Giovanni Montana

ICLR 2025arXiv:2412.03258
2
citations

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Yicheng Xiao, Lin Song, Yukang Chen et al.

NEURIPS 2025arXiv:2505.13031
20
citations

MURKA: Multi-Reward Reinforcement Learning with Knowledge Alignment for Optimization Tasks

WANTONG XIE, Yi-Xiang Hu, Jieyang Xu et al.

NEURIPS 2025

NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data

Haolong Qian, Xianliang Yang, Ling Zhang et al.

NEURIPS 2025

Non-convex entropic mean-field optimization via Best Response flow

Razvan-Andrei Lascu, Mateusz Majka

NEURIPS 2025arXiv:2505.22760
1
citations

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Zongkai Liu, Qian Lin, Chao Yu et al.

AAAI 2025paperarXiv:2412.07639
8
citations

Online Reinforcement Learning in Non-Stationary Context-Driven Environments

Pouya Hamadanian, Arash Nasr-Esfahany, Malte Schwarzkopf et al.

ICLR 2025arXiv:2302.02182
3
citations

On the Convergence of Projected Policy Gradient for Any Constant Step Sizes

Jiacai Liu, Wenye Li, Dachao Lin et al.

NEURIPS 2025arXiv:2311.01104
4
citations

On the Sample Complexity of Differentially Private Policy Optimization

Yi He, Xingyu Zhou

NEURIPS 2025arXiv:2510.21060

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi et al.

ICLR 2025arXiv:2410.02275
6
citations

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder, Deep Tejas Karkhanis

NEURIPS 2025spotlightarXiv:2505.15201
28
citations

Progress Reward Model for Reinforcement Learning via Large Language Models

Xiuhui Zhang, Ning Gao, Xingyu Jiang et al.

NEURIPS 2025

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Simon Matrenok, Skander Moalla, Caglar Gulcehre

NEURIPS 2025arXiv:2507.08068
1
citations

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NEURIPS 2025arXiv:2506.01300
29
citations

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jingwen Gu et al.

NEURIPS 2025arXiv:2506.18896
26
citations

ReDit: Reward Dithering for Improved LLM Policy Optimization

Chenxing Wei, Jiarui Yu, Ying He et al.

NEURIPS 2025arXiv:2506.18631
8
citations

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

ICLR 2025arXiv:2410.04612
18
citations

Reinforced Active Learning for Large-Scale Virtual Screening with Learnable Policy Model

Yicong Chen, Jiahua Rao, Jiancong Xie et al.

NEURIPS 2025

Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding

Hanyin Wang, Zhenbang Wu, Gururaj Kolar et al.

NEURIPS 2025spotlightarXiv:2505.21908
5
citations

Reward Dimension Reduction for Scalable Multi-Objective Reinforcement Learning

Giseung Park, Youngchul Sung

ICLR 2025arXiv:2502.20957

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Qingyang Zhang, Haitao Wu, Changqing Zhang et al.

NEURIPS 2025spotlightarXiv:2504.05812
78
citations

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren et al.

ICLR 2025arXiv:2409.13156
50
citations

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

Wanxin Tian, Shijie Zhang, Kevin Zhang et al.

NEURIPS 2025arXiv:2506.21669
6
citations

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo, Lijie Xu, Jie Liu et al.

NEURIPS 2025arXiv:2505.23564
18
citations

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Heyang Zhao, Chenlu Ye, Quanquan Gu et al.

NEURIPS 2025arXiv:2411.04625
16
citations

Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

Eliot Xing, Vernon Luk, Jean Oh

ICLR 2025arXiv:2412.12089
13
citations

Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback

Zexu Sun, Yiju Guo, Yankai Lin et al.

ICLR 2025
5
citations

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Ruilin Luo, Zhuofan Zheng, Lei Wang et al.

NEURIPS 2025arXiv:2501.04686
31
citations

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.

ICML 2025arXiv:2410.01679
56
citations

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang et al.

ICCV 2025arXiv:2503.01785
357
citations

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Qining Zhang, Lei Ying

ICLR 2025arXiv:2409.17401
10
citations

Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning

Yen-Ju Chen, Nai-Chieh Huang, Ching-pei Lee et al.

ICML 2024arXiv:2310.11897
5
citations

Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate

Yuancheng Xu, Chenghao Deng, Yanchao Sun et al.

ICML 2024oralarXiv:2309.03426
7
citations

Adaptive-Gradient Policy Optimization: Enhancing Policy Learning in Non-Smooth Differentiable Simulations

Feng Gao, Liangzhi Shi, Shenao Zhang et al.

ICML 2024
PreviousNext