Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

0citations
0
citations
#2278
in ICML 2025
of 3340 papers
4
Top Authors
1
Data Points

Abstract

*Policy gradient* (PG) methods are effective *reinforcement learning* (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level.This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order $\widetilde{\mathcal{O}}(\epsilon^{-5})$.Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate $\widetilde{\mathcal{O}}(\epsilon^{-3})$, but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.

Citation History

Jan 28, 2026
0