Near-Optimal Sample Complexity for MDPs via Anchoring

9citations

arXiv:2502.04477

citations

#613

in ICML 2025

of 3340 papers

Top Authors

Data Points

Top Authors

Jongmin Lee Mario Bravo Roberto Cominetti

Abstract

We study a new model-free algorithm to compute $\varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating setting. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an $\varepsilon$-optimal policy with sample and time complexity $\widetilde{O}(|\mathcal{S}||\mathcal{A}|\||h\||^{2}/\varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $ \||h\|| $. Although the complexity bound involves the span seminorm $ \||h\|| $ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.

Citation History

Jan 28, 2026

Feb 13, 2026

9+9

Feb 13, 2026