Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs
Topics
Abstract
We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of $$\tilde{O}\left(\left(\sum_{Δ_h(s,a)>0} \frac{H^2 \log K \land \mathtt{Var}_{\max}^{\text{c}}}{Δ_h(s,a)} +\sum_{Δ_h(s,a)=0}\frac{ H^2 \land \mathtt{Var}_{\max}^{\text{c}}}{Δ_{\mathrm{min}}} + SAH^4 (S \lor H) \right) \log K\right),$$ where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Here, $Δ_h(s,a) =V_h^* (a) - Q_h^* (s, a)$ represents the suboptimality gap and $Δ_{\mathrm{min}} := \min_{Δ_h (s,a) > 0} Δ_h(s,a)$. The term $\mathtt{Var}_{\max}^{\text{c}}$ denotes the maximum conditional total variance, calculated as the maximum over all $(π, h, s)$ tuples of the expected total variance under policy $π$ conditioned on trajectories visiting state $s$ at step $h$. $\mathtt{Var}_{\max}^{\text{c}}$ characterizes the maximum randomness encountered when learning any $(h, s)$ pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of $$Ω\left( \sum_{Δ_h(s,a)>0} \frac{H^2 \land \mathtt{Var}_{\max}^{\text{c}}}{Δ_h(s,a)}\cdot \log K\right),$$ demonstrating the necessity of dependence on $\mathtt{Var}_{\max}^{\text{c}}$ even when the maximum unconditional total variance (without conditioning on $(h, s)$) approaches zero.