Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

3
citations
#2552
in ICLR 2025
of 3827 papers
2
Top Authors
7
Data Points

Abstract

We investigate off-policy evaluation (OPE), a central and fundamental problemin reinforcement learning (RL), in the challenging setting of Partially ObservableMarkov Decision Processes (POMDPs) with large observation spaces. Recentworks of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-freeframework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend onthe entire observable history remained an open problem. In this work, we proveinformation-theoretic hardness for model-free OPE of history-dependent policies inseveral settings, characterized by additional assumptions imposed on the behaviorpolicy (memoryless vs. history-dependent) and/or the state-revealing property ofthe POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm—whose analysis has surprisingly eluded the literature despite the algorithm’s simplicity—demonstratingprovable separation between model-free and model-based OPE in POMDPs.

Citation History

Jan 25, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Jan 28, 2026
0
Feb 13, 2026
3+3
Feb 13, 2026
3
Feb 13, 2026
3