August 2023 Projected state-action balancing weights for offline reinforcement learning
Jiayi Wang, Zhengling Qi, Raymond K. W. Wong
Author Affiliations +
Ann. Statist. 51(4): 1639-1665 (August 2023). DOI: 10.1214/23-AOS2302


Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.

Funding Statement

Wong’s research was partially supported by the National Science Foundation (DMS-1711952 and CCF-1934904). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.


The authors would like to thank the anonymous referees, the Associate Editor and the Editor for their constructive comments that improved the quality of this paper.


Download Citation

Jiayi Wang. Zhengling Qi. Raymond K. W. Wong. "Projected state-action balancing weights for offline reinforcement learning." Ann. Statist. 51 (4) 1639 - 1665, August 2023.


Received: 1 May 2022; Revised: 1 March 2023; Published: August 2023
First available in Project Euclid: 19 October 2023

Digital Object Identifier: 10.1214/23-AOS2302

Primary: 62G05 , 62M05

Keywords: Infinite horizons , Markov decision process , policy evaluation , reinforcement learning

Rights: Copyright © 2023 Institute of Mathematical Statistics


This article is only available to subscribers.
It is not available for individual sale.

Vol.51 • No. 4 • August 2023
Back to Top