Optimal policy evaluation using kernel-based temporal difference methods

Yaqi Duan; Mengdi Wang; Martin J. Wainwright

doi:10.1214/24-AOS2399

Abstract

We study nonparametric methods for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We analyze the kernel-based least-squares temporal difference (LSTD) estimate, which can be understood either as a nonparametric instrumental variables method, or as a projected approximation to the Bellman fixed point equation. Our analysis imposes no assumptions on the transition operator of the Markov chain, but rather only conditions on the reward function and population-level kernel LSTD solutions. Using empirical process theory and concentration inequalities, we establish a nonasymptotic upper bound on the error with explicit dependence on the effective horizon $H = {(1 - γ)}^{- 1}$ of the Markov reward process, the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over subclasses of MRPs, which shows that our guarantees are optimal in terms of the sample size n and the effective horizon H. Whereas existing worst-case theory predicts cubic scaling ( $H^{3}$ ) in the effective horizon, our theory reveals a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.

Funding Statement

This work was partially supported by NSF-DMS grant 2015454, NSF-IIS grant 1909365, NSF-FODSI grant 202350, and DOD-ONR Office of Naval Research N00014-21-1-2842 to MJW.

Citation

Download Citation

Yaqi Duan. Mengdi Wang. Martin J. Wainwright. "Optimal policy evaluation using kernel-based temporal difference methods." Ann. Statist. 52 (5) 1927 - 1952, October 2024. https://doi.org/10.1214/24-AOS2399

Information

Received: 1 May 2022; Revised: 1 July 2023; Published: October 2024

First available in Project Euclid: 20 November 2024

Digital Object Identifier: 10.1214/24-AOS2399

Subjects:

Primary: 62G05

Secondary: 62M05

Keywords: dynamic programming , Markov reward process , nonparametric estimation , policy evaluation , reinforcement learning , ‎reproducing kernel Hilbert ‎space , sequential decision-making , temporal difference learning

Abstract

Funding Statement

Citation

Information

KEYWORDS/PHRASES

PUBLICATION TITLE:

PUBLICATION YEARS