Abstract
We consider a system with a finite number of states, $1, 2, \cdots, S$. Periodically we observe the current state of the system and perform an action, $a$, from a finite set $A$ of possible actions. As a joint result of $s$, the current state, and $a$, the action performed, two things occur: (1) we receive an immediate return $r(s, a)$; and (2) the system moves to a new state $s'$ with probability $q(s' \mid s, a)$. (For several interesting and occasionally amusing concrete examples of this setup, the reader is referred to Howard's excellent book, [4].) Let $F$ be the (finite) set of functions from $S$ to $A$. A policy $\pi$ is a sequence $(\cdots, f_n, \cdots, f_1)$ of members of $F$. Using $n$ steps of the policy $\pi$ means observing the system, and upon finding $s_0$, performing action $f_n(s_0)$, observing the next state $s_1$ and performing $f_{n-1}(s_1)$ and so on until after $n - 1$ of these steps have been completed one observes $s_{n-1}$ and performs action $f_1(s_{n-1})$ and then stops. The expected (total) return using $n$ steps of $\pi$ given that the initial state is $s_0$ is denoted $v_n(s_0, \pi)$. A policy $\pi$ is optimal if $v_n(s, \pi) \geqq v_n(s, \pi')$ for any policy $\pi'$ and all $n$ and $s$. In other words, $\pi$ is optimal if the return using $n$ steps of $\pi$ cannot be exceeded by using $n$ steps of any other policy regardless of $n$ and the initial state of the system. Obviously, $v_n(s, \pi)$ for optimal $\pi$ may be calculated by value iteration, that is, by the use of the recursion $v_{n+1}(s, \pi) = \max_a \lbrack r(s, a) + \sum_{s'} q(s' \mid s, a)v_n(s', \pi)\rbrack.$ Similarly an optimal $\pi$ may be generated by letting $f_1(s)$ be any $a$ with $r(s, a) \geqq r(s, a')$ for all $a'$, and letting $f_{n+1}(s)$ be any $a$ for which the expression on the right side of the above equation attains its maximum. This paper is concerned with the return of an optimal policy $\pi$. The principal results are as follows: A policy is eventually stationary if for $m, n$ sufficiently large, $f_m = f_n$, that is, if the sequence which is the policy consists of one member of $F$ repeated infinitely often followed by a finite arbitrary sequence. In Section 3, an example is given where there is only one optimal policy and it has the form $(\cdots, f, g, f, g, f, g)$, an oscillating sequence of two members of $F$ which shows that there may not be an eventually stationary optimal policy. The author has not determined whether the example represents the worst possible behavior of optimal policies or whether there are cases where there is no periodic optimal policy. For an optimal policy $\pi$, the gain of the policy, defined by $\lim_n n^{-1}v_n(s, \pi)$, exists; it will be denoted by $v^\ast(s, \pi)$. It is shown in Lemma 3.1 that if $\pi$ is optimal then there is an $N$ such that $\lim_n \lbrack v_{nN+r}(s, \pi) - (nN + r)v^\ast(s, \pi)\rbrack$ exists for any $r$, that is, asymptotically, the return oscillates periodically around the long range average return. One of the major results of the paper is that there is a stationary policy $\sigma$, i.e., one for which $f_m = f_n$ for all $m, n$, which has the same gain as the optimal policy $\pi$, symbolically, $v^\ast(s, \sigma) = v^\ast(s, \pi)$ for all $s$. In fact a stronger result is true (Theorem 4.2), namely that there is some constant $C$ such that $v_n(s, \pi) - v_n(s, \sigma) < C$ for all $n$ and $s$. Although it is not known whether all optimal policies are eventually periodic, it is shown (Corollary 4.9 of Theorem 4.8) that there is an eventually periodic policy whose return is an arbitrarily small amount less than that of the optimal policy. Formally, for any $\epsilon > 0$, there is a policy $\sigma$ with the property that for some $N, f_{n+N} = f_n$ for $n$ sufficiently large such that $v_n(s, \pi) - v_n(s, \sigma) < \epsilon$ for all $n$ and any $s$. The final section contains further results for the special case that $q(s' \mid s, a) > 0$ for all $s', s, a$.
Citation
Barry W. Brown. "On the Iterative Method of Dynamic Programming on a Finite Space Discrete Time Markov Process." Ann. Math. Statist. 36 (4) 1279 - 1285, August, 1965. https://doi.org/10.1214/aoms/1177699999
Information