We consider a system with a finite number $S$ of states $s$, labeled by the integers $1, 2, \cdots, S$. Periodically, say once a day, we observe the current state of the system, and then choose an action $a$ from a finite set $A$ of possible actions. As a joint result of the current state $s$ and the chosen action $a$, two things happen: (1) we receive an immediate income $i(s, a)$ and (2) the system moves to a new state $s'$ with the probability of a particular new state $s'$ given by a function $q = q(s' \mid s, a)$. Finally there is specified a discount factor $\beta, 0 \leqq \beta < 1$, so that the value of unit income $n$ days in the future is $\beta^n$. Our problem is to choose a policy which maximizes our total expected income. This problem, which is an interesting special case of the general dynamic programming problem, has been solved by Howard in his excellent book . The case $\beta = 1$, also studied by Howard, is substantially more difficult. We shall obtain in this case results slightly beyond those of Howard, though still not complete. Our method, which treats $\beta = 1$ as a limiting case of $\beta < 1$, seems rather simpler than Howard's.
"Discrete Dynamic Programming." Ann. Math. Statist. 33 (2) 719 - 726, June, 1962. https://doi.org/10.1214/aoms/1177704593