February, 1966 A Note on Memoryless Rules for Controlling Sequential Control Processes
Cyrus Derman, Ralph E. Strauch
Ann. Math. Statist. 37(1): 276-278 (February, 1966). DOI: 10.1214/aoms/1177699618

## Abstract

We are concerned with a dynamic system which at times $t = 0, 1, \cdots$ is observed to be in one of a finite number of states. We shall denote the space of possible states by $I$. After each observation the system is "controlled" by making one of a finite number of possible decisions. We shall denote by $K_i$ the set of possible decisions when the system is in state $i, i \varepsilon I$. Let $\{Y_t\}, t = 0, 1, \cdots$, denote the sequence of observed states and $\{\Delta_t\}, t = 0, 1, \cdots$, the sequence of observed decisions. The fundamental assumption regarding $\{Y_t, \Delta_t\}, t = 0, 1, \cdots$, is \begin{equation*}\tag{A}P(Y_{t+1} = j \mid Y_0, \Delta_{0, \cdots,t}Y_t = i, \Delta_t = k) = q_{ij}(k),\quad t = 0, 1, \cdots; j \varepsilon I; k \varepsilon K_i\end{equation*} where the $q_{ij}(k)$'s are non-negative numbers satisfying $\sum_j q_{ij}(k) = 1, k \varepsilon K_i; i \varepsilon I$. A rule for making successive decisions can be summarized in the form of a collection of non-negative functions $D_k(Y_0, \Delta_0, \cdots, \Delta_{t-1}, Y_t),\quad t = 0, 1, \cdots; k \varepsilon K_{Y_t},$ where in every case $\sum_k D_k(\cdot) = 1$. We set $P(\Delta_t = k \mid Y_0, \Delta_0, \cdots, \Delta_{t-1}, Y_t) = D_k(Y_0, D_0, \cdots, \Delta_{t-1}, Y_t)$ for $t = 0, 1, \cdots$. Thus, given $Y_0 = i$ and any rule $R$ for making successive decisions, the sequence $\{Y_t, \Delta_t\}, t = 0, 1, \cdots$, is a stochastic process with its probability measure dependent upon the rule $R$. We refer to such a process as a sequential control process. Let $C$ denote the class of all possible rules; $C'$ denote the class of all rules such that $D_k(Y_0, \Delta_0, \cdots, \Delta_{t-1}, Y_t = i) = D_{ik},\quad t = 0, 1, \cdots; k \varepsilon K_i; i \varepsilon I$. That is, $C'$ is the class of all rules such that the mechanism for making a decision at any time $t$ is dependent only on the state of the system at time $t$. A rule $R \varepsilon C'$ has a stationary Markovian character and, indeed, when $R \varepsilon C'$ is used, the resulting process $\{Y_t\}, t = 0, 1, \cdots$, is a Markov chain with stationary transition probabilities. We let $C''$ denote the subclass of $C'$ where the $D_{ik}$'s are zero or one. Rules in $C'$ allow for randomization; the rules in $C''$ are non-randomized. For a given $R \varepsilon C$ and initial state $Y_0 = i$, let $X_{T,j,k,R}(i) = (T + 1)^{-1}\sum^T_{t=0} P_R(Y_t = j, \Delta_t = k \mid Y_0 = i)$ and let $X_{T,R}(i)$ denote the vector of components $X_{T,j,k,R}(i)$ for all $k \varepsilon K_j$ and $j \varepsilon I$. Denote by $H_R(i)$ the set of limit points of $X_{T,R}(i)$ as $T \rightarrow \infty$. Let $H(i) = \mathbf{\bigcup}_{R\varepsilon C} H_R(i),\quad H'(i) = \mathbf{\bigcup}_{R\varepsilon C'} H_R(i),\quad H''(i) = \mathbf{\bigcup}_{R\varepsilon C''} H_R(i);$ and let $\bar{H}'(i)$ and $\bar{H}''(i)$ denote the convex hulls of $H'(i)$ and $H''(i)$, respectively. In  was proved Theorem 1. (a) $\bar H'(i) = \bar H''(i) \supset H(i)$. (b) If the Markov chain corresponding to $R$ is irreducible for every $R \varepsilon C''$, then $\bar H''(i) = H'(i) = H(i) = \mathbf{\bigcup}_{i\varepsilon I} H(i)$. Examples were given in  and  showing that $H(i)$ can be larger than $H'(i)$. In (b) the irreducibility assumption can be weakened to the condition that for each $R \varepsilon C''$ the corresponding Markov chain has, at most, one ergodic class. Blackwell , , and Maitra  have considered memoryless rules. By a memoryless rule we mean a rule $R$ such that $D_t(Y_0, \Delta_0, \cdots, \Delta_{t-1}, Y_t = i) = D^{(t)}_{ik}\quad t = 0, 1, \cdots, k \varepsilon K_i, i \varepsilon I.$ That is, with a memoryless rule the mechanism for making a decision is a function of the time $t$ and the state $i$ at time $t$. The memoryless rules of Blackwell and Maitra are non-randomized; i.e., $D^{(t)}_{ik} = 0$ or 1. We shall let $C^M$ denote the class of memoryless rules (both randomized and non-randomized). Thus $C \supset C^M \supset C' \supset C''$. If $R \varepsilon C^M - C'$, then $\{Y_t\}, t = 0, 1, \cdots$, is a finite state Markov chain with non-stationary transition probabilities. We remark that it is the memoryless rules (non-randomized) that are considered the rules of interest in the usual finite horizon dynamic programming problems. See Blackwell  for interesting remarks along these lines. We are concerned with optimization problems where the criterion to be optimized are functions of the points in $H(i)$. In  it was shown that one can construct problems where the optimal rule is in $C - C'$. This can occur, e.g., if the criterion to be optimized is a linear functional over the points in $H(i)$ but where a solution must also satisfy one or more linear constraints in $H(i)$. It is for the purpose of treating optimization problems of this kind that we are interested in the limit points of $X_{T,R}(i)$ for rules belonging to the various important sub-classes of $C$. Let $H^M(i) = \mathbf{\bigcup}_{R\varepsilon C^M} H_R(i)$. The result of this note (which is similar to Theorem 4.1 of ) is Theorem 2. $H(i) = H^M(i).$ In fact, for any $R \varepsilon C$ there exists an $R_0 \varepsilon C^M$ such that $X_{tR_0}(i) = X_{tR}(i)$ for all $t$. Proof. Define $R_0$ by $D_{jk}(t) = P_{R_0}(\Delta_t = k \mid Y_t = j) = P_R(\Delta_t = k \mid Y_t = j, Y_0 = i).$ It is enough to show that for all $t, k \varepsilon K_j$ and $j \varepsilon I$, \begin{equation*} \tag{*} P_{R_0}(Y_t = j, \Delta_t = k \ Y_0 = i) = P_R(Y_t = j, \Delta_t = k \mid Y_0 = i).\end{equation*} The relation $(\ast)$ holds for $t = 0$, since $P_R(Y_0 = j, \Delta_0 = k \mid Y_0 = i) = 0 = P_{R_0}(Y_0 = j, \Delta_0 \mid Y_0 = i)$ if $j \neq i$ and \begin{align*}P_R(Y_0 = i, \Delta_0 = k \mid Y_0 = i) &= P_R(\Delta_0 = k \mid Y_0 = i) \\ &= P_{R_0} (\Delta_0 = k \mid Y_0 = i)\end{align*} by definition. Now assume $(\ast)$ holds for $t = 0, \cdots, T - 1$. Then $P_R(Y_T = j,\Delta_T = k \mid Y_0 = i) = P_R(Y_T = j \mid Y_0 = i)P_R(\Delta_T = k \mid Y_T = j, Y_0 = i)$ but $P_R(\Delta_T = k \mid Y_T = j, Y_0 = i) = D_{jk}(t)$ by definition, and \begin{align*}P_R(Y_T = j \mid Y_0 = i &= \sum_{l\varepsilon I} \sum_{k\varepsilon K_l} P_R(Y_{T-1} = l, \Delta_{T-1} = k \mid Y_0 = i) q_{lj}(k) \\ &= \sum_{l\varepsilon I} \sum_{k\varepsilon K_l} P_{R_0}(Y_{T-1} = l, \Delta_{T-1} = k \mid Y_0 = i) q_{lj}(k) \\ &= P_{R_0}(Y_T = j \mid Y_0 = i)\end{align*} by the introduction hypothesis. Thus \begin{align*}P_R(Y_T = j, \Delta_T = k \mid Y_0 = i) &= P_{R_0}(Y_T = j \mid Y_0 = i)D_{jk}(t) \\ &= P_{R_0}(Y_T = j, \Delta_T = k \mid Y_0 = i).\end{align*} This completes the proof.

## Citation

Cyrus Derman. Ralph E. Strauch. "A Note on Memoryless Rules for Controlling Sequential Control Processes." Ann. Math. Statist. 37 (1) 276 - 278, February, 1966. https://doi.org/10.1214/aoms/1177699618

## Information

Published: February, 1966
First available in Project Euclid: 27 April 2007

zbMATH: 0138.13604
MathSciNet: MR184778
Digital Object Identifier: 10.1214/aoms/1177699618 