Invariance, Minimax Sequential Estimation, and Continuous Time Processes

J. Kiefer

doi:10.1214/aoms/1177706874

September, 1957 Invariance, Minimax Sequential Estimation, and Continuous Time Processes

J. Kiefer

Ann. Math. Statist. 28(3): 573-601 (September, 1957). DOI: 10.1214/aoms/1177706874

Abstract

The main purpose of this paper is to prove, by the method of invariance, that in certain sequential decision problems (discrete and continuous time) there exists a minimax procedure $\delta^\ast$ among the class of all sequential decision functions such that $\delta^\ast$ observes the process for a constant length of time. In the course of proving these results a general invariance theorem will be proved (Sec. 3) under conditions which are easy to verify in many important examples (Sec. 2). A brief history of the invariance theory will be recounted in the next paragraph. The theorem of Sec. 3 is to be viewed only as a generalization of one due to Peisakoff [1]; the more general setting here (see Sec. 2; the assumptions of [1] are discussed under Condition 2b) is convenient for many applications, and some of the conditions of Sec. 2 (and the proofs that they imply the assumptions) are new; but the method of proof used in Sec. 3 is only a slight modification of that of [1]. The form of this extension of [1] in Secs. 2 and 3, and the results of Secs. 4 and 5, are new as far as the author knows. In 1939 Pitman [2] suggested on intuitive grounds the use of best invariant procedures in certain problems of estimation and testing hypotheses concerning scale and location parameters. In the same year Wald [3] had the idea that the theorem of Sec. 3 should be valid for certain nonsequential problems of estimating a location parameter; unfortunately, as Peisakoff points out, there seems to be a lacuna in Wald's proof. During the war Hunt and Stein [4] proved the theorem for certain problems in testing hypotheses in their famous unpublished paper whose results have been described by Lehmann in [5a], [5b]. Peisakoff's previously cited work [1] of 1950 contains a comprehensive and fairly general development of the theory and includes many topics such as questions of admissibility and consideration of vector-valued risk functions which will not be considered in the present paper (the latter could be included by using the devise of taking linear combinations of the components of the risk vector). Girshick and Savage [6] at about the same time gave a proof of the theorem for the location parameter case with squared error or bounded loss function. In their book [7], Blackwell and Girshick in the discrete case prove the theorem for location (or scale) parameters. The referee has called the author's attention to a paper by H. Kudo in the Nat. Sci. Report of the Ochanomizu University (1955), in which certain nonsequential invariant estimation problems are treated by extending the method of [7]. All of the results mentioned above are nonsequential. Peisakoff [1] mentions that sequential analysis can be considered in his development, but (see Sec. 4) his considerations would not yield the results of the present paper. A word should be said about the possible methods of proof. (The notation used here is that of Sec. 2 but will be familiar to readers of decision theory.) The method of Hunt and Stein, extended to problems other than testing hypotheses, is to consider for any decision function $\delta$ a sequence of decision functions $\{\delta_i\}$ defined by $$\delta_i(x,\Delta) = \int_{G_n} \delta_i(gx,g\Delta)\mu(dg)/\mu(G_n)$$ where $\mu$ is left Haar measure on a group $G$ of transformations leaving the problem invariant and $\{G_n\}$ is a sequence of subsets of $G$ of finite $\mu$-measure and such that $G_n \rightarrow G$ in some suitable sense. If $G$ were compact, we could take $\mu(G) = 1$ and let $G_1 = G$; it would then be clear that $\delta_1$ is invariant and that $\sup_Fr_{\delta_1} (F) \leqq \sup_Fr_\delta(F),$ yielding the conclusion of the theorem of Sec. 3. If $G$ is not compact, an invariant procedure $\delta_0$ which is the limit in some sense of the sequence $\{\delta_i\}$ must be obtained (this includes proving that, in Lehmann's terminology, suitable conditions imply that any almost invariant procedure is equivalent to an invariant one) and $\sup_Fr_{\delta_0} (F) \leqq \sup_Fr_\delta(F)$ must be proved. Peisakoff's method differs somewhat from this, in that for each $\delta$ one considered a family $\{\delta_g\}$ of procedures obtained in a natural way from $\delta$, and shows that an average over $G_n$ of the supremum risks of the $\delta_g$ does not exceed that of $\delta$ as $n \rightarrow \infty$; there is an obvious relationship between the two methods. Similarly, in [7] the average of $r_\delta(gF_0)$ for $g$ in $G_n$ and some $F_0$ is compared with that of an optimum invariant procedure (the latter can thus be seen to be Bayes in the wide sense); the method of [6] is in part similar. In some problems it is convenient (see Example iii and Remark 7 in Sec. 2) to apply the method of Hunt and Stein to a compact group as indicated above in conjunction with the use of Peisakoff's method for a group which is not compact. The possibility of having an unbounded weight function does not arise in the Hunt-Stein work. Peisakoff handles it by two methods, only one of which is used in the present paper, namely, to truncate the loss function. The other method (which also uses a different assumption from Assumption 5) is to truncate the region of integration in obtaining the risk function. Peisakoff gives several conditions (usually of symmetry or convexity) which imply Assumption 4 of Sec. 2 or the corresponding assumption for his second method of proof in the cases treated by him, but does not include Condition 4b or 4c of Sec. 2. Blackwell and Girshick use Condition 4b for a location parameter in the discrete case with $W$ continuous and not depending on $x$, using a method of proof wherein it is the region of integration rather than the loss function which is truncated. (The proof in [6] is similar, using also the special form of $W$ there.) It is Condition 4c which is pertinent for many common weight functions used in estimating a scale parameter, e.g., any positive power of relative error in the problem of estimating the standard deviation of a normal d.f. The overlap of the results of Secs. 4 and 5 of the present paper with previous publications will now be described. There are now three known methods for proving the minimax character of decision functions. Wolfowitz [8] used the Bayes method for a great variety of weight functions for the case of sequential estimation of a normal distribution with unknown mean (see also [9]). Hodges and Lehmann [10] used their Cramer-Rao inequality method for a particular weight function in the case of the normal distribution with unknown mean and gamma distribution with unknown scale (as well as in some other cases not pertinent here) to obtain a slightly weaker minimax result (see the discussion in Sec. 6.1 of [12]) than that obtainable by the Bayes method. The Bayes method was used in the sequential case by Kiefer [11] in the case of a rectangular distribution with unknown scale or exponential distribution with unknown location, for a particular weight function. This method was used by Dvoretzky, Kiefer and Wolfowitz in [12] for discrete and continuous time sequential problems involving the Wiener, gamma, Poisson, and negative binomial processes, for particular classes of weight functions. The disadvantage of using the Cramer-Rao method is in the limitation of its applicability in weight function and in regularity conditions which must be satisfied, as well as in the weaker result it yields. The Bayes method has the disadvantage that, when a least favorable a priori distribution does not exist, computations become unpleasant in proving the existence (if there is one) of a constant-time minimax procedure unless an appropriate sequence of a priori distributions can be chosen in such a way that the a posteriori expected loss at each stage does not depend on the observations (this is also true in problems where we are restricted to a fixed experimentation time or size, but it is less of a complication there); thus, the weight functions considered in [12] for the gamma distribution were only those relative to which such sequences could be easily guessed, while the proof in [11] is made messy by the author's inability to guess such a sequence, and even in [8] the computations become more involved in the case where an unsymmetric weight function is treated. (If, e.g., $\mathscr{F}$ is isomorphic to $G$, the sequence of a priori distributions obtained by truncating $\mu$ to $G_n$ in the previous paragraph would often be convenient for proving the minimax character by the Bayes method if it were not for the complication just noted.) The third method, that of invariance, has the obvious shortcoming of yielding little unless the group $G$ is large enough and/or there exists a simple sequence of sufficient statistics; however, when it applies to the extent that it does in the examples of Secs. 4 and 5, it reduces the minimax problem to a trivial problem of minimization. Several other sequential problems treated in Section 4 seem never to have been treated previously by any method or for any weight function; some of these involve both an unknown scale and unknown location parameter. A multivariate example is also treated in Sec. 4. In example xv of Sec. 4 will be found some remarks which indicate when the method used there can or cannot be applied successfully. In Sec. 5, in addition to treating continuous time sequential problems in a manner similar to that of Sec. 4, we consider another type of problem where the group $G$ acts on the time parameter of the process rather than on the values of the sample function.

Citation

Download Citation

J. Kiefer. "Invariance, Minimax Sequential Estimation, and Continuous Time Processes." Ann. Math. Statist. 28 (3) 573 - 601, September, 1957. https://doi.org/10.1214/aoms/1177706874