On universal algorithms for classifying and predicting stationary processes

This is a survey of results on universal algorithms for classification and prediction of stationary processes. The classification problems include discovering the order of a k-step Markov chain, determining memory words in finitarily Markovian processes and estimating the entropy of an unknown process. The prediction problems cover both discrete and real valued processes in a variety of situations. Both the forward and the backward prediction problems are discussed with the emphasis being on pointwise results. This survey is just a teaser. The purpose is merely to call attention to results on classification and prediction. We will refer the interested reader to the sources. Throughout the paper we will give illuminating examples. AMS 2000 subject classifications: Primary 60G25, 60G10.


Introduction
Fourty five years ago David Bailey wrote a PhD thesis under the direction of Donald Ornstein [4] entitled "Sequential schemes for classifying and predicting ergodic processes". Even though the thesis was never published it was very influential and gave rise to a great deal of work and it is our purpose to survey some of the developments in this research program. To put things in a proper historical perspective we will begin by reviewing the main results from that thesis.
The general problem considered there was that of extracting as much information as possible from a sequence of observations of a finite alphabet stationary stochastic process X 0 , X 1 , ...X n . He gave the first universal estimation scheme for the evaluation of the Shannon entropy, prior to the schemes which arose from the universal data compression algorithms of J. Ziv and A. Lempel [105]. He then showed that for each k there was a sequence of functions g n which when applied to X 0 , X 1 , ...X n would with probability one eventually equal YES/NO according to the alternative "the process IS/IS NOT a k-step mixing Markov chain". On the other hand he showed the non existence of a similar sequence of functions for deciding membership in the union over all k of these classes.
In contrast to the pioneering universal scheme of D. Ornstein [82] for estimating the conditional probability of X 0 given the infinite past {X i : i ≤ 0} in a sequential fashion he showed the nonexistence of such a universal scheme for the forward problem of estimating the conditional probability of X n+1 given the observations X 0 , X 1 , ...X n .
In the first part we concentrate on discrete (finite or countably infinite) valued processes and begin by taking up the questions that relate to learning about general features of a process in a sequential fashion. We start by addressing the problem of estimating the order k of a k-step Markov chain, including countable state chains. In contrast to Bailey's negative result for two valued decision schemes, we show that there is a sequence of functions g n which when applied to the outputs X 0 , X 1 , ...X n of any ergodic process will converge with probability one to the order k if the process is k-step Markov and to infinity otherwise. We will also describe some further negative results, generalizing Bailey's, for classification of the class of processes called finitarily Markovian, where the next output depends on a finite segment of the past but the length of this segment is not bounded.
Following this we will describe some more general classification problems giving a variety of conditions under which one can, with eventual certainty, decide between membership in two disjoint classes of processes. In the last part of this section we will describe the recent striking characterization of the Shannon entropy of a process as essentially the only finitely observable isomorphism invariant of a process.
Most of the next section deals with estimation problems for finitarily Markovian processes (also called finite context processes or variable length Markov processes). Before continuing the introduction we pause to give an intuitive definition of this class. The memory length for a sequence of past observations {X i : i ≤ 0} of a process is the smallest possible 0 ≤ K(. . . , X −1 , X 0 ) ≤ ∞ such that the conditional distribution of X 1 given the entire past is equal to the conditional distribution of X 1 given only X 1−K , ..., X 0 . The least such value of K is called the memory length. When it is finite it should have the property that the same value is obtained for any other continuation {X j : j ≤ −K}. A process is finitarily Markovian if with probability one this K is always finite. If it is bounded by k then the process is a Markov chain with order at most k.
We describe universal backward schemes for the estimation of this memory length which almost surely converge to the correct value K(. . . , X −2 , X −1 , X 0 ). The forward estimation problem of the memory length is the problem of determining K(X 0 , X 1 , ...X n ), based on the observations of (X 0 , X 1 , ...X n ). Here there is no universal scheme. We will show that even within the class of two step countable Markov chains one cannot successfully guess along a sequence of stopping times of density one whether the minimal memory length is one or two. We will also show that within the class of binary finitarily Markovian processes one cannot guess for K(X 0 , X 1 , ..., X λn ) on a sequence of stopping times λ n with λ n /n → 1. The last part of this section deals with the special class of binary renewal processes and the problem of estimating the residual waiting time until the next occurrence of the renewal state.
The second part of the survey is devoted to real valued processes. In his thesis, Bailey [4] showed that for finite valued processes even though no scheme can be universally successful for forward estimation any universal backward scheme when used for forward prediction will converge almost surely in Cesaro mean, cf. also Ornstein [82]. Several authors have extended this to bounded real valued processes using quantization to reduce to the finite valued case see for example Algoet [1,3], Morvai [53], Morvai Yakowitz and Györfi [56]. Yet another approach to the sequential prediction used a weighted average of expert schemes, and with these schemes the results were extended to the general unbounded case by Nobel [80] and Györfi and Ottucsak [28], (see also the survey of Feder and Merhav [50]). However none of these results were optimal in the sense that moment conditions higher than those strictly necessary were assumed. We will describe some optimal results that we recently obtained for this forward prediction for real valued processes.
We have already mentioned the use of stopping times in devising universal schemes and we will describe a few results of this kind in the next subsection where we focus our attention on those processes where the conditional distribution of X 0 given the past becomes a continuous function of the past outputs after a set of probability zero is omitted. Next we take up the case of Gaussian processes which have been considered by Schäfer [100]. He constructed an algorithm which can estimate the conditional expectation for every time instance n for an extremely restricted class of Gaussian processes. A more general result giving an estimate for the conditional mean along a stopping time sequence will be described for stationary Gaussian (not necessarily ergodic) processes that include a much wider class of processes than that in Schäfer [100]. The disadvantage of these estimators is the rapid growth of the stopping times. A more realistic scheme will be given with a more moderate growth.
Throughout the survey we will give specific examples to illustrate the ideas.

Discovering features of a process by sequential sampling
A stochastic process X = {X n : 0 ≤ n < ∞} is determined by the joint distributions of the random variables {X 0 , X 1 , ..., X k } for all k. We will be interested in stationary stochastic processes. These are those processes for which the joint distribution of {X t , X t+1 , ..., X t+k } is the same as that of {X 0 , X 1 , ..., X k } for all t and all k. The simplest examples are independent identically distributed random variables and stationary Markov chains. Stationary processes can be uniquely extended into the past. This means that on a possibly enlarged sample space we have random variables {X n : −∞ < n < ∞} whose distributions are stationary.
For notational convenience, we will use the following notation throughout this survey X n m = (X m , . . . , X n ), where m ≤ n. We shall deal primarily with ergodic processes. These are stationary processes that cannot be decomposed into an average of stationary processes in a non-trivial fashion. Irreducible Markov chains are always ergodic. It is an easy consequence of Birkhoff's ergodic theorem that if a process {X n } is both stationary and ergodic, then from almost every sample sequence of the process one can determine the joint distributions. Indeed, in that case, for a fixed k, with probability 1, the empirical distributions on k-tuples determined by the sample will converge to the true distribution and the knowledge of these finite distributions gives the original process X. In brief, with probability 1, a single sampling of an ergodic stationary process suffices to determine the nature of the process exactly.
A more realistic situation is one in which as time goes on we are presented with more and more observations and we are asked to give some information about X based on a finite sampling x 0 , x 1 , ..., x n , which will get better and better as n increases. In this first section we will survey several kinds of specific problems that correspond to this general situation. We will begin with a simple problem in which we want to determine the order of K-step Markov chain, and then go on to discuss the more basic question of determining whether or not the process that we are observing is a Markov chain of some finite order. After these more specific classes of processes we will discuss more general classification problems and then conclude this section with a remarkable characterization of the entropy of a process the unique finitely observable isomorphism invariant. These notions will be defined below.

Estimating the order of a Markov chain
For a stationary stochastic process {X n } with values in some set X , finite or countably infinite, a word w ∈ X k of length k is called a memory word if the conditional probability of X 0 given the past is constant on the cylinder set defined by X −1 −k = w. For a formal definition we introduce some notation for the distributions and conditional distributions: let p(x 0 −k ) denote the probability of the event X 0 −k = x 0 −k and let p(y|x 0 −k ) denote the conditional probability of the event X 1 = y given that the event X 0 −k = x 0 −k occurred. Note that the random variables are denoted by capital letters and particular realizations by lower case letters. For example, p(y|X 0 −k ) denotes the random variable which is a function of the random variables X 0 −k taking the value We say that the empty word ∅ with length zero is a memory word if for all i ≥ 1, all y ∈ X , all z 0 −i+1 ∈ X i such that p(z 0 −i+1 , y) > 0: If the empty word is a memory word then it is also called a minimal memory word.
If no proper suffix of w is a memory word then w is called a minimal memory word.
Note that the empty word is a memory word if and only if the stationary stochatic process is independent and identically distributed. Define the set W k of those memory words w 0 −k+1 with length k and let W * denote the set of all memory words. Note that W 0 is either the empty set or it conains exactly the empty word. Note also that if the empty word is a memory word then it is the only minimal memory word.
For example in a k-step Markov processes all words of length k are memory words. However, in general, a k-step Markov processes may also have shorter memory words, cf. Bühlmann and Wyner [10]. Naturally any left extension of a memory word is also a memory word. Example 2.1. Consider an independent and identically distributed process {X n } on a countable alphabet. Then the empty word is a memory word and it is the only minimal memory word. Now the length of the shortest minimal memory word is zero and the length of the longest minimal memory word is also zero.

This Markov chain yields a stationary process by choosing the initial distribution
This stationary process is an ergodic process. Indeed, the process has only two possible realizations ω ∞ −∞ , either each of the two realizations occures with probability 0.5 and an invariant set is either the empty set (wich has probability zero) or it must contain both of these realizations (in which case it has probability one). The minimal memory words are the '0' and the '1'. The other memory words w 0 −k+1 with length k, k ≥ 2, are those for which either On universal algorithms for classifying and predicting 83 This yields a stationary and ergodic process {M n }. Define Then {Z n } is a stationary and ergodic binary Markov chain with order 2. The minimal memory words of the process {Z n } are the '1', the '10' and the '00'. Note that the length of the shortest minimal memory word is one and the length of the longest minimal memory word is two.
The next example shows that the right extension of a memory word is not necessarily a memory word.   Consider the problem of determining the order of a Markoc chain, based on sequentially observing the outputs of a single sample {X 1 , X 2 , ..., X n }. That is to say we would like to have sequences of functions L n so that L n (X 1 , X 2 , ..., X n ) will converge almost surely to M , in case the process is a M -step Markov process but not a (M − 1)-step Markov chain, and to infinity otherwise.

This yields a stationary and ergodic process
Early work on this problem like that of Merhav, Gutman and Ziv [51], Finesso [19,20] Csiszár and Shields [13], Csiszár [14] and Peres and Sields [87] was restricted to finite state processes. This enabled them to use a priori rates for the convergence of empirical distributions and entropy estimators. Morvai and Weiss [63] gave the first universal order estimator for countable state Markov processes. However, in that scheme, the data segment was unnecessarily divided into two parts. Later, in [67], a simpler, better scheme was given which does not divide the data segment into two. To review this scheme we begin with a formal definition of the memory length.

Definition 2.2. For a stationary time series
almost surely. Consider the stationary and ergodic binary renewal process {Z n } in Example 2.5. Then The goal is now to estimate the essential supremum of the function K(X 0 −∞ ). The essential supremum of K(X 0 −∞ ) is equal to the order of the Markov chain if the process is Markov of some order and infinity otherwise. In other words, the essential supremum of K(X 0 −∞ ) is the smallest k ≥ 0 such that P (X k 1 ∈ W k ) = 1 if there is such k and infinite otherwise.
In order to describe the estimate for this function we first give a formal definition of how to find the essential supremum of the function K(X 0 −∞ ). For k ≥ 0 let S k denote the support of the distribution of X 0 −k . Define In general, define If for some k, Δ k = 0 then the process is a k-step Markov chain and the least such k is the order of the chain.

Example 2.12. Consider the stationary and ergodic binary process {X n } in Example 2.2. Then
and Δ i = 0 for i ≥ 1.

86
G. Morvai  We would like to define a statistic to estimate Δ k . The key fact which we will use is the pointwise ergodic theorem. It follows from that theorem that with probability one, for all fixed k, the empirical distributions on k-tuples determined by the sample taken from 0 up to time n will converge as n tends to infinity to the true distribution. However at any finite stage we only have a finite sample at our disposal. It follows that we have to make sure that we have seen a specific k-block enough times to be sure that we are close to the truth. Here is the procedure in detail. (Cf. Morvai and Weiss [67]. ) We denote the usual empirical distribution estimates for the conditional distributions p(x|z 0 −k+1 ) from the samples X n 0 asp n (x|z 0 −k+1 ). (In other words, p n (x|z 0 −k+1 ) is the ratio of the number of occurrences of the string (z 0 −k+1 , x) in the observed X n 0 to the number of occurrences of the string z 0 −k+1 in X n 0 .) Thesep's are functions of X n 0 , but we suppress this dependence. As we have said we only want to consider this statistic if the sample afforded us is sufficiently large. One kind of such restriction is the following one.
For a fixed 0 < γ < 1 let S n k denote the set of strings with length k + 1 which appear more than n 1−γ times in X n 0 . These are the strings which occur sufficiently often so that we can rely on their empirical distribution. Now define the empirical version of Δ 0 as follows: Define the empirical version of Δ 1 as follows: In general, define the empirical version of Δ k as follows: By ergodicity, the empirical conditional probabilities tend to the true conditional probabilities. Now it is immediate that for any fixed k, by ergodicity, eventually almost surely. Now the key idea is that if the process is not Markov of any order then for any fixed k ≥ 0, almost surely, butΔ n k tends to zero with a rate. Thus define an estimate χ n for the order from samples X n 0 as follows. Let 0 < β < 1−γ 2 be arbitrary. Set χ 0 = 0, and for n ≥ 1 let χ n be the smallest 0 ≤ k < n such that if there is such a k and n otherwise. The algorithm works because if the process is not Markov of any order or Markov but k is smaller than the order thenΔ n k will be bounded away from zero eventually almost surely and soΔ n k will be greater than n −β eventually almost surely while if k is greater than or equal to the order of the Markov chain thenΔ n k tends to zero with a rate, that is,Δ n k will not be greater than n −β eventually almost surely. The next theorem asserts that this estimator is pointwise universally consistent. [67]). For any ergodic, stationary process {X n } taking values from a finite or countably infinite alphabet if the observed process is Markov then the sequence of estimators χ n converges to the order of the Markov chain almost surely and if the observed process is not Markov of any order then the sequence of estimators χ n tends to infinity almost surely. In other words, for any ergodic, stationary process {X n } taking values from a finite or countably infinite alphabet the sequence of estimators χ n converges almost surely to the essential supremum of the memory function K(·). Now if M > 0 is arbitrary but fixed then for the class of all stationary and ergodic processes χ n < M eventually if the process is Markov with order less than M and χ n ≥ M eventually almost surely otherwise, cf. Morvai and Weiss [67]. A result in Morvai and Weiss [67] assserts that even when we restrict attention to countable second order Markov chains there is no universal estimator for the length of the shortest memory word that converges even in probability.

Theorem 2.1 (Morvai and Weiss
For further reading on related topics see also [16] and [88].

Classification for special processes
In this subsection we take up classification problems which seem simpler since all that we want to do is to determine if our observations are coming from a certain class or not. Here is how to formalize the situation. Let X be discrete (finite or countably infinite) alphabet. Let {X n } be a stationary and ergodic time series.
If G is a subclass of all stationary and ergodic binary processes then a sequence of functions g n : {0, 1} n → {Y ES, NO} is a classification for G in probability if lim n→∞ P (g n (X 1 , . . . , X n ) = Y ES) = 1 for all processes in G, and lim n→∞ P (g n (X 1 , . . . , X n ) = NO) = 1 for all processes not in G.
Similarly, g n : {0, 1} n → {Y ES, NO} is a classification for G in a pointwise sense if g n (X 1 , . . . , X n ) = Y ES eventually almost surely for all processes in G, and g n (X 1 , . . . , X n ) = NO eventually almost surely for all processes not in G. Of course, if g n is a classification in a pointwise sense then it is a classification in probability but a classification in probability is not necessarily a classification in a pointwise sense.
For the class M k of k-step mixing Markov chains of fixed order k, there are pointwise estimators of the type we have just described. Bailey [4] gave such a scheme for independent processes (k = 0) and indicated how to generalize the result for the class of M k .) For the class M mix = ∞ k=0 M k of mixing Markov chains of any order, Bailey showed that no such classification exists.

Theorem 2.2. (Bailey [4]) There is no sequence of functions
such that for all stationary and ergodic binary processes See Ornstein and Weiss [84] for some further results on this kind of question. For a generalization of this non-existence result of Bailey see Morvai and Weiss [61]. Now consider the class of finitarily Markovian processes. These are processes such that with probability one we will encounter a memory word but their lengths are not bounded. Simple examples of such processes are renewal processes where as we look back as soon we see a recurrent event we will have a memory word in our hand.

Definition 2.3. The stationary time series
In other words the stationary and egodic discrete process {X n } is finitarily Markovian if and only if P ( where W * denotes the set of all memory words of the process. This class includes all finite order Markov chains (mixing or not) and many other processes such as the finitarily deterministic processes of Kalikow, Katznelson and Weiss [37].
Here is another example which includes all binary renewal processes with finite expected inter-arrival time. Let {M n } be any stationary and ergodic first order Markov chain with finite or countably infinite state space S. Let s ∈ S be an arbitrary state with P (M 1 = s) > 0. Now let X n = I {Mn=s} . The resulting binary time series {X n } is stationary and ergodic. It is also finitarily Markovian. The resulting time series X n = I {Mn=0} will not be Markov of any order. The conditional probability P (X 1 = 0|X 0 −∞ ) depends on whether until the first (going backwards) occurrence of one you see an even or odd number of zeros.) A result in Morvai and Weiss [61] asserts that there is no classification for membership in the class of binary finitarily Markovian processes. The result applies to both pointwise classifications and classifications in probability. For details see Morvai and Weiss [61].
In contrast to the negative result on classification for the class of finitarily Markovian processes, one can construct a classification rule for the class of renewal processes since in the case of the class of binary renewal processes (with renewal state zero) it is enough to check if each of the words from the countable set {0, 01, 011, . . . } is a memory word, cf. Morvai and Weiss [73]. For more results see D. Ryabko [93,94] or Morvai and Weiss [71,76].

On classifying general processes
The general problem of when can one discriminate between two classes of processes has been studied by several authors. In order to obtain positive results the testing schemes considered are not restricted to being simply two valued as were the schemes considered in the previous section. Some sufficient conditions for this to be possible were given by A. Dembo Here is another result of this type drawn from [103]. One of the motivations was the desire to recognize in an effective way when a process is a function of a Markov chain. These are very popular today in the mathematical biology literature under the name "Hidden Markov Models" (HMM). In [21] one can find a very nice characterization of these processes as those which can be defined by a finite number of finite dimensional stochastic matrices. Essentially the same characterization was rediscovered several years later by A. Heller in [32]. There has been much work in finding methods for determining the best HMM to fit some given data. In light of this it is natural to ask -can one determine membership in this class or not by successive observations of {X 1 , X 2 , ...X n }. D. Bailey showed in his thesis [4] that this is not even possible for the class of all k-step Markov chains (k arbitrary, fixed number of states). In [61] we give a similar negative result for another extension of the class of all Markov chainsthe finitary Markov processes.
On the other hand, if one restricts the order and the size of the state space then there are guessing schemes g n which will converge almost surely and test for membership, see for example [44], [13]. (In these papers there are integer valued schemes which are shown to converge to the least k such that the process is a k-step Markov chain, and with an a priori bound on the value of k this can be used to produce a two valued scheme which tests for membership in the class).
One can find such schemes for any family of ergodic processes with uniform rates in the ergodic theorem and a variant of this can be used for the class of all ergodic HMM where there is an a priori bound on the number of states in the Markov chain.
Let F denote some family of ergodic stochastic processes on a fixed state space S with a finite number of symbols. Identify these processes with the shift invariant measures on the compact space, S Z , of bi-infinite sequences of elements from S. On this space of measures put the weak* topology to obtain a compact space. Convergence in this topology coincides exactly with convergence of all finite dimensional distributions. We will be concerned mainly with ergodic measures, since by the ergodic decomposition almost every sequence produced by any stationary process is a typical sequence for some ergodic process. On the ergodic processes we take the induced topology. Thus when we speak of a closed family of ergodic processes we mean closed in this relative topology.
The estimation scheme will be based on the properties of the empirical distribution of k-blocks in n-strings based on the alphabet S. Let us introduce the following notation for this empirical distribution. Let b ∈ S k be a fixed k-block and u ∈ S n an n-string, then define

Definition 2.5.
A closed family of ergodic stochastic processes F has uniform rates, if for every k ∈ N , and every > 0 there is some n = n(k, ) such that for With this definition, for any closed family with uniform rates, a guessing scheme with two values, {YES,NO}, can be constructed which will almost surely stabilize on YES if the process belongs to F and to NO in the contrary case. To this end let F be a family with uniform rates, and fix a sequence k such that it is summable.
Let n k = n(k, k ) be the sequence which the definition supplies for us, and define g n as follows: For n in the range [n k , n k+1 − 1] if for some P ∈ F we have that With this definition we have that if the closed family of ergodic processes, F, has uniform rates and the g n are defined by (3.4)-(3.6) then for almost every realization of a process P from the family F we have that eventually g n (x 1 , x 2 , ...x n ) = YES, while for almost every realization of an ergodic process that is not in F eventually g n (x 1 , x 2 , ...x n ) = NO.
It is not hard to show that if K is a compact set of ergodic distributions then K has uniform rates.
For example, all Markov processes defined by transition matrices of a fixed size and a uniform positive lower bound on their entries, have uniform rates, since the set is clearly compact and consists of ergodic processes only. We can now formulate a theorem which is sufficiently general and whose assumptions are purely toplogical. [103]) If the family of ergodic processes, E, is closed (in the set of all ergodic processes) and is also σ-compact, then there are g n such that for almost every realization of a process P from the family E we have that eventually g n (x 1 , x 2 , ...x n ) = YES, while for almost every realization of an ergodic process that is not in E eventually g n (x 1 , x 2 , ...x n ) = NO.

Theorem 2.4. (Weiss
Note that in constract to Nobel's result the hypotheses refer only to the class E, and not to its complement which would be needed to apply his theorem.
As examples of this theorem one can take all ergodic Markov processes with a fixed number of states. The σ-compactness can be seen by taking for the K k all those ergodic Markov processes defined by transition matrices where if an entry is non zero it is at least 1/k. In a similar fashion one sees that all ergodic hidden Markov models with a fixed number of states and a bound on the window size of the function satisfy the hypotheses of the theorem. For further reading on related topics see [5], [34], [17], [84], [47], [103], [91], [71], [22] and [42].

Finite observability and entropy
We can put the questions that we have been considering in a yet more general framework. For simplicity we will consider only finite valued processes in this subsection. If J is a function of ergodic processes taking values in a metric space (Ω, d), then we say that J is finitely observable (FO) if there is some sequence of functions S n (x 1 , x 2 , ..., x n ) that converges to J(X) for almost every realization of the process X, for all ergodic processes. A weaker notion would involve convergence in probability of the functions S n to J rather than convergence almost everywhere. The particular labels that a process carries play no role in the following and so we may assume that all our processes take values in finite subsets of Z. Here This may easily be generalized as follows. Denote by P the shift-invariant probability measures on Z Z with support on a finite number of symbols and the topology of convergence in finite dimensional distributions. This means that a sequence of probability measures μ n converges to a limiting measure μ if and only if for each finite block b the measures μ n ([b]) of the finite cylinder sets defined by the block b converge to μ ([b]). Then to each finite-valued stationary process there will correspond a unique element of P, namely its distribution function DIST(X). This function is also FO by the same argument, replacing the arithmetic averages of the x i by the empirical distributions of finite blocks. Next consider the memory order L(X) of a process. This equals the minimal m such that the process is an m-Markov process, and +∞ if no such m exists. (Note that L(X) is a number associated with the distribution of process X.) In §2.1 it is shown that this function is FO.
A better-known example is the Shannon entropy of a process. Here, several different estimators S n are known to converge to the entropy; cf. [4,106,84,85,46]. The expected value of X 0 will clearly change if we change the labeling of our states but the Shannon entropy is not sensitive to such changes. In fact it is invariant under a very broad notion of equivalence of processes which we proceed to describe.
Processes X and X are isomorphic if there is a stationary invertible coding going from one to the other. More formally, let us denote the bi-infinite sequence with values in X , which maps the probability distribution of the X random variables to that of the X random variables. It is stationary if almost surely φT = T φ, where T is the shift on X . Finally, it is invertible if it is almost surely one-to-one. In this case it is not hard to see that the inverse mapping, where defined, will yield a stationary coding from X to X.
While the definition of the entropy of a process was given by C. Shannon [101] it was the great insight of A. Kolmogorov [45] that it is in fact an isomorphism invariant. This enabled him to solve an outstanding problem in ergodic theory; namely, he proved that independent processes with differing entropies are not isomorphic. Since that time entropy has turned out to be fundamental in many areas of ergodic theory. It is perhaps somewhat surprising that no new invariants of that kind were discovered and the next theorem of Ornstein and Weiss [86] explains this to some extent:

Theorem 2.5. (Ornstein and Weiss[85]) If J is a finitely observable function, defined on all ergodic finite-valued processes, that is an isomorphism invariant, then J is a continuous function of the entropy.
Note that there is no a priori assumption about the nature of the function J, such as measurability. An even stronger version of the theorem replaces isomorphism by the more restricted notion of finitary isomorphism. These are isomorphisms where the codings, in both directions, depend only on a finite (but variable) number of the variables. These are codings that are continuous after the removal of a null set. About ten years after Kolmogorov's result D. Ornstein [83] showed the converse; namely, independent processes with the same entropy are isomorphic. This was strengthened to finitary isomorphism by M. Keane and M. Smorodinsky [41], and is a strictly stronger notion than isomorphism, since there are many examples of processes that are isomorphic but not finitarily isomorphic.
It is natural to ask what happens when we restrict attention to smaller families of processes. That is, we now suppose that the finitely observable isomorphism invariant is only defined on a particular class and ask can one find any new invariants. Y. Gutman and M. Hochman ( [23]) have proved a rather general theorem which shows that for many natural examples of classes of processes the answer remains negative. These classes include the main classes of the various mixing types. We will content ourselves with formulating just two of their results here.
Theorem 2.6 (Gutman and Hochman [23]). If J is a finitely observable invariant on one of the following classes: 1. the Kronecker systems (the class of systems with pure point spectrum), 2. the zero entropy weakly mixing processes, 3. the zero entropy mildly mixing processes, 4. the zero entropy strongly mixing processes, Then J is constant.
For the class of irrational rotations the general problem is still open but they did obtain a partial result.

G. Morvai and B. Weiss
Theorem 2.7 (Gutman and Hochman [23]). For every finitely observable invariant J on the class of irrational rotations, there is a Borel set Θ ⊆ [0, 1) of full Lebesgue measure such that J assigns the same value to processes arising from rotations by angles in Θ. In particular, there is no complete finitely observable invariant for irrational rotations.

Estimation for finitarily Markovian processes
In this section we will concentrate on the class of finitarily Markov processes and discuss several specific estimation problems for them. For our first problem we take up the basic question of detection of memory words (cf. Morvai and Weiss [65]). This problem has been discussed often in the context of modelling processes but mostly only for finite alphabet processes. We will show here how it relates to prediction questions.
To begin with, recall that K was the minimal length of the context that determines the conditional probability. Consider the problem of estimating the value of K, both in the backward sense, where we observe more and more of the past and in the forward sense, where one observes successive values of {X n } for n ≥ 0 and asks for the least value K such that the conditional distribution of X n+1 given {X i } n i=n−K+1 is the same as the conditional distribution of X n+1 given {X i } n i=−∞ . We will not restrict to the finite alphabet case and include the possibility that the process takes countably infinite values.
Similar questions have been studied by Bühlman and Wyner in [10] but only for the case of finite alphabet finite order Markov chains. The possibility of countable alphabets complicates matters significantly. The reason is that while for finite alphabet Markov chains empirical distributions converge exponentially fast and one can establish universal rates of convergence for countable alphabet Markov chains no universal rates are available at all.
As for the classification problem, namely determining whether the observed process is finitarily Markovian or not, in Morvai and Weiss [61] it was shown that there is no classification rule for discriminating the class of finitarily Markovian processes from the other ergodic processes that are not.
In the first subsection we will review how to determine the value of K(X 0 −∞ ) from observations of increasing length of the data segments X 0 −n . We will describe a universal consistent estimator which will converge almost surely to the memory length K(X 0 −∞ ) for any ergodic finitarily Markovian process on a countable state space. Then we turn our attention to the forward estimation problem. This is the attempt to determine K(X n −∞ ) from successive observations of X n 0 . The stationarity means that results in probability can be carried over automatically. However, almost sure results present serious problems as we have already mentioned previously. For more results in related to these questions of what can be learned about processes by forward observations see Ornstein and Weiss [84], Dembo and Peres [17], Nobel [79], and Csiszár and Talata [15].
In this last paper the authors define a finite context to be a memory word w of minimal length, that is, no proper suffix of w is a memory word. An infinite context for a process is an infinite string with all finite suffixes having positive probability but none of them being a memory word. They treat there the problem of estimating the entire context tree in case the size of the alphabet is finite. For a bounded depth context tree, the process is Markovian, while for an unbounded depth context tree the universal pointwise consistency result there is obtained only for the truncated trees which are again finite in size. This is in contrast to the results discussed here which deal with infinite alphabet size and consistency in estimating memory words of arbitrary length. It is this generality that forces us to restrict to estimating at specially chosen times.
Finally, in the last subsection we will discuss estimating the residual waiting time in binary renewal processes. Recall that the classical binary renewal process is a stochastic process {X n } taking values in {0, 1} where the lengths of the runs of 1's between successive zeros are independent. These arise for example, in the study of Markov chains since the return times to a fixed state form such a renewal process. In many applications, the occurrences of a zero, which represent the failure times of some system which is renewed after each failure, are of importance and so the problem arises of estimating when the next failure will occur. Since this is usually unbounded this problem is rather difficult. We will give a rather detailed discussion of this problem and defer a more detailed description of the results to the subsection itself.

Estimation of the memory length for finitarily Markovian processes
Let {X n } be stationary and ergodic finitarily Markovian with finite or countably infinite alphabet X . In this subsection we will first show how to determine the value of K(X 0 −∞ ) from observations of increasing length of the data segments X 0 −n . We will describe a universal consistent estimator which will converge almost surely to the memory length K(X 0 −∞ ) for any ergodic finitarily Markovian process on a countable state space.
In order to estimate K(X 0 −∞ ) (for the definition cf Definition 2.2) some explicit statistics are needed to be defined. These will be the same as those that we used when estimating its essential supremum in finding the order of a Markov chain. For the convenience of the reader we brifly repeat their definition. (Cf. e.g. Morvai and Weiss [65] or [73].) The first is a measurement of the failure of w 0 −k+1 to be a memory word. For the empty word ∅ with length zero Δ 0 (∅) is defined as If Δ 0 (∅) = 0 then the process is independent and identically distributed. In general, for any k ≥ 1 and for any word This vanishes precisely when w 0 −k+1 is a memory word. etc.
An empirical version of this based on the observation of a finite data segment X 0 −n is needed. Letp −n (x|w 0 −k+1 ) denote tne usual empirical version of the conditional probability p(x|w 0 −k+1 ) from samples X 0 −n . Thesep's are functions of X 0 −n , but the dependence is suppressed to keep the notation manageable. For a fixed 0 < γ < 1 let L n k denote the set of strings with length k + 1 which appear more than n 1−γ times in X 0 −n . Now the empirical version of Δ 0 (∅) is as follows: For any k ≥ 1 and for any word w 0 −k+1 ∈ X k the empirical version of Δ k is as follows: By ergodicity, the ergodic theorem implies that almost surely the empirical distributionsp converge to the true distributions p and so for any The key idea is that if w 0 −k+1 is not a memory word then almost surely and if w 0 −k+1 is a memory word then not just almost surely, butΔ n k (w 0 −k+1 ) tends to zero with a rate. Now we review a test for w 0 −k+1 to be a memory word. Let 0 < β < 1−γ 2 be arbitrary. Let NT EST n (w 0 −k+1 ) = Y ES ifΔ n k (w 0 −k+1 ) ≤ n −β and NO otherwise. Note that NT EST n depends on X 0 −n . ('N' in NTEST stands for 'negative' since the data segment grows in negative (backward) direction.) By Morvai and Weiss [65], eventually almost surely, NT EST n (w 0 −k+1 ) = Y ES if and only if w 0 −k+1 is a memory word. Now we define an estimate χ n for K(X 0 −∞ ) from samples X 0 −n as follows. Set χ 0 = 0, and for n ≥ 1 let χ n be the smallest 0 ≤ k < n such that NT EST n (X 0 −k+1 ) = Y ES if there is such and n otherwise.

Theorem 2.8 (Morvai and Weiss [65]). Let {X n } be a stationary and ergodic finitarily Markovian process taking values from a finite or countably infinite alphabet. Then
χ n (X 0 −n ) = K(X 0 −∞ ) eventually almost surely. Now we turn our attention to the forward estimation problem where we are allowed to use growing segments of successive observations of X n 0 . Since when the word is a memory word one can use conditional independence and hence specific rates, either going backward or forward, and if the word is not a memory word one can use the forward ergodic theorem instead of the backward, it makes sense to define the forward version of the previous test as where T is the left shift operator. ('P' in PTEST stands for 'positive' since the data segment grows in positive (forward) direction.) Now by Morvai and Weiss [65], eventually almost surely, P T EST n (w 0 −k+1 ) = Y ES if and only if w 0 −k+1 is a memory word. PTEST tests a single word if it is a memory word or not. It is also possible to test a countable list of words (instead of a single word) if all of the words on the list are memory words or not, cf [73]. Now we shall examine how well can one estimate the local memory length for finite order Markov chains. In the case of finite alphabets this can be done with stopping times that eventually cover all time epochs (cf. Morvai and Weiss [65]). However, as soon as one goes to a countable alphabet, even if the order is known to be two and we are just trying to decide whether the X n alone is a memory word or not, there is no sequence of stopping times which is guaranteed to succeed eventually and whose density is one, cf. Morvai and Weiss [65]. [67]) There are no strictly increasing sequence of stopping times {λ n } and estimators {h n (X 0 , . . . , X λn )} taking the values one and two, such that for all countable alphabet Markov chains of order two lim n→∞ λ n n = 1 almost surely and h n (X 0 , . . . , X λn ) = K(X λn 0 ) eventually almost surely. We discussed that we cannot achieve density one in the forward memory length estimation problem even in the class of Markov chains on a countable alphabet. Now we shall show something similar in the class of binary (i.e. 0, 1) valued finitarily Markov processes. We will assume that there is given a sequence of estimators and stopping times, (h n , λ n ) that do succeed to estimate successfully the memory length for binary Markov chains of finite order and construct a finitarily Markovian binary process on which the scheme fails infinitely often.

Theorem 2.9. (Morvai and Weiss
Here is a precise statement: We emphasize that in the final counterexample process X n that was constructed in Morvai and Weiss [65], eventually almost surely K(X n −∞ ) ≤ n and K(X n −∞ ) = K(X n 0 ). For further reading cf. [73], [60] and [65].

On estimating the residual waiting time
In this subsection we investigate the possibility of giving a universal estimator at time n for the residual waiting time to the next zero in the binary renewal process {X n }.
As for motivation consider a big system, e.g. a telephone exchange or a computer system. The sytem can be either in a good state or in a bad state. When the system breaks down (the system gets into a bad state) it is restarted (renewal). We observe the sequence of the states (good or bad states) of the system and observing these states to a certain time we would like to give an estimate to the residual waiting time to the next bad state / renewal. More precisely, we would like to estimate the conditional expectation of the residual waiting time until the next such renewal state without prior knowledge of the distribution.
Consider the renewal process {X n } with renewal state '0'. (For a formal definition see Morvai and Weiss [68].) We will assume that the process is stationary and ergodic. Even though our primary interest is in one sided processes, stationarity implies that there exists a two sided process with the same statistics and we will use the two sided version whenever it is convenient to do so. Note that these renewal processes are finitarily Markovian processes. Indeed, any word with positive probability from {0, 01, 011, 0111, . . . } is a memory word, though not necessarily a minimal one.
Our interest is in the waiting time to renewal (the state 0) given some previous observations, in particular given X n 0 . We introduce the notation τ (X n −∞ ) as the look back time for the last zero occurred in X n −∞ . Formally put τ (X n −∞ ) = the t ≥ 0 such that X n−t = 0, and X i = 1 for n − t < i ≤ n.
For k = 0, 1, . . . let p k denote the conditional probability that given X 0 = 0 it will be followed exactly by k ones until the next zero. Formally put Our goal is to estimate E(σ n |X n 0 ) without prior knowledge of the distribution function of the process. In earlier works such as [43] attention is restricted to those renewal processes which arise from Markov chains with a finite number of states. In that case the problem is much easier since the probabilities p k decay exponentially and one can use this information in trying to find not only the distribution but even the hidden Markov chain itself. We are considering the general case where the number of hidden states might be infinite and this exponential decay no longer holds in general.
For the estimator itself it is most natural to use the empirical distribution observed in the data segment X 0 , X 1 , . . . , X n . However if there were an insufficient number of occurrences of 1-blocks of length at least τ (X 0 , X 1 , . . . , X n ) then we do not expect to give a good estimate. In particular if no block of that length has occurred yet, clearly no intelligent estimate can be given. For this reason we will estimate only along stopping times.
Unfortunately, there is no strictly increasing sequence of stopping times {ξ n } with density one, and sequence of estimators {h n (X 0 , . . . , X ξn )}, such that for all binary classical renewal processes the error |h n (X 0 , . . . , X ξn ) − E(σ ξn |(X 0 , . . . , X ξn )| tends to zero almost surely as n tends to infinity, without higher moment assumptions on the p k 's. To obtain a positive result some higher moment assumptions on the p k 's are needed, cf. Morvai and Weiss [68]. Note also that the process is stationary means that the first moment of the p k 's must be finite. Furthermore, in order that the expected value of σ 0 , that is, E(σ 0 ) (not conditioned on the event that X 0 = 0) be finite the second moment of the p k 's has to be finite. Now we describe the stopping times and the estimators. Define ψ as the position of the first zero, that is, ψ = min{t ≥ 0 : X t = 0}. Let 0 < δ < 1 be arbitrary. Define the stopping times ξ n as and in general let These are the successive times i when the value t = τ (X i 0 ) has occurred previously enough times so that we can safely estimate the residual renewal time by empirical distributions derived from observations already made. We also need to fix κ n as the index where reading backwards from X ξn we will have seen for the first time ≥ ξ 1−δ n occurrences of an i with τ (X i 0 ) = τ (X ξn 0 ). Formally put and in general, let For n > 0 define our estimator h n (X 0 , . . . , X ξn ) at time ξ n as Note that κ n ensures that we take into consideration exactly (ξ n ) 1−δ pieces of occurrences. The n-th estimate is simply the average of the residual waiting times that we have already observed in the data segment X ξn κn when we were at the same value of τ as we see at time ξ n .

G. Morvai and B. Weiss
Calculate the σ's to get

Now calculating the ξ's one gets
Calculating ξn n 's one gets Calculating the κ's one gets Finally calculate the h's to get Note that the fact that ξ n /n tends to one means that we are estimating on a sequence that has density one, in other words, we rarely fail to give an estimate.

Theorem 2.11. (Morvai and Weiss [68]) Assume
Note that both h n and ξ n depend on δ and so on α. We also constructed a more involved sequence of stopping times ξ * n and estimator h * n (X 0 , . . . , X ξ * n ) the constructions of which do not depend on a-priori knowledge of the α and we also managed to reduce our assumption from α > 2 to α > 1, cf. Morvai and Weiss [68]. We also constructed intermittent schemes for estimating the residual waiting time to the next zero for all binary stationary and ergodic processes. The scheme consists of a sequence of stopping times λ n and estimators f n (X λn 0 ). For all binary stationary and ergodic processes, almost surely. If the process turnes out to be a binary renewal process then lim n→∞ λ n n = 1 almost surely. Cf. Morvai and Weiss [74]. For further reading see [75] and [78].

Part II. Estimation for real valued processes
In the first part of this survey we dealt exclusively with discrete valued processes. In this part we will deal with real valued processes. If the one dimensional marginal distribution is continuous then with probability one in a finite number of observations there will be no repetitions. This means that in order to be able to use any of the methods that we were considering before we will have to introduce quantizers which will group the data so that there will be repetitions. We will discuss in this section several positive results for the forward prediction problem for real valued processes. The first of these is based on an observation of Bailey that despite the fact that a backward scheme when used in the forward direction needn't converge pointwise it may be that it converges in Cesaro mean. The subsequent last section is based on the idea of intermittent estimation. This means that we do not predict at every time instant, but when we do predict we want to be certain that eventually our predictions are optimal.

Pointwise sequential estimation of the conditional expectation in Cesaro mean
In this section we consider the problem of estimating the conditional expectation E(X n |X n−1 0 ) from a single sample of length n. (For the origin of this problem cf. Cover [12].) We observe a longer and longer finite segment of the single sample path X ∞ 0 and from the data segment X n−1 0 we want to estimate the conditional expectation E(X n |X n−1 0 ). Unfortunately this can not be done even for binary processes as the next theorem shows. (Cf. Györfi, Morvai, and Yakowitz [27] also.) In his thesis, Bailey [4] constructed a backward estimatorÊ −n (X −1 −n ) which tries to approximate E(X 0 |X −1 −n ). It turned out that to estimate the conditional expectation of a fixed random variable X 0 is possible as the next theorem shows. [4], Ornstein [82]) For the backward estimatorÊ −n (X −1 −n ) constructed in Bailey [4] (cf. Ornstein [82]

also) and for all stationary and ergodic binary processes
almost surely.
(Algoet [1], Morvai [53], Morvai, Yakowitz and Györfi [56] have extended this from binary processes to bounded real-valued stationay and ergodic processes. Györfi et. al. [24] and Algoet [3] extended the above result further to unbounded real-valued stationary processes.) In his thesis, Bailey [4] (cf. Ornstein [82] also) indicated how Maker's (also known as Breiman's) generalized ergodic theorem can be used to turn the backward estimator into a forward estimator for which the error will tend to zero in Cesaro average. [49], Breiman [8,9], Algoet [2]) Consider a stationary and ergodic dynamical system with the usual left shift oparator T . Let f n be a sequence of real valued functions such that

Theorem 3.3. (Maker
Note that if the f n 's are bounded then the condition E(sup n≥1 |f n |) < ∞ is trivially true. Now combine the above theorems with Ornstein [82]. Several authors have extended this from binary processes to bounded real valued processes using quantization to reduce to the finite valued case see for example Algoet [1,3], Morvai [53], Morvai, Yakowitz and Györfi [56]. The extension to the unbounded case turned out to be difficult because of the requirement of the integrability of the supremum in Maker's theorem.
A different approach to the sequential prediction uses a weighted average of simple estimators called 'experts', cf. e.g. Györfi and Lugosi [25]. The simple estimators can be partition-based, kernel-based etc. (cf. e.g. Györfi and Ottucsák and Walk [29]) The weight of an expert in the weighted average depends on its past performance as an estimator of the next outcome. These schemes are constructed directly as forward schemes and with these, results were extended to the general unbounded case by Nobel [80] and Györfi and Ottucsák [28]. [28]) Let {X n } be stationary and ergodic real-valued process with E |X 0 | 4 < ∞. Then for the estimatorÊ n (X n−1 0 ) defined in [28] (which is based on the idea of combining simple estimators called 'experts'):

Theorem 3.4. (Györfi and Ottucsák
(In fact, Györfi and Ottucsák considered a little bit more general framework when side information is also available, cf. [28], but for the case of simplicity we stated their result in a little bit simpler setting.) However none of these results were optimal in the sense that moment conditions higher than those strictly necessary were assumed. In our work [70] we have obtained optimal results by managing to prove the integrability of the supremum for the backward estimator and it is these results that we shall now review briefly. (For the the algorihm cf. Morvai, Yakowitz and Györfi [56], Algoet [3] and Morvai and Weiss [70].) Let {X n } be a real-valued doubly infinite stationary ergodic time series.

Example 3.2.
Assume that X 0 = π and X 1 = 0. Then The sequences λ k−1 , R k−1 and τ k are defined recursively (k = 1, 2, . . . ). Put λ 0 = 1 and R 0 = 0. Let τ 1 be the time between the occurrence of the pattern at time −1 and the last occurrence of the same pattern prior to time −1. More precisely, let Put Let τ 2 be the time between the occurrence of the pattern at time −1 and the last occurrence of the same pattern prior to time −1. More precisely, let In general, let τ k be the time between the occurrence of the pattern at time −1 and the last occurrence of the same pattern prior to time −1. More precisely, let (Cf. Morvai and Weiss [70], Algoet [3] and Morvai et. al. [56].) Example 3.3. Let X −1 −9 = (X −9 , X −8 , . . . , X −2 , X −1 ) = 010010010. Note that λ 0 = 1, R 0 = 0. The τ 's are: The λ's are: The X −τ 's are: The R's are: To obtain a fixed sample size t > 0 version, let κ t be the maximum of nonnegative integers k for which λ k ≤ t. For t > 0 put Note that λ 0 = 1, R 0 = 0. The τ 's are: The λ's are: The X −τ 's are: The R's are: The kappa's are: TheR's are:

G. Morvai and B. Weisŝ
Algoet [3] managed to prove thatR −t converges to E(X 0 |X −1 −∞ ) almost surely provided that E|X 0 | is finite. For a somewhat weaker result see Györfi et. al. [24]. However none of them was able to prove the integrability of the supremum of the estimatesR −t in case of unbounded random variables. This missing link was proved by Morvai and Weiss [70] under the condition that (What is more, we proved that merely having E|X 0 | < ∞ is not enough, cf. [70].) For t > 0 consider the estimatorR t aŝ and in generalR t (ω) =R −t (T t ω) which is defined in terms of (X 0 , . . . , X t−1 ) in the same way asR −t (ω) was defined in terms of (X −t , . . . , X −1 ).
(T denotes the left shift operator.) The next example shows how the left shift operator T works. We will use these numerical calculations later.

Pointwise consistent intermittent estimation schemes
Consider the forward estimation problem for countable alphabet first order Markov chains. Ryabko [89] showed that that problem can not be solved. (Cf. Györfi, Morvai, and Yakowitz [27] also.) If one insists on the error criteria then the two ways of getting around the negative results for forward estimation are intermittent schemes -where the estimates are given only at carefully chosen stopping times and restricting to processes with special properties. In this section first we will review results like this for the class of processes where the conditional distribution as a function of the past is continuous on a set of full measure. This class is more general than the processes with continuous conditional probabilities, as we shall see in an example which follows the definition.
Put R * − the set of all one-sided sequences of real numbers, that is, Define a metric on sequences (. . . , x −1 , x 0 , ) and (. . . , y −1 , y 0 ) as follows. Let We will consider two-sided stationary real-valued processes {X n } ∞ n=−∞ . Note that a one-sided stationary time series {X n } ∞ n=0 can be extended to be a twosided stationary time series {X n } ∞ n=−∞ . Definition 3.1. The conditional expectation E(X 1 |X 0 −∞ ) is almost surely continuous if for some set C ⊆ R * − which has probability one the conditional expectation E(X 1 |X 0 −∞ ) restricted to this set C is continuous with respect to the metric d * (·, ·) in (3.1).
Consider any stationary and ergodic finitarily Markovian process {X n } such that the distribution of X 0 concentrates on {0, 1, 2, . . . } and E|X 0 | < ∞. Then obviously E(X 1 |X 0 −∞ ) is almost surely continuous. This yields a stationary and ergodic process {M n }. Let The resulting time series {X n } will not be Markov of any order but it will be finitarily Markovian. The conditional expectation takes values from the set {0, 9 10 }. If X 0 = 1 then it is zero. Otherwise its value depends solely on whether until the first (going backwards) occurrence of one you see an even or odd number of zeros. The conditional expectation E(X 1 |X 0 −∞ ) is almost surely continuous, but it is not continuous on the whole space since it can not be made continuous at X 0 −∞ = (. . . , 0, 0, 0). In the previous example X 0 was a binary random variable. In the next example X 0 will be uniformly distributed on the unit interval. Notice that, aside from the exceptional set {0}, which has Lebesgue measure zero τ is finite and well-defined on the closed unit interval. The transformation is defined by All iterations S k of S for −∞ < k < ∞ are well defined and invertible with the exception of the set of dyadic rationals which has Lebesgue measure zero. Now choose r uniformly on the unit interval. Set X 0 (r) = r and put X n (r) = S n r. The process {X n } is a stationary and ergodic first order Markov chain with conditional expectation E(X 1 |X 0 = x) = Sx, (one observation determines the whole orbit of the process) cf. [27]. Since S is a continuous mapping disregarding the set of dyadic rationals, the resulting conditional expectation is almost surely continuous. However, the conditional expectation is not continuous on the whole unit interval, since it can not be made continuous at e.g. 0.5. Example 3.9. Consider the binary periodic Markov chain {M n } which alternates between the states, that is, let

This yields a stationary and ergodic process with marginal probabilities
Let Z n be independent identically distributed with uniform distribution on (0, 1). We assume that the {Z n } process is independent from the {M n } process. Now let Clearly, the {X n } process is also stationary and ergodic. The conditional expectation (The event {X 0 = 1} occurs with probability zero and this event can be excluded.) The conditional expectation in the next example is not almost surely continuous with respect to the metric d * (·, ·) in (3.1). Let {Z n } be independent and identically distributed with We assume that the {Z n } process is independent from the {M n } process. Now let Obviously, the {X n } process is also stationary and ergodic. The conditional expectation is

Now we argue by contradiction. Assume there exists
such that P (X 0 −∞ ∈ C) = 1 and on C the conditional expectation E(X 1 |X 0 −∞ ) is given as above and the conditional expectation in C. Since for any k = 1, 2, . . . , P (X 0 = 2 −k ) > 0 and since any word formed by the letters {0, 2 −1 , 2 −2 , . . . } has positive probability, there is a sequence This is a contradiction. Thus the conditional expectation E(X 1 |X 0 −∞ ) is not almost surely continuous with respect to the metric d * (·, ·) in (3.1).
The conditional expectation in the next example will not be almost surely continuous with respect to the metric d * (·, ·) in (3.1). Let {Z n } be independent and identically distributed with uniform distribution on the interval (1,2). We assume that the {Z n } process is independent from the {M n } process. Now let Obviously, the {X n } process is also stationary and ergodic. The conditional expectation is

Now we argue by contradiction. Assume there exists
such that P (X 0 −∞ ∈ C) = 1 and on C the conditional expectation E(X 1 |X 0 −∞ ) is given as above and the conditional expectation in C. Since for any 0 < k → 0, in C such that 1 < y (k) 0 < 1 + k and for all 1 ≤ i ≤ k, |y This is a contradiction. Thus the conditional expectation E(X 1 |X 0 −∞ ) is not almost surely continuous with respect to the metric d * (·, ·) in (3.1).
The conditional expectation in the next example is not almost surely continuous with respect to the metric d * (·, ·) in (3.1). This is not immediately evident but a detailed proof can be found in our paper [62].
Let X n = h(M n ). Since h(·) is one to one, {X n } is also a stationary and ergodic Markov chain. The conditional expectation E(X 1 |X 0 −∞ ) is not almost surely continuous with respect to the metric d * (·, ·) in (3.1).) However the conditional expectation in the next example is almost surely continuous with respect to the metric d * (·, ·) in (3.1). This yields a stationary and ergodic real-vaued process {M n } (the distribution of which concentrates on S and it is a first order Markov chain). The conditional expectation is almost surely continuous with respect to the metric d * (·, ·) in (3.1) even though  .1)). Now we will review an algorithm which will successfully estimate the conditional expectation of the next output (at time n + 1) given the observations up to time n at carefully selected time instances n in case the process has almost surely continuous conditional expectations.
Define the nested sequence of partitions {P k } ∞ k=0 of the real line as follows.
k denote the quantizer that assigns to any point x the unique interval in P k that contains x. Let [X n m ] k = ([X m ] k , . . . , [X n ] k ). We define the stopping times {λ n } along which we will estimate. Set λ 0 = 0. For n = 1, 2, . . ., define λ n recursively. Let Note that λ 1 ≥ 1 and it is a stopping time on [X ∞ 0 ] 1 . The first estimate m 1 is defined as m 1 = X 1 . Let Note that λ 2 ≥ 2 and it is a stopping time on [X ∞ 0 ] 2 . The second estimate m 2 is defined as In general, let Note that λ n ≥ n and it is a stopping time on [X ∞ 0 ] n . The nth estimate m n is defined as This estimator can be viewed as a sampled version of the predictor in Morvai et al. [56], Weiss [104], Algoet [3]. (For the discrete case cf. Morvai [54] and Morvai and Weiss [57].) Notice that the difference between the first and second statement in the theorem above is the quantization in the condition part of the conditional expectation. While the error m n − E(X λn+1 |[X λn 0 ] n ) tends to zero almost surely for all real-valued stationary time series with E(|X 0 | 2 ) < ∞, the error m n − E(X λn+1 |X λn 0 ) does not. E.g. for the stationary and ergodic Markov chain {X n } in Example 3.12 the error m n − E(X λn+1 |X λn 0 ) does not tend to zero with positive probability, cf. Morvai and Weiss [62]. (Of course, the conditional expectation E(X 1 |X 0 −∞ ) for this counterexample process is not almost surely continuous with respect to the metric d * (·, ·) in (3.1).) It turns out that the problem is caused by the quantization. If one knows in advance that the distribution of X 0 concentrates on finite or countably infinite subset of the real line then one may omit the partition P k and the quantizer [·] k entirely and so eliminate this problem. (Cf. Morvai and Weiss [62].) Example 3.14. Let X 6 0 = (X 0 , X 1 , . . . , X 5 , X 6 ) = 0100101. The λ's are: The X λ+1 's are: The m's are: One of the drawbacks of this scheme is that the growth of the stopping times {λ k } is rather rapid.
where the height of the tower is k − l, l(X ∞ 0 ) is a finite number which depends on X ∞ 0 , and c = 2 H− . Remark 3.1. It is an OPEN PROBLEM if there is a better sequence of stopping timesλ n the growth of which is less rapid with estimatorê n (X 0 , X 1 , . . . , Xλ n ) such that for all stationary and ergodic binary processes At the end of the present section we will review an intermittent scheme where the stopping times grow less rapidly, but that scheme is not designed to succeed for all discrete valued processes.
From the proof of Bailey [4], Ryabko [89], Györfi, Morvai, Yakowitz [27] it is clear that even for the class of all stationary and ergodic binary time series with almost surely continuous conditional expectation E(X 1 |X 0 −∞ ) one can not estimate E(X n+1 |X n 0 ) for all n in a pointwise consistent way. However, if one considers only a very narrow class of processes then one can succeed for all time instances.
Schäfer [100] considered stationary and ergodic Gaussian processes. He constructed an algorithm which can estimate the conditional expectation for every time instance n for an extremely restricted and narrow class of Gaussian processes. Note that if you want to estimate in time average (or Cesaro average) the problem becames much easier, cf. Györfi and Lugosi [25], Biau et. al. [7].
We consider stationary Gaussian (not necessarily ergodic) processes and estimate the conditional mean along a stopping time sequence for a much wider class of processes than in Schäfer [100].
Consider a stationary Gaussian process {X n } with autocovariance function γ(k) = E(X n+k X n ) and EX n = m. Define the following subclasses of stationary Gaussian processes: In Φ 1 we have Gaussian processes satisfying the condition ∞ j=0 |γ(j)| < ∞ (3.2) and are not Markovian of any order. In Φ 2 we have all Gaussian processes (not necessarily satifying (3.2)) which are Markov of some order. We are going to deal with processes in Φ = Φ 1 ∪ Φ 2 .
Although estimating the conditional mean in the class Φ 2 is much easier, our algorithm will be valid universally for every process in Φ.

2) is satisfied and
{X n } is a real-valued stationary and ergodic Gaussian process in Φ, see Hida and Hitsuda [33].
Schäfer [100] investigated the restricted model class considered in the following example.  For general Gaussian processes it is hard to check condition (3.3). Two special extremely narrow classes of Gaussian processes have been given in Schäfer [100] where this condition is satisfied.
At the beginning of this section we suggested an algorithm and sequence of stopping times along which the error tends to zero almost surely under the condition that the conditional expectation E(X 1 | . . . , X −1 , X 0 ) is almost surely continuous. Unfortunately the conditional expectation E(X 1 | . . . , X −1 , X 0 ) is not almost surely continuous in the Gaussian case in general and so this result is not applicable for Gaussian processes in general, cf. Molnár-Sáska and Morvai [52]. We note that for Gauss-Markov processes the conditional expectation E(X 1 |X 0 −∞ ) is continuous. Now we consider an extension of the algorithm discussed in at the beginning of this section. Now consider the special nested sequence of partitions P k of the real line as follows. Let 3 , (i + 1)2 −(k+1) 3 ) : for i = 0, 1, −1, . . . }.
The choice of P k in such form has technical reasons, see [52]. Consider the same sequence of stopping times λ's and estimators m's using this sequence of P's. This estimator is also consistent for (not Gaussian) stationary processes with almost surely contionuous conditional expectations. For more on estimation for Gaussian processes see Györfi and Lugosi [25] and Biau et. al. [7]. Note that it is still unknown if one can estimate the conditional expectation for all n for all stationary and ergodic Gaussian processes. (Cf. Györfi, Morvai, and Yakowitz [27] and Györfi and Sancetta [30].) Now we will consider stationary real-valued (not necessarilily Gaussian) processes {X n }. We will review a sequence of stopping times which grows slower than the previous ones.
Let {P k } ∞ k=0 denote a nested sequence of finite or countably infinite partitions of the real line by intervals. Let x → [x] k denote a quantizer that assigns to any point x the unique interval in P k that contains x. For a set C of real numbers let diam(C) = sup y,z∈C |z − y|. We assume that Define the stopping times as follows. Set ζ 0 = 0. For k = 1, 2, . . ., define the sequences η k and ζ k recursively. Each step we refine the quantization, and slowly increase the block length of the next repetition, as follows: let One denotes the estimate of E(X ζ1+1 |X ζ1 0 ) by g 1 , and defines it to be Let η 2 = min{t > 0 : [X ζ1+t ζ1−(l2−1)+t ] 2 = [X ζ1 ζ1−(l2−1) ] 2 } and ζ 2 = ζ 1 + η 2 .
One denotes the estimate of E(X ζ2+1 |X ζ2 0 ) by g 2 , and defines it to be In general, let One denotes the kth estimate of E(X ζ k +1 |X ζ k 0 ) by g k , and defines it to be Example 3.17. Let [·] k be the quantizer and let l k = k. Let (X 0 , X 1 , . . . , X 5 , X 6 ) = 0100101.
The ζ's and η's are: The X ζ+1 's are: The g's are: The next theorem states the strong (pointwise) consistency of the estimator. The consistency holds independently of how the sequence l k and the partitions are chosen as long as l k goes to infinity and the partitions become finer. However, the choice of these sequences has a great influence on the growth of the stopping times.
From the proof of [4], [89] and [27] it is clear that even for the class of all stationary and ergodic binary time series with almost surely continuous conditional expectation E(X 1 | . . . , X −1 , X 0 ) one can not estimate E(X n+1 |X n 0 ) for all n strongly (pointwise) consistently.
The stationary processes with almost surely continuous conditional expectation generalize the processes for which the conditional expectation is actually continuous. (Cf. [36] or [40].) If one uses finite partitions then it is possible to give an upper bound on the growth of the stopping times {ζ k }. Let P k be a nested sequence of finite partitions of the real line by intervals. If for some > 0, ∞ k=1 (k + 1)2 −l k < ∞ then for the stopping time ζ k ζ k < |P k | l k 2 l k eventually almost surely, (cf. Morvai and Weiss [58], Algoet [3] and Morvai et. al. [55]). Example 3.18. Consider = 1, l k = 4 log 2 (k + 1) , and |P k | = k + 1. Then which has a little bit faster growth than polynomial.
In case of finite alphabet processes you can achieve a slightly better upper bound. Indeed, let H denote the entropy rate associated with the stationary and ergodic finite alphabet time series {X n }. Note that in this case no quantization is needed. Then ζ k < 2 l k (H+ ) eventualy almost surely provided that (k + 1)2 −l k is summable. (Cf. [57], [85], [55].)