Exponential forgetting of smoothing distributions for pairwise Markov models

We consider a bivariate Markov chain $Z=\{Z_k\}_{k \geq 1}=\{(X_k,Y_k)\}_{k \geq 1}$ taking values on product space ${\cal Z}={\cal X} \times{ \cal Y}$, where ${\cal X}$ is possibly uncountable space and ${\cal Y}=\{1,\ldots, |{\cal Y}|\}$ is a finite state-space. The purpose of the paper is to find sufficient conditions that guarantee the exponential convergence of smoothing, filtering and predictive probabilities: $$\sup_{n\geq t}\|P(Y_{t:\infty}\in \cdot|X_{l:n})-P(Y_{t:\infty}\in \cdot|X_{s:n}) \|_{\rm TV} \leq K_s \alpha^{t}, \quad \mbox{a.s.}$$ Here $t\geq s\geq l\geq 1$, $K_s$ is $\sigma(X_{s:\infty})$-measurable finite random variable and $\alpha\in (0,1)$ is fixed. In the second part of the paper, we establish two-sided versions of the above-mentioned convergence. We show that the desired convergences hold under fairly general conditions. A special case of above-mentioned very general model is popular hidden Markov model (HMM). We prove that in HMM-case, our assumptions are more general than all similar mixing-type of conditions encountered in practice, yet relatively easy to verify.


Introduction
We consider a bivariate Markov chain Z = {Z k } k≥1 = {(X k , Y k )} k≥1 defined on a probability space (Ω, F , P) and taking values on product space Z = X × Y, where X is possibly uncountable space and Y = {1, . . . , |Y|} is a finite set, typically referred to as the state-space. Process X = {X k } k≥1 is seen as the observed sequence and Y = {Y k } k≥1 is seen as hidden or latent variable sequence, often referred to as the signal process. The process Z is sometimes called the pairwise Markov model (PMM) [33,5,6,14] and covers many latent variable models used in practice, such as hidden Markov models (HMM) and autoregressive regime-switching models. For a classification and general properties of pairwise models, we refer to [33,6,14]. Generally, neither Y nor X is a Markov chain, although for special cases they might be. In many practical models, such as above-mentioned HMM's and Markov switching models, the signal process Y remains to be a Markov chain. However, for every PMM, conditionally on the realization of X (resp. Y ), the Y (resp. X process) is always an inhomogenous Markov chain. The fact that we consider finite Y might seem restrictive at the first sight. The study of such models is mainly motivated by the fact that in the most applications of PMM's, specially of HMM's, the state space is finite, often rather small and so it is clear that this case needs special treatment. Strictly speaking, the term "hidden Markov model" refers to the case of discrete Y, the models with uncountable state space Y are often called "state-space models" (see e.g. [1]). Their difference is not only the level of mathematical abstraction, rather than different research objectives, techniques and algorithms -the finite Y allows effectively use many classical HMM tools like Viterbi, forward-backward and Baum-Welch algorithm, and under finite Y all these tools are applicable also for PMM case. Thus the model considered in the present article could be considered as a generalization of standard HMM, where the state space is still finite, but the structure of the model is more involved allowing stronger dependence between the observations. It turns out that with finite Y many abstract conditions simplify so that they are easy to apply in practice and many general conditions can be weakened. Also, finite Y allows us to employ different technique. The observation space X , however, is very general, as it usually is in practice.
In the current paper, the main object of interest is the conditional signal process, i.e. the process Y conditioned on X. More specifically, the purpose of the present work is to study the distributions P (Y t:t+m−1 ∈ ·|X s:n ), where m ≥ 1, ∞ ≥ n ≥ t ≥ s and where we adopt the notation a l:n for any vector (a l , . . . , a n ) with n ≤ ∞. For m = 1, the probabilities P (Y t ∈ ·|X 1:n ) are traditionally called smoothing probabilities, when t < n, filtering probabilities, when t = n and predictive probabilities, when t > n. In our paper, we deal with probabilities P (Y t:t+m−1 ∈ ·|X s:n ), where m ≥ 1 and t ≤ n, and we call all these distributions (m-block) smoothing distributions even if t = n or t + m > n. Our first main result (Theorem 3.1 below) states that when Z is positive Harris chain, then under some additional conditions, stated as A1 and A2, the following holds: there exists a constant α ∈ (0, 1) such that for every t ≥ s ≥ l ≥ 1, it holds sup n≥t sup m≥1 P (Y t:t+m−1 ∈ ·|X l:n ) − P (Y t:t+m−1 ∈ ·|X s:n ) TV ≤ C s α t−s = K s α t , a.s., (1) where C s is a σ(X s:∞ )-measurable finite random variable, K s def = C s α −s and for any signed measure ξ on Y, ξ TV def = i∈Y |ξ(i)| denotes the total variation norm of ξ.
Here and in what follows, when not stated otherwise, a.s. statements are with respect to measure P. In this case, the distribution of Z 1 is not specified. Sometimes we would like to specify it, like Z 1 ∼ π, and then we write P π -a.s. instead. In words, (1) states that the total variation distance of two smoothing distributions decrease exponentially in t. In Subsection 3.3, we shall see that a martingale convergence argument allows us to deduce from (1) the following bound (the inequality (27) below) P (Y t:∞ ∈ ·|X l:∞ ) − P (Y t:∞ ∈ ·|X s:∞ ) TV ≤ K s α t , a.s..
We also argue that the same approach (and the same assumptions) yields to the inequality P π (Y t:∞ ∈ ·|X s:n ) − Pπ(Y t:∞ ∈ ·|X s:n ) TV ≤ K s α t , P π − a.s.. (2) where π andπ are two initial distributions of Z 1 respectively,π is absolutely continuous with respect to π, denoted byπ ≻ π, P π and Pπ are the distributions of Z under π andπ and, as previously, ∞ ≥ n ≥ t ≥ s ≥ 1. Since K s is P π -a.s. finite, the inequality (2) implies that for P π -a. e. realization of X, the difference P π (Y t:∞ ∈ ·|X s:∞ ) − Pπ(Y t:∞ ∈ ·|X s:∞ ) TV tends to zero exponentially fast in t. The convergence to zero is sometimes referred to as the weak ergodicity of Markov chain in random environment [40], we thus prove that the weak ergodicity is actually geometric. Although (2) implies (1), in the present paper we concentrate on the inequalities of type (1), because they allow us to obtain the two sided generalizations. For two-sided versions of these inequalities, let us consider the two-sided stationary Markov chain {Z k } k∈Z . In Subsection 3.3, we shall see that lim l,s→∞ P (Y t:t+m−1 ∈ ·|X t−l:t+s ) = P (Y t:t+m−1 ∈ ·|X −∞:∞ ), a.s..
We strengthen this result by proving that under general conditions the following holds (Corollary 3.2): there exists α ∈ (0, 1) and a stationary process {C k } k∈Z , C k < ∞, such that for all t ∈ Z, m ≥ 1, and l, s ≥ 0 P (Y t:t+m−1 ∈ ·|X t−l:t+s ) − P (Y t:t+m−1 ∈ ·|X −∞:∞ ) TV ≤ C t α l∧s , a.s., where ∧ denotes the minimum. The random variable C t is σ(X −∞:t , X t+m−1,∞ )-measurable and the process {C k } k∈Z is ergodic when {Z k } k∈Z is ergodic. Another result of this type (Corollary 3.3 below) states that under the same assumptions (we take m = 1, for simplicity) P (Y t ∈ ·|X 1:n ) − P (Y t ∈ ·|X −∞:∞ ) TV ≤ C 1 α t−1 +C k α k−t , a.s., where C 1 is σ(X 1:∞ )-measurable,C k is σ(X −∞:k )-measurable and the process {C k } k∈Z is ergodic when {Z k } k∈Z is. Although the constants C andC in all above-stated inequalities are random, nevertheless the bounds can be useful in various situations when pathwise limits are of interest. For example, the inequality (2) implies that (for similar type of bounds see also [13,8,25]) and the inequality (3) is very useful when one needs to approximate the smoothing probability P (Y t:t+m−1 ∈ ·|X t−l:t+s ) with something being independent of l and s. We shall briefly discuss the motivation of inequalities type (4) and (3) in the point of view of segmentation theory below. The assumptions A1 and A2 are stated, discussed and interpreted in Subsection 3.1.
Relation with the previous work. The most popular type of PMM's are HMM's, where the underlying process Y is a Markov chain, and given Y n = i, the observation X n is generated according to a probability distribution attached to the state i and independently of everything else. Therefore, the vast majority of the study of smoothing and filtering probabilities are done for HMM's, where the study of these issues has relatively long history dating back to 1960's, where well-known forward-backwards recursions for calculating these probabilities (for HMM's) were developed. The forgetting properties typically refer to the convergence (as t → ∞, n ≥ t) and the inequalities of type (2) are often referred to as exponential smoothing. For n = t, the convergence (5) is called filter stability, and it is probably the most studied convergence in the literature. For an overview of several forgetting properties and mixing type conditions ensuring forgetting (in HMM case), we refer to [1,Ch. 3,4]. Some of these conditions are also restated in Subsection 4.3. The list of research articles dealing with various aspects of forgetting and filtering problems in HMM setting is really long including [2,3,7,8,9,25,12,13,21,20,27], just to mention a few more prominent articles. Majority of above-mentioned papers deal with (exponential) forgetting of filters and filter stability of the state space models i.e. they consider the case where the state space Y of Markov chain is very general, possibly uncountable. In these papers, various mixing conditions for filter stability and forgetting properties are stated. These conditions are often appropriate and justified for general Y, but when applied to the case of finite Y, they might become too restrictive or limited. Hence the case of finite Y needs special treatment and so it is also quite expected that in the case of finite Y our main assumption A1, designed for discrete Y, is more general that the ones made in all above-mentioned papers. For many models mentioned in the literature, A1 is easy to verify, but we provide a more practical condition -cluster condition -which is more general than many similar assumptions encountered in the HMM-literature, yet very easy to check. Since HMM's are so important class of models, Subsection 4.3 is fully devoted to HMM-case. Besides presenting the results, Subsection 4.3 also aims to give a state-of-art overview of mixing-type conditions for finite-state HMM's.
Recently, a significant contribution to the study of smoothing probabilities (with continuous state space) was made by van Handel and his colleagues [38,41,40,39,4,42,36,37,34]. Again, most of the papers deals with HMM's, but in [36,37], also more general PMM's are considered. In particular, they consider a special class of PMM's, called non-degenerate PMM's. The crucial feature of non-degenerate PMM's is that by some change of measure the dynamics of X and Y -process can be made independent (see Subsection 4.2 for precise definition). While natural in continuous-space setting, for finite Y this assumption might be restrictive and in Subsection 4.2 we show that the assumptions in [36] imply A1 and A2. For HMM's, the non-degeneracy simply means strictly positive emission densities and in Subsection 4.3 we show several ways how to relax it.
The present work generalizes and builds on the approach in [27,26], where solely the HMM-case was considered. In many ways, HMM is technically much simpler model to handle, hence the generalization from HMM to PMM is far from being straightforward. Moreover, our second main result, Theorem 3.2 cannot be found in the earlier papers even in HMM case. Also, for the HMM case, the cluster condition introduced in the present paper is significantly weaker than the one in [27,26]. The proofs of our main results rely on the Markovian block-decomposition of the conditional hidden chain, A1 is used to bound from above the Dobrushin coefficient of certain block-transitions.
Applications in segmentation. The motivation of studying the inequalities (1) and the two-sided inequalities (3) and (4) (instead of just filtering ones) comes from the so-called segmentation problem that aims to prognose or estimate the hidden underlying path y 1:n given a realization x 1:n of observed process X 1:n . The goodness of any path s 1:n ∈ Y n is typically measured via loss function L : where L(y 1:n , s 1:n ) measures the loss when the actual state sequence is y 1:n and the estimated sequence is s 1:n . The best path is then the one that minimizes the expected loss E[L(Y 1:n , s 1:n )|X 1:n = x 1:n ] = y1:n∈Y n L(y 1:n , s 1:n )P (Y 1:n = y 1:n |X 1:n = x 1:n ).
over all possible state sequences s 1:n . A common loss function measures the similarity of the sequences entry-wise, i.e. L(y 1:n , s 1: where I yt =st = 0 if and only if y t = s t , otherwise I yt =st = 1. Thus L(y 1:n , s 1:n ) counts the classification errors of path s 1:n and the expected number of classification errors is Now it is clear that the pathŷ 1:n that minimizes the expected loss is also the path that minimizes the expected number of classification errors and it can obtained by pointwise maximization of smoothing probabilities, i.e. Any such path is called pointwise maximum a posteriori (PMAP) (see, e.g. [22,23,24]). The PMAP path is easy to calculate via forward-backward algorithms that hold for PMM as well as for HMM.
When n varies, it is convenient to divide the expected loss by n, and so we study the time-averaged expected number of classification errors of PMAP path as follows: This number can be considered as the (best possible) expected number of classification errors per one time entry. It turns out that when Z is an ergodic process satisfying our general assumptions A1 and A2, then there exists a constant R ≥ 0 so that where R is a constant. The number R is solely depending on the model and characterizes the its segmentation capacity -the smaller R, the easier the segmentation. The proof of this convergence in HMM-case is given in [26,23], but it holds without changes in more general PMM case as well. The proof relies largely on the inequality (4), being thus an example of the use of this kind of inequalities. For the discussion about the importance of the existence of limit R as well as for another applications of inequalities of type (4) and (1) in the segmentation context, see [23,24]. These papers deal with HMM's only, but with the exponential forgetting results of the present paper, the generalization to PMM case is possible.
It is interesting to notice that our main condition A1 is not only relevant to smoothing distributions. This condition has been used, albeit in slightly more restricted form, in the development of the Viterbi process theory [31,29,28]. This suggests that the condition A1 is essential form many different aspects and captures well the mixing properties. When the observation space X is finite, then A1 essentially becomes what is known in ergodic theory as the subpositivity of some observation string.

Preliminaries
The model and some basic notation. We will now state the precise theoretical framework of the paper. We assume that observation-space X is a Polish (separable completely metrizable) space equipped with its Borel σ-field B(X ). We denote Z = X × Y, and equip Z with product topology τ × 2 Y , where τ denotes the topology of X . Furthermore, Z is equipped with its Borel σ-field B(Z) = B(X ) ⊗ 2 Y , which is the smallest σ-field containing sets of the form A × B, where A ∈ B(X ) and B ∈ 2 Y . Let µ be a σ-finite measure on B(X ) and let c be the counting measure on 2 Y . Finally, let be a measurable non-negative function such that for each z ∈ Z the function z ′ → q(z ′ |z) is a probability density function with respect to product measure µ × c. We define random process as a homogeneous Markov chain on the two-dimensional space Z having the transition kernel density q(z ′ |z). This means that the transition kernel of Z is defined as follows: Since every C ⊂ B(Z) is in the form C = ∪ j∈Y A j × {j}, where A j ∈ B(X ), the probability (6) reads We also assume that Z 1 has density with respect to product measure µ × c. Then, for every n, the random vector Z 1:n has a density with respect to the measure (µ × c) n . In what follows, with a slight abuse of notation the letter p will be used to denote the various joint and conditional densities. Thus p(z k ) = p(x k , y k ) is the density of Z k evaluated at z k = (x k , y k ), p(z 1:n ) = p(z 1 ) n k=2 q(z k |z k−1 ) is the density of Z 1:n evaluated at z 1:n , p(z 2:n |z 1 ) = n k=2 q(z k |z k−1 ) stands for the conditional density and so on. Sometimes it is convenient to use other symbols beside x k , y k , z k as the arguments of some density; in that case we indicate the corresponding probability law using the equality sign, for example p(x 2:n , y 2:n |x 1 = x, y 1 = i) = q(x 2 , y 2 |x, i) n k=3 q(x k , y k |x k−1 , y k−1 ), n ≥ 3.
The notation P z (·) will represent the probability measure, when the initial distribution of Z is the Dirac measure on z ∈ Z (i.e. P z (A) = P (A|Z 1 = z)). For a probability measure ν on B(Z), P ν (·) denotes the probability measure, when the initial distribution of Z is ν (i.e. P ν (A) = P z (A) ν(dz)).
The marginal processes {X k } k≥1 and {Y k } k≥1 will be denoted with X and Y , respectively. It should be noted that even though Z is a Markov chain, this doesn't necessarily imply that either of the marginal processes X and Y are Markov chains. However, it is not difficult to show that conditionally given X 1:n , Y 1:n is a (generally non-homogeneous) Markov chain and vice-versa.
For any set A consisting of vectors of length r > 1 we adopt the following notation: General state space Markov chains. We will now recall some necessary concepts from the general state Markov chain theory. Markov chain Z is called ϕ-irreducible for some σ-finite measure ϕ on .] a maximal irreducibility measure ψ in the sense that for any other irreducibility measure ϕ ′ the measure ψ dominates ϕ ′ , ψ ≻ ϕ ′ . The symbol ψ will be reserved to denote the maximal irreducibility measure of Z. Chain Z is called Harris recurrent, when it is ψ-irreducible and ψ(A) > 0 implies P z (Z k ∈ A i.o.) = 1 for all z ∈ Z. Chain Z is called positive if its transition kernel admits an invariant probability measure. Any ψ-irreducible chain admits a cyclic decomposition [32,Th. 5.4.4]: Overlapping r-block process. For every r > 1, define Z k def = Z k:k+r−1 , k ≥ 1. Thus Z = {Z k } is a Markov process with the state space Z r and transition kernel P (Z 2 ∈ A|Z 1 = z 1:r ) = P Z 2:r+1 ∈ A|Z 1:r = z 1:r = P Z r+1 ∈ A(z 2:r )|Z 1 = z 1 ), Similarly, for every set A ⊂ Z r , and z 1 ∈ Z, we denote A(z 1 ) def = {z 2:r | z 1:r ∈ A}. The following proposition (proof in Appendix) specifies the maximal irreducible measure of Z . Proposition 2.1 If Z is positive Harris with stationary probability measure π and maximal irreducible measure ψ, then Z is a positive Harris chain with maximal irreducible measure ψ r , where 3 Exponential forgetting

The main assumptions
We shall now introduce the basic assumptions of our theory for the non-stationary case. For every n ≥ 2 and i, j ∈ Y we denote p ij (x 1:n ) def = y2:n : yn=j p(x 2:n , y 2:n |x 1 , y 1 = i) = p(x 2:n , y n = j|x 1 , y 1 = i).
For any n ≥ 2, define Recall the definition of A (k) and A(x 1 ). Thus Observe that it is not generally the case that (2) . The following are the main assumptions.
A1 There exists integer r > 1 and a set E ⊂ X r such that . The condition A1 is the central assumption of our theory. The intuitive meaning of A1 is fairly simple, because it can be considered as the "irreducibility and aperiodicity" of conditional signal process as follows. Suppose we have an inhomogeneous Markov chain Y = {Y t } t≥1 , with Y t being the finite state space of Y t . The canonical concepts of irreducibility and aperiodicity are not defined for such a Markov chain, but a natural generalization would be the following: for every time t, there exists a time n > t such that P (Y n = j|Y t = i) > 0 for every i ∈ Y t and j ∈ Y n . If Y is homogeneous, then this property implies that Y is irreducible and aperiodic, hence also geometrically ergodic. When we fix n > t and define The assumption A1 generalizes that idea to conditional signal process. Indeed, A1 states that for every x 1:r ∈ E, and for every fixed t ≥ 1, it holds Observe that since Z is homogenous, the set Y + (x 1:r ) (and therefore also the sets Y + (1) and Y + (2) ) is independent of t, and A1 also ensures that it is independent of x 1:r , provided x 1:r ∈ E. If now x 1:∞ is a realization of X 1:∞ such that x 1:r ∈ E and we define Y + (x 1:∞ ) as previously, just x 1:r replaced by . This observation makes A1 comparable with the definition of irreducibility of conditional signal process defined by van Handel in [40]. Van Handel's definition, when adapted to our case of finite Y, states that for every t and for a.e. realization x t:∞ of X t:∞ , there exists n > t such that the measures P (Y n ∈ ·|Y t = i 1 , X t:∞ = x t:∞ ) and P (Y n ∈ ·|Y t = i 2 , X t:∞ = x t:∞ ) are not mutually singular, provided P (Y t = i k |X t:∞ = x t:∞ ) > 0 for k = 1, 2 (for non-Markov case this condition is generalized in [37]). This condition is weaker than A1, and it has to be, because by Theorem 2.3 in [40], for stationary X, the above-defined irreducibility condition is necessary and sufficient for the convergence P π (Y t ∈ ·|X 1:∞ ) − Pπ(Y t ∈ ·|X 1:∞ ) TV → 0, P π -a.s. and Pπ-a.s., where π ≻π and π corresponds to the stationary measure (weak ergodicity). On the other hand, the condition A1 is typically met and relatively easy to verify. Moreover, as already mentioned, our main result, Theorem 3.1 states that for positive Harris Z, A1 and A2 do ensure not only the weak ergodicity but also the exponential rate of convergence. So one possibility to generalize A1 for uncountable Y would be replacing "not mutually singular" in the definition of regularity of conditional signal process in [40] by "having the same support".
The condition A2 ensures that X returns to the set E in appropriate regularity under certain stability conditions on Z. Conditions A1-A2 will be discussed in more detail in case of specific models in Section 4.
We note that under A1 and A2 we may without loss of generality assume that for some n 0 ≥ 1 Indeed, let for n ≥ 1 We would like to replace E by E no . Clearly A1 holds for E no as well, and we also have Unfortunately, the positive integral does not imply that (1) (otherwise the integral would be zero). Thus E ′ satisfies both A1 and A2, and so with no loss of generality we may and shall assume that (9) holds.

Bounding the Dobrushin coefficient
The Dobrushin coefficient δ(M ) of a stochastic matrix M (i, j) is defined as the maximum total variation difference over all row pairs of M divided by 2, i.e.
where · TV stands for total variation norm. As is well known, for any two probability rows vectors ξ, ξ ′ of length n, and for any n × n stochastic matrix, The Dobrushin coefficient is sub-multiplicative: for any two n × n stochastic matrices M and M ′ , A stochastic matrix M is said to satisfy the Doeblin condition, if there exists a probability row vector ξ and ǫ > 0 such that each row of M is uniformly greater than ǫξ, i.e.
If M satisfies such condition, then its Dobrushin coefficient has In what follows we prove our own version of the Doeblin condition. We shall consider the observation sequences x 1:n , where x 1:r ∈ E and n ≥ r. For those sequences define probability distribution on Y as follows: p(x r+1:n |x r , y r ) is the normalizing constant, I A denotes the indicator function on A, and the set Y + is given by A1. Define the stochastic matrix where i, j ∈ Y represent the row and column index of the matrix, respectively. The matrix U [x 1:n ] is well-defined stochastic matrix for all x 1:n satisfying c(x r: Lemma 3.1 Suppose A1 is satisfied. Let n ≥ r, and let x 1:n be such that x 1:r ∈ E and p(x 2:n |x 1 , Proof. We only consider the case n > r; the proof for n = r follows along similar, although simpler arguments. Let x 1:n be such that x 1:r ∈ E and p(x 2:n |x 1 , Indeed, since by assumption, p(x 2:n |x 1 , Thus (i * , j * ) ∈ Y + , which by A1 implies that (i, j * ) ∈ Y + for every i ∈ Y + (1) . This together with (12) shows that p(x 2:n |x 1 , y 1 ) > 0 for every y 1 ∈ Y + (1) , and so (11) is proved in one direction. In the other direction, if i / ∈ Y + (1) , then, by definition of Y + , p ij (x 1:r ) = 0 for every j ∈ Y, which in turn implies that p(x 2:n |x 1 , y 1 = i) = 0.
Next, note that c(x r:n ) > 0, and so U [x 1:n ] is well-defined. Indeed, we saw above that j * ∈ Y + (2) and therefore by (12) c(x r:n ) ≥ p(x r+1:n |x r , y r = j * ) > 0. When i / ∈ Y + (1) then by (11), U [x 1:n ](i, j) = λ[x r:n ](j) for every j ∈ Y and hence the inequality (10) is fulfilled for every j ∈ Y. Thus in what follows we assume that i ∈ Y + (1) . We have . 2) , and so we obtain Together with (9) this implies Remark. Inspired by the technique in [9], one can add to the condition A1 the following: there exists t ∈ {2, . . . , r − 1} and state l ∈ Y so that for every Then, for every i ∈ Y + (1) , Thus the matrix of conditional probabilities Thus, the Dobrushin condition holds with λ(j) = I {l} (j). Although formally (13) restricts A1, for many models like HMM, it is actually equivalent to A1. One advantage of (13) is that λ might be bigger than 1/n 2 o . The condition (13) is more useful in linear state space models (continuous Y), see [9].

Exponential forgetting results
Conditional transition matrices and distributions. Let now, for every m ≥ 1 and for every k ≥ 1, Observe that Y m is countable and so every version of conditional probability above is regular. Note that since the process Z is homogeneous, for every 1 < s < n and x 1:n , we can take where the last equality follows from Markov property. Thus is the one-step conditional transition matrix. The matrix F 0;1 [x 1:n ] ≡ I, where I stands for |Y| × |Y| identity matrix, and for m > 1, The definition (15) is clearly justified, since if m > 1 and u = v 1 , then Note that without loss of generality we may assume F r−1; For every m ≥ 1 and 1 ≤ l ≤ t, n define The notation ν t l:n;m represents the random function ν t l:n;m [X l:n ] taking values [0, 1] m . The domain of that function is finite and so we identify the random function with random vector. Observe that for any m, k, l, ≥ 1, s ≥ l and n ≥ s + k it holds where the second equality follows from (14), and the third equality follows from the fact that ν s l:n;1 is a regular conditional distribution. Thus ν s+k l:n;m = ν s l:n;1 F k;m [X s:n ], a.s..
In order to generalize the ν t l:n;m to the case m = ∞ corresponding to the conditional distribution of Y t:∞ , let F stand for the cylindrical σ-algebra on Y ∞ . Now, for every 1 ≤ l ≤ t, n, let ν t l:n;∞ be the regular version of the conditional distribution P (Y t:∞ ∈ ·|X l:n ) on σ-algebra F .
Remark about a.s. The stochastic process Z is defined on an underlying probability space (Ω, F , P). (Regular) conditional probabilities ν and F are defined up to P-a.s., only. Therefore, the (in)equalities like (16) or the statement (18) of Proposition 3.1 below are all stated in terms of P-a.s.. Observe also that there are countable many indexes l, n, m, s, k. Therefore (16) implies: there exists Ω o ⊂ Ω such that P(Ω o ) = 1 and for any ω ∈ Ω o (16) holds for any s, k, l, m, n. The same holds for other similar equalities like (18).

The main theorem.
In what follows, we take r ′ = r−1; for any t > r ′ and for any string x s:t ∈ X t−s+1 , we define Thus κ 0 (x s:t ) counts the number of vectors from set E in the string x s:t in almost non-overlapping positions starting from s. Here "almost non-overlapping positions" means that the last entry of previous position and the first entry of the next one overlap. Similarly, κ k (x s:t ) counts the number of vectors from set E in the string x s+k:t (k = 0, . . . , r ′ − 1).
Let us also define reversed time counterpart of κ 0 as follows Thus alsoκ(x s:t ) counts the number of vectors from set E in the string x s:t in almost non-overlapping positions; the difference with κ is thatκ starts counting from t. Note that with k = (t − s) mod r ′ κ(x s:t ) = κ k (x s:t ).
Moreover, the inequality (18) also holds when κ k (X s:t ) is replaced beκ(X s:t ).
Note that Thus ν t l:n;m − ν t s:n;m TV = (ν s l:n;1 − ν s s:n;1 ) where δ(U ) denotes the Dobrushin coefficient of matrix U . Note that if x 1:n is such that p(x 1:n ) > 0, then for any u = 1, . . . , n − 1 there exists a state y u such that p(x u+1:n |x u , y u ) > 0 so that the assumption of Lemma 3.1 is fulfilled. Since p(X 1:n ) > 0, a.s., we have by Lemma 3.1 and so the statement follows.
Since for some k,κ[x s:t ] = κ k [x s:t ], we have that max k κ k [X s:n ] ≥κ[X s:t ] and so the second statement follows.
We are now ready to prove the first of the two main results of the paper. Recall that we do not assume any specific initial distribution π of the chain Z, hence all a.s.-statements below are with respect to the measure P in underlying probabilty space.
Proof. (i) First we show that Recall that we denoted E(x 1 ) = {x 2:r | x 1:r ∈ E}. We have for any ( and so (20) holds. Here the first inequality follows from A1 and (9), and the second inequality follows from A2. Since by A2 ψ(E (1) × Y + (1) ) > 0, then it follows from Lemma A.1 and (20) that X goes through E infinitely often a.s. Assuming for the sake of concreteness that s ≥ l, we have that there must exist T (s) ∈ {1, . . . , r ′ } such that κ 0 (X s+T :s+T +u ) − → u ∞, a.s.. Thus, as u → ∞, we have max k∈{0,...,r ′ −1} κ k (X s:u ) → ∞ and so the first part of the statement follows from Proposition 3.1.
(ii) Define Z k = Z k:k+r−1 , k ≥ 1. From Proposition 2.1 we know that Z is a positive Harris chain with maximal irreducibility measure ψ r . Recall that the chain Z admits a cyclic decomposition {D k , k = 0, . . . , d − 1}, where d denotes the period of Z. Also recall that by A2 ψ(E (1) × Y + (1) ) > 0; hence by (7) and (20) ψ r (E × Y r ) > 0. Thus with no loss of generality we may assume that Let s ≥ 1, and let T (s) ≥ 0 be a σ(X s:∞ )-measurable integer-valued random variable defined as the smallest integer such that Z s+T ∈ D 0 . Since Z is Harris recurrent, then T < ∞, a.s.. We have thus by the strong Markov property that {Z s+T +k } k≥0 is a Markov chain with the same transition kernel as Z, hence also positive Harris. Also, by the cyclic decomposition of Z and by the fact that Z s+T ∈ D 0 , we have that the Markovian sub-process {Z s+T +kd } k≥0 can be seen seen as the process Z on D 0 , i.e. as a process that starts from Z s+T , the next value is the one of Z at the next visit of D 0 and so on. With this observation, it is easy to see that {Z s+T +kd } k≥0 is ψ r | D0 -irreducible (ψ r | D0 stands for restriction), positive Harris (if ψ r | D0 (A) > 0, then also ψ r (A) > 0 and so for every I E×Y r (Z s+T +kdr ′ ) and p 0 Since the invariant measure P π (Z 1 ∈ ·) dominates the maximal irreducibility measure ψ r [32, Prop.
If t < s + T + U , then T + U > t − s and defining α def = ρ p and N def = T + U , we have that by Proposition 3.1 for any t ∈ {s, . . . , n} the following inequalities hold a.s.
(ii) If Z is positive Harris, then there exists a constant 1 > α > 0 such that the following holds: for every s ≥ 1 there exist a σ(X s:∞ )-measurable random variable C s < ∞ such that for all t ≥ s ≥ l ≥ 1 sup n≥t P (Y t:∞ ∈ ·|X l:n ) − P (Y t:∞ ∈ ·|X s:n ) TV ≤ C s α t−s , a.s.
Proof. Let A be the algebra consisting of all cylinders of Y ∞ . Thus F = σ(A). The statement (i) of Theorem 3.1 means that for P-a.s., The proof of the second statement is the same.
By Levy martingale convergence theorem, for every l, t and F ∈ F lim n P (Y t:∞ ∈ F |X l:n ) = P (Y t:∞ ∈ F |X l:∞ ), a.s..
As mentioned in Introduction, for n = t, such convergences -filter stability -are studied by van Handel et al. in series of papers [36,41,40,38,42]. If Z is positive Harris, then there the convergence above holds in exponential rate, i.e. there exists an almost surely finite random variable C s and α ∈ (0, 1) so that sup n≥t P π (Y t:∞ ∈ ·|X s:n ) − Pπ(Y t:∞ ∈ ·|X s:n ) TV ≤ K s α t , P π − a.s., where K s = C s α −s . Of course, just like in (26) and (27), we have that (29) and (30) also hold with n = ∞. The convergence (29) with n = ∞ is studied by van Handel in [40] (mostly in HMM setting) under the name weak ergodicity of Markov chain in random environment.

Two-sided forgetting
A1 and A2 under stationarity. In this section we consider a two-sided stationary extension of Z, namely {Z k } ∈Z = {(X k , Y k )} k∈Z . As previously, the process is defined on underlying probability space (Ω, F , P), but in this section, the measure P is such that the process Z is stationary. All a.s. statements are with respect to P. Denote for n ≥ 1 and x 1:n ∈ X n , Y * (x 1:n ) def = {(y 1 , y n ) |∃y 2:n−1 : p(x 1:n , y 1:n ) > 0}.
In the stationary case it is convenient to replace A1 and A2 with the following conditions.
A1' There exists a set E ⊂ X r , r > 1, such that Y * def = Y * (x 1:r ) = ∅ is the same for any x 1:r ∈ E, and It is easy to see that A1 ′ and A2 ′ imply P (X 1:r ∈ E) > 0.
Let us compare conditions A1 and A1 ′ . Suppose E is any set such that Y * (x 1:r ) is the same for any x 1:r ∈ E and also Y + (x 1:r ) is the same for any x 1:r ∈ E. Then clearly Y * (1) ⊂ Y + (1) , and these sets are equal, if for every i ∈ Y + (1) , there exists x 1 ∈ E (1) such that p(x 1 , i) > 0 (recall that we consider stationary case, thus p(x 1 , i) = p(x t , y t = i) for any t). Therefore, if there exists i ∈ Y + (1) \ Y * (1) , then p(x 1 , i) = 0 for every x 1 ∈ E (1) . This implies that such a state i almost never occurs together with an observation x 1 from E (1) and without loss of generality we can leave such states out of consideration. Indeed, recall the proof of Proposition 3.1, where for given x s:n , we calculated, for any k = 0, . . . , r ′ − 1, so that for any v ∈ Y m and for any k ν t s:n;m (v) = i0,i1,i2···iτ k p(y s+k = i 0 |x s:n )p(y s+k+r ′ = i 1 |y s+k = i 0 , x s+k:n )p(y s+k+2r ′ = i 2 |y s+k+r ′ = i 1 , x s+k+r ′ :n ) · · · · · · p(y s+k+τ k r ′ = i τ k |y s+k+(τ k −1)r ′ = i τ k −1 , x s+k+(τ k −1)r ′ :n )p(y t:t+m−1 = v|y s+k+τ k r ′ = i τ k , x s+k+τ k r ′ :n ).
The following theorem shows that when using reversed-time chain and backward-counterκ, we have the inequality (32) with C s replaced by another random variable C 0 that is σ(X −∞:0 )-measurable, but independent of s. Similarly, the random variableC s could be replaced by a σ(X m−1:∞ )-measurable random variableC m−1 that is also independent of s (but dependent on m). Because of the stationarity, the assumptions A1-A2 are replaced by the (formally) weaker assumptions A1 ′ − A2 ′ , but as we argued, they can be considered to be equivalent. Throughout the section we assume m ≥ 1 is a fixed integer.
Proof. By stationarity the reversed-time chainZ is positive Harris. Now we apply the proof of (ii) of Theorem 3.1 to the reversed-times block chainZ k def =Z k:k+r ′ = (Z −k , . . . , Z −k−r ′ ), k ≥ 0. Let f : X r → X r be the mapping that reverses the ordering of vector: f (x 1 , . . . , x r ) = (x r , . . . , x 1 ), and let E def = {f (x 1:r ) : x 1:r ∈ E}. ThusZ k ∈Ē × Y r if and only if Z −k−r ′ :−k ∈ E × Y r . Now, just like in the proof of (ii) of Theorem 3.1, we define where, as previously, T is a random variable so thatZ T ∈ D 0 , where D 0 , . . . , D d ′ −1 is a cyclic decomposition ofZ k . The set D 0 is such that P Z 1 ∈ D 0 ∩ (E × Y r ) > 0, by (31), such a set D 0 exists. Therefore, S(n)/n → p o , a.s., where We denoteX t def = X −t . Thus, for any u,X T :T +u = (X −T , . . . , X −T −u ) and so for any k = 0, 1, 2, . . . , whereκ is defined as in (17). Now everything is the same as in the proof of (ii) of Theorem 3.1: for every for 0 < p < p0 d ′ r ′ , there exists a finite random variable U (depending on p) such that for all k ≥ 0, By assumption all conditions of Proposition 3.1 and applying it with −l ≤ −s ≤ 0 ≤ n (and withκ), we obtain just like in the proof of (ii) of Theorem 3.1 ν 0 −l:n;m − ν 0 −s:n;m TV ≤ 2ρκ (X−s:0) ≤ 2ρ p(s−T )I(T +U≤s) ≤ 2α −(U+T ) · α s , a.s..
So the statement holds with C 0 def = 2α −(U+T ) and the the random variable C 0 is independent of s. This proves (34). If Z satisfies A1 ′ and A2 ′ , then the reversed-time chainZ satisfies A1 ′ and A2 ′ withĒ instead of E. Then the inequality (34) applied toZ yields to (35). The constants might be different, because the transition kernel ofZ might be different from that of Z.
Similar, for any s > 1, the limit lim exists and plugging (39) into (35), we obtain for any s ′ > m − 1 The inequalities (38) and (40) together imply the following approximation inequality Applying (41) to ν t 1:n;m , we obtain the following corollary. Corollary 3.2 Suppose the assumptions of Theorem 3.2 hold. Then there exists α o ∈ (0, 1) such that for every n, t satisfying n ≥ t + m − 1, t ≥ 1, it holds where C t is a σ(X −∞:t , X t+m−1:∞ )-measurable random variable.

Corollary 3.3 is a PMM-generalization of Theorem 2.1 in
Ergodicity. In (44), we can replace 1 by any l ∈ Z, and consider the stochastic process {C l } l∈Z . The construction of C l reveals that for any l, C l = f (X l,∞ ), where the function f is independent of l. This means that the process {C l } l∈Z is a stationary coding of the process Z (see e.g. [ , we see that the process {C l } l∈Z is stationary (since Z was assumed to be stationary) and, when Z is ergodic process (in the sense of ergodic theory), then so is {C l } l∈Z . The same holds for the process {C k } k∈Z . The ergodicity of these processes is key for proving the existence of limit R in PMAP segmentation (recall paragraph "Applications in segmentation").

Countable X
When X is countable, then Z is a Markov chain with countable state space and Z is (positive) Harris recurrent if and only if Z is (positive) recurrent. If X is finite, then every irreducible Markov chain is positive recurrent. If X is countable, then A1 is fulfilled if and only if for some r > 1 there exists a vector x 1:r ∈ X r such that Y + (x 1:r ) = Y + (x 1:r ) (1) × Y + (x 1:r ) (2) = ∅. For irreducible Z, the assumption A2 automatically holds if Y + (x 1:r ) = ∅ and that is guaranteed by A1. The interpretation of A1 ′ in the case of countable X is very straightforward: for every two vectors y 1:r ,ȳ 1:r ∈ Y r satisfying p(x 1:r , y 1:r ) > 0 and p(x 1:r ,ȳ 1:r ) > 0, there exists a third vectorỹ 1:r ∈ Y r such thatỹ 1 = y 1 ,ỹ r =ȳ r and p(x 1:r ,ỹ 1:r ) > 0.
In ergodic theory, this property is called as the subpositivity of the word x 1:r for factor map π : Z → X , π(x, y) = x, see ( [44], Def 3.1). Thus A1 ′ ensures that a.e. realization of X process has infinitely many subpositive words.

Nondegenerate PMM's
In [36,37], Tong and van Handel introduce the non-degenerate PMM. When adapted to our case, the model is non-degenerate when the kernel density factorizes as follows where P = (p ij ) is a transition matrix and r(x ′ |x) is a density of transition kernel, i.e for every x, is a transition kernel on X × B(X ). The third factor g(x, i, x ′ , j) is a strictly positive measurable function. For a motivation and general properties of non-degenerate PMM's see [36], the key property is that the function g is strictly positive. The non-degenerate property does not imply that Y is a Markov chain and even if it is, its transition matrix need not be P. Under (45), for every x 1:n , n ≥ 2, i, j ∈ Y where p n−1 ij stands for the i, j-element of P n−1 and g n (i, j, x 1:n ) > 0, (see also [36,Lemma 3.1]). From (46) it immediately follows that when P is primitive, i.e. for some R ≥ 1, P R has strictly positive entries, then any x 1:r with r = R + 1 such that p(x 1:r ) > 0 satisfies A1: Y + (x 1:r ) = Y × Y. Thus, when P is primitive, then A1 and A2 both hold with E = {x 1:r : p(x 1:r ) > 0}.
Barely the non-degeneracy is not sufficient for the primitivity of P. We now show that when combined with some natural ergodicity assumptions, then P is primitive. Let P n (i, j) Recall that π is a stationary measure of Z, and with a slight abuse of notation, let π(i) def = π({i} × X ) be a marginal measure of π. Surely π(i) > 0 for every i ∈ Y and so the convergence equivalently, P (Y n = j|Y 1 = i) → π(j), ∀i, j ∈ Y implies that P n (i, j) must consist of all positive entries when n is big enough. If Y happens to be a Markov chain with transition matrix P, then it is primitive. Otherwise observe that by (46) P n (i, j) = X n p(x 1 |y 1 = i)p ij (x 1:n )µ n (dx 1:n ) = p n−1 ij X n p(x 1 |y 1 = i)g n (i, j, x 1:n ) n k=1 r(x k |x k−1 )µ n (dx 1:n ) so that if there exists n such that P n (i, j) > 0 for every i, j ∈ Y, then P n−1 consists of strictly positive entries and so it is primitive. Hence for non-degenerate PMM's (47) implies A1 and A2. A stronger version of (47) (so-called marginal ergodicity) is assumed in [36] for proving the filter stability for nondegenerate PMM's [36, Th 2.10]. Thus, for finite Y, Theorem 3.1 generalizes that result. We believe that the key assumption of non-negative g can be relaxed in the light of cluster-assumption introduced in the next subsection for HMM's.

Hidden Markov model
In case of HMM the transition kernel density factorizes as q(x ′ , j|x, i) = p ij f j (x ′ ). Here P = (p ij ) is the transition matrix of the Markov chain Y and f j are the emission densities with respect to measure µ. Thus p ij (x 1:n ) = k1,...,kn−2 The process Z is irreducible (with respect to some measure) if and only if Y is irreducible and in this case the maximal irreducible measure is Since HMM's are by far the most popular PMM's in practice, it would be desirable to have a relatively easy criterion to check the assumptions A1 and A2 for HMM's. In this subsection, we introduce a fairly general but easily verifiable assumption called cluster assumption. Lemma 4.1 below shows that cluster assumption implies A1 and A2. The rest of the subsection is mostly devoted to show that the cluster assumption still generalizes many similar assumptions encountered in the literature.
Surely, at least one cluster always exists. Also, it is important to observe that every state i belongs to at least one cluster. Distinct clusters need not be disjoint and a cluster can consist of a single state.
The cluster assumption states: There exists a cluster C ⊂ Y such that the sub-stochastic matrix P C = (p ij ) i,j∈C is primitive, that is P R C has only positive elements for some positive integer R.
Lemma 4.1 Let Z be hidden Markov chain with irreducible hidden chain Y . Then the cluster-assumption implies A1-A2.
Proof. There must exist integer R ≥ 1 such that P R C consists of only positive elements. Defining we see that A2 also holds.
The cluster-assumption was introduced in [30,29,19] in other purposes than exponential forgetting.
Later it was successfully exploited in many different setups [23,24,27]. In those earlier papers, the concept of cluster was stronger than (48), namely C ⊂ Y was called a cluster if The weaker definition of cluster (48) was first introduced in [31].
We shall now show how in the case of finite Y, the cluster-assumption naturally generalizes many existing mixing conditions encountered in the literature. The following assumption is known as strong mixing condition (Assumption 4.3.21 in [1]): for every x ∈ X , there exists probability measure K x on Y and strictly positive functions ζ − , ζ + on X such that A stronger version of the strong mixing condition is the following: there exists positive numbers σ − and σ + and a probability measure K on Y such that This is Assumption 4.3.24 in [1]. It is easy to verify that under the strong mixing condition the Dobrushin coefficent of r-step transition matrix U (x s:n ) = F r−1;1 (x s:n ) can be bounded above by Under (51) the upper bound (1 − σ + σ − ) r−1 -a constant less than 1. Now it is clear that under (51) the exponential forgetting holds with non-random universal constant C * , i.e. in the inequality (19) C s ≡ C * for every s.
In the book [1], the Assumptions 4.3.21 and 4.3.24 as well as Assumptions 4.3.29 and 4.3.31 below are stated for general state space model, where Y is general space, and so (50) and (51) are just the versions of these assumptions for the discrete (finite or countable infinite) Y. We now briefly argue that for the case of discrete Y they are rather restrictive and our cluster-assumption naturally generalizes them. Indeed, it is easy to see that (51) holds if p ij > 0 for every i, j and for every x, there exists j so that f j (x) > 0 (this is a very natural condition, otherwise leave x out of X ). On the other hand, if the transition matrix is is irreducible then every row has at least one positive entry and then (51) implies that p ij > 0 for every i, j -a rather strong restriction on transition matrix. The same holds for (50). Indeed, since for every j, there exists x so that f j (x) > 0 and for every j there exists i such that p ij > 0 (implied by irreducibility), then for every j there exists x and i so that p ij f j (x) > 0. Then p i ′ j > 0 for every i ′ so that all entries of transition matrix are positive. If the entries of P are all positive (as it is sometimes assumed, e.g. [21]), then any cluster satisfies the requirement of cluster assumption (with R = 1), so that strong mixing condition trivially implies cluster-assumption.
In order to incorporate zero-transition, the primitivity of one-step transition matrix P could be replaced by that of R-step transition matrix for some R > 1. An example of such kind of mixing assumptions is the following (Assumption 4.3.29 in [1], see also [12,20]): There exists positive numbers σ − and σ + , an integer R and a probability measure K on Y such that with p R ij being i, j-element of P R , we have When the densities are bounded away from below and above, i.e. (19) is non-random and independent of s. We see that 1. relaxes the first requirement of (51), because (under irreducibility) now all elements of P R must be non-negative. For aperiodic chain, such R always exists and so 1. is not restrictive. On the other hand, the assumption on emission densities is stronger, because they all must be strictly positive. When densities are all positive, then there is only one cluster C = Y, hence under 1. and 2. above, the clusterassumption holds. The assumption 2. about the positivity of densities is often made in literature (e.g.. [2,7,8]). In particular, it is the HMM-version of the nondegeneracy-assumption [40,34]. Of course, the above-mentioned articles deal with continuous state space X , where the technique is different. However, at least in finite state case, the mutual equivalence of emission distributions excludes many important models and can be restrictive. The cluster assumption, however, combines the zero-densities and zerotransitions, being therefore applicable for much larger class of models.
Another assumption of similar type, originally also applied in the case of finite Y, can be found in [25,13]: the matrix P is primitive and This assumption relaxes the requirement of positive densities, but it implies that µ{x : min i f i (x) > 0} > 0 so that Y is a cluster that satisfies cluster assumption.
Although we have seen that the cluster assumption is weaker than many mixing assumptions in the literature, it is still strictly stronger than A1 and A2. To illustrate this fact, consider a following example (a modification of Example 5.1 in [30]) of four state HMM with transition matrix  Thus Y + (x 1:3 ) = Y × {3, 4} and hence A1 and A2 hold.
We conclude the section with some examples of assumptions made in the literature that are weaker than cluster assumption (or not comparable with it), but still stronger than A1 and A2. First of them is Assumption 4.3.31 in [1]. When adapted to our case of discrete Y, one of the main conditions of this assumption is (there are also some other conditions, making it more stronger) as follows: there exists a µ-a.s. non-identically null function α : X → [0, 1] and C ⊂ Y such that for all i, j ∈ Y and for all x ∈ X This condition implies A1. Indeed, let C ′ ⊂ Y be a cluster. Then there exists X ′ such that µ(X ′ ) > 0 and so A1 holds. It is also implicitly assumed that {x | α(x) > 0} is µ-positive, whence A2 also follows. This assumption is not comparable with cluster assumption.
then it holds for any other x ′ 2 ∈ F 1 as well. Observe that due to the assumption 1 ∈ Y + (2) . Relabel the states so that Y + (2) = {1, 2, . . . , l}. Let A 1 ⊂ Y + (1) be the set of states that can be connected with 1. Formally, i ∈ A 1 if p i1 (x 1:2 ) > 0 for every x 1:2 ∈ E 1 . Clearly A 1 = ∅. If A 1 = Y + (1) , then the proposition is proved -just take E = E 1 × F 1 and observe that by assumption for any state k in C 1 , p 1k > 0. Let A 2 = Y + (1) \ A 1 consists of states that cannot be connected to 1 but can be connected to 2. Thus i ∈ A 2 , if and only if p i2 (x 1:2 ) > 0, but p i1 (x 1:2 ) = 0 for every x 1:2 ∈ E 1 . The set A 2 might be empty. Similarly . . , l. By irreducibility there exists a path i 1 , i 2 , . . . , i s , with i 1 = 2 and i s = 1 from the state 2 to the state 1. Let C 2 , . . . C s be the corresponding clusters containing i 2 , . . . , i s and define F j = G Cj , j = 2 . . . , s. Finally take E 2 = F 2 × · · · × F s . Since p 1i2 > 0 by assumption (the first row has all non-zero entries), we have that for every x 2:s ∈ E 2 , p 11 (x 2:s ) > 0 and p 21 (x 2:s ) > 0. Now enlarge the set E 1 by taking E 1 × E 2 and redefine the sets A ′ 1 , A ′ 2 , . . . A ′ l ′ . Observe: if for a k = 1, . . . , l and for x 2:s ∈ E 2 , p k,1 (x 2:s ) > 0 then A k ⊂ A ′ 1 . Therefore A 1 ∪ A 2 ⊂ A ′ 1 , and l ′ < l. If l ′ > 1, then proceed similarly by enlarging E 1 × E 2 until all elements of Y + (1) can be connected with 1. This proves A1. The assumption A2 is trivial.

Linear Markov switching model
Let X = R d for some d ≥ 1 and for each state i ∈ Y let {ξ k (i)} k≥2 be an i.i.d. sequence of random variables on X with ξ 2 (i) having density h i with respect to Lebesgue measure on R d . We consider the linear Markov switching model, where X is defined recursively by Here F (i) are some d × d matrices, Y = {Y k } k≥1 is a Markov chain with transition matrix (p ij ), X 1 is some random variable on X , and random variables {ξ k (i)} k≥2, i∈Y are assumed to be independent and independent of X 1 and Y . For the linear switching model measure µ is Lebesgue measure on R d and the transition kernel density expresses as q(x 2 , j|x 1 , i) = p ij h j (x 2 − F (j)x 1 ). When F (i) are zero-matrices, then the linear Markov switching model simply becomes HMM with h i being the emission densities. When F (i) = F for every i ∈ Y, then the model becomes autoregressive model with correlated noise. Linear Markov switching models, also sometimes called linear autoregressive switching models have been widely used in econometric modelling, see e.g. [16,17,18].
The following result gives sufficient conditions for A1-A2 to hold. The analytic form of the stationary density p(z 1 ) is usually intractable for the linear switching model, and therefore we will avoid its use in the conditions. Instead, we will rely solely on the notion of ψ-irreducibility. In what follows, let · denote the 2-norm on X = R d , and for any x ∈ X and ǫ > 0 let B(x, ǫ) denote an open ball in X with respect to 2-norm with center point x and radius ǫ > 0. (i) There exists set C ⊂ Y and ǫ > 0 such that the following two conditions are satisfied: 1. for x ∈ B(0, ǫ), h i (x) > 0 if and only if i ∈ C; 2. the sub-stochastic matrix (p ij ) i,j∈C is primitive.
Together with (53) and (i) this implies that p ij (x 1:R+2 ) > 0 if and only if i ∈ Y C and j ∈ C. Hence Y + (x) = Y C × C for every x ∈ E. Together with (ii) this implies that A1-A2 hold.
Note that if densities h i are all positive around 0 (for example, Gaussian), then (i) is fulfilled when P is primitive with C = Y. General conditions for the linear Markov switching model to be positive Harris and aperiodic can be found in [10].
Remark: Instead of the linear Markov switching model, we can also consider the general Markov switching model, also called the nonlinear autoregressive switching model. For this model the linear recursion in (52) is replaced by any measurable function G : Y × X → X , i.e.
The statement of Lemma 4.2 holds for this model as well, if we demand that the G(i, ·) satisfy the following additional conditions: G(i, ·) are continuous at 0, and G(i, 0) = 0 for all i ∈ Y.
If these conditions are too restrictive, a different approach is needed to prove A1-A2. For general conditions for positivity, Harris recurrence and aperiodicity of the non-linear switching model see e.g. [10,11,43].
Proof of Proposition 2.1. Clearly, if Z is a stationary process, then the process Z is stationary as well, so that the distribution of Z 1 (under π) is invariant probability measure for Z.