Three Kinds of Geometric Convergence for Markov Chains and the Spectral Gap Property

In this paper we investigate three types of convergence for geometrically ergodic Markov chains (MCs) with countable state space, which in general lead to different ‘rates of convergence’. For reversible Markov chains it is shown that these rates coincide. For general MCs we show some connections between their rates and those of the associated reversed MCs. Moreover, we study the relations between these rates and a certain family of isoperimetric constants. This sheds new light on the connection of geometric ergodicity and the so-called spectral gap property, in particular for non-reversible MCs, and makes it possible to derive sharp upper and lower bounds for the spectral radius of certain non-reversible chains.


Introduction
For positive recurrent Markov chains (MCs) one of the central questions is the convergence of their transition kernels to the invariant distribution.The 'geometrically ergodic' case when this convergence takes place at a geometric rate is of particular importance.A profound analysis of this subject can be found in the monographs by Meyn and Tweedie [7] and by Nummelin [8].
In this paper we are concerned with three different kinds of rates of geometric convergence.In Section 2 we present an example to illustrate the differences between the definitions; in Section 3 several connections between these rates for a MC and the corresponding rates for the reversed chain are proved.In Section 4 we show that for reversible Markov chains (under a mild condition) the different types of rates of convergence actually coincide.In Section 5 we analyze geometrically ergodic MCs by applying the concept of isoperimetric constants, which has been used in [14] to establish necessary and sufficient conditions for the spectral gap property.We show that this property and geometric ergodicity are equivalent for normal Markov chains, generalizing the results of Roberts and Tweedie [11] and Roberts and Rosenthal [12].Moreover, it is shown how a certain sequence of isoperimetric constants can be used to obtain bounds for the rates of geometric convergence, and prove that these bounds are sharp in some cases.In Section 6 we present an example which shows that geometric ergodicity (GE) does not imply the spectral gap property (SGP) and calculate exact rates of geometric convergence applying the method of isoperimetric constants.
Throughout this paper let ξ 1 , ξ 2 , . . .be a positive recurrent MC with countable state space Ω, transition kernel p(•, •) and invariant probability measure π.Let p * (i, j) = π( j)p( j, i) be the transition probabilities of the reversed MC (a realization of which we denote by ξ * 1 , ξ * 2 , . ..).We need the standard MC operators P, P * and Π defined by for all real-valued functions f on Ω for which the corresponding series converge.In particular, for all f ∈ L 2 (π) it easily follows from Jensen's inequality and the stationarity of π that the sums in (2), ( 3) and (4) converge and that P f , P * f and Π f are in L 2 (π).Note that we consider Π as the operator that maps every f ∈ L 2 (π) to the function constantly equal to the π-expected value of f .The scalar product on L 2 (π) is of course It is easy to show that 〈P f , g〉 π = 〈 f , P * g〉 π , so P * is the adjoint operator of P on L 2 (π).We say that P has the spectral gap property (SGP) on where Note that the limit in (5) always exists (see e.g.[10]).The total variation distance of two probability measures µ and ν on Ω is defined by From [7] (Chapter 15) and [8] (Theorem 6.14 (iii)) it follows that the GE property is equivalent to the seemingly more restrictive condition for some δ < 1, where K δ is defined as in (6).Note that the δ in (7) may differ from the δ in (6).
Obviously, (7) implies that for some δ < 1 It is certainly of interest to find the best rate of 'geometric convergence'.However, considering ( 6)-( 8) there are three possibilities to define an optimal lower bound for this rate: Let δ 0 = inf{δ : 0 < δ < 1 and ( 6) is satisfied} ( 9) Definition 1. Regarding the geometric rate of convergence we call δ 0 the optimal lower bound (OLB) in the weak sense, δ 1 the OLB in the strong sense and δ 2 the OLB in the L 1 (π) sense.
It follows from the definitions that Are these inequalities in general strict, and under which conditions do they become equalities?Moreover, are these OLBs attained?We start with an example.

Introductory example: the reversed winning streak
Let us consider the MC with state space N and transition matrix Its invariant measure π is given by The crucial observation now is that which immediately generalizes to For arbitrary δ > 0 we conclude that Since this holds true for all δ > 0, we see that K δ (i) ≤ 2(1/δ) i−1 and that the OLB in the weak sense is zero, i.e., δ 0 = 0.
But of course the MC is not GE at rate zero (this rate of geometric ergodicity only occurs for MCs induced by a sequence of i.i.d.random variables); thus the infimum in ( 9) is not attained.
Next let us determine δ 1 .Check that Now consider an arbitrary δ < 1 satisfying (10).Then which is of course equivalent to δ > 1/2.Hence, On the other hand, if we choose δ = 1 2 +ε, we see that for any ε ∈ (0, 1 2 ) we have that K δ (i) ≤ (2−ε) i .Moreover, a simple calculation shows that (7) is satisfied.This together with (13) implies that The above reasoning implies that this MC is not GE with rate 1 2 in the strong sense.Regarding δ 2 , so far we only know that δ 2 ≤ 1 2 .Its exact value will be derived in the next section, where we will also see how the different rates of convergence occur in a natural way when trying to bound δ * 0 , the OLB of the reversed chain in the weak sense, by the OLBs of the original MC.

The reversed chain
Assuming that a MC ξ 1 , ξ 2 , . . . is GE, what can we say about the reversed MC ξ * 1 , ξ * 2 , . ..? We show that the GE property is preserved under time-reversion, but the behavior of the OLBs is more complicated.

Theorem 1. If a MC is GE, then the reversed MC is also GE.
and Actually, we have just shown where δ * 2 denotes the OLB of the reversed MC in the L 1 (π) sense.
For every δ > δ 2 there is a constant C such that the right-hand side of ( 16) is at most Cδ n for all n.
•) and carrying out the same calculations as in (16) Let us apply Theorem 2 to the example in Section 2. The transition matrix of the reversed MC is given by  This MC has a remarkable feature: there is a central state in the sense that this state can be reached from any other one in a single step with probability 1/2.This property immediately implies that It is interesting that (18) implies the classical condition which was used by Döblin [2] in order to establish uniform geometric convergence to the invariant measure (with respect to total-variation) for certain Markov chains, i.e., ∃δ < 1 : sup Note that this is a stronger property than (6).
In [6] it is shown that (18) implies that (the constant 2 does not appear in [6] due to a different definition of the total variation norm).The proof is based on a coupling argument in which ( 18) is used to bound the expected coupling time, which in turn leads to the estimate for the total variation (see [6]).The factor 1 2 in ( 19) is optimal in the sense that it is as small as possible.In fact, From ( 19) it now follows immediately that The situation is completely different from what we have seen for the original chain, for which it has been shown that Let us determine δ 2 , which had been left open at the end of Section 2. From Theorem 2 and (20) it follows that A closer look at the proof of Theorem 2 yields even more.We obtain , so the OLB in the L 1 (π) sense, δ 2 , is in fact attained.Recall that this was not the case for δ 0 and δ 1 .

Reversible Markov chains
In this section we show that for reversible MCs δ 0 , δ 1 and δ 2 coincide under the (rather weak) condition that the invariant distribution π has a finite (1+ε)-moment

Theorem 3. If a MC is reversible, GE and its invariant distribution π has a finite
and all these OLBs are attained.
Proof: Without loss of generalization we can assume that Ω = N and with . Now we apply the spectral representation theorem (see e.g. [10]) with spectral measure associated to P − Π and (δ i /π)/||δ i /π|| L 2 (π) , E λ denoting the corresponding projection operator.We obtain From ( 22) and (23) it follows that We have where the first two inequalities follow from Cauchy-Schwarz and the identities (23)-( 24), respectively.The last inequality follows from the definition of ρ.From the equivalence of (i) and (iii) in Theorem 2.1 of [12] it follows that the upper bound ρ i for the rate in (25) is optimal in the sense that This implies that δ 0 = sup j∈Ω ρ j .

Now let us prove (21)
. By (26), it is enough to show that 1+ε) .We obtain From the last proof we immediately obtain Corollary 2. For a reversible MC the following two statements are equivalent: The estimate in ( 27) is the well-known 1 -bound for the total variation in terms of the spectral radius.For Markov chains with finite state space this can be found in [13].

Geometric ergodicity and spectral theory
The following theorem due to [11] and [12] shows the close connection between geometric ergodicity and the spectral gap property.
The original proof of this result can be found in [12].A very short derivation of the first part was given in [14].The key observation there was that the spectral radius of a MC can be expressed by a rescaled function of a sequence of isoperimetric constants (see Theorem 5 below).It turns out that these rescaled constants are a suitable tool for studying geometric ergodicity in the sense that they can be related to the different notions of geometric speed of convergence.
The isoperimetric constants in question are where The following theorem from [14] relates spectral properties to the rescaled limits of isoperimetric constants.
Theorem 5. Assume that the operator P is normal.Then the spectral radius ρ is given by In particular, for reversible Markov chains this yields Moreover, if P is in addition positive, we have Based on this result, we can show Theorem 6.If the underlying MC is GE, then If P is in addition normal, then the MC satisfies SGP and the spectral radius ρ can be estimated by Proof: An easy calculation shows that Hence, for every ε ∈ (0, This proves the first assertion of the theorem. The first inequality in (31) follows from the second part of Theorem 4. Let us prove the second inequality.It was shown in [14] that for l < n we have Thus, by ( 34) and (32), (1 − k P * l P l (A)) Now first letting n → ∞, then taking the supremum over all A ⊂ Ω, thereafter letting l → ∞ and applying Theorem 5 yields ρ ≤ δ 2 .
From this theorem we immediately obtain Corollary 3. If P is normal, then the following statements are equivalent: Next we want to prove the equivalence in Corollary 3 for certain non-reversible MCs.Note that normality of the operator P is only needed to ensure that (34) holds.So it seems natural to start with a modified version of (34).Define Corollary 4. Assume that for every A ⊂ Ω the sequence (a(n, A)) n∈N has a nondecreasing subsequence (a(n k , A)) k∈N with n 1 = 1.Then the GE property and SGP are equivalent and where κ ≥ 1 is a constant which does not depend on the underlying MC.
Note that the subsequence (n k ) k≥2 is allowed to depend on A. The fact that κ ≥ 1 has been established in [5], from which the following definition of κ is taken: Let denote the set of all possible distributions of pairs (X , Y ) of i.i.d random variables each having variance 1.Then Proof: The implication SGP =⇒ GE can be derived in a similar way as (25).More precisely, in the derivation of (25) we have to take the adjoint in the inner product, i.e. to replace P n − Π by P * n − Π.The result follows by applying Cauchy-Schwarz in (25) and the fact that GE =⇒ SGP follows immediately from (37), since δ 2 < 1 implies ρ < 1.So let us show (37).Since (a(n k , A)) k∈N is nondecreasing, we can carry out the same calculation as in the proof of Theorem 6 with n replaced by n k .By assumption, we have n 1 = 1 for all A ⊂ Ω.This yields which implies that (1 − k P * P ) Now (37) follows from Proposition 1 of [16].
Because of its generality, the upper bound in (37) is not sharp in most cases.In order to improve this upper bound for certain MCs we show the following generalization of Theorem 5.
We need the Hilbert space For a positive recurrent MC the spectral radius ρ = ρ(P) of the associated Markov operator P on L 2 0 (π) is given by Proof: Since P * n P n is positive and selfadjoint, Theorem 5 yields By the Rayleigh-Ritz principle (see e.g.[5]) it follows that sup Since the left-hand side in (41) equals ||P n || 2 L 2 0 (π) , we obtain Now n → ∞ leads to the assertion.
Corollary 5. Assume that there exists an n 0 ∈ N such that Then and Proof: From Theorem 7 it follows that The inequalities (43) can be shown in the same way as in the proof of Theorem 6.
The upper bound in (43) is better than that in (37).To show this, note that since we do not know the exact value of κ, the estimate (37) can only be applied with κ = 1.Therefore we have to prove that Actually, δ 2 is smaller than the right-hand side of (37) whenever max δ∈[0,1] (1 − δ)(1 + δ) 2 ≤ 8/κ.This is the case as long as κ ≤ 27/4.
Observe that normality of a MC implies condition (42).Let us again consider the example of Section 2 to show that this implication cannot be reversed.Let P and P * be given by ( 12) and (17), respectively.It can be readily seen that for i ≥ 2 and j ∈ N we have and This implies that P * P = P P * , so the MC is not normal.However, a short calculation shows that By (45), P * 3 P 3 = P * (P * 2 P 2 )P = P * (P * P) 2 P = P * 2 P P * P 2 = P * 2 P 2 (P −1 P * −1 )P * 2 P 2 = (P * P) 2 (P * P) −1 (P * P) 2 = (P * P) 3 . (46) By complete induction, it is now seen that ( 42) is satisfied with n 0 = 2.
The spectral gap in this example has already been determined in [14].We give a very short alternative derivation.From Corollary 5 it follows that ρ(P) = ρ(P * P). But where I denotes the identity operator, i.e., I f = f .Since P * P is selfadjoint, we obtain Note that the inequality ρ ≤ δ 2 = 1 2 , which has been derived in Corollary 5, is in fact sharp!We can use this in order to obtain an estimate for κ.Insert ρ = 1  2 into (37) we obtain that κ ≤ 64 9 .
The computations in the proof of Theorem 3 lead to the following modification of Corollary 5: Corollary 6.If the operator P of a geometrically ergodic MC satisfies (42) and the invariant distribution π has a finite (1 + ε)-moment for some ε > 0, then The following result provides lower bounds for δ 0 and δ 2 .) n∈N is nondecreasing, we even have Moreover, for every sequence (A 2n ) n∈N with lim n→∞ Proof: We only show the third inequality of Theorem 8 because the proofs of the others are similar.We have by assumption that, for arbitrary δ > δ 2 , ≤ lim sup n→∞ C(δ) Now δ → δ 2 and n 0 → ∞ yields the result.
Let us apply this result to our example.A good choice of the set A is of key importance in order to obtain a non-trivial lower bound.We try A = {2, 4, 6, 8, . ..}.Then This implies that for all n.Applying Theorem 8 yields By what has been shown before, this bound is again sharp.One can prove that the above choice of A is optimal in the sense that k 2n (A) = k 2n .
So we have just seen that in our example we have It would be nice to have this relation in general, at least asymptotically, but this result fails to be true.In the next section we consider an example (originally due to Häggström [3]) of a MC that is GE and satisfies k 2n = 0 for all n ∈ N. In this example the left-hand side in ( 55) is equal to one for every n, but by geometric ergodicity the right-hand side in ( 55) is less than one.
Kontoyiannis and Meyn [4] have proved that geometric ergodicity and SGP are not equivalent using the same example, but a different argument based on an Lyapunov function approach.
Häggström [3] originally used the example in order to present a sequence of random variables connected to a geometrically ergodic MC with finite second moments but not following the central limit theorem.In fact, this result implies that the MC cannot satisfy SGP, since by a theorem due to Cogburn [1] for every sequence of random variable connected to a Markov chain satisfying SGP and having finite second moments the central limit theorem holds.
We now show that We start from the observation This yields that 1 2 is an upper bound for δ 0 .To see that 1 2 is also a lower bound, note that To see that 1 2 is also a lower bound for δ 2 , we calculate 1 − k 2n (A 2n ) for A 2n = {0} ∪ {(a, a) : a ∈ {1, 2, . . ., 2n}}, n ≥ 2.