Geometric Ergodicity and Hybrid Markov Chains

Various notions of geometric ergodicity for Markov chains on general state spaces exist. In this paper, we review certain relations and implications among them. We then apply these results to a collection of chains commonly used in Markov chain Monte Carlo simulation algorithms, the so-called hybrid chains. We prove that under certain conditions, a hybrid chain will "inherit" the geometric ergodicity of its constituent parts.


Introduction
A question of increasing importance in the Markov chain Monte Carlo literature (Gelfand and Smith, 1990;Smith and Roberts, 1993) is the issue of geometric ergodicity of Markov chains (Tierney, 1994, Section 3.2; Meyn and Tweedie, 1993, Chapters 15 and 16; Roberts and Tweedie, 1996).However, there are a number of different notions of the phrase "geometrically ergodic", depending on perspective (total variation distance vs. in L 2 ; with reference to a particular V -function; etc.).One goal of this paper is to review and clarify the relationship between such differing notions.We first discuss a general result (Proposition 1) giving the equivalence of a number of related ergodicity notions, involving total variation distance and V -uniform norms.Some of these equivalences follow from standard treatments of general state space Markov chains (Nummelin, 1984;Asmussen, 1987;Meyn and Tweedie, 1993), though they may not have previously been stated explicitly.Others of these equivalences are related to "solidarity properties" (Nummelin and Tweedie, 1978;Vere-Jones, 1962), regarding when a geometric rate of convergence can be chosen independently of the starting position.We then turn to L 2 theory, and discuss (Theorem 2) a number of equivalences of geometric ergodicity of reversible Markov chains, on L 1 and L 2 .The essence of our analysis is the spectral theorem (e.g.Rudin, 1991;Reed and Simon, 1972;Conway, 1985) for bounded selfadjoint operators on a Hilbert space.Again, we believe that these equivalences are known, though they may not have been explicitly stated in this way.We further show that the conditions of Proposition 1 imply the conditions of Theorem 2. We are unable to establish the converse in complete generality, however we do note that the two sets of conditions are equivalent for most chains which arise in practice.We also argue (Corollary 3) that any of these equivalent conditions are sufficient to establish a functional central limit theorem for empirical averages of L 2 functions.We then turn our attention (Section 3) to the geometric ergodicity of various "hybrid" Markov chains which have been suggested in the literature (Tierney, 1994, Section 2.4;Chan and Geyer, 1994;Green, 1994).After a few preliminary observations, we prove (Theorem 6) that under suitable conditions, hybrid chains will "inherit" the geometric ergodicity of their constituent chains.This suggests the possibility of establishing the geometric ergodicity of large and complicated Markov chain algorithms, simply by verifying the geometric ergodicity of the simpler chains which give rise to them.We note that there are various alternatives to considering distributional convergence properties of Markov chains, such as considering the asymptotic variance of empirical estimators (cf.Geyer, 1992;Greenwood et al., 1995).But we do not pursue that here.

Equivalences of geometric ergodicity
We now turn our attention to results for geometric convergence.We consider a φ-irreducible, aperiodic Markov chain P (x, •) on a state space X , with stationary distribution π(•).[Recall that a chain is φ-irreducible if there is a non-zero measure φ on X , such that if φ(A) > 0, then P x (τ A < ∞) > 0 for all x ∈ X .]We shall sometimes assume that the underlying σ-algebra on X is countably generated; this is assumed in much of general state space Markov chain theory (cf.Jain and Jamison, 1967;Meyn and Tweedie, 1993, pp. 107 and 516), and is also necessary to ensure that the function x → P (x, •) − π(•) var is measurable (see Appendix).We recall that P acts to the right on functions and to the left on measures, so that Call a subset S ⊆ X hyper-small if π(S) > 0 and there is δ S > 0 and k ∈ IN such that , where is the density with respect to π of the absolutely continuous component of P k (x, •).A fundamental result of Jain and Jamison (1967), see also Orey (1971), states that on a countably generated measure space, every set of positive π-measure contains a hyper-small set.Following Meyn and Tweedie (1993, p. 385), given a positive function V : X → IR, we let L ∞ V be the vector space of all functions from X to IR for which the norm |f| V = sup . We denote operator norms as usual, viz.
The following proposition borrows heavily from Meyn and Tweedie (1993) and Nummelin and Tweedie (1978).Note that the equivalence of (i) and (i') below shows that the geometric rate ρ may be chosen independently of the starting point x ∈ X , though in general (i.e. for chains that are not "uniformly ergodic"), the multiplicative constants C x will depend on x.Most of the remaining conditions concern the existence and properties of a geometric drift function V .
Proposition 2.1 The following are equivalent, for a φ-irreducible, aperiodic Markov chain P (x, •) on a countably generated state space X , with stationary distribution π(•).
(i) The chain is π-a.e.geometrically ergodic, i.e. there is ρ < 1, and constants C x < ∞ for each x ∈ X , such that for π-a.e.x ∈ X , (i') There are constants ρ x < 1 and C x < ∞ for each x ∈ X , such that for π-a.e.x ∈ X , (i") There is a hyper-small set S ⊆ X , and constants ρ S < 1 and C S < ∞, such that (ii) There exists a π-a.e.-finite measurable function V : X → [1, ∞], which may be taken to have π(V j ) < ∞ for any fixed j ∈ IN, such that the chain is V -uniformly ergodic, i.e. for some ρ < 1 and some fixed constant C < ∞, (iii) There exists a π-a.e.-finite measurable function V : X → [1, ∞], which may be taken to have π(V j ) < ∞ for any fixed j ∈ IN, and a positive integer n, such that (iv) There exists a π-a.e.-finite measurable function V : X → [1, ∞], which may be taken to have π(V j ) < ∞ for any fixed j ∈ IN, and a positive integer n, such that Remark 2.1 1.By the spectral radius formula (e.g.Rudin, 1991, Theorem 10.13), r(T ) = inf n≥1 T n 1/n , so that the displayed equations in (iii) and (iv) above can be restated as r(P − Π) < 1 and r(P L ∞ V,0 ) < 1, respectively.
2. We may use the same function V for each of (ii), (iii), and (iv) above. Proof: We note that ρ x and C x may not be measurable as functions of x ∈ X .However, since X is countably generated, the function Thus, we can define r x = lim inf n→∞ -finite measurable functions with π(r x < 1) = 1.Then we can find r < 1 and K < ∞ so that the set B = {x ∈ X ; r x ≤ r, K x ≤ K} has π(B) > 0. By Jain and Jamison (1967), there is a hyper-small subset S ⊆ B. For x ∈ S, we have Nummelin and Tweedie (1978), which generalizes the countable state space results of Vere-Jones (1962).(i) ⇐⇒ (ii) : Obviously (ii) implies (i).That (i) implies (ii) follows from Meyn and Tweedie (1993).Indeed, the existence of such a V follows from their Theorem 15.0.1 (i) and Theorem 5.2.2, the finiteness of E π (V ) follows from their Theorem 15.0.1 (iii) and Theorem 14.3.7 (see also Meyn and Tweedie, 1994, Proposition 4.3 (i)), and the possibility of having E π (V j ) < ∞ follows from their Lemma 15.2.9.(ii) ⇐⇒ (iii) : Clearly, (ii) is equivalent to the existence of specified V , ρ, and C, such that Taking j = 1 so that π(V ) < ∞, we see that this expression will be less than 1 for sufficiently large k, proving (iii). 2 We now turn attention to results involving the spectral theory of bounded self-adjoint operators on Hilbert spaces.Given a Markov operator P with stationary distribution π, a number 1 ≤ p < ∞, and a signed measure µ on X , define µ L p (π) by (note that if µ << π and p = 1, the two definitions coincide), set L p (π) = {µ ; µ L p (π) < ∞}, and set P L p (π) = sup{ µP L p (π) ; µ L p (π) = 1} .It is well known (see e.g.Baxter and Rosenthal, 1995, Lemma 1) that we always have Recall that P is reversible with respect to π if π(dx)P (x, dy) = π(dy)P (y, dx) as measures on X × X ; this is equivalent to P being a self-adjoint operator on the Hilbert space L 2 (π), with inner product given by µ, ν = X dµ dπ dν dπ dπ.
If P is reversible w.r.t.π, and dµ dπ = f, then d(µP ) dπ = P f, as is easily checked by writing out the definitions and using reversibility.In particular, the action of P on signed measures µ ∈ L 2 (π) is precisely equivalent to the action of P on functions f with π(f 2 ) < ∞.We shall apply the usual spectral theory (e.g.Rudin, 1991;Reed and Simon, 1972;Conway, 1985) to the operator P acting on measures in L 2 (π).
Theorem 2.1 Let P be a Markov operator on a state space X , reversible with respect to the probability measure π (so that P is self-adjoint on L 2 (π)).Then the following are equivalent, and furthermore are all implied by any of the equivalent conditions of Proposition 2.1.
Relation to Proposition 2.1: It remains to show that these conditions are implied by the equivalent conditions of Proposition 2.1.We show that condition (iii) is implied by Proposition 2.1 part (ii) (with j = 2).Indeed, for µ ∈ L 2 (π), we have (using the triangle inequality and the Cauchy-Schwartz inequality) that 2 Remark 2.3 1.For most reversible chains which arise, the conditions of Proposition 1 and Theorem 2 are actually equivalent.For example, this is obviously true if the point masses δ x are all in L 2 (π) (which holds for an irreducible chain on a discrete measure space), or if P (x, •) ∈ L 2 (π) for all x.More generally, if the chain is of the form P (x, •) = (1 − a x )δ x + a x ν x (•) where ν x ∈ L 2 (π) and a x > 0, then the two sets of conditions are again seen to be equivalent, since if condition (iii) of Theorem 2 holds, then condition (i ) of Proposition 1 also holds.This includes most examples of Metropolis-Hastings algorithms (Metropolis et al., 1953;Hastings, 1970).However, we do not know if these conditions are equivalent in complete generality.
2. We note that a number of other mixing conditions for Markov chains are available, though we do not pursue them here.See for example Rosenblatt (1962, Sections V b and VIII d), and Carmona and Klein (1983).
Finally, we consider the existence of central limit theorems for empirical averages is finite, where e g (S) = g(X j ) converge weakly to Brownian motion with variance σ 2 g .
On the other hand, if the Markov operator P has spectral radius ρ < 1, as in condition (ii) of Theorem 2, then it is easily seen (cf.Geyer, 1992, Section 3.5) that for π(g 2 ) < ∞, so that the above Markov Functional Central Limit Theorem applies.We thus obtain: Corollary 2.1 Let the reversible Markov operator P satisfy any one equivalent condition of Proposition 1 or of Theorem 2 (e.g., let P be geometrically ergodic).Let g : X → IR with π(g) = 0 and π(g 2 ) < ∞.Then σ 2 g < ∞, and as n → ∞, the processes converge weakly (in the Skorohod topology, on any finite interval) to Brownian motion with variance σ 2 g .In particular, setting t = 1, the random variable 1 This corollary directly generalizes Chan and Geyer (1994, Theorem 2) for reversible chains; in particular, it shows that their condition π |g| 2+ < ∞ is then unnecessary.See Chan and Geyer (1994), and references therein, for some other approaches to central limit theorems for Markov chains, including those involving drift conditions (cf.Meyn and Tweedie, 1993, Theorem 17.5.4).

Remark 2.4
We note that the Markov Functional Central Limit Theorem used above applies only to a reversible Markov process with a spectral gap.For more general reversible Markov processes, the less restrictive results of Kipnis and Varadhan (1986) may be very useful.Furthermore, we note that corresponding results for non-reversible processes are an active area of study, for example in the contexts of interacting particle systems and in homogenization theory; see for example Derriennic and Lin (1996), and the recent unpublished works of Papanicolaou, Komorovskii, Carmona, Xu, Lamdin, Olla, Yau, and Sznitman.

Geometric ergodicity of hybrid samplers
Given a probability distribution π(•) on the state space X = X 1 × X 2 × . . .× X k , the usual deterministic-scan Gibbs sampler (DUGS) is the Markov kernel P = P 1 P 2 . . .P k , where P i is the Markov kernel which replaces the i th coordinate by a draw from π(dx i |x −i ), leaving x −i fixed (where x −i = (x 1 , . .., x i−1 , x i+1 , . . ., x k )).This is a standard Markov chain Monte Carlo technique (Gelfand and Smith, 1990;Smith and Roberts, 1993;Tierney, 1994).The random-scan Gibbs sampler (RSGS), given by P = 1 k i P i , is sometimes used instead.Often the full conditionals π(dx i |x −i ) may be easily sampled, so that DUGS may be efficiently run on a computer.However, sometimes this is not feasible.Instead, one can define new operators Q i which are easily sampled , such that Q n i converges to P i as n → ∞.This is the method of "one-variable-at-a-time Metropolis-Hastings" or "Metropolis within Gibbs" (cf.Tierney, 1994, Section 2.4; Chan and Geyer, 1994, Theorem 1; Green, 1994).Such samplers prompt the following definition.
be a collection of Markov kernels on a state space X .The random-scan hybrid sampler for C is the sampler defined by A common example is the variable-at-a-time Metropolis-Hastings algorithms mentioned above.
For another example, if the Q i are themselves Gibbs samplers, then the random-scan hybrid sampler would correspond to building a large Gibbs sampler out of smaller ones.Similarly, if the Q i are themselves Metropolis-Hastings algorithms (perhaps with singular proposals, cf.Tierney, 1995), then the random-scan hybrid sampler is also a Metropolis-Hastings algorithm but with a different proposal distribution.It is important to note that this usage of the word "hybrid" is distinct from the more specific notion of "hybrid Monte Carlo" algorithms as studied in the physics literature (cf.Duane et al., 1987;Neal, 1993, Section 5.2).We now discuss some observations about geometric convergence of these samplers.Our first result is a cautionary example.
Proposition 3.1 Let X = X 1 × X 2 , and let π be a probability measure on X .Let Q 1 and Q 2 be two Markov operators on X which fix the second and first coordinates (respectively).Assume that the usual RSGS for π is geometrically ergodic.Assume further that for each fixed ] is geometrically ergodic, with stationary distribution equal to π(dx|y) [resp.π(dy|x)].Then it is still possible that the hybrid sampler fails to be geometrically ergodic on X .

Proof:
A result of Roberts and Tweedie (1996) states that a Metropolis algorithm is not geometrically ergodic if the infimum of acceptance probabilities is 0. Therefore consider the following example.Let π be the bivariate density of two independent standard normal components.By independence, RSGS is geometrically ergodic (in fact uniformly ergodic with rate 1/2).Now let Q 1 be the following random walk Metropolis algorithm.Given (X n , Y n ) = (x, y), let Y n+1 = y and propose a candidate Z n+1 for X n+1 from a N (x, 1 + y 2 ) distribution, accepting with the probability: 1 ∧ π(Z n+1 )/π(X n ).Similarly define Q 2 on fixed x by proposing a candidate for Y n+1 according to N (y, 1 + x 2 ).For fixed y, Q 1 is geometrically ergodic; and for fixed x, Q 2 is geometrically ergodic (see for example Roberts and Tweedie, 1996).However it is easy to verify that along any sequence of points (x i , y i ) such that |x i | → ∞ and |y i | → ∞, the acceptance probability for each of Q 1 and Q 2 (and hence also for P C ) goes to 0. This provides our required counterexample.

2
To continue, we need some notation.Let π be a probability measure on X = X 1 ×. ..×X k , and let P i be the operator which replaces the i th coordinate of x ∈ X by a draw from π(dx i | x −i ), leaving x −i unchanged.Let M(X ) be the space of all σ-finite signed measures on X , and let M 0 = {µ ∈ M(X ) ; µ(X ) = 0}.Given a norm N on M(X ), and a Markov operator P on X , set P N = sup N (µP ).Say P is N -geometrically ergodic if πP = π and there is a positive integer m with P m M0 N < 1. (Note that, if N is total variation distance, then N -geometric ergodicity is equivalent to uniform ergodicity.On the other hand, if N (µ) = µ L 2 (π) , then N -geometric ergodicity is equivalent to condition (ii) of Theorem 2 above.)Finally, call N π-contracting if it has the property that P N ≤ 1 for any Markov operator P satisfying πP = π.Thus, the L p (π) norms are all π-contracting (see e.g.Baxter and Rosenthal, 1995, Lemma 1), though the norm L ∞ V is usually not.In this appendix we prove that, on a countably generated state space, the quantity P n (x, •) − π(•) var is measurable as a function of x.This fact was needed in the proof of Proposition 1 above.
Lemma 4.1 Let ν(•) be a finite signed measure on (X , F). Suppose that F 0 is a set algebra generating F. Then for all S ∈ F, and for all > 0, there is S 0 ∈ F 0 with |ν(S) − ν(S 0 )| < .
Proof: Take ν to be a measure (otherwise consider its positive and negative parts separately and use the triangle inequality).It follows from Doob (1994, Chapter IV, Section 3) that we can find S 0 ∈ F 0 with ν(S∆S 0 ) arbitrarily small.The result follows.

Proof:
The proof proceeds by demonstrating that the supremum can be written as a supremum of a countable collection of measurable functions.Therefore let F n = σ{A 1 , A 2 , . . ., A n }.Now fix x ∈ X .Let A * be the set that achieves sup

But sup
A∈Fn ν(x, A) is clearly a measurable function of x ∈ X .The result follows.

A
ν(x, A) (possible by the Hahn decomposition, e.g.Royden, 1968).Notice that n F n is a set algebra generating F. So for arbitrary > 0, we can apply the previous lemma to find a set A ∈ n F n with ν(A * ) ≥ ν(A) ≥ ν(A * ) − .Hence, sup A∈F ν(x, A) = sup n sup A∈Fn ν(x, A) .

2
Proposition 4.1 Let ν(x, •) be a bounded signed measure on (X , F), where F is countably generated by the sequence of sets {A 1 , A 2 , . ..}. Assume that ν(•, A) is a measurable function for all sets A. Then sup A∈Fν(x, A) is measurable.