Spectral Bounds for Certain Two-Factor Non-Reversible MCMC Algorithms

We prove that the Markov operator corresponding to the two-variable, non-reversible Gibbs sampler has spectrum which is entirely real and non-negative, thus providing a first step towards the spectral analysis of MCMC algorithms in the non-reversible case. We also provide an extension to Metropolis-Hastings components, and connect the spectrum of an algorithm to the spectrum of its marginal chain.


Introduction
This paper is inspired by the earlier paper [23], which discusses the importance of real, non-negative spectra for MCMC algorithms, and proves this property for several different reversible cases.In this paper, we extend that result to some common non-reversible MCMC algorithms, as we shall explain.
Markov chain Monte Carlo (MCMC) algorithms, such as the Gibbs sampler [9,8] and the Metropolis-Hastings algorithm [16,10,26], are an extremely active area of modern research, with applications to numerous areas (see e.g.[3] and the references therein).Much of the mathematical analysis of these algorithms centers around their convergence rate; i.e., how long they need to be run before they produce accurate samples from the designated target probability distribution (cf.[20]).Some of this analysis uses probabilistic techniques such as coupling and minorisation conditions (e.g.[21,4]).However, much of the analysis involves considering the spectrum of the associated Markov operator (see Section 2.2).In such cases, the Markov operator is nearly always assumed to be self-adjoint, corresponding to the Markov chain being reversible (see e.g.[13,24,6,5,12]).The paradigm used is then roughly as follows: 1. Since the Markov operator is self-adjoint, its spectrum must be real (not complex), and can often be shown (or forced) to be non-negative, cf.[23].
2. The corresponding spectral gap can then be bounded away from zero using various techniques (Cheeger's inequality, quadratic forms, etc.).
3. These spectral gap bounds then imply bounds on the operator's norm, which in turn lead to bounds on the Markov chain's convergence rate.
However, if the Markov chain is not reversible, then much of this paradigm breaks down (though the spectral radius formula is still of some relevance to step 3 above; see Section 2.2 below), and the analysis becomes much more difficult (see e.g.[17]).Some authors have attempted to get around this difficulty by replacing the non-reversible Markov chain by its "reversibilisation" [7], or by some other chain which provably has the same convergence properties [19].However, there has been very little success at directly investigating the spectral properties of non-reversible Markov chains themselves, despite the fact that many commonly used MCMC algorithms (such as the systematic-scan Gibbs sampler) are not reversible and thus not amenable to the above paradigm.
In this paper, we make a small start in this direction.We consider one of the simplest common classes of non-reversible MCMC algorithms; namely, those which are a product of two factors each of which is a reversible Markov chain.In particular, we consider the twovariable systematic-scan Gibbs sampler, and prove step 1 of the above paradigm; i.e., that a Markov operator corresponding to such a sampler must have spectrum which is real and nonnegative (Theorem 1).This implies (Corollary 2) that the corresponding auto-covariances are also non-negative.We also consider a combination of a Metropolis-Hastings component and a Gibbs Sampler component, and prove that the corresponding spectrum must still be real in that case (Theorem 3).Finally, we consider the relationship between the spectra of certain (non-reversible) systematic scan chains, and their corresponding (reversible) marginal chains (Theorem 5).We hope that these results will lead to further efforts to extend the above spectral analysis paradigm to non-reversible Markov chains.

Background
We begin with some background needed for our results.

Markov Chain
A (time-homogeneous) Markov chain on a measurable space (X , F) has a Markov kernel P : X × F → [0, 1], where P (x, A) represents the probability that, if the chain begins in the state x ∈ X , it will then "move" to a state in A ∈ F on the next iteration.Formally, for each fixed x ∈ X , the mapping A → P (x, A) is a probability measure on (X , F), and for each fixed A ∈ F, the mapping x → P (x, A) is a measurable function on X .A sequence of X -valued random variables X 0 , X 1 , X 2 , . . . is a Markov chain following the transitions P if for any n ≥ 0 and all A ∈ F, Prob[X n+1 ∈ A | X 0 , X 1 , . . ., X n ] = P (X n , A).
In the case of MCMC algorithms, there is always a fixed probability measure π on (X , F) which is stationary for P , meaning that (πP )(A) := x∈X π(dx) P (x, A) = π(A) for all A ∈ F. Under mild conditions, if the Markov chain is run repeatedly, then it will converge in distribution to π.Indeed, this is the main motivation for MCMC algorithms, and indeed the speed of this convergence is of great importance (see e.g.[20]).
One condition which guarantees that π is stationary for P is that the Markov chain is reversible with respect to π; i.e., that π(dx) P (x, dy) = π(dy) P (y, dx) for all x, y ∈ X .

Markov Operator
Such a Markov kernel P can also be viewed as a linear operator (see e.g.[22] for basic facts about operators), which acts on functions f : X → C by (P f )(x) := y∈X f (y) P (x, dy) , so that (P f )(x) is the conditional expected value of f when the Markov chain takes one step starting at x.
The stationary probability measure π gives rise to an inner product f, g = x∈X f (x) g(x) π(dx) and norm f = f, f on the Hilbert space Then P acts on L 2 (π), and indeed it is easily seen (e.g.[2]) that we always have P f ≤ f ; i.e., P ≤ 1; i.e., P is a (weak) contraction on L 2 (π).Similar comments also apply to P acting on the subspace which is more directly related to MCMC convergence (since it avoids the specific eigenvalue 1 for constant functions, corresponding to the fact that πP = π since π is a stationary distribution).The operator P is also related to the auto-covariance of the chain, which is important in understanding the accuracy of MCMC samplers (see e.g.[15]).Indeed, for f : X → R, where the expected value E is taken with respect to a Markov chain {X n } started in stationary and following the transitions P .It is easily seen that P is reversible if and only if the operator P is self-adjoint; i.e., P f, g = f, P g for all f, g ∈ L 2 (π).An operator P is positive if it is self-adjoint and also P f, f ≥ 0 for all f ∈ L 2 (π).Any positive operator has a unique positive square-root; i.e., a positive operator S := P 1/2 with S 2 = P .
The spectrum of the operator P is defined, as usual, by (Here I is the identity operator on L 2 (π), and "invertible" means having an inverse within the class of all bounded (i.e., continuous) linear operators on L 2 (π).)The corresponding spectral radius is r(P ) = sup{|z|; z ∈ σ(P )}.Since P ≤ 1, it follows that r(P ) ≤ 1.In general, σ(P ) consists of complex numbers.However, for self-adjoint operators (corresponding to reversible Markov chains), the spectrum is well-known to contain only real numbers.And, for positive operators, the spectrum is also non-negative; i.e., contained in [0, ∞).
It turns out (see e.g.[18]) that in the MCMC context, the spectral radius r(P ) for the operator P on L 2 0 (π) is of great importance to convergence rates.In the reversible case, this is because r(P ) n then equals the operator norm P n , and hence provides direct bounds on In the non-reversible case, that bound does not hold; however by the spectral radius formula (e.g.[22], Theorem 10.13) we still have r(P ) = lim n→∞ P n 1/n , so the bound still holds asymptotically in this sense.

Gibbs Sampler
is a d-fold product measurable space, and that λ i is some σ-finite reference measure on (X i , F i ) for each i.(The most common case is where each λ i equals Lebesgue measure on X i = R.) Suppose further that the stationary probability distribution π has a density φ with respect to λ; i.e., π λ with dπ dλ = φ.Then the i th component Gibbs sampler is the Markov kernel G i which leaves all coordinates besides i unchanged, and replaces the i th coordinate by a draw from the full conditional distribution of π conditional on all the other components.That is, for x ∈ X and A i ∈ F i , if S x,i,Ai := {y ∈ X ; y j = x j for j = i, and .
These single-component Gibbs samplers G i are easily seen to be reversible Markov chains corresponding to self-adjoint operators.In fact, they are projection operators, i.e. (G i ) 2 = G i , so their spectra consist entirely of the values 0 and 1, and in particular their spectra are real and non-negative.
The single-component Gibbs samplers G i are then combined together to form a complete MCMC algorithm P .There are two main ways of doing this.The first is the systematicscan Gibbs sampler, defined by P = G 1 G 2 . . .G d , corresponding to cycling through all of the different coordinates in order.The second is the random-scan Gibbs sampler, defined by , corresponding to choosing a coordinate uniformly at random and updating that coordinate only.Now, it is easily seen that the random-scan Gibbs sampler is reversible, so that its spectrum can be analysed in various ways (see e.g.[23]).However, the systematic-scan Gibbs sampler is more commonly used in applications, and it is definitely not reversible.(For example, if d = 2 and the support of π is an "L" shape, then with G 1 G 2 it is possible to move from the lower-right corner to the upper-left corner, but not to move the other way.) In this paper, we focus on the two-variable systematic-scan Gibbs sampler; i.e., the case where d = 2 and P = G 1 G 2 (equivalent to the data augmentation algorithm introduced in [25]), which is arguably the simplest common non-reversible MCMC algorithm.

Metropolis-Hastings Algorithm
Let d, X i , F i , λ i , φ be as above.When some of the Gibbs sampler kernels G i cannot be feasibly implemented, practitioners sometimes instead use Metropolis-Hastings components, defined as follows.Let Q i be an arbitrary Markov kernel on X which leaves all coordinates besides the i th one unchanged; i.e., such that in the above notation Q i (S x,i,Xi ) = 1.Assume that Q i (x, •) has a density q i,x (t) with respect to λ i , in the sense that Then the i th component Metropolis-Hastings algorithm is the Markov kernel M i corresponding to "proposing" a new state y ∈ X according to Q i , and then accepting this new state with probability α i (x; y) := min(1, φ(y) qi,y(xi) φ(x) qi,x(yi) ), otherwise with probability 1 − α i (x, y) the new state is rejected so the Markov chain remains at the state x.In terms of Markov operators, writing x[i, t] := (x 1 , . . ., x i−1 , t, x i+1 , . . ., x d ), this corresponds to setting is the overall probability of rejecting the proposal.Now, the acceptance probabilities α i (x, y) have been chosen precisely (see e.g.[26,20]) to ensure that each kernel M i is reversible with respect to π, so π is stationary for M i .Hence, the operator M i is self-adjoint, though it might not be a positive operator.
Remark.It is also possible to define a full-dimensional Metropolis-Hastings algorithm, which acts on all components simultaneously.In the above notation, that corresponds to the case d = 1; i.e., to letting X 1 be the entire state space and setting P = M 1 .This approach is quite common, though we do not pursue it here.

Main Results
In terms of the above background, our first main result is as follows.
Theorem 1.Consider a two-variable systematic-scan Gibbs sampler P = G 1 G 2 as above (or any other product P = G 1 G 2 for any positive Markov operators G 1 and G 2 ).Then the spectrum of P is real and non-negative, with σ(P ) ⊆ [0, 1].
As discussed in the Introduction, this theorem extends step 1 of the reversible Markov chain paradigm to a non-reversible case.
Then, since ] for real-valued f as noted above, it follows immediately that: Corollary 2. Let {X n } be a random sequence started in stationary and following the transitions P = G 1 G 2 of a two-variable systematic-scan Gibbs sampler as above.Then for any real We also consider the case of a combination of a Gibbs sampler component and a Metropolis-Hastings component, as follows.

Proofs of Main Results
Our proofs rely on the following known operator theory facts, following [11].

Proposition 4. (i)
Let A and B be two self-adjoint operators on a Hilbert space H, with B positive.Then the spectra of the product operators AB and BA are equal and real; i.e., σ(AB) = σ(BA) ⊆ R. (ii) If, in addition to the above, A is also positive, then the spectra of the product operators are non-negative; i.e., σ(AB) = σ(BA) ⊆ [0, ∞).
Remark.Theorem 1 does not extend directly to Gibbs samplers with d > 2 coordinates.Indeed, we have checked numerically that if X = {1, 2} 3 , with π(i, j, k) ∝ i + j + k, then the corresponding three-variable systematic-scan Gibbs sampler has non-real eigenvalues 0.0002515 ± 0.0014018 i, among others.Indeed, it is well-known (see [1]) that even Proposition 4 does not extend to three operators.Daniel Rosenthal has pointed out a simple example: if then A and B and C are each positive matrices, but the product ABC has complex eigenvalues Proof of Theorem 3. Applying Proposition 4(i) with ⊆ R, so either way we have σ(P ) ⊆ R.But we know that r(P ) ≤ 1, whence σ(P ) ⊆ [−1, 1], as claimed.

The Marginal Chain
We now consider the connection between the spectrum of P , and the spectrum of the marginal chain P , defined as follows.
For the two-variable systematic-scan Gibbs sampler P = G 1 G 2 , the Markov chain proceeds by first (via G 1 ) "replacing" the first coordinate by a fresh value depending only on the second coordinate.This means that P (x, A) does not depend on the first coordinate of x; i.e., P ((y, x 2 ), A) = P ((z, x 2 ), A) for all y, z ∈ X 1 .Hence, also the function P f depends only on x 2 .That in turn implies the existence of a "marginal" Markov chain which only keeps track of the second coordinate; i.e., which has state space (X 2 , F 2 ), and transition kernel P defined by P (x 2 , A 2 ) = P (x, {(y 1 , y 2 ) ∈ X ; y 2 ∈ A 2 }) for x 2 ∈ X 2 and A 2 ∈ F 2 .(Usually, a function of a Markov chain will not itself be a Markov chain, but rather a hidden Markov model.)In this case, it turns out [15,18,12] that P is reversible with respect to the marginal distribution of π on X 2 , defined by π(A 2 ) = π{(x 1 , x 2 ) ∈ X ; x 2 ∈ A 2 }, and furthermore the convergence rate of P to π is identical to the convergence rate of P to π.So, that provides a different avenue to studying convergence of two-variable Gibbs samplers, using the methodology of reversible chains.
The above facts for the two-variable Gibbs sampler also extend ([14], Section 2.4) to the case P = G 1 M 2 of a combination of a Gibbs sampler component followed by a Metropolis-Hastings component; i.e., it also has a marginal chain P which is reversible with respect to π with the same convergence rate.
The identical convergence rates of the full and the marginal chain in these cases suggest that there might be a connection between their spectra.Indeed, we have the following.
To prove Theorem 5, we require another operator theory result.Proposition 6.Let A be an operator on a Hilbert space H. Suppose M is a proper closed linear subspace of H which contains the range of A; i.e., such that Af ∈ M whenever f ∈ H. Let B be the restriction of A to M; i.e., B = A M .Then σ(A) = σ(B) ∪ {0}.
Proof.Let M ⊥ = {f ∈ H; f, g = 0 ∀g ∈ M} be the subspace of functions "perpendicular" to M. Then the entire space H can be written as the direct sum M ⊕ M ⊥ .Hence any operator D can be decomposed in block-matrix form as . With respect to this decomposition, we must have (since M contains the range of A) that for some operator C : M ⊥ → M. Then where I M and I M ⊥ are the identity operators on M and M ⊥ respectively.Now, if λ = 0 and λ ∈ σ(B), then it can be checked directly that where X = (λI M − B) −1 C(λ −1 I M ⊥ ).So, λI − A is invertible, and hence λ ∈ σ(A).This shows that σ(A) ⊆ σ(B) ∪ {0}.Also, since range(A) ⊆ M, A is not surjective, and therefore 0 ∈ σ(A).
But P J is essentially the same as P : if f ∈ J , with f (x 1 , x 2 ) = g(x 2 ) for all x 1 and x 2 , then ( P g)(x 2 ) = (P f )(x 1 , x 2 ).More formally, let J = L 2 ( π) be the collection of squareintegrable functions on X 2 , and x * be any fixed element of X 1 , and define S : J → J by (Sf )(x 2 ) = f (x * , x 2 ), with inverse S −1 : J → J by (S −1 g)(x 1 , x 2 ) = g(x 2 ).Then P = S −1 P J S, so P is similar to P J .In particular, σ( P ) = σ(P J ).The result follows.
Remark.It is known that for the two-variable systematic-scan Gibbs sampler P = G 1 G 2 , the marginal chain is positive and thus has positive spectrum [15]; and for the combined chain P = G 1 M 2 , the marginal chain is reversible and thus has real spectrum [14].Using this, Theorem 5 in turn provides an alternative proof of Theorems 1 and 3 -though it also strengthens them by providing a specific description (of sorts) of the spectra σ(P ) in those two cases.

A Self-Contained Operator Theory Proof
Our Proposition 4 above, which is essential to the proofs of Theorems 1 and 3, makes heavy use of Proposition 1 of [11].The corresponding proof presented in [11] is brief, but it relies on several other operator theory concepts and theorems, and hence is not easily accessible to non-experts.For completeness, we provide here a self-contained proof, following [11].We prove this Proposition using a few simple lemmas.The first was proved by Nathan Jacobson years ago; James Fulford has pointed out that there is a nice discussion of this topic at [27].Remark.The displayed identity in the proof of Lemma 8 is suggested intuitively (see e.g.[27]) by substituting in the (unjustified) expansions Proof.Since CD is invertible, it must be injective; i.e., if f = 0 then (CD)f = 0. Hence also Df = 0. So, D is also injective.Then, since CD is invertible, so is its adjoint (CD) * .In particular, its adjoint must be surjective; i.e., for each g ∈ H there is f ∈ H with (CD) * f = g.But (CD) * = D * C * = DC * since D is self-adjoint.So, D(C * f ) = g.Hence, D is also surjective.
Thus, D is both injective and surjective, and hence invertible as a linear mapping H → H.It then follows from the Open Mapping Theorem (see e.g.Corollary 2.12(b) on page 49 of [22]) that its inverse is a continuous (i.e., bounded) linear operator; i.e., D is invertible as a bounded linear operator on H.
The remaining claims then follow from the fact that the product of invertible operators is invertible.
Proof.Lemma 8 above shows that σ(CD) and σ(DC) agree except possibly for the value 0, and Lemma 9 shows that 0 ∈ σ(CD) if and only if 0 ∈ σ(DC).

Theorem 3 .
Consider a two-variable systematic-scan combination of a Metropolis-Hastings component and a Gibbs sampler component, of the form P = M 1 G 2 or P = G 1 M 2 , with G i and M i as above (or any other positive Markov operator G i and any other reversible Markov operator M i ).Then the spectrum of P is real, with σ(P ) ⊆ [−1, 1].

Lemma 8 .
For any operators C and D on a Hilbert space H, the spectra σ(CD) and σ(DC) differ by at most {0}; i.e., if λ ∈ C and λ = 0, then λ ∈ σ(CD) if and only if λ ∈ σ(DC).Proof.By replacing C by C/λ, it suffices to assume that λ = 1.Thus, it suffices to prove that I − DC is invertible if and only if I − CD is invertible.But this follows from the identity (I − DC) −1 = I + D(I − CD) −1 C , which can be verified by multiplying I + D(I − CD) −1 C by I − DC (on either the left or the right side) and getting the result I.