Sufﬁcient Conditions for Torpid Mixing of Parallel and Simulated Tempering

We obtain upper bounds on the spectral gap of Markov chains constructed by parallel and simulated tempering, and provide a set of sufﬁcient conditions for torpid mixing of both techniques. Combined with the results of [ 22 ] , these results yield a two-sided bound on the spectral gap of these algorithms. We identify a persistence property of the target distribution, and show that it can lead unexpectedly to slow mixing that commonly used convergence diagnostics will fail to detect. For a multimodal distribution, the persistence is a measure of how “spiky”, or tall and narrow, one peak is relative to the other peaks of the distribution. We show that this persistence phenomenon can be used to explain the torpid mixing of parallel and simulated tempering on the ferromagnetic mean-ﬁeld Potts model shown previously. We also illustrate how it causes torpid mixing of tempering on a mixture of normal distributions with unequal covariances in R M , a previously unknown result with relevance to statistical inference problems. More generally, any-time a multimodal distribution includes both very narrow and very wide peaks of comparable probability mass, parallel and simulated tempering are shown to mix slowly.


Introduction
Parallel and simulated tempering [4; 13; 5] are Markov chain simulation algorithms commonly used in statistics, statistical physics, and computer science for sampling from multimodal distributions, where standard Metropolis-Hastings algorithms with only local moves typically converge slowly.
Tempering-based sampling algorithms are designed to allow movement between modes (or "energy wells") by successively flattening the target distribution. Although parallel and simulated tempering have distinct constructions, they are known to have closely related mixing times; Zheng [24] bounds the spectral gap of simulated tempering below by a multiple of that of parallel tempering.
Madras and Zheng [12] first showed that tempering could be rapidly mixing on a target distribution where standard Metropolis-Hastings is torpidly mixing, doing so for the particular case of the meanfield Ising model from statistical physics. "Rapid" and "torpid" here are formalizations of the relative terms "fast" and "slow", and are defined in Section 2. However, Bhatnagar and Randall [2] show that for the more general ferromagnetic mean-field Potts model with q ≥ 3, tempering is torpidly mixing for any choice of temperatures.
Woodard et al. [22] generalize the mean-field Ising example of [12] to give conditions which guarantee rapid mixing of tempering algorithms on general target distributions. They apply these conditions to show rapid mixing for an example more relevant to statistics, namely a weighted mixture of normal distributions in R M with identity covariance matrices. In [22] the authors partition the state space into subsets on which the target distribution is unimodal. The conditions for rapid mixing of the tempering chain are that Metropolis-Hastings is rapidly mixing when restricted to any one of the unimodal subsets, that Metropolis-Hastings mixes rapidly among the subsets at the highest temperature, that the overlap between distributions at adjacent temperatures is decreasing at most polynomially in the problem size, and that an additional quantity γ (related to the persistence quantity of the current paper) is at most polynomially decreasing. These conditions follow from a lower bound on the spectral gaps of parallel and simulated tempering for general target distributions given in [22].
Here we provide complementary results, showing several ways in which the violation of these conditions implies torpid mixing of Markov chains constructed by parallel and simulated tempering. Most importantly, we identify a persistence property of distributions and show that the existence of any set with low conductance at low temperatures (e.g. a unimodal subset of a multimodal distribution) and having small persistence (as defined in Section 3 with interpretation in Section 5), guarantees tempering will mix slowly for any choice of temperatures. This result is troubling as this mixing problem will not be detected by standard convergence diagnostics (see Section 6).
We arrive at these results by deriving upper bounds on the spectral gaps of parallel and simulated tempering for arbitrary target distributions (Theorem 3.1 and Corollary 3.1). Combining with the lower bound in [22] then yields a two-sided bound.
In Section 4.2 we show that this persistence phenomenon can explain the torpid mixing of tempering techniques on the mean-field Potts model. The original result [2] uses a "bad cut" which partitions the space into two sets that have significant probability at temperature one, such that the boundary has low probability at all temperatures. We show that one of these partition sets has low persistence, also implying torpid mixing. We then show the persistence phenomenon for a mixture of normal distributions with unequal covariances in R M (Section 4.1), thereby proving that tempering is torpidly mixing on this example. In typical cases such as these, the low-conductance set is a unimodal subset of a multimodal distribution. Then the persistence measures how "spiky", or narrow, this peak is relative to the other peaks of the distribution; this is described in Section 5, where we show that whenever the target distribution includes both very narrow and very wide peaks of comparable probability mass, simulated and parallel tempering mix slowly.

Preliminaries
Let ( , , λ) be a σ-finite measure space with countably generated σ-algebra . Often = R M and λ is Lebesgue measure, or is countable with counting measure λ . When we refer to an arbitrary subset A ⊂ , we implicitly assume A ∈ . Let P be a Markov chain transition kernel on , defined as in [19], which operates on distributions µ on the left and complex-valued functions f on the right, so that for x ∈ , If µP = µ then µ is called a stationary distribution of P. Define the inner product If P is µ-reversible, it follows that µ is a stationary distribution of P.
We will be primarily interested in distributions µ having a density π with respect to λ, in which case define π[A] = µ(A) and define ( f , g) π , L 2 (π), and π-reversibility to be equal to the corresponding quantities for µ.
If P is aperiodic and φ-irreducible as defined in [16], µ-reversible, and nonnegative definite, then the Markov chain with transition kernel P converges in distribution to µ at a rate related to the spectral gap: where is the variance of f . It can easily be shown that Gap(P) ∈ [0, 1] (for P not nonnegative definite, Gap(P) ∈ [0, 2]).
For any distribution µ 0 having a density π 0 with respect to µ, define the L 2 -norm µ 0 2 = (π 0 , π 0 ) 1/2 µ . For the Markov chain with P as its transition kernel, define the rate of convergence to stationarity as: where the infimum is taken over distributions µ 0 that have a density π 0 with respect to µ such that π 0 ∈ L 2 (µ). The rate r is equal to − ln(1 − Gap(P)), where we define − ln(0) = ∞; for every µ 0 that has a density π 0 ∈ L 2 (µ), and r is the largest quantity for which this holds for all such µ 0 . These are facts from functional analysis (see e.g. [23; 11; 17]). Analogous results hold if the chain is started deterministically at x 0 for µ-a.e. x 0 ∈ , rather than drawn randomly from a starting distribution µ 0 [17]. Therefore for a particular such starting distribution µ 0 or fixed starting state x 0 , the number of iterations n until the L 2 -distance to stationarity is less than some fixed ε > 0 is O(r −1 ln( µ 0 − µ 2 )). Similarly, [11] show that the autocorrelation of the chain decays at a rate r. Their proof is stated for finite state spaces but applies to general state spaces as well. Therefore, informally speaking, the number of iterations of the chain required to obtain some number N 0 of approximately independent samples The quantity r = − ln(1−Gap(P)) is monotonically increasing with Gap(P); therefore lower (upper) bounds on Gap(P) correspond to lower (upper) bounds on r. In addition, − ln(1−Gap(P))/Gap(P) approaches 1 as Gap(P) → 0. Therefore the order at which Gap(P) → 0 as a function of the problem size is equal to the order at which the rate of convergence to stationarity approaches zero. When Gap(P) (and thus r) is exponentially decreasing as a function of the problem size, we call P torpidly mixing. When Gap(P) (and thus r) is polynomially decreasing as a function of the problem size, we call P rapidly mixing. The rapid / torpid mixing distinction is a measure of the computational tractability of an algorithm; polynomial factors are expected to be eventually dominated by increases in computing power due to Moore's law, while exponential factors are presumed to cause a persistent computational problem.

Metropolis-Hastings
The Metropolis-Hastings algorithm provides a common way of constructing a transition kernel that is π-reversible for a specified density π on a space with measure λ. Start with a "proposal" kernel P(w, dz) having density p(w, ·) with respect to λ for all w ∈ , and define the Metropolis-Hastings kernel as follows: Draw a "proposal" move z ∼ P(w, ·) from current state w, accept z with probability ρ(w, z) = min 1, π(z)p(z, w) and otherwise remain at w. The resulting kernel is π-reversible.

Parallel and Simulated Tempering
If the Metropolis-Hastings proposal kernel moves only locally in the space, and if π is multimodal, then the Metropolis-Hastings chain may move between the modes of π infrequently. Tempering is a modification of Metropolis-Hastings wherein the density of interest π is "flattened" in order to allow movement among the modes of π. For any inverse temperature β ∈ [0, 1] such that π(z) β λ(dz) < ∞, define For any z and w in the support of π, the ratio π β (z)/π β (w) monotonically approaches one as β decreases, flattening the resulting density. For any β, define T β to be the Metropolis-Hastings chain with respect to π β , or more generally assume that we have some way to specify a π β -reversible transition kernel for each β, and call this kernel T β .

Parallel tempering. Let
The parallel tempering algorithm [4] simulates parallel Markov chains T β k at a sequence of inverse temperatures β 0 < . . . < β N = 1 with β 0 ∈ . The inverse temperatures are commonly specified in a geometric progression, and Predescu et al. [15] show an asymptotic optimality result for this choice.
Updates of individual chains are alternated with proposed swaps between temperatures, so that the process forms a single Markov chain with state x = (x [0] , . . . , x [N ] ) on the space pt = N +1 and stationary density . The marginal density of x [N ] under stationarity is π, the density of interest.
A holding probability of 1/2 is added to each move to guarantee nonnegative definiteness. The update move T chooses k uniformly from {0, . . . , N } and updates x [k] according to T β k : , . . . , x [N ] ) and δ is Dirac's delta function.
The swap move Q attempts to exchange two of the temperature levels via one of the following schemes: PT1. sample k, l uniformly from {0, . . . , N } and propose exchanging the value of x [k] with that of x [l] . Accept the proposed state, denoted (k, l)x, according to the Metropolis criteria preserving π pt : PT2. sample k uniformly from {0, . . . , N − 1} and propose exchanging x [k] and x [k+1] , accepting with probability ρ(x, (k, k + 1)x).
Both T and either form of Q are π pt -reversible by construction, and nonnegative definite due to their 1/2 holding probability. Therefore the parallel tempering chain defined by P pt = QTQ is nonnegative definite and π pt -reversible, and so the convergence of P n pt to π pt may be bounded using the spectral gap of P pt .
The above construction holds for any densities φ k that are not necessarily tempered versions of π, by replacing T β k by any φ k -reversible kernel T k ; the densities φ k may be specified in any convenient way subject to φ N = π. The resulting chain is called a swapping chain, with sc , λ sc , P sc and π sc denoting its state space, measure, transition kernel, and stationary density respectively. Just as for parallel tempering, a swapping chain can be defined using swaps between adjacent levels only, or between arbitrary levels, and the two constructions will be denoted SC2 and SC1, analogously to PT2 and PT1 for parallel tempering. Although the terms "parallel tempering" and "swapping chain" are used interchangeably in the computer science literature, we follow the statistics literature in reserving parallel tempering for the case of tempered distributions, and use swapping chain to refer to the more general case.
Simulated tempering. An alternative to simulating parallel chains is to augment a single chain by an inverse temperature index k to create states (z, k) ∈ st = ⊗ {0, . . . , N } with stationary density The resulting simulated tempering chain [13; 5] alternates two types of moves: T ′ samples z ∈ according to T k , conditional on k, while Q ′ attempts to change k via one of the following schemes: ST1. propose a new temperature level l uniformly from {0, . . . , N } and accept with probability ST2. propose a move to l = k − 1 or l = k + 1 with equal probability and accept with probability As before, a holding probability of 1/2 is added to both T ′ and Q ′ ; the transition kernel of simulated tempering is defined as P st = Q ′ T ′ Q ′ . For a lack of separate terms, we use "simulated tempering" to mean any such chain P st , regardless of whether or not the densities φ k are tempered versions of π.

Tempering Chains
The parallel and simulated tempering algorithms described in Section 2.2 are designed to sample from multimodal distributions. Thus when simulating these chains, it is typically assumed that if the temperature swaps between all pairs of adjacent temperatures are occurring at a reasonable rate, then the chain is mixing well. However, Bhatnagar and Randall [2] show that parallel tempering is torpidly mixing for the ferromagnetic mean-field Potts model with q ≥ 3 (Section 4.2), indicating that tempering does not work for all target distributions. It is therefore of significant practical interest to characterize properties of distributions which may make them amenable to, or inaccessible to, sampling using tempering algorithms.
In this Section we provide conditions for general target distributions π under which rapid mixing fails to hold. In particular, we identify a previously unappreciated property we call the persistence, and show that if the target distribution has a subset with low conductance for β close to one and low persistence for values of β within some intermediate β-interval, then the tempering chain mixes slowly. Somewhat more obviously, the tempering chain will also mix slowly if the inverse temperatures are spaced too far apart so that the overlap of adjacent tempered distributions is small.
Consider sets A ⊂ that contain a single local mode of π along with the surrounding area of high density. If π has multiple modes separated by areas of low density, and if the proposal kernel makes only local moves, then the conductance of A with respect to Metropolis-Hastings will be small at low temperatures (β ≈ 1). The conductance of a set A ⊂ with 0 < µ(A) < 1 is defined as: for P any µ-reversible kernel on , where 1 A is the indicator function of A. Φ P (A) provides an upper bound on Gap(P) [9]. Note that P reversible implies and so Φ P (A) ≤ 2.
We will obtain upper bounds on the spectral gap of a parallel or simulated tempering chain in terms of an arbitrary subset A of . Conceptually the case where π| A (the restriction of π to A) is unimodal as described above is the most insightful, but the bounds hold for all A ⊂ such that 0 < π[A] < 1.
The bounds will involve the conductance of A under the chain T β defined in Section 2.2, as well as the persistence of A under tempering by β. For any A ⊂ such that 0 < π[A] < 1 and any density φ on , we define the quantity and define the persistence of A with respect to π β as γ(A, π β ), also to be denoted by the shorthand γ(A, β). The persistence measures the decrease in the probability of A between π and π β . If A has low persistence for small values of β, then a parallel or simulated tempering chain starting in A c may take a long time to discover A at high temperatures (β near zero). If A is a unimodal subset of a multimodal distribution, then it typically has low conductance for low temperatures (β ≈ 1), so the tempering chain may take a long time to discover A at all temperatures even when π [A] is large. This leads to slow mixing, and contradicts the common assumption in practice that if swapping acceptance rates between temperatures are high, the chain is mixing quickly. A key point is that, due to the low persistence of the set, this problem does not manifest as low conductance of the high-temperature chain which may well be rapidly mixing on π β . Nevertheless, it does lead to slow mixing. This contradicts the common assumption in practice that if the highest temperature is rapidly mixing, and swapping acceptance rates between temperatures are high, then the tempering chain is rapidly mixing.
Even if every subset A ⊂ has large persistence for high temperatures, it is possible for some subset to have low persistence within an intermediate temperature-interval. This causes slow mixing by creating a bottleneck in the tempering chain, since swaps between non-adjacent β and β ′ typically have very low acceptance probability. The acceptance probability of such a swap in simulated tempering, given that z ∈ A, is given by the overlap of π β and π β ′ with respect to A. The overlap of two distributions φ and φ ′ with respect to a set A ⊂ is given by [22]: which is not symmetric. When considering tempered distributions π β we will use the shorthand δ(A, β, β ′ ) = δ(A, π β , π β ′ ).
The most general results are given for any swapping or simulated tempering chain with a set of densities φ k not necessarily tempered versions of π. For any level k ∈ {0, . . . , N }, let γ(A, k) and δ(A, k, l) be shorthand for γ(A, φ k ) and δ(A, φ k , φ l ), respectively.
The following result, involving the overlap δ(A, k, l), the persistence γ(A, k), and the conductance Φ T k (A), is proven in the Appendix: where for k * = 0 we take this to mean: One can obtain an alternative bound for the swapping chain by combining the bound for simulated tempering with the results of [24]. However, the alternative bound has a superfluous factor of N so we prefer the one given here.
For the case where tempered distributions φ k = π β k are used, the bounds in Theorem 3.1 show that the inverse temperatures β k must be spaced densely enough to allow sufficient overlap between adjacent temperatured distributions. If there is an A ⊂ and a level k * such that the overlap δ(A, k, l) is exponentially decreasing in M for every pair of levels l < k * and k ≥ k * , and the conductance Φ T β k (A) of A is exponentially decreasing for k ≥ k * , then the tempering chain is torpidly mixing. An example is given in Section 4.3.
The bounds in Theorem 3.1 are given for a specific choice of densities {φ k } N k=0 . When tempered densities are used, the bounds can be stated independent of the number and choice of inverse temperatures: Corollary 3.1. Let P pt be a parallel tempering chain using scheme PT1 or PT2, and let P st be a simulated tempering chain using scheme ST1, with densities φ k chosen as tempered versions of π. For any A ⊂ such that 0 < π[A] < 1, and any β * ≥ inf{β ∈ }, we have where for β * = inf{β ∈ } we take this to mean: This is a corollary of Theorem 3.1, verified by setting k * = min{k : β k ≥ β * }.
Recall from Section 2 that torpid mixing of a Markov chain means that the spectral gap of the transition kernel is exponentially decreasing in the problem size. Then Corollary 3.1 implies the following result: Then parallel and simulated tempering are torpidly mixing.
In The quantity γ({A j }) defined in [22] is related to the persistence of the current paper. If φ k [A j ] is a monotonic function of k for each j, then In addition, the conductance Φ T k (A) of the current paper is exactly the spectral gap of the projection matrixT k for T k with respect to the partition {A, A c }, as defined in [22]. SinceT k is a 2 × 2 matrix, its spectral gap is given by the sum of the off-diagonal elements, which is precisely Φ T k (A) written in the form (3).
The lower bound given in [22] is, for any partition {A j } J j=1 of , where T k | A j is the restriction of the kernel T k to the set A j . Note the upper and lower bounds are stated for arbitrary sets and partitions respectively, and so also hold for the inf over sets

Torpid Mixing for a Mixture of Normals with Unequal Variances in R M
Consider sampling from a target distribution given by a mixture of two normal densities in R M : at the current state. When σ 1 = σ 2 , Woodard et al. [22] have given an explicit construction of parallel and simulated tempering chains that is rapidly mixing. Here we consider the case σ 1 = σ 2 , assuming without loss of generality that σ 1 > σ 2 .
For technical reasons, we will use the following truncated approximation to π, where Figure 1 showsπ β [A 2 ] as a function of β for M = 35. It is clear that for β < 1 2 ,π β [A 2 ] is much smaller thanπ[A 2 ]. This effect becomes more extreme as M increases, so that the persistence of A 2 is exponentially decreasing for β < 1 2 , as we will show. We will also show that the conductance of A 2 under Metropolis-Hastings for S with respect toπ β is exponentially decreasing for β ≥ 1 2 , implying the torpid mixing of parallel and simulated tempering. The Metropolis-Hastings chains for S with respect to the densities restricted to each individual modẽ are rapidly mixing in M , as implied by results in Kannan and Li [8] (details are given in Woodard [21]). As we will see however, Metropolis-Hastings for S with respect toπ itself is torpidly mixing in M . In addition, we will show that parallel and simulated tempering are also torpidly mixing for this target distribution for any choice of temperatures.
First, calculateπ β [A 2 ] as follows. Let F be the cumulative normal distribution function in one dimension. Consider any normal distribution in R M with covariance σ 2 I M for σ > 0. The probability under this normal distribution of any half-space that is Euclidean distance d from the center of the normal distribution at its closest point is F (−d/σ). This is due to the independence of the dimensions and can be shown by a rotation and scaling in R M .
The distance between the half-space A 2 and the point −1 M is equal to M . Therefore Recall the definition of from Section 2.2; for the mixtureπ, we have = (0, 1]. We will apply Corollary 3.2 with A = A 2 , β * = 0, and β * * = 1 2 to show that parallel and simulated tempering are torpidly mixing on the mixtureπ. Looking first at the persistence γ(A 2 , β), since F is also exponentially decreasing.
Turning now to the conductance Φ T β (A 2 ), define the boundary ∂ A 2 of A 2 with respect to the Metropolis-Hastings kernel T β as the set of z ∈ A 2 such that it is possible to move to A 1 via one move according to T β . Then ∂ A 2 contains only z ∈ A 2 within distance M −1 of A 1 . Therefore For M > 1, this is bounded above by Analytic integration shows for any a > 0 that F (−a) ≤ N 1 (a; 0, 1 is exponentially decreasing. In particular, Φ T β (A 2 ) is exponentially decreasing for β = 1, so the standard Metropolis-Hastings chain is torpidly mixing. Using the above facts that (7) and (9) are exponentially decreasing, Corollary 3.2 implies that parallel and simulated tempering are also torpidly mixing for any number and choice of temperatures.

Small Persistence for the Mean-Field Potts Model
The Potts model is a type of discrete Markov random field which arises in statistical physics, spatial statistics, and image processing [1; 3; 7]. We consider the ferromagnetic mean-field Potts model with q ≥ 2 colors and M sites, having distribution: We will use the proposal kernel S that changes the color of a single site, where the site and color are drawn uniformly at random. It is well-known that Metropolis-Hastings for S with respect to π is torpidly mixing for α ≥ α c [6]. Bhatnagar and Randall [2] show that parallel and simulated tempering are also torpidly mixing on the mean-field Potts model with q = 3 and α = α c (their argument may extend to q ≥ 3 and α ≥ α c ). Here we show that this torpid mixing can be explained using the persistence phenomenon described in Section 3. We use the same cut of the state space as do Bhatnagar and Randall [2], since it has low conductance for β close to 1. Our torpid mixing explanation will be stated for q ≥ 3 and α ≥ α c . Our initial definitions will be given for q ≥ 2 to allow us to address the case q = 2 in Section 4.3.
Then π can be written as and the marginal distribution of σ is given by For q ≥ 3 define the "critical" parameter value α c = Gore and Jerrum [6] apply (10) to rewrite ρ as: and g α (x) = α 2 x 2 − x ln x. Observe that f α does not depend on M . It is also shown in [6] that any local maximum of f α is of the form m = (x, , or a permutation thereof (the apostrophe denoting the first derivative). Gore and Jerrum also show that at α = α c the local maxima occur for x = 1 q and x = q−1 q . Letting m 1 = ( 1 q , . . . , 1 q ), m 2 = ( q−1 q , 1 q(q−1) , . . . , 1 q(q−1) ), and m 3 equal to m 2 with the first two elements permuted, note that and that for any a, f α (a) is invariant under permutation of the elements of a. Therefore the q + 1 local maxima of the function f α c are also global maxima (for q = 2 there is a single global maximum).
For M large enough the q+1 global maxima of f α c correspond to q+1 local maxima of ρ(σ); Figure 2 shows the 4 modes of ρ(σ) for the case q = 3.
We will additionally need the following results. The proofs are given in the thesis by Woodard [20]. Gore and Jerrum state this result for α = α c , but their argument can be extended in a straightforward manner; details are given in [20].
As in Bhatnagar and Randall [2], define the set A = {z : σ 1 (z) > M 2 }. Then we have the following two results, also shown in [20].  Now consider any q ≥ 3 and α ≥ α c . For any β, the density π β is equal to the mean-field Potts density with parameter αβ. Recall that T β is the Metropolis-Hastings kernel for S with respect to π β . Take the value of τ from Proposition 4.4. Define the inverse temperature β * * = α c /α − τ/α.
are exponentially decreasing. Therefore Corollary 3.2 can be used to show the torpid mixing of parallel and simulated tempering on the mean-field Potts model with q ≥ 3 and α ≥ α c , for any number and choice of inverse temperatures.

Torpid Mixing on the Mean-Field Ising Model using Fixed Temperatures
Consider the mean-field Ising model, which is simply the mean-field Potts model from Section 4.2 with q = 2. Recall the definitions from that section. Madras and Zheng [12] show that parallel and Now consider any α 1 , α 2 such that α c < α 2 and α 1 < α 2 . If α 1 ≤ α c , let x 1 = 1 2 ; otherwise, let x 1 be the value of x in Proposition 4.5 for α 1 . Let x 2 be the value of x in Proposition 4.5 for α 2 , so that Recalling the definition of C α,ε from Proposition 4.2, C α 1 ,ε ∩ C α 2 ,ε = .

Interpretation of Persistence
As described in Section 3, tempering algorithms mix slowly when there is a set A ⊂ which has low conductance under the low-temperature (β = 1) chain and has small persistence for some range of β-values. Here small persistence means π β [A]/π[A] near zero. To understand how the existence of such a set depends on the properties of π, we can rewrite this ratio as: where π| A is the restriction of π to A. Here E π (π(Z) β−1 ) denotes the expected value of the random variable W = π(Z) β−1 where Z has distribution π.
Let Z 1 and Z 2 be random variables with distributions π| A and π, respectively, and define random variables W 1 = π(Z 1 ) β−1 and W 2 = π(Z 2 ) β−1 . One way in which the ratio (12) may be smaller than one is if W 2 stochastically dominates W 1 , or equivalently if the random variable Y 1 = π(Z 1 ) stochastically dominates Y 2 = π(Z 2 ). This means that within the set A the probability mass is concentrated in places where the density is high relative to those places where mass is concentrated for the rest of the space A c . For example, if π consists of two peaks, one in A and the other in A c , and π(A) = π(A c ), then loosely speaking the peak in A is more "spiky", or tall and narrow, than the peak in A c .
Define the set A = {z ∈ R 3 : i z i ≥ 0}, which contains almost all of the probability mass of the first component and almost no mass from the second component. Figure 3 shows the cumulative distribution functions of the random variables Y 1 and Y 2 defined above, where it can be seen that Y 1 stochastically dominates Y 2 . Intuitively this is because π has two peaks, one primarily in A and the other in A c , with the first taller and more narrow than the other. As shown above, this stochastic dominance implies that the persistence γ(A, β) is less than one for any β ∈ (0, 1).
More generally, the persistence of a set A can be less than one whenever Y 1 tends to be larger than Y 2 , in the sense that the transformation has a smaller expectation than W 2 = Y β− 1 2 . Again this occurs when the probability mass within A is concentrated in regions of high density relative to the regions where mass concentrates in A c . Again, if π consists of two peaks, one in A and one in A c , and π(A) = π(A c ), then informally speaking the peak in A is taller and more narrow than the peak in A c . Now take the more interesting case where π consists of multiple peaks of comparable probability, some of which are much taller than others; then the tallest peaks are also the narrowest peaks.
Define A to contain one of these tall, narrow peaks. Since there are other peaks of the distribution that are much lower and wider, and none that are much taller and narrower, the expectation of W 1 is much smaller than that of W 2 . The persistence of A is therefore small, and since A is a set having low conductance at low temperatures, the results in Section 3 imply that parallel and simulated tempering mix slowly. Here we mean slow mixing in a relative sense, that the smaller the persistence the slower the mixing, when other factors are held constant.

Conclusion
We have seen that if the multimodal target distribution has very wide peaks and very narrow peaks of comparable probability, then parallel and simulated tempering mix slowly. This means that if the simulated tempering chain is initialized in one of the wide peaks, or for parallel tempering if every level of the tempering chain is initialized in a wide peak, then the tempering chain will take a very large number of iterations to discover the narrow peaks of the distribution.
During application of simulated or parallel tempering, the acceptance rate of swap or temperature change moves is monitored, as are standard Markov chain convergence diagnostics. If the convergence diagnostics do not detect a problem, and if the acceptance rate for swap or temperature changes is high, then the tempering chain is presumed to be mixing well among the modes of the target distribution. However, we have shown that small persistence can cause slow mixing even when the acceptance rate for swaps or temperature changes, as measured by the quantity δ, is large. Additionally, standard Markov chain convergence diagnostics will rarely detect the problem; convergence diagnostics based on the history of the chain cannot detect the fact that there are undiscovered modes, unless they take into account some specialized knowledge about the distribution.
Widely-used convergence diagnostics, such as time-series plots and autocorrelation plots, make few assumptions about the target distribution; these convergence diagnostics cannot infer features of the distribution in parts of the space that have not been explored. Even the Gelman-Rubin diagnostic, which is specifically designed to detect lack of convergence due to multimodality, works very poorly when some modes have a much smaller "basin of attraction" than others [21]. This is typically the case for the narrow peaks with which we are concerned.
When there are undiscovered modes, inferences based on samples from the tempering chain can be inaccurate. Practitioners should therefore be cautious about inferences that have been obtained using parallel and simulated tempering, just as for Metropolis-Hastings, and not presume that all the modes of the distribution have been discovered.
This slow mixing result is not surprising, since narrow peaks that have a small basin of attraction are extremely difficult to find in a large space. This has been called the "needle in a haystack" or "witch's hat" problem in the statistics literature, where it is recognized as causing difficulty for Metropolis-Hastings and Gibbs samplers [14]. We suspect that the problem of approximately sampling from a multimodal distribution containing very narrow peaks at unknown locations can be shown to be NP-complete (this question is addressed in [18]). If so, then parallel and simulated tempering fail in exactly the same situation that all other sampling methods would fail, namely for high-dimensional multimodal distributions with some very narrow peaks.