HOW TO COMBINE FAST HEURISTIC MARKOV CHAIN MONTE CARLO WITH SLOW EXACT SAMPLING

,


Background
Let π be a given probability distribution on a set S. Given a function g : S → R, we want to estimate its mean ḡ := S g(s)π(ds).As we learn in elementary statistics, one can obtain an estimate for ḡ by taking samples from π and using the sample average g-value as an estimator.But algorithms which sample exactly from π may be prohibitively slow.This is the setting for the Markov chain Monte Carlo (MCMC) method, classical in statistical physics and over the last ten years studied extensively as statistical methodology [4,7,9,12].In MCMC one designs a Markov chain on state-space S to have stationary distribution π.Then the sample average g-value over a long run of the chain is a heuristic estimator of ḡ.Diagnostics for assessing length of run required, and expressions for heuristic confidence intervals, form a substantial part of MCMC methodology [11].In general one cannot make such estimates rigorous, because one cannot eliminate the possibility that all the samples seen come from some small part of the state space which is almost disconnected from the remainder.Rigorous estimates typically require an a priori bound on some notion of the chain's mixing time (e.g. the relaxation time defined at (3)); and while there is now substantial theoretical literature on mixing times [1,3,5] it deals with settings more tractable than most statistical applications.This paper investigates the interface between rigor and heuristics in a particular (perhaps artificial) context.Suppose we have a guess τ for the mixing time of the chain, based on simulation diagnostics or heuristic estimates [6] or some non-rigorous mathematical argument.Suppose we have some separate scheme (an exact sampler) which gives independent samples exactly from π. Imagining that sampling τ steps of the chain is roughly equivalent to sampling once from π, it is natural to consider the ratio ρ = cost of one exact sample cost of τ steps of the chain where cost refers to computational time.If ρ < 1 then one would just use the exact sampler and forget MCMC.If ρ is extremely large then we might not be able to afford even one exact sample, and we are forced to rely on MCMC (this is the setting typically envisaged in MCMC).This paper addresses the remaining context, where ρ is large but not extremely large; in other words, we can afford to simulate many steps of the chain (enough to make estimates heuristically good) but can afford only a few exact samples.In the case of sampling from general d-dimensional densities, for instance, exact samplers (e.g. based on rejection sampling using some tractable comparison density) are typically feasible only for small d, and MCMC is used for large d, so there should always be a window of d-values which fits our "ρ large, but not extremely large" context.In this context, we could just use the exact sampler to get n independent samples from π. Then the sample average g-value provides an estimate of ḡ with O(n −1/2 ) error.But instead, suppose we use these n independent samples as initial states and generate n independent mstep realizations of the Markov chain.If diagnostic tests suggest that mixing occurs in τ steps then we have an "effective sample size" of (n × m/τ ) and the heuristic estimate of error (when we use the overall sample average g-value as an estimator) would be O( τ /(nm)).Our basic result, Theorem 2.1, shows that in a certain sense such error bounds can be made rigorous.This basic result requires an initial guess for mixing time; an "adaptive" variation, Theorem 2.2, eliminates that requirement.

Results
The discussion in section 1 provides conceptual context for our result, but let us now state the (rather minimal) mathematical assumptions for the result.For simplicity we assume the state space S is finite (though since our results are non-asymptotic they must extend to the general case without essential change).We assume (for reasons explained in Section 2.1) the function g : S → R is bounded, so by rescaling we may assume 0 ≤ g(•) ≤ 1. (1) We assume the Markov chain is reversible, that is to say its transition matrix K satisfies With no further assumptions, Theorem 2.1 says we can construct a conservative confidence interval I which is always valid, i.e. satisfies (4).That is, for validity there are no "implicit asymptotics", and indeed we do not even need to assume K is irreducible.The length of the confidence interval will depend on the data, i.e. the realizations of the chain, but (5) bounds the length in terms of the relaxation time of the chain, defined as Based on 2n exact samples from π and 2mn steps of the K-chain, we can construct an interval I such that and where k c α := 2 c 2/α + log(4/α) .

Discussion
Of course 1 − α is the prescribed confidence level.Choice of c reflects a trade-off between the "target length" k c α max 1 n , τ nm log 2 n of confidence interval and the probability of meeting that target.We have not tried to optimize numerical constants; at various places where we have used Chebyshev's inequality, more complicated arguments would presumably give better bounds.
To interpret Theorem 2.1 in order-of-magnitude terms, suppose we take m = nτ , so that we use 2n 2 τ steps of the chain.Then the "target length" of our confidence interval will be O(n −1 log n).The theorem guarantees a confidence interval that is always valid, and guarantees that, if τ 2 is indeed not more than τ , then the length of the confidence interval will likely not exceed the target length.This contrasts with the O(n −1/2 ) length confidence interval obtained by using the exact sampler only, and gets close to the O(n −1 ) length of the heuristic confidence interval.
Admittedly the numerical constants make our numerical error bounds too crude to be useful in practice.For instance, taking m = nτ , α = 0.32 and c = 1, the target length is about 10 n , compared to the 1.0 √ n length of the usual confidence interval based on exact samples only.So not until n ≈ 30, 000 would our interval be guaranteed to be actually shorter (provided τ/τ 2 were sufficiently large).But the order-of-magnitude result does seem of theoretical interest, in particular because of its similarity to the idea [10] of self-testing algorithms.That paper describes an algorithm for generating random self-avoiding walks.As the authors write [10] "While there are a number of Monte Carlo algorithms used to solve these problems in practice, these are heuristic and their correctness relies on unproven conjectures.In contrast, our algorithm is shown rigorously to produce answers with specific accuracy and confidence.Only the efficiency of the algorithm relies on a widely believed conjecture, and a novel feature is that this conjecture can be tested as the algorithm proceeds."In our MCMC setting, we cannot estimate rigorously the actual value of τ 2 , but we can self-justify inferences based on estimated τ 2 .On a more technical note, let us outline why Theorem 2.1 gives close to the best possible bounds on confidence interval length.Indeed, we claim that the best one could hope for is length of order The point is that there are two different "obstacles" to sharp estimation.First, consider the eigenvector g 2 associated with eigenvalue λ 2 .It is easy to estimate the variance of the overall sample average when g = g 2 , and this variance works out as order τ2 nm ; so we should not hope to have smaller estimation error than the corresponding standard deviation τ2 mn .Second, suppose some subset A of the state space, with π(A) = 1/n, is almost disconnected.Then it is not unlikely that all n exact samples, and hence the n realizations of the chains, miss A, and so a contribution E π [g(•)1 A ] to ḡ would be "invisible" to our simulations, and so this contribution is an unavoidable source of possible error when using sample averages as estimators.Our assumption (1) that g is bounded was intended as the simplest way of bounding this error -bounding it as order 1/n.So Theorem 2.1 shows that, if our initial guess τ is indeed roughly close to τ 2 , then our rigorous confidence interval's length will be roughly of the minimal order (6), up to log n terms.In Section 2.3 we give a natural "adaptive" variation in which we prescribe two numbers τ < τmax , where as before τ is a heuristic estimate of τ 2 , and where 2n 2 τmax is the maximum number of steps of the chain that we would be willing to simulate.Theorem 2.2 gives an alwaysvalid confidence interval which, if τ 2 is indeed small relative to τmax , will have length of order n −1 log n and will require order n 2 τ 2 steps of the chain.

Outline of construction and proof
Recall the "procedure" of simulating n realizations of m steps of the Markov chain, starting from n exact samples from π.The construction of the confidence interval I in Theorem 2.1 can be summarized as follows.(i) Perform this procedure once, and find the overall average g-value -call it ḡ * .(ii) Perform the procedure again, and for 1 ≤ i ≤ n let A i be the average g-value over the i'th m-step realization.(iii) Test whether for all i, where r(n, m) To analyze the validity of the confidence interval, the key point is that after observing the event " happening n times out of n, we can be confident that its probability is , and then the sample average of n truncated variables has s.d. of order log n √ r(n,m) Finally, to bound the chance of not reporting the short confidence interval we need to bound the chance of a truncation being needed, and a bound can be derived from large deviation estimates for reversible chains.

An adaptive version
In the procedure underlying Theorem 2.1 we make a single guess τ and hope that the "good event" which leads to a short confidence interval will happen; if not, we settle for a long confidence interval.A natural variation is to specify that, if the "good event" fails, then repeat the process with τ replaced by 2τ, 4τ, 8τ, . . .and continue until the "good event" happens or until we reach some predetermined limit on numbers of steps of the chain.
So we are prescribing the maximum number of steps of the chain to be 2n 2 τmax .The bound in ( 9) is less than 0.05 for n = 8 and goes to zero very rapidly as n increases (the constant 96 just emerges from the proof technique).So if τmax is indeed large compared to τ 2 then by generating a small number of exact samples one can construct a confidence interval for ḡ which will be "short" with high probability, and the number of steps of the Markov chain required will be O(n 2 (τ 2 ∨ τ )).More precisely, if we start with small τ then we are confident of needing at most 192n 2 τ 2 steps of the chain.But as discussed earlier, though the interval is "short" in the sense of being O( log n n ), the numerical constants make the interval numerically large for moderate values of n.

Proof of Theorem 2.1 3.1 Construction of the confidence interval
be a reversible Markov chain with initial state X * i0 = Z * i ; these n Markov chains are independent.Define and ḡ * is our initial guess for ḡ.Now re-run the entire simulation independently to get {Z 1 , Z 2 , . . ., Z n }, another set of n independent samples from π, and (X ij ) m−1 j=0 another independent but identically distributed family of n reversible Markov chains each with initial state X i0 = Z i .We further define ḡ * otherwise (12) where r(n, m) := min n, m τ .Let for the number of truncations, and call the event where In the next section we shall prove Proposition 3.1 For any b > 0 Replacing α by α/2 in Proposition 3.1 and setting b = 2/α, we see that the confidence interval satisfies the requirement of (4) that P(ḡ Notice that (17) is bounded by k c α max 1 n , τ nm log 2 n; here we use assumption n ≥ 2 which implies log 2 n ≥ 1.So to prove (5) and complete the proof of Theorem 2.1, it is enough to prove (in section 3.3)

Proof of Proposition 3.1
We denote the conditional probability, conditional expectation and conditional variance given ḡ * by P ḡ * , E ḡ * and Var ḡ * respectively.

Proof of Theorem 2.2
For the first part of the procedure for constructing the confidence interval I, simulate {Z * i , 1 ≤ i ≤ n} and {Z i , 1 ≤ i ≤ n} as at the start of section 3.1.This part of the procedure is not repeated.Then for k ∈ {0, 1, 2, . . ., a}, let The bound asserted in (9) follows from (32) and (33).The number of chain steps used equals 2n 2 × 2 T τ = 2n 2 × M .