Function-Specific Mixing Times and Concentration Away from Equilibrium

Slow mixing is the central hurdle when working with Markov chains, especially those used for Monte Carlo approximations (MCMC). In many applications, it is only of interest to to estimate the stationary expectations of a small set of functions, and so the usual definition of mixing based on total variation convergence may be too conservative. Accordingly, we introduce function-specific analogs of mixing times and spectral gaps, and use them to prove Hoeffding-like function-specific concentration inequalities. These results show that it is possible for empirical expectations of functions to concentrate long before the underlying chain has mixed in the classical sense. We use our techniques to derive confidence intervals that are sharper than those implied by both classical Markov chain Hoeffding bounds and Berry-Esseen-corrected CLT bounds. For applications that require testing, rather than point estimation, we show similar improvements over recent sequential testing results for MCMC. We conclude by applying our framework to real data examples of MCMC, providing evidence that our theory is both accurate and relevant to practice.

1. Introduction. Methods based on Markov chains play a critical role in statistical inference, where they form the basis of Markov chain Monte Carlo (MCMC) procedures for estimating intractable expectations (see, e.g., Gelman et al., 2013;Robert and Casella, 2005). In MCMC procedures, it is the stationary distribution of the Markov chain that typically encodes the information of interest. Thus, MCMC estimates are asymptotically exact, but their accuracy at finite times is limited by the convergence rate of the chain.
The usual measures of convergence rates of Markov chains-namely, the total variation mixing time or the absolute spectral gap of the transition matrix (Levin, Peres and Wilmer, 2008)-correspond to very strong notions of convergence and depend on global properties of the chain. Indeed, convergence of a Markov chain in total variation corresponds to uniform convergence of the expectations of all unit-bounded function to their equilibrium values. The resulting uniform bounds on the accuracy of expectations (Chung et al., 2012;Gillman, 1998;Joulin et al., 2010;Kontorovich et al., 2014;Léon and Perron, 2004;Lezaud, 2001;Paulin, 2012;Samson et al., 2000) may be overly pessimistic-not indicative of the mixing times of specific functions such as means and variances that are likely to be of interest in an inferential setting.
Another limitation of the uniform bounds is that they typically assume that the chain has arrived at the equilibrium distribution, at least approximately. Consequently, applying such bounds requires either assuming that the chain is started in equilibrium-impossible in practical applications of MCMC-or that the burn-in period is proportional to the mixing time chain, which is also unrealistic, if not impossible, in practical settings.
Given that the goal of MCMC is often to estimate specific expectations, as opposed to obtaining the stationary distribution, in the current paper we develop a function-specific notion of convergence with application to problems in Bayesian inference. We define a notion of "function-specific mixing time," and we develop function-specific concentration bounds for Markov chains, as well as spectrum-based bounds on function-specific mixing times. We demonstrate the utility of both our overall framework and our particular concentration bounds by applying them to examples of MCMC-based data analysis from the literature and by using them to derive sharper confidence intervals and faster sequential testing procedures for MCMC.
1.1. Preliminaries. We focus on Markov chains that are discrete in space and time, and satisfy the usual conditions of irreducibility, aperiodicity, and reversibility. These conditions guarantee the existence of unique stationary distribution π. The issue is then to understand how quickly the distribution of the Markov chain approaches this stationary distribution.
The classical analysis of mixing defines convergence rate in terms of the total variation distance: where the supremum ranges over all unit-bounded functions. The mixing time is then defined as the number of steps required to ensure that the chain is within total-variation distance δ of the stationary distribution; it is given by T δ : = min n > 0 : max where π (i) n denotes the distribution of the chain state X n given the starting state X 0 = i.
Total variation is a worst-case measure of distance, and the resulting notion of mixing time can therefore be overly conservative when the Markov chain is being used to approximate the expectation of a fixed function, or expectations over some relatively limited class of functions. Accordingly, it is of interest to consider the following function-specific measure, to which we refer as an f -discrepancy: The f -discrepancy leads naturally to a function-specific notion of mixing time: Definition 1 (f -mixing time). For a given function f , the f -mixing time is In the sequel, we also define function-specific notions of the spectral gap of a Markov chain, which can be used to bound the f -mixing time and to obtain function-specific concentration inequalities.
1.2. Related work. Mixing times are a classical topic of study in Markov chain theory, and there is a large collection of techniques for their analysis (see, e.g., Aldous and Diaconis, 1986;Diaconis and Fill, 1990;Levin, Peres and Wilmer, 2008;Meyn and Tweedie, 2012;Ollivier, 2009;Sinclair, 1992). These tools and the results based on them, however, generally apply only to worst-case mixing times. Relatively little is known about mixing with respect to individual functions or limited classes of functions. Similar limitations exist in studies of concentration of measure and studies of confidence intervals and other statistical functionals that depend on tail probability bounds. Existing bounds are generally uniform, or non-adaptive, and the rates that are reported include a factor that encodes the global mixing properties of the chain and that does not adapt to the function (Chung et al., 2012;Gillman, 1998;Joulin et al., 2010;Kontorovich et al., 2014;Léon and Perron, 2004;Lezaud, 2001;Paulin, 2012;Samson et al., 2000). These factors, which do not appear in classic bounds for independent random variables, are generally either some variant of the spectral gap γ of the transition matrix, or else a mixing time of the chain T δ 0 for some absolute constant δ 0 > 0. For example, the main theorem from Léon and Perron (2004) shows that for a function f : [d] → [0, 1] and a sample X 0 ∼ π from the stationary distribution, we have where γ 0 : = min 1 − λ 2 P , 1 . The requirement that the chain start in equilibrium can be relaxed by adding a correction for the burn-in time (Paulin, 2012). Extensions of this and related bounds, including bounded differences type inequalities and generalizations to continuous Markov chains and non-Markov mixing processes have also appeared in the literature (Kontorovich et al., 2014;Samson et al., 2000).
The concentration result has an alternative formulation in terms of the mixing time instead of the spectral gap (Chung et al., 2012). This version and its variants are weaker, since the mixing time can be lower bounded as where γ * : = min 1 − λ 2 , 1 − λ d ≤ γ 0 is the absolute spectral gap (Levin, Peres and Wilmer, 2008).
In terms of the minimum probability π min : = min i π i , the corresponding upper bound is an extra factor of log 1 π min larger, which potentially leads to a significant gap between 1 γ 0 and T δ 0 , even for a moderate constant such as δ 0 = 1 8 . Similar distinctions arise in our analysis, and we elaborate on them at the appropriate junctures.
1.3. Organization of the paper. In the remainder of the paper, we elaborate on these ideas and apply them to MCMC. In Section 2, we state some concentration guarantees based on function-specific mixing times, as well as some spectrum-based bounds on f -mixing times, and the spectrum-based Hoeffding bounds they imply. Section 3 is devoted to further development of these results in the context of several statistical models. More specifically, in Section 3.1, we show how our concentration guarantees can be used to derive confidence intervals that are superior to those based on uniform Hoeffding bounds and CLT-type bounds, whereas in Section 3.2, we study confidence intervals in the context of sequential testing. In Section 4, we show that our mixing time and concentration bounds improve over the nonadaptive bounds in real examples of MCMC from the literature. Finally, the bulk of our proofs are given in Section 5, with some more technical aspects of the arguments deferred to the appendices.
2. Main results. We now present our main technical contributions, starting with a set of "master" Hoeffding bounds with exponents given in terms of f -mixing times. As we explain in Section 2.3, these mixing time bounds can be converted to spectral bounds by applying various bounds on the f -mixing time in terms of the spectrum, such as those we prove in Section 2.2.
In Section 2.2, we define and analyze corresponding function-specific quantities, which we introduce as necessary.
2.1. Master Hoeffding bound. In this section, we present a master Hoeffding bound that provides concentration rates that depend on the mixing properties of the chain only through the f -mixing time T f . The only hypotheses on burn-in time needed for the bounds to hold are that the chain has been run for at least N ≥ T f steps-basically, so that thinning is possible-and that the chain was started from a distribution π 0 whose fdiscrepancy distance from π is small-so that the expectation of each f X n iterate is close to µ-even if its total-variation discrepancy from π is large. Note that the latter requirement imposes only a very mild restriction, since it can always be satisfied by first running the chain for a burn-in period of T f steps and then beginning to record samples.
Theorem 1. Given ν : = min µ, 1 − µ and some fixed > 0, assume To gain intuition for this result, we consider a special case that applies whenever the true expectation µ lies in 1 4 , 3 4 . In this case, we have the lower bound ν 2 4 ≥ 2 16 . Therefore, provided that we draw the initial point Compared to the bounds in earlier work (e.g., Léon and Perron, 2004), the bounds in equations (7) and (8) has several distinguishing features. The primary difference is that the effective sample size N/T f 2 /16 is a function of f , which can lead to significantly sharper bounds on the deviations of the empirical means than the uniform bounds can deliver. Further, unlike the uniform results, it does not require that the chain has reached equilibrium, or even approximate equilibrium, in a total variation sense. Instead, the result applies provided that the chain has equilibrated only approximately, and only with respect to f . This condition can be guaranteed to hold simply by running the chain for a burn-in period of T f 2 /16 steps. Adding the burnin period does not significantly alter the bound; in fact, it merely reduces the effective sample size by 1.
The appearance of the function-specific mixing time T f in the bounds comes with both advantages and disadvantages. A notable disadvantage, shared with the mixing time versions of the uniform bounds, is that spectrumbased bounds on the mixing time (including our f -specific ones) introduce a log 1 π min term that can be a significant source of looseness. On the other hand, obtaining rates in terms of mixing times comes with the advantage that any bound on the mixing time translates directly into a version of the concentration bound (with the mixing time replaced by its upper bound). Moreover, since the π −1 min term is likely to be an artifact of the spectrum-based approach, and possibly even just of the proof method, it may be possible to turn the mixing time based bound into a stronger spectrum-based bound with a more sophisticated analysis. We go part of the way toward doing this, albeit without completely removing the π −1 min term. An analysis based on mixing time also has the virtue of better capturing the non-asymptotic behavior of the rate. Indeed, for a fixed function f , there exists a function-specific spectral-gap γ f ≥ 0 such that These asymptotics can be used to turn our aforementioned theorem into a variant of the results of Léon and Perron (2004) with γ 0 replaced by a value γ f that under mild conditions is at least as large as γ 0 . While (9) has 2 inside T f , we note that this is only a small disadvantage as it only introduces at worst a log 1 dependence in the denominator and therefore only loosens the bound by a log factor. However, as we explore in Section 4, such an asymptotic spectrum-based view loses a great deal of information needed to deal with practical cases, where often γ f = γ 0 and yet T f δ T δ even for very small values of δ > 0. There we derive more fine-grained concentration inequalities that capture this non-asymptotic behavior.
Intuitively we interpret Theorem 1 as establishing the "effective" sample size N eff : = . This intuition is backed by the Hoeffding bound derived in Corollary 1 and it is useful as a simple mental model of these bounds. On the other hand, interpreting the theorem this way effectively plugs in the asymptotic behavior of T f and does not account for the non-asymptotic properties of the mixing time; the latter may actually be more favorable and lead to substantially better effective sample sizes than the naive asymptotic interpretation predicts. Seen from this perspective, the master bounds come with the advantage that any bound on T f that takes advantage of favorable non-asymptotics translates directly into a stronger version of the Hoeffding bound. We investigate these issues empirically in Section 4.

2.2.
Bounds on f -mixing times. We generally do not have direct access either to the mixing time T δ or the f -mixing time T f δ . Fortunately, any bound on T f translates directly into a variant of the tail bound (7). Accordingly, this section is devoted to methods for bounding these quantities. Since mixing time bounds are equivalent to bounds on d TV and d f , we frame the results in terms of distances rather than times. These results can then be inverted in order to obtain mixing-time bounds in applications.
The simplest bound is simply a uniform bound on total variation distance, which also yields a bound on the f -discrepancy. In particular, if the chain is started with distribution π 0 , then we have d TV π n , π ≤ 1 √ π min · λ n * · d TV π 0 , π .
In order to improve upon this bound, we need to develop function-specific notions of spectrum and spectral gaps. The simplest way to do this is simply to consider the (left) eigenvectors to which the function is not orthogonal and define a spectral gap restricted only to the corresponding eigenvectors.
Definition 2 (f -eigenvalues and spectral gaps). For a function f : [d] → R, we define where q j denotes a left eigenvector associated with λ j . Similarly, we define Using this notation, it is straightforward to show that if the chain is started with the distribution π 0 , then This bound, though useful in many cases, is also rather brittle: it requires f to be exactly orthogonal to the eigenfunctions of the transition matrix. For example, a function f 0 with a good value of λ f can be perturbed by an arbitrarily small amount in a way that makes the resulting perturbed function f 1 have λ f = λ * . More broadly, the bound is of little value for functions with a small but nonzero inner product with the eigenfunctions corresponding to large eigenvalues (which is likely to occur in practice; cf. Section 4), or in scenarios where f lacks symmetry (cf. the random function example in Section 2.4).
In order to address these issues, we now derive a more fine-grained bound on d f . The basic idea is to split the lower f -spectrum J f into a "bad" piece J, whose eigenvalues are close to 1 but whose eigenvectors are approximately orthogonal to f , and a "good" piece J f \ J, whose eigenvalues are far from 1 and which therefore do not require control on the inner products of their eigenvectors with f . More precisely, for a given set J ⊂ J f , let us define q T j f , λ J : = max λ j : j ∈ J , and λ −J : = max λ j : j ∈ J f \ J .
We obtain the following bound, expressed in terms of λ −J and λ J , which we generally expect to obey the relation 1 − λ −J 1 − λ J . discrepancy that comes from the J part of the spectrum. We can get a still sharper estimate by instead making use of the following vector quantity that more precisely summarizes the interactions between f and J: We call the bound based on this quantity the oracle adaptive bound because it uses the exact value of the part of the discrepancy coming from the J eigenspaces, while using the same bound as above for the part of the discrepancy coming from J f \ J.
Lemma 2 (Oracle f -discrepancy bound). Given f : We emphasize that, although Lemma 2 is stated in terms of the initial distribution π 0 , when we apply the bound in the real examples we consider, we replace all quantities that depend on π 0 by their worst cases values, in order to avoid dependence on initialization; this results in a h J n ∞ term instead of the dot product in the lemma.
2.3. Concentration bounds. The mixing time bounds from Section 2.2 allow us to translate the master Hoeffding bound into a weaker but more interpretable-and in some instances, more directly applicable-concentration bound. The first result we prove along these lines applies meaningfully only to functions f whose absolute f -spectral gap γ f is larger than the absolute spectral gap γ * . It is a direct consequence of the master Hoeffding bound and the simple spectral mixing bound (12), and it delivers the asymptotics in N and promised in Section 2.1.
Deriving a Hoeffding bound using the sharper f -mixing bound given in Lemma 1 requires more care, both because of the added complexity of managing two terms in the bound and because one of those terms does not decay, meaning that the bound only holds for sufficiently large deviations > 0.
The following result represents one way of articulating the bound implied by Lemma 1; it can yield a great improvement over the previous two results when the contribution from the bad part of the spectrum J-that is, the part of the spectrum that brings γ f closer to 1 than we would like-is negligible at the scale of interest. Recall that Lemma 1 expresses the contribution of J via the quantity ∆ * J .
Similar arguments can be applied to combine the master Hoeffding bounds with the oracle f -mixing bound Lemma 2, but we omit the corresponding result for the sake of brevity. The proofs for both aforementioned corollaries are in Section 5.4.

2.4.
Example: Lazy random walk on C 2d . In order to illustrate how the mixing time and Hoeffding bounds from Section 2.2, we analyze their predictions for various classes of functions on the 2d-cycle C 2d , identified with the integers modulo 2d. In particular, consider the Markov chain corresponding to a lazy random walk on C 2d ; it has transition matrix It is easy to see that the chain is irreducible, aperiodic, and reversible, and its stationary distribution is uniform. It can be shown (Levin, Peres and Wilmer, 2008) that its mixing time scales proportionally to d 2 . However, as we now show, several interesting classes of functions mix much faster, and in fact, a "typical" function, meaning a randomly chosen one, mixes much faster than the naive mixing bound would predict.
The epitome of a rapidly mixing function is the parity function given by It is easy to see that no matter what the choice of initial distribution π 0 is, we have E f parity (X 1 ) = 1 2 , and thus f parity mixes in a single step. A more general class of examples arises from considering the eigenfunctions of P , which are given by g j u = cos πju d ; (see, e.g., Levin, Peres and Wilmer, 2008). We define a class of functions of varying regularity by setting Here we have limited 0 ≤ j ≤ d because f j and f 2d−j behave analogously. Note that the parity function f parity corresponds to f d . Intuitively, one might expect that some of these functions mix much before d 2 steps have elapsed-both because the vectors {f j , j = 1} are orthogonal to the non-top eigenvectors with eigenvalues close to 1 and because as j gets larger, the periods of f j become smaller and smaller, meaning that their global behavior can increasingly be well determined by looking at local snapshots, which can be seen in few steps.
Our mixing bounds allow us to make this intuition precise, and our Hoeffding bounds allow us to prove correspondingly improved concentration bounds for the estimation of µ = E π f j = 1/2. Indeed, we have Consequently, equation (12) predicts that where we have used the trivial bound E π f 2 ≤ 1 to simplify the inequalities. Note that this is a significant improvement over d 2 as soon as j log d. Moreover, the bound (19) can itself be improved, since each f j is orthogonal to all eigenfunctions other than 1 and g j , so that the log d factors can all be removed by a more carefully argued form of Lemma 1. It thus follows directly from the bound (18) that if we draw N +T f j 2 8 samples, we obtain the tail bound where the burn-in time is given by N b =T f j 2 /8 . We point out that the sharper analysis mentioned above would allow us to remove the log 2d factors.
A more interesting example comes from considering a randomly chosen function f : C 2d → [0, 1]. Indeed, suppose that the function values are sampled iid from some distribution ν on [0, 1] whose mean µ * is 1/2: We can then show that for any fixed δ * > 0, with high probability over the randomness of f , have For δ log d √ d , this reflects a drastic improvement over the global mixing time of order d 2 log 1 δ . The core idea behind the proof of equation (22) is to apply Lemma 1 with It can be shown that h j ∞ = 1 for all 0 ≤ j < 2d and that with high simultaneously for all j ∈ J δ , which suffices to reduce the first part of the sharper f -discrepancy bound to order δ.
In order to estimate the rate of concentration, we proceed as follows. Taking δ = c 0 2 for a suitably chosen universal constant c 0 > 0, we show that ∆ J : = 2 32 ≥ ∆ * J . We can then set ∆ = 2 32 and observe that with high probability over f , ν ≥ 1 4 , so that the deviation in Corollary 2 satisfies the bound 2 ∆ J +∆ ν ≤ . With δ as above, we have 1 − λ −J ≥ c 1 4 d log d for another universal constant c 1 > 0. Thus, if we are given N + T f 2 /16 samples for for some c 2 > 0. Consequently, it suffices for the sample size to be lower bounded by in order to achieve an estimation accuracy of . Notice that this requirement is an improvement over the d 2 2 from the uniform Hoeffding bound provided that ( log 2 d d ) 1/4 . Proofs of all these claims can be found in Appendix B.
2.5. Alternative Hoeffding bounds. Although we focus most of our attention in this paper on the master Hoeffding bound in Theorem 1 and its cousin Corollary 1, one can obtain a variation on this master Hoeffding bound, representing different tradeoffs, by making a small change in the argument used to prove Theorem 1. We believe that this alternate bound is of independent interest, and we thus state it here and briefly expound on its properties.
Intuitively, we interpret Theorem 2 as providing an effective sample size of Which version of the master bound to use depends on the application context. In terms of functional form, they represent a clear tradeoff: greater conceptual complexity as a function of N -in particular, ). These forms lead to very different arguments in theoretical analyses based on the bounds. In terms of concentration predictions, the tradeoff is less clear, since it depends on the relationship between and N . In the analyses presented in this paper, we found that the bounds expressed in terms of 2 were easier to apply than the bounds implied by Theorem 2, but it may be that in some scenarios the reverse will hold.
For completeness, we also state another version of the Hoeffding bound, whose relationship to Theorem 2 is the same as that of Corollary 1 to Theorem 1.
The proofs of the aforementioned theorem and corollary are in Sections 5.1 and 5.4 respectively.
3. Statistical applications. We now consider how our results apply to Markov chain Monte Carlo (MCMC) in various statistical settings. Our investigation proceeds along three connected avenues. We begin by showing, in Section 3.1, how our concentration bounds can be used to provide confidence intervals for stationary expectations that avoid the over-optimism of pure CLT predictions without incurring the prohibitive penalty of the Berry-Esseen correction-or the global mixing rate penalty associated with spectral gap based confidence intervals. Then, in Section 3.2, we show how our results allow us to improve on recent sequential hypothesis testing methodologies for MCMC, again replacing the dependence on the spectral gap by a dependence on the f -mixing time. Later, in Section 4, we illustrate the practical significance of function-specific mixing properties by using our framework to analyze three real-world instances of MCMC, basing both the models and datasets chosen on real examples from the literature.
3.1. Confidence intervals for posterior expectations. In many applications, a point estimate of E π f does not suffice; the uncertainty in the estimate must be quantified, for instance by providing confidence intervals. In this section, we discuss how improved concentration bounds can be used to obtain sharper confidence intervals. In all cases, we assume the Markov chain is started from some distribution π 0 that need not be the stationary distribution, meaning that the confidence intervals must account for the burn-in time required to get close to equilibrium.
We first consider a bound that is an immediate consequence of the uniform Hoeffding bound given by (Léon and Perron, 2004, Thm. 1). As one would expect, it gives contraction at the usual Hoeffding rate but with an effective sample size of N eff = γ 0 (N − T 0 ), where γ 0 = min{1 − λ 2 , 1} and T 0 is the tuneable burn-in parameter. Note that this means that no matter how small T f is compared to the global mixing time T , the effective size incurs the penalty for a global burn-in; likewise, no matter how favorable the mixing rate of f in the regime of interest, the effective sample size remains bounded by the global spectral parameter γ 0 .
Let us now state one result that can be derived using our Hoeffding bounds. The resulting confidence interval adapts to the function, both in terms of burn-in time required, which now falls from a global mixing time to an f -specific mixing time, and in terms of rate, which falls from 1 γ 0 to T f (δ) for an appropriately chosen δ > 0. We first note that the one-sided tail bound of Theorem 1 can be written as e −r N ( )/4 , where If we wish for each tail to have probability mass that is less than α/2, we need to choose > 0 so that r N ≥ 4 log 2 α , and conversely any such corresponds to a valid two-sided 1 − α confidence interval. Let us summarize our conclusions: Theorem 3. For any width N ∈ r −1 N 4 log 2/α , ∞ , the set In order to make the result more amenable to interpretation, first note that for any 0 < η < 1, we have Consequently, whenever r N,η ( N ) ≥ 4 log 2 α and N ≥ η, then we are guaranteed that a symmetric interval of half-width N is a valid 1 − α -confidence interval. Summarizing more precisely, we have: In many situations we will not have direct access to T f δ , but we can often obtain an upper boundT f δ that is valid for all δ > 0. In Section 5.5, therefore, which contains the proofs for this section, we prove a strengthened form of Theorem 3 and its corollary that applies in that setting.
A popular alternative strategy for building confidence intervals using MCMC depends on the Markov central limit theorem (e.g., (Flegal, Haran and Jones, 2008;Jones and Hobert, 2001;Glynn and Lim, 2009;Robert and Casella, 2005)). If the Markov CLT held exactly, it would lead to appealingly simple confidence intervals of width Unfortunately, the CLT does not hold exactly, even after the burn-in period. The amount by which it fails to hold can be quantified using a Berry-Esseen bound for Markov chains, as we now discuss. Let us adopt the compact notationS We then have the bound where Φ is the standard normal CDF. Note that this bound accounts for both the non-stationarity error and for non-normality error at stationarity. The non-normality term decays far more slowly than the non-stationarity term, going roughly at the rate 1 γ 0 √ N while the latter goes at the rate e −γ 0 N . While the bound (29) makes it possible to prove a corrected CLT confidence interval, the resulting bound will have two significant drawbacks. The first is that it will only hold for extremely large sample sizes, on the order of 1 π min γ 2 0 , compared to the order log 1/π min γ 0 required by the uniform Hoeffding bound. The second, shared by the uniform Hoeffding bound, is that it will be non-adaptive and therefore bottlenecked by the global mixing properties of the chain.
More precisely, to be valid, the interval requires a large sample size bounded below as and, when it is valid, it takes the form It is worth pointing out that this confidence interval width contains a hidden mixing penalty. Indeed, defining the variance σ 2 f = Var π f X and , we can rewrite the width as Thus, for this bound, the quantity ρ f captures the penalty due to nonindependence, playing the role of γ 0 and γ f in the other bounds. In this sense, the CLT bound adapts to the function f , but only when it applies, which is at a sample-size scale dictated by the global mixing properties of the chain (i.e., γ 0 ).
3.2. Sequential testing for MCMC. For some applications, full confidence intervals may be unnecessary; instead, a practitioner may merely want to know whether µ = E π [f ] lies above or below some threshold 0 < r < 1. In these cases, we would like to develop a procedure for distinguishing between the two possibilities, at a given tolerable level 0 < α < 1 of combined Type I and II error. The simplest approach is, of course, to choose N so large that the 1 − α confidence interval built from N MCMC samples lies entirely on one side of r, but it may be possible to do better by using a sequential test. This latter idea was recently investigated in Gyori and Paulin (2015), and we consider the same problem settings that they did: (a) Testing with (known) indifference region, involving a choice between (b) Testing with no indifference region-that is, the same as above but with δ = 0.
For the first setting, we always assume 0 < δ < ν, and the algorithm is evaluated on its ability to correctly choose between H 0 and H 1 when one of them holds, but it incurs no penalty for either choice when µ falls in the indifference region r − δ, r + δ . The error of a procedure A can thus be defined as The rest of this subsection is organized as follows. For the first setting above, we analyze a procedure A fixed that makes a decision after a fixed number N := N (α) of samples. We also analyze a sequential procedure A seq that chooses whether to reject at a sequence N 0 , . . . , N k , . . . of decision times. For the second, more challenging, setting, we analyze A hard , which also rejects at a sequence of decision times. For both A seq and A hard , we calculate the expected stopping times of the procedures.
As mentioned above, the simplest procedure A fixed would choose a fixed number N of samples to be collected based on the target level α. After collecting N samples, it forms the empirical averageμ N = 1 N N n=1 f X n and outputs H 0 ifμ N ≥ r + δ, H 1 ifμ N ≤ r − δ, and outputs a special indifference symbol, say I, otherwise.
The sequential algorithm A seq makes decisions as to whether to output one of the hypotheses or continue testing at a fixed sequence of decision times, say N k . These times are defined recursively by where M > 0 and 0 < ξ < 2/5 are parameters of the algorithm. At each time N k for k ≥ 1, the algorithm A seq checks if If the empirical average lies in this interval, then the algorithm continues sampling; otherwise, it outputs H 0 or H 1 accordingly in the natural way.
For the sequential algorithm A hard , let N 0 > 0 be chosen arbitrarily, 1 and let N k be defined in terms of N 0 as in (32). It once again decides at each N k for k ≥ 1 whether to output an answer or to continue sampling, depending on whetherμ N k ∈ r − k α , r + k α .
When this inclusion holds, the algorithm continues; when it doesn't hold, the algorithm outputs H 0 or H 1 in the natural way.
In order to state our main result, first define The following result is restricted to the stationary case; later in the section, we turn to the question of burn-in.
Theorem 4. Assume that α ≤ 2 5 . For A fixed , A seq , A hard to all satisfy err A, f ≤ α, it suffices to (respectively) choose where we let inf ∅ = ∞.
Our results differ from those of Gyori and Paulin (2015) because the latter implicitly control the worst-case error of the algorithm err A, f , while our analysis controls err A, f directly. For example, among other results, Gyori and Paulin (2015) showed that A fixed with N = log(1/α) γ 0 δ 2 has worst-case error at most err(A fixed ) ≤ α. The τ f parameter in our bounds plays the same role in our bounds that 1 γ 0 plays in their uniform bounds. In particular, they showed that choosing M = log( 2 √ αξ ) γ 0 δ suffices to guarantee err(A seq ) ≤ α. Similarly, for the no indifference case, they proved that the choice k (α) = log(1/α) + 1 + 2 log k γ 0 N k (38) leads to a procedure with worst-case error bounded as err(A hard ) ≤ α. Once more, our improvements are of a similar nature.
As a result of this close correspondence, we easily see that our results improve on the uniform result for a fixed function f whenever f converges to its stationary expectation faster than the chain itself converges-more concretely, whenever τ f 2 = T f νδ 2 ≤ 1 2γ 0 . The value of the above testing results depends substantially on their computational properties. In the indifference region case, for example, the sequential procedure will only be valuable if it can reduce the number of samples needed compared to the fixed sample size procedure. In the no indifference case, the procedure is valuable because of its ability to test between hypotheses separated only by a point, but its utility will be greatly limited if it takes too long to run. We now turn, therefore, to the question of bounding expected stopping times.
To carry out the analysis, we introduce the true margin ∆ = |r −µ|. First, let us introduce the following notation. Let N (A) be the number of sampled collected by A. Given a margin schedule k , let We can bound the expected stopping times of A seq , A hard in terms of ∆ as follows: Theorem 5. Assume either H 0 or H 1 holds. Then, With minor modifications to the proofs in Gyori and Paulin (2015), we can bound the expected stopping times as In order to see how the uniform and adaptive bounds compare, it is helpful to first note that, under either H 0 or H 1 , we have the lower bound ∆ ≥ δ. Thus, the dominant term in the expectations in both cases is (1 + ξ)M/∆. Consequently, the ratio between the expected stopping times is approximately equal to he ratio between the M values-viz.
As a result, we should expect a significant improvement in terms of number of samples when the relaxation time 1 γ 0 is significantly larger than the f -mixing time T f νδ 2 4 . Framed in absolute terms, we can writē Up to an additive term, the bound for A hard is also qualitatively similar to earlier ones, with 1 δ∆ replaced by 1 ∆ 2 .

Analyzing mixing in practice.
We analyze several examples of MCMC-based Bayesian analysis from our theoretical perspective. These examples demonstrate that convergence in discrepancy can in practice occur much faster than suggested by naive mixing time bounds and that our bounds go part of the way toward closing the gap between theoretical predictions and observed behavior. 4.1. Bayesian logistic regression. Our first example is a Bayesian logistic regression problem introduced by Robert and Casella (2005). The data consists of 23 observations of temperatures (in Fahrenheit, but normalized by dividing by 100) and a corresponding binary outcome-failure (y = 1) or not (y = 0) of a certain component; the aim is to fit a logistic regressor α, β ∈ R 2 to the data, incorporating a prior and integrating over the model uncertainty to obtain future predictions. More explicitly, following the analysis in Gyori and Paulin (2012), we consider the following model: which corresponds to an exponential prior on e α , an improper uniform prior on β and a logit link for prediction. As in Gyori and Paulin (2012), we target the posterior by running a Metropolis-Hastings algorithm with a Gaussian proposal with covariance matrix Σ = 4 0 0 10 .
Unlike in their paper, however, we discretize the state space to facilitate exact analysis of the transition matrix and to make our theory directly applicable. The resulting state space is given by where ∆ = 0.1 and (α,β) is the MLE. This space has d = 17 2 = 289 elements, resulting in a 289 × 289 transition matrix that can easily be diagonalized. Robert and Casella (2005) analyze the probability of failure when the temperature x is 65 • F; it is specified by the function f 65 α, β = exp α + 0.65β 1 + exp α + 0.65β .
Note that this function fluctuates significantly under the posterior, as shown in Figure 2. This function also happens to exhibit rapid mixing The discrepancy d f 65 , before entering an asymptotic regime in which it decays exponentially at a rate 1 − γ * ≈ 0.386, first drops from about 0.3 to about 0.01 in just 2 iterations, compared to the predicted 10 iterations from the naive bound  Note that the oracle f -discrepancy bound improves significantly over the uniform baseline, even though the non-oracle version does not. In this calculation, we took J = 2, . . . , 140 to include the top half of the spectrum excluding 1 and computed h j ∞ directly from P for j ∈ J and likewise for q T j f 65 . The oracle bound is given by Lemma 2. As shown in Figure 4, this decay is also faster than that of the total variation distance.  the whole spectrum below the top eigenvalue, the oracle bound becomes exact. Between that and J = ∅, the oracle bound becomes tighter and tighter, with the rate of tightening depending on how much power the function has in the higher versus lower eigenspaces. Figure 5 illustrates this for a few settings of J, showing that although for this function and this chain, a comparatively large J is needed to get a tight bound, the oracle bound is substantially tighter than the uniform and non-oracle f -discrepancy bounds even for small J.

4.2.
Bayesian analysis of clinical trials. The problem of missing data often necessitates Bayesian analysis, particularly in settings where uncertainty quantification is important, as in clinical trials. We illustrate how our framework would apply in this context by considering a clinical trials dataset (Berry et al., 2010;Gyori and Paulin, 2012).
The dataset consists of n = 50 patients, some of whom participated in a trial for a drug and exhibited early indicators (Y i ) of success/failure and final indicators (X i )of success/failure. Among the 50 patients, both indicator values are available for n X = 20 patients; early indicators are available for n Y = 20 patients; and no indicators are available for n 0 = 10 patients. The analysis depends on the following parameterization: Note that, in contrast to what one might expect, p is to be interpreted as the marginal probability that X i = 1, so that in actuality p = P X i = 1 unconditionally; we keep the other notation, however, for the sake of consistency with past work (Berry et al., 2010;Gyori and Paulin, 2012). Conjugate uniform (i.e., Be 1, 1 ) priors are placed on all the model parameters.
The unknown variables include the parameter triple γ 0 , γ 1 , p and the unobserved X i values for n Y + n 0 = 30 patients, and the full sample space is thereforeΩ = [0, 1] 3 × 0, 1 30 . We cannot estimate the transition matrix for this chain, even with a discretization with as coarse a mesh as ∆ = 0.1, since the number of states would be d = 10 3 × 2 30 ∼ 10 12 . We therefore make two changes to the original MCMC procedure. First, we collapse out the X i variables to bring the state space down to [0, 1] 3 ; while analytically collapsing out the discrete variables is impossible, we can estimate the tran-sition probabilities for the collapsed chain analytically by sampling the X i variables conditional on the parameter values and forming a Monte Carlo estimate of the collapsed transition probabilities. Second, since the function of interest in the original work-namely, f γ 0 , γ 1 , p = 1 p > 0.5 -depends only on p, we fix γ 0 and γ 1 to their MLE values and sample only p, restricted to the unit interval discretized with mesh ∆ = 0.01. As Figure 1 shows, eigenvalue decay occurs rapidly for this sampler, with γ * ≈ 0.86. Mixing thus occurs so quickly that none of the bounds-uniform or function-specific-get close to the truth, due to the presence of the constant terms (and specifically the large term 1 √ π min ≈ 2.14 × 10 33 ). Nonetheless, this example still illustrates how in actual fact, the choice of target function can make a big difference in the number of iterations required for accurate estimation; indeed, if we consider the two functions f 1 (p) : = 1 p > 0.5 , and f 2 (p) : = p, we see in Figure 6 that the mixing behavior differs significantly between them: whereas the discrepancy for the second decays at the asymptotic exponential rate from the outset, the discrepancy for the first decreases faster than that (by about an order of magnitude) for the first few iterations, before reaching the asymptotic rate dictated by the value of spectral gap. 4.3. Collapsed Gibbs sampling for mixture models. Due to the ubiquity of clustering problems in applied statistics and machine learning, Bayesian inference for mixture models (and their generalizations) is a widespread application of MCMC (Ghahramani and Griffiths, 2005;Griffiths and Steyvers, 2004;Jain et al., 2007;Mimno, Hoffman and Blei, 2012;Neal, 2000). We consider the mixture-of-Gaussians model, applying it to a subset of the schizophrenic reaction time data analyzed in Belin and Rubin (1995). The subset of the data we consider consists of 10 measurements, with 5 coming from healthy subjects and 5 from subjects diagnosed with schizophrenia. Since our interest is in contexts where uncertainty is high, we chose the 5 subjects from the healthy group whose reaction times were greatest and the 5 subjects from the schizophrenic group whose reaction times were smallest. We considered a mixture with K = 2 components, viz.: We chose relatively uninformative priors, setting α 0 = α 1 = 1 and ρ = 237. Increasing the value chosen in the original analysis (Belin and Rubin, 1995), we set σ ≈ 70; we found that this was necessary to prevent the posterior from being too highly concentrated, which would be an unrealistic setting for MCMC. We ran collapsed Gibbs on the indicator variables Z i by analytically integrating out ω and µ 0:1 .
As Figure 1 illustrates, the spectral gap for this chain is small-namely, γ * ≈ 3.83 × 10 −4 -yet the eigenvalues fall off comparatively quickly after λ 2 , opening up the possibility for improvement over the uniform γ * -based bounds. In more detail, define the vector corresponding to the cluster assignments in which the patient and control groups are perfectly separated (with the control group being assigned label b). We can then define the indicator for exact recovery of the ground truth by As Figure 7 illustrates, convergence in terms of f -discrepancy occurs much faster than convergence in total variation, meaning that predictions of required burn-in times and sample size based on global metrics of convergence drastically overestimate the computational and statistical effort required to estimate the expectation of f accurately using the collapsed Gibbs sampler. This behavior can be explained in terms of the interaction between the function f and the eigenspaces of P . Although the pessimistic constants in the bounds from the uniform bound (10) and the non-oracle function-specific bound (Lemma 1) make their predictions overly conservative, the oracle version of the function-specific bound (Lemma 2) begins to make exact predictions after just a hundred iterations when applied with J = 1, . . . , 25 ; this corresponds to making exact predictions of T f δ for δ ≤ δ 0 ≈ 0.01, which is a realistic tolerance for estimation of E f . Panel (b) of   Table 1 Comparison of bounds on T f δ for different values of δ. The uniform bound corresponds to the bound T f δ ≤ T δ , the latter of which can be bounded by the total variation bound. The function-specific bounds correspond to Lemmas 1 and 2, respectively. Whereas the uniform and non-oracle f -discrepancy bounds make highly conservative predictions, the oracle f -discrepancy bound is nearly sharp even for δ as large as 0.01.
The mixture setting also provides a good illustration of how the functionspecific Hoeffding bounds can substantially improve on the uniform Hoeffding bound. In particular, let us compare the T f -based Hoeffding bound (Theorem 1) to the uniform Hoeffding bound established by Léon and Per-ron (2004). At equilibrium, the penalty for non-independence in our bounds is (2T f (ν 2 /4)) −1 compared to roughly γ −1 * in the uniform bound. Importantly, however, our concentration bound applies unchanged even when the chain has not equilibrated, provided it has approximately equilibrated with respect to f . As a consequence, our bound only requires a burn-in of T f ( ν 2 4 ), whereas the uniform Hoeffding bound does not directly apply for any finite burn-in. This issue can be addressed using the method of Paulin (2012), but at the cost of a burn-in dependent penalty d TV (T 0 ) = sup π 0 d TV (π n , π): where we have let T 0 denote the burn-in time. Note that a matching bound holds for the lower tail.
For our experiments, we computed the tightest version of the bound (42), optimizing T 0 in the range 0, 10 5 for each value of the deviation . Even given this generosity toward the uniform bound, the function-specific bound still outperforms it substantially, as Figure 8 shows. For the function-specific bound, we used the function-specific oracle bound (Lemma 2) to bound T f ν 2 4 ; this nearly coincides with the true value when ≈ 0.01 but deviates slightly for larger values of .

Proofs of main results.
This section is devoted to the proofs of the main results of this paper. 5.1. Proofs of master Hoeffding bounds. The proofs of our master Hoeffding bounds (Theorems 1 and 2) are based on the following more general result: Proposition 1 (Master tail bound). Suppose that d f π 0 , π ≤ δ and N 0 (δ) : = N T f (δ) ≥ 1. Then for every α > 0, we have Comparison of the (log) tail probability bounds provided by the uniform Hoeffding bound due to Léon and Perron (2004) with one version of our function-specific Hoeffding bound (Theorem 1). Plots are based on N = 10 6 iterations, and choosing the optimal burn-in for the uniform bound and a fixed burn-in of 409 ≥ T f 10 −6 iterations for the function-specific bound. The function-specific bound improves over the uniform bound by orders of magnitude. 5.1.1. Proof of Theorem 1. Take δ = 2 ν 4 and α = . Proposition 1 then guarantees that . 5.1.2. Proof of Theorem 2. Take δ = ν N and α = . The bound in Proposition 1 then yields .

5.2.
Proof of Proposition 1. At the heart of the proof is the following bound on the MGF for the sum of an appropriately thinned subsequence of the function values {f (X n )} ∞ n=1 . In particular, let us introduce the shorthand notation X m,t : = X (m−1)T f (δ)+t . With this notation, we have Lemma 3 (Master MGF bound). For 0 ≤ t < T f δ and α ∈ R, See Section 5.3 for the proof of this claim.
Using Lemma 3, we now prove the proposition. Recalling the definition of X m,t , we have where the last inequality follows from Jensen's inequality, as applied to the exponential function. Combined with Lemma 3, we conclude valid for α > 0. The claim then follows from the MGF Markov inequality.
5.3. Proof of Lemma 3. For any given scalar t ∈ 0, T f (δ) , we again make use of the shorthand X m,t : = X (m−1)T f (δ)+t . We then introduce a sequence of Bernoulli variables Y m ∼ Bern f ( X m,t ) . With this notation, Jensen's inequality implies that so that it suffices to bound the latter MGF.
In order to do so, we first argue that for all integers m = 1, . . . , N 0 − 1, we have In order to establish this fact, note that, if we let ρ (b 1:m ) denote the distribution of X m+1,t given that Y 1:m = b 1:m , by the definition of T f δ . A similar argument can be given for b m+1 = 0. Finally note that With this in hand, we observe where the new sequence {Y m } N 0 m=1 consists of i.i.d. Bern(µ) variables. By the sub-Gaussianity of the Y m -variables, we deduce that which completes the proof.

Proofs of derived Hoeffding bounds.
In this section, we prove the derived Hoeffing bounds. 5.4.1. Proof of Corollary 1. The proof is a direct application of Theorem 1. Indeed, it suffices to note that if 2 ≤ 4λ f ν √ π min , then which yields the first bound. Turning to the second bound, note that if 2 > 4λ f ν √ π min , then equation (12) implies that T f 2 ν 4 = 1, which establishes the claim.
Notice first that we have set = 2 ∆+∆ J ν . On the other hand, using the facts that λ J ≤ 1, d TV (π 0 , π n ) ≤ 1, d f (π 0 , π n ) ≤ 1, and that E π f 2 ≤ 1, we have from Lemma 1 that It follows that, provided ∆ ≤ Plugging into Theorem 1 now gives the first part of the bound. On the other hand, if ∆ > λ −J √ π min , then Lemma 1 implies that T f (∆ J + ∆) = 1, giving the second case. 5.4.3. Proof of Corollary 3. The proof is a direct application of Theorem 2. Indeed, it suffices to note that if N ≥ ( 1 λ f ) by equation (12). Plugging into the bound gives the first inequality, while the second follows from the fact that , then T f ( ν N ) = 1, which leads to the improved concentration in the second statement. 5.5. Proofs of confidence interval results. Here we provide the proof of the confidence interval corresponding to our bound (Theorem 3). Proofs of the claims (26) and (30) can be found in Appendix C.
As discussed in Section 3.1, we actually prove a somewhat stronger form of Theorem 3, in order to guarantee that the confidence interval can be straightforwardly built using an upper boundT f on the f -mixing time rather than the true value. SettingT f = T f recovers the original theorem.
Specifically, supposeT f : N → R + is an upper bound on T f and note that the corresponding tail bound becomes e −r N ( )/4 , wherẽ This means that, just as before we wanted to make the rate r N in (27) at least as large as 4 log 2 α , we now wish to do the same withr N , which means choosing N withr N N ≥ 4 log 2 α . We therefore have the following result.
Proposition 2. For any width N ∈r −1 N 4 log 2/α , ∞ , the interval The corresponding lower bound leads to an analogous bound on the lower tail.
As we did with Corollary 4, we can derive a more concrete, though slightly weaker, form of this result that is more amenable to interpretation. We derive the corollary from the specialized bound by settingT f = T f .
To obtain this bound, define the following lower bound, in parallel with equation (28) Since this is a lower bound, we see that whenever N ≥ η andr N,η N ≥ 4 log 2 α , N is a valid half-width for a 1 − α -confidence interval for the stationary mean centered at the empirical mean. More formally, we have the following: Proposition 3. Fix η > 0 and let Proof. By assumption, we have But now Proposition 2 applies, so that we are done.
5.6. Proofs of sequential testing results. In this section, we collect various proofs associated with our analysis of the sequential testing problem. 5.6.1. Proof of Theorem 4 for A fixed . We provide a detailed proof when H 1 is true, in which case we have µ ≤ r − δ; the proof for the other case is analogous. When H 1 is true, we need to control the probability P A fixed X 1:N = H 0 . In order to do so, note that Theorem 1 implies that yields the bound P A fixed X 1:N = H 0 ≤ α, as claimed. 5.6.2. Proof of Theorem 4 for A seq . The proof is nearly identical to that given by Gyori and Paulin (2015), with τ f δ/2 replacing 1 γ 0 . We again assume that H 1 holds, so µ ≤ r − δ. In this case, it is certainly true that err A seq , f = P ∃k : A seq X 1: It follows by Theorem 1, with k = δ + M N k , that In order to simplify notation, for the remainder of the proof, we define τ : = τ f (δ/2), β : = √ αξ 2 , and ζ k : = δ 2 N k 2τ log(1/β) . In terms of this notation, we have M = 2τ log(1/β) δ , and hence that It follows that the error probability is at most We now finish the proof using two small technical lemmas, whose proofs we defer to Appendix D.
Using this bound, and grouping together terms in blocks of size 9 5ξ , we find that the error is at most Since both α and ξ are at most 2 5 , we have β = √ αξ 2 ≤ 1 5 , and hence the error probability is bounded as 5.6.3. Proof of Theorem 4 for A hard . We may assume that H 1 holds, as the other case is analogous. Under H 1 , letting k 0 be the smallest k such that By Theorem 1, and the definition of k , we thus have as claimed.
5.6.4. Proof of Theorem 5 for A seq . We may assume H 1 holds; the other case is analogous. Note that Our proof depends on the following simple technical lemma, whose proof we defer to Appendix D.3.
Given this lemma, we then follow the argument of Gyori and Paulin (2015) in order to bound the integral: more precisely, we have To conclude, note that either r ≥ ∆ or 1 − r ≥ ∆, since 0 < µ < 1, so that min 1 r , 1 1−r ≤ 1 ∆ . It follows that Combining the bounds yields the desired result. 5.6.5. Proof of Theorem 5 for A hard . For concreteness, we may assume H 1 holds, as the H 0 case is symmetric. We now have that For convenience, let us introduce the shorthand otherwise.
Applying the Hoeffding bound from Theorem 1, we then find that Observe further that Combining the pieces yields The crux of the proof is a bound on the infinite sum, which we pull out as a lemma for clarity.
. See Appendix D.4 for the proof of this claim. Lemma 7 then implies that The claim now follows from equation (46).
6. Discussion. A significant obstacle to successful application of statistical procedures based on Markov chains-especially MCMC-is the possibility of slow mixing. The usual formulation of mixing is in terms of convergence in a distribution-level metric, such as the total variation or Wasserstein distance. On the other hand, algorithms like MCMC are often used to estimate equilibrium expectations over a limited class of functions. For such uses, it is desirable to build a theory of mixing times with respect to these limited classes of functions and to provide convergence and concentration guarantees analogous to those available in the classical setting, and our paper has made some steps in this direction.
In particular, we introduced the f -mixing time of a function, and showed that it can be characterized by the interaction between the function and the eigenspaces of the transition operator. Using these tools, we proved that the empirical averages of a function f concentrate around their equilibrium values at a rate characterized by the f -mixing time; in so doing, we replaced the worst-case dependence on the spectral gap of the chain, characteristic of previous Markov chain concentration bounds, by an adaptive dependence on the properties of the actual target function. Our methodology yields sharper confidence intervals, as well as better rates for sequential hypothesis tests in MCMC, and we have provided evidence that the predictions made by our theory are accurate in some real examples of MCMC and thus potentially of significant practical relevance.
Our investigation also suggests a number of further questions. In practical applications, it would be desirable to have methods for estimating or bounding the f -mixing time based on samples. It would also be interesting to study the f -mixing times of Markov chains that arise in probability theory, statistical physics, and applied statistics itself. While we have shown what can be done with spectral methods, the classical theory provides a much larger arsenal of techniques, some of which may generalize to yield sharper f -mixing time bounds. We leave these and other problems to future work.
It follows from equation (12) that as claimed.
A.2. Proof of equation (12). Let D = diag √ π . Then the matrix A = DP D −1 is symmetric and so has an eigendecomposition of the form A = γ 1 γ T 1 + d j=2 λ j γ j γ T j . Using this decomposition, we have where h j : = D −1 γ j and q j : = Dγ j . Note that the vectors {q j } d j=2 correspond to the left eigenvectors associated with the eigenvalues {λ j } d j=2 . Now, if we let π 0 be an arbitrary distribution over [d], we have d f (π n , π) = π T 0 P n f − π T P n f ≤ (π 0 − π) T P n f .
Defining P f : = 1π T + j∈J f λ j h j q T j , we have P n f = P n f f . Moreover, if we defineP f : = j∈J f λ j h j q T j , and correspondinglyÃ f : = DP f D −1 , we then have the relation π 0 −π TP f = π 0 −π T P f . Consequently, by the definition of the operator norm and sub-multiplicativity, we have In order to complete the proof, let Z ∈ {0, 1} d denote the indicator vector Z j = 1 X 0 = j . Observe that the function Applying Lemma 1 and using the facts that λ J δ ≤ 1, that h j ∞ ≤ 1, Therefore, when the event E δ holds, we have To conclude, we use the fact that where j 0 = 4δ d log d . On the other hand, we also have cos πx ≤ 1 − π 2 x 2 2 + π 4 x 4 24 ≤ 1 − π 2 x 2 12 , x ≤ 1, which implies that λ −J δ ≤ 1 − 2π 2 δ 2 3d log d ≤ exp − 2π 2 δ 2 3d log d .
Replacing δ by δ 128 throughout, we conclude that for n ≥ 3 (128) 2 d log d log 128d δ 2π 2 δ 2 = 3 · 2 13 π 2 · d log d log 128d δ δ 2 , we have d f (π n , π) ≤ δ with probability at least P E δ/128 . It now suffices to prove P (E δ ) ≥ 1 − δ √ d log d , since this will imply that P E δ/128 ≥ 1 − δ 128 √ d log d , as required. For this, notice that because the vectors {q j } d j=1 are rescaled versions of an orthonormal collection of eigenvectors, we have E q T j f = E ν [µ] · q T j 1 = 0 C.1. Proof of claim (26). This claim follows directly from a modified uniform Hoeffding bound, due to Paulin (2012). In particular, for any integer T 0 ≥ 0, let d TV (T 0 ) = sup π 0 d TV (π 0 P T 0 , π) be the worst-case total variation distance from stationarity after T 0 steps. Using this notation, Paulin (2012) shows that for any starting distribution π 0 and any bounded function f : [d] → [0, 1], we have We now use the bound (48) to prove our claim (26). Recall that we have chosen T 0 so that d TV T 0 ≤ α 0 /2. Therefore, the bound (48) implies that as required.
C.2. Proof of the claim (30). We now use the result (29) to prove the claim (30).
It follows from equation (29) that and since a matching bound holds for the lower tail, we get the desired result.
D. Proofs for Section 5.6. In this section, we gather the proofs of Lemmas 4-7.
In words, the quantity k is either the smallest integer such that 1 + ξ is bigger than ζ k (if ζ k ≤ 1) or the largest integer such that 1 + ξ is smaller than ζ k (if ζ k ≥ 1).
With this definition, we see that 1 + ξ k always lies between ζ k and 1, so that we are guaranteed that g 1 + ξ k ≥ g ζ k , and hence ∞ k=1 g ζ k ≤ ∞ k=1 g (1 + ξ) k .
The proof will thus be completed if we can show that at most two distinct values of k map to a single k . Indeed, if we prove this, we will have ∞ k=1 g ζ k ≤ 2 ∞ =−∞ g 1 + ξ ≤ 4 ∞ =0 g 1 + ξ .
In order to prove the required claim, note first that k is clearly nondecreasing in k, so that we need to prove that k+2 > k for all k ≥ 1. It is sufficent to show that ζ k+2 ≥ 1 + ξ ζ k , since this inequality implies that k+2 ≥ k + 1. We now exploit the fact that ζ k = an k for some absolute constant a, where n k = n 0 1 + ξ k . For this, let b = n 0 1 + ξ k , so that n k = b .
Since n k+1 > n k , we have 1 + ξ b ≥ 1 + ξ b ≥ b + 1, and hence as required. 2 D.2. Proof of Lemma 5. When c = 0 and = 0, we note that the claim obviously holds with equality. On the other hand, the left hand side is increasing in , so that the c = 0 case follows immediately.
It suffices to show that (1 + ξ) is at least as large as the largest root of the the quadratic equation z 2 − 2 c + 1 z + 1 = 0. This largest root is given by z * = c + 1 + c (c + 2) ≤ 2(c + 1).
Differentiating the upper bound in c, we find that its derivative is 5 4 (c + 1) ξ ≤ 5 8ξ ≤ 9 5ξ , so it actually suffices to verify the claim for c = 1, which can be done by checking numerically that 5 log 4 4 ≤ 9 5 .
D.3. Proof of Lemma 6. Our strategy will be to split the infinite sum into two parts: one corresponding to the range of s where h is constant and equal to 1 and the other to the range of s where h is decreasing. In terms of the N k , these two parts are obtained by splitting the sum into terms with k < k 0 and k ≥ k 0 , where k 0 ≥ 1 is minimal such that M ≤ ∆N k for k ≥ k 0 .
For convenience in what follows, let us introduce the convenient shorthand Now, if k 0 = 1, we note that h must then be decreasing for s ≥ N 1 , so that For k < k 0 , we have T k = 1, so that when k < k 0 − 1, we have Finally, we observe that N k+1 ≤ 1+ξ N k +1+ξ, so that N k 0 − N k 0 −1 ≤ ξN k 0 −1 + 1 + ξ.
Putting together the pieces, we have Furthermore, the exponential in the definition of F k is a decreasing function of N k , so we further bound the overall sum as