On the Bernstein-von Mises theorem for the Dirichlet process

We establish that Laplace transforms of the posterior Dirichlet process converge to those of the limiting Brownian bridge process in a neighbourhood about zero, uniformly over Glivenko-Cantelli function classes. For real-valued random variables and functions of bounded variation, we strengthen this result to hold for all real numbers. This last result is proved via an explicit strong approximation coupling inequality.


Results
Let P n = n −1 n i=1 δ Zi be the empirical distribution of an i.i.d. sample Z 1 , . . . , Z n from a distribution P 0 on some measurable space (X , A), and given Z 1 , . . . , Z n let P n be a draw from the Dirichlet process with base measure ν + nP n . Thus ν is a finite measure on the sample space and P n | Z 1 , . . . , Z n ∼ DP(ν + nP n ) for all n, which is the posterior distribution obtained when equipping the distribution of the observations Z 1 , Z 2 , . . . , Z n with a Dirichlet process prior with base measure ν. The case ν = 0 is allowed; the process P n is then known as the Bayesian bootstrap. For full definitions and properties, see the review in Chapter 4 of [15].
The Dirichlet process is the standard "nonparametric prior" on the set of probability distributions on a (Polish) sample space and was first made popular in Bayesian nonparametrics by Ferguson [13] and has subsequently been used in numerous statistical applications. The purpose of this note is to prove the following result concerning the Bernstein-von Mises theorem for the Dirichlet process posterior. Theorem 1. Suppose G is a P 0 -Glivenko-Cantelli class of measurable functions g : X → R with measurable envelope function G, such that νe tG ≤ Ce ct 2 for every t > 0 and some c, C > 0, and P 0 G 2+δ < ∞ for some δ > 0. Then there exists a neighbourhood of 0, such that for every t in the neighbourhood, sup g∈G E e t √ n(Png−Png) | Z 1 , . . . , Z n − e t 2 P0(g−P0g) 2 /2 as * → 0. (1.1) Here we write S n as * → 0 if there exist measurable random variables ∆ n with |S n | ≤ ∆ n and ∆ n → 0 for P ∞ 0 -almost every sequence Z 1 , Z 2 , . . . . This is because the supremum in the theorem is not necessarily measurable in Z 1 , Z 2 , . . ., and, when it is not, convergence on a set of non-measurability one may be less informative than it appears, as pointed out by a referee. (Cf. Chapter 1.9 in [35] for further discussion.) The function t → e t 2 σ 2 /2 is the Laplace transform of the normal distribution with mean 0 and variance σ 2 . The theorem thus says that the Laplace transform of the posterior Dirichlet process centered at the empirical measure tends to the Laplace transform of a centered normal distribution with variance P 0 (g − P 0 g) 2 in a neighbourhood of 0. This implies that the posterior Dirichlet process tends in distribution to a normal distribution (see Section 2.4), which is a version of the Bernstein-von Mises theorem for the Dirichlet process prior (a weak version, as the usual theorem gives the approximation in the total variation distance; see Section 12.2 of [15] for discussion). The convergence of the Laplace transform is useful for handling for instance moments of the posterior distribution.
The main contribution of the theorem is, however, to provide uniformity in a class of functions g. This uniformity refers to the marginal posterior distributions of the process √ n(P n g − P n g) : g ∈ G . The stronger sense of uniformity of distributional convergence of this process as a random element in the set ℓ ∞ (G) is known to be true if G is a Donsker class, as shown in [18] (also see [21,22]). This is a much stronger property than Glivenko-Cantelli as assumed here.

Remark 1.
The assumption in Theorem 1 that the class G possesses an envelope function G with sub-Gaussian tails under the base measure of the Dirichlet process prior Q ∼ DP(ν) ensures that the distribution of QG possesses a finite moment generating function. If G ∈ G, this is clearly necessary (see also Lemma 6(iii) below). It results from the fact that although vanishing in the limit, the prior remains present in the posterior for every n. In the case of the Bayesian bootstrap, which formally corresponds to taking ν = 0, this condition can be omitted.
Remark 2. Theorem 1 can be extended to the assertion (1.1) for a sequence G n of classes of measurable functions. Inspection of the proofs below shows that if the total mass of the base measures remains bounded, then it suffices that these classes satisfy sup g∈Gn∪G 2 n |P n g − P 0 g| as * → 0, sup and possess envelope functions G n such that max 1≤i≤n G n (Z i ) = o( √ n/ log n), almost surely, and Ee tQGn/ √ n ≤ e ct for Q ∼ DP(ν) and every 0 < t < √ n. The last condition is implied by uniform sub-Gaussianity of the envelopes G n , but it is of course also satisfied if G n ∞ √ n for every n. (If the total mass of the base measures increases to infinity, then this last condition must be replaced by a condition that ensures the assumption of Lemma 6(iii).) For convergence in probability in (1.1), it suffices that these conditions hold in probability. If the classes G n are separable, then uniformity over G 2 n is implied by uniformity over G n , as shown by Lemma 11 of [30].
Major applications of studying posterior Laplace transforms of functionals as in (1.1) include establishing semiparametric and nonparametric Bernsteinvon Mises theorems [4,5,28,31], especially for inverse problems [24,25,27], posterior contraction rates in the supremum norm [2,6,26] and convergence rates for Tikhonov-type penalised least squares estimators [24,26]. Such proofs typically require uniformity over function classes as established in (1.1) and use likelihood expansions based on local asymptotic normality (LAN) of the model. Because the Dirichlet process prior does not give probability one to a dominated set of measures, the resulting posterior distribution cannot be derived using Bayes' formula; one cannot thus use the LAN approach of the aforementioned papers to prove (1.1).
Our result is applicable when a Dirichlet process prior is assigned to some distributional component of the model, such as the covariate distribution in regression models with random design. For example, Theorem 1 has recently been applied to establish semiparametric Bernstein-von Mises results for estimating average treatment effects in causal inference problems [29,30]. Indeed, results there suggest that for estimating functionals, using a Dirichlet process prior on the covariate distribution can yield better performance than other common priors choices, especially in high-dimensional covariate settings.
Bernstein-von Mises and Donsker-type results have also been obtained for generalizations of the Dirichlet process, such as Pólya trees [3] and the Pitman-Yor process [14,18], and similar style results to Theorem 1 can be expected to hold for such priors. While the main outline of our proof should be applicable, it relies on the explicit posterior representation (2.1) of the Dirichlet process for several technical computations, which would have to be extended to the explicit posterior representations available for these other priors.
The case X = R The proof of Theorem 1 requires uniformly bounded exponential moments of the process ( √ n(P n g − P n g) : g ∈ G), which only holds for small |t| under the moment condition P 0 G 2+δ < ∞ of the theorem (see Lemmas 3 and 4). When X = R, we can strengthen Theorem 1 to hold for all t ∈ R under significantly stronger conditions on G.
We now assume Z 1 , Z 2 , . . . are i.i.d. random variables taking values in X = R. Recall that the total variation of a function f : Since bounded variation balls are universal Donsker classes, this is a significantly stronger requirement than G being P 0 -Glivenko-Cantelli in Theorem 1. We prove this result by exploiting a strong approximation, which establishes a rate of convergence for representations of these random variables defined on a common probability space and has various applications in probability and statistics, for instance studying distributional approximations of transformed random variables ψ n ( √ n(P n − P n )), where the functions ψ n depend on n. For an overview of the theory of strong approximations and a survey of their applications in probability and statistics, see Csörgő and Révész [8] and Csörgő and Hall [9], respectively.
An almost sure strong approximation of the posterior Dirichlet process was established by Lo [23]. He showed that on a suitable probability space, there exist random elements F , K and Z 1 , Z 2 , · · · ∼ iid F 0 such that F |Z 1 , . . . , Z n ∼ DP (ν + nP n ) for every n, K is a Kiefer process independent of Z 1 , Z 2 , . . . and such that Applications of (1.2) include studying the large sample behaviour of the Bayesian bootstrap and smoothed Dirichlet process posterior [23], as well as receiver operating characteristic (ROC) curves [17]. We revisit this result by establishing an explicit coupling inequality in order to make uniform the constants in (1.2). This for instance allows control of exponential moments, which is needed to prove Proposition 1. We henceforth assume that the underlying probability space is rich enough that all random variables and processes subsequently introduced may be defined on it. Since the posterior distribution is conditional on the observations Z 1 , . . . , Z n , it is natural for a Bayesian to index the Gaussian process in (1.2) by the empirical distribution function F n to obtain a conditional Gaussian approximation. The following is the explicit coupling inequality analogue of Lemma 6.3 of [23].
This result says one can couple the Dirichlet process posteriors to a sequence of Brownian bridges independent of the underlying data. The theorem could also be rephrased with the random variables Z 1 , Z 2 , . . . replaced by any real numbers z 1 , z 2 , . . . to emphasize this independence.
For x = x n taken equal to a constant times log n, the right side sums finite over n, and hence the complement of the events at x n are valid for every sufficiently large n, by the Borel-Cantelli lemma, for almost every sequence Z 1 , Z 2 , . . .. Provided that |ν| = O(log n), this yields that, for almost every sequence Z 1 , Z 2 , . . . , which improves on the rate n −1/4 (log n) 1/2 (log log n) 1/4 in Lemma 6.3 of [23]. This is because we replace the KMT coupling used in [23], which involves a Kiefer process, with a direct quantile coupling due to [7] involving dependent Brownian bridges. The following is the analogous result when the Brownian bridges are related amongst themselves by tying them to a Kiefer process K(·, n) = n i=1 B i for (B i ) independent Brownian bridges.
Theorem 2 does not say anything about the joint distribution in n of the corresponding Brownian bridges and thus only "in probability" or "in distribution" limit results can be proved. On the other hand, despite the slower convergence rate, Theorem 3 can be used to establish the almost sure limiting behaviour of statistics of interest based upon √ n(F n − F n )(z), for instance a law of the iterated logarithm.
If |ν| = O(n 1/4 (log n) 5/4 ), the above yields P (·|Z 1 , Z 2 , . . . )−almost sure order n −1/4 (log n) 5/4 , significantly slower than the rate in Theorem 2. In Theorem 3 we follow the approach of [23] of using the KMT coupling rather than a quantile coupling as in Theorem 2. Indeed, up to logarithmic factors, a better rate is not obtainable for coupling a quantile process with a Kiefer process [10], as opposed to dependent Brownian bridges. We obtain a slightly slower almost sure rate than the n −1/4 (log n) 1/2 (log log n) 1/4 achieved in Lemma 6.3 of [23] due to technical arguments used to make the coupling non-asymptotic.
We may also index the Brownian bridges by the true distribution function F 0 at the expense of a slower rate. The following is the coupling inequality analogue of Theorem 2.1 of [23]. Corollary 1. On a suitable probability space, there exist random variables Z 1 , Z 2 , · · · ∼ iid F 0 and F n , with F n | Z 1 , . . . , Z n ∼ DP(ν + nP n ), and a sequence of Brownian bridges (B n ) independent of Z 1 , Z 2 , . . . , such that for any y > 0, the event for all x > 0 and n ≥ 2, where C 1 − C 4 are universal constants.
The Bayesian interpretation is that there are events (A n,y ) of high P n 0probability depending only on the observations Z 1 , . . . , Z n on which one can approximate the posterior Dirichlet process with a sequence of Brownian bridges independent of the underlying data. If |ν| = O(n −1/4 (log n) 3/4 ), setting y = √ δ log n with δ > 1/2 gives that for P ∞ 0 -almost every sequence Z 1 , Z 2 , . . . , we have approximation rate n −1/4 (log n) 3/4 , P (·|Z 1 , Z 2 , . . . )−almost surely. A similar, if more complicated, expression can be proved with the Brownian bridges (B n ) replaced by the Kiefer process K, in particular yielding an almost sure rate O(n −1/4 (log n) 5/4 ), for P ∞ 0 -almost every sequence Z 1 , Z 2 , . . . .

Proof of Theorem 1
For given Z 1 , Z 2 , . . . , the Dirichlet process posterior distribution can be represented in law as the convex combination where V n ∼ Beta(|ν|, n), Q ∼ DP(ν), and W 1 , W 2 , . . . are i.i.d. exponential variables with mean 1, all variables V n , Q, , W 1 , W 2 , . . . independent. This follows for instance from Theorem 14.37 in [15] (with σ = 0), or representation properties of the Dirichlet distribution as in Proposition G.10 in the same reference. For ν = 0, the variables Q and V n should be interpreted as 0, and the first term vanishes. With the notationW We show that the first term is negligible and the second tends to a normal distribution.
The variable V n is of the order 1/n and the first term in brackets on the right side of (2.2) is bounded above by Q|g| + max 1≤i≤n |g(Z i )|. Because EQ|g| = ν|g|/|ν| < ∞, and by Lemma 5 (below), since P 0 g 2 < ∞, the first term on the right tends to zero, in distribution conditionally given almost every sequence Z 1 , Z 2 , . . ., as n → ∞. The leading factor 1/W n of the second term on the right of (2.2) tends to 1 almost surely, by the strong law of large numbers. If P 0 g 2 < ∞, then P n g → P 0 g and P n g 2 → P 0 g 2 , almost surely. This may be used together with (2.3) to show that, for every ε > 0, The Lindeberg central limit theorem then gives that, given almost every se- If we would know that the moment generating function of the variables on the left were bounded, then this would imply convergence of exponential moments, and the proposition would be proved for G = {g}. The approach to proving the proposition will be to strengthen first the preceding display to uniformity in g, and next show that exponential moments of the variables on the left are suitably bounded.
For the uniformity, we use the assumption that G is Glivenko-Cantelli. This is not overly strong, and it may be not far off from necessary. Indeed, if the variables W i − 1 were standard normal instead of exponential, and the leading factor 1/W n were not present and ν = 0, then the conditional distribution of the left side of the preceding display would be N 0, P n (g − P n g) 2 , and convergence of these normal distributions to N (0, σ 2 g ) would imply the convergence P n (g − P n g) 2 → σ 2 g , uniformly in g if the convergence in distribution were uniform. This is close to the Glivenko-Cantelli property.
To take account of possible non-measurability of the supremum in Theorem 1, we use an explicit bound on the distance between the distributions in (2.4).

Lemma 1.
For independent mean-zero random variables X 1 , . . . , X n , the bounded Lipschitz distance between the law of n −1/2 n i=1 X i and the mean-zero normal distribution with the same variance satisfies, for any 0 < ε < 1, and H(u) = u + u 1/3 and H 0 (u) = u 1/4 1 + | log u| 1/2 , Proof. We combine results by [36] and [11], as outlined in the proof of Proposition A.5.2 in [35], which gives the lemma for i.i.d. variables. The first step is to note that, for We can then use Strassen's theorem (see e.g. [12], Theorem 11.6.2) to see that the Prohorov distance between the laws of n −1/2 n i=1 X i and n −1/2 n i=1 Z i is bounded above by the maximum of ε and the right side of the last display. Next, we apply Theorem B in [11] to bound the Prohorov distance between the law of n −1/2 n i=1 Z i and the mean-zero normal distribution with the same variance by a multiple of for an absolute constant C > 0, this is bounded by a multiple of the third term of (2.5). Finally, the distance between this normal distribution and the normal distribution as in the lemma is bounded by (n −1 n i=1 EX 2 i 1 |Xi|>ε √ n ) 1/3 using Lemma 2.1 of [11] These estimates use the Prohorov distance, but are also valid for the bounded Lipschitz distance, which is bounded above by twice the Prohorov distance (see [12], Corollary 11.6.5). (Then we may also replace u 1/3 by u 1/2 in H.) The idea of the lemma is that the first and third terms on the right can be made arbitrarily small by choice of ε if n −1 n i=1 EX 2 i remains bounded (as H 0 (u) → 0 as u ↓ 0), while for fixed ε > 0 the middle term tends to zero as n → ∞ if the Lindeberg condition holds.
Lemma 2. Suppose G is a P 0 -Glivenko-Cantelli class of measurable functions g : X → R with envelope function G such that νG < ∞ and P 0 G 2 < ∞. Then the convergence in (2.4) is uniform in g ∈ G. More precisely, for d the bounded Lipschitz distance, Proof. By the square-integrability of G and conservation of the Glivenko-Cantelli property under continuous transformations (see [34]), the set {g 2 : g ∈ G} is also Glivenko-Cantelli. Thus, for σ 2 n,g = P n (g − P n g) 2 , sup g∈G |σ 2 n,g − σ 2 g | as * → 0.
We now apply (2.5) to the variables X i = (W i − 1) g(Z i ) − P n g conditionally given Z 1 , Z 2 , . . ., and that sup 0<v≤u H 0 (v) ≤ CH 0 (u), to see that Here, sup g σ 2 n,g can be further bounded by P n G 2 , resulting in a uniform upper on the distance that is a measurable function of Z 1 , Z 2 , . . .. The resulting measurable upper bound tends to zero almost surely, as n → ∞ followed by ε → 0, because max 1≤i≤n G(Z i ) = o( √ n) almost surely by Lemma 5 and P 0 G 2 < ∞. Because d N (0, σ 2 ), N (0, τ 2 ) ≤ |s 2 − τ 2 | for any σ, τ > 0, the variances σ 2 n,g in the left side can be replaced by their limits σ 2 g by the preceding display. This shows the convergence of the second term in (2.2), without the denomi-natorW n , to the claimed Gaussian limits. HereW n → 1, almost surely, and the first term is bounded above by √ nV n QG + max 1≤i≤n G(Z i ) , which tends to zero almost surely, by the argument given before. Since d L(µ + σX), L(X) ≤ |µ| + |σ − 1| E|X|, for any µ, σ and random variable X, the scaling byW n and shifting by the first term does not change the Gaussian limits. Then (1.1) holds for 0 ≤ t < T . Furthermore, if (2.6) holds for some T < 0, then (1.1) holds for T < t ≤ 0.
Proof. For a fixed t > 0 and M > 0, the function h M (x) = e tx ∧ M is bounded and Lipschitz. Lemma 2 thus gives that, with E Z denoting the conditional expectation given Z 1 , Z 2 , . . ., For t < T and sufficiently large M and n, this is arbitrarily small, uniformly in g ∈ G, by assumption (2.6). The proof of the assertion with T < 0 is similar (or replace P n −P n by P n −P n in the argument).
Lemma 4. If G has envelope function G such that e tG dν ≤ Ce ct 2 for every t > 0 and some C, c > 0, and P 0 G 2+δ < ∞ for some δ > 0, then (2.6) holds for every T in a sufficiently small neighbourhood of 0.
Proof. By the Cauchy-Schwarz inequality, Ee T (Y1+Y2) < ∞ if Ee 2T Yi < ∞, for i = 1, 2. Thus it suffices to prove that the two terms on the right side of (2.2) both possess finite exponential moments that are bounded in n.
Since P 0 G 2 < ∞, we have that ε n := max 1≤i≤n G(Z i )/ √ n → 0, almost surely by Lemma 5. The absolute value of the first term of (2.2) is bounded above by nV n (n −1/2 QG + ε n ), where ε n tends to zero almost surely. Thus this term has bounded exponential moments by Lemma 6.
Next consider the second term on the right side of (2.2), or equivalently, assume that ν = 0. The absolute value satisfies Since Ee X = 1 + ∞ 0 P (X ≥ x)e x dx, it follows that for T > 0, The probability in the first integral on the far right is bounded above by e −3x/2 , for every x > 0, using (2.8). Thus the integral involving this term is bounded above by ∞ 0 e −x/2 dx = 2. For x in the integration range of the second integral on the far right, the number 1 − 3x/n is at least 1/2 for large enough n. Hence the preceding display is bounded above by It suffices to show that the last expectation is finite and bounded in g ∈ G for some T > 0. Let ψ 1 (x) = e x − 1, and let · ψ1 be the corresponding Orlicz norm. Then, by Proposition A.1.6 and Lemma 2.2.2 in [35], with the norms interpreted conditionally given Z, Under the condition P 0 G 2+δ < ∞, the first term is bounded almost surely by the law of large numbers, while the second term tends to zero almost surely by Lemma 5. By the definition of the Orlicz norm, Ee |Y |/C ≤ 2 for C ≥ Y ψ1 and any random variable Y . This concludes the proof for T > 0. For T < 0, we copy the preceding argument, but replace W i − 1 by 1 − W i and T by |T |.
Proof. For any y > 0 we have As n → ∞ the first term tends to zero, for fixed y, while the second tends to E|Y 1 | r 1 |Y1|>y by the law of large numbers, and can be made arbitrarily small by choice of y.
Lemma 6. Suppose that Q ∼ DP(ν) and V n ∼ Beta(|ν|, n) are independent, and let G be a nonnegative, measurable function.
Proof. We have that The integrand is dominated by e tu u |ν|−1 e −u(1−1/n) , which is uniformly integrable for sufficiently large n and t < 1. Therefore, for fixed t < 1, the integral times n |ν| is asymptotic to ∞ 0 u |ν|−1 e −u(1−t) du = (1 − t) −|ν| Γ(|ν|) by the dominated convergence theorem. By the definition of the beta distribution, the expectation Ee ntnVn is the quotient of two of the integrals as in the display, with t = t n and with t = 0, respectively. This concludes the proof of (i).
For (ii), we first note that Ee tQG ≤ EQe tG = e tG dν/|ν| by Jensen's inequality, whence Ee tQG e ct 2 by the assumption. By the independence of Q and V n and a coordinate substitution as under (i), The integrand in the numerator tends pointwise to u |ν|−1 e −u and is dominated by a multiple of e ct 2 u 2 /n u |ν|−1 e −u(1−1/n) ≤ u |ν|−1 e −du on [0, n], for a constant d < 1 − ct 2 and sufficiently large n. The denominator is as before. The integrals thus have the same limit.
For (iii), note that by the stick-breaking representation of the Dirichlet process, the variable QG is stochastically larger than W G(θ), for W ∼ Beta(1, |ν|) independent of θ ∼ ν/|ν|. It follows that for any t ≥ 0, for ψ(t) = e tG dν. Then by the preceding calculations, for sufficiently large n, The integral is bounded below by n −1 2 −n , whence ψ(t √ n/4) ≤ C(ν)e n for large enough n, if the left side of the display remains bounded in n. For sufficiently large s, there exists large enough n such that t √ n − 1/4 < s ≤ t √ n/4, and ψ(s) ≤ C(ν)e n ≤ C(ν)e 32s 2 /t 2 .

Proof of Proposition 1
Proof of Proposition 1. Because by assumption the variation of a function g ∈ G is bounded uniformly over all intervals [a, b], the limits of g(x) as x → ±∞ exist and are finite. (Indeed the values |g(a)|, for a < 0, are bounded by |g(0)| + V 0 a (g) ≤ |g(0)| + V and hence every sequence g(x n ) with x n → −∞ has a converging subsequence. If there were two subsequences x n and y n with different limits, then these could without loss of generality be chosen alternating: x 1 ≥ y 1 ≥ x 2 ≥ y 2 ≥ · · · and the variation over the partitions containing y N , x N , . . . y 1 , x 1 would tend to infinity with N .) It can be seen that the variation of the extended function g over [−∞, ∞] is the supremum of the variations over all intervals [a, b], and hence is also finite. In particular, the functions g − g(−∞) are uniformly bounded. As shifting the functions by a constant does not change the claim of the proposition, we can assume without loss of generality that g(−∞) = 0, and that the class G has a uniformly bounded envelope function G. We can then decompose g as g = g + − g − , for right-continuous, nondecreasing functions g + , g − : [−∞, ∞] → R, uniformly bounded by 2V (e.g. Section 6.3 of [32]). Let dg = dg + − dg − be the corresponding signed (Stieltjes) measure, and |dg| = dg + + dg − its total variation.
We work on the probability space from Theorem 2. For (B n ) the Brownian bridges in that theorem and g ∈ G, set It can be seen that given Z 1 , Z 2 , . . ., the variable W n g possesses a N (0, g − P n g 2 L 2 (Pn) )-distribution, whence W n is a P n -Brownian bridge process, indexed by G.
The process F = F n in Theorem 2 is the distribution function of the posterior Dirichlet process P n . By partial integration, we have Writing ∆ n = √ n(F n − F n ) − B n • F n ∞ , we thus find that, for every sequence Z 1 , Z 2 , . . . , Since bounded variation balls are uniform Donsker classes, G is P 0 -Glivenko-Cantelli. Thus to prove the proposition, by Lemmas 2 and 3, we need show only that (2.6) holds for all T ∈ R. Using the last display and Cauchy-Schwarz, The first term converges to 1 as n → ∞ for every sequence Z 1 , Z 2 , . . . by Lemma 8 below. The second term equals sup g∈G e T 2 Pn(g−Png) 2 ≤ e T 2 PnG 2 → e T 2 P0G 2 < ∞, P ∞ 0 -a.s. This establishes (2.6) and completes the proof.

Proofs of strong approximation results
We recall some useful facts. For a centered Gaussian process (G t ) t∈T with countable index set T satisfying sup t∈T |G t | < ∞ (Borell's inequality -Theorem 7.1 of [20]): for every x > 0, where σ 2 = sup t∈T EG 2 t < ∞. Note that if G has continuous sample paths and T ⊂ R is uncountable, (2.7) still holds, since we may restrict the supremum to a countable skeleton of T .
For X θ ∼ Gamma(θ, 1), we have the pair of exponential inequalities for every x > 0, see p. 28-29 of [1]. We also denote by P Z the conditional probability given Z 1 , . . . , Z n .
Combining the above and using 2 |ν|x ≤ |ν| + x, for all x > 0. It therefore remains to show the desired exponential inequality with Let U 1 , . . . , U n−1 ∼ U (0, 1) be i.i.d. and independent of (Z i ) i≥1 and denote the corresponding order statistics by 0 = U (0) < U (1) < · · · < U (n−1) < U (n) = 1. For given Z 1 , . . . , Z n , the Bayesian bootstrap posterior distribution can be represented in law asP n = are the order statistics of the sample and we have used the exchangeability of (U (i) − U (i−1) : 1 ≤ i ≤ n − 1). This gives (2.10) Define the empirical quantile function Q n−1 (t) of the U i 's by to be the uniform quantile process. By Theorem 1 of Csörgő and Révész [7], one can define for each n a Brownian bridge {B n (t) : 0 ≤ t ≤ 1} on the same probability space such that for all x ≥ 0, where c 1 , c 2 , c 3 are universal constants. Since these Brownian bridges are constructed based on (U i ) i≥1 , which are independent of (Z i ) i≥1 , they may also be taken to be independent of (Z i ) i≥1 . Setting B n =B n−1 and following [23], (2.12) We prove separate exponential inequalities for I B − III B . For n ≥ 2, and the required inequality follows from (2.11).
Since B n is a Brownian bridge, V 1 , . . . , V n−1 are Gaussian random variables with V i ∼ N (0, i n(n−1) (1 − i n(n−1) )). Thus Var(V i ) ≤ 1/n for all i, and so the standard Gaussian maximal inequality Lemma 2.3.4 of [16] yields E max 1≤i<n |V i | ≤ C log n/n for an absolute constant C > 0. Applying Borell's inequality (2.7), for x > 0, For III B , recall that for a Brownian bridge B n , P ( B n ∞ > x) ≤ 2e −2x 2 for x > 0 (Proposition 12.3.4 of [12]). Using the mean value theorem with Combining the exponential inequalities for I B − III B via a union bound and comparing the dominating terms, for all x > 0 and universal constants C 1 , C 2 , C 3 > 0. Together with (2.9) this gives the result.
Proof of Theorem 3. Using the exponential inequality (2.9), we need show only the result with √ n(F − F n ) instead of √ n(F − F n ), whereF is defined in (2.10).
Arguing as in (2.12), We again establish separate exponential inequalities for I K and II K . For n ≥ 2, For the first term, we use the KMT inequality (2.14). Since {(n−1) −1/2 K(s, n− 1) : s ∈ [0, 1]} is a Brownian bridge for each n ≥ 2, we use the first inequality in Lemma 7 to deal with the second term. Together these yield for all x > 0 and universal constants C 1 − C 4 > 0.
Using again that {(n − 1) −1/2 K(s, n − 1) : s ∈ [0, 1]} is a Brownian bridge, II (1) K is equal in distribution to II B , for which we use the inequality (2.13). Similarly, II K = (n − 1) −1 max 1≤i<n |n −1/2 K(i/n, n)|, which is equal in distribution to III B up to a universal constant factor, and hence satisfies P (II ) ≤ e −2x 2 for a universal constant C > 0 and all x > 0. Together these give for all x > 0, Using the exponential inequalities for I K and II K , a union bound and that x 1/2 log n + x, for all x > 0 and universal constants C 1 − C 4 > 0. The first term on the right-hand side dominates if and only if x ≤ D(log n) 2 /n 1/3 for a universal constant D > 0. For such x, the upper bound in the last display is bounded by C 3 exp(−C 4 (log 2) 2 /2 1/3 ) for all n ≥ 2, which can be made larger than 1 by taking C 3 universal and large enough. The last display is thus trivially satisfied for such x, which implies for all x > 0 and (different) universal constants C 2 − C 4 > 0. Together with (2.9) this yields the result.
Proof of Corollary 1. That P (A n,y ) ≥ 1 − 2e −2y 2 follows from the Dvoretzky-Kiefer-Wolfowitz-Massart inequality. Let {B n : n ≥ 1} be the Brownian bridges from Theorem 2. By the triangle inequality, √ n(F − F n )(z) − B n (F 0 (z)) ≤ √ n(F − F n )(z) − B n (F n (z)) + |B n (F n (z)) − B n (F 0 (z))| , and the exponential inequality for the first term follows from Theorem 2. Since {B n : n ≥ 1} are independent of (Z i ) i≥1 by Theorem 2, applying the second inequality in Lemma 7 gives for all x > 0 and a universal constant K > 0. The result follows by a union bound.
Lemma 7. Let B = {B t : t ∈ [0, 1]} be a Brownian bridge and F n be the empirical distribution function of Z 1 , . . . , Z n ∼ iid F 0 . Then there exists a universal constant K > 0 such that, for n ≥ 2 and every x > 0, If B is independent of Z 1 , . . . , Z n , then there also exists K > 0 such that, for n ≥ 2 and every x > 0, Proof. The intrinsic metric of the Brownian bridge is bounded above by the square root of the Euclidean distance, whence its metric entropy integral is a multiple of δ → δ max log(1/δ), 1 . Therefore, by Dudley's theorem (see [16], Theorem 2.3.8). E sup s,t |B s −B t |/J(|s−t|) < ∞, for J(δ) = √ δ max log(1/δ), 1 . Because the process (s, t) → (B s − B t )/J(|s − t|) is centered Gaussian with uniformly bounded variance, we can apply Borell's inequality (2.7) to see that there exist constants D, E > 0 such that, for y > 0, There exists a constant C > 0 such that Cy 2 ≤ D(y − E) 2 , for y > 2E. Then, for y > 2E, By making C if necessary still smaller, we can ensure that the right side is bigger than 1 for y ≤ 2E, and then the preceding inequality is valid for every y > 0. By the Dvoretsky-Kiefer-Wolfowitz-Massart inequality, we also have, for y > 0, P sup z∈R |F n (z) − F 0 (z)| > y ≤ 2e −2ny 2 .
Combining these two inequalities, we see that, for every y 1 , y 2 > 0, We choose y 1 = 2x/C and y 2 = x/n to reduce the right side to 4e −2x , and then have y 1 J(y 2 ) ≥ K 1 x 3/4 max( log(n/x), 1)/n 1/4 , for some constant K 1 > 0. For x < log 2, we have that 2e −x > 1 and hence the first inequality of the lemma is trivially satisfied. For x ≥ log 2, we have 4e −2x ≤ 2e −x and max( log(n/x), 1) ≥ K 2 √ log n, for some constant K 2 > 0 and n ≥ 2. The first inequality of the lemma follows.
Proof. Suppose t > 0. For α n = C 1 (log n + |ν|)/ √ n, with C 1 the universal constant from Theorem 2, and using the change of variable u = e tαn+tx/ = e tαn 1 + as n → ∞. Since ∆ n ≥ 0 the lower bound E[e t∆n |Z 1 , . . . , Z n ] ≥ 1 holds trivially, which completes the proof for t > 0. The case t < 0 follows by a similar argument.

Lemma 9.
If Y n are random variables with Ee tYn → e t 2 σ 2 /2 , for every t in a subset of R that contains both a strictly increasing sequence with limit 0 and a strictly decreasing sequence with limit 0, then Y n N (0, σ 2 ).
Proof. Let T be the set of points and let a < 0 and b > 0 be contained in T . Because Ee tYn is bounded in n, for both t = a and t = b, the sequence Y n is tight, by Markov's inequality. For every t ∈ T strictly between a and b, some power larger than 1 of the variable e tYn is bounded in L 1 , and hence the sequence e tYn is uniformly integrable. Consequently, if Y is a weak limit point of Y n , then Ee tYn tends to Ee tY along the same subsequence for every t ∈ (a, b) ∩ T . In view of the assumption of the lemma, it follows that Ee tY = e t 2 σ 2 /2 . The set t ∈ (a, b) ∩ T is infinite by assumption. Finiteness of Ee tY on this set implies that the function z → Ee zY is analytic in an open strip containing the real axis. By analytic continuation it is equal to e z 2 σ 2 /2 , whence Ee isY = e −s 2 σ 2 /2 , for every s ∈ R.
Corollary 2. If (Y n , Z n ) are random elements with E(e tYn | Z n ) → e t 2 σ 2 /2 , in probability, for every t in a set that contains both a strictly increasing sequence with limit 0 and a strictly decreasing sequence with limit 0, then Y n | Z n N (0, σ 2 ), in probability. If the convergence in the assumption is in the almost sure sense, then the conclusion is also true in the almost sure sense.
Proof. For the conclusion in probability it suffices to show that every subsequence of {n} has a further subsequence with d L(Y n | Z n ), N (0, σ 2 ) → 0, almost surely, where d is a metric defining weak convergence. From the assumption we know that every subsequence has a further subsequence with E(e tYn | Z n ) → e t 2 σ 2 /2 , almost surely. For a countable set of t, we can construct a single subsequence with this property for every t, by a diagonalization scheme. The preceding lemma gives that d L(Y n | Z n ), N (0, σ 2 ) → 0, almost surely, along this subsequence.