The Bernstein-Von-Mises theorem under misspeciﬁcation

: We prove that the posterior distribution of a parameter in mis- speciﬁed LAN parametric models can be approximated by a random normal distribution. We derive from this that Bayesian credible sets are not valid conﬁdence sets if the model is misspeciﬁed. We obtain the result under conditions that are comparable to those in the well-speciﬁed situation: uniform testability against ﬁxed alternatives and suﬃcient prior mass in neighbourhoods of the point of convergence. The rate of convergence is considered in detail, with special attention for the existence and construction of suitable test sequences. We also give a lemma to exclude testable model subsets which implies a misspeciﬁed version of Schwartz’ consistency theorem, establishing weak convergence of the posterior to a measure degenerate at the point at minimal Kullback-Leibler divergence with respect to the true distribution.


Introduction
The Bernstein-Von Mises theorem asserts that the posterior distribution of a parameter in a smooth finite-dimensional model is approximately a normal distribution if the number of observations tends to infinity. Apart from having considerable philosophical interest, this theorem is the basis for the justification of Bayesian credible sets as valid confidence sets in the frequentist sense: (central) sets of posterior probability 1 − α cover the parameter at confidence level 1 − α. In this paper we study the posterior distribution in the situation that the observations are sampled from a "true distribution" that does not belong to the statistical model, i.e. the model is misspecified. Although consistency of the posterior distribution and the asymptotic normality of the Bayes estimator (the posterior mean) in this case have been considered in the literature, by Berk (1966Berk ( , 1970 [2,3] and Bunke and Milhaud (1998) [5], the behaviour of the full posterior distribution appears to have been neglected. This is surprising, because in practice the assumption of correct specification of a model may be hard to justify. In this paper we derive the asymptotic normality of the full posterior distribution in the misspecified situation under conditions comparable to those obtained by Le Cam in the well-specified case.
This misspecified version of the Bernstein-Von Mises theorem has an important consequence for the interpretation of Bayesian credible sets. In the misspecified situation the posterior distribution of a parameter shrinks to the point within the model at minimum Kullback-Leibler divergence to the true distribution, a consistency property that it shares with the maximum likelihood estimator. Consequently one can consider both the Bayesian procedure and the maximum likelihood estimator as estimates of this minimum Kullback-Leibler point. A confidence region for this minimum Kullback-Leibler point can be built around the maximum likelihood estimator based on its asymptotic normal distribution, involving the sandwich covariance. One might also hope that a Bayesian credible set automatically yields a valid confidence set for the minimum Kullback-Leibler point. However, the misspecified Bernstein-Von Mises theorem shows the latter to be false.
More precisely, let B → Π n (B | X 1 , . . . , X n ) be the posterior distribution of a parameter θ based on observations X 1 , . . . , X n sampled from a density p θ and a prior measure Π on the parameter set Θ ⊂ R d . The Bernstein-Von Mises theorem asserts that if X 1 , . . . , X n is a random sample from the density p θ0 , the model θ → p θ is appropriately smooth and identifiable, and the prior puts positive mass around the parameter θ 0 , then, where N x,Σ denotes the (multivariate) normal distribution centred on x with covariance matrix Σ,θ n may be any efficient estimator sequence of the parameter and i θ is the Fisher information matrix of the model at θ. It is customary to identifyθ n as the maximum likelihood estimator in this context (correct under regularity conditions).
The Bernstein-Von Mises theorem implies that any random setsB n such that Π n (B n | X 1 , . . . , X n ) = 1 − α for each n satisfy, N 0,I (ni θ0 ) 1/2 (B n −θ n ) → 1 − α, in probability. In other words, such setsB n can be written in the formB n = θ n + i −1/2 θ0Ĉ n / √ n for setsĈ n that receive asymptotically probability 1 − α under the standard Gaussian distribution. This shows that the 1 − α-credible setsB n are asymptotically equivalent to the Wald 1 − α-confidence sets based on the asymptotically normal estimatorsθ n , and consequently they are valid 1 − α confidence sets.
In this paper we consider the situation that the posterior distribution is formed in the same way relative to a model θ → p θ , but we assume that the observations are sampled from a density p 0 that is not necessarily of the form p θ0 for some θ 0 . We shall show that the Bernstein-Von Mises can be extended to this situation, in the form, where θ * is the parameter value θ minimizing the Kullback-Leibler divergence θ → P 0 log(p 0 /p θ ) (provided it exists and is unique, see corollary 4.2), V θ * is minus the second derivative matrix of this map, andθ n are suitable estimators. Under regularity conditions the estimatorsθ n can again be taken equal to the maximum likelihood estimators (for the misspecified model), which typically satisfy that the sequence √ n(θ n − θ * ) is asymptotically normal with mean zero and covariance matrix given by the "sandwich formula" Σ θ * = V θ * (P 0lθ * l T θ * ) −1 V θ * . (See, for instance, Huber (1967) [8] or Van der Vaart (1998) [18].) The usual confidence sets for the misspecified parameter take the formθ n + Σ 1/2 θ * C/ √ n for C a central set in the Gaussian distribution. Because the covariance matrix V θ * appearing in the misspecified Bernstein-Von Mises theorem is not the sandwich covariance matrix, central posterior sets of probability 1 − α do not correspond to these misspecified Wald sets. Although they are correctly centered, they may have the wrong width, and are in general not 1 − α-confidence sets. We show below by example that the credible sets may over-or under-cover, depending on the true distribution of the observations and the model, and to extreme amounts.
The first results concerning limiting normality of a posterior distribution date back to Laplace (1820) [10]. Later, Bernstein (1917) [1] and Von Mises (1931) [14] proved results to a similar extent. Le Cam used the term 'Bernstein-Von Mises theorem' in 1953 [11] and proved its assertion in greater generality. Walker (1969) [19] and Dawid (1970) [6] gave extensions to these results and Bickel and Yahav (1969) [4] proved a limit theorem for posterior means. A version of the theorem involving only first derivatives of the log-likelihood in combination with testability and prior mass conditions (compare with Schwartz' consistency theorem, Schwartz (1965) [16]) can be found in Van der Vaart (1998) [18] which copies (and streamlines) the approach Le Cam presented in [12].
Weak convergence of the posterior distribution to the degenerate distribution at θ * under misspecification was shown by Berk (1966Berk ( , 1970 [2,3], while Bunke and Milhaud (1998) [5] proved asymptotic normality of the posterior mean. These authors also discuss the situation that the point of minimum Kullback-Leibler divergence may be non-unique, which obviously complicates the asymptotic behaviour considerably. Posterior rates of convergence in misspecified non-parametric models were considered in Kleijn and Van der Vaart (2006) [9].
In the present paper we address convergence of the full posterior under mild conditions comparable to those in Van der Vaart (1998) [18]. The presentation is split into two parts. In section 2 we derive normality of the posterior given that it shrinks at a √ n-rate of posterior convergence (theorem 2.1). We actually state this result for the general situation of locally asymptotically normal (LAN) models, and next specify to the i.i.d. case. Next in section 3 we discuss results guaranteeing the desired rate of convergence, where we first show sufficiency of existence of certain tests (theorem 3.1), and next construct appropriate tests (theorem 3.3). We conclude with a lemma (applicable in parametric and nonparametric situations alike) to exclude testable model subsets, which implies a misspecified version of Schwartz' consistency theorem.
In subsection 2.1 we work in a general locally asymptotically normal set-up, but in the remainder of the paper we consider the situation of i.i.d. observations, considered previously and described precisely in section 2.2.

Asymptotic normality in LAN models
Let Θ be an open subset of R d parameterising statistical models {P (n) θ : θ ∈ Θ}. For simplicity, we assume that for each n there exists a single measure that dominates all measures P (n) θ as well as a "true measure" P (n) 0 , and we assume that there exist densities p (n) θ and p (n) 0 such that the maps (θ, x) → p (n) θ are measurable.
We consider models satisfying a stochastic local asymptotic normality (LAN) condition around a given inner point θ * ∈ Θ and relative to a given norming rate δ n → 0: there exist random vectors ∆ n,θ * and nonsingular matrices V θ * such that the sequence ∆ n,θ * is bounded in probability, and for every compact The prior measure Π on Θ is assumed to be a probability measure with Lebesgue-density π, continuous and positive on a neighbourhood of a given point θ * . Priors satisfying these criteria assign enough mass to (sufficiently small) balls around θ * to allow for optimal rates of convergence of the posterior if certain regularity conditions are met (see section 3).
The posterior based on an observation X (n) is denoted Π n ( · |X (n) ): for every Borel set A, To denote the random variable associated with the posterior distribution, we use the notation ϑ. Note that the assertion of theorem 2.1 below involves convergence in P 0 -probability, reflecting the sample-dependent nature of the two sequences of measures converging in total-variation norm.
Theorem 2.1. Assume that (2.1) holds for some θ * ∈ Θ and let the prior Π be as indicated above. Furthermore, assume that for every sequence of constants M n → ∞, we have: Then the sequence of posteriors converges to a sequence of normal distributions in total variation: Proof. The proof is split into two parts: in the first part, we prove the assertion conditional on an arbitrary compact set K ⊂ R d and in the second part we use this to prove (2.4). Throughout the proof we denote the posterior for H = (ϑ − θ * )/δ n given X (n) by Π n . The posterior for H follows from that for θ by Π n (H ∈ B|X (n) ) = Π n (( ϑ −θ * )/δ n ∈ B|X (n) ) for all Borel sets B. Furthermore, we denote the normal distribution N ∆ n,θ * ,V −1 θ * by Φ n . For a compact subset K ⊂ R d such that Π n (H ∈ K|X (n) ) > 0, we define a conditional version Π K n of Π n by Π K n (B|X (n) ) = Π n (B ∩ K|X (n) )/Π n (K|X (n) ). Similarly we defined a conditional measure Φ K n corresponding to Φ n . Let K ⊂ R d be a compact subset of R d . For every open neighbourhood U ⊂ Θ of θ * , θ * + Kδ n ⊂ U for large enough n. Since θ * is an internal point of Θ, for large enough n the random functions f n : are well-defined, with φ n : K → R the Lebesgue density of the (randomly located) distribution N ∆ n,θ * ,V θ * , π n : K → R the Lebesgue density of the prior for the centred and rescaled parameter H and s n : θ * (X (n) ). By the LAN assumption, we have for every random sequence (h n ) ⊂ K, log s n (h n ) = h T n V θ * ∆ n,θ * − 1 2 h T n V θ * h n +o P0 (1). For any two sequences (h n ), (g n ) in K, π n (g n )/π n (h n ) → 1 as n → ∞, leading to, as n → ∞. Since all functions f n depend continuously on (g, h) and K × K is compact, we conclude that, where the convergence is in outer probability. Assume that K contains a neighbourhood of 0 (to guarantee that Φ n (K) > 0) and let Ξ n denote the event that Π n (K) > 0. Let η > 0 be given and based on that, define the events: where the * denotes the inner measurable cover set, in case the set on the right is not measurable. Consider the inequality (recall that the total-variation norm · is bounded by 2): As a result of (2.5) the second term is o(1) as n → ∞. The first term on the r.h.s. is calculated as follows: Note that for all g, h ∈ K, φ K n (h)/φ K n (g) = φ n (h)/φ n (g), since on K φ K n differs from φ n only by a normalisation factor. We use Jensen's inequality (with respect to the Φ K n -expectation) for the (convex) function x → (1 − x) + to derive: Combination with (2.6) shows that for all compact K ⊂ R d containing a neighbourhood of 0, P (n) 0 Π K n − Φ K n 1 Ξn → 0. Now, let (K m ) be a sequence of balls centred at 0 with radii M m → ∞. For each m ≥ 1, the above display holds, so if we choose a sequence of balls (K n ) that traverses the sequence K m slowly enough, convergence to zero can still be guaranteed. Moreover, the corresponding events Ξ n = {Π n (K n ) > 0} satisfy P (n) 0 (Ξ n ) → 1 as a result of (2.3). We conclude that there exists a sequence of radii (M n ) such that M n → ∞ and P (n) 0 Π Kn n − Φ Kn n → 0 (where it is understood that the conditional probabilities on the l.h.s. are well-defined on sets of probability growing to one). The total variation distance between a measure and its conditional version given a set K satisfies Π − Π K ≤ 2Π(K c ). Combining this with (2.3) and lemma 5.2, we conclude that P Condition (2.3) fixes the rate of convergence of the posterior distribution to be that occuring in the LAN property. Sufficient conditions to satisfy (2.3) in the case of i.i.d. observations are given in section 3.

Asymptotic normality in the i.i.d. case
Consider the situation that the observation is a vector X (n) = (X 1 , . . . , X n ) and the model consists of n-fold product measures P (n) θ = P n θ , where the components P θ are given by densities p θ such that the maps (θ, x) → p θ (x) are measurable and θ → p θ is smooth (in the sense of lemma 2.1). Assume that the observations form an i.i.d. sample from a distribution P 0 with density p 0 relative to a common dominating measure. Assume that the Kullback-Leibler divergence of the model relative to P 0 is finite and minimized at θ * ∈ Θ, i.e.: In this situation we set δ n = n −1/2 and use ∆ n,θ * = V −1 θ * G nlθ * as the centering sequence (wherel θ * denotes the score function of the model θ → p θ at θ * and G n = √ n(P n − P 0 ) is the empirical process).
Lemmas that establish the LAN expansion (2.1) (for an overview, see, for instance Van der Vaart (1998) [18]) usually assume a well-specified model, whereas current interest requires local asymptotic normality in misspecified situations. To that end we consider the following lemma which gives sufficient conditions.
there is an open neighbourhood U of θ * and a square-integrable function m θ * such that for all θ 1 , θ 2 ∈ U :

8)
(ii) the Kullback-Leibler divergence with respect to P 0 has a 2nd-order Taylorexpansion around θ * : then (2.1) holds with δ n = n −1/2 and ∆ n,θ * = V −1 θ * G nlθ * . Furthermore, the score function is bounded as follows: (2.10) Finally, we have: Proof. Using lemma 19.31 in Van der Vaart (1998) [18] for ℓ θ (X) = log p θ (X), the conditions of which are satisfied by assumption, we see that for any sequence (h n ) that is bounded in P 0 -probability: Hence, we see that, Using the second-order Taylor-expansion (2.9): and substituting the log-likelihood product for the first term, we find (2.1). The proof of the remaining assumptions is standard.
Regarding the centering sequence ∆ n,θ * and its relation to the maximumlikelihood estimator, we note the following lemma concerning the limit distribution of maximum-likelihood sequences.
Lemma 2.2. Assume that the model satisfies the conditions of lemma 2.1 with non-singular V θ * . Then a sequence of estimatorsθ n such thatθ n P0 −→ θ * and, satisfies the asymptotic expansion: Proof. The proof of this lemma is a more specific version of the proof found in Van der Vaart (1998) [18] on page 54.
Lemma 2.2 implies that for consistent maximum-likelihood estimators (sufficient conditions for consistency are given, for instance, in theorem 5.7 of van der Vaart (1998) [18]) the distribution of √ n(θ n − θ * ) has a normal limit with mean zero and covariance More important for present purposes, however, is the fact that according to (2.13), this sequence differs from ∆ n,θ * only by a term of order o P0 (1). Since the total-variational distance N µ,Σ − N ν,Σ is bounded by a multiple of µ − ν as (µ → ν), the assertion of the Bernstein-Von Mises theorem can also be formulated with the sequence √ n(θ n − θ * ) as the locations for the normal limit sequence. Using the invariance of total-variation under rescaling and shifts, this leads to the conclusion that: which demonstrates the usual interpretation of the Bernstein-Von Mises theorem most clearly: the sequence of posteriors resembles more-and-more closely a sequence of 'sharpening' normal distributions centred at the maximum-likelihood estimators. More generally, any sequence of estimators satisfying (2.13) (i.e. any best-regular estimator sequence) may be used to centre the normal limit sequence on.
The conditions for lemma 2.2, which derive directly from a fairly general set of conditions for asymptotic normality in parametric M -estimation (see, theorem 5.23 in Van der Vaart (1998) [18]), are close to the conditions of the above Bernstein-Von Mises theorem. In the well-specified situation the Lipschitz condition (2.8) can be relaxed slightly and replaced by the condition of differentiability in quadratic mean.
It was noted in the introduction that the mismatch of the asymptotic covariance matrix V −1 θ * P 0 [l θ * l T θ * ] V −1 θ * and the limiting covariance matrix V −1 θ * in the Bernstein-Von Mises theorem causes that Bayesian credible sets are not confidence sets at the nominal level. The following example shows that both overand under-covering may occur.
Example 2.1. Let P θ be the normal distribution with mean θ and variance 1, and let the true distribution possess mean zero and variance σ 2 > 0. Then θ * = 0, P 0l 2 θ * = σ 2 and V θ * = 1. It follows that the radius of the 1 − α-Bayesian credible set is z α / √ n, whereas a 1−α-confidence set around the mean has radius z α σ/ √ n. Depending on σ 2 ≤ 1 or σ 2 > 1, the credible set can have coverage arbitrarily close to 0 or 1.

Asymptotic normality of point-estimators
Having discussed the posterior distributional limit, a natural question concerns the asymptotic properties of point-estimators derived from the posterior, like the posterior mean and median.
Based on the Bernstein-Von Mises assertion (2.4) alone, one sees that any functional f : P → R, continuous relative to the total-variational norm, when applied to the sequence of posterior laws, converges to f applied to the normal limit distribution. Another general consideration follows from a generic construction of point-estimates from posteriors and demonstrate that posterior consistency at rate δ n implies frequentist consistency at rate δ n .
Proof. Defineθ n to be the center of a smallest ball that contains posterior mass at least 1/2. Because the ball around θ * of radius δ n M n contains posterior mass tending to 1, the radius of a smallest ball must be bounded by δ n M n and the smallest ball must intersect the ball of radius δ n M n around θ * with probability tending to 1. This shows that θ n − θ * ≤ 2δ n M n with probability tending to one.
Consequently, frequentist restrictions and notions of asymptotic optimality have implications for the posterior too: in particular, frequentist bounds on the rate of convergence for a given problem apply to the posterior rate as well.
However, these general points are more appropriate in non-parametric context and the above existence theorem does not pertain to the most widely-used Bayesian point-estimators. Asymptotic normality of the posterior mean in a misspecified model has been analysed in Bunke and Milhaud (1998) [5]. We generalize their discussion and prove asymptotic normality and efficiency for a class of point-estimators defined by a general loss function, which includes the posterior mean and median.
Let ℓ : R k → [0, ∞) be a loss-function with the following properties: ℓ is continuous and satisfies, for every M > 0, with strict inequality for some M . Furthermore, we assume that ℓ is subpolynomial, i.e. for some p > 0, Define the estimatorsθ n as the (near-)minimizers of The theorem below is the misspecified analog of theorem 10.8 in van der Vaart (1998) [18] and is based on general methods from M -estimation, in particular the argmax theorem (see, for example, corollary 5.58 in [18]).
Theorem 2.3. Assume that the model satisfies (2.1) for some θ * ∈ Θ and that the conditions of theorems 3.1 are satisfied. Let ℓ : R k → [0, ∞) be a lossfunction with the properties listed and assume that θ p dΠ(θ) < ∞. Then under P 0 , the sequence √ n(θ n − θ * ) converges weakly to the minimizer of, , provided that any two minimizers of this process coincide almost-surely. In particular, if the loss function is subconvex (e.g. ℓ(x) = x 2 or ℓ(x) = x , giving the posterior mean and median), then √ n(θ n − θ * ) converges weakly to X under P 0 .
Proof. The theorem can be proved along the same lines as theorem 10.8 in [18]. The main difference is in proving that, for any M n → ∞, Here, abusing notation, we write dΠ n (h|X 1 , . . . , X n ) to denote integrals relative to the posterior distribution of the local parameter h = √ n(θ − θ * ). Under misspecification a new proof is required, for which we extend the proof of theorem 3.1 below. Once (2.16) is established, the proof continues as follows. The variableĥ n = √ n(θ n − θ) is the maximizer of the process t → ℓ(t − h) dΠ n (h|X 1 , . . . , X n ). Reasoning exactly as in the proof of theorem 10.8, we see thatĥ n = O P0 (1). Fix some compact set K and for given M > 0 define the processes for every M > 0, and W M P0 −→ Z as M → ∞. We conclude that there exists a sequence M n → ∞ such that Z n,Mn P0 Z. The limit (2.16) implies that Z n,Mn − Z n = o P0 (1) in ℓ ∞ (K) and we conclude that Z n P0 Z in ℓ ∞ (K). Due to the continuity of ℓ, t → Z(t) is continuous almost surely. This, together with the assumed unicity of maxima of these sample paths, enables the argmax theorem (see, corollary 5.58 in [18]) and we conclude thatĥ n P0 ĥ , whereĥ is the minimizer of Z(t).

365
For the proof of (2.16) we adopt the notation of theorem 3.1. The tests ω n employed there can be taken nonrandomized without loss of generality (otherwise replace them for instance by 1 ωn>1/2 ) and then U n ω n tends to zero in probability by the only fact that ω n does so. Thus (2.16) is proved once it is established that, with ǫ n = M n / √ n, We can use bounds as in the proof of theorem 3.1, but instead of at (3.5) and Π B(a n , θ * ; P 0 ) n p/2 θ − θ * p dΠ(θ), These expressions tend to zero as before.
The last assertion of the theorem follows, because for a subconvex loss function the process Z is minimized uniquely by X, as a consequence of Anderson's lemma (see, for example, lemma 8.5 in [18]).

Rate of convergence
In a Bayesian context, the rate of convergence is defined as the maximal pace at which balls around the point of convergence can be shrunk to radius zero while still capturing a posterior mass that converges to one asymptotically. Current interest lies in the fact that the Bernstein-Von Mises theorem of the previous section formulates condition (2.3) (with δ n = n −1/2 ), for all M n → ∞. A convenient way of establishing the above is through the condition that suitable test sequences exist. As has been shown in a well-specified context in Ghosal et al. (2000) [7] and under misspecification in Kleijn and Van der Vaart (2003) [9], the most important requirement for convergence of the posterior at a certain rate is the existence of a test-sequence that separates the point of convergence from the complements of balls shrinking at said rate. This is also the approach we follow here: we show that the sequence of posterior probabilities in the above display converges to zero in P 0 -probability if a test sequence exists that is suitable in the sense given above (see the proof of theorem 3.1). However, under the regularity conditions that were formulated to establish local asymptotic normality under misspecification in the previous section, more can be said: not complements of shrinking balls, but fixed alternatives are to be suitably testable against P 0 , thus relaxing the testing condition considerably. Locally, the construction relies on score-tests to separate the point of convergence from complements of neighbourhoods shrinking at rate 1/ √ n, using Bernstein's inequality to obtain exponential power. The tests for fixed alternatives are used to extend those local tests to the full model.
In this section we prove that a prior mass condition and suitable test sequences suffice to prove convergence at the rate required for the Bernstein-Von Mises theorem formulated in section 2. The theorem that begins the next subsection summarizes the conclusion. Throughout the section we consider the i.i.d. case, with notation as in section 2.2.

Posterior rate of convergence
With use of theorem 3.3, we formulate a theorem that ensures √ n-rate of convergence for the posterior distributions of smooth, testable models with sufficient prior mass around the point of convergence. The testability condition is formulated using measures Q θ , defined by, for all A ∈ A and all θ ∈ Θ. Note that all Q θ are dominated by P 0 and that Q θ * = P 0 . Also note that if the model is well-specified, then P θ * = P 0 and Q θ = P θ for all θ. Therefore the use of Q θ instead of P θ to formulate the testing condition is relevant only in the misspecified situation (see Kleijn and Van der Vaart (2006) [9] for more on this subject). The proof of theorem 3.1 makes use of Kullback-Leibler neighbourhoods of θ * of the form: for some ǫ > 0.
Theorem 3.1. Assume that the model satisfies the smoothness conditions of lemma 2.1, where in addition, it is required that P 0 (p θ /p θ * ) < ∞ for all θ in a neighbourhood of θ * and P 0 (e sm θ * ) < ∞ for some s > 0. Assume that the prior possesses a density that is continuous and positive in a neighbourhood of θ * . Furthermore, assume that P 0lθ * l T θ * is invertible and that for every ǫ > 0 there exists a sequence of tests (φ n ) such that: Then the posterior converges at rate 1/ √ n, i.e. for every sequence (M n ), M n → ∞:

The Bernstein-Von-Mises theorem under misspecification 367
Proof. Let (M n ) be given, and define the sequence (ǫ n ) by ǫ n = M n / √ n. According to theorem 3.3 there exists a sequence of tests (ω n ) and constants D > 0 and ǫ > 0 such that (3.7) holds. We use these tests to split the P n 0 -expectation of the posterior measure as follows: The first term is of order o(1) as n → ∞ by (3.7). Given a constant ǫ > 0 (to be specified later), the second term can be decomposed as: Given two constants M, M ′ > 0 (also to be specified at a later stage), we define the sequences (a n ), a n = M log n/n and (b n ), b n = M ′ ǫ n . Based on a n and b n , we define two sequences of events: , The sequence (Ξ n ) is used to split the first term on the r.h.s. of (3.3) and estimate it as follows: . . , X n ≤ P 0 (Ξ n ) + P n 0 (1 − ω n ) 1 Ω\Ξn Π θ : θ − θ * ≥ ǫ X 1 , X 2 , . . . , X n .
According to lemma 3.1, the first term is of order o(1) as n → ∞. The second term is estimated further with the use of lemmas 3.1, 3.2 and theorem 3.3: for some C > 0, Π B(a n , θ * ; P 0 ) Π θ : θ − θ * ≥ ǫ . Π B(a n , θ * ; P 0 ) for large enough n, using (3.6). A large enough choice for the constant M then ensures that the expression on the l.h.s. in the next-to-last display is of order o(1) as n → ∞.
The sequence (Ω n ) is used to split the second term on the r.h.s. of (3.3) after which we estimate it in a similar manner. Again the term that derives from 1 Ωn is of order o(1), and where we have split the domain of integration into spherical shells A n,j , (1 ≤ j ≤ J, with J the smallest integer such that (J + 1)ǫ n > ǫ): A n,j = θ : jǫ n ≤ θ−θ * ≤ (j +1)ǫ n ∧ǫ . Applying theorem 3.3 to each of the shells separately, we obtain: For a small enough ǫ and large enough n, the sets θ : θ − θ * ≤ (j + 1)ǫ n all fall within the neighbourhood U of θ * on which the prior density π is continuous. Hence π is uniformly bounded by a constant R > 0 and we see that: Π{ θ : θ − θ * ≤ (j + 1)ǫ n } ≤ RV d (j + 1) d ǫ d n , where V d is the Lebesgue-volume of the d-dimensional ball of unit radius. Combining this with (3.6), there exists a constant K ′ > 0 such that, with M ′ < D/2(1 + C): for large enough n. The series is convergent and we conclude that this term is also of order o(1) as n → ∞.
Consistent testability of the type (3.2) appears to be a weak requirement because the form of the tests is arbitrary. (It may be compared to "classical conditions" like (B3) in section 6.7, page 455, of [13], formulated in the wellspecified case.) Consistent testability is of course one of Schwarz' conditions for consistency ( [16]) and appears to have been introduced in the context of the (well-specified) Bernstein-Von Mises theorem by Le Cam. To exemplify its power we show in the next theorem that the tests exist as soon as the parameter set is compact and the model is suitably continuous in the parameter.
Theorem 3.2. Assume that Θ is compact and that θ * is a unique point of minimum of θ → −P 0 log p θ . Furthermore assume that P 0 (p θ /p θ * ) < ∞ for all θ ∈ Θ and that the map, is continuous at θ 1 for every s in a left neighbourhood of 1, for every θ 1 . Then there exist tests φ n satisfying (3.2). A sufficient condition is that for every θ 1 ∈ Θ the maps θ → p θ /p θ1 and θ → p θ /p θ * are continuous in L 1 (P 0 ) at θ = θ 1 .
Beyond compactness it appears impossible to give mere qualitative sufficient conditions, like continuity, for consistent testability. For "natural" parameterizations it ought to be true that distant parameters (outside a given compact) are the easy ones to test for (and a test designed for a given compact ought to be consistent even for points outside the compact), but this depends on the structure of the model. Alternatively, many models would allow a suitable compactification to which the preceding result can be applied, but we omit a discussion. The results in the next section allow to discard a "distant" part of the parameter space, after which the preceding results apply.
In the proof of theorem 3.1, lower bounds in probability on the denominators of posterior probabilities are needed, as provided by the following lemma.
Moreover, the prior mass of the Kullback-Leibler neighbourhoods B(ǫ, θ * ; P 0 ) can be lower-bounded if we make the regularity assumptions for the model used in section 2 and the assumption that the prior has a Lebesgue density that is well-behaved at θ * .

Lemma 3.2.
Under the smoothness conditions of lemma 2.1 and assuming that the prior density π is continuous and strictly positive in θ * , there exists a constant K > 0 such that the prior mass of the Kullback-Leibler neighbourhoods B(ǫ, θ * ; P 0 ) satisfies:

Suitable test sequences
In this subsection we prove that the existence of test sequences (under misspecification) of uniform exponential power for complements of shrinking balls around θ * versus P 0 (as needed in the proof of theorem 3.1), is guaranteed whenever asymptotically consistent test-sequences exist for complements of fixed balls around θ * versus P 0 and the conditions of lemmas 2.1 and 3.4 are met. The following theorem is inspired by lemma 10.3 in Van der Vaart (1998) [18]. Theorem 3.3. Assume that the conditions of lemma 2.1 are satisfied, where in addition, it is required that P 0 (p θ /p θ * ) < ∞ for all θ in a neighbourhood of θ * and P 0 (e sm θ * ) < ∞ for some s > 0. Furthermore, suppose that P 0lθ * l T θ * is invertible and for every ǫ > 0 there exists a sequence of test functions (φ n ), such that: Then for every sequence (M n ) such that M n → ∞ there exists a sequence of tests (ω n ) such that for some constants D > 0, ǫ > 0 and large enough n:
For the construction of the first sequence, a constant L > 0 is chosen to truncate the score-function component-wise (i.e. for all 1 ≤ k ≤ d, (l L θ * ) k = 0 if |(l θ * ) k | ≥ L and (l L θ * ) k = (l θ * ) k otherwise) and we define: ω 1,n = 1 (P n − P 0 )l L θ * > M n /n , Because the functionl θ * is square-integrable, we can ensure that the matrices P 0 (l θ * l T θ * ), P 0 (l θ * (l L θ * ) T ) and P 0 (l L θ * (l L θ * ) T ) are arbitrarily close (for instance in operator norm) by a sufficiently large choice for the constant L. We fix such an L throughout the proof.
By the central limit theorem P n 0 ω 1,n = P n 0 √ n(P n − P 0 )l L where we have used the notation (for all θ ∈ Θ 1 with small enough ǫ > 0)Q θ = Q θ −1 Q θ and the fact that P 0 = Q θ * =Q θ * . By straightforward manipulation, we find: In view of lemma 3.4 and conditions (2.8), (2.9), (P 0 (p θ /p θ * ) − 1) is of order O( θ − θ * 2 ) as (θ → θ * ), which means that if we approximate the above display up to order o( θ − θ * 2 ), we can limit attention on the r.h.s. to the first term in the last factor and equate the first factor to 1. Furthermore, using the differentiability of θ → log(p θ /p θ * ), condition (2.8) and lemma 3.4, we see that: Summarizing the above and combining with the remark made at the beginning of the proof concerning the choice of L, we find that for every δ > 0, there exist choices of ǫ > 0, L > 0 and N ≥ 1 such that for all n ≥ N and all θ in Θ 1 : We denote ∆(θ) = (θ − θ * ) T P 0 (l θ * l T θ * )(θ − θ * ) and since P 0 (l θ * l T θ * ) is strictly positive definite by assumption, its smallest eigenvalue c is greater than zero. Hence, −δ θ − θ * 2 ≥ −δ/c ∆(θ). and there exists a constant r(δ) (depending only on the matrix P 0 (l θ * l T θ * ) and with the property that r(δ) → 1 if δ → 0) such that: for small enough ǫ, large enough L and large enough n, demonstrating that the type-II error is bounded above by the (unnormalized) tail probability Q n θ (W n ≥ r(δ)∆(θ)) of the mean of the variables . so thatQ θ W i = 0. The random variables W i are independent and bounded since: The variance of W i underQ θ is expressed as follows:

373
Using that P 0lθ * = 0 (see (2.11)), the above can be estimated like before, with the result that there exists a constant s(δ) (depending only on (the largest eigenvalue of) the matrix P 0lθ * l T θ * and with the property that s(δ) → 1 as δ → 0) such that: VarQ θ (W i ) ≤ s(δ)∆(θ), for small enough ǫ and large enough L. We apply Bernstein's inequality (see, for instance, Pollard (1984) [15], pp. 192-193) to obtain: The factor t(δ) = r(δ) 2 (s(δ) + 3 2 L √ d θ − θ * r(δ)) −1 lies arbitrarily close to 1 for sufficiently small choices of δ and ǫ. As for the n-th power of the norm of Q θ , we use lemma 3.4, (2.8) and (2.9) to estimate the norm of Q θ as follows: for some constant u(δ) such that u(δ) → 1 if δ → 0. Because 1 + x ≤ e x for all x ∈ R, we obtain, for sufficiently small θ − θ * : Note that u(δ) − t(δ) → 0 as δ → 0 and ∆(θ) is upper bounded by a multiple of θ − θ * 2 . Since V θ * is assumed to be invertible, we conclude that there exists a constant C > 0 such that for large enough L, small enough ǫ > 0 and large enough n: Concerning the range θ − θ * > ǫ, an asymptotically consistent test-sequence of P 0 versus Q θ exists by assumption, what remains is the exponential power; the proof of lemma 3.3 demonstrates the existence of a sequence of tests (ω 2,n ) such that (3.12) holds. The sequence (ψ n ) is defined as the maximum of the two sequences defined above: ψ n = ω 1,n ∨ ω 2,n for all n ≥ 1, in which case P n 0 ψ n ≤ P n 0 ω 1,n + P n 0 ω 2,n → 0 and: Combination of the bounds found in (3.11) and (3.12) and a suitable choice for the constant D > 0 lead to (3.7).
The following lemma shows that for a sequence of tests that separates P 0 from a fixed model subset V , there exists a exponentially powerful version without further conditions. Note that this lemma holds in non-parametric and parametric situations alike. Lemma 3.3. Suppose that for given measurable subset V of Θ, there exists a sequence of tests (φ n ) such that: Then there exists a sequence of tests (ω n ) and strictly positive constants C, D such that: Proof. For given 0 < ζ < 1, we split the model subset V in two disjoint parts V 1 and V 2 defined by Note that for every test-sequence (ω n ), Let δ > 0 be given. By assumption there exists an N ≥ 1 such that for all n ≥ N + 1, P n 0 φ n ≤ δ and sup θ∈V Q n θ (1 − φ n ) ≤ δ. Every n ≥ N + 1 can be written as an m-fold multiple of N (m ≥ 1) plus a remainder 1 ≤ r ≤ N : n = mN + r. Given n ≥ N , we divide the sample X 1 , X 2 , . . . , X n into (m − 1) groups of N consecutive X's and a group of N + r X's and apply φ N to the first (m − 1) groups and φ N +r to the last group, to obtain: Y m,n = φ N +r X (m−1)N +1 , X (m−1)N +2 , . . . , X mN +r , which are bounded, 0 ≤ Y j,n ≤ 1 for all 1 ≤ j ≤ m and n ≥ 1. From that we define the test-statistic Y m,n = (1/m)(Y 1,n + . . . + Y m,n ) and the test-function ω n = 1{Y m,n ≥ η}, based on a critical value η > 0 to be chosen at a later stage. The P n 0 -expectation of the test-function can be bounded as follows: Proof. The function R(x) defined by e x = 1 + x + 1 2 x 2 + x 2 R(x) increases from − 1 2 in the limit (x → −∞) to ∞ as (x → ∞), with R(x) → R(0) = 0 if (x → 0). We also have |R(−x)| ≤ R(x) ≤ e x /x 2 for all x > 0. The l.h.s. of the assertion of the lemma can be written as The expectation on the r.h.s. of the above display is bounded by P 0 m 2 θ R(ǫm θ ) if θ − θ * ≤ ǫ. The functions m 2 R(ǫm) are dominated by e sm for sufficiently small ǫ and converge pointwise to m 2 R(0) = 0 as ǫ ↓ 0. The lemma then follows from the dominated convergence theorem.

Consistency and testability
The conditions for the theorems concerning rates of convergence and limiting behaviour of the posterior distribution discussed in the previous sections include several requirements on the model involving the true distribution P 0 . Depending on the specific model and true distribution, these requirements may be rather stringent, disqualifying for instance models in which −P 0 log p θ /p θ * = ∞ for θ in neighbourhoods of θ * . To drop this kind of condition from the formulation and nevertheless maintain the current proof(s), we have to find other means to deal with 'undesirable' subsets of the model. In this section we show that if Kullback-Leibler neighbourhoods of the point of convergence receive enough prior mass and asymptotically consistent uniform tests for P 0 versus such subsets exist, they can be excluded from the model beforehand. As a special case, we derive a misspecified version of Schwartz' consistency theorem (see Schwartz (1965) [16]). Results presented in this section hold for the parametric models considered in previous sections, but are also valid in non-parametric situations.

Exclusion of testable model subsets
We start by formulating and proving the lemma announced above, in its most general form.
Lemma 4.1. Let V ⊂ Θ be a (measurable) subset of the model Θ. Assume that for some ǫ > 0: and there exist constants γ > 0, β > ǫ and a sequence (φ n ) of test-functions such that:
In many situations, (4.1) is satisfied for every ǫ > 0. In that case the construction of uniform exponentially powerful tests from asymptotically consistent tests (as demonstrated in the proof of lemma 3.3) can be used to fulfill (4.2) under the condition that an asymptotically consistent uniform test-sequence exists.
In this corollary form, the usefulness of lemma 4.1 is most apparent. All subsets V of the model that can be distinguished from P 0 based on a characteristic property (formalised by the test functions above) in a uniform manner (c.f. (4.4)) may be discarded from proofs like that of theorem 3.1. Hence the properties assumed in the statement of (for instance) theorem 3.1, can be left out as conditions if a suitable test sequence exist.
Whether or not a suitable test sequence can be found depends on the particular model and true distribution in question and little can be said in any generality. The likelihood ratio test is one possibility. The following lemma is comparable to the classical condition, as in [13]. Then Π(V |X 1 , X 2 , . . . , X n ) → 0, P 0 -almost-surely.

Appendix: Technical lemmas
The first lemma used in the proof of theorem 2.1 shows that given two sequences of probability measures, a sequence of balls that grows fast enough can be used conditionally to calculate the difference in total-variational distance, even when the sequences consist of random measures.
Lemma 5.1. Let (Π n ) and (Φ n ) be two sequences of random probability measures on R d . Let (K n ) be a sequence of subsets of R d such that, Proof. Let K, a measurable subset of R d and n ≥ 1 be given and assume that Π n (K) > 0 and Φ n (K) > 0. Then for any measurable B ⊂ R d we have: Π n (B) − Π K n (B) ≤ 2Π n (R d \ K). and hence also: As a result of the triangle inequality, we then find that the difference in totalvariation distances between Π n and Φ n on the one hand and Π K n and Φ K n on the other is bounded above by the expression on the right in the above display (which is independent of B).
Define A n , B n to be the events that Π n (K n ) > 0, Φ n (K n ) > 0 respectively. On Ξ n = A n ∩B n , Π Kn n and Φ Kn n are well-defined probability measures. Assumption (5.1) guarantees that P n 0 (Ξ n ) converges to 1. Restricting attention to the event Ξ n in the above upon substitution of the sequence (K n ) and using (5.1) for the limit of (5.3) we find (5.2), where it is understood that the conditional probabilities on the l.h.s. are well-defined with probability growing to 1.
The second lemma demonstrates that the sequence of normals satisfies the condition of lemma 5.1 when the sequence of centre points ∆ n,θ * is uniformly tight.
Lemma 5.2. Let K n be a sequence of balls centred on the origin with radii M n → ∞. Let (Φ n ) be a sequence of normal distributions (with fixed covariance matrix V ) located respectively at the (random) points (∆ n ) ⊂ R d . If the sequence ∆ n is uniformly tight, then Φ n (R d \ K n ) = N ∆n,V (R d \ K n ) P0 −→ 0.
Proof. Let δ > 0 be given. Uniform tightness of the sequence (∆ n ) implies the existence of a constant L > 0 such that: sup n≥1 P n 0 ( ∆ n ≥ L) ≤ δ.
For all n ≥ 1, call A n = { ∆ n ≥ L}. Let µ ∈ R d be given. Since N (µ, V ) is tight, there exists for every given ǫ > 0 a constant L ′ such that N µ,V (B(µ, L ′ )) ≥ 1 − ǫ (where B(µ, L ′ ) defines a ball of radius L ′ around the point µ). Assuming that µ ≤ L, B(µ, L ′ ) ⊂ B(0, L + L ′ ) so that with M = L + L ′ , N µ,V (B(0, M )) ≥ 1 − ǫ for all µ such that µ ≤ L. Choose N ≥ 1 such that M n ≥ M for all n ≥ N . Let n ≥ N be given. Then: Note that on the complement of A n , ∆ n < L, so: