Posterior asymptotics in Wasserstein metrics on the real line

In this paper, we use the class of Wasserstein metrics to study asymptotic properties of posterior distributions. Our first goal is to provide sufficient conditions for posterior consistency. In addition to the well-known Schwartz's Kullback--Leibler condition on the prior, the true distribution and most probability measures in the support of the prior are required to possess moments up to an order which is determined by the order of the Wasserstein metric. We further investigate convergence rates of the posterior distributions for which we need stronger moment conditions. The required tail conditions are sharp in the sense that the posterior distribution may be inconsistent or contract slowly to the true distribution without these conditions. Our study involves techniques that build on recent advances on Wasserstein convergence of empirical measures. We apply the results to density estimation with a Dirichlet process mixture prior and conduct a simulation study for further illustration.


Introduction
The Wasserstein distance originally arose in the problem of optimal transportation [43] and is often called the Kantorovich or transportation distance. We refer to [42] for the history about this metric. For two Borel probability measures P and Q on the real line, the Wasserstein metric of order p, p ∈ [1, ∞), is defined as where C (P, Q) is the set of every coupling π of P and Q, that is, a Borel probability measure on R 2 with marginals P and Q, respectively. There are a wide number of applications of Wasserstein metrics, e.g. Wasserstein generative adversarial networks (GAN; [1,25]), approximate Markov chain Monte Carlo (MCMC; [38]), distributionally robust optimization (DRO; [31]) and clustering ( [4,32]). However, exhaustive study on statistical properties such as the convergence behavior of the empirical measure with respect to W p have been conducted only recently, see [7,18,46,15]. In particular, the great success of Wasserstein GAN in machine learning society accelerated the study of Wasserstein metrics in statistics community as a discrepancy measure between probabilities; [41,34,5]. Recently, [3] proposed the use of the Wasserstein distance in the implementation of Approximate Bayesian Computation (ABC) to approximate the posterior distribution. In nonparametric Bayesian inference, [36,12] used Wasserstein metrics to study asymptotic properties of posterior distributions, but W p was considered as a distance between mixing distributions rather than a distance between mixture densities themselves. As a result, the Wasserstein metrics in these papers yielded a stronger topology than the total variation distance on the space of density functions. In general, W p , 1 ≤ p < ∞, metrizes the weak convergence of probability measures in a bounded metric space. Specifically, if the diameter of the underlying metric space is bounded by 1, one has the relationship d 2 P ≤ W 1 ≤ 2d P ≤ d V , where d P and d V are Lévy-Prokhorov and total variation distances, see [23]. In an unbounded metric space, the second and third inequalities do not hold because W 1 is not a bounded metric.
In this article, we utilize the Wasserstein distances to study asymptotic behavior of posterior distributions under the assumption that data are generated from a fixed true distribution and we focus on nonparametric Bayesian density estimation on the real line. To set the stage, let X 1 , . . . , X n be the observations which are independent and identically distributed random variables from the true distribution P 0 possessing a density p 0 . Let F be a collection of probability densities in R equipped with the weak topology, and Π be a prior distribution on F. Then the posterior probability of a measurable set A ⊂ F is given as Π(p ∈ A | X 1 , . . . , X n ) = A n i=1 p(X i )/p 0 (X i )dΠ(p) n i=1 p(X i )/p 0 (X i )dΠ(p) (1.1) by the Bayes formula. Throughout the paper, we allow the prior Π to depend on the sample size n, but often abbreviate this dependency in the notation of both prior and posterior distributions. If clarification is necessary, the prior and posterior will be denoted Π n and Π n (· | X 1 , . . . , X n ), respectively. The posterior distribution is said to be consistent with respect to a (pseudo-)metric d if Π d(p, p 0 ) > | X 1 , . . . , X n → 0 in probability for every > 0, where the convergence in probability is taken with respect to the true distribution P 0 . If is replaced by n for some sequence n → 0, the convergence rate of the posterior distributions is said to be at least n . There is a huge amount of research articles concerning asymptotic properties of the posterior distribution. We refer to the monograph [22] for the history and details about this topic.
Of key importance is the Kullback-Leibler (KL) support condition developed by [39]. A fixed prior Π is said to satisfy the KL support condition if Π p : K(p 0 , p) < > 0 for every > 0, (1.2) where K(p 0 , p) = log[p 0 (x)/p(x)]dP 0 (x) is the KL divergence. If the prior depends on the sample size, the KL condition (1.2) can be replaced by lim inf n→∞ Π n p : K(p 0 , p) < > 0 for every > 0. In particular, it gives a suitable lower bound of the denominator in (1.1) and it implies posterior consistency in the weak topology, that is with respect to the Lévy-Prokhorov distance, see [39] and Section 6.4 of [22]. A variation of the KL support condition to obtain a convergence rate is developed by [20]. It is formally expressed as Π(K n ) ≥ e −n 2 n for all large enough n, (1.4) where K n = p ∈ F : log p 0 p dP 0 ≤ 2 n , log p 0 p 2 dP 0 ≤ 2 n .
In literature, studies on posterior asymptotics have focused on strong metrics such as the total variation, Hellinger and uniform metrics. For those purposes, some non-trivial conditions such as the bounded entropy or prior summability are assumed in addition to the KL conditions, see [19,45,2,11] for example. On the other hand, it is surprising that careful analysis of the convergence rates with respect to a weak metric such as the Lévy-Prokhorov and Kolmogorov has not been studied in literature, considering that the KL support condition is sufficient for the consistency in those metrics. [11] studied the convergence rate of the posterior distribution with respect to the Lévy-Prokhorov metric, but their rate n −1/4 have a lot of room for improvement. Furthermore, they used the Lévy-Prokhorov rate as a tool for proving the consistency in total variation, and did not focus on the convergence rate itself.
Wasserstein metrics W p , 1 ≤ p < ∞ metrize weak convergence in a bounded space, but it generates a stronger topology in general. Indeed, neither the KL support condition (1.2) nor (1.4) are sufficient for posterior consistency with respect to W p . If P 0 is a standard Cauchy density, for example, W p (P, P 0 ) = ∞ for any P and p ≥ 1. Therefore, for any prior except the one putting all its mass on P 0 , the posterior distribution is inconsistent with respect to W p . This simple example shows that tails or moments of probability measures play an important role for handling W p .
For a sequence P n of probability measures, it is well-known that W p (P n , P ) → 0 if and only if P n converges to P weakly and M p (P n ) → M p (P ), see [43], p.212, where M p (P ) = |x| p dP (x). Therefore, for the Wasserstein consistency to hold, the posterior moment should converge to the true moment, see Theorem 2.1. However, while the moment consistency of frequentist's nonparametric estimators such as the the empirical distribution is straightforward, it is non-trivial to show that the posterior moment converges to the true moment even with a very popular prior such as a Dirichlet process mixture. This is mainly because tails of probability measures in the support of the prior should be considered simultaneously.
To prove posterior consistency, we will leverage on the KL condition. We provide two different approaches which are of independent interest; see the proof of Theorem 2.2. The first one targets directly posterior moment consistency and relies on a result from [45]. The second one has less stringent conditions but the proof is more complicated. Specifically, we construct uniformly consistent tests based on the empirical distribution by exploiting suitable upper bounds of Wasserstein metrics. We then show that, to achieve posterior consistency with respect to W p , moments of densities must be suitably bounded. In particular, the posterior needs to put most of its mass on distributions that possess moments up to an order determined by that of the Wasserstein metric. In practice, the posterior moment condition can be worked out by means of exponentially small prior probability on the complement set, cf. Lemma 8.2. In Section 5.2 we provide an illustration in the specific example of Dirichlet process mixture prior.
Both approaches for posterior consistency can be extended to obtain suitable convergence rates with the KL condition (1.4). While the first approach gives the convergence rate for the moment, the second approach gives the rate with respect to W p p relying on slightly stronger moment conditions, see Theorems 3.1 and 3.2. For convergence rates with the second approach, we rely on new upper bounds on Wasserstein metrics that can be of independent interest, cf. Lemma 8.7. Interestingly, the posterior moment conditions for consistency and convergence rates are nearly necessary, that is the posterior distribution may be inconsistent or contract slowly to the true distribution when they are not satisfied. Finally, we obtain convergence rates for the case p = ∞ in Theorem 4.1, for which we need to restrict to probability measures on a bounded space.
To the best of our knowledge, this paper is the first result on posterior asymptotics with the Wasserstein metric. The remainder of this paper is organized as follows. Results on posterior consistency and its convergence rate with respect to W p , for 1 ≤ p < ∞, are considered in Sections 2 and 3, respectively. Posterior asymptotics with respect to W ∞ is studied in Section 4. Section 5 considers more details with specific examples. Some numerical results complementing our theory are provided in Section 6. Concluding remarks and proofs are given in Sections 7 and 8, respectively.

Notation
Before proceeding, we introduce some further notation; for two real numbers a and b, their minimum and maximum are denoted by a ∧ b and a ∨ b, respectively. Inequality a b means that a is less than a constant multiple of b, where the constant is universal unless specified. Upper cases such as P and Q refer to probability measures corresponding to the densities denoted by lower cases and vise versa. The empirical measure based on X 1 , . . . , X n is denoted P n . For a real-valued function f , its expectation with respect to P is denoted P f . The expectation with respect to the true distribution is often denoted Ef (X). The restriction of P onto a set A is denoted P | A .

Consistency with respect to W p
Recall that W p (P n , P ) → 0 if and only if P n converges weakly to P and M p (P n ) → M p (P ), see Theorem 7.12 of [43]. Also, the KL support condition (1.2), or (1.3), guarantees posterior consistency with respect to the Lévy-Prokhorov metric which induces the weak convergence. Therefore, it is natural under the KL support condition to guess that posterior consistency with respect to W p is equivalent to the consistency of the pth moment, that is, . . , X n → 0 in probability for every > 0.
(2.1) If (2.1) holds, we say that the posterior moment of order p is consistent. For p = 1, the moment consistency can be easily implied by W 1 -consistency by the help of the duality theorem by [30], see also [17,14,44], which asserts that where L is the class of every Lipschitz function whose Lipschitz constant is bounded by 1. Since the map x → |x| belongs to L , we have that |M 1 (P ) − M 1 (Q)| ≤ imsart-ejs ver. 2020/01/20 file: WassersteinRate-210616-EJS-rev2.tex date: July 1, 2021 Chae, De Blasi and Walker/Posterior asymptotics in Wasserstein metrics 6 W 1 (P, Q). Although such an explicit bound does not exist for p > 1, one can show that posterior consistency with respect to the Wasserstein distance is equivalent to the moment consistency under the KL support condition.
Theorem 2.1. For a prior Π, suppose that the KL condition (1.3) holds. Then, the consistency of the p-th moment (2.1) is equivalent to that Π W p (P, P 0 ) ≥ | X 1 , . . . , X n −→ 0 in probability for every > 0. (2.2) We provide two different approaches for proving posterior consistency with respect to W p which are of independent interests. The first approach relies on a result from [45]; namely that if C is a convex set of probability measures and inf P ∈C H(P 0 , P ) > 0 then Π(C|X 1 , . . . , X n ) → 0 in probability, where H(P, Q) denotes the Hellinger distance between P and Q. This approach directly uses Theorem 2.1 by establishing the consistency of the pth moment (2.1). The proof based on this approach is very simple as it only needs a single application of the Cauchy-Schwarz inequality. However, it requires the moment of order 2p to be bounded a posteriori.
The second approach constructs a uniformly consistent sequence of tests based on the convergence of empirical distribution. The uniformity does not make any problem for the compact support case, i.e. P 0 ([−1, 1]) = 1 and P ([−1, 1]) = 1 for every P in the support of the prior Π. If probability measures in the support of the prior have unbounded support, however, problems may happen due to probability measures with large moments. This problem can be avoided if the moments are suitably bounded a posteriori, as expressed through condition (2.3) below. The second approach relies on a rather complicated proof, but it only needs the moment of order p + δ, for some δ > 0, to be bounded. Theorem 2.2. Assume that the prior Π satisfies the KL condition (1.3). Furthermore, assume that there exist positive constants K and δ such that M p+δ (P 0 ) < ∞ and Π M p+δ (P ) ≤ K | X 1 , . . . , X n → 1 in probability.
It should be emphasized that assumptions in Theorem 2.2 are nearly necessary. Certainly, M p (P 0 ) < ∞ is necessary. Since the consistency with respect to W p entails the consistency of the pth moment by Theorem 2.1, it is also necessary that Π M p (P ) ≤ K | X 1 , . . . , X n → 1 in probability for some constant K.
(2.4) On the other hand, M p (P 0 ) < ∞ and (2.4) are not sufficient for the posterior distribution to be consistent with respect to W p , as shown in the following example.
for a constant K, a condition that seems easy to prove at first sight. However, the proof is not simple even with a well-known prior which puts all of its mass on the space of light-tailed distributions, that is, distribution with large or infinite tail index. Here, if a distribution function F satisfies 1 − F (x) = x −α L(x) for large enough x, where L(·) is a slowly varying function satisfying lim y→∞ L(xy)/L(y) = 1 for any x > 0, the positive constant α is called the (right) tail index of F , see [33] for a Bayesian consistency of the tail index. It should be noted that a light-tail, i.e. large tail index, does not guarantee a small value of moment, which makes the proof of posterior consistency in W p difficult. This is in stark contrast to that the moment of the empirical distribution can be trivially shown to be consistent. In Section 5.2, we are able to work out the case of Dirichlet process mixture prior by using Lemma 8.2, that is by establishing that the prior puts exponentially small mass to probability measures P with M p (P ) > K. See Theorem 5.1.

Convergence rates with respect to W p
For a given rate sequence n , suppose that Π(K n ) ≥ e −n 2 n for every large enough n. Based on this condition, which is used to find a lower bound of the integrated likelihood, the denominator in the expression (1.1), we will extend the results of Section 2 to obtain a convergence rate. The main task in this section is to find additional assumptions required to achieve the convergence rate n . An extension of the first proof for Theorem 2.2 requires the moment of order 2p as follows.
Theorem 3.1. Assume that the prior Π satisfies the KL condition (1.4) for a sequence n with n → 0 and n 2 n → ∞. Furthermore, assume that there exists a constant K such that M 2p (P 0 ) ≤ K and Π P : M 2p (P ) > K | X 1 , . . . , X n → 0 in probability.
Then Π |M p (P ) − M p (P 0 )| > K n | X 1 , . . . , X n → 0 in probability for some constant K > 0. Note that M p (P ) is a linear functional of P for which the semi-parametric Bernstein-von Mises (BvM) theorem may hold, see [9,37]. In this case, the convergence rate of the marginal posterior distribution of M p (P ) would be the parametric rate n −1/2 even though the global posterior convergence rate n may be slower. However, while Theorem 3.1 is very general, the semi-parametric BvM theorem holds under rather strong conditions. For example, the above mentioned papers consider only specific priors and relied on the assumption that p 0 is compactly supported and bounded away from zero. It is sometimes possible to obtain the parametric convergence rate for the finite-dimensional parameter of interests without the semi-parametric BvM theorem. However, the proof typically relies on the LAN (locally asymptotically normal) expansion of the log-likelihood, see [6,10].
Next, we consider an extension of the testing approach. To achieve the convergence rate n , we will construct a sequence of consistent test P 0 φ n → 0 and sup where F n = {P : W p p (P, P 0 ) ≤ K n }∩F 0 . Here, F 0 will be defined as a collection of probability measures whose tails and moments are suitably bounded. Then, it will suffices for the desired result to show that Π(F c 0 | X 1 , . . . , X n ) → 0 in probability.
A consistent sequence of tests will be constructed based on the convergence of the empirical distribution to the true distribution. Note that there are wellknown concentration inequalities of the form P (W p p (P n , P ) > n ) ≤ δ n , where δ n is a decaying sequence, and those inequalities might be directly used to define tests as φ n = 1 if W p p (P n , P 0 ) > n 0 otherwise. However, such a simple approach does not give sharp convergence rates of the posterior distribution. For example, if we apply the concentration inequality by [18], for any P with W p p (P, P 0 ) > 2 p n and M 2p+δ (P ) < ∞, we have where c 1 and c 2 are constants. Here, the constants c 1 and c 2 depends on the moments of P , so it is not easy to bound (3.1) uniformly. Furthermore, the second term in the right hand side of (3.1) is of polynomial order in n 2 n which decays too slowly compared to e −n 2 n . In turn, the use of φ n would give a much slower convergence rate than n . Theorem 3.2 below is our main results concerning convergence rates of the posterior distribution. Proof of Theorem 3.2 relies on the set-up in [18]. In particular, Lemmas 8.3 and 8.5 can be easily deduced from the results in [18]. We build on these two lemmas to develop some techniques whose details are different from [18]. As mentioned above, we need to construct a sequence of tests decaying with an exponential rate. As far as we know, this is not possible with the proof technique used in [18]. Given a bounded moment condition, we achieve this by the help of Lemma 8.7. The condition n ≥ (log n)/n in Theorem 3.2 is assumed only for technical reason. Although we could not succeed to eliminate this condition, we believe the result is valid for any n ↓ 0 with n 2 n → ∞.
Theorem 3.2. Assume that the prior Π satisfies the KL condition (1.4) for a sequence n with n ↓ 0 and n ≥ (log n)/n. Furthermore, assume that there exist positive constants K and δ such that Assumptions in Theorems 3.2 should be understood as sufficient conditions so that Π(K n ) ≥ e −n 2 n guarantees n as the posterior convergence rate for any n n −1/2 . For the empirical measure to achieve the rate n −1/2 with respect to W p p , the same moment condition M 2p+δ (P 0 ) < ∞ is considered in [18]. They provided an example showing that this moment condition cannot be weakened in general. As illustrated in the example at the end of this section, the posterior moment condition Π(M 2p+δ (P ) ≤ K | X 1 , . . . , X n ) → 1 cannot also be weakened.
When p > 1, Theorem 3.2 gives a rate n with respect to W p p rather than W p . This result is more relevant to [18] than [7]. In particular, condition M 2p+δ (P 0 ) < ∞ is the same to Eq. (3) of [18], and much weaker than which is a necessary and sufficient condition in [7] for that E[W p (P n , P 0 )] n −1/2 . When p = 1, M 2p+δ (P 0 ) < ∞ is only slightly stronger than (3.2), which is reduced to 2) which may not be satisfied even when P 0 is compactly supported. Note that if P 0 is standard normal, (3.2) is satisfied if and only if 1 ≤ p < 2. As mentioned in [7], the rate E[W p (P n , P 0 )] n −1/2 cannot be obtained under moment-type conditions considered in Theorem 3.2. Therefore, we would need a stronger assumption such as (3.2) to replace W p p by W p in Theorem 3.2. Since we are focusing on the moment-type condition in the present paper, we do not address more detail about condition (3.2). Instead, we consider the metric W ∞ in Section 4 with a stronger assumption. Specifically, P 0 will be assumed to be supported on a bounded interval. This is necessary to obtain the consistency with respect to W ∞ . The result in Section 4 guarantees the rate n with respect to W p , not W p p , at least when probabilities are compactly supported. Note that our approach does not guarantee the rate n −1/2 which is minimax optimal and achieved by the empirical measure under some general conditions, e.g. [47,18,7]. Our approach gives the rate n −1/2 only if the prior puts sufficiently large mass around the KL neighborhood of p 0 . This is mainly because our approach relies on the general approach of [20] for which the KL condition Π(K n ) e −n 2 n plays an important role to determine the rate. Also, note that the testing approach only gives sharp rates when the distance is compatible with the natural statistical distance of the model, the Hellinger distance in our case, see [27] for extensive discussion on this point. In this regards, it might not be possible to obtain sharp rates based on the testing approach. Hence, a different approach would be necessary to achieve the rate n −1/2 , e.g. the approach in [27,48]. Another possible approach would be to utilize the functional Bernstein-von Mises theorem. Specifically, the approach given in [8], combining with the Kantorovich-Rubinstein representation, might give the rate n −1/2 at least for W 1 , and further limiting distribution of the posterior distribution. Note that the above papers are limited to specific priors and probability measures on a bounded set, while the present paper focuses on the moment condition for the posterior convergence rate. With these approaches, it would be highly interesting to investigate sufficient conditions to achieve the rate n −1/2 .
For p = 1, an additional logarithmic term in Theorem 3.2 can be eliminated if we assume a slightly stronger condition, which is satisfied if . . , X n ) → 1 in probability for some positive constants K and δ, see Theorem 8.9 for details. Since Theorem 8.9 relies on some technical assumptions, we defer its statement to Section 8, and provide here a simpler statement, Theorem 3.2. Proofs of these theorems are quite similar.
Finally, we note that moment conditions in Theorem 3.2 cannot be weakened to δ < 0 as shown in the following example.

Convergence rates with respect to W ∞
Since W p (P, Q) monotonically increases in p, one may define W ∞ (P, Q) = lim p→∞ W p (P, Q) which, according to [24], corresponds to where A = {x : |x − y| < for some y ∈ A} is the -enlargement of A and R is the set of all Borel subsets of R. This representation of W ∞ bears similarities with the Lévy-Prokhorov metric which metrizes the weak convergence. The metric W ∞ induces a much stronger topology than the weak topology even in a bounded metric space. In an unbounded space, if the tail index of two probability measures P and Q are different, then W ∞ (P, Q) is typically infinity. For example, if P and Q are Student's t-distributions with ν 1 and ν 2 degrees of freedom with ν 1 = ν 2 , then W ∞ (P, Q) = ∞. Therefore, it is meaningless to study asymptotics with W ∞ in an unbounded space.
In this section, we assume that P 0 is supported in the unit interval [0, 1], and so are all probability measures in the support of the prior. Our benchmarking assumption is inf x∈[0,1] p 0 (x) ≥ c 0 for some constant c 0 > 0, which is a necessary and sufficient condition for that P 0 [W ∞ (P n , P 0 )] n −1/2 , see [7].

Examples
In this section, we consider the posterior moment condition (2.5) with two examples. In the first example, we illustrate the idea of a novel approach handling the second moment condition without full technical details. The approach relies on a special property of gamma distributions. The second example considers higher order moments, and concrete posterior convergence rates are derived. Note that (2.5) holds trivially if the prior satisfies for some K , where t p is the density of the Student's t distribution with p degrees of freedom. Such a prior can be easily constructed by conditioning well-known priors by the event in the left hand side of (5.1). Although the prior probability for this event would be close to 1 with most priors and large enough K, this conditioning might be unnatural in practice.

Mixture of gamma distributions
Suppose it is required to establish weak consistency alongside a functional constraint; such as in probability, for some finite K > 0, with the prior Π on density functions on (0, ∞). This would be for establishing W 1 consistency.
With the usual Kullback-Leibler support condition, we write the posterior as , for some function α. Using standard arguments, the P 0 -expectation of the numerator over the set A is for all large n, for any d > 0; see [39] for details about this argument. Also note Hence, if A = {p : α −1 dP > K} and we construct the prior Π so that then, a.s. for all large n, using the Markov inequality and the Borel-Cantelli lemma, We obtain the second moment result by taking α(x) = 1/x 2 and so we need to ensure for the prior, for any > 0, there exists a K < ∞ such that x 2 dP > K implies x −2 dP < . Let κ 0 be the true second moment and assume we can construct the model p(x) such that for any > 0 there exists a K > 0 such that imsart-ejs ver. 2020/01/20 file: WassersteinRate-210616-EJS-rev2.tex date: July 1, 2021

13
This also implies K > κ 0 . Such an example arises with the gamma distribution, so consider p( For a more general nonparametric model, consider a mixture of gamma distributions. We can assign priors to M and w but to describe the prior for a = (a 1 , . . . , a M ) and b there is no loss in generality in fixing them. Now and, if, as assumed, a j > 2 + δ, then .
To obtain our required condition, we take the prior so that if a single a j (1 + a j )/b 2 > K then it is true for all j. This ensures that if E X 2 > K then a j (1 + a j )/b 2 > K for all j and then it is also true that b 2 /((a j − 1)(a j − 2)) < ξ/K, for all j, for some ξ < ∞, which is fixed. Indeed, ξ = (2+δ)(3+δ)/(δ(1+δ)), so E X −2 < ξ/K. Hence, we take the prior for (a, b) as where g c− is a density on (0, c) and g c+ a density on (c, ∞) for some c. Here, g c− (a j |b) puts all the mass on a j (1 + a j ) < cb 2 and g c+ (a j |b) puts all the mass on a j (1 + a j ) > cb 2 . In practice, we can take c so large that the part of the prior which contributes to the posterior will only be the g c− component.

Dirichlet process mixture
Consider a Dirichlet process mixture prior where DP(αH) denotes the Dirichlet process with base measure αH, φ σ (x) = σ −1 φ(x/σ) and φ is the standard normal density. In practice, an inverse gamma prior is usually imposed for σ 2 , but we consider a fixed sequence σ = σ n → 0 for technical convenience. Note that the sequence σ n controls the convergence rate. Specifically, with a suitable sequence σ n , one can prove that Π(K n ) ≥ e −n 2 n , see [21,40]. While these papers extensively studied posterior convergence rates with respect to the Hellinger metric, posterior moments have not been studied thoroughly. Only Section 8 of [21] slightly touched the tail mass of the posterior distribution. However, their result relies on the assumption that P 0 is compactly supported, and cannot be directly used to bound the posterior moments.
Note that the posterior moment condition (2.5) is similar to for any probability measure P and p ≥ 1, (2.5) is implied by n , then the posterior probability Π P (B m ) 2 −pm | X 1 , . . . , X n will be close to 1 for large enough n with high P 0 -probability. More generally, one can show that n , however, one cannot bound P (B m ) by 2 −pm because the convergence rate n is larger than 2 −pm . In this case, the prior must play a role, that is, the prior probability that P (B m ) 2 −pm should be small. In fact, this prior probability should be exponentially small, with an order e −cn 2 n for some constant c > 0, to guarantee that the posterior probability also decays, cf. Lemma 8.2. To this aim we will make use of for every m ≥ 0, which in particular implies that the prior expectation of G(B m ) equals H(B m ). If H is a normal distribution (any H with sub-Gaussian tail would actually work), the prior expectation of G(B m ) is much smaller than 2 −pm for every large enough m.
If the prior and P 0 satisfy conditions in Theorem 5.1, the posterior distribution is consistent with the rate n −2/5 with respect to W p p for p < 2, up to a logarithmic factor. Once P 0 possesses a smoother density, it is possible to prove the consistency of higher order moments, see Lemma 8.13 for more details.
If we impose a prior on σ 2 , it can be deduced from the proof that the assertion of Lemma 8.13 is still valid provided that σ 2 is bounded a posteriori, that is, Π(σ 2 > K | X 1 , . . . , X n ) → 0 for some constant K > 0. Under mild assumptions, the posterior distribution of σ 2 will be concentrated around 0 unless P 0 itself is a location mixture of normal distributions. If P 0 is a location mixture of normal distributions, the posterior probability that σ 2 > σ 2 0 + vanishes, where σ 2 0 is the true parameter.

Numerical study
Although theoretical results given in previous sections provide reasonable sufficient conditions for the Wasserstein consistency, those conditions are not easy to verify in practice. With a DP mixture prior, for example, the rate n determined by Π(K n ) ≥ e −n 2 n plays an important role for the consistency with respect to W p . However, it is very difficult to find exact rate n satisfying Π(K n ) e −n 2 n . Note also that if P 0 has an unbounded support, the posterior distribution is typically inconsistent with respect to W ∞ . Since W p ↑ W ∞ as p ↑ ∞, the posterior distribution will be consistent with respect to W p only for small values of p, where the threshold value depends on n . Perhaps the most interesting cases would be p = 1 or p = 2, so in this section, we empirically show that the posterior distribution tends to be consistent with respect to W 1 and W 2 with popularly used priors.
There are several computational algorithms sampling from a posterior distribution based on a Dirichlet process mixture prior, see [35,29] and references therein. Unfortunately, given a posterior sample P , it is very difficult to compute the Wasserstein distance W p (P, P 0 ), see Theorem 3 of [31]. Instead of directly calculating W p (P, P 0 ), we can easily generate a Markov chain sample Y 1 , . . . , Y N from the posterior predictive distribution p(x)dΠ(p | X 1 , . . . , X n ). Then, the corresponding empirical distribution P N can be used as a proxy of the posterior predictive distribution. Note that the empirical distribution from an ergodic Markov chain, as well as the one from an iid sample, contracts to the stationary distribution with respect to the Wasserstein metrics, see [18]. However, it is still not easy to compute W p ( P N , P 0 ). To evaluate W p ( P N , P 0 ), we first approximate P 0 by a discrete measure Q M and find W p ( P N , Q M ). If M is a multiple of N , one can easily find exact value of W p ( P N , Q M ) based on the following lemma taken from [7]. Lemma 6.1. For given two collections of real numbers x 1 ≤ · · · ≤ x N and y 1 ≤ · · · ≤ y N , let P and Q be the corresponding empirical measures. Then, for any p ≥ 1, . For a non-symmetric P 0 , a similar approximation Q M can be obtained after replacing the origin by the median. For various true distributionsstandard uniform, standard normal, Laplace, Student's t with 20, 10, 5 degrees of freedoms-the approximation error, the upper bound of W p (P 0 , Q M ), is depicted in Figure 1. When p = 1 and p = 2, the approximation of P 0 by Q M is quite accurate for all cases. On the other hand, for p = 4 and p = 8, the approximation is not reliable unless the support of the true distribution is bounded.
With the above six true distributions, we generated n = 50, 100, 200, . . . , 6400 samples and obtained N = 10 4 MCMC samples from the posterior predictive  distributions after 1000 burn-in periods. Then, we evaluated the Wasserstein distance W p ( P N , Q M ) between the empirical distribution P N of MCMC sample and the discrete approximation Q M of P 0 with M = 2 × 10 5 . We considered p = 1 and p = 2 only because because the approximation by Q M is not reliable for large p. We repeated the above procedure for 100 times and the median among 100 repetitions are depicted in Figures 2 and 3. As can be seen, the posterior predictive distributions become closer to the approximation Q M of the true distribution as the sample size increases. Interestingly, it seems that the location-scale mixture prior also gives consistent posterior distributions with respect to both W 1 and W 2 for all cases. Figure 4 shows similar results with a location mixture prior with different hyperparameter H = N (0, 10 4 ). Note that a normal distribution with large variance is a natural choice for H in practice. The results in Figure 4 shows that the posterior distribution seems to be consistent with respect to W 2 , but more samples are needed to dominate prior probabilities on the tail. This is because some posterior predictive samples might be very large when the number of observation is small, and W 2 ( P N , Q M ) is more sensitive to these large samples than W 1 ( P N , Q M ).

Discussion
In this paper, we provided sufficient conditions for posterior consistency with respect to the Wasserstein metrics and the convergence rate to be n in addition to the well-known KL conditions. Based on our main theorem, the posterior probability that W p p (P, P 0 ) n vanishes if M 2p+δ (P ) is bounded by a constant for some δ > 0 with high posterior probability. A similar moment condition has been used in [18] to show that W p p (P n , P 0 ) n −1/2 with high probability. The moment condition cannot be weakened in general as illustrated in our examples. Under a stronger condition (3.2), which is a necessary and sufficient condition for W p (P n , P 0 ) n −1/2 , we conjecture that the posterior probability that W p (P, P 0 ) n would vanish. We note that asymptotic results given in this paper might be utilized to obtain posterior consistency and its convergence rate with respect to strong metrics such as the total variation. For this, the key is to obtain posterior convergence rate in the Wasserstein metric and bound the total variation between smooth densities by a power of the Wasserstein metric. More precisely, if P and Q possess smooth Lebesgue densities p and q, one can prove that p−q 1 W α p (P, Q) for some α > 0, see [13] for a sharp inequality. This is a certain reverse inequality because the total variation generates a stronger topology than W p in the space  of all probability measures on a bounded metric space. This kind of reverse inequality and related theory for posterior consistency are the main motivation of the present paper, which was firstly considered in [11]. We conclude by discussing an example where total variation consistency and a mild condition implies Wasserstein consistency. This is a non-trivial finding. For a given kernel density function k on R, consider a location mixture which is often called a convolution. A prior Π on p can be induced from a prior on the mixing distribution G. With slight abuse of notation, we use the same notation Π for the prior of G. Suppose that the true distribution is also of the form (7.1), that is p 0 (x) = k(x−z)dG 0 (z) for some probability measure G 0 . In this case, posterior consistency with respect to the total variation automatically implies the consistency in W 2 . Suppose that k is symmetric about the origin, x 2 k(x)dx < ∞, and thatk(t) = 0 for every t ∈ R, wherek is the Fourier transform of f defined ask(t) = e −itx k(x)dx.  p 0 1 > | X 1 , . . . , X n ) → 0 for every > 0 implies that EΠ W 2 (P, P 0 ) > | X 1 , . . . , X n → 0 for every > 0.
Note that the condition Π(M 2 (G) < ∞) = 1 is easily satisfied for well-known priors. For example, if we put a Dirichlet process prior for G, the tail of G is much lighter than that of its mean, see [16]. Posterior consistency with respect to the total variation can also be easily established using a standard technique.
or R Bm P (F ) = 0 according as P (B m ) > 0 or P (B m ) = 0.
To get some insight of overall proofs, we next address how one can obtain the consistency of the empirical distribution with respect to W p . Suppose for a moment that P 0 is supported on [−1, 1]. Then, Lemma 8.3 implies that if |P n (F ) − P 0 (F )| is sufficiently small for every F ∈ P l and l ≤ L, where L is a large enough constant, then W p (P n , P 0 ) will also be small. Since there are various tools to bound the deviation |P n (F ) − P 0 (F )|, e.g. the inequality by [26], it is not difficult to prove that the empirical distribution converges to P 0 in probability with respect to W p , 1 ≤ p < ∞, with the help of Lemma 8.3.
In case that P 0 has an unbounded support, Lemma 8.5 can be applied for the Wasserstein consistency of P n . Indeed, if |P n (π −1 m (F )) − P 0 (π −1 m (F ))| is sufficiently small for every F ∈ P l , l ≤ L and m ≤ M , where L and M are large constants, then W p (P n , P 0 ) will be small. Note that L and M can be chosen as large but fixed constants, so the consistency of P n can be similarly proven using a large deviation inequality such as the Hoeffding's inequality. Here, it plays an important role that M p (P n ) converges to M p (P 0 ) by the law of large numbers, because once the pth moment of P n and P 0 is bounded, it is relatively easy to prove the Wasserstein consistency, see the proof of Theorem 2.2 and Lemma 8.6 for details.

Frequently used results from literature
The KL condition (1.4) gives a suitable lower bound of the integrated likelihood, that is, the denominator in (1.1). Once this condition holds, the posterior probability of a sequence of subsets F n of F can be shown to converge to 1 if the prior probability of F c n or likelihood is sufficiently small. The latter can often be expressed through the existence of a certain sequence of uniformly consistent tests. Lemmas 8.1 and 8.2 are taken from [20] with slight modification for the simplicity. The rate sequence n is assumed that n → 0 and n 2 n → ∞.
Lemma 8.1. Suppose that Π(K n ) ≥ e −n 2 n and assume that there exists a sequence of tests φ n such that P 0 φ n → 0 and sup n for F n ⊂ F. Then, Π F c n | X 1 , . . . , X n → 0 in probability.
The following lemmas are taken from [18] with slight modification, see also [15,46]. Since the statement of Lemma 8.3 is slightly different from these papers, we provide a detailed proof for the reader's convenience. Lemma 8.3. Assume that two probability measures P and Q are supported on (−1, 1]. Then, Proof. For a Borel partition {A k : k ≥ 1} of a Borel set A ⊂ R and two finite measures P and Q on A with equal mass, define the finite measure P as Here, P | A k and P | A k denote the restrictions of P and P , respectively, onto A k . We say P is the {A k : k ≥ 1}-approximation of P to Q. Then, we have the following lemma whose proof is explicitly given in [15] (pp. 1189-1190).
Lemma 8.4. Suppose that the {A k : k ≥ 1}-approximation P of P to Q is well-defined. Then, there exists a coupling ξ of P and P such that For l ≥ 0, let P l be the P l -approximation of P to Q. We only consider the case that P l is well-defined for all l ≥ 0. The other case can be handled with further details, see Proposition 1 of [46].
Since P l (F ) = Q(F ) for F ∈ P l , we have W p (P l , Q) ≤ 2 −(l−1) for every l ≥ 0. Furthermore, it is easy to check that, for F ∈ P l , P l (F ) = P l+1 (F ) and P l+1 | F is the {C ∈ P l+1 : C ⊂ F }-approximation of P l | F to Q| F . Therefore, by Lemma 8.4, there exists a coupling ξ l+1 of P l and P l+1 such that It follows that there exist random variables Z 0 , Z 1 , Z 2 , . . . in a same probability space, say (S, S, µ), such that and Z l is marginally distributed as P l . Let N = inf{l : Z l+1 = Z l }, where the infimum of the empty set is set to be infinity. Then, conditional on the event {N = l} with l < L, where L is a fixed positive integer, we have with probability one. It follows that Therefore, Since imsart-ejs ver. 2020/01/20 file: WassersteinRate-210616-EJS-rev2.tex date: July 1, 2021
Lemma 8.5. For two probability measures P and Q on R, Proof. The proof is explicitly given in [18] (pp. 714-715).

Proof of Theorem 2.1
Since the KL condition (1.3) holds, the posterior distribution is consistent with respect to the Lévy-Prokhorov metric d P , see Theorem 6.25 of [22]. Therefore, there exists a real sequence 1n ↓ 0 such that Π d P (P, P 0 ) > 1n | X 1 , . . . , X n −→ 0 in probability.

25
Now, suppose that (2.1) holds. Then, in a similar way, we can construct a sequence 2n ↓ 0 such that and P n ∈ F n such that Note that (P n ) is a non-random sequence of probability measures such that d P (P n , P 0 ) → 0 and M p (P n ) → M p (P 0 ). It follows that W p (P n , P 0 ) → 0. Since Π(F n | X 1 , . . . , X n ) → 1 in probability, we conclude that (2.2) holds.
Conversely, suppose that (2.2) holds. Then, similarly as before, we can construct a sequence 3n ↓ 0 such that and P n ∈ F n such that Again, (P n ) is a non-random sequence with W p (P n , P 0 ) → 0, so we have |M p (P n )− M p (P 0 )| → 0. Since Π(F n | X 1 , . . . , X n ) → 1 in probability, we conclude that (2.1) holds.

Proof of Theorem 2.2
We first provide a simple proof relying on a stronger moment condition than the one in the statement of Theorem 2.2. For this, we assume that M 2p (P 0 ) ≤ K and Π P : M 2p (P ) ≤ K | X 1 , . . . , X n → 1 in probability.

26
To this aim, it suffices to show that inf P ∈C1 H(P 0 , P ) > 0 and inf P ∈C2 H(P 0 , P ) > 0. For P ∈ C 1 ∪ C 2 , we have by the Cauchy-Schwarz inequality. The integral of the right term is itself upper bounded by by virtue of √ p p 0 ≤ 1 2 (p + p 0 ). Hence, we get H(P 0 , P ) ≥ /(2 Now, we will prove Theorem 2.2 without the moment condition of order 2p. Lemma 8.6. For positive constants , δ and K assume that and where M and L are positive integers. Then, where K is a constant depending only on δ, K and p. Proof. Since W p p (R Bm P, R Bm Q) ≤ 2 p and (8.2) holds, the summation in the right hand side of (8.1) over m > M is bounded by c 1 2 −δM , where c 1 is a constant depending only on δ, K and p. Therefore, W p p (P, Q) is bounded by It follows that where the second inequality holds by (8.2), (8.3) and that the cardinality of P l is 2 l . Here, c 2 is a constant depending only on δ, K and p.
By (5.4), we have that and Π(F 0 | X 1 , . . . , X n ) → 1 in probability, where Suppose that a sufficiently small > 0 is given. We will prove that for some function g : (0, ∞) → (0, ∞), with g( ) ↓ 0 as ↓ 0, Π W p p (P, P 0 ) ≥ g( ) X 1 , . . . , X n −→ 0 in probability. (8.5) Let L and M be the largest integer less than or equal to log 2 −1 and (log 2 −1 )/(2p), respectively. Then, Then, by Lemma 8.6 and (8.6), there exists a constant c 1 , depending only on δ, K and p, such that Certainly, g( ) ↓ 0 as ↓ 0. Since Π(F c 0 | X 1 , . . . , X n ) → 0 in probability and the KL condition (1.3) holds, by Schwartz's theorem (see Theorem 6.25 of [22] if Π depends on n), it is sufficient for (8.5) to construct a sequence φ n of tests such that P 0 φ n + sup for some constant c > 0 and every large enough n. Let Then, by the Hoeffding's inequality, Also, for P ∈ F c m,F,+ , by the Hoeffding's inequality. Similarly, for P ∈ F c m,F,− , Therefore, if we define then, Since L, M and does not depend on n, φ n satisfies (8.7) for some c > 0 and large enough n, which completes the proof.

Proof of Theorem 3.1
For a given sequence δ n , let and C n = C n,1 ∪ C n,2 . Then, it can be shown that as in the first proof of Theorem 2.2. For any measurable set C, let Π C n be the posterior distribution restricted and renormalized onto C, that is, we have where G n−1 is the σ-algebra generated by X 1 , . . . , X n−1 . Since C n,j is convex, we have H 2 (p 0 ,p Cn,j n−1 ) ≥ δ 2 n /(4K) for j = 1, 2. Therefore, n for all large enough n, where c 1 > 0 is a constant depending only on K. It follows that L n (C n,j ) is upper bounded by e −c2nδ 2 n with probability tending to 1 for some constant c 2 . Thus, if we take δ n = K n for large enough K , the proof is complete. and where K is a constant depending only on α, K and p. If p > 1, condition (8.9) can be replaced by a slightly weaker condition that Proof. By (8.8) and that W p p (R Bm P, R Bm Q) ≤ 2 p , the summation in the right hand side of (8.1) over m > M is bounded by a constant multiple of 2 −(p+δ)M .
where the inequality holds by (8.9) with l = 0. Therefore, by Lemma 8.5, where K is a constant depending only on α, K and p. By (8.4), the summation in the last display is bounded by Since the cardinality of P l is 2 l and ∞ l=1 (l + 1) −2 < ∞, the first assertion follows from (8.8) and (8.9).

31
If p > 1 and (8.9) is replaced by (8.10), we have Therefore, we have the same conclusion with a different constant K .
Before proving Theorem 3.2, we state and prove a similar one. Theorem 8.9 below is devised for eliminating the logarithmic term log −1 n in Theorem 3.2. Proofs of Theorem 3.2 and 8.9 are quite similar, so we do not provide all details to avoid the repetition. We provide a detailed proof only for Theorem 8.9 because this contains the most technical part caused by the factors (l +1) −2 and (l + 1) −4 . These factors appear for handling the last term in (8.11). If p > 1, we need not consider these factors by the second assertion of Lemma 8.7. For p = 1, we can avoid the technical factors (l + 1) −2 and (l + 1) −4 , with an additional logarithmic factor in the rate. If we want to eliminate the term log −1 n , the statement would become more complicated as Theorem 8.9. For conciseness we decided to include Theorem 3.2 in the main texts rather than Theorem 8.9. Theorem 8.9. Assume that the prior Π satisfies the KL condition (1.4) for a sequence n with n ↓ 0 and n ≥ (log n)/n. Furthermore, assume that there exist positive constants K and δ such that and L is the largest integer less than or equal to (log 2 −1 n )/p. Then, for some constant K > 0, Π W p p (P, P 0 ) ≥ K n X 1 , . . . , X n −→ 0 in probability. (8.12) Proof. Let M be the largest integer less than or equal to (p + δ) −1 log 2 −1 n . Let α > 0 be a sufficiently small constant such that (1 + α/2) 2p < 2 δ . For m ≤ M and F ∈ P l with l ≤ L, let where K 1 > 0 is a large constant described below. Then, by Lemma 8.7, implies that W p p (P, P 0 ) ≤ K 2 n for some constant K 2 . Since Π(K n ) ≥ e −n 2 n , by Lemma 8.1, it is sufficient for (8.12) to construct a sequence φ n of tests such that P 0 φ n → 0 and sup for every large enough n. For m ≤ M and F ∈ P l with l ≤ L, let Then, by Lemma 8.8, .
Since n ≥ (log n)/n, we have as n → ∞ provided that K 1 is large enough. Also, if K 1 is sufficiently large, for P ∈ F c m,F,+ with F ∈ P l , and K 1 is sufficiently large, then φ n satisfies (8.13) for all large enough n. This completes the proof.
Proof of Theorem 3.2 for p > 1. We first claim that if then (8.12) holds for some constant K . The proof of this claim is the same to that of Theorem 8.9 if we replace F n by F 0 and eliminate the factors (l + 1) −2 and (l + 1) −4 in all equations, which is possible due to the second assertion of Lemma 8.7.
Once we adjust the constant K, two conditions of the claim is satisfied by (5.4). Hence the proof is complete.
Proof of Theorem 3.2 for p = 1. If (8.8) and (8.10) hold with p = 1, then it holds that This can be proved as in Lemma 8.7. The only difference is that the last term in (8.11) is bounded as where K is a constant depending only on α, K and p.
As in the proof of Theorem 3.2, we next claim that if Once we adjust the constant K, two conditions of the claim is satisfied by (5.4). Hence the proof is complete.

Proof of Theorem 4.1
Let F = {P ∈ F 0 : W ∞ (P, P 0 ) ≤ }. We will show that for every small enough ≥ K 1 (log n)/n and n ≥ n 0 , there exists a test φ such that P 0 φ ≤ e −K2n 2 and sup where K 1 , K 2 and n 0 are constants depending only on c 0 . Since Π(K n ) ≥ e −n 2 n , (8.15) and Lemma 8.1 guarantees (4.1) for large enough constant K > 0.
Let > 0 be given. Let N be the smallest integer greater than or equal to We first claim that for P ∈ F 0 , P (I jk ) − P 0 (I jk ) ≤ c 0 2 for every j and k implies that W ∞ (P 0 , P ) ≤ 2 . (8.17) Therefore, we can choose constants K 1 , K 2 > 0 and n 0 such that if ≥ K 1 (log n)/n, then the right hand side of (8.17) is bounded by exp(−K 2 n 2 ) for every n ≥ n 0 . This completes the proof of (8.15). It is easy to show that there exist constants a 1 and a 2 such that for every ∈ (0, 1]. The assertion follows because e −1 ≤ e −x ≤ 1 for every x ∈ (0, 1] and 1 0 x −1 dx = −1 . Lemma 8.11. Suppose that X ∼ Beta(α , α(1 − )), α ≤ 1 and α(1 − ) ≥ 1. Then, where C α is a constant depending only on α.
Proof. Let p be the pdf of X, that is, By Lemma 8.10, where c α is a constant depending only on α.
Proof. Let C > 0 be given. For eacn n and m ≤ C log −1 n , let ψ m = I P n (B m ) − P 0 (B m ) > K n /2 , where K is a universal constant described below. Using the Hoeffding's inequality, it is not difficult to prove that Denote p G (x) = k σ (x − z)dG(z). We use the result of [36]. It is shown in the proof of Theorem 2 in [36] that 2(s−2)/(1+2s) 1 for any s > 2 and G with M 2 (G) < ∞, where C is a constant depending only on s. Note that Theorem 2 of [36] assumed that G and G 0 are discrete probability measures with bounded supports, but finiteness of the second moment suffices as discussed therein. The right hand side of the last display tends to zero as p G − p G0 1 → 0. It follows that for every > 0, Π W 2 2 (G, G 0 ) > | X 1 , . . . , X n → 0.