Multivariate goodness-of-Fit tests based on Wasserstein distance

Goodness-of-fit tests based on the empirical Wasserstein distance are proposed for simple and composite null hypotheses involving general multivariate distributions. This includes the important problem of testing for multivariate normality with unspecified mean vector and covariance matrix and, more generally, testing for elliptical symmetry with given standard radial density and unspecified location and scatter parameters. The calculation of test statistics boils down to solving the well-studied semi-discrete optimal transport problem. Exact critical values can be computed for some important particular cases, such as null hypotheses of ellipticity with given standard radial density and unspecified location and scatter; else, approximate critical values are obtained via parametric bootstrap. Consistency is established, based on a result on the convergence to zero, uniformly over certain families of distributions, of the empirical Wasserstein distance---a novel result of independent interest. A simulation study establishes the practical feasibility and excellent performance of the proposed tests.


Introduction
Wasserstein distances are metrics on spaces of probability measures with certain finite moments. They measure the distance between two such distributions by the minimal cost needed to move probability mass in order to transform one distribution into the other one. Wasserstein distances have a long history and continue to attract interest from diverse fields in statistics, machine learning and computer science, in particular image analysis; see for instance the monographs and reviews by Santambrogio (2015), Peyré and Cuturi (2019), and Panaretos and Zemel (2019).
A natural application of any meaningful distance between distributions is to the goodness-of-fit (GoF) problem-namely, the problem of testing the null hypothesis that a sample comes from a population with fully specified distribution P 0 or with unspecified distribution within some postulated parametric model M. GoF problems certainly are among the most fundamental and classical ones in statistical inference. Typically, GoF tests are based on some distance between the empirical distribution P n and the null distribution P 0 or an estimated distribution in the model M. The most popular ones are the Cramér-von Mises (Cramér, 1928;von Mises, 1928) and Kolmogorov-Smirnov (Kolmogorov, 1933;Smirnov, 1939) tests, involving distances between the cumulative distribution function of P 0 and the empirical one. Originally defined for univariate distributions only, they have been extended to the multivariate case, for instance in Khmaladze (2016), who proposes a test that has nearly all properties one could wish for, including asymptotic distribution-freeness, but whose implementation is computationally quite heavy and quickly gets intractable.
Many other distances have been considered in this context, though. Among them, distances between densities (after kernel smoothing) have attracted much interest, starting with Bickel and Rosenblatt (1973) in the univariate case. Bakshaev and Rudzkis (2015) recently proposed a multivariate extension; the choice of a bandwidth matrix, however, dramatically affects the outcome of the resulting testing procedure. Fan (1997) considers a distance between characteristic functions, which accommodates arbitrary dimensions; the idea is appealing but the estimation of the integrals involved in the distance seems tricky. McAssey (2013) proposes a heuristic test that relies on a comparison of the empirical Mahalanobis distance with a simulated one under the null. Still in a multivariate setting, Ebner, Henze and Yukich (2018) define a distance based on sums of powers of weighted volumes of kth nearest neighbour spheres.
The use of the Wasserstein distance for GoF testing has been considered mostly for univariate distributions (Munk and Czado, 1998;del Barrio et al., 1999;del Barrio et al., 2000;del Barrio, Giné and Utzet, 2005). For the multivariate case, available methods are restricted to discrete distributions (Sommerfeld and Munk, 2018) and Gaussian ones (Rippl, Munk and Sturm, 2016). Indeed, serious difficulties, both computational and theoretical, hinder the development of Wasserstein GoF tests for general multivariate continuous distributions, particularly in the case of composite null hypotheses. Composite null hypotheses are generally more realistic than simple ones. Of particular practical importance is the case of location-scale families: tests of multivariate Gaussianity, tests of elliptical symmetry with given standard radial density, etc., belong to that type. Although the asymptotic null distribution of empirical processes with estimated parameters is well known (van der Vaart, 1998, Theorem 19.23), the actual exploitation of that theory in GoF testing remains problematic because of the difficulty of computing critical values.
The aim of this paper is to explore the potential of the Wasserstein distance for GoF tests of simple (consisting of one fully specified distribution) and composite (consisting of a parametric family of distributions) null hypotheses involving continuous multivariate distributions. The tests we are proposing are based on the Wasserstein distance between P n and the distribution P 0 in the case of a simple null hypothesis and on the Wasserstein distance between P n and a model-based distribution estimate in the case of a composite null hypothesis. They are computationally feasible, have the correct size, and enjoy good power properties in comparison with other tests available in the literature.
We concentrate on the continuous case, i.e., the distributions under the null hypothesis are absolutely continuous with respect to the Lebesgue measure on R d . The test statistic involves the Wasserstein distance between P n , which is discrete, and a distribution from the null hypothesis to be tested, which is continuous. Calculating such a distance requires solving the so-called semidiscrete transportation problem, an active area of research in computer science.
In case of a simple null hypothesis, the null distribution of the test statistic does not depend on unknown parameters. Exact critical values can be calculated with arbitrary precision via a Monte Carlo procedure, by simulating from the null distribution and computing empirical quantiles. Exact critical values can also be computed for Wasserstein tests for the GoF of a location-scatter family of elliptical distributions with known radial distribution. We handle the presence of unknown nuisance parameters by using empirically standardized data. An important and well-studied special case is that of testing for multivariate normality. Out of the many available tests in the literature, we select the ones of Royston (1982), Henze and Zirkler (1990) and Rizzo and Székely (2016) as benchmark for our Wasserstein test.
For general parametric models, we rely on the bootstrap to calculate critical values. The question whether the method has the correct size under the null hypothesis remains open. A proof of that property would require knowledge of non-degenerate limit distributions of the empirical Wasserstein distance-a hard and long-standing open problem, which we briefly review in Section 1.2. Monte Carlo experiments, however, suggest that our tests have the correct asymptotic size.
In all cases, we show that our Wasserstein GoF tests are consistent against fixed alternatives, that is, the null hypothesis under such alternatives is rejected with probability tending to one. For the general parametric case, this property relies on the uniform consistency in probability of the empirical distribution with respect to the Wasserstein distance, uniformly over adequate classes of probability measures. To the best of our knowledge, this result, which is of independent interest, is new in the literature.
Measure transportation has attracted much interest in the recent statistical literature. Carlier et al. (2016), Chernozhukov et al. (2017 and del Barrio et al. (2018) propose measure transportation-based concepts of multivariate ranks, signs, and quantiles. These notions have been successfully applied by Shi, Drton and Han (2019), Deb and Sen (2019), and Ghosal and Sen (2019) in the construction of distribution-free tests in a multivariate context, by Hallin, La Vecchia and Liu (2019) for R-estimation of VARMA models with unspecified innovation densities.
The outline of the paper is as follows. In the remainder of this introduction, we introduce the Wasserstein distance (Section 1.1), review the asymptotic theory of empirical Wasserstein distance (Section 1.2), and provide some information on the computational methods for the semi-discrete problem underlying the implementation of the Wasserstein GoF tests (Section 1.3). In Section 2, we give a formal description of the GoF test procedure for simple null hypotheses. Section 3 addresses the composite null hypothesis that the unknown distribution belongs to an elliptical family with unknown location and scatter (covariance) parameters and known radial distribution; the multivariate normal family is an important special case. Composite null hypotheses covering general parametric models are treated in Section 4. In Section 5, we conduct a simulation study to assess the finite-sample performance of the Wasserstein tests in comparison to other GoF tests, both for simple and composite null hypotheses. In Appendix A, the convergence of the empirical Wasserstein distance uniformly over certain classes of underlying distributions is stated and proved. In Appendix B, the algorithms we are using in the computation of critical values are listed and explained.

Wasserstein distance
Let P(R d ) be the set of Borel probability measures on R d and let P p (R d ) be the subset of such measures with a finite moment of order p ∈ [1, ∞). For P, Q ∈ P(R d ), let Γ(P, Q) be the set of probability measures γ on R d ×R d with marginals P and Q, i.e., such that γ with · the Euclidean norm. In terms of random variables X ∼ P and Y ∼ Q, the p-Wasserstein distance is the smallest value of {E( X − Y p )} 1/p over all possible couplings (X, Y ) ∼ γ.
The p-Wasserstein distance W p defines a metric on P p (R d ) which, when endowed with the Wasserstein distance W p , is a complete separable metric space (Villani, 2009, Theorem 6.18 and the bibliographical notes). Convergence in the W p metric is equivalent to weak convergence plus convergence of moments of order p; see for instance Bickel and Freedman (1981, Lemmas 8.1 and 8.3) and Villani (2009, Theorem 6.9). It is thus quite natural to consider W p in the construction of GoF tests for multivariate distributions.
For univariate distributions P and Q with distribution functions F and G, the p-Wasserstein distance boils down to the L p -distance between the respective quantile functions F −1 and G −1 . This representation considerably facilitates both the computation of the distance and the asymptotic theory of its empirical versions. Also, the optimal transport plan mapping X ∼ P to Y ∼ Q is immediate: if F has no atoms, then Y := G −1 • F (X) ∼ Q, while monotonicity of G −1 • F implies the optimality of the coupling (X, Y ).

Asymptotic theory
Let X n = (X 1 , . . . , X n ) be an independent random sample from P ∈ P(R d ). Its distribution as a random vector in (R d ) n is the n-fold product P n of P with itself. Let L n : (R d ) n → P(R d ) map x n = (x 1 , . . . , x n ) ∈ (R d ) n to the discrete probability measure L n (x n ) := n −1 n i=1 δ xi , with δ x the Dirac measure at x. The empirical distribution of the sample is P n := L n (X n ) = n −1 n i=1 δ Xi . We study its distribution as a random element in P(R d ).
The Wasserstein distance between the empirical distribution L n (x n ) and a probability measure P ∈ P p (R d ) is the value at x n ∈ (R d ) n of the map Consider the distribution of this map under P n , i.e., for an independent random sample of size n from P. In perhaps more familiar notation, the random variable of interest is the empirical Wasserstein distance W p ( P n , P).
According to Bickel and Freedman (1981, Lemma 8.4), if P ∈ P p (R d ), the empirical distribution is strongly consistent in the Wasserstein distance: for an i.i.d. sequence X 1 , X 2 , . . . with common distribution P, we have W p ( P n , P) → 0 almost surely as n → ∞. The corresponding consistency rates have been studied intensively; see Panaretos and Zemel (2019, Section 3.3) for a review. If P is nondegenerate, then E[W p ( P n , P)] is at least of the order n −1/2 , and if P is absolutely continuous, which is the case of interest here, the convergence rate cannot be faster than n −1/d . Actually, the rate can be arbitrarily slow (Bobkov and Ledoux, 2019, Theorem 3.3). Precise rates under additional moment assumptions are given for instance in Fournier and Guillin (2015).
Asymptotic distribution results for the empirical Wasserstein distance in dimension d ≥ 2 are, however, surprisingly scarce. The one-dimensional case is well-studied thanks to the link to empirical quantile processes, see for instance del Barrio, Giné and Utzet (2005). Also for discrete distributions, non-degenerate limit distributions are known (Sommerfeld and Munk, 2018;Tameling, Sommerfeld and Munk, 2019). For multivariate Gaussian distributions, a central limit theorem for the empirical Wasserstein between the true normal distribution and the one with estimated parameters is given in Rippl, Munk and Sturm (2016). Although interesting and useful for GoF testing (see Section 5.1 below), this result does not cover the case of the empirical distribution P n . Important steps have been taken recently by del Barrio and Loubes (2019) who, quite remarkably, manage to obtain some asymptotic results under alternatives. For general P, Q ∈ P 4+δ (R d ) for some δ > 0, they establish a central limit theorem for n 1/2 W 2 ( P n , Q) − E{W 2 ( P n , Q)} .
Unfortunately, if Q = P, which is the case of interest here, the asymptotic variance is zero, meaning that the random fluctuations of W 2 ( P n , P) around its mean are of order less than n −1/2 . Moreover, as mentioned above, E{W p ( P n , P)} may converge to zero at a slower rate than n −1/2 . The crucial problem of the limiting distribution of the empirical Wasserstein distance under the null so far remains an important and difficult open problem, which apparently precludes the implementation of multivariate analogues of the existing one-dimensional procedures. The most recent progress perhaps has been booked in Goldfeld and Kato (2020), who obtain a central limit theorem for the empirical W 1 -distance after smoothing the empirical and true distributions with a Gaussian kernel.
To construct critical values of Wasserstein GoF tests of general parametric models, we will propose in Section 4 the use of the parametric bootstrap. In general, proving consistency of the parametric bootstrap typically requires having non-degenerate limit distributions under contiguous alternatives of the statistic of interest (Beran, 1997;Capanu, 2019). As the above review shows, such results are still beyond the horizon.

Computational issues
Important numerical developments have taken place recently in the area of measure transportation and, more particularly, in the computation of the 2-Wasserstein distance between a discrete and a continuous distribution, the so-called semi-discrete optimal transportation problem; see for instance Leclaire and Rabin (2019) and the references therein. The efficiency and high accuracy of the algorithms developed by Mérigot (2011), Lévy (2015, or Kitagawa, Mérigot and Thibert (2017) make it possible to simulate from the exact null distribution of empirical Wasserstein distances. Moreover, Kitagawa, Mérigot and Thibert (2017) establish, under certain assumptions, the convergence of their algorithm. This opens the door for the implementation, based on simulated critical values, of the Wasserstein distance-based GoF tests in dimension d ≥ 1 for which asymptotic critical values remain unavailable.
Most algorithms to date rely on the dual formulation of the problem, assuming that the source continuous probability measure P admits a density f w.r.t. the Lebesgue measure on R d . We follow Santambrogio (2015, Section 6.4.2) for a brief exposition. In line with the set-up relying on the empirical measure, we work with the quadratic cost function (p = 2) and a discrete measure P n = n −1 n i=1 δ xi over n distinct atoms x 1 , . . . , x n ∈ R d , each of mass 1/n.
The semi-discrete problem requires constructing a power diagram or Laguerre-Voronoi diagram, partitioning R d into power cells . . , n for i = 1, . . . , n, defined in terms of a vector ψ = (ψ 1 , . . . , ψ n ) ∈ R n . Each power cell V ψ (i) corresponds to a set of linear constraints and, therefore, is a convex polyhedron. The dual to the problem of minimizing the expected transportation cost x − y 2 dγ(x, y) over the couplings γ ∈ Γ(P n , P) is then the maximisation, with respect to the vector ψ, of the objective function The function ψ → F (ψ) is differentiable. Setting its partial derivatives to zero yields the equations specifying that each power cell V ψ (i) should receive mass P n ({x i }) = n −1 under P. The optimal transport plan from P to P n then consists in mapping all points in the interior of V ψ (i) to x i . The dual formulation above is the basis for the multi-scale algorithm developed in Mérigot (2011) based on the method for solving constrained least-squares assignment problems in Aurenhammer, Hoffmann and Aronov (1998). For the Monte Carlo simulation experiments, we use the implementation of that algorithm in the function semidiscrete provided by the R package transport (Schuhmacher et al., 2019).
Further improvements of the multi-scale algorithm are introduced in Lévy (2015) and Kitagawa, Mérigot and Thibert (2017). Recently, stochastic algorithms in Genevay et al. (2016) and Leclaire and Rabin (2019) are claimed to perform even better. To the best of our knowledge, implementations of these algorithms are not yet available in R (R Core Team, 2018), the language in which we programmed the simulation experiments. Our aim in this paper is to demonstrate the feasability of goodness-of-fit tests for multivariate distributions based on the Wasserstein distance. Advances in computational methods and software implementations can only strengthen that case.

Wasserstein GoF tests for simple null hypotheses
Let X n = (X 1 , . . . , X n ) be an independent random sample from some unknown distribution P ∈ P(R d ). For some given fixed P 0 ∈ P p (R d ), consider testing the simple null hypothesis H n 0 : P = P 0 against H n 1 : P = P 0 based on the observations X n . Note that P, under the alternative, is not required to have finite moments of order p.
Consider the test statistic T n := W p p ( P n , P 0 ), the pth power of the p-Wasserstein distance between the empirical distribution P n = n −1 n i=1 δ Xi and the distribution P 0 specified by the null hypothesis. Having bounded support, P n trivially belongs to P p (R d ), so that T n is well-defined.
Actual computation of T n amounts to solving the semi-discrete optimal transport problem, as reviewed in Section 1.3 for p = 2. In the simulations of Section 5, we therefore limit ourselves to p = 2; the theory developed in this section, however, is developed for general p ≥ 1.
For 0 < α < 1, the test φ n P0 we are proposing has the form with critical value where P n 0 stands for n-fold product measure of P 0 on (R d ) n , that is, the distribution under H n 0 of the observation X n . By construction, the exact size of the GoF test in (1) is P n 0 T n > c(α, n, P 0 ) ≤ α, with equality if the law of T n under P n 0 is continuous. The risk of a false rejection is thus bounded by the nominal size α, and often equal to it.
Although the critical level c(α, n, P 0 ) cannot be calculated analytically, its value can be approximated to any desired degree of precision via Monte Carlo simulation. To this end, draw a large number N , say, of independent random samples of size n from P 0 and compute the test statistic for each such sample. The empirical (1 − α) quantile of the N simulated test statistics thus obtained is then a consistent and asymptotically normal estimator of c(α, n, P 0 ) as N → ∞ provided that the distribution of T n has a continuous and positive density at c(α, n, P 0 ). The approximation error is of the order N −1/2 and can be made arbitrarily small by choosing N sufficiently large. The null distribution of T n depends on P 0 , so that c(α, n, P 0 ) needs to be calculated for each P 0 separately.
Under the alternative hypothesis, the following proposition establishes that the test is rejecting the null with probability tending to one, i.e., is consistent against any fixed alternative P = P 0 .
Proposition 1 (Consistency). For every P 0 ∈ P p (R d ), the test φ n P0 is consistent against any P ∈ P(R d ) with P = P 0 : Proof. Fix P 0 ∈ P p (R d ). For any α > 0, the critical value c(α, n, P 0 ) tends to zero as n → ∞. Indeed, by Bickel and Freedman (1981, Lemma 8.4), we have T n → 0 in P n 0 -probability and thus lim n→∞ P n 0 [T n > ε] = 0 for any ε > 0. It follows that, for every α > 0 and every ε > 0, we have c(α, n, P 0 ) ≤ ε for all sufficiently large n.
Let P ∈ P(R d ) with P = P 0 . We consider two cases according as P has finite moments of order p or not.
First, suppose that P ∈ P p (R d ). Still by Bickel and Freedman (1981, Lemma 8.4), we have W p ( P n , P) → 0 in P n -probability as n → ∞. The triangle inequality for the metric W p yields in P n -probability. Hence T n = W p p ( P n , P 0 ) → W p p (P, P 0 ) in P n -probability as n → ∞. But W p p (P, P 0 ) > 0 since P, P 0 ∈ P p (R d ) and P = P 0 by assumption. It follows that lim n→∞ P n [T n > c(α, n, P 0 )] = 1, as required.
Second, suppose that P ∈ P(R d ) \ P p (R d ). Let δ 0 denote the Dirac measure at 0 ∈ R d . Since W p is a metric, the triangle inequality implies As the expectation of X 1 2 under P is infinite, the law of large numbers implies that W p p ( P n , δ 0 ) → ∞ in P n -probability as n → ∞. But the same then is true for T n and thus lim n→∞ P n [T n > c(α, n, P 0 )] = 1, as required.

Wasserstein GoF tests for elliptical families
The distribution P is then in P 2 (R d )-denote it by P f rad -and Z has mean zero and covariance matrix I d . The with radial density f rad satisfying r 2 f rad (r)dr = d; the distribution P then is in P 2 (R d ) and X has mean µ and covariance matrix Σ = AA . We refer to Cambanis, Huang and Simons (1981) or Fang, Kotz and Ng (1990) for details.
Let E(f rad ) denote the family of d-variate elliptical distributions with standard radial density f rad . Such families are indexed by a location vector µ ∈ R d and a positive definite d × d covariance matrix Σ; the choices µ = 0 and Σ = I d yield the spherical P f rad . Common examples of elliptical families are the multivariate normal family, with f rad the density of the root of a χ 2 d variable, and the multivariate Student t distribution with ν > 2 degrees of freedom, where f rad is the density of the root of a rescaled Fisher F (d, ν) variable. In general, elliptical distributions are not subject to moment constraints (Σ := AA is then a scatter rather than a covariance matrix), but here we intend to use the Wasserstein distance of order p = 2 and therefore restrict to elliptical families with finite second-order moments.
Given an i.i.d. sample X 1 , . . . , X n from some unspecified P ∈ P 2 (R d ), we wish to test the null hypothesis that P is elliptical with specified standard radial density f rad , namely, The location vector µ and the covariance matrix Σ of P are unknown nuisance parameters. In contrast to Section 2, the null hypothesis is thus a composite one.
Our testing strategy is to compute residuals of the form yielding an empirical distribution PẐ n := n −1 n i=1 δẐ n,i . The test statistic we propose is T E(f rad ),n := W 2 2 ( PẐ n , P f rad ).
If the null distribution of (Ẑ n,1 , . . . ,Ẑ n,n ) does not depend on the unkown µ and Σ, then we can define critical values for T E(f rad ),n as if µ = 0 and Σ = I d .
As in Section 2, such critical values can then be approximated with any desired accuracy via Monte Carlo random sampling from P f rad .
In the sequel, we letμ n = n −1 n i=1 X i = X n and choose forÂ n the Cholesky triangle L n,X ∈ R d×d of the empirical covariance matrix Recall that for every symmetric positive definite matrix S ∈ R d×d , there exists a unique lower triangular matrix L ∈ R d×d with positive diagonal elements, called Cholesky triangle, producing the Cholesky decomposition S = LL (Golub and Van Loan, 1996, Theorem 4.2.5). If Σ is invertible, then S n,X is invertible with probability tending to one; even more, for an i.i.d. sequence X 1 , X 2 , . . . from P ∈ E(f rad ), with probability one, the matrix S n,X is invertible for all n large enough depending on the sample. On the event that S n,X is invertible, the residuals (4) are thuŝ For completeness, on the event that S n,X is not invertible, we setẐ n,i = 0 for i = 1, . . . , n, although a precise definition is immaterial for the results to follow. Let us show that the joint distribution of the vector of residuals computed in this way does not depend on the unknown µ or Σ. The key is the following elementary property.
Let µ ∈ R d and let L ∈ R d×d be lower triangular with positive diagonal elements. Put with obvious notation, similarly define z n and S n,z . Then, S n,x is invertible if and only S n,z is, in which case their Cholesky factors L n,x and L n,z are related by L n,x = LL n,z , which implies Proof. We have x i = µ + Lz i for all i = 1, . . . , n, whence x n = µ + Lz n and S n,x = LS n,z L .
Since L is invertible, S n,x is invertible if and only if S n,z is. Suppose they both are, and let L n,x and L n,z denote their Cholesky factors. The matrix LL n,z is lower triangular, has positive diagonal elements, and satisfies By the uniqueness of the Cholesky decomposition, L n,x = LL n,z . Finally, Proposition 2. Let X 1 , . . . , X n be an i.i.d. sample from P ∈ E(f rad ) with mean µ and full-rank covariance Σ. The joint distribution ofẐ n,i in (5) for i = 1, . . . , n does not depend on µ nor Σ.
Proof. Let L be the Cholesky factor of Σ. In view of Lemma 1, it is sufficient to show that Z i = L −1 (X i −µ), for i = 1, . . . , n, is an independent random sample of the spherical distribution P f rad with mean zero, covariance identity, and standard radial density f rad . By the assumption on P, there exists an invertible A ∈ R d×d with AA = Σ such that X i = µ + Aζ i for i = 1, . . . , n, where ζ 1 , . . . , ζ n is an independent random sample from P f rad . Then, It thus follows from sphericity that the common distribution of Z 1 , . . . , Z n is the same as that of ζ 1 , . . . , ζ n , that is, P f rad .
For the hypothesis testing problem (3), we propose the test at level α ∈ (0, 1) and with critical value The probability in (6) is calculated under the spherical distribution with radial density f rad , which is free of nuisances. By Proposition 2, the size of the test is at most α: for all P ∈ E(f rad ), The size of the test is equal to α if the null distribution of T E(f rad ),n is continuous at the critical value. In practice, calculation of the critical value is implemented by Monte Carlo simulation, see Algorithm 2 in Appendix B.
The consistency of the test follows from a large of law numbers in 2-Wasserstein distance for the empirical distribution of the residuals defined in (5).
Proposition 3. Let P ∈ P 2 (R d ) have mean vector µ ∈ R d and invertible covariance matrix Σ ∈ R d×d with Cholesky triangle L ∈ R d×d . Let X 1 , X 2 , . . . be a sequence of i.i.d. random vectors with common distribution P. ForẐ n,i as in (5), we have sequence with common distribution Q 0 . By the strong law of large numbers, X n → µ and S n,X → Σ a.s. as n → ∞.
With probability one, S n,X is invertible for n large enough (depending on the sample) and admits a unique Cholesky factor L n,X . The map that sends a positive definite symmetric matrix to its Cholesky triangle is differentiable (Smith, 1995) and thus continuous. It follows that L n,X → L and L −1 n,X → L −1 a.s. as n → ∞.
Consider the coupling of PẐ n and P Z n via the discrete uniform distribution on the pairs (Ẑ n,i , Z i ) for i = 1, . . . , n. From the definition of the Wasserstein distance, we have For each i = 1, . . . , n, the identity X i = µ + LZ i yields featuring the matrix norm on R d×d associated with the Euclidean norm on R d . It follows that The right-hand side converges to zero almost surely as n → ∞ in view of the law of large numbers for n −1 n i=1 Z i 2 , (7), and (8). The result follows.
Proposition 4 (Consistency). The sequence of tests φ n E(f rad ) is consistent against any P ∈ P 2 (R d ) \ E(f rad ) with positive definite covariance matrix: lim n→∞ P n φ n E(f rad ) = 1 = 1 for every α > 0.
Let P be as in the statement. It is sufficient to show that there exists ε > 0 such that lim n→∞ P n [T E(f rad ),n > ε] = 1.
By Proposition 3, we have with Q 0 as in the statement of that proposition. By the triangle inequality, By assumption, Q 0 = P f rad and thus W 2 (P f rad , Q 0 ) > 0 since otherwise P ∈ E(P 0 ). For ε > 0 less than W 2 2 (P f rad , Q 0 ), we obtain lim n→∞ P n [T E(f rad ),n > ε] = 1, as required.

Wasserstein GoF tests for general parametric families
Extending the scope of Section 3, consider the problem of testing whether the unknown common distribution P of a sample of observations belongs to some parametric family M := P θ : θ ∈ Θ of distributions on R d where the parameter space Θ is some metric space and the map θ → P θ is assumed to be one-toone and continuous in a sense to be specified. Given an independent random sample X n = (X 1 , . . . , X n ) from some unknown P ∈ P(R d ), the goodness-of-fit problem is about testing H n 0 : P ∈ M against H n 1 : P / ∈ M.
Assume that every P θ ∈ M has a finite moment of order p ∈ [1, ∞), that is, M ⊆ P p (R d ). The test statistic we propose is T M,n := W p p ( P n , Pθ n ) whereθ n = θ n (X n ) is some consistent (under H n 0 ) estimator sequence of the true parameter θ. The distribution of X n under H n 0 in (9) being P n θ for some θ ∈ Θ, we would like to take as the critical value of a test with nominal size α ∈ (0, 1). This choice is infeasible, however, since the true parameter θ is unknown. Therefore, we propose to replace it by the bootstrapped quantity c M (α, n,θ n ), yielding the test rejecting H n 0 whenever T M,n exceeds c M (α, n,θ n ). Given the parameter estimateθ n , the proposed critical value can be approximated by resampling from the estimated distribution Pθ n . The idea is as follows and is given in more detail in Appendix B, in particular Algorithm 3: 1. generate a large number B of samples X * n,b = (X * 1,b , . . . , X * n,b ) ∈ (R d ) n , say, for b = 1, . . . , B, of size n from Pθ n ; 2. letting P 3. compute the empirical quantile As B → ∞ and since, conditionally on the data, the Monte Carlo approximation c M,B (α, n,θ n ) converges to the true quantile c M (α, n,θ n ) of the distribution of T M,n under P n θn , provided the latter distribution has a positive and continuous density at the stated limit point. The rate of convergence in In what follows, we assume we can compute c M (α, n,θ n ) to any desired degree of accuracy.
By construction, we have ∀θ ∈ Θ, P n θ T M,n > c M (α, n, θ) ≤ α, that is, if we could use the critical value at the true parameter θ, the risk of a false rejection of the null hypothesis would be bounded by α; it would be even equal to α if the distribution of T M,n is continuous at c M (α, n, θ). But as the true parameter θ is unknown, we use the estimated oneθ n instead, so that the risk of a false rejection is The question remains open whether under the null hypothesis the actual size of the test indeed converges to α. To prove this would require non-degenerate limit distribution theory for W p p ( P n , P θ ), not only for fixed θ ∈ Θ, but even for sequences θ n converging to θ at certain rates. As discussed in Section 1.2, such limit results are still beyond the horizon. In the simulation study, however, we check that the proposed bootstrap method indeed produces a test with approximately the right size.
Nevertheless, against a fixed alternative, the consistency of the test (11) based on the parametric bootstrap can be established theoretically. The key is the uniform convergence in probability of the empirical Wasserstein distance treated in Appendix A. For the parameter estimatorθ n , we need to assume weak consistency locally uniformly in θ: if ρ denotes the metric on Θ and if K(Θ) denotes the collection of compact subsets of Θ, we will require that ∀ε > 0, ∀K ∈ K(Θ), lim n→∞ sup θ∈K P n θ ρ(θ n , θ) > ε = 0.
As illustrated in Remark 1 below, this condition is satisfied, for instance, for moment estimators of a Euclidean parameter under a uniform integrability condition.

Proposition 5 (Consistency). Let
, be a model indexed by a metric space (Θ, ρ). Assume that the following conditions are satisfied: (a) the map Θ → P p (R d ) : θ → P θ is one-to-one and W p -continuous; (b)θ n is weakly consistent locally uniformly in θ ∈ Θ, i.e., (12) holds.
Then, the following properties hold: (i) T M,n → 0 in P n θ -probability locally uniformly in θ ∈ Θ, i.e., (iii) for every P ∈ P(R d ) \ M such that there exists K ∈ K(Θ) with P n θ n ∈ K → 1 as n → ∞, we have lim n→∞ P n φ n M = 1 = 1. Proof. (i) By the triangle inequality, it follows that for all θ ∈ Θ. For compact K ⊆ Θ, it is then sufficient to show that each of the W p -distances on the right-hand side of (13) converges to 0 in P n θ -probability uniformly in θ ∈ K.
First, since K is compact and θ → P θ is W p -continuous, the set is compact in P p (R d ) equipped with the W p -distance. By Bickel and Freedman (1981, Lemma 8.3(b)) or Villani (2009, Definition 6.8(b) and Theorem 6.9) and a subsequence argument, it follows that x → x p is uniformly integrable with respect to M K , i.e., lim r→∞ sup θ∈K x >r x p dP θ (x) = 0.
Corollary 1 then implies that W p ( P n , P θ ) → 0 in P n θ -probability as n → ∞, uniformly in θ ∈ K.
By condition (b), the latter probability converges to 0 as n → ∞ uniformly in θ ∈ K.
(iii) Let P and K be as in the statement. Put c n = sup θ∈K c M,n (α, n, θ). We have P n φ n M = 1 ≥ P T M,n > c M (α, n,θ n ),θ n ∈ K ≥ P T M,n > c n ,θ n ∈ K .
As η > 0 and lim n→∞ c n = 0, the latter probability converges to one by the assumption made on K and the fact that W p ( P n , P) → 0 in P n -probability as n → ∞. Second, suppose that P ∈ P(R d ) \ P p (R d ). Let δ 0 be the Dirac measure at 0 ∈ R d . Since θ → W p (P θ , δ 0 ) is continuous, s = sup θ∈K W p (P θ , δ 0 ) is finite. By the triangle inequality, on the event {θ n ∈ K}, x p dP(x) = ∞ in P nprobability by the weak law of large numbers. It follows that lim n→∞ P n T M,n > c n ,θ n ∈ K = 1.
Remark 1 (Uniform consistency). Under a mild moment condition, the uniform consistency condition (b) in Proposition 5 is satisfied for method of moment estimators-call them moment estimators-of a Euclidean parameter θ ∈ Θ ⊆ R k . In the method of moments, an estimatorθ n of θ is obtained by solving (with respect to θ) the equations for some given k-tuple f := (f 1 , . . . , f k ) of functions such that m : θ → E θ [f (X)] is a homeomorphism between Θ and m(Θ); see, for instance, van der Vaart (1998, Chapter 4). The consistency ofθ n = m −1 (n −1 n i=1 f (X i )) uniformly in θ ∈ K for any compact K ⊆ Θ then follows from the uniform consistency over K of n −1 n i=1 f (X i ) as an estimator of E θ [f (X)] for such θ. By van der Vaart and Wellner (1996, Proposition A.5.1), a sufficient condition for the latter is that the functions f j are P θ -uniformly integrable for θ ∈ K, i.e., Since I{|f j (X)| > M } ≤ |f j (X)| η /M η for η > 0, a further sufficient condition is that there exists η > 0 such that sup θ∈K E θ [|f j (X)| 1+η ] < ∞ for j = 1, . . . , k.
Remark 2 (Parameter estimate under the alternative). In Proposition 5(iii), the condition that there exists a compact K ⊆ Θ such that lim n→∞ P n [θ n ∈ K] = 1 holds, for instance, when Θ is locally compact andθ n is consistent for a pseudoparameter θ(P) ∈ Θ. This is the case for the moment estimators of Remark 1 when Θ ⊆ R k is open and f is P-integrable with f (x) dP(x) ∈ m(Θ).
In obvious notation, we haveẐ n,ij = (Z ij − Z n,j )/s n,j,Z . Letθ n denote a strongly consistent estimator of θ that depends on the data only throughẐ n,1 , . . . ,Ẑ n,n . Consider the empirical distributions PẐ n := n −1 n i=1 δẐ n,i and P Z n := n −1 To test the hypothesis H n 0 : P ∈ M, we propose the location-scale adjusted statistic T ls M,n := W 2 2 PẐ n , Pθ n .
Its distribution still depends on θ but no longer on µ or σ. Critical values can thus be computed as if µ j = 0 and σ j = 1 for all j = 1, . . . , d. For a test of size α ∈ (0, 1), we reject the null hypothesis as soon as the test statistic exceeds the critical value c ls M (α, n,θ n ) where In practice, critical values are calculated by a parametric bootstrap procedure as before. The advantage of the estimated residuals (14) is that the critical values are a function of θ only rather than a function of ψ = (µ, σ, θ), which greatly simplifies their computation. Under the null hypothesis, we have T ls M,n → 0 almost surely as n → ∞ since it is bounded by a multiple of where each of the three terms converges to zero almost surely. Under an alternative P ∈ P 2 (R d ) \ M such thatθ n remains in a compact set with probability tending to one, T ls M,n remains bounded away from zero and the test is consistent by an argument similar to the proof of Proposition 5.

Finite-sample performance of GoF tests
This section is devoted to a numerical assessment of the finite-sample performance of the Wasserstein-based GoF tests introduced in the previous sections and we compare them, whenever possible, with other tests. The case of a simple null hypothesis (Section 2) is treated in Section 5.1. The performances of various tests for multivariate normality, which is a special case of the hypothesis of elliptical symmetry considered in Section 3, are compared in Section 5.2, along with an illustration involving a Student t distribution with known degrees of freedom. Section 5.3 considers, in line with Remark 3, the more general composite null hypothesis of a parametric family indexed by marginal location and scale along with a copula parameter θ. Numerical results support the validity of the bootstrap-based calculation of critical values. To the best of our knowledge, no GoF test is available in the literature for such cases except for the method described by Khmaladze (2016), the numerical implementation of which, however, remains unsettled.
Throughout, we consider the Wasserstein distance of order p = 2. The level α of the tests is set to 5%, the sample size is n = 200, and the number of replicates considered in the estimation of power curves is 1000. We rely on the R package transport (Schuhmacher et al., 2019), which is why we restrict ourselves to dimension d = 2. As explained in Section 1.3, stochastic algorithms have recently been proposed to solve the semi-discrete problem in higher dimensions, but these are not yet implemented in R.

Simple null hypotheses
The setting is as in Section 2: given an independent random sample X 1 , . . . , X n from some unknown P ∈ P(R d ), we consider testing the simple null hypothesis H n 0 : P = P 0 , where P 0 ∈ P 2 (R d ) is fully specified.

Other GoF tests
Two other goodness-of-fit tests will be used as benchmarks. Rippl, Munk and Sturm (2016) consider the fully specified Gaussian null hypothesis H n 0 : P = N d (µ 0 , Σ 0 ) with given mean and covariance. Recall that the squared 2-Wasserstein distance between two d-variate Gaussian distributions is , with X n and S n,X the sample mean and sample covariance matrix, respectively. This test is sensitive to changes in the parameters of the Gaussian distribution but not to other types of alternatives. Calculation of the test statistic is straightforward. To compute critical values, we relied on a Monte Carlo approximation, drawing many samples from the Gaussian null distribution and taking the empirical quantile of the resulting test statistics. Khmaladze (2016) constructs empirical processes in such a way that they are asymptotically distribution-free, which facilitates their use for hypothesis testing. A special case of the construction is as follows. Let the d-variate cumulative distribution function (cdf) F be absolutely continuous with joint density f , marginal densities f 1 , . . . , f d , and copula density c. Define with F 1 , . . . , F d the marginal cdfs of F . The d-variate cdf G(x) = d j=1 F j (x j ) has the same margins as F , but coupled via the independence copula. Letting l(y) f (y) dy and κ = l(y) f (y) dy, it follows from Corollary 4 in Khmaladze (2016) based on an independent random sample X 1 , . . . , X n from F converges weakly to a G-Brownian bridge, i.e., has the same weak limit as the ordinary empirical process based on an independent random sample Y 1 , . . . , Y n from G. The asymptotic distribution of a test statistic based onṽ F,n which is invariant with respect to coordinate-wise continuous monotone increasing transformations is thus the same as if F (or G) were the uniform distribution on [0, 1] d . This includes the Kolmogorov-Smirnov type statistic sup x∈R d ṽ F,n (x) , which (with F the cdf of P 0 ) we are considering below for comparison with our Wasserstein-based test.
In case F has independent margins, F and G coincide and the procedure reduces to a classical Kolmogorov-Smirnov test. To ensure that the test has the right size at finite sample size, we calculate critical values by Monte Carlo approximation rather than relying on the asymptotic theory.

Results
In Figure 1, we assess the performance of the GoF tests of H n 0 : P = P 0 where P 0 = N 2 (0, I 2 ) is a centered bivariate Gaussian with identity covariance matrix. The alternatives P in panels (a)-(f) are as follows: (a) P = N 2 µ µ , I 2 with location shift µ along the main diagonal (rejection frequencies plotted against µ ∈ R); (b) P = N 2 0, σ 2 0 0 σ 2 (rejection frequencies plotted against σ 2 > 0); (c) P = N 2 0, 1 ρ ρ 1 with correlation ρ (rejection frequencies plotted against ρ ∈ (−1, 1)); (d) P has standard normal margins but Gumbel copula with parameter θ (rejection frequencies plotted against θ ∈ [1, ∞)); (e) P has standard Gaussian margins but a bivariate Student t copula with ν = 4 degrees of freedom and correlation parameter ρ (rejection frequencies plotted against ρ ∈ (−1, 1)); 2 (f) P is the "banana-shaped" Gaussian mixture described in Appendix C (rejection frequencies plotted against the mixing weight p ∈ (−1, 1)). 3 The Gumbel and Student t copula simulations in (d) and (e) were implemented from the R package copula (Hofert et al., 2018). Inspection of Figure 1 indicates that the Khmaladze test, as a rule, is uniformly outperformed by the Rippl-Munk-Sturm and Wasserstein tests. The Rippl-Munk-Sturm test, of course, does relatively well under the Gaussian alternatives of panels (a)-(c) where, however, the Wasserstein test is almost as powerful (while its validity, contrary to that of the Rippl-Munk-Sturm test, extends largely beyond the Gaussian null hypothesis). Against the non-Gaussian alternatives in panels (d)-(f), the Wasserstein test has higher power than the Rippl-Munk-Sturm and Khmaladze tests, with the exception of the Gumbel copula alternative in panel (d), where the Rippl-Munk-Sturm and Wasserstein tests perform equally well. For the "banana mixture" of panel (f), the Rippl-Munk-Sturm test fails to capture the change in distribution. Figures 2 and 3 are dealing with non-Gaussian simple null distributions P 0 , so that the Rippl-Munk-Sturm test no longer applies. In Figure 2, the null distribution is the mixture of Gaussians P 0 = 0.5 N 2 (0, I 2 ) + 0.5 N 2 3 0 , I 2 . The alternatives in both panels are (a) P = 0.5 N 2 (0, I 2 ) + 0.5 N 2 3+δ 0 , I (rejection frequencies plotted against the location shift δ ∈ R); Empirical powers of various GoF tests for the simple Gaussian null hypothesis H n 0 : P = N2(0, I2). Three tests are considered: the Wasserstein-2 distance (Section 2), the Rippl-Munk-Sturm test (Rippl, Munk and Sturm, 2016), and the Khmaladze Kolmogorov-Smirnov type test (Khmaladze, 2016), see Section 5.1.1. The alternatives P in panels (a)-(f) are described in Section 5.1.2 (note that in (e), P is not Gaussian even when ρ = 0).
The Wasserstein test uniformly outperforms the Khmaladze one.
In Figure 3, P 0 has standard Gaussian margins and a Gumbel copula with parameter θ = 1.7. The alternative P is of the same form but with another value θ = 1.7 of the copula parameter θ ∈ [1, ∞). Again, the Wasserstein test yields uniformly higher empirical power.

Elliptical families
If the radial density f rad is the density of the root of a chi-square random variable with d degrees of freedom, the elliptical family E(f rad ) corresponds to the Gaussian family. The null hypothesis in (3) then is that P is multivariate Gaussian with unknown mean vector and positive definite covariance matrix.
Testing multivariate normality is a well-studied problem for which many tests have been put forward. As benchmarks, we will consider here the tests proposed in Royston (1982), Henze and Zirkler (1990), and Rizzo and Székely (2016). Royston's test is a generalisation of the well-known Shapiro-Wilks test. The Henze-Zirkler test statistic is an integrated weighted squared distance between the characteristic function under the null and its empirical counterpart. Interestingly, Ramdas, García Trillos and Cuturi (2017) showed that the Wasserstein distance and the energy distance of Rizzo and Székely (2016) are connected, as the so-called entropy-penalized Wasserstein distance interpolates between them two. We borrowed the implementation of these tests from the R package MVN (Korkmaz, Goksuluk and Zararsiz, 2014). The test by Rippl, Munk and Sturm  (2016) tests (see Section 5.1.1) for the simple null hypothesis H n 0 : P = P0 with P0 a bivariate distribution with standard Gaussian margins and Gumbel copula with parameter θ = 1.7; rejection frequencies are plotted against the copula parameter θ.
(2016) considered in Section 5.1 does not apply here, since it only can handle fully specified Gaussian distributions.
The alternatives in the two panels of Figure 4 are (a) P with standard normal margins and a Gumbel copula with parameter θ ranging over [1, ∞); (b) P with independent margins, one of which is standard normal while the other one is Student t with ν > 0 degrees of freedom.
Inspection of Figure 4 reveals that the Wasserstein test has the highest power against the copula alternative in panel (a), while Royston's test has no power at all. For the Student t alternative in panel (b), Royston's test comes out as most sensitive, but the Wasserstein and energy tests (Rizzo and Székely, 2016) performe quite well too. In Figure 5, we consider the bivariate Student (ν = 12 degrees of freedom) elliptical family, with radial density f rad the density of the root of a rescaled Fisher F (d, 12) variable. Figure 5 provides a plot of rejection frequencies under bivariate skew-t alternatives (Azzalini, 2014) with marginal skewness parameters α 1 and α 2 . Simulations were based on the function rmst from the R package sn (Azzalini, 2020). In principle, the empirical process approach in Khmaladze (2016) leads to test statistics that are asymptotically distribution-free, but their numerical implementation involves a number of multiple integrals, the computation of which remains problematic. Empirical power curves of various tests of the hypothesis that P is bivariate Gaussian with unknown mean vector and covariance matrix. The Wasserstein test in Section 3 is compared to three other multivariate normality tests mentioned in Section 5.2. In (a), the alternative P has a Gumbel copula with parameter θ; rejection frequencies are plotted against θ ∈ [1, ∞). In (b), one of the marginals of P is a Student t distribution with ν degrees of freedom; rejection frequencies are plotted against ν > 0.

General parametric families
We now turn to the more general example of a non-elliptical parametric model M where the parametric bootstrap procedure described in Section 4 nevertheless applies. In the notation of Remark 3, let M = {Q ψ : ψ ∈ Ψ} consist of the bivariate distributions with Gaussian marginals and an Ali-Mikhail-Haq (AMH) copula, yielding a five-dimensional parameter vector ψ = (µ 1 , σ 1 , µ 2 , σ 2 , θ) where µ 1 , µ 2 ∈ R and σ 1 , σ 2 ∈ (0, ∞) are marginal location and scale parameters, and θ ∈ Θ = [−1, 1] is the AMH copula parameter. We applied the method involving the location-scale reduction described in Remark 3. Following Genest, Ghoudi and Rivest (1995), the copula parameter θ was estimated via a rank-based maximum pseudo-likelihood estimator. Obviously, the componentwise ranks of the data and those of the residuals in (14) coincide, so thatθ n , as required, depends on the data only through the residuals. We first checked the validity of the parametric bootstrap procedure of Section 4. To do so, we simulated 1 000 independent random samples of size n = 200 from P ∈ M with θ = 0.7. For each sample, we calculated the test statistic T ls M,n in (15) and checked whether or not it exceeds the bootstrapped critical value c ls M (α, n,θ n ) for α equal to multiples of 5%. The critical value function θ → c ls M (α, n, θ) in (16) was pre-computed by Algorithm 3, or rather a Fig 5. Empirical power of the Wasserstein test in Section 3 for the hypothesis that P is bivariate Student t with ν = 12 degrees and unknown mean vector and covariance matrix. The alternatives P are bivariate skew-t with skewness parameters α1 and α2.
variation thereof taking into account the estimated residuals in Eq. (14). The points in Figure 6(a) show the empirical type I errors as a function of α. The diagonal line fits the points well, lending support to the validity of the parametric bootstrap method (if not proving it). Figure 6(b) similarly displays the rejection frequencies of the Wasserstein test under an alternative P whose copula belongs to the Frank family with parameter η. If η = 0, the Frank copula reduces to the independence copula, which is a member of the AMH family too. Again, the approach in Khmaladze (2016) in principle also applies, but its actual implementation is intricate and remains unsettled. x >r x p dP(x) = 0.
Then, lim n→∞ sup P∈M E P W p p ( P n , P) = 0.
The condition on M is equivalent to the one that the closure of M in the metric space (P p (R d ), W p ) is compact. This follows from Prohorov's theorem and the characterization of W p -convergence in Bickel and Freedman (1981, Lemma 8.3) or Villani (2009, Theorem 6.9). The convergence rate of E P {W p p ( P n , P)} has been studied intensively; see, for instance, Fournier and Guillin (2015, Theorem 1). However, those rates require the existence of moments of order q higher than p.
Proof of Theorem 1. The following smoothing argument is inspired by the proof of Theorem 1.1 in Horowitz and Karandikar (1994). Let U σ denote the Lebesgueuniform distribution on the ball {x ∈ R d : x ≤ σ} in R d with radius σ ∈ (0, ∞) and centered at the origin. Denoting by * the convolution of probability measures, we have, for any Q ∈ P p (R d ), Indeed, if X and Y are independent random vectors with distributions Q and U σ , respectively, then (X + Y, X) is a coupling of Q * U σ and Q, so that By the triangle inequality, it follows that W p ( P n , P) ≤ 2σ + W p ( P n * U σ , P * U σ ).
Taking expectations and using the elementary inequality (a + b) p ≤ 2 p−1 (a p + b p ) for p ≥ 1, a ≥ 0, and b ≥ 0, we obtain If we can show that then it will follow that But then, this latter lim sup is actually a lim and is equal to zero, as required.
Let us proceed to show (17). Fix σ > 0 for the remainder of the proof. Let f σ denote the density function of U σ . The measures P n * U σ and P * U σ are absolutely continuous too and have density functions , respectively. The Wasserstein distance can be controlled by weighted total variation (Villani, 2009, Theorem 6.15): Take expectations and apply Fubini's theorem to see that Let r > σ and split the integral in (18) according to whether x > r or x ≤ r. Note that f σ (u) = f σ (0) if y ≤ σ and f σ (u) = 0 otherwise. For any P ∈ P(R d ) and any x ∈ R d , we have, by the Cauchy-Schwarz inequality, g n (x; P) ≤ n −1/2 f σ (0).

It follows that lim
x ≤r x p g n;P (x) dx = 0.
But then, in view of (18), we have lim sup n→∞ sup P∈M E P W p p ( P n * U σ , P * U σ ) ≤ lim sup n→∞ sup P∈M 2 p−1 x >r x p g n (x; P) dx.
By the triangle inequality, we also have, for all n, Applying Fubini's theorem once more, we find that x >r x p g n (x; P) dx ≤ 2 x >r x >r x p f σ (x − y) dx dP(y) = 2 y∈R d u+y >r u + y p f σ (u) du dP(y).
Since f σ (u) = 0 whenever u > σ and since r > σ, we have u+y >r Choosing r > 2σ, we get that y > σ for all y in the non-zero branch above, and thus, for all n, x >r x p g n (x; P) dx ≤ 2 p+1 y >r−σ y p dP(y).
It follows that, for every r > σ, The left-hand side does not depend on r. The condition on M implies that the right-hand side converges to zero as r → ∞. It follows that the left-hand side must be equal to zero. But this is exactly (17), as required. The proof is complete.
Corollary 1. For M as in Theorem 1, we have i.e., W p p ( P n , P) → 0 in probability as n → ∞, uniformly in P ∈ M. Proof. By Markov's inequality, for every ε > 0 and every P ∈ P p (R d ), we have In view of Theorem 1, the integral converges to zero uniformly in P ∈ M.

Appendix B: Algorithms for the computation of critical values
Our test statistics involve the Wasserstein distance between an empirical measure and a continuous one. Their calculation requires solving a semi-discrete optimal transport problem (Section 1.3), for which we relied on the function semidiscrete in the R package transport (Schuhmacher et al., 2019), which implements the method of Mérigot (2011). The method starts from a discretization of the source density. The quality of approximation can be set by choosing a sufficiently fine mesh and selecting the tolerance parameter to a low value. The meshes considered here consisted of approximately 10 5 cells.
Below we provide pseudo-code algorithms to sketch the main steps in the actual computation of the critical values. We start with the case of a simple null hypothesis (Algorithm 1), then turn to elliptical families with a given generator (Algorithm 2) and finally propose the bootstrap procedure for a general parametric family (Algorithm 3). The empirical distribution associated with a sample X = (X 1 , . . . , X n ) ∈ (R d ) n is denoted by P n (X). The largest integer not larger than a scalar x ∈ R is denoted by x .
In Algorithm 3, we first compute c M (α, n, θ) for θ in a finite mesh Θ 1 ⊆ Θ. From these values, we reconstruct the function θ → c M (α, n, θ) by smoothing. It is into the resulting function that we plug in the actual estimateθ n . Further, we restrict the bootstrap parameter estimatesθ * n,b to be in another finite mesh Θ 2 ∈ Θ, because calculation of the bootstrapped test statistics T * M,n,b requires a preliminary discretization of the density associated toθ * n,b in order to solve the corresponding semi-discrete optimal transport problem. The first loop in Algorithm 3 is discretizing the densities of P θ for θ ∈ Θ 2 . The second loop is calculating c M (α, n, θ) for θ ∈ Θ 1 by drawing B samples of size n from P θ . The final step of the algorithm consists of reconstructing the function θ → c M (α, n, θ) by smoothing. This smoothing step is illustrated in Figure 7 for the five-parameter bivariate Gaussian-AMH model in Section 5.3, applying the location-scale reduction in Remark 3.
The quality of the approximate critical thresholds is ensured by choosing a large enough number of Monte Carlo replications N (Algorithms 1 and 2) or bootstrap replicates B (Algorithm 3). In the simulation experiments, we chose N between 3 000 and 10 000 depending on the time required, while B = 1 000.
Algorithm 1: Computation of c(α, n, P 0 ) in Eq. (2) Input: • A mesh that supports the source density f associated to P 0

Appendix C: A banana-shaped distribution
The "banana-shaped" distribution in Section 5.1 and Figure 1 of three Gaussian components. Figure 8 shows a scatterplot for p = 0.35 of a random sample of size n = 500 from this distribution.