Wasserstein distance error bounds for the multivariate normal approximation of the maximum likelihood estimator

We obtain explicit $p$-Wasserstein distance error bounds between the distribution of the multi-parameter MLE and the multivariate normal distribution. Our general bounds are given for possibly high-dimensional, independent and identically distributed random vectors. Our general bounds are of the optimal $\mathcal{O}(n^{-1/2})$ order. Explicit numerical constants are given when $p\in(1,2]$, and in the case $p>2$ the bounds are explicit up to a constant factor that only depends on $p$. We apply our general bounds to derive Wasserstein distance error bounds for the multivariate normal approximation of the MLE in several settings; these being single-parameter exponential families, the normal distribution under canonical parametrisation, and the multivariate normal distribution under non-canonical parametrisation. In addition, we provide upper bounds with respect to the bounded Wasserstein distance when the MLE is implicitly defined.


Introduction
The asymptotic normality of the maximum likelihood estimator (MLE), under regularity conditions, is one of the most fundamental and well-known results in statistical theory.However, progress has only been made very recently on the problem of deriving error bounds for the distance between the distribution of the MLE, under general regularity conditions, and its limiting normal distribution.This is in part due to the fact that the MLE is in general a nonlinear statistic for which classical techniques for distributional approximation, such as Stein's method [41], are difficult to apply directly, although, amongst other works, [12] and [35] have obtained optimal order Berry-Esseen-type bounds for quite broad classes of nonlinear statistics.
In recent years, however, there have been a number of contributions to the problem of quantifying the closeness of the MLE to its asymptotic normal distribution.Under general regularity conditions, [4] used Stein's method to obtain an explicit O(n −1/2 ) bound, where n is the sample size, between the distribution of the single-parameter MLE and the normal distribution in the bounded Wasserstein metric (this and all other probability metrics mentioned in this paper will be defined in Section 2.2).In the special case that the MLE can be expressed as a suitably smooth function of a sum of independent and identically distributed (i.i.d.) observations, [3] obtained bounds that sharpen and simplify those of [4].The results of [4] were extended by [1] to quantify the closeness between the multi-parameter MLE and its limiting multivariate these are of the worse order O n −1/8 (see Remark 2.3).For the time being, to the best of our knowledge, the O(n −1/4 ) Kolmogorov distance bounds for the multi-parameter MLE that can be deduced from our Wasserstein distance bounds have the best dependence on n in the current literature.
The rest of the paper is organised as follows.In Section 2, we present the setting of the paper.This includes the notation, regularity conditions for our main results, definitions of the probability metrics used in the paper and a relationship between the 1-Wasserstein and Kolmogorov metrics, and we also recall some results from the literature on Stein's method for normal and multivariate normal approximation.In Section 3, we state and prove our main results.Theorem 3.1 provides an optimal order Wasserstein distance bound on the closeness between the distribution of the multi-parameter MLE and its limiting multivariate normal distribution.We also present a simplified bound in the univariate case (Theorem 3.2).Theorem 3.3 provides p-Wasserstein distance analogues of the bounds of Theorems 3.1 and 3.2.In Section 4, we apply the results of Section 3 in the settings of single-parameter exponential families, the normal distribution under canonical parametrisation, and the multivariate normal distribution under non-canonical parametrisation.In addition, we provide upper bounds for cases where the MLE cannot be expressed analytically with respect to the bounded Wasserstein distance.In Section 4.5, we carry out a simulation study to assess the accuracy of our bounds.Some technical proofs, examples, and calculations are postponed to Appendix A.

Regularity conditions
The notation that is used throughout the paper is as follows.The parameter space is Θ ⊂ R d equipped with the Euclidean norm.Let θ = (θ 1 , θ 2 , . . ., θ d ) denote a parameter from the parameter space, while θ 0 = (θ 0,1 , θ 0,2 , . . ., θ 0,d ) denotes the true, but unknown, value of the parameter.For X = (X 1 , X 2 , . . ., X n ) being i.i.d.random vectors in R t , t ∈ Z + , we denote by f (x i |θ) the probability density (or mass) function of X i .The likelihood function is L(θ; x) = n i=1 f (x i |θ), where x = (x 1 , x 2 , . . ., x n ).Its natural logarithm, called the loglikelihood function, is denoted by (θ; x) = log L(θ; x).We shall write ∇ = ∂ ∂θ 1 , . . ., ∂ ∂θ d to denote the gradient operator with respect to the unknown parameter vector θ.A maximum likelihood estimate (not seen as a random vector) is a value in the parameter space which maximises the likelihood function.For many models, the MLE as a random vector exists and is also unique, in which case it is denoted by θn (X), the MLE for θ 0 based on the sample X.A set of assumptions that ensure existence and uniqueness of the MLE are given in [27].This is known as the 'regular' case.However, existence and uniqueness of the MLE cannot be taken for granted; see [8] for an example of non-uniqueness.We shall write E to denote the expectation with respect to θ 0 , and E θ to denote the expectation with respect to θ.
Let us now present standard regularity conditions under which asymptotic normality of the MLE holds [14]: (R.C.3)The expected Fisher information matrix for a single random vector I(θ) is finite and positive definite.For r, s ∈ {1, 2, . . ., d}, its elements satisfy This condition implies that nI(θ) is the covariance matrix of ∇( (θ; X)).
In addition to these regularity conditions, [14] assumes that the true value θ 0 of θ is interior to the parameter space Θ ⊂ R d , which is compact.Throughout this paper, we shall instead assume that the parameter space Θ ⊂ R d is open.Conditions (R.C.1), (R.C.3) and (R.C.4) are stated explicitly on page 118 of [14].We have expressed (R.C.4) slightly differently to how it is stated in [14], so that our presentation is consistent with that from the book [10] and a similar regularity condition (R.C.4') of [1], which are both referred to in our paper.Condition (R.C.2) is not stated on page 118 of [14], but is crucial to the proof and is implied by equation (4.32) on page 124 of [14] in which an interchange in the order of integration and differentiation is assumed.
The asymptotic normality of the MLE was first discussed by [16].Here, with the above regularity conditions, we present the following statement of the asymptotic normality of the multi-parameter MLE for i.i.d.random vectors; for the independent but not necessarily identically distributed case see [22].
Theorem 2.1 (Davison [14]).Let X 1 , X 2 , . . ., X n be i.i.d.random vectors with probability density (or mass) functions f (x i |θ), where θ ∈ Θ ⊂ R d , and Θ is compact.Assume that the MLE θn (X) exists and is unique and that the regularity conditions (R.C.1)-(R.C.4) hold.Let Z ∼ MVN (0, I d ), where 0 is the d × 1 zero vector and I d is the d × d identity matrix.Then A quantitative version of Theorem 2.1 was obtained by [1] (in the i.i.d.setting) under slightly stronger regularity conditions, these being (R.C.1)-(R.C.3) and the following condition (R.C.4').Before presenting this condition, we introduce some notation.Let the subscript (m) denote an index for which the quantity | θn (x) (m) − θ 0,(m) | is the largest among the d components: (m) ∈ {1, . . ., d} is such that The log-likelihood (θ; x) is three times differentiable with respect to the unknown vector parameter θ and the third order partial derivatives are continuous in θ.In addition, for any θ 0 ∈ Θ there exists 0 < = (θ 0 ) and functions M kjl (x), ∀k, j, l ∈ {1, 2, . . ., d}, such that In Theorems 3.1 and 3.3, we shall work with the same regularity conditions as [1], but with (R.C.4') replaced by the following condition (R.C.4"(p)).Before stating condition (R.C.4"(p)), we introduce some terminology.We say that M (θ; x) is monotonic in the multivariate context if for all fixed θ1 , θ2 , . . ., θd and x we have that, for each s ∈ {1, 2, . . ., d}, is a monotonic function.
(R.C.4"(p))All third order partial derivatives of the log-likelihood (θ; x) with respect to the unknown vector parameter θ exist.Also, for any θ ∈ Θ and for X denoting the support of the data, we assume that for any j, l, q ∈ {1, 2, . . ., d} there exists a function M qlj (θ; x), which is monotonic in the sense defined in (2.2), such that In the univariate d = 1 case we drop the subscripts and write M (θ; x).
We include reference to the variable p in the name of our condition (R.C.4"(p)) to emphasis the fact that the integrability condition (2.3) depends on p, the order of the Wasserstein distance under consideration.In the case p = 1, corresponding to the classical 1-Wasserstein distance, we shall simply write (R.C.4").
Remark 2.2.For brevity, in this remark we discuss the condition (R.C.4"); similar comments apply to the more general condition (R.C.4"(p)).The motivation for introducing (R.C.4") is that in the proof of Theorem 3.1 it allows us to bound one of the remainder terms in the 1-Wasserstein metric, which would not be possible if instead working with (R.C.4) or (R.C.4').Conditions (R.C.4), (R.C.4') and (R.C.4") each require all third order partial derivatives of (θ; x) to exist.Each condition then also involves an integrability condition involving a function that dominates the absolute value of these partial derivatives in a certain way.For a given MLE, verifying the integrability conditions in (R.C. (R2) The density f (x|θ) is three times differentiable with respect to θ, the third derivative is continuous in θ, and f (x|θ) dx can be differentiated three times under the integral sign.
These regularity conditions are the same as those used in [10] and [4] with the exception that (R.C.4") is replaced by a univariate version of (R.C.4) and (R.C.4'), respectively.

Probability metrics
where the infimum is taken over all joint distributions of X and Y that have the same law as X and Y , respectively.In the case p = 1, corresponding to the 1-Wasserstein distance, we shall drop the subscript 1 and write d W .The infimum in (2.4) is actually a minimum in that there exists a pair of jointly distributed random variables (X * , Y * ) with L(X * ) = L(X) and (see Chapter 6 of [42] and Lemma 1 of [28]).By Hölder's inequality, it follows that, if 1 for all X and Y such that E[|X| q ] < ∞ and E[|Y | q ] < ∞ (see again Chapter 6 of [42] and Lemma 1 of [28]).The 1-Wasserstein metric and several other probability metrics used in this paper can be conveniently expressed as integral probability metrics.For R d -valued random vectors X and Y , integral probability metrics are of the form for some class of functions H.At this stage, we introduce some notation.For vectors a For a three times differentiable function h : With this notation in place, taking .In all the above notation, we supress the dependence on the dimension d.Of the works mentioned in the Introduction, the results of [34] are given in the Kolmogorov metric, [3] and [4] work in the bounded Wasserstein metric, and [1] works in the smooth test function d 0,1,2,3 metric.It is evident that d bW and d 0,1,2,3 are weaker than the d W metric.We now note the following important relations between the Kolmogorov metric and the 1-Wasserstein and bounded Wasserstein metrics, respectively.Let Y be any real-valued random variable and Z ∼ N(0, 1).Then by [40,Proposition 1.2] (see also [11,Theorem 3.3]) and [33,Proposition 2.4], we have that For the multi-parameter case, the following generalisation of (2.12) due to [25] is available.Let Z ∼ MVN(0, I d ), d ≥ 1.Then, for any R d -valued random vector Y , (2.14) A similar bound with the slightly bigger multiplicative constant of 3(log(d+1)) 1/4 had previously been obtained by [5].For an analogous relationship between the 1-Wasserstein and convex distances in R d see [30].
Remark 2.3.In the univariate case, the same argument to that used in the proof of Corollary 4.2 of [20] can be used to show that there exists a universal constant C (which can be found explicitly . Using the approach of [5] with a multivariate analogue of the smoothing function of [20] would also lead to a bound of the form metric of [1] for the multivariate normal approximation of the multi-parameter MLE only yield O(n −1/8 ) bounds in the Kolmogorov metric, whilst our O(n −1/2 ) 1-Wasserstein distance bounds lead to O(n −1/4 ) Kolmogorov distance bounds.

Wasserstein distance bounds by Stein's method
Optimal order O(n −1/2 ) 1-Wasserstein distance bounds for the normal approximation of sums of independent random variables via Stein's method date as far back as [15].We shall make use of the following result.
Theorem 2.4 (Reinert [38]).Let ξ 1 , . . ., ξ n be i.i.d.random variables with Only very recently have optimal order Wasserstein distance bounds been obtained for multivariate normal approximation of independent random vectors.There has been quite a lot of activity on this topic over the last few years, and amongst the bounds from this literature we use a bound of [9] given in Theorem 2.5 below.This is on account of the weak conditions, simplicity, and good dependence on the dimension d that is sufficient for our purposes.It should be noted, however, that there are bounds in the literature that have a better dependence on the dimension d; see, for example, [13], in which, in the case of the 2-Wasserstein distance, the fourth moment condition of [9] is replaced by a Poincaré inequality condition.If we were to use such a bound with improved dependence on d in the derivation of our general bounds of Theorems 3.1 and 3.3, it would, however, make no difference to the overall dependence of the bound on the dimension d.We also note that in the univariate case, optimal order n −1/2 p-Wasserstein distance bounds have been obtained for the normal approximation of sums of independent random variables without the use of Stein's method; see [39] and references therein.
The bound (2.16) below is not stated in [9], but is easily obtained from the bound (2.15) (which is given in [9]) by an application of Hölder's inequality.The bound (2.18) is also not stated in [9], but is again easily obtained from the bound (2.17) (which is given in [9]) by this time applying the basic inequality Theorem 2.5 (Bonis [9]).Let ξ 1 , . . ., ξ n be i.i.d.random vectors in where ξ 1,j is the j-th component of Then there exists a constant C p > 0 depending only on p such that (2.17) 3 Main results and proofs For ease of presentation, let us now introduce the following notation: Notice that, using condition (R.C.3), E [T lj ] = 0 for all j, l ∈ {1, 2, . . ., d}.
A general 1-Wasserstein distance error bound for the multivariate normal approximation of the multi-parameter MLE is given in the following theorem.Theorem 3.1.Let X = (X 1 , X 2 , . . ., X n ) be i.i.d.R t -valued, t ∈ Z + , random vectors with probability density (or mass) function f (x i |θ), for which the true parameter value is θ 0 and the parameter space Θ is an open subset of R d .Assume that the MLE exists and is unique and that (R.C.1)-(R.C.3), (R.C.4") are satisfied.In addition, for Ṽ as in (3.19), assume that where The following theorem is a simplification of Theorem 3.1 for the single-parameter MLE.
Theorem 3.2.Let X = (X 1 , X 2 , . . ., X n ) be i.i.d.random variables with probability density (or mass) function f (x i |θ), for which the true parameter value is θ 0 and the parameter space Θ is an open subset of R. Assume that the regularity conditions (R1)-(R3), (R.C.4") are satisfied and that the MLE, θn (X), exists and is unique.Assume that E d dθ logf (X p-Wasserstein distance analogues of the bounds of the above two theorems are given in the following theorem. where and C p > 0 is a constant depending only on p.
In the case p = 2, we have the following simpler bound with an explicit constant: where K 1 (θ 0 ) is defined as in Theorem 3.1.Here and throughout the paper, O(1) is understood as smaller than a constant which does not depend on n, but may depend on the dimension d.Assuming that ) for all l = 1, 2, . . ., d.To see this, note that because W is asymptotically standard multivariate normally distributed, it follows that, as n → ∞, ) for all l = 1, 2, . . ., d.Also, using condition (R.C.3) and that X 1 , X 2 , . . ., X n are independent we have that for all j, l, q ∈ {1, 2, . . ., d} then we are guaranteed that 1).To see this, note that, by the asymptotic normality of the MLE, we have that, for all l = 1, 2, . . ., d, θn (X 4 .Two applications of the Cauchy-Schwarz inequality then give (2) In general (θ; x) and log(f (x i |θ 0 )) will depend on the dimension d (and therefore so will Ṽkj and M qlj (θ; x), for example), and therefore it is difficult to make precise general statements regarding the dependence of the bound (3.20) of Theorem 3.1 on the dimension d.However, it is clear that the term K 3 (θ 0 ) has a very poor dependence on the dimension d.Assuming that M qlj (θ; x) = O(1) and Ṽkj = O(1), we have that This poor dependence on the dimension is a consequence of the crude inequality (3.31) used in the proof of Theorem 3.1 below, which we now state: where (We also introduce the monotonicity assumption on M qlj to obtain inequality (3.31).)Inequality (3.26) is useful in that the expectations in the sum are easier to bound directly than the quantity E|Q l Q q M qlj (θ * 0 ; X)|, but this comes at the cost of having a sum with 2 d terms, resulting in a poor dependence on the dimension d.However, as is demonstrated in the examples of Section 4, when the number of dimensions is low, inequality (3.26) (which leads to the term K 3 (θ 0 )) is very useful as the computation of the expectations in the sum are often straightforward.
Remark 3.5.Theorem 2.1 of [4] gives a bounded Wasserstein bound on the distance between the distribution of the single-parameter MLE and the normal distribution, and Theorem 2.1 of [1] gives a bound on the distance between the distribution of multi-parameter MLE and the multivariate normal distribution with respect to the d 0,1,2,3 metric.Both bounds are of the optimal O(n −1/2 ) order.We now give further comparisons between our bounds and those of [4] and [1].
Theorem 2.1 of [4] holds under the same regularity conditions as our Theorem 3.1, but with condition (R.C.4') instead of (R.C.4") Condition (R.C.4') introduces a constant .This causes two complications in the bound of [4].Firstly, some additional conditional expectations (which involve ) must be estimated; secondly, appears in other terms in the bound and so in applications of the bound must later be optimised.Our bound (3.22) has no such complications and in most applications we would expect that the expectations that must be estimated in our bound are easier to work with than those of [4], and ultimately lead to better bounds (even when given in a stronger metric).Indeed, in Section 4.1 we apply Theorem 3.1 to derive 1-Wasserstein distance bounds for the normal approximation of the MLE of the exponential distribution in the canonical and non-canonical parametrisations, and we find that in both cases our bounds outperform those that were obtained by [4].
Theorem 2.1 of [1] also holds under the same regularity conditions as our Theorem 3.1, but with condition (R.C.4') instead of (R.C.4").The bound of [1] therefore has similar complications to the bound of [4], and overall the bound of [1] takes a more complicated form than our bound (3.20) in Theorem 3.1.For small dimension d, we would therefore expect our bound to be preferable to that of [1] and lead to better bounds in applications.However, as noted in Remark 3.4, the term K 3 (θ 0 ) of bound (3.20) has a very poor dependence on the dimension d; much worse than the bound of [1].In applications in which the dependence on the dimension is more important than the choice of metric, the bound of [1] may therefore be preferable to our bound (3.20).
Proof of Theorem 3.1.By the triangle inequality we have that We now proceed to find upper bounds for the terms R 1 and R 2 .
Let us now focus on bounding E|Q l Q q M qlj (θ * 0 ; X)|.As M qlj is a monotonic function in the sense defined in (2.2), we have that, for all x ∈ X, Applying inequality (3.31) to (3.30) gives the bound . Finally, combining our bounds for R 1 and R 2 yields inequality (3.20). 2 Proof of Theorem 3.2.The proof is exactly the same as that of Theorem 3.1 with the exception that the term R 1 in (3.27) is bounded using Theorem 2.4, rather than Theorem 2.5. 2 Proof of Theorem 3.3.The proof is similar to that of Theorem 3.1.Let p ≥ 2. By the triangle inequality we have that The term R 1,p can be bounded similarly to how we bounded R 1 in the proof of Theorem 3.1.In the case p = 2 we obtain the same bound K 1 (θ 0 ) for R 1,2 , and for the case p ≥ 2 the only way our argument changes is that we apply inequality (2.18), rather than inequality (2.16).
To bound R 2,p , we note that the random vectors W and 1 √ n Ṽ ∇ ( (θ 0 ; X)) are defined on the same probability space and thus provide a coupling of them.It therefore follows from the definition of the p-Wasserstein distance that . Substituting (3.29) into this bound and using the triangle inequality now gives that .
We now apply the inequality where in obtaining the second inequality we used the Cauchy-Schwarz inequality and a similar argument to the one used to obtain inequality (3.31).Summing up our bounds for R

Single-parameter exponential families
The distribution of a random variable, X, is said to be a single-parameter exponential family distribution if the probability density (or mass) function is of the form where the set B = {x : f (x|θ) > 0} is the support of X and does not depend on θ; k(θ) and A(θ) are functions of the parameter; T (x) and S(x) are functions only of the data.Many popular distributions are members of the exponential family, including the normal, gamma and beta distributions.
The choice of the functions k(θ) and T (X) is not unique.If k(θ) = θ we have the socalled canonical case.In this case θ and T (X) are called the natural parameter and natural observation [10].It is often of interest to work under the canonical parametrisation due to appealing theoretical properties that can, for example, simplify the theory and computational complexity in generalised linear models.In fact, as noted in Remark 4.2 below, our general (4.33) bound in Corollary 4.1 for the normal approximation of the MLE for exponential family distributions simplifies in the canonical case.Canonical parametrisations are important in, amongst other examples, Gaussian graphical models [26] and precision matrix estimation [29].
We give two examples using the exponential distribution, firstly, in its canonical form, and then, in Appendix A.2 under a change of parametrisation.The example given in the appendix is given for purely illustrative purposes, as an improved bound can be obtained directly by Stein's method.
In the case of X 1 , X 2 , . . ., X n exponentially distributed Exp(θ), i.i.d.random variables where θ > 0, the probability density function is log θ and S(x) = 0. Hence Exp(θ) is a single-parameter canonical exponential family distribution.The MLE is unique and given by θn (X) = 1 X .
It should be noted that the exact values for d W (W, Z) and d W 2 (W, Z) do not depend on θ 0 .This is because a simple scaling argument using the fact that i(θ 0 ) = 1 shows that the distribution of W = n i(θ 0 )( θn (x) − θ 0 ) does not involve θ 0 .Hence, it is a desirable feature of our bounds that they do not depend on θ 0 .
Proof.Straightforward steps can be followed in order to prove that the assumptions (R1)-(R3), (R.C.4"), and (R.C.4"(2)) hold for this example.We will not show that here.The log-likelihood function is and its third derivative is given by (3)  For (1): In addition, since T (x) = x, we have that Var(T (X 1 )) = Var(X 1 ) = 1 θ 2 0 and therefore for the first term of the upper bound in (4.33), we have that According to Remark 4.2, the second term of the bound in (4.33) vanishes.Finally, we consider the third term.Recall that we can take M (θ, x) = 2n θ 3 .We know that since X i ∼ Exp(θ 0 ), i = 1, 2, . . ., n, we have that X ∼ G(n, nθ 0 ), with G(α, β) being the gamma distribution with shape parameter α and rate parameter β.Using now the fact that θn (x) = 1 x , the results in pp.70-73 of [24] give that, for n > 2, and Applying the results of (4.39) , (4.40) and (4.41) to (4.33) and using that i(θ 0 ) = 1 θ 2 0 , yields result (1) of the corollary.
For (2): For the first term of the upper bound in (4.34) we have, since i(θ 0 ) = 1 where we used that the fourth central moment of . The second term in (4.34) vanishes due to k (θ 0 ) = 0.With respect to the third term, since (3) . We have already mentioned that θn (X) = 1 X and X ∼ G(n, nθ 0 ).Therefore, simple calculations yield and Applying now the results of (4.42), (4.43) and (4.44) to (4.34) and using that the second term of the bound in (4.34) vanishes, yields result (2) of the corollary.Note that the inequality , for any n > 4, has also been used.

The normal distribution under canonical parametrisation
The distribution of a random variable X is said to be a canonical multi-parameter exponential family distribution if, for η ∈ R d , the probability density (or mass) function takes the form where B = {x : f (x|η) > 0}, the support of X, does not depend on η; A(η) is a function of the parameter η; and T j (x) and S(x) are functions of only the data.
Here, we apply Theorem 3.1 in the case that X 1 , X 2 , . . ., X n are i.i.d.random variables following the N(µ, σ 2 ) distribution, an exponential family distribution.Let be the natural parameter vector.The MLE for η 0 exists, it is unique and equal to This can be seen from the invariance property of the MLE and the result of [14, p. 116] in which the MLEs for µ and σ 2 are given.In Corollary 4.5, we give an explicit bound on the 1-Wasserstein distance between the distribution of η(X) and its limiting multivariate normal distribution.As η(X) is a non-linear statistic, this result demonstrates the power of our general theorems of Section 3; to the best of our knowledge no other such optimal order bounds have been given for multivariate normal approximation of non-linear statistics in the 1-Wasserstein metric.
Corollary 4.5.Let X 1 , X 2 , . . ., X n be i.i.d.N(µ, σ 2 ) random variables.Let η 0 be as in (4.45), and for ease of presentation we denote α and Z in the weaker d 0,1,2,3 metric was given in [1].Aside from being given in a stronger metric, our bound has the advantage of taking a simpler form with a better dependence on the parameters η 1 and η 2 .The numerical constants in our bound and that of [1] are of the same magnitude.In deriving the bound (4.46) we made no attempt to optimise the numerical constants and instead focused on giving a clear proof and simple final bound.
The following lemma will be used in the proof of Corollary 4.5.The proof is given in Appendix A.3.
Proof of Corollary 4.5.The first and second-order partial derivatives of the logarithm of the normal density function are given by Therefore, the expected Fisher information matrix for one random variable is and simple calculations give that , 2 is defined as in the statement of the corollary.We now set about bounding d W (W , Z) by applying the general bound (3.20).To this end, we first note that K 2 (η 0 ) = 0 due to the fact that E[T 2 lj ] = 0, for all l, j ∈ {1, 2}.This follows from the definition of T kj in (3.19) and the results of (4.47) and (4.48).
We now focus on bounding K 1 (η 0 ).Let where we used the inequality (a + b) 4 ≤ 8(a 4 + b 4 ).In terms of the parameters η 1 and η 2 , we have that µ = η 2 2η 1 and , and a longer calculation using standard formulas for the lower order moments of the normal distribution gives that .
Substituting these formulas into (4.49)gives that We bound R 1,2 similarly: Combining our bounds for R 1,1 and R 1,2 gives that

.50)
We now bound K 3 (η 0 ), as given by Here the superscript M qlj in R M qlj q,l,j emphasises the fact the quantity depends on the choice of dominating function M qlj .In bounding K 3 (η 0 ) we first note the following inequalities which will simplify the final bound: which can be seen to hold from several applications of the simple inequality ab ≤ 1 2 (a 2 + b 2 ).
From the formulas in (4.47) we readily obtain that Therefore we can take At this stage we note that R M 222 2,2,2 = 0 and that In order to bound each of these terms, we must consider four cases: (A) η = (η 1 , η 2 ), (B) η = ( η1 , η 2 ), (C) η = (η 1 , η2 ) and (D) η = ( η1 , η2 ).It will be convenient to write defined in the obvious manner.We first bound R M 111 1,1,1 .We consider the four case (A), (B), (C) and (D), and bound the terms by using the Cauchy-Schwarz inequality and the bounds of Lemma 4.7: Thus, Similar calculations (which are given in Appendix A.3) show that Applying these bounds to (4.51) yields the following bound: For 1 ≤ j ≤ p, let Xj denote the sample mean of X 1,j , . . ., X n,j .Then it is well-known in this case that the MLE is unique and equal to θn (X) = X1 , . . .
Then it is readily checked that all the assumptions of Theorem 3.1 are met and so an application of the bound (3.20) would yield a bound of the form d W (W , Z) ≤ Cn −1/2 , where Z ∼ MVN(0, I 2p ), for some constant C that does not depend on n.However, the term K 3 (θ 0 ) has a very poor dependence on the dimension d and would be tedious to compute.Instead, we take advantage of the particular representation of the MLE to derive a neat optimal O(n −1/2 ) 1-Wasserstein distance (and 2-Wasserstein distance) bound with good dependence on the dimension.In deriving this bound we make use of Theorem 2.5.

.53)
Remark 4.9.Corollary 3.1 of [2] gave a bound in the weaker d 1,2 metric for the case that X 1 , . . ., X n are i.i.d.N(µ, σ 2 ) random variables.Theorem 4.8 generalises the setting from p = 1 to p ≥ 1 and gives a bound in the stronger 1-Wasserstein distance.Our bound shows that the MLE converges in distribution to the multivariate normal distribution for even large p provided p n.We believe that the dependence on the dimension p in our bound is optimal, and this seems to be supported by empirical results in Section 4.5.
Proof.The inequality d W (W , Z) ≤ d W 2 (W , Z) is immediate from (2.5), and the rest of the proof is devoted to bounding d W 2 (W , Z).We begin by recalling the standard result that the expected Fisher information matrix is given by and therefore ) , and define the standardised random variables For 1 ≤ j ≤ p, let Xj and Ȳj denote the sample means of X 1,j , . . ., X n,j and Y 1,j , . . ., Y n,j .A simple calculation gives the useful equation Putting all this together gives that W can be written as W = (W 1 , . . ., W 2p ) , where, for 1 ≤ j ≤ p, and It will be useful to define V = (V 1 , . . ., V 2p ) , where, for 1 ≤ j ≤ p, We now note that X1 , . . . 2 are independent (see Section 3b.3 of [37]), from which it follows that W 1 , . . ., W 2p are independent.As the infimum in the definition (2.4) of the 2-Wasserstein distance is attained, for each j = 1, . . ., 2p we may construct a probability space on which the random variables W * j and Z * j with L(W . By independence, on taking the product of these probabilities spaces, we can construct random vectors
Here θ 0 = (µ 1 , . . ., µ p , σ 1,1 , . . ., σ 1,p , . . .σ p,1 , . . ., σ p,p ) .The density function here is It is well-known in this case that the MLE is unique and equal to θn (X) = X, . Since the covariance matrix Σ and its MLE estimator Σ are symmetric, for the purpose of presenting a multivariate normal approximation for the MLE we restrict θ 0 to only include σ i,j , i ≥ j, and θn (X) to only include the estimators σi,j , i ≥ j.This restricted MLE has p + p 2 = p(p + 3)/2 parameters.As in diagonal case, we could apply Theorem 3.1 to obtain a optimal order O(n −1/2 ) 1-Wasserstein distance bound, but we prefer to proceed as we did there and exploit the particular representation of the MLE in deriving our bound.
The proof of the following theorem follows a similar basic approach to that of Theorem 4.8, again making use of Theorem 2.5, although as the components of the random vector W are now no longer independent our calculations are a little more involved, as we cannot reduce the problem to the univariate case as we did in proving Theorem 4.8.We defer the proof to Appendix A.4.For a matrix A, let A max = max i,j |a i,j |.Theorem 4.10.Let X 1 , . . ., X n be i.i.d.MVN(µ, Σ) random vectors, where µ = (µ 1 , . . ., µ p ) and Σ = (σ ij ) ∈ R p×p is positive semi-definite.Let θn (X) be the MLE restricted in the manner as described above.Let W = √ n[I(θ 0 )] 1/2 θn (X) − θ 0 and Z ∼ MVN(0, I p(p+3)/2 ).Write σ 2 * = max 1≤j≤p σ jj (the largest variance in the covariance matrix Σ).Then

Implicitly defined MLEs
In order to be calculated, the general upper bound on the 1-Wasserstein distance of interest, as expressed in Theorem 3.1, requires a closed-form expression for the MLE.In this section, we explain how an upper bound on the weaker bounded Wasserstein distance can be obtained when the MLE is implicitly defined.Our strategy is split into two steps; first, put the dependence of the bound on the MLE only through the mean squared error (MSE), E[ d j=1 Q 2 j ] with Q j as in (3.19), and secondly discuss how upper bounds can be obtained for the MSE.In addition to the regularity conditions needed in Theorem 3.1, in order to attain an upper bound on the bounded Wasserstein distance when the MLE is not expressed analytically, we replace assumption (R.C.4") by (Con.1) as below: (Con.1)For > 0 and for all θ 0 ∈ Θ, sup θ:|θq−θ 0,q |< ∀q∈{1,2,...,d} where M kji = M kji (θ 0 ) only depends on θ 0 .Proposition 4.11.Let X = (X 1 , X 2 , . . ., X n ) be i.i.d.R t -valued, t ∈ Z + , random vectors with probability density (or mass) function f (x i |θ), for which the true parameter value is θ 0 and the parameter space Θ is an open subset of R d .Assume that the MLE exists and is unique, but cannot be expressed in a closed-form, and that (R.C.1)-(R.C.3) and (Con.1) are satisfied.In addition, for Ṽ as in Then, for > 0 being a positive constant, as in (Con.1), that need not depend on the sample size n, and with W as in (3.19), where K 1 (θ 0 ) is as in (3.21).
Remark 4.12.There is a well-developed theory to verify the bound for any p > 0 in general settings (see Chapter III, Sections 1 and 3 of [23], and Sections 3-4 of [43]).Using such results, we can deduce that the bound (4.57) is of the optimal order O n −1/2 ; notice that the positive constant need not depend on n and its choice could be optimised in examples.In addition, we note that the bound (4.57) has a better dependence on the dimension d than the 1-Wasserstein distance bound of Theorem 3.1.To be more precise, assuming that Ṽlk = O(1) and M kmi = O(1) it can be seen that (4.57) is of order O(d 5 ), while the 1-Wasserstein distance bound (3.20) is of the much larger order O(d 4 2 d ).
Remark 4.13.Condition (Con.1) in (4.56) is non-restrictive and is satisfied by various distributions for which the MLE of their parameters cannot be expressed analytically.Here, we give two examples: 1. Gamma distribution: With α, β > 0 and θ = (α, β) being the vector parameter, the probability density function is We have that where, for any z ∈ C \ {0, −1, −2, . ..}, the polygamma function ψ m (z) is defined by denoting the digamma function.The polygamma function has the series representation (differentiate both sides of formula 5.15.1 of [31]) which holds for any m ≥ 1 and any z ∈ C \ {0, −1, −2, . ..}.It is easy to see that for x > 0, |ψ 2 (x)| is a decreasing function of x and, using (4.58), (Con.1) is satisfied with M 112 = 0 and

Beta distribution:
The probability density function is with α, β > 0 and x ∈ (0, 1).Hence, for j, k where as in the case of the gamma distribution, ψ j (•) is the polygamma function defined in (4.59).(Con.1) is again satisfied with Proof of Proposition 4.11.With Ṽ and W as in (3.19), we obtain through the method of proof of Theorem 3.1, that For the first quantity on the right-hand side of the result in (4.61), we obtain using Theorem 2.5 that With respect to the second term in (4.61), note that

.63)
For h ∈ H bW and with Ṽ and Q j as in (3.19), for ease of presentation let us denote by where θ * 0 is as in (3.28).Using the above notation and the triangle inequality, Since W is as in (3.19), then for A [j] denoting the j-th row of a matrix A, a first order multivariate Taylor expansion gives that Using (3.29) component-wise and the Cauchy-Schwarz inequality, we have that, for T kj as in (3.19), Since E[T kj ] = 0, ∀j, k ∈ {1, 2, . . ., d}, we have that where the inequality trivially holds since the variance of a random variable is always nonnegative.Now, using that ( d j=1 α j ) 2 ≤ d( d j=1 α 2 j ) for α j ∈ R, yields Taking square roots in both sides of the above inequality and applying this inequality to (4.67) yields To bound now E |D 2 |, with D 2 as in (4.64), we need to take into account that θ=θ * 0 is in general not uniformly bounded and there is a positive probability that the MLE will be outside an -neighbourhood of the true value of the parameter.For > 0, the law of total expectation and Markov's inequality yield and using (Con.1),we have that

Empirical results
In this section, we investigate, through a simulation study, the accuracy of our bounds given in Sections 4.1 -4.3.We carried out the study using R.For the exponential distribution with θ = 1 under canonical and non-canonical parametrisation (this bound is given in Appendix A.2) and the normal distribution under canonical parametrisation with η = (1, 1) , we calculated our bound and estimated the true value of d W (W , Z) for sample sizes n = 10 j , j = 1, 2, 3, 4 (Tables 1 -3).For the multivariate normal distribution under non-canonical parametrisation with diagonal covariance matrix we studied the dependence of d W (W , Z) on the dimension p with n = 1000 fixed and µ k = σ 2 k = 1 for all 1 ≤ k ≤ p (Figure 1).Calculating our bounds is straightforward, but estimating the 1-Wasserstein distance d W (W , Z) is more involved.For a given example and given sample size n, we simulated N realisations of the distributions of W and Z to obtain the empirical distribution functions of both distributions.We then used the R package transport to compute the 1-Wasserstein distance between these two empirical distributions.As we simulated the distributions, we only obtained an estimate for the 1-Wasserstein distance d W (W , Z), although this estimate improves as N increases.To mitigate the random effects from the simulations, we repeated this K = 100 times and then took the sample mean to obtain our estimate dW (W , Z).We used N = 10 4 for all simulations, except for the multivariate normal distribution under non-canonical parametrisation for which we used N = 10 3 on account of the many simulations for the 99 values of the dimension p. From the tables we see that at each step we increase the sample size by a factor of ten, the value of the upper bound drops by approximately a factor of √ 10, which is expected as our bounds are of order O n −1/2 .The simulated 1-Wasserstein distances dW (W , Z) do not decrease by a factor of roughly √ 10 for larger sample sizes, because the approximation errors resulting from taking a finite value of N become more noticeable when the value of dW (W , Z) decreases.
Our bounds for the exponential distribution perform reasonably well, particularly in the canonical parametrisation case.In Table 2 for the exponential distribution under non-canonical parametrisation we also provide the bound obtained from a direct application of Theorem 2.4 (this is inequality (A.76)), which as expected is an order of magnitude better than our bound resulting from the general approach.The bounds for the normal distribution under canonical parametrisation are much bigger than for the exponential distribution.This is a result of the increased complexity of this example and the fact that we sacrificed best possible constants in favour of a simpler proof and compact final bound.
Figure 1 shows the behaviour of the simulated 1-Wasserstein distance dW (W , Z) for the multivariate normal distribution with diagonal covariance matrix with µ k = σ 2 k = 1, 1 ≤ k ≤ p, when the dimension p varies from 2 up to 100.Here our focus was on the dependence on the dimension for fixed n, so we chose a small sample size n = 1000 to reduce the computational complexity of the simulations.Figure 1 also contains a log-log plot.Across all 99 data points there is clearly not a straight line fit, but after the value 3.8 for log(p) (the 45th data point), we start to see some stabilisation towards a straight line.We obtained a slope of 0.576 between the 70th and 99th data points, which reduced to 0.569 between the 90th and 99th data points.The results from these simulations suggest that the slope is converging down to 0.5, which would be consistent with the theoretical O(p 1/2 ) scaling of our bound (4.53).A Further examples, proofs and calculations A.1 Verifying (R.C.4") for the inverse gamma distribution Let X 1 , X 2 , . . ., X n be i.i.d.inverse gamma random variables with parameters α > 0 and β > 0 and probability density function In this appendix, we verify condition (R.C.4") for the single-parameter MLE for the inverse gamma distribution (fixed α or fixed β).The purpose is to give an illustration of how (R.C.4") can be verified for more complicated MLEs than those considered in Section 4. To keep the calculations manageable, we focus on the single-parameter case.
For the moment, let θ denote the unknown parameter, either α or β.Recall that in the single-parameter case condition (R.C.4") is We shall verify the stronger (and, in this case, simpler to verify) condition that which implies (R.C.4") by the Cauchy-Schwarz inequality.It should be noted that provided E ( θn (X) − θ 0 ) 4 < ∞, the argument of part (1) of Remark 3.4 shows that this quantity is order O(n −2 ).In verifying that the expectations involving the monotonic dominating function M are finite, we shall see, as expected, that these expectations are order O(n 2 ).Therefore the final term in bound (3.22) of Theorem 3.2 is of the desired order O( 1).An application of Theorem 3.2, and further calculations to bound the other (simpler) terms would confirm that we obtain a Wasserstein distance bound with O(n −1/2 ) convergence rate.
This example is given for purely illustrative purposes, as an improved bound can be obtained directly by Stein's method.Define S = . zero mean and unit variance random variables.Therefore, by Theorem 2.4, However, in order to apply Stein's method directly, we require the quantity W = n i(θ 0 )( θn (x)− θ 0 ) to be a sum of independent random variables.The general theorems obtained in this paper are, however, applicable whatever the form of the MLE is, as long as the regularity conditions are met.
(2) Like the bound of Corollary 4.3 for the exponential distribution under canonical parametrisation, the bound (A.75) of Corollary A.1 is of order O(n −1/2 ) and does not depend on θ 0 .These features are shared by the bound (A.76) obtained by a direct application of Stein's method.A bound with these features was also obtained by [4] in the weaker bounded Wasserstein metric.Despite being given in a stronger metric, our bound has numerical constants that are an order of magnitude smaller.
Proof.It is straightforward to show that θn (X) = X and that the conditions (R1)-(R3), (R.C.4") are satisfied for this specific example.The log-likelihood function is We have that which is a decreasing function with respect to θ, and therefore condition (R.C.4") is satisfied with M (θ; x) 0 .In addition, since T (x) = x, we have that Var(T (X 1 )) = Var(X 1 ) = θ 2 0 and therefore for the first term of the upper bound in (4.33), we have that 1 Now, consider the second term.The quantity E[( X − θ 0 ) 2 ] is calculated using the results in p. 73 and the equations (3.38), p. 70 of [24] along with the fact that θn (X) = X ∼ G n, n θ 0 .We obtain that E[( X − θ 0 ) 2 ] = θ 2 0 n .We also have that i(θ 0 ) = 1 θ 2 0 , and therefore Finally, we work on the third term.Since X ∼ G n, n θ 0 and 1 X ∼ Inv.G n, n θ 0 (where Inv.G denotes the inverse gamma distribution), we have that A.3 Further calculations from the proof of Corollary 4.5 Proof of Lemma 4.7.Let us first note the standard result that X and n i=1 X i − X 2 are independent, which follows from Basu's theorem.We also have that X ∼ N(µ, σ 2 n ) and 1 (n−1) , the chi-square distribution with n − 1 degrees of freedom.We therefore have that η1 = d n 2σ 2 V and η2 = d n σ 2 U V , where U ∼ N(µ, σ 2 n ) and V ∼ Inv−χ 2 (n−1) are independent.All expectations as given in the lemma can therefore be computed exactly using the formulas and then expressing the resulting expression in terms of the canonical parametrisation (η 1 , η 2 ) = ( 1 2σ 2 , µ σ 2 ).(Here the expectations E[V k ] and E[V −k ], follow from the standard formula that, for Y ∼ χ 2 (r) , E[Y m ] = 2 m Γ(m+r/2) Γ(r/2) , r > 0, m > − r 2 and the identity Γ(x + 1) = xΓ(x).)As an example, η 2 1 (2n + 15) (n − 3)(n − 5) .
To obtain the compact bound for E[Q 2 1 ] as stated in the lemma, we note that f (n) := n(2n+15) is a decreasing function of n for n > 9 with f (10) = 10.Similar calculations show that, for n > 9, From these formulas we are able to obtain compacts bounds for all expectations given in the lemma, that are valid for n ≥ 10, using a similar argument to the one we used to bound E[Q where in the second step we used the triangle inequality and the Cauchy-Schwarz inequality, and that h Lip ≤ 1, since h ∈ H W . Now we bound R 2 .We can write V = 1  Finally, combining the bounds for R 1 and R 2 gives the bound for d W (W , Z) as stated in the theorem.
4') and (R.C.4") each have extra difficulty compared to (R.C.4): (R.C.4') involves a conditional expectation, whilst for (R.C.4") the expectations in (2.3) involve the MLE.In Section 4, we give some examples in which the MLE takes a relatively simple form, for which the verification of (R.C.4") follows from elementary calculations, and is simpler to work with than the integrability condition involving conditional expectations in (R.C.4').For complicated MLEs it inevitably becomes more involved to verify (R.C.4").In Appendix A.1, we give an illustration of how (R.C.4") can be verified for more complicated MLEs using the example of the inverse gamma distribution.A comparison between (R.C.4') and (R.C.4") in the context of obtaining error bounds for the distance between the distribution of the MLE and the multivariate normal distribution is given in Remark 3.5.In the case of univariate i.i.d.random variables we work with (R.C.4") and the following simpler regularity conditions: (R1) The densities defined by any two different values of θ are distinct.

Let
X and Y be R d -valued random vectors.Fix p ≥ 1 and suppose that E[|X| p ] < ∞ and E[|Y | p ] < ∞, where | • | denotes the usual Euclidean norm.Then the p-Wasserstein distance between the distributions of X and Y is defined by provided these quantities are finite.Here (and elsewhere) • := • ∞ denotes the usual supremum norm of a real-valued function.For a Lipschitz function h : R d → R we denote h Lip = sup x =y |h(x) − h(y)| |x − y| .

d j=1 a j r ≤ d r−1 d j=1 a r j , where a 1
, . . ., a d ≥ 0 and r ≥ 2. For a d × d matrix A, let A F = d i=1 d j=1 |a i,j | 2 be the Frobenius norm.

Theorem 3 .
1 provides an upper bound on the 1-Wasserstein distance between the distribution of the MLE and the multivariate normal distribution.In Proposition 4.11 below, we put the dependence of the upper bound in (3.20) on the MLE only through the MSE, E[ d j=1 Q 2 j ].

4 ≤ p 4
i , where ξ 1 , . . ., ξ n are i.i.d.random vectors, and V = 1 √ n n i=1 ξi , where ξ1 , . . ., ξn are i.i.d.random vectors withξ i = [I(θ 0 )] 1/2 ξi .Here the components of ξ1 are given by ξ1,j = Y 1,j , 1 ≤ j ≤ p, and ξ1,(k,) = Y 1,k Y 1, − σ k, ,1 ≤ ≤ k ≤ p, where, for d + 1 ≤ j ≤ p(p + 3)/2 we associate ξ 1,j with an ordering of ξ 1,(k, ) , 1 ≤ ≤ k ≤ p.We begin by showing that the assumptions of Theorem 2.5 are met, that isE[ξ 1 ] = 0 and E[ξ 1 ξ 1 ] = I p(p+3)/2 .The components of ξ1 are given by ξ1,j = Y 1,j and ξ(k,),1 = Y 1,k Y 1, − σ k, .We can immediately see that E[ξ 1 ] = [I(θ 0 )] 1/2 E[ ξ1 ] = 0. Let us now show that E[ξ 1 ξ 1 ] = I p(p+3)/2 .As the MLE is asymptotically multivariate normally distributed we have that W d → Z, as n → ∞ (with an abuse of notation, as we have not indexed W with n).We have just shown that R 1 → 0, as n → ∞, (again with the same abuse of notation) for all h ∈ H 1 .Therefore by (A.81) we have thatV d → Z, as n → ∞.Therefore E[V V ] = I p(p+3)/2 + o(1), as n → ∞.But since ξ 1 , . . ., ξ n are i.i.d.we have that E[ξ 1 ξ 1 ] = E[V V ].Since E[ξ 1 ξ 1 ] does not involve n, we deduce that E[ξ 1 ξ 1 ] = I p(p+3)/2 .Now we obtain the bound E[([I(θ 0 )] 1/2 ξ1,j ) 4 ] = E p(p+3)/2 q=1 a j,q ξ1,q (p + 3) 4 16[I(θ 0 )] Wasserstein and bounded Wasserstein distances, which we denote by d K , d W and d bW , respectively, as well as smooth test function metrics, which we denote by d 1,2 and d 0,1,2,3 valued, t ∈ Z + , random vectors with probability density (or mass) function f (x i |θ), for which the true parameter value is θ 0 and the parameter space Θ is an open subset of R d .Assume that the MLE exists and is unique and that (R.C.1)-(R.C.3), (R.C.4"(p)) are satisfied.In addition, for Ṽ as in (3.19), assume that E 1,p and R 2,p , in the cases p ≥ 2 and p = 2, yields the desired bounds (3.23) and (3.25), respectively.24ExamplesInthis section, we apply the general theorems of Section 3 to obtain explicit optimal O(n −1/2 ) Wasserstein distance bounds for the multivariate normal approximation of the MLE in several important settings.Each of the examples given is of interest in its own right and taken together the examples provide a useful demonstration of the application of the general theorems to derive explicit bounds for particular MLEs of interest.Our focus in this section is mostly on obtaining bounds with respect to the 1-Wasserstein metric, although we do derive some bounds with distribution under canonical parametrisation.In Section Proposition 4.11 we provide an upper bound with respect to the bounded Wasserstein distance for cases where the MLE cannot be expressed analytically.
respect to the 2-Wasserstein metric.It should be noted, however, that p-Wasserstein (p ≥ 1) analogues of each of the bounds derived in this section can be obtained through an application of Theorem 3.3; see Corollary 4.3 for a 2-Wasserstein distance bound for the normal approximation of the exponential

Table 1 :
Simulation results for the Exp(1) distribution under canonical parametrisation

Table 2 :
Simulation results for the Exp(1) distribution under non-canonical parametrisation