Minimax bounds for estimating multivariate Gaussian location mixtures

We prove minimax bounds for estimating Gaussian location mixtures on $\mathbb{R}^d$ under the squared $L^2$ and the squared Hellinger loss functions. Under the squared $L^2$ loss, we prove that the minimax rate is upper and lower bounded by a constant multiple of $n^{-1}(\log n)^{d/2}$. Under the squared Hellinger loss, we consider two subclasses based on the behavior of the tails of the mixing measure. When the mixing measure has a sub-Gaussian tail, the minimax rate under the squared Hellinger loss is bounded from below by $(\log n)^{d}/n$. On the other hand, when the mixing measure is only assumed to have a bounded $p^{\text{th}}$ moment for a fixed $p>0$, the minimax rate under the squared Hellinger loss is bounded from below by $n^{-p/(p+d)}(\log n)^{-3d/2}$. These rates are minimax optimal up to logarithmic factors.


Introduction
Let φ be the standard univariate normal density and, for d ≥ 1, let F d denote the class of densities on R d of the form: (x 1 , . . . , x d ) → φ(x 1 − u 1 )φ(x 2 − u 2 ) . . . φ(x d − u d )dG(u 1 , . . . , u d ), (1.1) where G is a probability measure on R d . F d is precisely the class of all Gaussian location mixture densities on R d . We study minimax rates in the problem of estimating an unknown density f * ∈ F d from i.i.d observations X 1 , . . . , X n (throughout the paper, we assume that n ≥ 2).
The minimax rate crucially depends on the choice of the loss function. We study two different loss functions in this paper. The first is the squared L 2 distance: L 2 (f, g) := (f (x) − g(x)) 2 dx. (1. 2) The minimax risk of estimation over F d under the L 2 loss function is In Theorem 2.1, we prove that R n F d , L 2 is of the order n −1 (log n) d/2 . This result is known for d = 1. Indeed, when d = 1, the upper bound follows from the results proved in Ibragimov [5] for estimation of smooth functions (also see Kim [6,Section 3]) and the lower bound was proved by Kim [6]. To the best of our knowledge the result for d ≥ 2 is novel. It is interesting that the rate n −1 (log n) d/2 has a relatively mild dependence on the dimension d and thus the usual curse of dimensionality is largely avoided for estimating multivariate Gaussian location mixtures under the L 2 loss function.
The second loss function we investigate is the squared Hellinger distance: In order to obtain meaningful rates under the squared Hellinger distance, it is necessary to impose additional conditions on the probability measure G underlying the density (1.1). The most common assumption in the literature is to assume that G is discrete with a known upper bound on the number of atoms. The Hellinger accuracy (as well as accuracy in the total variation distance) of estimating discrete Gaussian location mixtures has been investigated, for example, in [1,4,7,10,12]. In particular, it was proved by Doss, Wu, Yang and Zhou [1] (and Wu and Yang [12] for d = 1) that the minimax rate is n −1 when the dimension d and the number of atoms of G are bounded from above by constants.
In contrast to the discrete mixture situation, minimax rates in squared Hellinger distance under broader assumptions on G are not fully understood. Given a subclass G of probability measures on R d , let F G denote the class of all densities of the form (1.1) where G is constrained to be in G. We shall denote the minimax risk over F G in the squared Hellinger distance by We study R n (F G , h 2 ) for the following two natural subclasses G: 1. G = G 1 (Γ): the class of all probability measures G satisfying G{u : u > t} ≤ Γ exp(−t 2 /Γ) for all t > 0 and a constant Γ. Every probability measure in G 1 (Γ) has sub-Gaussian tails. 2. G = G 2 (p, K): the class of all probability measures G satisfying for a fixed p > 0 and constant K > 0.
The problem of estimation of densities belonging to the classes F G1(Γ) and F G2(p,K) has been studied in [2,13,6] for d = 1 and in [9] for d ≥ 1. Extending the results of Zhang [13] to d ≥ 1, Saha and Guntuboyina [9] analyzed the performance of the nonparametric maximum likelihood estimator over F d leading to the following upper bounds on R n (F G1(Γ) , h 2 ) and R n (F G2(p,K) , h 2 ): and . (1.5) To the best of our knowledge, the corresponding lower bounds do not currently exist in the literature (except for the case of R n (F G1(Γ) , h 2 ) for d = 1) and we establish these in this paper. Specifically, we prove that in Theorem 2.2 and Theorem 2.3 respectively.
(1.6) implies that there is a logarithmic price to be paid for dimensionality under the sub-Gaussianity assumption. (1.7) implies that the rate of convergence becomes much slower (than the parametric rate) if we only assume boundedness of the p th moment of G for a fixed p > 0. It is usually believed that Gaussian location mixtures are arbitrarily smooth leading to nearly parametric rates of estimation. While this is true for the L 2 loss function, our results reveal that the story is more complicated for the squared Hellinger loss function. Specifically, inequality (1.7) shows that, under the squared Hellinger distance, the rates can be arbitrarily slow if the mixing measure is allowed to have heavy tails. This fact does not seem to have been emphasized previously in the literature even for d = 1. Furthermore, for each fixed p, the rate becomes exponentially slow in d revealing the usual curse of dimensionality. Note that our lower bounds also imply that the upper bounds (1.4) and (1.5) cannot be substantially improved.
The rest of the paper is organized as follows. Our main results are all stated in the next section. Theorem 2.1 proves the minimax rate of (log n) d/2 /n for F d under the L 2 loss. Theorem 2.2 and Theorem 2.3 deal with the squared Hellinger loss function: Theorem 2.2 proves the minimax lower bound of (log n) d /n under the subgaussianity assumption on the mixing measure and Theorem 2.3 proves the n −p/(p+d) (log n) −3d/2 lower bound under the bounded p th moment assumption on the mixing measure. The proofs of these results are given in Section 3. We also recall in this section (see Subsection 3.1) some basic facts about Fourier transforms, Hermite polynomials and Assouad's lemma that are used in our proofs.

Main Results
We state all our main results in this section. Our first result shows that R n (F d , L 2 ) is of the order n −1 (log n) d/2 . Theorem 2.1. There exist constants c d and C d depending only on d such that The proof of the upper bound on R n (F d , L 2 ) in (2.1) is based on an extension of the ideas of [5] to d ≥ 1 (a simple exposition of these ideas can be found in Kim [6,Theorem 4.1]). It involves considering the estimator where K(y) := K(y 1 ) . . . K(y d ) with K(y) := (sin y)/(πy) and the bandwidth h is taken to be h := (2 log n) −1/2 . Controlling the variance off n (x) is straightforward while bounding the bias is non-trivial and we do this via Fourier analysis. The proof of the lower bound on R n (F d , L 2 ) in (2.1) is based on an extension of the ideas of Kim [6, Proof of Theorem 1.1]. It involves applying Assouad's Lemma (recalled in Lemma 3.1) to a carefully chosen subset of F d whose elements are indexed by a hypercube. This subset of F d is constructed by taking mixing measures that are additive perturbations of a Gaussian mixing measure. The additive perturbations are created using Hermite polynomials.
Our next result proves a lower bound of order (log n) d /n for R n (F G1(Γ) , h 2 ). A comparison with the upper bound (1.4) of Saha and Guntuboyina [9] reveals that this lower bound is possibly off by at most a factor of log n and is thus minimax rate optimal up to the single log n multiplicative factor.
The proof of Theorem 2.2 is based on an extension of the ideas of Kim [6, Proof of Theorem 1.3]. Assouad's lemma is applied to a subset of F G1(Γ) which is constructed by taking mixing measures that are additive perturbations of a Gaussian mixing measure. The perturbations are different from those used in the proof of Theorem 2.1 although they are also based on Hermite polynomials. It is not easy to directly work with the squared Hellinger loss function while dealing with mixture densities so we crucially use the fact that the squared Hellinger loss is bounded from below by a constant multiple of the chi-squared divergence for the constructed subset of F d .
Our final result proves a lower bound of the order n −p/(p+d) (up to a logarithmic factor) for R n (F G2(p,K) , h 2 ). As we mentioned previously, this result reveals that rates strictly slower than n −1 are possible for estimating Gaussian location mixtures and that the rate can also be affected by the usual curse of dimensionality. A comparison with the upper bound (1.5) of Saha and Guntuboyina [9] reveals that this lower bound is optimal up to logarithmic factors possibly depending on d.
For the proof of theorem 2.3, we first construct a normal mixture density whose mixing measure is a discrete distribution that is supported on a ddimensional product set (lattice) and has a Pareto type tail behavior. We then construct a hypercube of normal mixture densities by perturbing this support set. For each point in the support set, we use either the original point or a nearby point whose distance to the non-perturbed point is determined by the probability value at the original support point of the mixing distribution. We finally apply Assouad's lemma to the constructed hypercube of densities in F G2(p,K) .
The proofs of our results are given in the next section.

Preliminaries
We shall recall here some standard facts about the Fourier transform and Hermite polynomials that we shall use in our main proofs. We use the notation T for the Fourier transform. It is defined as Plancherel's theorem states that The convolution-product property of the Fourier transform states that It is well-known that Our lower bound constructions involve Hermite polynomials. These are defined for d = 1 as Note that H j (·) is an odd function when j is odd (and even when j is even). The Hermite polynomials are orthogonal with respect to the weight function φ(x) := (2π) −1/2 exp(−x 2 /2). Specifically for all j ≥ 0 and k ≥ 0, we have We shall use the bound for a constant κ ∼ 1.086 < 2 1/4 (see e.g. Equation 8.954 of [3]).
For d ≥ 1, we take We next recall Assouad's lemma (see e.g., [11,Lemma 24.3]) which will be our main tool for proving minimax lower bounds.
is the Hamming distance between τ and τ ′ and χ 2 (f g) := (f − g) 2 /g is the χ 2 -divergence between densities f and g.

Proof of Theorem 2.1
We break this proof into two parts: the upper bound on R n (F d , L 2 ) and the lower bound on R n (F d , L 2 ). Let us first prove the upper bound.
We consider the kernel estimator (2.2) with bandwidth h = (2 log n) −1/2 . We need to control its variance and squared bias. We first bound the variance as Thus by Plancherel theorem (and (3.1)) . The density f ∈ F d is of the form (1.1) for some probability measure G and thus we can assume that f is the density of X = U + Z for independent random variables U ∼ G and Z having the standard normal distribution on R d . We can thus write Using the formula for the characteristic function of the Gaussian random variable Z and the fact that the characteristic function of U is bounded by 1, we get As a result, The standard Gaussian tail bound |u|>a e −u 2 du ≤ 2 √ πe −a 2 now leads to Combining this with (3.6), we get The choice h := (log n) −1/2 then clearly leads to R n (F d , L 2 ) ≤ C d (log n) d/2 /n which proves the upper bound. We now prove the lower bound on R n (F d , L 2 ). The idea is to construct a subset of F d indexed by a hypercube {0, 1} N for some N and then use Assouad's lemma (Lemma 3.1). Our construction is a natural extension to d ≥ 1 of the one-dimensional construction in [6] and is described below. Let m be the largest integer such that We can assume without loss of generality that n is large enough so that m defined as above satisfies m ≥ 3. It is also easy to check (because m 5d/4 8 d 3 dm ≥ e m ) that the above condition for m implies that m ≤ 1 2 log n.
Below we denote by φ σ 2 (·), the univariate normal density with mean zero and variance σ 2 i.e., φ σ 2 (x) := ( √ 2πσ) −1 exp(−x 2 /(2σ 2 )). We construct densities (for the application of Assouad's lemma) via perturbations of the density: Note also that where ǫ is given by for a constant c ∈ (0, 1) that will be determined later and Here H ji (·) denotes the Hermite polynomial (see (3.2)). Let us first argue that f α ∈ F d . To see this, it is enough to show that integrates to 1 over u ∈ R d and is nonnegative. Integration to 1 is justified by the fact that γ ji (u i ) is an odd function of u i for each i = 1, . . . , d and the fact that γ(u)du = 1. For nonnegativity of (3.9), note first that the inequality (3.4) implies that, for each i = 1, . . . , d, (3.10) Because we have assumed that n is large enough so that m ≥ 3, we have exp(−u 2 i /6) ≤ exp(−u 2 i /(2m)) which gives |γ ji (u i )| ≤ 8 · 3 ji/2 √ mφ m (u i ) for every i = 1, . . . , d and consequently where |j| := j 1 + · · · + j d . Thus because the maximum value of any j i is 2m − 1 ≤ 2m for every j ∈ J and the cardinality of J is m d . Plugging in our value of ǫ (from (3.8)) and using condition (3.7), we get which implies nonnegativity of (3.9) as long as c < 1. We now lower bound min α =β L 2 (fα,f β ) Υ(α,β) (where Υ(α, β) := j∈J 1 {α j =β j } is the Hamming distance between α and β). Observe first that As a result Because Γ j is defined as the product over i = 1, . . . , d of the convolution of φ and γ ji , we have This gives (note that The orthogonality of the Hermite polynomials with respect to the weight function φ (see Equation (3.3)) implies We thus have min α =β Now we bound the χ 2 distance between f α and f β for α and β with Υ(α, β) = 1. Note first that, as a result of (3.11), we have We now split the integral above into R(x) := {|x 1 | ≤ M m 1/2 , . . . , |x d | ≤ M m 1/2 } and R(x) c where M is a dimensional constant larger than 8d log 3.
By Assouad's lemma 3.1, we have Of course c can be taken to be small enough so that the right hand side above is larger than a constant (depending on d alone) multiple of (log n) d/2 /n. The proof is thus complete.

Proof of Theorem 2.2
The idea for the lower bound on R n (F G1(Γ) , h 2 ) is to again construct a subset of F G1(Γ) indexed by a hypercube and then use Assouad's lemma. Our construction is a natural extension to d ≥ 1 of the one-dimensional construction in [6]. Let m be the largest integer such that The above condition implies m ≤ log n. In order to use Assouad's lemma, we construct densities via perturbations of the density where, it may be recalled, that φ σ 2 (·) denotes the univariate normal density with mean zero and variance σ 2 . Note that f 0 can also be written as Again where ǫ is given by for a constant c ∈ (0, 1) that will be determined later and By an analogous argument in the proof of Theorem 2.1, we can show f α ∈ F G1(Γ) . First, integration of f α (x) to 1 is guaranteed since v ji (u i ) is an odd function of u i for each i = 1, . . . , d and the fact that v(u)du = 1. For nonnegativity of v(u) + ǫ j∈J α j v j (u), note that using (3.4) we have for every i = 1, . . . , d. Thus as long as c < 1/2. In the same way, we have .
where c d is a constant depending on d. The fact that all these constructed mixing densities are between (1/2)v(u) and (3/2)v(u) gives Inequality (3.13) implies that because, respectively, By inequality (3.14), it is clear that for the application of Assouad's Lemma 3.1, it is enough to focus on the quantity . Let us write and we consider (3.15) Because Λ j / √ f 0 is defined as the product over i = 1, . . . , d, of where the above equality follows by Kim [6, Lemma 2.2], we have Using Consequently, This (and (3.14)) gives By Lemma 3.1, we have

Proof of Theorem 2.3
The idea for the lower bound on R n (F G2(p,K) , h 2 ) is again to construct a subset of F G2(p,K) indexed by a hypercube and then we use Assouad's lemma. Let and k 0 the largest integer less than or equal to k ′ 0 where Let k = k d 0 and let a 1 , . . . , a k be an enumeration of the points in S d = S×· · ·× S. We can take a 1 = (1, 1, . . . , 1) ∈ R d , a 2 = (1 + M log(2e), 1, . . . , 1), . . . , a k = (1 + (k 0 − 1)M log(k 0 e), . . . , 1 + (k 0 − 1)M log(k 0 e)). Let G be the discrete probability distribution supported on S d that is given by where 1 is the d-dimensional vector of ones and is the normalizing constant. We assume n is sufficiently large so that log log k 0 ≤ M and M p ≥ 2 d . We claim that C n,d,p > 1/2 and G ∈ G 2 (p, 6 1/p d). (3.18) Let us assume the above claim for now and proceed with the proof. The claim will be proved later. For i = 1, . . . , k, .
Thus the lower bound will be a constant (only depending on d and p) multiple of n −p/(p+d) (log n) −3d/2 .
where the last inequality follows by the induction hypothesis.