Asymptotically minimax empirical Bayes estimation of a sparse normal mean vector

For the important classical problem of inference on a sparse high-dimensional normal mean vector, we propose a novel empirical Bayes model that admits a posterior distribution with desirable properties under mild conditions. In particular, our empirical Bayes posterior distribution concentrates on balls, centered at the true mean vector, with squared radius proportional to the minimax rate, and its posterior mean is an asymptotically minimax estimator. We also show that, asymptotically, the support of our empirical Bayes posterior has roughly the same effective dimension as the true sparse mean vector. Simulation from our empirical Bayes posterior is straightforward, and our numerical results demonstrate the quality of our method compared to others having similar large-sample properties.


Introduction
High-dimensional problems, where the parameter is effectively lower-dimensional, are now commonplace in statistical applications.Examples include variable selection in regression (Fan and Lv 2010), covariance matrix estimation (Cai et al. 2010;Cai and Zhou 2012;Lam and Fan 2009), large-scale multiple testing (Bogdan et al. 2011;Cai and Jin 2010), and function estimation (Cai 2012;Johnstone and Silverman 2005).The canonical example, which we shall consider here, is that of estimating a sparse high-dimensional normal mean vector.Let X 1 , . . ., X n be independent observations, with X i ∼ N(θ i , 1), i = 1, . . ., n, and the goal is to estimate the mean vector θ = (θ 1 , . . ., θ n ) under squarederror loss θ − θ 2 , where • is the usual ℓ 2 -norm on R n (e.g., Abramovich et al. 2006;Brown and Greenshtein 2009;Castillo and van der Vaart 2012;Donoho and Johnstone 1994;Donoho et al. 1992;Jiang and Zhang 2009).With only a single observation X i for each θ i , accurate estimation is not possible without some structure.Assuming θ is sparse, in the sense that most of the θ i 's are zero, makes the effective dimension relatively small so that reasonably accurate estimation becomes possible.
This normal means model is by now a classic one which has been widely studied from both a mathematical and applied point of view.Despite the extent to which the manynormal-means model has been studied, it is still a practically important model in a variety of problems.For example, the sparse normal mean model is the cornerstone for many modern Bayes and empirical Bayes multiple testing procedures, e.g., Scott and Berger (2006), Jin and Cai (2007), Bogdan et al. (2008), Efron (2008), and Martin and Tokdar (2012).More recently, Scott et al. (2013) have presented a novel use of the same classical model considered here but in the regression setting.Clearly, research on this classical model is alive and well, and the results provided by our unique approach, namely, asymptotically minimax concentration rates and superior finite-sample performance compared to many existing methods, are useful contributions.
Recently, Castillo and van der Vaart (2012) have considered the performance of several Bayesian methods for this problem.They focus on frequentist properties of a Bayesian posterior distribution, and the corresponding Bayes estimators, for priors with a two-groups structure.In sparse estimation problems, a two-groups prior puts positive probability on θ vectors with some exact zero entries, so the marginal prior for each component is a mixture of a continuous distribution and a point-mass at zero.Castillo and van der Vaart (2012) show that, for a suitably chosen two-groups prior, the posterior concentrates around the true signal at the asymptotically optimal minimax rate.From this, concentration properties of posterior quantities, such as the posterior mean, can be derived.An important message in their paper is that care is needed in choosing the prior for the non-zero θ entries.In particular, they show that priors with too light tails, e.g., Gaussian, give sub-optimal concentration properties.The results presented herein provide similar guidance, though our perspective is quite different.
Here we take a novel empirical Bayes approach.In particular, we present a hierarchical two-groups prior where, given a weight ω, the θ i 's are modeled as independent, with θ i = 0 with probability g i (ω), and θ i ∼ h i (θ | ω) with probability 1 − g i (ω), where the functions g i and h i depend on data X i .These functions are defined explicitly in Section 2. To complete the hierarchy, ω is assigned a prior concentrated near 1.We argue that the effect of the data-dependent prior is mitigated by preventing the posterior from tracking the data too closely.This approach provides some new insights, which we compare with those coming from the fully Bayesian framework of Castillo and van der Vaart (2012).
In Section 3 we present our theoretical framework.First, we show that our empirical Bayes posterior concentrates, with probability 1, around the true mean vector at the optimal minimax rate (with respect to square error loss) for the assumed sparsity class.Concentration rate theorems for empirical Bayes posteriors are relatively scarce in the literature, and our technique for handling the challenges that arise from data appearing in both the likelihood and prior might be useful in other problems; one possible extension is discussed briefly in Section 5. We then show that our empirical Bayes posterior mean is an asymptotically minimax estimator of θ.Finally, we show that, asymptotically, the support of our empirical Bayes posterior has, up to a logarithmic factor, the same effective dimension as the true sparse θ.An interesting observation is that the particular form of the prior on ω is the main catalyst for concentration of our empirical Bayes posterior.
Section 4 describes computation of our empirical Bayes posterior mean via a straightforward Markov chain Monte Carlo.Simulation results are presented to show that our empirical Bayes posterior mean generally outperforms those Bayesian and non-Bayesian competitors with comparable large-sample properties.In particular, we compare our method with a two hard thresholding estimators (Donoho and Johnstone 1994), Bayes and empirical Bayes estimators based on priors with a two-groups structure (Castillo and van der Vaart 2012;Johnstone and Silverman 2004), and a new estimator based on the one-group Dirichlet-Laplace prior (Bhattacharya et al. 2014).Our proposed empirical Bayes estimator is competitive in all cases considered here, and, in many cases, is strikingly better than the others.Some concluding remarks are given in Section 5.
The dependence of the prior on (α, κ, σ 2 ) will not be reflected in our notation.
Observe that if σ 2 < (1 − κ) −1 , then the prior for θ i is proper, a mixture of a point mass and a Gaussian centered at X i .When σ 2 > (1 − κ) −1 , the prior is improper.In any case, the posterior is proper, so this possible impropriety of the prior is not a concern.In fact, σ 2 = (1−κ) −1 is a critical boundary, corresponding to an improper uniform prior for the non-zero θ i 's; see Section 3.3.The term ω αn−1 in the joint density, which resembles a beta density, turns out to be critical to the success of our proposed method, both in theory and in implementation.
Given data X = (X 1 , . . ., X n ) from the normal mean model and the empirical Bayes prior distribution Π X for θ, we could combine these to form an empirical Bayes posterior distribution via Bayes theorem.That is, for a suitable set A in the θ-space, define the probability measure We will investigate concentration properties of the empirical Bayes posterior in Section 3.
In particular, we show that the empirical Bayes posterior mean derived from Q n is an asymptotically minimax estimator of θ.
It might seem that our apparent double-use of the data-in the prior and in the likelihood-could lead to a posterior Q n that tracks the data too closely.To see that this is not the case, note that if |X i | is large, then the prior probability for θ i = 0, under Π X , would be rather large.Thus, the prior has an unexpected shrinkage effect, pushing θ i corresponding to X i with large magnitude towards zero.On the other hand, an X i with large magnitude shifts the prior on the non-zero part further from zero, effectively making the tails heavier, to accommodate large signals.These two phenomena suggest that using data in both the prior and the likelihood will not result in a posterior that tracks data too closely.In fact, our theoretical and numerical results demonstrate that the posterior is doing the right thing, namely, concentrating on the true θ.

A fractional likelihood perspective
To start, it will help to look at the proposed model from a different perspective.For mathematical convenience, we shift our focus and rewrite the empirical Bayes posterior Q n using a fractional likelihood.That is, we write p n θ (X) = p n θ (X) κ p n θ (X) 1−κ and move the 1 − κ fraction into the prior Π X defined above.The effect of this is an alternative prior for (θ, ω) of a very simple form: (2) To provide some further intuition for the prior (1) presented in Section 2, we may consider a data-free version of the prior in (2), where the X i 's are replaced by hyperparameters µ i .The marginal likelihood for µ = (µ 1 , . . ., µ n ), given ω, is and X i is clearly the maximum marginal likelihood estimate of µ i .The use of plugin estimates for mean hyperparameters was considered in Babenko and Belitser (2010) though in a slightly different context.We get the empirical Bayes prior (1) by plugging in X i for µ i and undoing the fractional likelihood.Within this alternative setup, we introduce independent binary latent variables I 1 , . . ., I n , where I i = 1 if and only if θ i = 0.Then, given ω, the indicators I 1 , . . ., I n are independent Ber(ω) variables.These indicators characterize the support of the vector θ; in particular, n i=1 (1 − I i ) is the number of non-zero θ i and is distributed as Bin(n, 1 − ω).The beta prior for ω is concentrated near 1 for n large, so the support size will tend to be small, consistent with the assumption of sparsity.Castillo and van der Vaart (2012), on the other hand, focus primarily on priors directly on the support size, though this kind of beta-binomial prior is considered in their Example 2.2.We find that direct use of the weight ω is both theoretically and computationally convenient; see Remark 1.
Write this new version of the prior as ΠX , and express the posterior as This version of the empirical Bayes posterior is particular amenable for our asymptotic analysis; see, also Walker and Hjort (2001).The use of pseudo-posteriors, where an inverse temperature parameter plays the role of κ, has been considered in the statistics and machine learning literature (e.g., Dalalyan and Tsybakov 2008;Jiang and Tanner 2008;Zhang 2006), but our context is different.

Lower bound on the denominator
In the normal mean model, let θ ⋆ denote the true mean vector.Assume that θ ⋆ is sparse in the sense that most of its entries are zero.To make this more precise, let S ⋆ ⊂ {1, 2, . . ., n} denote the support of θ ⋆ , i.e., θ ⋆ i = 0 if and only if i ∈ S ⋆ .Let s n = #S ⋆ be the cardinality of S ⋆ , and say that θ ⋆ is s n -sparse.Then by sparse we mean that Start by rewriting the empirical Bayes posterior Q n once more as . (3) Our overall goal is to show that Q n concentrates its mass near θ ⋆ with P θ ⋆ -probability 1.
The strategy is to show that the denominator of Q n is not too small, and the numerator, for sets A n away from θ ⋆ , is not too large.
Our first result gives a bound on the denominator of Q n , like that which obtains from the familiar Kullback-Leibler property (e.g., Barron et al. 1999;Ghosal et al. 1999Ghosal et al. , 2000;;Schwartz 1965;Shen and Wasserman 2001).This lower bound will be used in Section 3.3 to derive vanishing upper bounds on the Q n -probability assigned to complements of balls around θ ⋆ .But besides as a tool for proving other things, the following lemma suggests that our empirical Bayes-style prior is sufficiently concentrated around θ ⋆ .As Castillo and van der Vaart (2012) show, without suitable prior concentration, the desired posterior concentration is not possible.Therefore, if we associate lower bounds on the denominator of Q n in (3) with adequate prior concentration, then Lemma 1 says that our prior is sufficiently concentrated around θ ⋆ .
Proof.Write D n in terms of the conditional prior (θ 1 , . . ., θ n ) | ω ∼ ΠX,ω and the marginal prior ω ∼ π for ω under ΠX .That is, For given ω, the inner expectation involves an average over all configurations of the indicators (I 1 , . . ., I n ) defined in Section 3.1.This average is clearly larger than just the case where the indicators exactly match up with the support S ⋆ of θ ⋆ , times the probability of that configuration.That is, The term ω sn−n (1−ω) sn corresponds to the probability for the configuration of (I 1 , . . ., I n ) matching the support S ⋆ .The integral for i ∈ S ⋆ is the expectation of the normal density ratio for non-zero θ i with respect to the N(X i , σ 2 ) prior.Finally, the product over i ∈ S ⋆ disappears because p 0 (X i ) = p θ ⋆ i (X i ) for i ∈ S ⋆ .To further bound this quantity, first pull out the terms exp{ κ 2 (X i − θ ⋆ i ) 2 } in the latter integrand that do not depend on θ i .Since, by the law of large numbers, s −1 So, the remaining product over i ∈ S ⋆ equals (1 + κσ 2 ) −sn/2 , and we can conclude that the entire product over i ∈ S ⋆ in the lower bound for D n is itself lower bounded by exp It remains to bound the first integral over ω.Since π(dω) = αnω αn−1 dω, we have The last inequality follows since and use the approximation − log(1 − x) = x + o(x), for x ≈ 0, then we get a lower bound on the ω-integral of the form: Putting these pieces together, gives the lower bound

Concentration
In the frequentist problem of estimating a s n -sparse vector θ under squared ℓ 2 -error loss, it is known that the minimax rate is proportional to ε n := s n log(n/s n ); see Donoho et al. (1992).Following Castillo and van der Vaart (2012), our goal here is to show that Q n concentrates asymptotically on n-balls, centered at θ ⋆ , with square radius proportional to ε n .More precisely, for a constant M > 0, let then we will demonstrate that Q n (A M εn ) → 0 with P θ ⋆ -probability 1.
The theorem below requires a restriction on (κ, σ 2 ).In particular, we require that, for some β > 1, (κ, σ 2 ) reside in the feasible region We are particularly interested in large β, so that κ arbitrarily close to 1 can be included.Figure 1 displays a portion of the region R β , for β = 200.The condition σ 2 = (1 − κ) −1 discussed in Section 2 defines the boundary of R β , for large β and κ ≈ 1.
Proof.Let N n be the numerator for Q n (A M εn ) in (3), i.e., Taking expectation of N n , with respect to P θ ⋆ , we get The second term in the upper bound equals After some tedious algebra, this can be rewritten as exp 1 2 Combining the two terms in the upper bound, ignoring the normal density, gives For (κ, σ 2 ) in the feasible region R β in (5), the coefficient on (θ i − θ ⋆ i ) 2 in the exponential term above is negative.
We can now find a constant c > 0, depending on (κ, σ 2 , β), such that Then J n ω (dθ) := n i=1 J ω (dθ i ) is upper bounded by exp{−c θ − θ ⋆ 2 } times a probability measure in θ on R n .Therefore, by definition of A M εn , Next, take M such that cM > 2, and then take K ∈ (2, cM).Then Markov's inequality gives the upper bound This upper bound has a finite sum over n ≥ 1, so the Borel-Cantelli lemma gives that N n ≤ e −Kεn , with P θ ⋆ -probability 1 for all large n.Putting together this bound on N n and the one on D n from Lemma 1, we get Since s n = o(ε n ), the exponent diverges to −∞ regardless of the sign on η.Therefore, Q n (A M εn ) → 0 as n → ∞ with P θ ⋆ -probability 1.
Remark 1.The ε n concentration rate is driven primarily by the beta prior on the weight ω.In particular, it comes from the term (s n /n) 2sn in the lower bound (4) in Lemma 1.This means that the prior for θ, given ω, should be selected so that it does not interfere with the correct rate coming from the lower bound on the denominator of Q n .
Remark 2. Castillo and van der Vaart (2012) show that the minimax concentration rate will not hold if the prior on non-zero θ has too light of tails, e.g., Gaussian.A way to understand this point, from our perspective, is that the Gaussian conditional prior interferes with what the beta prior for the weight ω is doing.As we have demonstrated, this does not necessarily mean that Gaussian is wrong, but that some adjustments should be made to prevent this interference.

Asymptotic minimaxity of the posterior mean
Since the empirical Bayes posterior concentrates around the right place and the right rate, it ought to produce an estimator of θ with good properties.For this problem, perhaps the most natural choice of estimator is the empirical Bayes posterior mean, Next we show that θn is a minimax estimator if θ ⋆ is s n -sparse.
Theorem 2. Take (κ, σ 2 ) as in Theorem 1.If θ ⋆ is s n -sparse, then there exists a universal constant Proof.Start by considering the quantity M εn for M as in Theorem 1.On A c M εn , θ − θ ⋆ 2 is bounded above by Mε n , and Q n (A c M εn ) ≤ 1 trivially.So, we immediately get For the integration over A M εn , we again look at the numerator and denominator of Q n separately, as in the previous subsection.The denominator has the same lower bound as in Lemma 1.Take n large enough that, with P θ ⋆ -probability 1, the lower bound in the lemma holds; then the expectation of the ratio can be bounded by upper bounding the expectation of the numerator, together with the lower bound on the denominator.Expectation of the numerator, with respect to P n θ ⋆ , proceeds just like in the proof of Theorem 1.This time, we get where ) times a probability measure for θ in R n , just as in the proof of Theorem 1.Since the function x → xe −cx is monotonically decreasing for large enough x, we can see that, for large n, θ Therefore, the expectation is eventually bounded by Mε n exp(−cMε n ).Combining this with the lemma's lower bound, we can find ν > 0 such that, for large n, ).Take M ′ = 2M to complete the proof.

Effective posterior dimension
Besides posterior concentration around θ ⋆ at the minimax rate, it is desirable if the majority of the posterior mass is concentrated in a roughly s n -dimensional subspace of R n , where it is presumed that θ ⋆ resides.Castillo and van der Vaart (2012) show that their fully Bayes posteriors have effective dimension proportional to s n .An interesting question, therefore, is if a similar result obtains for our empirical Bayes posterior.In this section we show that, under the conditions of Theorems 1-2, the posterior distribution for 1 − ω puts vanishingly small mass above s n n −1 (up to a logarithmic factor), so that ω tends to concentrate around 1 − s n n −1 .That this provides some information about the effective dimension of the posterior can be seen from the following expression: where D θ = #{i : θ i = 0}; this fact derives from the full conditionals in Section 4.1 below.So, if α is not too large, and ω concentrates around 1−s n n −1 , then D θ concentrates around n − s n .Therefore, the posterior distribution for θ must reside on a space with effective dimension proportional to s n .
Theorem 3. Let δ n = Kε n n −1 , where ε n = s n log(n/s n ) as before, and K > 0 is a suitably large constant.Then, under the conditions of Theorem 1, Proof.Write the numerator of P(1 − ω > δ n | X) as This is similar to the first display in the proof of Theorem 1.Just as in that proof, we get the following bound on the expectation: where c is a positive constant and v = v(σ 2 , β) is a variance term that depends on the particular σ 2 and β values.Each integral in the inside product is bounded above by 1, so we get From Lemma 1, we have that the denominator of P(1 − ω > δ n | X) is lower bounded by exp{−2ε n + O(s n )} with probability 1 for large n.So, for large n, we get sn) .
If we pick K such that Kα > 2, then the fact that s n = o(ε n ) implies that this upper bound approaches zero as n → ∞, proving the claim.
Since the logarithmic term log(n/s n ) is small, the practical implication of this result is that the posterior distribution of ω concentrates around 1 − s n n −1 .The simulation results displayed in Figure 3 below confirm this.

Computational considerations
Computation of the empirical Bayes posterior mean can be carried out via a simple Gibbs sampler for ω and θ = (θ 1 , . . ., θ n ) based on the full conditionals: where D θ = #{i : θ i = 0}.That is, first sample from the θ | ω conditional posterior in (8a), then from the ω | θ conditional posterior in (8b).Repeat this process to obtain a sample from the full posterior.R code for this Gibbs sampling procedure is available at www.math.uic.edu/~rgmartin.Once the posterior sample is available, the empirical Bayes estimator θ, the posterior mean, is obtained by computing a coordinate-wise average of the posterior θ samples.Besides the posterior mean, many other quantities of interest can be calculated.For example, inclusion probabilities, P(θ i = 0 | X), i = 1, . . ., n, can be easily calculated.Also, in a function estimation problem, where θ 1 , . . ., θ n are coefficients attached to the fixed basis functions, the posterior samples of the unknown functions are readily available.Theory and experience suggest that good numerical results are obtained for large κ and large σ 2 .Throughout, we use κ = 0.99 and σ 2 = (1−0.99)−1 = 100, on the boundary of the feasible region.For α, (7) suggests that relatively small values of α are appropriate, so that the ω posterior can learn from X through D θ .We have found that choosing α to be decreasing with n is a reasonable choice.(This has no consequence on the results in Theorems 1-3.)In particular, in the three examples below, with n = 200, 500, 1000 we take α = 0.25, 0.10, 0.05, respectively.Alternatively, one could use the data to choose α.For example, a method-of-moments estimator of α can be obtained as follows.First, estimate D = D θ via universal hard thresholding, i.e., D equals the number of X i such that |X i | ≤ (2 log n) 1/2 .Under the assumed prior, D has a beta-binomial distribution, with expectation n 2 α/(nα + 1).If we set this expectation equal to D, then solving for α gives a method-of-moments estimator, in particular, α = D{n(n − D)} −1 .In our examples below, we use the n-dependent but data-free choices of α mentioned above.

Simulation studies
For illustration, we first reproduce a simulation study presented in Bhattacharya et al. (2014).In particular, we take samples X = (X 1 , . . ., X n ) of dimension n = 200 from the normal mean model X i ∼ N(θ ⋆ i , 1).Recall the sparsity level s n is the number of non-zero θ ⋆ i 's.In this case, we consider s n = 10, 20, 40, and the signals are fixed at values A = 7, 8. Table 1 displays estimates of the mean squared error obtained from 100 replications of X.In addition to the proposed empirical Bayes posterior mean estimator (EBM), based on κ = 0.99, σ 2 = 100, and α = 0.25, the methods being compared are a Dirichlet-Laplace estimator (DL) of Bhattacharya et al. (2014), an empirical Bayes median estimator (EBMed) of Johnstone and Silverman (2004), and a fully Bayes posterior median estimator (PMed1) of Castillo and van der Vaart (2012).A few other methods have been considered in the literature recently, and some comments on why they are omitted from comparison here are given in Remark 3 below.Here, we find that our proposed empirical Bayes estimator is the top performer across all these settings.
Consider a single sample X under the simulation setting described above, with n = 200, where the first s n = 10 entries in θ ⋆ equal A = 7, and the remaining entries are zero.For the given X, the Gibbs sampler is run to obtain a sample from our empirical Bayes posterior distribution of θ.In Figure 2 we plot the posterior inclusion probability  (2012).In this case, we look at n = 500, s n = 25, 50, 100, and signals fixed at A = 3, 4, 5. Table 2 displays estimates of the mean squared error based on 100 replications.This time, the methods are two fully Bayes posterior mean estimates (PM1 and PM2), two fully Bayes component-wise posterior medians (PMed1 and PMed2), Johnstone and Silverman (2004) empirical Bayes mean (EBM) and median (EBMed), and hard thresholding (HT) and hard thresholding oracle (HTO) rules.Our proposed empirical Bayes estimator, based on α = 0.10, is competitive when A = 4, and clearly dominates when A = 5, just like in the previous illustration.Interestingly, the empirical Bayes estimators are the better performers overall in this case.
One rather unusual observation is that some of the methods have, for given s n , a mean square error increasing in the signal size A. We find this behavior to be counterintuitive, since it should be easier to detect stronger signals.The two thresholding estimators have decreasing mean square error as A increases, as does our proposed estimator.
To follow up on the mean square error results in Table 2, we also display the posterior distribution of ω for two separate runs.As indicated from Theorem 3, the posterior distribution for ω should concentrate around 1 − s n n −1 .For both cases in Figure 3, the posterior is concentrated exactly where we expect that it would be.
As a final example, consider a n = 1000 dimensional mean vector, with the first 10 entries of θ ⋆ equal 10, the next 90 entries equal A, and the remaining 900 entries equal zero.Mean square errors for two Dirichlet-Laplace estimators in Bhattacharya et al. (2014) and our empirical Bayes estimator, based on α = 0.05, are displayed in Table 3.Here we consider a range of A, from A = 2 to A = 7.For the smaller signals, A ≤ 4, the Dirichlet-Laplace estimator, with smaller prior weight n −1 is the best, but our estimator is better for larger signals, A > 4. The larger weight Dirichlet-Laplace prior estimator is dominated by our empirical Bayes estimator.those included in our comparisons here.These include the lasso (Tibshirani 1996), the Bayesian lasso (Park and Casella 2008), the horseshoe prior estimator (Carvalho et al. 2010), the empirical Bayes estimators of Jiang and Zhang (2009), Brown and Greenshtein (2009), and, most recently, Koenker and Mizera (2014).Some of these methods, including a version of ours, are compared more extensively in Koenker (2014).Those estimators without minimax guarantees, such as the Koenker-Mizera estimator, can only be motivated by finite-sample simulation studies which, by necessity, are narrowly constructed.
On the other hand, our estimator has the desired minimax property and also has the best overall finite-sample performance among those provably minimax competitors.

Discussion
The paper has considered a classical problem of estimating a sparse high-dimensional normal mean vector, and we have proposed a novel empirical Bayes solution.Though the stated prior itself may seem overly informative, we show that the prior induces a sort of shrinkage effect, preventing the posterior from tracking the data too closely.We go on to prove that the empirical Bayes posterior concentrates around θ ⋆ at the minimax rate, that its mean is an asymptotic minimax estimator, and that its effective dimension agrees with that of the true sparse mean vector.
The mathematical device used in our asymptotic analysis is an alternative representation of the empirical Bayes model with a fractional likelihood.As in Walker and Hjort (2001), this fractional likelihood posterior is a powerful tool, though our concentration results do not follow immediately from theirs.With this adjustment, the prior changes to a very simple one, which we have called ΠX .The key to success of our empirical Bayes posterior in the asymptotic framework is the particular beta prior on ω, under ΠX .From this prior, and the lower bound derived in Lemma 1, the minimax rate ε n = s n log(n/s n ) drops out almost automatically.As we indicated, to push through the minimax concentration result, we only need the conditional prior on θ, given ω, under ΠX , to not interfere with the dynamics induced by the prior on ω.Intuitively, there should be many priors that would accomplish this.We showed that an empirical Bayes prior that by centering a Gaussian prior at the observations, under ΠX , minimax concentration follows relatively easily.Castillo and van der Vaart (2012) have similar results, e.g., they make sure the prior for θ does not interfere by requiring suitably heavy tails.
In addition to the good large-sample properties, our empirical Bayes procedure is easy to compute, and, in a number of cases, the finite-sample performance of our empirical Bayes posterior mean is considerably better than that of existing methods with comparable large-sample properties (Remark 3).Since our method admits a full posterior distribution, any other feature, such as the inclusion probabilities displayed in Figure 2, useful in the signal detection problem, can be readily calculated.
A possible extension of the method presented herein is as follows.Suppose that each X i and θ i are r-vectors, where r = r n possibly depends on n.Collecting some of the variables together in vectors introduce a group structure.This structure appears in a variety of applications, and this has motivated developments in model selection and estimation with grouped variables (e.g., Yuan and Lin 2006).Abramovich and Grinshtein (2013) prove asymptotic minimaxity of a Bayes method in this grouped setting, and we expect that similar results can be derived based on the ideas presented here.

Figure 2 :
Figure 2: Plot of the empirical Bayes posterior inclusion probability P(θ i = 0 | X) for i = 1, . . ., n.Here n = 200, s n = 10, and θ ⋆ 1 = • • • = θ ⋆ 10 = 7.P(θ i = 0 | X) as a function of the indices i = 1, . . ., n.It is evident that the empirical Bayes posterior is able to clearly identify the correct model.As a second example, we reproduce a simulation study presented inCastillo and van der Vaart (2012).In this case, we look at n = 500, s n = 25, 50, 100, and signals fixed at A = 3, 4, 5. Table2displays estimates of the mean squared error based on 100 replications.This time, the methods are two fully Bayes posterior mean estimates (PM1 and PM2), two fully Bayes component-wise posterior medians (PMed1 and PMed2),Johnstone and Silverman (2004) empirical Bayes mean (EBM) and median (EBMed), and hard thresholding (HT) and hard thresholding oracle (HTO) rules.Our proposed empirical Bayes estimator, based on α = 0.10, is competitive when A = 4, and clearly dominates when A = 5, just like in the previous illustration.Interestingly, the empirical Bayes estimators are the better performers overall in this case.One rather unusual observation is that some of the methods have, for given s n , a mean square error increasing in the signal size A. We find this behavior to be counterintuitive, since it should be easier to detect stronger signals.The two thresholding estimators have decreasing mean square error as A increases, as does our proposed estimator.To follow up on the mean square error results in Table2, we also display the posterior distribution of ω for two separate runs.As indicated from Theorem 3, the posterior distribution for ω should concentrate around 1 − s n n −1 .For both cases in Figure3, the posterior is concentrated exactly where we expect that it would be.As a final example, consider a n = 1000 dimensional mean vector, with the first 10 entries of θ ⋆ equal 10, the next 90 entries equal A, and the remaining 900 entries equal zero.Mean square errors for two Dirichlet-Laplace estimators inBhattacharya et al. (2014) and our empirical Bayes estimator, based on α = 0.05, are displayed in Table3.Here we consider a range of A, from A = 2 to A = 7.For the smaller signals, A ≤ 4, the Dirichlet-Laplace estimator, with smaller prior weight n −1 is the best, but our estimator is better for larger signals, A > 4. The larger weight Dirichlet-Laplace prior estimator is dominated by our empirical Bayes estimator.

Table 1 :
Mean square errors, based on 100 replications, sampling X of dimension n = 200.First three rows are from Bhattacharya et al. (2014); last row corresponds to the proposed empirical Bayes posterior mean.Boldface font indicates the column winner.

Table 2 :
Castillo and van der Vaart (2012)isting methods available for this problem besides Mean square errors, based on 100 replications, sampling X of dimension n = 500.First eight rows are fromCastillo and van der Vaart (2012); last row corresponds to the proposed empirical Bayes posterior mean.Boldface font indicates the column winner.

Table 3 :
Mean square errors, based on 100 replications, sampling X of dimension n = 200.
Bhattacharya et al. (2014)ttacharya et al. (2014); last row corresponds to the proposed empirical Bayes posterior mean.Boldface font indicates the column winner.