Empirical Bayes analysis of spike and slab posterior distributions

In the sparse normal means model, convergence of the Bayesian posterior distribution associated to spike and slab prior distributions is considered. The key sparsity hyperparameter is calibrated via marginal maximum likelihood empirical Bayes. The plug-in posterior squared-$L^2$ norm is shown to converge at the minimax rate for the euclidean norm for appropriate choices of spike and slab distributions. Possible choices include standard spike and slab with heavy tailed slab, and the spike and slab LASSO of Rockov\'a and George with heavy tailed slab. Surprisingly, the popular Laplace slab is shown to lead to a suboptimal rate for the full empirical Bayes posterior. This provides a striking example where convergence of aspects of the empirical Bayes posterior does not entail convergence of the full empirical Bayes posterior itself.


Introduction
In the sparse normal means model, one observes a sequence X = (X 1 , . . ., X n ) with θ = (θ 1 , . . ., θ n ) ∈ R n and ε 1 , . . ., ε n i.i.d.N (0, 1).Given θ, the distribution of X is a product of Gaussians and is denoted by P θ .Further, one assumes that the 'true' vector θ 0 belongs to the set of vectors that have at most s n nonzero coordinates, where 0 ≤ s n ≤ n.A typical sparsity assumption is that s n is a sequence that may grow with n but is 'small' compared to n (e.g. in the asymptotics n → ∞, one typically assumes s n /n = o(1) and s n → ∞).A natural problem is that of estimating θ with respect to the euclidean loss θ − θ ′ 2 = n i=1 (θ i − θ ′ i ) 2 .A benchmark is given by the minimax rate for this loss over the class of sparse vectors ℓ 0 [s n ].Denoting r n := 2s n log(n/s n ), [7] show that the minimax rate equals (1 + o(1))r n as n → ∞.
Taking a Bayesian approach, one of the simplest and arguably most natural classes of prior distributions in this setting is given by so-called spike and slab distributions, where δ 0 denotes the Dirac mass at 0, the distribution G has density γ with respect to Lebesgue measure and α belongs to [0, 1].These priors were introduced and advocated in a number of papers, including [15,8,9,21].One important point is the calibration of the tuning parameter α, which can be done in a number of ways, including: deterministic n-dependent choice, data-dependent choice based on a preliminary estimate α, fully Bayesian choice based on a prior distribution on α.Studying the behaviour of the posterior distributions in sparse settings is currently the object of a lot of activity.A brief (and by far not exhaustive) overview of recent works is given below.Given a prior distribution Π on θ, and interpreting P θ as the law of X given θ, one forms the posterior distribution Π[• | X] which is the law of θ given X.The frequentist analysis of the posterior distribution consists in the study of the convergence of Π[• | X] in probability under P θ0 , thus assuming that the data has actually been generated from some 'true' parameter θ 0 .
In the present paper, we follow this path and are more particularly interested in obtaining a uniform bound on the posterior squared L 2 -moment of the order of the optimal minimax rate, that is in proving, with C a large enough constant, sup for Π a prior distribution constructed using a spike and slab approach, whose prior parameters may be calibrated using the data, that is following an empirical Bayes method.This is of interest for at least three reasons • this provides adaptive convergence rates for the entire posterior distribution, using a fully data-driven procedure.This is more than obtaining convergence of aspects of the posterior such that posterior mean or mode, and in fact may require different conditions on the prior, as we shall see below.
• the inequality (2) automatically implies convergence of several commonly used point estimators derived from the posterior Π[• | X]: it implies convergence at rate Cr n of the posterior mean ´θdΠ(θ | X) (using Jensen's inequality, see e.g.[6]), but also of the coordinatewise posterior median (see the supplement of [6] for details) and in fact of any fixed posterior coordinatewise quantile, for instance the quantile 1/4 of Π[• | X].It also implies, using Tchebychev's inequality, convergence of the posterior distribution at rate M n r n for • 2 as in (3) below with M = M n , for any M n → ∞. • knowing (2) is a first step towards results for uncertainty quantification, in particular for the study of certain credible sets.Indeed, (2) suggests a natural way to build such a set, that is C ⊂ R n with Π[C | X] ≥ 1 − α for a given α ∈ (0, 1).Namely, define C = {θ : θ − θ 2 ≤ r X }, with θ the posterior mean (or another suitable point estimate of θ) and r X a large enough multiple of the (1 − α)-quantile of ´ θ − θ 2 dΠ(θ | X).
The present work is the first of a series of papers where we study aspects of inference using spike and slab prior distributions.In particular, based on the present results, the behaviour of the previously mentioned credible sets is studied in the forthcoming paper [5].
Previous results on frequentist analysis of spike and slab type priors.In a seminal paper, Johnstone and Silverman [12] considered estimation of θ using spike and slab priors combined with an empirical Bayes method for choosing α.They chose α = α based on a marginal maximum likelihood approach to be described in more details below.Denoting θ the associated posterior median (or posterior mean), [12] established that sup thereby proving minimaxity up to a constant of this estimator over ℓ 0 [s n ].The estimator is adaptive, as the knowledge of s n is not required in its construction.
In [6], convergence of the posterior distribution is studied in the case α is given a prior distribution.If α ∼ Beta(1, n + 1), Π is the corresponding hierarchical prior, and Π[• | X] the associated posterior distribution, it is established in [6] that for large enough M , as n → ∞, In [14], Martin and Walker use a fractional likelihood approach to construct a certain empirical Bayes spike and slab prior, where the idea is to reweight the standard spike and slab prior by a power of the likelihood.They derive rate-optimal concentration results for the corresponding posterior distribution and posterior mean.
A related class of prior distributions recently put forward by Ročková [16] and Ročková and George [17], is given by where both distributions G 0 , G 1 have densities with respect to Lebesgue measure.The authors in particular consider the choices G 0 = Lap(λ 0 ) and G 1 = Lap(λ 1 ), where Lap(λ) denotes the Laplace (double-exponential) distribution.Taking λ 0 large enough enables one to mimic the spike of the standard spike and slab prior, and the fact that both G 0 , G 1 are continuous distributions offers some computational advantages, especially when working with the posterior mode.One can also note that the posterior mode when α = 1 leads to the standard LASSO estimator.For this reason, the authors in [16,17] call this prior the spike and slab LASSO prior.It is shown in [16], Theorem 5.2 and corollaries, that a certain deterministic n-dependent choice of α, λ 0 , λ 1 (but independent on the unknown s n ) leads to posterior convergence at near-optimal rate s n log n, while putting a prior on α can yield ( [16], Theorem 5.4) the minimax rate for the posterior, if a certain condition on the strength of the true non-zero coefficiencents of θ 0 is verified.
Other priors and related work.We briefly review other options to induce sparsity using a Bayesian approach.One option considered in [6] is first to draw a subset S ⊂ {1, . . ., n} at random and then to draw nonzero coordinates on this subset only.That is, sample first a dimension k ∈ {0, . . ., n} at random according to some prior π.Given k, sample S uniformly at random over subsets of size k and finally set Under the assumption that the prior π on k is of the form, referred to as the complexity prior, π(k) = ce −ak log(nb/k) , (4) [6] show that under this prior, both (3) and (2) are satisfied.However, such a 'strong' prior on the dimension is not necessary at least for (3) to hold: it can be checked for instance, for π the prior on dimension induced by the spike and slab prior on θ with α ∼ Beta(1, n + 1), that π(s n ) ≍ exp(−cs n ) ≫ exp(−cs n log(n/s n )).So in a sense the complexity prior 'penalises slightly more than necessary'.Another popular way to induce sparsity is via the so-called horseshoe prior, which draws a θ from a continous distribution which is itself a mixture.As established in [18]- [19] the horseshoe yields the nearly-optimal rate s n log n uniformly over the whole space ℓ 0 [s n ], up again to the correct form of the logarithmic factor.In a different spirit but still without using Dirac masses at 0, the paper [11] shows that, remarkably, it is also possible to adopt an empirical Bayes approach on the entire unknown distribution function F of the vector θ, interpreting θ as sampled from a certain distribution, and the authors derive oracle results over ℓ p , p > 0, balls for the plug-in posterior mean (not including the case p = 0 though).We also note the interesting work [20] that investigates necessary and sufficient conditions for sparse continuous priors to be rate-optimal.However the latter is for a fixed regularity parameter s n , while the results decribed in Section 2 (in particularity the suboptimality phenomenon, but also upper-bounds using the empirical Bayes approach) are related to adaptation.
Using complexity-type priors on the number of non-zero coordinates, Belitser and co-authors [1]- [2] consider Gaussian priors on non-zero coefficients, with a recentering of the posterior mean at the observation X i -for those coordinates i that are selected-to adjust for overshrinkage.In [2], oracle results for the corresponding posterior are derived, that in particular imply convergence at the minimax rate up to constant over ℓ 0 [s n ], and the authors also derive results on uncertainty quantification by studying the frequentist coverage of credible sets using their procedure.
For further references on the topic, in particular about relationships between spike and slab priors and absolutely continuous counterparts such as the horseshoe or the spike and slab LASSO, we refer to the paper [19] and its discussion by several authors of the previously mentioned works.
Overview of results and outline.This paper obtains the following results.
1.For the spike and slab prior, in Section 2.2 we establish lower bound results that show that the popular Laplace slab yields suboptimal rates when the complete empirical Bayes posterior is considered.2. In Sections 2.3 and 2.6, we establish rate-optimal results for the posterior squared L 2moment for the usual spike and slab with a Cauchy slab, when the prior hyperparameter is chosen via a marginal maximum likelihood method.3.In Section 2.4, the spike and slab LASSO prior is considered and we provide a near-optimal adaptive rate for the corresponding complete empirical Bayes posterior distribution.
Section 2 introduces the framework, notation, and the main results, ending with a brief simulation study in Section 2.5 and discussion.Section 3 gathers the proofs of the lower-bound results as well as upper-bounds on the spike and slab prior.Technical lemmas for the spike and slab prior can be found in Section 4, while Sections 5-6 contain the proof of the result for the spike and slab LASSO prior.
For real-valued functions f, g, we write f g if there exists a universal constant C such that f (x) ≤ Cg(x), and f g is defined similarly.When x is a positive real number or an integer, we write f (x) ≍ g(x) if there exists positive constants c, C, D such that for x ≥ D, we have cf (x) ≤ g(x) ≤ Cf (x).For reals a, b, one denotes a ∧ b = min(a, b) and a ∨ b = max(a, b).

Empirical Bayes estimation with spike and slab prior
In the setting of model (1), the spike and slab prior on θ with fixed parameter α ∈ [0, 1] is where G is a given probability measure on R. We consider the following choices

Lap(1) or
Cauchy (1) where Lap(λ) denotes the Laplace (double exponential) distribution with parameter λ and Cauchy(1) the standard Cauchy distribution.Different choices of parameters and prior distributions are possible (a brief discussion is included below) but for clarity of exposition we stick to these common distributions.In the sequel γ denotes the density of G with respect to Lebesgue measure.By Bayes' formula the posterior distribution under (1) and ( 5) where, denoting by φ the standard normal density and g(x) = φ * G(x) = ´φ(x − u)dG(u) the convolution of φ and G at point x ∈ R, the posterior weight a(X i ) is given by, for any i, The distribution G Xi has density with respect to Lebesgue measure on R. The behaviour of the posterior distribution Π α [• | X] heavily depends on the choices of the smoothing parameters α and γ.It turns out that some aspects of this distribution are thresholding-type estimators, as established in [12].
Posterior median and threshold t(α).The posterior median θmed α (X i ) of the ith coordinate has a thresholding property: there exists t(α) > 0 such that θmed α A default choice can be α = 1/n; one can check that this leads to a posterior median behaving similarly as a hard thresholding estimator with threshold √ 2 log n.One can significantly improve on this default choice by taking a well-chosen data-dependent α.
In order to choose α, in this paper we follow the empirical Bayes method proposed in [12].The idea is to estimate α by maximising the marginal likelihood in α in the Bayesian model, which is the density of α | X.The log-marginal likelihood in α can be written as Let α be defined as the maximiser of the log-marginal likelihood where the maximisation is restricted to The reason for this restriction is that one does not need to take α smaller than α n , which would correspond to a choice of α 'more conservative' than hard-thresholding at threshold level √ 2 log n.In [12], Johnstone and Silverman prove that the posterior median αmed (X i ) has remarkable optimality properties, for many choices of the slab density γ.For γ with tails 'at least as heavy as' the Laplace distribution, then this point estimator converges at the minimax rate over ℓ 0 [s n ].More precisely, it follows from Theorem 1 in [12] that there exists constants C, c 0 , c 1 such that if then the posterior median θmed One can actually remove the lower bound in condition (11) -see Theorem 2 in [12] -by a more complicated choice of α, for which α in (10) is replaced by a smaller value if the empirical Bayes estimate is close to α n given by t(α n ) = √ 2 log n.In the present paper for simplicity of exposition we first work under the condition (11).In Section 2.6 below, we show that the lower bound part of the condition can be removed when working with the modified estimator as in [12].
Plug-in posterior distribution.The posterior we consider in this paper is Π α[• | X], that is the distribution given by (6), where α has been replaced by its empirical Bayes (EB) estimate α given by (10).This posterior is called complete EB posterior in the sequel.The value α is easily found numerically, as implemented in the R package EbayesThresh, see [13].As noted in [12], the posterior median αmed (X i ) displays excellent behaviour in simulations.However, the entire posterior distribution Π α[• | X] has not been studied so far.It turns out that the behaviour of the posterior median does not always reflect the behaviour of the complete posterior, as is seen in the next subsection.

Suboptimality of the Laplace slab for the complete EB posterior distribution
Theorem 1.Let Π α be the spike and slab prior distribution (5) with slab distribution G equal to the Laplace distribution Lap (1).Let Π α[• | X] be the corresponding plug-in posterior distribution given by (6), with α chosen by the empirical Bayes procedure (10).There exist D > 0, N 0 > 0, and c 0 > 0 such that, for any n ≥ N 0 and any s n with 1 ≤ s n ≤ c 0 n, there exists θ 0 ∈ ℓ 0 [s n ] such that, Theorem 1 implies that taking a Laplace slab leads to a suboptimal convergence rate in terms of the posterior squared L 2 -moment.This result is surprising at first, as we know by ( 12) that the posterior median converges at optimal rate r n .The posterior mean also converges at rate r n uniformly over ℓ 0 [s n ], by Theorem 1 of [12].So at first sight it would be quite natural to expect that so does the posterior second moment.
One can naturally ask whether the suboptimality result from Theorem 1 could come from considering an integrated L 2 -moment, instead of simply asking for a posterior convergence result in probability, as is standard in the posterior rates literature following [10].We now derive a stronger result than Theorem 1 under the mild condition s n log 2 n.The fact that the result is stronger follows from bounding from below the integral in the display of Theorem 1 by the integral restricted to the set where θ − θ 0 2 is larger than the target lower bound rate.
Theorem 2. Under the same notation as in Theorem 1, if Π α is a spike and slab distribution with as slab G the Laplace distribution, there exists m > 0 such that for any s n with s n /n → 0 and log Theorem 2, by providing a lower bound in the spirit of [3], shows that the answer to the above question is negative, and for a Laplace slab, the plug-in posterior Π α[• | X] does not converge at minimax rate uniformly over ℓ 0 [s n ].
Note that the suboptimality occuring here does not result from an artificially constructed example (we work under exactly the same framework as [12]) and that this has important (negative) consequences for construction of credible sets.Due to the rate-suboptimality of the EB Laplaceposterior, typical credible sets derived from it (such as, e.g., taking quantiles of a recentered posterior second moment) will inherit the suboptimality in terms of their diameter, and therefore will not be of optimal size.Fortunately, it is still possible to achieve optimal rates for certain spike and slab EB posteriors: the previous phenomenon indeed disappears if the tails of the slab in the prior distribution are heavy enough, as seen in the next subsection.

Optimal posterior convergence rate for the EB spike and Cauchy slab
The next result considers Cauchy tails, although other examples can be covered, as discussed below.In the sequel, we abbreviate by SAS prior a spike and slab prior with Cauchy slab.Theorem 3. Let Π α be the SAS prior distribution (5) with slab distribution G equal to the standard Cauchy distribution.Let Π α[• | X] be the corresponding plug-in posterior distribution given by (6), with α chosen by the empirical Bayes procedure (10).There exist C > 0, N 0 > 0, and c 0 , c 1 > 0 such that, for any n ≥ N 0 , for any s n such that (11) is satisfied for such c 0 , c 1 , If one only assumes s n ≤ c 0 n in (11), then the last statement holds with the bound Cr n replaced by Cr n + C log 3 n.
Theorem 3 confirms that the empirical Bayes plug-in posterior, with α chosen by marginal maximum likelihood, converges at optimal rate with precise logarithmic factor, at least under the mild condition (11), if tails of the slab distribution are heavy enough.Inspection of the proof of Theorem 3 reveals that any slab density γ with tails of the order x −1−δ with δ ∈ (0, 2) gives the same result.Sensibility to the tails, in particular in view of posterior convergence in terms of d q -distances, will be further investigated in [5].
We note that the horseshoe prior on θ considered in [18]- [19] also has Cauchy-like tails, which seems to confirm that for empirical Bayes-calibrated (product-type) sparse priors, heavy tails are important to ensure optimal or near-optimal behaviour, see also the discussion [4].
The lower bound in condition (11) is specific to the estimate α.Note that in the very sparse regime where s n ≤ c 1 log 2 n, the rate is no more than C log 3 n, thus missing the minimax rate by at most a logarithmic factor.This lower bound on s n can be removed and the minimax rate obtained over the whole range of sparsities s n if one modifies slightly α, where the estimator is changed if α is too close to the lower boundary of the maximisation interval, see Section 2.6.

Posterior convergence for the EB spike and slab LASSO
Now consider the following prior on θ with fixed parameter α ∈ [0, 1] where for k = 0, 1, G k is given by Cauchy(1/λ 1 ), which leads to the spike and slab LASSO prior of [17] in the case of a Laplace G 1 , and to a heavy-tailed variant of the spike and slab LASSO if G 1 is Cauchy(1/λ 1 ), that is if its density is In this setting γ 0 , γ 1 denote the densities of G 0 , G 1 with respect to Lebesgue measure.We call SSL prior a spike and slab LASSO prior with Cauchy slab.
By Bayes' formula the posterior distribution under (1) and (13) where is the convolution of φ and G k at point x ∈ R for k = 0, 1, the posterior weight a(X i ) is defined through the function a(•) given by and if G k has density γ k with respect to Lebesgue measure, the distribution G k,Xi has density In slight abuse of notation, we keep the same notation in the case of the SSL prior for quantities such as a(x) or α below, as it will always be clear from the context which prior we work with.We consider the following specific choices for the constants λ 0 , λ 1 The choice of the constants L 0 , L 1 is mostly for technical convenience, and is similar to that of, e.g.Corollary 5.2 in [16].Any other constant L 0 (resp.L 1 ) larger (resp.smaller) than the above value also works for the following result.The above numerical values may not be optimal.
Let α be defined as the maximiser of the log-marginal likelihood, for C = C 0 (γ 0 , γ 1 ) a large enough constant to be chosen below (this ensures that α belongs to an interval on which we can verify that β is increasing, see ( 40)).This time we do not have access to the threshold t, since for the SSL prior the posterior median is not a threshold estimator, so here C log n/n plays the role of an approximated version of α n in (10).
Theorem 4. Let Π α be the SSL prior distribution (13) with Cauchy slab and parameters (λ 0 , λ 1 ) given by (15).Let Π α[• | X] be the corresponding plug-in posterior distribution given by (14), with α chosen by the empirical Bayes procedure (16).There exist C > 0, N 0 > 0, and c 0 , c 1 > 0 such that, for any n ≥ N 0 , for any s n such that (11) is satisfied for such c 0 , c 1 , then If one only assumes s n ≤ c 0 n in (11), then the last bound holds with Cs n log n replaced by This result is an SSL version of Theorem 3. It shows that a spike and slab LASSO prior with heavy-tailed slab distribution and empirical Bayes choice of the weight parameter leads to a nearly optimal contraction rate for the entire posterior distribution.Hence it provides a theoretical guarantee of a fully data-driven procedure of calibration of the smoothing parameter in SSL priors.

A brief numerical study
Theorems 1-2 imply that the posterior distribution for the spike and slab prior and Laplace(1) slab does not converge at optimal rate and the discrepancy between the actual rate and the minimax rate for some 'bad' θ 0 s is at least of order up to a multiplicative constant factor, as both lower and upper bounds are up to a constant.Note that R n grows more slowly than a polynomial in n/s n , so the sub-optimality effect will typically be only visible for quite large values of n/s n .For instance, if n = 10 4 and s n = 10, one has R n ≈ 6, which is quite small given that an extra multiplicative constant is also involved.
For the present simulation study we took n = 10 7 , s n = 10, for which R n ≈ 13.9, and the nonzero values of θ 0 equal to {2 log(n/s n )} 1/2 , as the lower bound proof of Theorems 1-2 suggests.We computed α using the package EBayesThresh of Johnstone and Silverman [13] and computed ´ θ −θ 0 2 2 dΠ α(X ) using its explicit expression, which can be obtained in closed form for a Laplace slab, with similar computations as in [13], Section 6.3.We then took the empirical average over 100 repetitions to estimate the target expectation . We first took γ = Lap(1) a standard Laplace slab and obtained R2 ≈ 1110.For comparison, we computed the empirical quadratic risk Rmean for the posterior mean (approximating E θ0 θmean − θ 0 2 ) and Rmedian the posterior median of the same posterior, obtaining Rmean ≈ 158 and Rmedian ≈ 167.So, in this case R2 is already 6 to 7 times larger than the risk of either mean or median.
To further illustrate the 'blow-up' in the rate for the posterior second moment R 2 , we took a Laplace slab Lap(a) with inverse-scale parameter a, for which the numerator in the definition of R n becomes exp{a 2 log(n/s n )} (let us also note that the multiplicative constant we refer to above also depends on a).The same simulation experiment as above was conducted, with the standard Laplace slab replaced by a Lap(a) slab, for different values of a.The numerical results are presented in Table 1, which feature a noticeable increase in the second moment R2 , while the risks of posterior mean and median stay around the same value, as expected.We also performed the same experiments for the quasi-Cauchy slab prior introduced in [12]- [13] (it is very close to the standard Cauchy slab -in particular it has the same Cauchy tails -but more convenient from the numerical perspective, see [13], Section 6.4).We found Rmedian ≈ 192, Rmean ≈ 191 for the posterior mean and R2 ≈ 287 for the posterior second moment.This time, as expected, the posterior second moment is not far from the two other risks.

Modified empirical Bayes estimator
For n ≥ 3 and A ≥ 0, let us set t 2 n = 2 log n − 5 log log n and t A = 2(1 + A) log n.For Π α the SAS prior with a Cauchy slab, let as before t(α) be the posterior median threshold for fixed α.It is not hard to check that t(•) is continuous and strictly decreasing so has an inverse (see [12], Section 5.3).In a similar fashion as in [12], Section 4, let us introduce a modified empirical Bayes estimator as, for A ≥ 0 and t := t(α), α Theorem 5. Let Π α be the SAS prior distribution with slab distribution G equal to the standard Cauchy distribution.For a fixed A > 0, let Π αA [• | X] be the corresponding plug-in posterior distribution, with αA the modified estimator (17).There exist C, c 0 > 0, N 0 > 0, such that, for any n ≥ N 0 , for any s n such that s n ≤ c 0 n, Theorem 5 shows that the plug-in SAS posterior distribution using the modified estimator (17), A > 0, and a Cauchy slab attains the minimax rate of convergence r n even in the very sparse regime s n log 2 n, for which the unmodified estimate of Theorem 3 may lose a logarithmic factor.

Discussion
In this paper, we have developped a theory of empirical Bayes choice of the hyperparameter of spike and slab prior distributions.It extends the work of Johnstone and Silverman [12] in that here the complete EB posterior distribution is considered.One important message is that such a generalisation preserves optimal convergence rates at the condition of taking slab distributions with heavy enough tails.If the tails of the slab are only moderate (e.g.Laplace), then the complete EB posterior rate may be suboptimal.This is in contrast with the hierarchical case considered in [6], where a Laplace slab combined with a Beta distributed prior on α was shown to lead to an optimal posterior rate.On the one hand, the empirical Bayes method often leads to simpler or/and more easily tractable practical algorithms; on the other hand, we have illustrated here that the complete EB posterior may in some cases need slightly stronger conditions to conserve optimal theoretical guarantees.This phenomenon had not been pointed out so far in the literature, to the best of our knowledge.
We also note that Theorem 3 (or Theorem 5 if one allows for very sparse signals) enables one to recover the optimal form of the logarithmic factor log(n/s n ) in the minimax rate.This entails significant work, as one needs to control the empirical Bayes weight estimate α both from above and below.This could work too in the SSL setting of Theorem 4, although this seems to need substantial extra technical work.
Looking at Theorems 1 and 2, it is natural to wonder why the Empirical Bayes approach fails for the Laplace slab where the full Bayes approach succeeds as seen in [6] Theorem 2.2.The reason why the hierarchical Bayes version works also for γ Laplace is the extra penalty in model size induced by the hierarchical prior on dimension.Indeed, in the full Bayes approach, the posterior distribution of α given X has density where p(X | α) is the marginal density one maximises when considering the MMLE α.Hence adding a term log π(α) for well-chosen π -for instance that arising from a Beta(1, n + 1) prior on α as considered in [6] -to the log-marginal likelihood one maximises forces α to concentrate on smaller values.For instance, in the present setting, one could consider a penalised log-marginal maximum likelihood, which would force the estimate α to concentrate on slightly smaller values, which would allow one to avoid the extra e √ log n/sn term arising in Theorems 1-2.The present work can also serve as a basis for constructing confidence regions using spike-andslab posterior distributions.This question is considered in the forthcoming paper [5].

Proofs for the spike and slab prior
Let us briefly outline the ingredients of the proofs to follow.For Theorems 1 and 3, our goal is to bound the expected posterior risk There are three main tools.First, after introducing notation and basic bounds in Section 3.1, bounds on the posterior risk for fixed α are given in Section 3.2, as well as corresponding bounds for random α.Let us note that the corresponding upper bounds are different from those obtained on the quadratic risk for the posterior median in [12] (and in fact, must be, in view of the negative result in Theorem 1).Second, inequalities on moments of the score function are stated in Section 3.3.As a third tool, we obtain deviation inequalities on the location of α in Section 3.4.One of the bounds sharpens the corresponding bound from [12] in case the signal belongs to the nearly-black class ℓ 0 [s n ] which we assume here.
Proofs of Theorems 1 and 3 are given in Sections 3.5 and 3.6.For Theorem 3, we also needed to slightly complete the proof of one of the inequalities on thresholds stated in [12], see Lemma 11.The proof of Theorem 2, which uses ideas from both previous proofs, is given in Section 3.7.Proofs of technical lemmas for the SAS prior are given in Section 4.

Notation and tools for the SAS prior
Expected posterior L 2 -squared risk.For a fixed weight α, the posterior distribution of θ is given by (6).On each coordinate, the mixing weight a(X i ) is given by (7) and the density of the non-zero component γ Xi by (8).In the sequel we will obtain bounds on the following quantity, already for a given α ∈ [0, 1], To do so, we study r 2 (α, µ, x This quantity is controlled by a(x) and the term involving γ x .From the definition of a(x), bounding the denominator from below by one of its two components, and using a(x) ≤ 1 yields, for any real The marginal likelihood in α.By definition, the empirical Bayes estimate α in (10) maximises the logarithm of the marginal likelihood in α in (9).In case the maximum is not taken at the boundary, α is a zero of the derivative (score) of the previous likelihood.Its expression is S(α) = n i=1 β(X i , α), where following [12] we set, for 0 ≤ α ≤ 1 and any real x, The study of α below uses in a crucial way the first two moments of β(X i , α), so we introduce the corresponding notation next.Let E τ , for τ ∈ R n , denote the expectation under and further denote The thresholds ζ(α), τ (α) and t(α).Following [12], we introduce several useful thresholds.From Lemma 1 in [12], we know that g/φ, and therefore β = g/φ − 1, is a strictly increasing function on R + .It is also continuous, so given α, a pseudo-threshold ζ = ζ(α) can be defined by Further one can also define τ (α) as the solution in x of Equivalently, a(τ Recall from Section 2 that t(α) is the threshold associated to the posterior median for given α.It is shown in [12], Lemma 3, that t(α) ≤ ζ(α).Finally, the following bound in terms of τ (α), see [12] p. 1623, is also useful for large x,

Posterior risk bounds
Recall the notation Lemma 1.Let γ be the Cauchy or Laplace density.For any x and α ∈ [0, 1/2], Let γ be the Cauchy density.For any real x and α ∈ [0, 1/2], The following lower bound is used in the proof of Theorem 1.
Lemma 2. Let γ be the Laplace density.There exists C 0 > 0 such that, for x ∈ R and α ∈ [0, 1] We now turn to bounding r 2 (α, µ, x).This is the quantity r 2 (α, µ, x), where α (which comes in via a(x) = a α (x)) is replaced by α.This is done with the help of the threshold τ (α).
Lemma 3 (no signal or small signal).Let γ be the Cauchy density.Let α be a fixed non-random element of (0, 1).Let α be a random element of [0, 1] that may depend on x ∼ N (0, 1) and on other data.Then there exists C 1 > 0 such that There exists C 2 > 0 such that for any real µ, if x ∼ N (µ, 1), Lemma 4 (signal).Let γ be the Cauchy density.Let α be a fixed non-random element of (0, 1).Let α be a random element of [0, 1] that may depend on x ∼ N (µ, 1) and on other data and such that τ (α) 2 ≤ d log(n) with probability 1 for some d > 0. Then there exists C 2 > 0 such that for all real µ, Er

Moments of the score function
The next three lemmas are borrowed from [12] and apply to any density γ such that log γ is Lipschitz on R and satisfies Both Cauchy and Laplace densities satisfy (23), with κ = 2 and κ = 1 respectively, and their logarithm is Lipschitz.
Also, the function α → m(α) is nonnegative and increasing in α.
Lemma 7.There exist a constant c 1 such that for any x and α, and constants c 2 , c 3 , c 4 such that for any α, and κ as in (23), for all µ.

In-probability bounds for α
Lemma 9 below implies that, for any possible θ 0 , the estimate α is smaller than a certain α 1 with high probability.One can interpret this as saying that α does not lead to too much undersmoothing (i.e.too many nonzero coefficients).On the other hand, if there is enough signal in a certain sense, α does not lead to too much oversmoothing (i.e.too many zero coefficients), see Lemma 10.
Although we generally follow the approach of [12], there is one significant difference.One needs a fairly sharp bound on α 1 below.Using the definition from [12] would lead to a loss in terms of logarithmic factors for the posterior L 2 -squared moment.So we work with a somewhat different α 1 , and shall thus provide a detailed proof of the corresponding Lemma 9.For the oversmoothing case, one can borrow the corresponding Lemma of [12] as is.
Let α 1 = α 1 (d) be defined as the solution of the equation, with where d is a constant to be chosen (small enough for Lemma 9 to hold).A solution of (24) exists, as using Lemma 5, α → α m(α) is increasing in α, and equals 0 at 0. Also, provided η n is small enough, α 1 can be made smaller than any given arbitrary constant.The corresponding threshold Lemma 8. Let κ be the constant in (23).Let α 1 be defined by (24) for d a given constant and let ζ 1 be given by β(ζ 1 ) = α −1 1 .Then there exist real constants c 1 , c 2 such that for large enough n, with κ as in (23).Also, ζ 2 1 ∼ 2 log(n/s n ) as n/s n goes to ∞. Lemma 9. Let α 1 be defined by (24) for d a given small enough constant and let ζ 1 be given by β(ζ 1 ) = α −1 1 .Suppose (11) holds.Then for some constant C > 0, sup For the oversmoothing case, one denotes the proportion of signals above a level τ by We also set, recalling that α 0 is defined via τ (α 0 ) = 1, One defines ζ τ,π as the corresponding pseudo-threshold β −1 (α(τ, π) −1 ).

Proof of Theorem 1
Proof.Let α * be defined as the solution in α of the equation, where Let θ 0 be the specific signal defined by, for α * , ζ * as in (27), Using for c 0 in (11) small enough to have 2 log(1/η n ) + C ≥ log(1/η n ).We next prove that, for α given by ( 10), for small enough c > 0, If α * ≤ α n the probability at stake is 0, as α belongs to [α n , 1] by definition.For α * > α n , we have So by Bernstein's inequality, The function α → α m(α) is increasing, as m(•) is (Lemma 5), so by its definition (27), α * can be made smaller than any given positive constant, provided c 0 in (11) is small enough, ensuring η n = s n /n is small enough.Using Lemma 6, m 1 (ζ, α) ∼ 1/(2α) as α → 0. So, using (27), one obtains, for small enough c 0 , s n 12α * .On the other hand, the last part of Lemma 7 implies Using the definition of α * , one deduces V s n /α * 2 and from this which in turn implies (29), as then Lemma 2 implies, for any possibly data-dependent weight α, that ´θ2 ), an application of (28) concludes the proof.

Now using
These are established in a similar way as in [12], but with the updated definition of α 1 , ζ 1 from (24), so we include the proof below for completeness.One can now apply Lemma 4 with α = α 2 , i: |θ0,i|>ζ1 Let us verify that the term in brackets in the last display is bounded above by , so this is also the case.Conclude that the last display is bounded above by Cnπ . Using (31), this term is itself bounded by Cs n log(n/s n ), which concludes the proof of the Theorem, given (30)-(31).
We now check that (30)-(31) hold.We first compare α 1 and α 2 .For small enough α, the bound on m 1 from Lemma 7 becomes 1/α, so that, using the definition (24) of α 1 , using the rough bound π 1 ≤ η n .Note that both functions m(•) −1 and m 1 (ζ 1 , •) are decreasing via Lemmas 5-6, and so is their product on the interval where both functions are positive.As d < 2, by definition of α 2 this means To prove (31), one compares ζ 2 first to a certain Using Lemma 11, which also gives the existence of ζ 3 , one gets This shows, reasoning as above, that Following [12], one distinguishes two cases to further bound ζ 3 . If where for the last inequality we have used that x → x 4 e −(x−ζ1)ζ1 is decreasing for By the definition of Using Lemma 5 as before, Φ(1) Taking logarithms this leads to ) is increasing, one gets, using π 1 ≤ η n , which concludes the verification of (30)-(31) and the proof of Theorem 3.
In checking (31), one needs a lower bound on m 1 .In [12], the authors mention that it follows from their lower bound (82), Lemma 8.But this bound cannot hold uniformly for any smoothing parameter α (denoted by w in [12]), as m 1 (µ, 0) = − m(w) < 0 if w = 0. So, although the claimed inequality is correct, it does not seem to follow from (82).We state the inequality we use now, and prove it in Section 4.3.
exists.Let α 3 be the largest such solution.Then for c 0 in (11) small enough,

Proof of Theorem 2
Let θ 0 , α * , ζ * be defined as in the proof of Theorem 1. Below we show that the event A = { α ∈ [α * , cα * ]}, for c a large enough constant, has probability going to 1, faster than a polynomial in 1/n.Recall from the proof of Theorem 1 that, if α ≥ α * , so in particular on A, we have where m is chosen small enough so that v n ≤ V X /2 on A. Then, where the second line follows from Markov's inequality.One now writes the L 2 -norm in the previous display as sum over coordinates and one expands the square, while noting that given X the posterior Π α[• | X] makes the coordinates of θ independent The last bound is the same as in the proof of the upper bound Theorem 3, except the fourth moment replaces the second moment.Denote r 4 (α, µ, x) = ´(u − µ) 4 dπ α (u | x), then In a similar way as in the proof of Lemma 1, one obtains ´(u−µ) 4 γ x (u)du ≤ C(1+(x−µ) 4 ).Next, noting that since now γ is Laplace so g has Laplace tails, x → (1+x 4 )g(x) is integrable, proceeding as in the proof of Lemma 1, one gets E 0 r 4 (α, 0, x) α as well as E µ r 4 (α, µ, x) 1 + τ (α) 4 , for any fixed α.Similarly as in Lemmas 3-4, one then derives the following random α bounds and, for any µ, Er 4 (α, µ, x) By using that the probabilities in the last displays go to 0 faster than 1/n, which we show below, and gathering the bounds for all i, From this deduce that The last bound goes to 0, as τ (α * ) ≤ ζ α * = ζ * and g has Laplace tails.To conclude the proof, we show that P θ0 (α ∈ [α * , cα * ]) is small.From the proof of Theorem 1, one already has P θ0 [α < α * ] ≤ exp(−cs n ), which is a o(1/n) using s n log 2 n.To obtain a bound on P θ0 [α > cα * ], one can now revert the inequalities in the reasoning leading to the Bernstein bound in the proof of Theorem 1.With A = n i=1 m 1 (µ i , α), we have which is the case for η n small enough.Since by definition n m(α * ) = s n /(4α * ), we have −A ≥ s n /(8α * ).From there one can carry over the same scheme of proof as for the previous Bernstein inequality, with now Ã = −A and Ṽ the variance proxy which is bounded by

Proofs of posterior risk bounds: fixed α
Proof of Lemma 1.First one proves the first two bounds.To do so, we derive moment bounds on γ x .Since γ x (•) is a density function, we have for any x, ´γx (u)du = 1.This implies (log g) ′ (x) = ´(u−x)γ x (u)du = ´uγ x (u)du−x.In [12], the authors check, see p. 1623, that ´uγ x (u)du =: m1 (x) is a shrinkage rule, that is 0 ≤ m1 (x) ≤ x for x ≥ 0, so by symmetry, for any real x, Note that for γ Laplace or Cauchy, we have |γ ′ | ≤ c 1 γ and |γ ′′ | ≤ c 2 γ.This leads to and similarly |g ′′ | ≤ c 2 g, so that ´u2 γ x (u)du ≤ C(1 + x 2 ) which gives the first bound using (18).We note, en passant, that the one but last display also implies for any real x that which implies that ´u2 γ x (u)du goes to ∞ with x.Also, for any real µ, Now using again g ′ /g ≤ c 1 and g ′′ /g ≤ c 2 leads to By using the expression of r 2 (α, µ, x), this yields the second bound of the lemma.We now turn to the bounds in expectation.For a zero signal µ = 0, one notes that x = τ (α) is the value at which both terms in the minimum in the first inequality of the lemma are equal.So For γ Cauchy, g has Cauchy tails and x → (1 + x 2 )g(x) is bounded, so one gets, with α ≤ 1/2, Turning to the last bound of the lemma, we distinguish two cases.Set for the remaining of the proof T := τ (α) for simplicity of notation.The first case is |µ| ≤ 4T , for which The second case is |µ| > 4T .We bound the expectation of each term in the second bound of the lemma (that for r 2 (α, µ, x)) separately.First, To do so, one uses the bound (22) and starts by noting that, if Z ∼ N (0, 1), This implies, with Φ(u) = ´∞ u φ(t)dt ≤ φ(u)/u for u > 0, The first term in the last sum is bounded above by 2 Φ(|µ|/2).The second term, as A ⊂ {x, |x| ≥ |µ|/2}, is bounded above by 2 Φ(|µ|/4).This implies, in the case > 4T , that The last bound of the lemma follows by combining the previous bounds in the two cases.
Proof of Lemma 2. From the expression of r 2 (α, 0, x) it follows where c 0 > 0. Indeed, both functions whose infimum is taken in the last display are continuous in x, are strictly positive for any real x, and have respective limits 1 and +∞ as |x| → ∞, using (34), so these functions are bounded below on R by positive constants.

Proofs of posterior risk bounds: random α
Proof of Lemma 3. Using the bound on r 2 (α, 0, x) from Lemma 1, For the first term in the last display, one bounds the indicator from above by 1 and proceeds as in the proof of Lemma 1 to bound its expectation by Cατ (α).The first part of the lemma follows by noting that E[(1 by Cauchy-Schwarz inequality.The second part of the lemma follows from the fact that using Lemma 1, Proof of Lemma 4. Combining ( 22) and the third bound of Lemma 1, Note that it is enough to bound the first term on the right hand side in the last display, as the last one is bounded by a constant under E µ .Let us distinguish the two cases α ≥ α and α < α.
In the case α ≥ α, as τ (α) is a decreasing function of α, where we have used e − 1 2 v 2 ≤ 1 for any v and that e As a consequence, one can borrow the fixed α bound obtained previously so that In the case α < α, setting b n = √ d log n and noting that τ (α) ≤ b n with probability 1 by assumption, proceeding as above, with b n now replacing τ (α), one can bound From this one deduces that Using similar bounds as in the fixed α case, one obtains Taking the square root and gathering the different bounds obtained concludes the proof.

Proofs on pseudo-thresholds
Proof of Lemma 8.For small α, or equivalently large ζ, we have Deduce that for large n, using η n = dα 1 m(α 1 ) and Lemma 5 on m, From this deduce that In particular, using log ζ ≤ a + ζ 2 /4 for some constant a > 0 large enough, one gets Inserting this back into the previous inequality leads to The lower bound is obtained by bounding (κ − 1) log(ζ 1 ) ≥ 0, for small enough α 1 .
Proof of Lemma 9. Using (11), log(1/η n ) ≤ log(n) − 2 log log n, and the bound on ζ from Lemma 8 gives It follows that α 1 belongs to the interval [α n , 1] over which the likelihood is maximised.
Then one notices that , regardless of the fact that the maximiser α is attained in the interior or at the boundary of [α n , 1].So The score function equals S(α) = n i=1 β(X i , α), a sum of independent variables.By Bernstein's inequality, if W i are centered independent variables with |W i | ≤ M and n i=1 Var(W i ) ≤ V , then for any A > 0, Set W i = β(X i , α 1 ) − m 1 (θ 0,i , α 1 ) and A = − n i=1 m 1 (θ 0,i , α 1 ).Then one can take M = c 3 /α 1 , using Lemma 7. One can bound −A from above as follows, using the definition of α 1 , provided d is chosen small enough and, using again the definition of α 1 , where one uses that ζ −1 is bounded.This leads to As m(α) ≍ ζ α g(ζ α ), one gets R α ≍ e ζαζ1 /ζ α → ∞ as α → 0. On the other hand, with π 1 ≤ s n /n and α 1 m(α 1 ) = ds n /n, so that R α1 < 8/π 1 as d < 2. This shows that the equation at stake has at least one solution for α in the interval (0, α 1 ).By definition of m 1 (µ, α), for any µ and α, and By definition of ζ, the denominator in (B) is bounded from above by 2αβ(x) so One splits the integral (A) in two parts corresponding to β(x) ≥ 0 and β(x) < 0. Let c be the real number such that g/φ(c) = 1.By construction the part of the integral where one uses the monotonicity of y → y/(1 + αy).For µ ≥ c, the integral ´c −c φ(x − µ)dx is bounded above by 2 ´c 0 φ(x − µ)dx ≤ 2cφ(µ − c).To establish (33), it thus suffices to show that The right hand-side equals 2 m(α Let us distinguish two cases.In the case ζ 3 ≤ 2ζ 1 , the previous claim is obtained, since ζ 1 goes to infinity with n/s n by Lemma 8 and φ ).In the case ζ 3 > 2ζ 1 , we obtain an upper bound on ζ 3 by rewriting the equation defining it.For t ≥ 1, one has Φ(t) ≥ Cφ(t)/t.Since This can be rewritten using φ( By using e x /x 2 ≥ Ce x/2 for x ≥ 1 one obtains using that u → u log(1/u) is bounded on (0, 1).So the previous claim is also obtained in this case, as φ(ζ 1 − c) is small compared to (C ′ ζ 1 ) −1 for large ζ 1 .

Proof of the convergence rate for the modified estimator
Proof of Theorem 5.The proof is overall in the same spirit as that of Theorem 2 in [12] and goes by distinguishing the two cases s n ≥ log 2 n and s n < log 2 n.The main difference is that here we work with the full posterior distribution, and the risk bounds require Lemmas 1-4, that bound the posterior risk in various settings, as well as a result, Lemma 13 below, in the same vein.Also, we need to work with a modified version of ζ 1 , to make sure that the probability in Lemma 9 goes to 0 fast enough.We note that this version of ζ 1 is the one used in [12] for both their Theorems 1 and 2 (in our Theorem 3, such a modification is not needed and we worked with the simpler version there).To do so, one replaces η n = s n /n in the definition (24) of α 1 by ηn = max η n , log 2 n n .
To keep notation simple, we still denote the corresponding threshold by ζ 1 .In the first part of the proof below, η n ≥ log 2 (n)/n, so this is the same version as in definition (24).In the second part of the proof, we have ηn = log 2 n/n and we now indicate the relevant properties of the corresponding modified threshold ζ 1 .First, the statement of Lemma 8 becomes, with κ = 2 (as γ is Cauchy), Second, we need below a bound on P [ ζ < ζ 1 ] with the modified version of ζ 1 as above.It is not hard to check from the proof of Lemma 9 that this proof goes through with the new version of ζ 1 and η n replaced by ηn .The only difference is with the term cs n /α 1 which is bounded by cnη n /α 1 = n m(α 1 ), so that Bernstein's inequality gives We are now ready for the proof of Theorem 5. First consider the case s n ≥ log 2 n and let us show that the risk of the empirical Bayes posterior Π αA [• | X] is not larger than that of the non-modified one.One decomposes The term (I) corresponds to the risk of the unmodified estimator, so is bounded as in Theorem 3.For (II), one splits it according to small and large signals θ 0,i : (II) = S + S, with and S = (II) − S. From Lemma 1, one knows that r 2 (α A , µ, x) ≤ µ 2 + C(1 + (x − µ) 2 ), while for µ = 0, one can use the bound in expectation E 0 r 2 (α, 0, x) ≤ Cατ (α), so that We now use the definition of α A to bound α A and τ (α A ).To bound τ (α A ), note that for any α ∈ (0, 1), by definition a(τ (α)) = 1/2, so for a signal of amplitude τ (α), the posterior puts 1/2 of its mass at zero, which means the posterior median is 0, implying τ (α) ≤ t(α), so that τ (α A ) ≤ t A .
Combining with the bound for α A of Lemma 12, For any fixed A > 0, this goes to 0 with n so it is a o(s n ζ 2 1 ), while s n ζ 2 1 is bounded by Cs n log(n/s n ) as follows from Lemma 8. Now to bound S, one adapts the last bound of Lemma 1 accommodate for the indicator 1 t>tn .This is done in Lemma 13 whose bound (38) implies S ≤ Cs n t 2 A P ( t > t n ) 1/2 .This bound coincides up to a universal constant with the corresponding bound (128) in [12] (taken for p = 0, p = 1 and q = 2, which corresponds to our setting, i.e. working with ℓ 0 classes and quadratic risk).So the remaining bounds of [12] for the case s n > c log 2 n can be used directly (the distinction of the three cases as in [12] p. 1646-1647 can be reproduced word by word, and is omitted for brevity), leading to S ≤ Cs n log(n/s n ).
Second, consider the case where s n ≤ log 2 n.We note that for this regime of s n , the inequalities (35) become, using that by definition ηn = log 2 n/n, Let us show that the risk of the plug-in posterior using the modified estimator is at most of the order of the minimax risk.For ζ 1 as above, For the terms (i) and (ii), apply respectively each bound of Lemma 3 with α = α A to get (ii) ≤ s n log n using (35), which is bounded from above by Cs n log(n/s n ) in the regime s n ≤ log 2 n.Also,

For large enough n, we have τ (α
We now bound the probability 53) in [12]).Using (37), we have , so that (i) goes to 0, and so is a o(s n log(n/s n )).
Finally, for the term (iii) one uses Lemma 4 with α which is no more than 2Cs n (1 + A) log n ≤ C ′ s n log n.As s n ≤ c log 2 n, we have log n log(n/s n ) so (iii) ≤ Cs n log(n/s n ).Putting the previous bounds together, one gets (i) + (ii) + (iii) ≤ Cs n log(n/s n ), which concludes the proof.
where we use that g has Cauchy tails.The result follows by using the expression of t A .
Lemma 13.For any real µ, for B := { t > t n }, and α A , t A as above, Proof.Similar to the proof of Lemma 1, one sets T := τ (α A ) and distinguishes two cases: if ), so using Cauchy-Schwarz inequality, If |µ| > 4T , one uses the bound on r 2 from Lemma 1 again keeping the dependence in a(x).First,

Proof of Theorem 4: the SSL prior
Recall that we use the notation of the SAS case, keeping in mind that every instance of g is replaced by g 1 and (some of the) φs by g 0 .Similarly, β(x, α), m, m 1 and m 2 are defined as in Section 3.1, but with β(x) = g 1 /g 0 − 1.
The main steps of the proof generally follow those of Theorem 3, although technically there are quite a few differences.In the SSL case, we do not know whether the function β = g 1 /g 0 − 1 is nondecreasing over the whole R + .Yet, we managed to show that β, which is an even function, is nondecreasing on the interval see Proposition 1 below.This allows us to define its inverse β −1 = β |Jn −1 on this interval.Further, we prove in Lemma 20 that β crosses the horizontal axis on the previous interval, is strictly negative on [0, 2λ 1 ] and tends to ∞ when x → ∞.As β is continuous, the graph of the function crosses any given horizontal line y = c, for any c > 0.
The threshold ζ 1 in the SSL case.In the SSL case, the function α → m(α) = −E 0 [β(X, α)] is still nondecreasing, since for any real z, the map M z : α → z/(1 + αz) is nonincreasing and β(X, α) = M β(X) (α).By Proposition 2, we also have that m is positive for α ≥ C log n/n and is of the order of a constant for α = 1.So, the map α → α m(α) is nondecreasing on [C log n/n, 1], its value at C log n/n is less than C ′ log n/n, and its value at one is of the order of a constant.This shows, using s n ≥ c 1 log 2 n by (11), that the following equation has a unique solution α 1 ∈ (C log n/n, 1) with d a small enough constant to be chosen later (see the proof of Lemma 21).Thus we can set C log(n/s n ).
As in the proof of Theorem 3, one can now decompose the risk R n (θ 0 ) = E θ0 ´ θ−θ 0 2 dΠ α(θ | X) according to whether coordinates of θ correspond to a 'small' or 'large' signal, the threshold being ζ 1 that we define next.One can write We next use the first part of Lemma 16 with α = α 1 and the second part of the Lemma to obtain, for any θ 0 in ℓ 0 [s n ], i: θ0,i=0 where for the last inequality we use Lemma 21.From (41) one gets Let us now check that τ (α By definition of τ (α 1 ), using φ ≤ 2g 0 by Lemma 17, α −1 1 − 1 = 2(g 1 /φ)(τ (α 1 )) ≥ β(τ (α 1 )) + 1.This gives us that β(ζ 1 ) ≥ β(τ (α 1 )) + 1 which implies the result as β is increasing here.Now with the previous bound on ζ 1 one obtains that the contribution to the risk of the indices i with |θ 0,i | ≤ ζ 1 is bounded by a constant times s n log(n/s n ).
It remains to bound the part of the risk for indexes i with |θ 0,i | > ζ 1 .To do so, one uses the second part of Lemma 16 with α chosen as α ′ 2 = C(log n/n), with C as in (40).By definition of α in (16), the probability that α is smaller than α ′ 2 equals zero.Also, one has τ ( which concludes the proof of Theorem 4.
6. Technical lemmas for the SSL prior

Fixed α bounds
As in the SAS case, we use the notation the posterior on one coordinate (X 1 , say) for fixed α in the SSL case, given X 1 = x.
Similar to Lemma 1, we have a( ).The first bound now follows from the inequality g 0 ≥ φ/2 obtained in Lemma 17.For the bound in expectation, and one then proceeds as in Lemma 1 to obtain the desired bound for zero signal.Now for a general signal µ, the bound for r 2 (α, µ, x) follows from the definition and the previous bound.For the bound in expectation, by symmetry one can assume µ ≥ 0. Also note that the term with the a(x) factor is bounded in expectation by a constant, by using a(x) ≤ 1.To handle the term with 1 − a(x), we distinguish two cases.First, one assumes that µ ≤ λ 0 /2.We have, using (a + b) For the first term we proceed as in Lemma 1, for the second using g 0 ≥ φ/2 from Lemma 17, As µ ≤ λ 0 /2, this is in turn bounded by a constant times (λ 0 ) −2 .Now in the case that µ > λ 0 /2, recall from the proof of Lemma 1 that for any real x, The first two terms are, in expectation, bounded by a constant.Next one writes By Lemma 17, we have One splits the integral on the last display in two parts.For |x| ≤ µ/4, one uses that g ′′ 0 is bounded together with the bound g 0 ≥ φ/2.For |x| > µ/4, one uses g ′′ 0 /g 0 = λ 2 0 (g 0 − φ)/g 0 ≤ λ 2 0 together with 1 − a(x) ≤ (g 0 /g 1 )(x)/α, which follows from the expression of a(x).This leads to The first term in the last expression is bounded.The second one is bounded by a constant given our choice of λ 0 by combining the following: α −1 ≤ n from ( 16), g 0 γ 0 for µ > λ 0 /8 from (47) and g 1 γ 1 .

Random α bounds
Lemma 16.Let α be a fixed non-random element of (0, 1).Let α be a random element of [0, 1] that may depend on x ∼ N (0, 1) and on other data.Then there exists C 1 > 0 such that There exists C 2 > 0 such that for any real µ, if x ∼ N (µ, 1), Suppose now that τ (α) 2 ≤ d log(n) with probability 1 for some d > 0, and that x ∼ N (µ, 1).Then there exists C 2 > 0 such that for all real µ, Proof of Lemma 16.For the first two inequalities, the proof is the same as in the SAS case in Lemma 3, the only difference being the presence of the term 4/λ 2 0 coming from Lemma 14 for the first inequality.For the third inequality , it follows from Lemma 14 that The expectation of the last term is a constant.For the first term, using Lemma 15, where the last estimate uses the bound α ≥ 1/n.As in Lemma 4, let us distinguish the two cases α ≥ α and α < α.In the case α ≥ α, as τ (α) is a decreasing function of α, where we have used e − 1 2 v 2 ≤ 1 for any v and that e u)du .To do so, one uses (42).In expectation, the term in factor of (x − µ) 2 + 1 is bounded by a constant.Using (47) and the fact that g ′′ 0 /g 0 ≤ λ 2 0 , the term in factor g ′′ 0 /g 0 is bounded by x 2 e −λ0|x| dx n 4 e −Cn 2 .
Finally, using (45) and the fact that x → xφ(x) is bounded, one obtains As a consequence, one can borrow the fixed α bound obtained previously so that In the case α < α, setting b n = √ d log n and noting that τ (α) ≤ b n with probability 1 by assumption, proceeding as above, with b n now replacing τ (α), one can bound From this one deduces that E (1 − a α(x)) ˆ(u − µ) 2 γ 0,x (u)du is bounded from above by a constant times : Using the same bounds but squared as in the fixed α case, one obtains Taking the square root and gathering the different bounds we obtained concludes the proof.

Properties of the functions g 0 and β for the SSL prior
Recall the notation φ, γ 0 , g 0 from Section 2. For any real x, we also write ψ(x) = ´∞ x e −u 2 /2 du.
Our key result on β is the following.
We next state and prove some Lemmas used in the proof of Proposition 1 below.
The approximation property of φ by g 0 is obtained by a Taylor expansion.For any x, u ∈ R, there exists c between x and x − u such that φ First we check that for any x in the prescribed interval, we have For any real x, using the inequality e v ≥ 1 + v, As φ is 1-Lipshitz, one can bound from below φ(x + u) − φ(x) ≥ −u, which leads to, for any This leads to inequality on g o+ − g o− above, using that x belongs to the prescribed interval to get the nonpositivity.From this one deduces On the prescribed interval φ(x) ≥ 5/λ 0 , so using that t → (t − a)/(t + b) is increasing, for large enough n, which concludes the proof.
As x ≥ 2λ 1 by assumption, this leads to the announced inequality.