The Horseshoe Estimator: Posterior Concentration around Nearly Black Vectors

We consider the horseshoe estimator due to Carvalho, Polson and Scott (2010) for the multivariate normal mean model in the situation that the mean vector is sparse in the nearly black sense. We assume the frequentist framework where the data is generated according to a fixed mean vector. We show that if the number of nonzero parameters of the mean vector is known, the horseshoe estimator attains the minimax $\ell_2$ risk, possibly up to a multiplicative constant. We provide conditions under which the horseshoe estimator combined with an empirical Bayes estimate of the number of nonzero means still yields the minimax risk. We furthermore prove an upper bound on the rate of contraction of the posterior distribution around the horseshoe estimator, and a lower bound on the posterior variance. These bounds indicate that the posterior distribution of the horseshoe prior may be more informative than that of other one-component priors, including the Lasso.


Introduction
We consider the normal means problem, where we observe a vector Y ∈ R n , Y = (Y 1 , . . ., Y n ), such that for independent normal random variables ε i with mean zero and variance σ 2 .The vector θ = (θ 1 , . . ., θ n ) is assumed to be sparse, in the 'nearly black' sense that the number of nonzero means is o(n) as n → ∞.A natural Bayesian approach to recovering θ would be to induce sparsity through a 'spike and slab' prior (Mitchell and Beauchamp, 1988), which consists of a mixture of a Dirac measure at zero and a (heavy-tailed) continuous distribution.Johnstone and Silverman (2004) analyzed an empirical Bayes version of this approach, where the mixing weight is obtained by marginal maximum likelihood.In the frequentist set-up that the data is generated according to a fixed mean vector, they showed that the empirical Bayes coordinatewise posterior median attains the minimax rate, in ℓ q norm, q ∈ (0, 2], for mean vectors that are either nearly black or of bounded ℓ p norm, p ∈ (0, 2].Castillo and Van der Vaart (2012) analyzed a fully Bayesian version, where the proportion of nonzero coefficients is modelled by a prior distribution.They identified combinations of priors on this proportion and on the nonzero coefficients (the 'slab') that yield posterior distributions concentrating around the underlying mean vector at the minimax rate in ℓ q norm, q ∈ (0, 2], for mean vectors that are nearly black, and in ℓ q norm, q ∈ (0, 2) for mean vectors of bounded weak ℓ p norm, p ∈ (0, q).Other work on empirical Bayes approaches to the two-group model includes (Efron, 2008;Jiang and Zhang, 2009;Yuan and Lin, 2005).
As a full Bayesian approach with a mixture of a Dirac and a continuous component may require exploration of a model space of size 2 n , implementation on large datasets is currently impractical, although Castillo and Van der Vaart (2012) present an algorithm which can compute several aspects of the posterior in polynomial time, provided sufficient memory can be allocated.Several authors, including (Armagan, Dunson and Lee, 2013;Griffin and Brown, 2010), have proposed one-component priors, which model the spike at zero by a peak in the prior density at this point.For most of these proposals, theoretical justification in terms of minimax risk rates or posterior contraction rates is lacking.The Lasso estimator (Tibshirani, 1996), which arises as the MAP estimator after placing a Laplace prior with common parameter on each θ i , is an exception.It attains close to the minimax risk rate in ℓ q , q ∈ [1, 2] (Bickel, Ritov and Tsybakov (2009)).It has however been recently shown that the corresponding full posterior distribution contracts at a much slower rate than the mode (Castillo, Schmidt-Hieber and Van der Vaart, 2014).This is undesirable, because this implies that the posterior distribution cannot provide an adequate measure of uncertainty in the estimate.
In general one would use a posterior distribution both for recovery and for uncertainty quantification.For the first, a measure of centre, such as a median or mode, suffices.For the second, one typically employs a credible set, which is defined as a central set of prescribed posterior probability.For realistic uncertainty quantification it is necessary that the posterior contracts to its center at the same rate as the posterior median or mode approaches the true parameter.
In this paper we study the posterior distribution resulting from the horseshoe prior, which is a one-component prior, introduced in (Carvalho, Polson andScott, 2009, 2010) and expanded upon in (Polson and Scott, 2012a,b;Scott, 2011).It combines a pole at zero with Cauchy-like tails.The corresponding estimator does not face the computational issues of the point mass mixture models.Carvalho, Polson and Scott (2010) already showed good behaviour of the horseshoe estimator in terms of Kullback-Leibler risk when the true mean is zero.Datta and Ghosh (2013) proved some optimality properties of a multiple testing rule induced by the horseshoe estimator.In this paper, we prove that the horseshoe estimator achieves the minimax quadratic risk, possibly up to a multiplicative constant.We furthermore prove that the posterior variance is of the order of the minimax risk, and thus the posterior contracts at the minimax rate around the underlying mean vector.These results are proven under the assumption that the number p n of nonzero parameters is known.However, we also provide conditions under which the horseshoe estimator combined with an empirical Bayes estimator still attains the minimax rate, when p n is unknown.This paper is organized as follows.In section 2, the horseshoe prior is described and a summary of simulation results is given.The main results, that the horseshoe estimator attains the minimax squared error risk (up to a multiplicative constant) and that the posterior distribution contracts around the truth at the minimax rate, are stated in section 3. Conditions on an empirical Bayes estimator of the key parameter τ such that the minimax ℓ 2 risk will still be obtained are given in section 4. The behaviour of such an empirical Bayes estimate is compared to a full Bayesian version in a numerical study in section 5. Section 6 contains some concluding remarks.The proofs of the main results and supporting lemmas are in the appendix.

Notation
We write A n ≍ B n to denote 0 < lim n→∞ inf An Bn ≤ lim n→∞ sup An Bn < ∞ and A n B n to denote that there exists a positive constant c independent of n such that A n ≤ cB n .A ∨ B is the maximum of A and B, and A ∧ B the minimum of A and B. The standard normal density and cumulative distribution are denoted by φ and Φ and we set Φ c = 1 − Φ.The norm • will be the ℓ 2 norm and the class of nearly black vectors will be denoted by ℓ Decreasing τ results in a higher prior probability of shrinking the observations towards zero.

The horseshoe prior
In this section, we give an overview of some known properties of the horseshoe estimator which will be relevant to the remainder of our discussion.The horseshoe prior for a parameter θ modelling an observation Y ∼ N (θ, σ 2 I n ) is defined hierarchically (Carvalho, Polson and Scott, 2010): ), for i = 1, . . ., n, where C + (0, 1) is a standard half-Cauchy distribution.The parameter τ is assumed to be fixed in this paper, rendering the θ i independent a priori.The corresponding density p τ increases logarithmically around zero, while its tails decay quadratically.The posterior density of θ i given λ i and τ is normal with mean (1 − κ i )y i , where κ i = 1 1+τ 2 λ 2 i .Hence, by Fubini's theorem: The posterior mean E[θ | y, τ ] will be referred to as the horseshoe estimator and denoted by T τ (y).The horseshoe prior takes its name from the prior on κ i , which is given by: If τ = 1, this reduces to a Be( 1 2 , 1 2 ) distribution, which looks like a horseshoe.As illustrated in Figure 1, decreasing τ skews the prior distribution on κ i towards one, corresponding to more mass near zero in the prior on θ i and a stronger shrinkage effect in T τ (y).
An unanswered question so far has been how τ should be chosen.Intuitively, τ should be small if the mean vector is very sparse, as the horseshoe prior will then place more of its mass near zero.By approximating the posterior distribution of τ 2 given κ = (κ 1 , . . ., κ n ) in case a prior on τ is used, Carvalho, Polson and Scott (2010) show that if most observations are shrunk near zero, τ will be very small with high probability.They suggest a half-Cauchy prior on τ .Datta and Ghosh (2013) implemented this prior on τ and their plots of posterior draws for τ at various sparsity levels indicate the expected relationship between τ and the sparsity level: the posterior distribution of τ tends to concentrate around smaller values when the underlying mean vector is sparser.As will be discussed further in the next section, the value τ = pn n (up to a log factor) is optimal in terms of mean square error and posterior contraction rates.
In case τ is estimated empirically, as will be considered in section 4, the horseshoe estimator can be computed by plugging this estimate into expression (1), thereby avoiding the use of MCMC.Other aspects of the posterior, such as the posterior variance, can be computed using such a plug-in procedure as well.Polson and Scott (2012a) and Polson and Scott (2012b) consider computation of the horseshoe estimator based on the representation in terms of degenerate hypergeometric functions, as these can be efficiently computed using converging series of confluent hypergeometric functions.They report unproblematic computations for τ 2 between 1 1000 and 1000.A second option is to apply a quadrature routine to the integral representation in (1).As the continuity and symmetry of T τ (y) in y can be taken advantage of when computing the horseshoe estimator for a large number of observations, the complexity of these computations mostly depends on the value of τ .Both approaches will be slower for smaller values of τ .Hence, if we use the (estimated) sparsity level pn n (up to a log factor) for τ , the computation of the horseshoe estimator will be slower if there are fewer nonzero parameters.As noted by Scott (2010), problems arise in Gibbs sampling precisely when τ is small as well.Hence care needs to be taken with any computational approach if pn n is suspected to be very close to zero.The performance of the horseshoe prior, with additional priors on τ and σ 2 , in various simulation studies has been very promising.Carvalho, Polson and Scott (2010) simulated sparse data where the nonzero components were drawn from a Student-t density and found that the horseshoe estimator systematically beat the MLE, the double-exponential (DE) and normal-exponential-gamma (NEG) priors, and the empirical Bayes model due to Johnstone and Silverman (2004) in terms of square error loss.Only when the signal was neither sparse nor heavy-tailed did the MLE, DE and NEG priors have an edge over the horseshoe estimator.In similar experiments in (Carvalho, Polson and Scott, 2009;Polson and Scott, 2012a) the horseshoe prior outperformed the DE prior, while behaving similarly to a heavy-tailed discrete mixture.In a wavelet-denoising experiment under several noise levels and loss functions, the horseshoe esti-mator compared favorably to the discrete wavelet transform and the empirical Bayes model (Polson and Scott, 2010).Bhattacharya et al. (2012) applied several shrinkage priors to data with the underlying mean vector consisting of zeroes and fixed nonzero values and found the posterior median of the horseshoe prior performing better in terms of squared error than the Bayesian Lasso (BL), the Lasso, the posterior median of a point mass mixture prior as in (Castillo and Van der Vaart, 2012) and the empirical Bayes model proposed by Johnstone and Silverman (2004), and comparable to their proposed Dirichlet-Laplace (DL) prior with parameter 1 n .Results in (Armagan, Dunson and Lee, 2013) are similar.In a second simulation setting, Bhattacharya et al. (2012) generated data of length n = 1000, with the first ten means equal to 10, the next 90 equal to a number A ∈ {2, . . ., 7} and the remainder equal to zero.In this simulation, the horseshoe prior beat the BL (except when A = 2) and the DL prior with parameter 1 n (except when A = 7), while performing similarly to the DL prior with parameter 1 2 .It is worthy of note that Koenker (2014) generated data according to the same scheme and applied the empirical Bayes procedures due to Martin and Walker (2014) (EBMW) and Koenker and Mizera (2014) (EBKM) to it.The MSE of EBMW was lower than that of the horseshoe prior for A ∈ {5, 6, 7}, while that of EBKM was much lower in all cases.

Mean square error and bounds on the posterior variance
In this section, we study the mean square error of the horseshoe estimator, and the spread of the posterior distribution, under the assumption that the number of nonzero parameters p n is known.Theorem 3.1 provides an upper bound on the mean square error, and shows that for a range of choices of the global parameter τ , the horseshoe estimator attains the minimax ℓ 2 risk, possibly up to a multiplicative constant.Theorem 3.3 states upper bounds on the rate of contraction of the posterior distribution around the underlying mean vector and around the horseshoe estimator, again for a range of values of τ .These upper bounds are equal, up to a multiplicative constant, to the minimax risk.The contraction rate around the truth is sharp, but this may not be the case for the rate of contraction around the horseshoe estimator.Theorems 3.4 and 3.5 provide more insight into the spread of the posterior distribution for various values of τ and indicate that τ = pn n log(n/p n ) is a good choice.
for τ → 0, as n, p n → ∞ and p n = o(n) By the minimax risk result in (Donoho et al., 1992), we also have a lower bound: as n, p n → ∞ and p n = o(n).The choice τ = pn n α , for α ≥ 1, leads to an upper bound (2) of order p n log(n/p n ), with (as can be seen from the proof) a multiplicative constant of at most 4ασ 2 .Thus, for this choice of τ , we have: The horseshoe estimator therefore performs well as a point estimator, as it attains the minimax risk (possibly up to a multiplicative constant of at most 2 for α = 1).This may seem surprising, as the prior does not include a point mass at zero to account for the assumed sparsity in the underlying mean vector.Theorem 3.1 shows that the pole at zero of the horseshoe prior mimics the point mass well enough, while the heavy tails ensure that large observations are not shrunk too much.
An upper bound on the rate of contraction of the posterior can be obtained through an upper bound on the posterior variance.The posterior variance can be expressed as: Details on the computation can be found in Lemma A.4.Using a similar approach as when bounding the ℓ 2 risk, we can find an upper bound on the expected value of the posterior variance.
Theorem 3.2.Suppose Y ∼ N (θ 0 , σ 2 I n ).Then the variance of the posterior distribution corresponding to the horseshoe prior satisfies Again, the choice τ = pn n α , for α ≥ 1 leads to an upper bound (3) of the order of the minimax risk.This result indicates that the posterior contracts fast enough to be able to provide a measure of uncertainty of adequate size around the point estimate.Theorems 3.1 and 3.2 combined allow us to find an upper bound on the rate of contraction of the full posterior distribution, both around the underlying mean vector and around the horseshoe estimator.
Proof.Combine Markov's inequality with the results of Theorems 3.1 and 3.2 for (4), and only with the result of Theorem 3.2 for (5).
A remarkable aspect of the preceding Theorems is that many choices of τ , such as τ = pn n α for any α ≥ 1, lead to an upper bound of the order p n log(n/p n ) on the worst case ℓ 2 risk and posterior contraction rate.The upper bound on the rate of contraction in ( 4) is sharp, as the posterior cannot contract faster than the minimax rate around the true mean vector (Ghosal, Ghosh and Van der Vaart, 2000).However, this is not necessarily the case for the upper bound in (5), and for τ = pn n α with α > 1, the posterior spread may be of smaller order than the rate at which the horseshoe estimator approaches the underlying mean vector.Theorems 3.4 and 3.5 provide more insight into the effect of choosing different values of τ on the posterior spread and mean square error.
Then the variance of the posterior distribution corresponding to the horseshoe prior satisfies for τ → 0 and p n = o(n), as n → ∞.This lower bound is sharp for vectors θ 0,n with p n entries equal to a n and the remaining entries equal to zero, if a n is such that |a n | 1/ log(1/τ ).
(iii) α > 1. Bound ( 6) is not very informative, but Theorem 3.5 exhibits a sequence θ 0,n ∈ ℓ 0 [p n ] for which there is a mismatch between the order of the mean square error and the posterior variance.Bounds ( 7) and ( 8) are of the orders p n (log , respectively.Hence up to logarithmic factors the total posterior variance ( 8) is a factor τ (1−1/α)∧(1−γ) 2 smaller than the square distance of the center of the posterior to the truth (7).For p n ≤ n c for some c > 0, this factor behaves as a power of n.
These observations suggest that τ = pn n log(n/p n ) is a good choice, because then (2), ( 3), ( 6), ( 7), ( 8) are all of the order p n log(n/p n ), suggesting that the posterior contracts at the minimax rate around both the truth and the horseshoe estimator.

Empirical Bayes estimation of τ
A natural follow-up question is how to choose τ in practice, when p n is unknown.As discussed in section 2, the full Bayesian approach suggested by Carvalho, Polson and Scott (2010) performs well in simulations.The analysis of such a hierarchical prior would however require different tools than the ones we have used so far.An empirical Bayes estimate of τ would be a natural solution, and allows us in practice to use one of the representations in (1) for computations, instead of an MCMC-type algorithm.
By adapting the approach in paragraph 6.2 in (Johnstone and Silverman, 2004), we can find conditions under which the horseshoe estimator with an empirical Bayes estimate of τ will still attain the minimax ℓ 2 risk.Based on the consideration of Section 3, we proceed with the choices τ = pn n log(n/p n ) and τ = pn n .The former is optimal in the sense that the posterior spread is of the order of the minimax risk, but the latter has the simple interpretation of being the proportion of nonzero means, and the difference between the two is only the square root of a log factor.
Theorem 4.1.Suppose we observe an n-dimensional vector Y ∼ N (θ 0 , σ 2 I n ) and we use T τ (y) as our estimator of θ 0 .If τ ∈ (0, 1) satisfies the following two conditions for τ = pn n or τ = pn n log(n/p n ): 2. There exists a function g : N × N → (0, 1) such that τ ≥ g(n, p n ) with probability one and − log(g(n, as n, p n → ∞ and p n = o(n).If only the first condition can be verified for an estimator τ , then sup{ 1 n , τ } will have an ℓ 2 risk of at most order p n log n.The first condition requires that τ does not overestimate the fraction pn n of nonzero means (up to a log factor) too much or by a too large probability.If p n ≥ 1, as we have assumed, then it is satisfied already by τ = 1 n (and c = 1).According to the last assertion of the theorem, this 'universal threshold' yields the rate p n log n (possibly up to a multiplicative constant).This is equal to the rate of the Lasso estimator with the usual choice of λ = 2 2σ 2 log n (Bickel, Ritov and Tsybakov, 2009).However, in the framework where p n → ∞, the estimator τ = 1 n will certainly underestimate the sparsity level.A more natural estimator of pn n is: where c 1 and c 2 are positive constants.By Lemma A.7, this estimator satisfies the first condition for τ = pn n and τ will also lead to a rate of at most order p n log n under these conditions.Its behaviour will be explored further in section 5.
The rate can be improved to p n log(n/p n ) if the second condition is met as well, which ensures that the sparsity level is not underestimated too much or by a too large probability.As we are not aware of any estimators meeting this condition for all θ 0 , this condition is currently mostly of theoretical interest.If the true mean vector is very sparse, in the sense that there are relatively few nonzero means or the nonzero means are close to zero, there is not much to be gained in terms of rates by meeting this condition.The extra occurrence of p n relative to the rate p n log n is of interest only if p n is relatively large.For instance, if p n ≍ n α for α ∈ (0, 1), then p n log(n/p n ) = (1 − α)p n log n, which suggests a decrease of the proportionality constant in (9), particularly if α is close to one.Furthermore, when p n is large, the constant in (9) may be sensitive to the fine properties of τ , as it depends on g(n, p n ) (as can be seen in the proof).If τ seriously underestimates the sparsity level, the corresponding value of g(n, p n ) from the second condition may be so small that the upper bound on the multiplicative constant before (9) becomes very large.Hence in this case, τ is required to be close to the proportion pn n (up to a log factor) with large probability in order to get an optimal rate.Datta and Ghosh (2013) warned against the use of an empirical Bayes estimate of τ for the horseshoe prior, because the estimate might collapse to zero.Their references for this statement, Scott and Berger (2010) and Bogdan, Ghosh and Tokdar (2008), indicate that they are thinking of a marginal maximum likelihood estimate of τ .However, an empirical Bayes estimate of τ does not need to be based on this principle.Furthermore, an estimator that satisfies the second condition from Theorem 4.1 or that is truncated from below by 1 n , would not be susceptible to this potential problem.

Simulation study
A simulation study provides more insight into the behaviour of the horseshoe estimator, both when using an empirical Bayes procedure with estimator (10) and when using the fully Bayesian procedure proposed by Carvalho, Polson and Scott (2010) with a half-Cauchy prior on τ .For each data point, 100 replicates of an n-dimensional vector sampled from a N (θ 0 , I n ) distribution were created, where θ 0 had either 20, 40 or 200 (5%, 10% or 50%) entries equal to an integer A ranging from 1 to 10, and all the other entries equal to zero.The full Bayesian version was implemented using the code provided in (Scott, 2010), and the coordinatewise posterior mean was used as the estimator of θ 0 .For the empirical Bayes procedure, the estimator (10) was used with c 1 = 2 and c 2 = 1.Performance was measured by squared error loss, which was averaged across replicates to create Figure 2. In all settings, both estimators experience a peak in the ℓ 2 loss for values of A close to the 'universal threshold' of √ 2 log 400 ≈ 3.5.This is not unexpected, as in the terminology of Johnstone and Silverman (2004), the horseshoe estimator is a shrinkage rule, and while it is not a thresholding rule in their sense, it does have the bounded shrinkage property which leads to thresholding-like behaviour.The bounded shrinkage property can be derived from Lemma A.3, which yields the following inequality as τ approaches zero: With τ = 1 n , this leads to the 'universal threshold' of 2σ 2 log n, or with τ = pn n α , a 'threshold' at 2ασ 2 log(n/p n ).Based on this property and the proofs of the main results, we can divide the underlying parameters into three cases: (i) Those that are exactly or close to zero, where the observations are shrunk close to zero; (ii) Those that are larger than the threshold, where the horseshoe estimator essentially behaves like the identity; (iii) Those that are close to the 'threshold', where the horseshoe estimator is most likely to shrink the observations too much.The horseshoe estimator performs well in cases (i) and (ii) due to its pole at zero and its heavy tails respectively.The hardest parameters to recover from the noise are those that are close to the threshold, and these are the ones that affect the estimation risk the most.This phenomenon explains the peaks in the graphs of Figure 2 around A = 3.5.
The full Bayes implementation with a Cauchy prior on τ attains a lower ℓ 2 loss around the universal threshold than the empirical Bayes procedure.This is because estimator (10) counts the number of observations that are above the universal threshold.When all the nonzero means are close to this threshold, τ may 'miss' some of them, thereby underestimating the sparsity level pn n and thus leading to overshrinkage.
For values of A well past the universal threshold, the empirical Bayes estimator does better than the full Bayes version.For such large values of A, the estimator (10) will be equal to the true sparsity level with large probability and hence its good performance is not unexpected.However, an interesting question is why the full Bayes estimator does not do as well as the empirical Bayes estimator, especially because the nonzero means are so far removed from zero that the problem is 'easy'.This could be due to the choice of a half-Cauchy prior for τ : it places no restriction on the possible values of τ and has such heavy tails that values far exceeding the sparsity level pn n are possible.This would lead to undershrinkage of the observations corresponding to a zero mean, which would be reflected in the ℓ 2 loss.Figure 2(d) shows a histogram of all Gibbs samples of τ in the setting where 50% of the means are set equal to 10.The range of these values is (3.1, 7.3), which is very far away from pn n = 1 2 .This indicates that a full Bayesian version of the horseshoe prior could benefit from a different choice of prior on τ than a half-Cauchy one, for example one that is restricted to [0,1].

Concluding remarks
The choice of the global shrinkage parameter τ is critical towards ensuring the right amount of shrinkage of the observations to recover the underlying mean vector.The value of τ = pn n log(n/p n ) was found to be optimal.Theorem 4.1 indicates that quite a wide range of estimators for τ will work well, especially in cases where the underlying mean vector is sparse.Of course, it should not come as a surprise that an estimator designed to recover sparse vectors will work especially well if the truth is indeed sparse.An interesting extension to this work would be to investigate whether the posterior concentration properties of the horseshoe prior still remain when a hyperprior is placed on τ .The result that τ = pn n (up to a log factor) yields optimal rates, together with the simulation results, suggests that in a fully Bayesian approach, a prior on τ which is restricted to [0, 1] may perform better than the suggested half-Cauchy prior.
The simulation results also indicate that mean vectors with the nonzero means close to the universal threshold are the hardest to recover.In future simulations involving shrinkage rules, it would therefore be interesting to study the challenging case where all the nonzero parameters are at this threshold.The performance of the empirical Bayes estimator (10) leaves something to be desired around the threshold.In additional numerical experiments (not shown), we tried two other estimators of τ .The first was the 'oracle estimator' τ = pn n .For values of the nonzero means well past the 'threshold', the behaviour of this estimator was very similar to that of (10).However, before the threshold, the squared error loss of the empirical procedure with the oracle estimator was between that of the full Bayes estimator and empirical Bayes with estimator (10).The second estimator was the mean of the samples of τ from the full Bayes estimator.The resulting squared error loss was remarkably close to that of the full Bayes estimator, for all values of the nonzero means.Neither of these two estimators is of much practical use.However, their range of behaviours suggests room for improvement over the estimator (10), and it would be worthwhile to study more refined estimators for τ .
An interesting question is what aspects of the horseshoe prior are truly essential towards optimal posterior contraction properties.Our proofs do not elucidate whether the pole at zero of the horseshoe prior is required, or if a prior with heavy tails, and in a sense 'sufficient' mass at zero would work as well.The failure of the Lasso to concentrate around the true mean vector at the minimax rate does indicate that heavy tails in itself may not be sufficient, and adding mass at zero solves this problem (Castillo, Schmidt-Hieber and Van der Vaart, 2014;Castillo and Van der Vaart, 2012).It is possible that the pole at zero is inessential, in particular if the global tuning parameter is chosen carefully, for instance by empirical Bayes.If the tuning parameter is chosen by a full Bayes method, the peak may be more essential, depending on its prior.
The horseshoe estimator has the property that its computational complexity depends on the sparsity level rather than the number of observations.Although there is no point mass at zero to induce sparsity, it still yields good reconstruction in ℓ 2 , and a posterior distribution that contracts at an informative rate.None of the estimates will however be exactly zero.Variable selection can be performed by applying some sort of thresholding rule, such as the one suggested in (Carvalho, Polson and Scott, 2010) and analyzed by Datta and Ghosh (2013).The performance of this thresholding rule in simulations in the two works cited has been encouraging.Theorem 4.1 and Lemma A.7, which both concern the empirical Bayes procedure discussed in section 4.
Lemma A.2.If τ 2 < 1, the posterior mean of the horseshoe prior can be bounded above by: , for any a > 1 and τ < 1 a .
Proof.We bound the integrals in the numerator and denominator of expression (1).For the first upper bound, we will use the fact that for 0 ≤ z ≤ 1, e y 2 2σ 2 z is bounded below by 1 and above by e y 2 2σ 2 .The posterior mean can therefore be bounded by: where By Shafer's inequality for the arctangent (Shafer, 1966): which completes the proof for the first upper bound.
For the second inequality, we note that, in the notation of Lemma A.1, T τ (y) = y I 1 2 (y) . The bounds in Lemma A.1 yield the stated inequality.
Lemma A.3.For τ 2 < 1, the absolute value of the difference between the horseshoe estimator and an observation y can be bounded by a function h(y, τ ) such that for any c > 2: Proof.We assume y > 0 without loss of generality.By a change of variables of x = 1 − z: .
For any fixed τ , this bound tends to zero as y tends to infinity.If τ → 0, the term containing h 3 (τ ) could potentially diverge.For s = 2 3 and y = cσ 2 log(1/τ ), where c is a positive constant, this term displays the following limiting behaviour as τ → 0: 3 − 1 > 0 and infinity otherwise.The condition c > 3 is related to the choice of s = 2 3 and can be improved to any constant strictly greater than 2 by choosing s appropriately close to one.Hence, we find that the absolute value of the difference between the posterior mean and an observation can be bounded by a function h(y, τ ) with the desired property.

Parameters equal to zero
We split up the term for the zero means into two parts: where ζ τ = 2σ 2 log(1/τ ).For the first term, we have, by the first bound in Lemma A.2: where the last inequality holds for ζ τ > σ 2 .If we apply this inequality and combine this upper bound with the upper bound on the first term, we find, for 2 ): Hence, for τ < e − σ 2 2 :

Conclusion
By ( 16) and ( 19), we find for τ < e − σ 2 2 : Lemma A.4.The posterior variance when using the horseshoe prior can be expressed as: and bounded from above by: Proof.As proven in Pericchi and Smith (1992): where m(y) is the density of the marginal distribution of y.Equality (20) can be found by combining the expressions 2σ 2 z dz with the equality T τ (y) = y + σ 2 m ′ (y) m(y) .The first upper bound is implied by the property |T τ (y)| < |y| and the fact that (1 − z) 2 ≤ 1 for z ∈ [0, 1].The second upper bound can be demonstrated by noting that (1 − z) 2 ≤ 1 − z for z ∈ [0, 1] and hence: Proof of Theorem 3.2 Proof.As in the proof of Theorem 3.1 we assume that θ i = 0 for i = 1, . . ., pn and θ i = 0 for i = pn + 1, . . ., n, where pn ≤ p n by assumption.We consider the posterior variances for the zero and nonzero means separately.Denote

Zero means
By the bound Var(θ | y) ≤ σ 2 + y 2 , we find for c ≥ 1: For |y| < cζ τ , we consider the upper bound Var(θ | y) ≤ σ 2 y + y T τ (y)−T τ (y) 2 from Lemma A.4. From this bound, we get Var(θ | y) ≤ σ 2 y T τ (y)+yT τ (y).Hence: We bound the first integral from ( 22) by applying the first bound on T τ (y) from Lemma A.2: because f (τ ) ≤ 2 3 τ .For the second term in ( 22), we first note that the second bound from Lemma A.2 can be relaxed to: for any a > 1 and τ < 1 a .By plugging this bound into the second integral of (22), we get three terms, which we will name I 1 , I 2 and I 3 respectively.We then find, bounding above by the integral over R instead of [−cζ τ , cζ τ ] for I 1 and I 2 : And thus:

Conclusion
By ( 21) and ( 24): Proof of Theorem 3.4 , we see that the final term in ( 20) is equal to: As Tτ (y) y is non-negative, we can bound the posterior variance from below by the final two terms in (20).By the above equality, this yields the following lower bound: where I k is as in Lemma A.1.We now use the bounds from Lemma A.1 with a = 2 and take ξ equal to c log(1/τ ) for some nonnegative constant c.Then e ξ = 1 τ c and e Taking for each bound on I k , k ∈ 3 2 , 1 2 , − 1 2 , the term that diverges fastest as τ approaches zero, we find that the lower bound is asymptotically of the order: For c ≤ 1, this reduces to: The second term is negligible compared to the first.Hence, we will use the term where ζ τ = 2σ 2 log(1/τ ).To find the lower bound on n i=1 E θi Var(θ i | Y i ), we only need to consider the parameters equal to zero: By the substitution dx, we find: where in the last step, we used τ τ x ≥ τ τ ≥ e − 1 e for x ∈ [0, 1], τ ∈ (0, 1].By plugging this into (25), we find that as τ → 0: finishing the proof for the first statement of the theorem.
We now consider θ such that θ i = a n for i = 1, . . ., p n , and θ i = 0 for i = p n + 1, . . ., n, and assume without loss of generality that a n > 0. We wish to find conditions on a n such that the lower bound ( 27) is sharp (up to a constant factor).Denoting ζ τ = 2σ The first integral from (28) can be split into two parts by splitting up the factor σ 2 + y 2 , the first of which can be bounded, by substituting x = (y − a n )/σ and applying Mills' ratio: The second of these integrals is, by y 2 = (y − a n ) 2 − a 2 n + 2a n y, equal to: dy. (30) The second integral of (30) can be bounded from below by zero, and the third from above by a n E θi Y i = a 2 n .Again substituting x = (y − a n )/σ yields the following upper bound on (30): σ 2 ∞ (ζτ −an)/σ x 2 φ(x)dx + a 2 n .Now using the equality x 2 φ(x) = φ(x) − d dx [xφ(x)] and again Mills' ratio, and combining with (29), we find the following upper bound on the first integral from (28): By substituting x = −y in the second integral from ( 28) and then applying the same inequalities to it as to the first integral, the following bound is obtained: This bound does not include a term a 2 n , because in the step equivalent to (30), the identity y 2 = (y + a n ) 2 − a 2 n − 2ya n is used, and thus only the integral Applying inequality 1 y T τ (y) ≤ 2 3 τ e If a n 1/ζ τ , we have a n ζ τ = O(1) and thus this term will be of order τ ζ τ .For the second integral from (34), we use bound ( 23).This leads to three integrals to be bounded, I 1 , I 2 en I 3 .Proof.For k = 0, the statement is immediate.By integration by parts the integral is seen to be equal to y k e y − e − y 1 ku k−1 e u du.For k = 0, the latter integral is bounded above by This is further bounded above by a multiple of (1 ∨ y k−1 )e y/2 + y k−1 e y .
Lemma A.6.Let I k be as in Lemma A.1.There exist functions R k with sup ζτ /4≤y≤4ζτ |R k (y)| → 0 for k > 0 and k = − 1 2 , such that, Proof.We split the integral in the definition of I k over the intervals [0, τ 2 ] and [τ 2 , 1].The first interval contributes, uniformly in yτ → 0, by the substitution u = z/τ 2 .The integral tends to 1 0 u k 1+u du, by the dominated convergence theorem, for any k > −1.The second interval contributes, with the substitution u = (y 2 /2σ 2 )z: This combines with the integral (35).
By the substitution u = ζ τ y − 2σ 2 log ζ τ , the remaining integral is equal to, with by the dominated convergence theorem.This yields the approximation in (36), with c 1 For the first integral in (40), we use bound 1 from Lemma A.2, and obtain a bound on its absolute value equal to where the last inequality follows by integration by parts.This is of much smaller order than the second integral from (40).In the third integral of (40), we bound by Mills' ratio.This is also of much smaller order than the second integral from (40), thus concluding the proof of (36).
Proof of (39) By expanding the term (1 − z) 2 in the numerator of the final term of (20), the posterior variance can be seen to be equal to: Because I 1 2 (y)/I − 1 2 (y) can be interpreted as the mean of the density proportional to z → z −1/2 e y 2 z/(2σ 2 ) /(τ 2 +(1−τ 2 )z), and I 3 2 (y)/I − 1 2 (y) as the second moment, it follows that the term in square brackets in ( 43 (44) The first term in ( 44) is as (40), except without the factor (ζ τ + y).Following the same steps as the proof of (36), we see that it is smaller than a multiple of ζ −1 τ times the bound on (40), so it is of the order ζ 2γ−3 τ τ (1−γ) 2 .The first and third integrals of the second term of (44) are also negligible.For the first, we use that the expression in square brackets is and bounded above by I 3 2 (y)/I − 1 2 (y), which in turn is bounded above by I 1 2 (y)/I − 1 2 (y).We bound as in ( 42), with the difference that the leading factor is (ζ τ + y) 2 instead of (ζ τ + y).This leads to the order ζ τ τ (1−γ) 2 +γ , much smaller than the claimed rate.For the third integral, we can bound the term in square brackets by 1 and use Mills' ratio to see that it is of the order ζ τ τ (4−γ) 2 .
We are left with the middle integral of the second term of (44).On the domain of this integral, by Lemma A.6: Proof.Suppose that Y ∼ N (θ, σ 2 I n ), θ ∈ ℓ 0 [p n ].We adapt the approach in paragraph 6.2 in (Johnstone and Silverman, 2004).We first derive the following inequality for events A such that τ > τ holds with probability one on A: where ( 17) was used in the second line, and Z follows a standard normal distribution.If A is such that τ > τ holds with probability one on A, we can use the inequality ζ τ < ζ τ if τ > τ to find: We now consider the nonzero and zero parameters separately.For both cases, we split up the expected ℓ 2 loss as follows: and then bound each of terms on the right hand side.For the nonzero means, we take c = 1, while for the zero means, we consider c ≥ 1.Note that for ζ τ to be well-defined, we need τ ≤ 1 and consequently, when we consider τ > cτ , we must have cτ < 1.

Zero means
We first establish an inequality for E θ [Z 2 1 A ], where A is an event and Z a standard normal random variable.By Young's inequality, we have for any positive x and y: .(51) Now suppose τ ≤ cτ for some c ≥ 1 such that cτ < 1.First note that |T τ (y)| increases monotonically in τ , as is clear from Because sign(T τ (y i )) = sign(T cτ (y i )) and 0 ≤ |T τ (y i )| ≤ |T cτ (y i )|, we have: Hence: And thus, by (18), we have for θ i = 0: (52)

Figure 1 .
Figure1.The effect of decreasing τ on the priors on κ (left) and θ (middle) and the posterior mean Tτ (y) (right).The solid line corresponds to τ = 1, the dashed line to τ = 0.05.Decreasing τ results in a higher prior probability of shrinking the observations towards zero.

Figure 2 .
Figure 2. Average squared error loss over 100 replicates with underlying mean vectors of length n = 400 if the nonzero coefficients are taken equal to A, in case 5% (Figure (a)), 10% (Figure (b)) or 50% (Figure (c)) of the means are equal to a nonzero value A. The solid line corresponds to empirical Bayes with (10), c 1 = 2, c 2 = 1, the dashed line to full Bayes with a half-Cauchy prior on τ .Figure (d) displays a histogram of all Gibbs samples of τ (after the burn-in) of all replicates in the setting τ ∼ C + (0, 1), A = 10, pn = 200.