Shrinkage with shrunken shoulders: Gibbs sampling shrinkage model posteriors with guaranteed convergence rates

Use of continuous shrinkage priors -- with a"spike"near zero and heavy-tails towards infinity -- is an increasingly popular approach to induce sparsity in parameter estimates. When the parameters are only weakly identified by the likelihood, however, the posterior may end up with tails as heavy as the prior, jeopardizing robustness of inference. A natural solution is to"shrink the shoulders"of a shrinkage prior by lightening up its tails beyond a reasonable parameter range, yielding a regularized version of the prior. We develop a regularization approach which, unlike previous proposals, preserves computationally attractive structures of original shrinkage priors. We study theoretical properties of the Gibbs sampler on resulting posterior distributions, with emphasis on convergence rates of the P{\'o}lya-Gamma Gibbs sampler for sparse logistic regression. Our analysis shows that the proposed regularization leads to geometric ergodicity under a broad range of global-local shrinkage priors. Essentially, the only requirement is for the prior $\pi_{\rm local}$ on the local scale $\lambda$ to satisfy $\pi_{\rm local}(0)<\infty$. If $\pi_{\rm local}(\cdot)$ further satisfies $\lim_{\lambda \to 0} \pi_{\rm local}(\lambda) / \lambda^a<\infty$ for $a>0$, as in the case of Bayesian bridge priors, we show the sampler to be uniformly ergodic.


Introduction
Bayesian modelers are increasingly adopting continuous shrinkage priors to control the effective number of parameters and model complexity in a data-driven manner. These priors are designed to shrink most of the parameters towards zero while allowing for the likelihood to pull a small fraction of them away from zero. To achieve such effects, a shrinkage prior 1 has a density with a "spike" near zero and heavy-tails towards infinity, encoding information that parameter values are likely close to zero but otherwise could be anywhere. Originally developed for the purpose of sparse regression (Carvalho et al., 2009), shrinkage priors have found applications in trend filtering of time series data (Kowal et al., 2019), (dynamic) factor models (Kastner, 2019), graphical models (Li et al., 2019), compression of deep neural networks (Louizos et al., 2017), among others.
Of particular interest in this paper is an application of Bayesian shrinkage to a logistic regression model y i | x i , β ∼ Bernoulli(logit −1 (x i β)) and computational properties of the corresponding posterior inference via Gibbs sampling. Due to the possibility of β being only weakly identifiable, use of a shrinkage prior on β here warrants proper modification of the prior's tail in order to ensure reasonable computational and statistical behaviors. Under our tail regularization strategy, we show that the Gibbs sampler achieves geometric ergodicity under a broad range of shrinkage priors. Notably, our proof technique unifies analyses of the Gibbs samplers under various shrinkage priors, providing an easily verifiable condition for geometric and uniform ergodicity.
For a simple purpose such as estimating the unknown means of independent Gaussian observations, a broad class of shrinkage priors achieve theoretically optimal performance (van der Pas et al., 2016;Ghosh and Chakrabarti, 2017). The lack of prior information in the tail of the distribution is problematic, however, in more complex models where parameters are only weakly identified. In such models, the posterior may have a tail as heavy as the prior, resulting in unreliable parameter estimates (Ghosh et al., 2018).
To address the above shortcoming of shrinkage priors, we build on the work of Piironen and Vehtari (2017) and propose a computationally convenient way to regularize shrinkage priors. The basic idea is to modify the prior so that the marginal distribution of |β j | has light-tails beyond a reasonable range. Our formulation has computational advantages over that of Piironen and Vehtari (2017) due to a subtle yet important difference. By preserving the global-local structure (1.1), our regularized shrinkage priors can benefit from partial marginalization approaches that substantially improve mixing of Gibbs samplers (Polson et al. 2014;Johndrow et al. 2020;Nishimura and Suchard 2022, Appendix E). In addition, our regularization leaves the posterior conditionals of λ j 's unchanged, allowing their conditional updates via existing specialized samplers (Griffin and Brown 2010;Polson et al. 2014;Nishimura and Suchard 2022, Appendix F). 2 Our regularized shrinkage priors allow for posterior inference via Gibbs sampler whose convergence rates often are provably fast. As an illustrative example, we consider Bayesian sparse logistic regression models, whose need for regularization motivated the work of Piironen and Vehtari (2017). Gibbs sampling via the Pólya-Gamma data augmentation of Polson et al. (2013) is a state-of-the-art approach to posterior computation under logistic model. When combined with advanced numerical linear algebra techniques, this Gibbs sampler is highly scalable to large data sets , but its theoretical convergence rate has not been investigated. Assuming that the prior density π loc (λ) is continuous and bounded except possibly at λ = 0, we establish that the Gibbs sampler is geometrically ergodic whenever π loc (0) < ∞. Stronger uniform convergence is achieved when λ −1 π loc (λ) dλ < ∞. The integrability condition holds in particular when π loc (λ) = O(λ a ) for a > 0 as λ → 0, which is the case for normal-gamma priors with shape parameter larger than 1/2 (Griffin and Brown, 2010) and for Bayesian bridge priors (Polson et al. 2014 andSuchard 2022, Appendix E).
Previous studies of the convergence rates under shrinkage models have focused exclusively on linear regression with specific parametric families of shrinkage priors (Pal and Khare, 2014;Johndrow et al., 2020). In contrast, our analysis requires no parametric assumptions on the shrinkage prior, at the same time extending the convergence results to the logistic model and, in Appendix A , to the probit model.
To summarize, this work provides two major contributions to the Bayesian shrinkage literature. First, we propose an effective and Gibbs-friendly approach to suitably modify shrinkage priors for use in weakly-identifiable models (Section 2). Second, we develop theoretical tools to study the behavior of shrinkage model Gibbs samplers near the spike β j = 0 without any parametric assumption on π loc (·), thereby unifying convergence analyses of the logistic regression Gibbs samplers under a range of shrinkage priors (Section 3). We conclude the article in Section 4 by demonstrating a practical use case of regularized shrinkage models via simulation study, which emulates increasingly common situations where the sample sizes are large yet the signals are difficult to detect. Our simulation results in particular highlight the dual role of the regularization; by eliminating heavy-tails in the shrinkage model posterior, it induces both more stable parameter estimates and faster mixing of the Gibbs sampler.

Regularized shrinkage prior
This section explains how our regularization approach allows us to incorporate prior information on the largest possible parameter values while maintaining the computational tractability of the original shrinkage prior. Piironen and Vehtari (2017) proposes to control the tail behavior of a global-local shrinkage prior by defining its regularized version with slab width ζ > 0 as with the prior π loc (·) on the local scale λ j unmodified. This regularization ensures that the variance of β j | τ, λ j , ζ is upper bounded by ζ 2 and hence β j | ζ marginally has a density with Gaussian tails beyond |β j | > ζ. The slab width ζ can be either given a prior distribution or fixed at a reasonable value. 3 While beneficial in improving statistical properties (Piironen and Vehtari, 2017), regularization the form (2.1) compromises the posterior conditional structures of shrinkage models. Specifically, the conditional distribution of τ, λ is altered through their dependency on ζ. This structural change is at best an inconvenience and potentially a cause of computational inefficiency, prohibiting the use of common acceleration techniques. For instance, the global scale τ is known to mix slowly when updating from its full conditional, so the state-of-the-art Gibbs samplers for Bayesian sparse regression marginalize out a subset of parameters when updating τ (Johndrow et al., 2020;. The analytical tractabilities of the integrals, which these marginalization strategies rely on, are lost when using the regularization as in (2.1).
We propose a more computationally convenient formulation, which induces regularization similar to that of (2.1) while keeping τ and λ conditionally independent of ζ given β. Our regularized prior π reg (·) defines the distribution of β j , λ j | τ, ζ as where N ( · | 0, σ 2 ) denotes the centered Gaussian density with variance σ 2 . In other words, in addition to defining π(β j | τ, λ j , ζ) as in (2.1), we alter the prior on λ j as π(λ j | τ, ζ) ∝ π loc (λ j )/ 1 + τ 2 λ 2 j /ζ 2 . Incidentally, we see that our regularized prior is very similar to that of Piironen and Vehtari (2017), but has a slightly lighter tail due to the factor 1/ 1 + τ 2 λ 2 j /ζ 2 which, as λ j → ∞, behaves like ζ/τ λ j . Alternatively, we can achieve the equivalent regularization through fictitious data that makes values |β j | ζ unlikely. While it may appear unnatural to introduce an auxiliary likelihood for the purpose of indirectly modifying a prior, this alternative formulation makes the regularization mechanism and resulting posterior properties more transparent. Figure 2.1 schematically describes this alternative construction of our regularized prior as well as the corresponding posterior structure when data y and X inform β through the likelihood L(y | X, β).
Given a global-local prior β j | τ, λ j ∼ N(0, τ 2 λ 2 j ), we introduce fictitious data z j whose realized value and underlying distribution are assumed to be for j = 1, . . . , p. We then define the regularized prior as the distribution of β j conditional on z j = 0. Under this model, the distribution of β j | τ, λ j , ζ, z j = 0 coincides with that of (2.1). On the other hand, the scale parameters τ, λ are conditionally independent of the others given β, so that the posterior full conditional τ, λ | β, ζ, z, y, X ( d = τ, λ | β) has the same density as in the unregularized version. Our regularization thus allows the Gibbs sampler to update τ, λ with the exact same algorithm as the one designed for the original shrinkage prior. We summarize our discussion as Proposition 2.1 below.

Geometric and uniform ergodicity under regularized sparse logistic regression
Shrinkage priors' popularity stems from, to a considerable extent, the ease of posterior computation via Gibbs sampling (Bhadra et al., 2019). As we have shown in Section 2, shrinkage models can incorporate regularization without affecting its computational tractability. We now investigate how fast such Gibbs samplers converge. While regularization was originally motivated to remedy statistically problematic behavior of heavytailed shrinkage priors, our results show that it can also improve the Gibbs samplers' convergence rates. The simulation results of Section 4 further corroborate the theory.
Note that the transition kernel actually depends neither on ω * nor τ * (nor λ * in the Bayesian bridge case) because of conditional independence. We refer readers to Polson et al. (2013) for more details on this data augmentation scheme. In our analysis, we do not use any specific properties of the Pólya-Gamma distribution aside from a couple of results from Choi and Hobert (2013) and Wang and Roy (2018).
The Pólya-Gamma Gibbs sampler for the logistic model has previously been analyzed under a Gaussian or flat prior on β (Choi and Hobert, 2013;Wang and Roy, 2018), but not under shrinkage priors. We establish geometric and uniform ergodicity -critical properties for any practical Markov chain Monte Carlo algorithms (Jones and Hobert, 2001). These properties imply the Markov chain central limit theorem and enables consistent estimation of Monte Carlo errors, ensuring that the Gibbs sampler reliably estimates quantities of interest (Flegal and Jones, 2011). To avoid cluttering notations and obscuring the main ideas, our analysis below assumes the slab width ζ to be fixed; however, the same conclusions hold if we only assume a prior constraint of the form ζ ≤ ζ max < ∞ (Remark 3.9).
Below are the main ergodicity results we will establish in this section, the uniform rate under Bayesian bridge and geometric rate under more general shrinkage priors: Theorem 3.1 (Uniform ergodicity in the Bayesian bridge case). If the prior π glo (·) is supported on [τ min , ∞) for τ min > 0, then the Pólya-Gamma Gibbs sampler for regularized Bayesian bridge logistic regression is uniformly ergodic.
Remark. Uniform/geometric ergodicity is an essential requirement for, yet not a guarantee of, practically efficient Markov chains (Roberts and Rosenthal, 2004). In fact, the simulation results of Section 4 show that the benefit of regularization is greatest when ζ is chosen small enough to impose a reasonable prior constraint on the value of β j 's.

Proof approach: minorization and drift conditions
To establish Theorem 3.1 and 3.2, we verify that each Gibbs sampler satisfies the minorization and drift conditions, upon on which geometric and uniform ergodicity are immediately implied by the well-known theory of Markov chains (Meyn and Tweedie, 2009;Roberts and Rosenthal, 2004). Here we introduce the relevant notions in terms of a generic transition kernel P (θ * , dθ).
In the statements to follow, we assume that P (θ * , dθ) has a corresponding density function which, with slight abuse of notation, we denote by P (θ | θ * ); in other words, the two satisfy the relation P (θ * , A) = A P (θ | θ * ) dθ. A chain on the space θ ∈ Θ with transition kernel P (θ * , dθ) is said to satisfy a minorization condition with a small set S if there are δ > 0 and a probability density π(·) such that The chain is uniformly ergodic when S = Θ. Otherwise, the chain is geometrically ergodic if it additionally satisfies a drift condition i.e. there is a Lyapunov function V (θ) ≥ 0 such that, for γ < 1 and b < ∞, (Rosenthal, 1995).

Behavior of shrinkage model Gibbs samplers near β j = 0
In many models, establishing minorization and drift condition amounts to quantifying the chain's behavior in the tail of the target. In studying convergence rates under shrinkage models, however, we are faced with an additional and distinctive challenge: the need to establish that the chain does not get "stuck" near the spike at β j = 0 (Pal and Khare, 2014;Johndrow et al., 2020). Regularization effectively eliminates the possibility of the chain meandering to infinity, making it relatively routine to analyze its behavior as β j → ∞. On the other hands, the existing results provide no general insights into the behavior near β j = 0. In fact, a careful examination of the proofs by Pal and Khare (2014) and Johndrow et al. (2020) reveals that the analyses under various shrinkage priors could have been unified if we had a more general characterization of shrinkage model Gibbs samplers' behavior near β j = 0.
To fill in this theoretical gap, we start our analysis by abstracting key model-agnostic results from our proofs of minorization and drift condition for the sparse logistic regression Gibbs sampler. Our Propositions 3.3 and 3.4 below characterize properties of the distribution of λ j | β j , τ -this distribution, due to conditional independence, typically coincides with the full posterior conditional of λ j and critically informs behavior of the subsequent update of β j in a shrinkage model Gibbs sampler. Our proof techniques apply to a broad range of shrinkage priors, essentially requiring only that π loc ∞ := max λ π loc (λ) < ∞. 4 Proposition 3.3 below plays a critical role in our proof of minorization condition. The proposition tells us that a sample from λ j | β * j , τ has a uniformly lower-bounded probability of λ j ≥ a as long as |β * j /τ | is bounded away from zero. In turn, the subsequent up-date of β j conditional on λ j should also have a guaranteed chance of landing away from zero. Intuitively, we can thus interpret the proposition as suggesting that a shrinkage model Gibbs sampler should not get "absorbed" to the spike at β j = 0. The difference in the limiting behavior as |β * j /τ | → 0, depending on whether λ −1 π loc (λ) dλ < ∞, is also significant and leads to the difference between geometric and uniform convergence under the sparse logistic regression example through Theorem 3.6. Proposition 3.3. For any a > 0, the tail probability P(λ j ≥ a | β * j , τ) is a decreasing function of |β * j /τ |. If λ −1 π loc (λ) dλ = ∞, then as |β * j /τ | → 0 the tail probability converges to 0, i.e. the conditional λ j | β * j , τ converges in distribution to a delta measure at 0. If λ −1 π loc (λ) dλ < ∞, then the conditional λ j | β * j , τ converges in distribution to Another key property of λ j | β j , τ, featured prominently in our proof of the drift condition (Theorem 3.8), is provided by Proposition 3.4 below. To briefly provide a context, a Lyapunov function of the form V (β) = j |β j | −α has proven effective in analyzing a shrinkage model Gibbs sampler (Pal andKhare 2014, Johndrow et al. 2020, Section 3.4). And bounding the conditional expectation of τ −α λ −α j as below often constitutes a critical step in establishing the drift condition.
Proposition 3.4. Let R > 0 and α ∈ [0, 1). If π loc ∞ < ∞, then there is an increasing function γ(r) > 0 with lim r→0 γ(r) = 0, for which the expectation with respect to Proposition 3.3 and 3.4 are substantial theoretical contributions on their own, but we defer their proofs to Appendix B ) so that we can without interruption proceed to establish ergodicity results in the regularized sparse logisitic regression case. Remark. The assumption π loc ∞ < ∞ is sufficient but not necessary one for the conclusion of Proposition 3.4 and later of Theorem 3.8. Following the analysis by Pal and Khare (2014), we can show that the conclusions also hold under normal-gamma priors with any shape parameter a > 0. These priors have the property π loc (λ) ∼ O(λ 2a−1 ) as λ → 0 and hence lim λ→0 π(λ) = ∞ for a < 1/2. We leave it as future work to characterize the behavior of general shrinkage priors with π loc ∞ = ∞. Remark. In Appendix A , we show that Proposition 3.3 and 3.4 can also be applied to establish uniform/geometric ergodicity of a Gibbs sampler for Bayesian sparse probit regression, demonstrating their relevance beyond the sparse logistic regression example.

Minorization -with uniform ergodicity in special cases
Having described the noteworthy model-agnostic results within our proofs, from now on we focus exclusively on the regularized sparse logistic regression case. We first consider the Gibbs sampler with fixed τ in Lemma 3.5 and Theorem 3.6. While fixing the global scale parameter is a common assumption in the ergodicity proofs for shrinkage models (Pal and Khare, 2014), we subsequently show that this assumption can be replaced with much weaker ones; we only require τ ∼ π glo (·) to be supported away from 0 in Theorem 3.1 and additionally away from +∞ in Theorem 3.7.
Let P (β | β * , τ, λ) denote the transition kernel corresponding to Step 3 and 4 of the Gibbs sampler as described in Page 6 and P (β | β * , τ) corresponding to Step 2 -4. In other words, we define The following lemma builds on a result of Choi and Hobert (2013) and plays a prominent role, along with Proposition 3.3, in our proofs of minorization conditions. Lemma 3.5. Whenever min j τ λ j ≥ R > 0, there is δ > 0 -independent of τ and λ except through R -such that the following minorization condition holds: We defer the proof to Appendix C .
We now establish a minorization condition for the Gibbs sampler with fixed τ .
Proof. Using Lemma 3.5, we have for δ > 0 depending only on R. Also, Proposition 3.3 implies that whenever |β Hence, j ∞ R/τ π(λ j | β * j , τ) dλ j is lower bounded by a positive constant depending only on and R/τ . In case C = ∞ 0 λ −1 π loc (λ) dλ < ∞, we can forgo the assumption |β * j /τ | ≥ and obtain a uniform lower bound since We now relax the assumption of fixed τ . The results of van der Pas et al. (2017) suggest that a constraint of the form 0 < τ min ≤ τ ≤ τ max < ∞ can improve the statistical property of shrinkage priors. As it turns out, such a constraint also enables us to establish minorization conditions for the full Gibbs sampler under sparse logistic regression with τ update incorporated. We can in fact take τ max = ∞ in case of the Bayesian bridge prior, whose unique structure allows us to marginalize out λ j 's when updating τ (Polson et al. 2014;Nishimura and Suchard 2022, Appendix E). This collapsed update of τ from τ | β makes it possible to deduce the uniform ergodicity result of Theorem 3.1 as an immediate consequence of Theorem 3.6 by studying the marginal transition β * → β with kernel (
For more general shrinkage priors, the global scale τ must be updated from the full conditional τ | β, λ. This makes it necessary to study the marginal transition (β * , λ * ) → (β, λ), jointly in regression coefficients and local scales, with kernel We establish a minorization condition for this general case in Theorem 3.7.
Proof. By Lemma 3.5 and the fact τ λ j ≥ τ min λ j , we know that for R > 0 To lower bound the term j π(λ j | β * j , τ) in (3.7), we first recall that It follows from the above inequalities that for η > 0 and density π lower (·) independent of β * j and τ . Combining (3.8) and (3.9), we can lower bound the transition kernel (3.7) as

Drift condition and geometric ergodicity
Here we establish a drift condition for geometric ergodicity under sparse logistic regression. As discussed in Section 3.2, the regularization prevents the Markov chain from meandering to infinity, so the main question is whether the chain can get "stuck" for a long time near β * j = 0. The following result shows that this does not happen as long as the global scale τ is bounded away from zero.

Auxiliary results for proof of geometric ergodicity
Proposition 3.10 and 3.11 below are used in the proof of Theorem 3.8 and are proved in Appendix D . Proposition 3.10 is a refinement of Proposition A1 in Pal and Khare (2014) and of Equation (41) in Johndrow et al. (2020), neither of which have the D(μ/σ) term.

Simulation
We run a simulation study to assess the computational and statistical properties of the regularized sparse logistic regression model. We use the Bayesian bridge prior π(β j | τ ) ∝ τ −1 exp(−|β j /τ | a ) to take advantage of the efficient global scale parameter update scheme. This prior also allows us to experiment with a range of spike and tail behavior by varying the exponent a, inducing larger spikes and heavier tails as a → 0. For the global scale parameter, we chose the objective prior π glo (τ ) ∝ τ −1 (Berger et al., 2015, Nishimura andSuchard 2022, Appendix E) with the range restriction 10 −6 ≤ E[ |β j | | τ ] ≤ 1 to ensure posterior propriety, though in practice we never observe a posterior draw of τ outside this range. For the posterior computations, we use the Pólya-Gamma Gibbs sampler provided by the bayesbridge package available from Python Package Index (pypi.org); the source code is available at the GitHub repository https://github.com/OHDSI/bayes-bridge.
4.1 Data generating process: "large n, but weak signal" problems Piironen and Vehtari (2017) demonstrate the benefits of regularizing shrinkage priors in the "p > n" case, when the number of predictors p exceeds the sample size n.
To complement their study, we consider the case of rare outcomes and infrequently observed features, another common situation in which regularizing shrinkage priors becomes essential. For example in healthcare data, many outcomes of interests have low incidence rates and many treatments are prescribed to only a small fraction of patients (Tian et al., 2018). This results in binary outcomes y and features x j filled mostly with 0's, making the amount of information much less than otherwise expected (Greenland et al., 2016).
To simulate under these "large n, but weak signal" settings, we generate synthetic data with n = 2,500 and p = 500 as follows. We construct binary features with a range of observed frequencies by first drawing 2w j ∼ Beta(1/2, 2) for j = 1, . . . , 500; this in particular means 0 ≤ w j ≤ 0.5 and E[w j ] = 0.1. For each j, we then generate x ij ∼ Bernoulli(w j ) for i = 1, . . . , n. We choose the true signal to be β j = 1 for j = 1, . . . , 10 and β j = 0 for j = 11, . . . , 500. To simulate an outcome with low incidence rate, we choose the intercept to be β 0 = 1.5 and draw y i ∼ Bernoulli(logit(−x i β)), resulting in y i = 1 for approximately 5% of its entries.

Convergence and mixing: with and without regularization
With the above data generating process, outcome y and design matrix X barely contain enough information to estimate all the coefficients β j 's. In particular, sparse logistic model without regularization can lead to a heavy-tailed posterior, for which uniform and geometric ergodicity of the Pólya-Gamma Gibbs sampler becomes questionable.
These potential convergence and mixing issues are evidenced by the traceplot (Figure 4.1a) of the posterior samples based on bridge exponent a = 1/16. As we are particularly concerned with the Markov chain wandering off to the tail of the target, we examine the estimated credible intervals to identify the coefficients with potential convergence and mixing issues. Plotted in Figure 4.1 are the coefficients with the widest 95% credible intervals; these coefficients also have some of the smallest estimated effective sample sizes, though the accuracy of such estimates is not guaranteed without geometric ergodicity. When regularizing the shrinkage prior with a slab width ζ = 1, the posterior samples indicate no such convergence or mixing issues (Figure 4.1b) and yield more sensible posterior credible intervals (Figure 4.2).
We emphasize that there is no fundamental change in the Gibbs sampler itself when incorporating the regularization, the only change being the addition of the ζ −2 I term in the conditional precision matrix (3.2) when updating β. It is the change in the posterior -more specifically the guaranteed light tails of the β marginal -that induces faster convergence and mixing.
We also assess sensitivity of convergence and mixing rates on the slab width ζ. The regularized prior recovers the unregularized one as ζ → ∞. This means that, as seen  from the problematic computational behavior of the unregularized model, ζ cannot be taken too large in this limited data setting. In other words, the choice of ζ has to reflect some degree of prior information on β j 's. We need not assume strong prior information, however; Figure 4.3 demonstrates that even small amount of regularization (e.g. ζ = 2 or 4) can noticeably improve the computational behavior over the unregularized case.

Statistical properties of shrinkage model for weak signals
To study the shrinkage model's ability to separate out the non-zero β j from the β j = 0, we simulate 10 replicate data sets and estimate the posterior for each of them. In total, we obtain 5,000 marginal posterior distributions -10 independent replications for each of the p = 500 regression coefficients -with 100 for the signal β j = 1 and 4,900 for the non-signal β j = 0. As all the predictors x j 's are simulated in an exchangeable manner, the 100 (and 4,900) posterior marginals for the signal (and non-signal) are also exchangeable. Figure 4.4 show the posterior credible intervals. Due to the low incidence rate and infrequent binary features, many of the signals are too weak to be detected. We also find that the credible intervals seemingly do not achieve their nominal frequentist coverage for signals below detection strength. This finding is consistent with the existing theoretical results on shrinkage priors and is unsurprising in light of the impossibility theorem by Li (1989) -confidence intervals cannot be optimally tight and have nominal coverage at the same time. Credible intervals produced by Bayesian shrinkage models tend to be optimally tight and thus require appropriate manual adjustments to achieve the nominal coverage (van der Pas et al., 2017). No statistical procedure is immune to this tightness-coverage trade-off; therefore, the apparent under-coverage should be seen not as a flaw but more as a feature of Bayesian shrinkage models.
We benchmark the signal detection capability of the posterior against the frequentist lasso, arguably the most widely-used approach to feature selection. Obtaining the lasso point estimates requires a selection of the hyper-parameter commonly referred to as the penalty parameter. For its choice, we first follow the standard practice of minimizing the ten-fold cross-validation errors (Hastie et al., 2009). However, this approach yields inconsistent and poor overall performance, detecting only 13 out of the 100 signals (Figure 4.4). Cross-validation likely fails here because each fold does not capture the characteristics of the whole data when the signals are so weak. As a more stable alternative for calibrating the penalty parameter, we try an empirical Bayes procedure based on the Bayesian interpretation of the lasso (Park and Casella, 2008). We first estimate the posterior marginal mean of the penalty parameter from the Bayesian lasso Gibbs sampler. Conditionally on this value, we then find the posterior mode of β. This procedure seems to yield more consistent performance, detecting 39 out of the 100 signals albeit with the estimates more shrunk towards null than the Bayesian posterior means. The empirical Bayes procedure demonstrates more consistent behavior for the non-signals as well (Figure 4.5).
We also assess how the spike size and (pre-regularization) tail behavior of the prior influence statistical properties of the resulting posterior. For this purpose, we fit the regularized bridge model with the exponent a −1 ∈ {2, 4, 8, 16} to the same data sets.   intervals are centered around the values similar to the a = 1/16 case (Figure 4.4), but are much wider overall. We observe the same pattern throughout the range of the exponent values: similar median values, but tighter intervals for the smaller exponents. In particular, as can be seen in Figure 4.7, more "extreme" shrinkage priors with larger spikes and heavier-tails seem to yield tighter credible intervals for the same coverage.

Discussion
Shrinkage priors have been adopted in a variety of Bayesian models, but the potential issues arising from their heavy-tails are often overlooked. Our method provides a simple and convenient way to regularize shrinkage priors, making the posterior inference more robust. Both the theoretical and empirical results demonstrate the benefits of regularization in improving the statistical and computational properties when parameters are only weakly identified. Much of the systematic investigations into the shrinkage prior properties has so far focused on rather simple models and situations in which signals are reasonable strong. Our work adds to the emerging efforts to better understand the behavior of shrinkage models in more complex settings.