Precision Matrix Estimation under the Horseshoe-like Prior–Penalty Dual

Precision matrix estimation in a multivariate Gaussian model is fundamental to network estimation. Although there exist both Bayesian and frequentist approaches to this, it is dif-ﬁcult to obtain good Bayesian and frequentist properties under the same prior–penalty dual. To bridge this gap, our contribution is a novel prior–penalty dual that closely approximates the graphical horseshoe prior and penalty, and performs well in both Bayesian and frequentist senses. A chief difﬁculty with the horseshoe prior is a lack of closed form expression of the density function, which we overcome in this article. In terms of theory, we establish posterior convergence rate of the precision matrix that matches the oracle rate, in addition to the frequentist consistency of the MAP estimator. In addition, our results also provide theoretical justiﬁcations for previously developed approaches that have been unexplored so far, e.g. for the graphical horseshoe prior. Computationally efﬁcient EM and MCMC algorithms are developed respectively for the penalized likelihood and fully Bayesian estimation problems. In numerical experiments, the horseshoe-based approaches echo their superior theoretical properties by comprehensively outperforming the competing methods. A protein–protein interaction network estimation in B-cell lymphoma is considered to validate the proposed methodology.


Introduction
High-dimensional precision matrix estimation under a multivariate normal model is a fundamental building block for network estimation, and a common thread connecting disparate applications such as inference on gene regulatory networks (Huynh-Thu and Sanguinetti, 2019), econometrics (Callot et al., 2019;Fan et al., 2016), and neuroscience (Ryali et al., 2012).The frequentist solution to this problem is now relatively well understood and several useful algorithms exist; see Pourahmadi (2011) for a detailed review.However, interested readers will quickly discern that the Bayesian literature on this problem is still sparse, barring some notable exceptions described in Section 1.1.The reason for this is simple: the focus of a Bayesian is on the entire posterior and quantification of uncertainty using the said posterior; a problem fundamentally more demanding computationally.Consequently, the virtues of probabilistic uncertainty quantification notwithstanding, the Bayesian treatment to precision matrix estimation has received relatively scant attention from impatient practitioners.Furthermore, a penalized likelihood estimate with good frequentist properties need not correspond to good Bayesian posterior concentration properties under the corresponding prior.A notable example of this in linear regression models is the lasso penalty (Tibshirani, 1996), and its Bayesian counterpart using the double exponential prior (Park and Casella, 2008), for which Castillo et al. (2015) assert: "the LASSO is essentially non-Bayesian, in the sense that the corresponding full posterior distribution is a useless object."We address this gap in the literature in the context of graphical models.Our contribution is a novel prior-penalty dual that makes both fully Bayesian and fast penalized likelihood estimation feasible.The key distinguishing feature of our work is that we provide theoretical and empirical support for both Bayesian and frequentist solutions to the problem under the same prior-penalty dual.It is shown that the Bayesian posterior as a whole concentrates around the truth and the penalized likelihood point estimate is consistent.To our knowledge, ours is the first work to establish these results using continuous shrinkage priors under an arbitrary sparsity pattern in the true precision matrix.This is at a contrast to the current state of the art in theory that imposes additional constraints on the graph, e.g., banded-ness or more general decomposable structures (Banerjee and Ghosal, 2014;Lee and Lee, 2021;Liu and Martin, 2019;Xiang et al., 2015).Typically these assumptions are made for computational and theoretical tractability rather than any intrinsic subject matter knowledge.
The reason our work is able to avoid these restrictive assumptions is because we work with continuous "global-local" shrinkage priors and impose sparsity in a weak sense (see, e.g., Bhadra et al., 2019b).
The motivating data set arises from a biological application.Protein-protein interaction networks have been found to play a crucial role in cancer (Ha et al., 2018).One such significant effort in this direction is "The Cancer Genome Atlas" program (Weinstein et al., 2013) that has collected data from over 7,700 patients across 32 different tumor types.From this repository, we retrieve proteomic data of 33 patients with "Lymphoid Neoplasm Diffuse Large B-cell Lymphoma," which is a cancer that starts in white blood cells and spreads to lymph nodes.Our findings are contrasted with that of Ha et al. (2018).

The Current State of the Art and Our Contributions in Context
A Gaussian graphical model (GGM) remains popular as a fundamental building block for network estimation because of the ease of interpretation of the resulting precision matrix estimate: an inferred off-diagonal zero corresponds to conditional independence of the two corresponding nodes given the rest (see, e.g., Lauritzen, 1996).
Among the most popular frequentist approaches for estimating GGMs are the graphical lasso (Friedman et al., 2008) and the graphical SCAD (Fan et al., 2009), which are respectively the graphical extensions of the lasso (Tibshirani, 1996) and SCAD (Fan and Li, 2001) penalties in linear models.Similarly, the CLIME estimator of Cai et al. (2011) is a graphical application of the Dantzig penalty (Candès and Tao, 2010).Fan et al. (2016) propose factor-based models for estimation of precision matrices, which are particularly attractive in financial applications where the precision matrix of outcome variables conditioned on some common factors being sparse is a sensible assumption.Alternatively, Callot et al. (2019) opt for a node-wise regression approach using 1 penalty for minimizing the risk of a Markowitz portfolio.The positive definiteness of their estimate is guaranteed asymptotically, which nevertheless remains hard to establish in finite samples; a common issue with node-wise regression approaches.Zhang and Zou (2014) propose a new empirical loss termed the D-trace loss to avoid computing the log determinant term in the 1 penalized loss.Under certain conditions they also prove that the resulting estimate is identical to the CLIME estimate (Cai et al., 2011).A ridge type estimate for precision matrix termed ROPE is proposed by Kuismin et al. (2017), who use the squared Frobenius norm of the precision matrix as a penalty function.Another distribution free version of the ridge estimate is proposed by Wang et al. (2015).An elastic net penalty (Zou and Hastie, 2005) is used to determine the functional connectivity among brain regions by Ryali et al. (2012).A comprehensive theoretical treatment for the rate of convergence of precision matrix estimates is given by Lam and Fan (2009).
The frequentist approaches listed above generally enjoy faster and more scalable computation, owing to being point estimates.Nevertheless, from a Bayesian perspective, a common theme with these penalized approaches is that the posterior concentration properties of the corresponding priors remain completely unexplored.Moving now to Bayesian methodologies for unstructured precision matrices, the literature is relatively scant.Wang (2012) proposes a Bayesian version of the graphical lasso and uses a clever decomposition of the precision matrix to facilitate block Gibbs sampling and to guarantee the positive definiteness of the resulting estimate.Banerjee and Ghosal (2015) consider a similar prior structure as the Bayesian graphical lasso, with the exception that they put a large point-mass at zero for the off-diagonal elements of the precision matrix.Under assumptions of sparsity, they derive posterior convergence rates in the Frobenius norm, and also provide a Laplace approximation method for computing marginal posterior probabilities of models.Spike-and-slab variants with double exponential priors is proposed by Gan et al. (2019).
A common issue with the spike-and-slab approach is the presence of binary indicator variables, which typically hinder posterior exploration and the Bayesian lasso estimate is known to be biased for large signals.Both of these issues are addressed by the graphical horseshoe estimate proposed by Li et al. (2019), which is an application of the popular global-local horseshoe prior (Carvalho et al., 2010) in GGMs.Li et al. (2019) provide considerable empirical evidence of superior performance over several competing Bayesian and frequentist approaches.Nevertheless, their theoretical results are limited to upper bounds on some Kullback-Leibler risk properties and the bias of the resulting estimate.Consequently, whether the graphical horseshoe posterior has correct concentration properties has remained an open question.Similarly, its frequentist dual: the penalized likelihood estimate, also remains unavailable, mainly because there is no closed form of the horseshoe prior or penalty; only a normal scale mixture representation.Both of these issues are resolved in the present paper.We propose a novel prior-penalty dual that closely approximates the graphical horseshoe prior with the density being available explicitly as well as a normal scale mixture, which has important implications in theory and in practice, and in both Bayesian and frequentist settings.Moreover, as a corollary to one of our main results, the posterior concentration properties of the graphical horseshoe is also established, for the first time.

Formulation of the Prior-Penalty Dual
We begin the formal treatment by pointing the interested readers to Supplementary Section S.1 for a summary of mathematical notations used in the paper.Let X (n) = (X 1 , . . ., X n ) T be a random sample from a p-dimensional normal distribution with mean 0 and a positive definite covariance matrix Σ.The corresponding precision matrix, or the inverse covariance matrix Ω = ((ω ij )) is defined as We assume that Ω is sparse, in the sense that the number of non-zero off-diagonal elements is small.We utilize the duality between a Bayesian prior and penalty, where the penalized likelihood estimate is understood to correspond to the maximum a posteriori (MAP) estimate under a given prior.
Hence, for fully Bayesian inference on Ω, we need a suitable prior that also results in a penalty function with good frequentist properties; a non-trivial problem even in linear models (Castillo et al., 2015).We put independent horseshoe-like priors (Bhadra et al., 2019a) on the off-diagonal and non-informative priors on the diagonal elements of Ω, while restricting the prior mass to positive definite matrices.A key benefit of the horseshoe-like prior, which closely mimics the sparsity-inducing global-local horseshoe prior (Carvalho et al., 2010) is that the prior density, and hence the penalty, is available in closed form under the former, unlike under the latter.This allows one to study the penalty (equivalently, the negative logarithm of the prior density) directly, and to establish important properties concerning convexity (see, e.g., Lemma 4.8), which remain much more difficult under the horseshoe prior.For the fully Bayesian model, the element-wise prior specification induced by the horseshoe-like prior is, where π(ω ij ; a) = log 1 + a/ω 2 ij /(2πa 1/2 ) gives the horseshoe-like density function for ω ij .The motivation for using this density is two-fold: it has a sharp spike near zero, encoding the Bayesian prior belief that most signals are ignorable; and it also possesses very heavy, polynomially decaying tails, allowing for identification of signals.These two properties closely mimic the popular horseshoe prior for sparse signals (Carvalho et al., 2010), and, in fact, one achieves the same origin and tail rates for the density function in terms of ω ij as in the original horseshoe.The crucial advantage with the horseshoe-like, then, is that there is a closed form to the density function, unlike the horseshoe prior.Nevertheless, similar to the original horseshoe prior, the horseshoe-like prior also admits a convenient latent variable representation as a Gaussian scale-mixture (Bhadra et al., 2019a).To be precise, one can write, where marginalizing over the latent ν ij leads to the desired π(ω ij ; a) identified above.For modeling valid precision matrices, we must restrict the prior mass on the space of symmetric positive Markov chain Monte Carlo samples for the graphical horseshoe-like (dashes), the graphical horseshoe (small dashes) and the Bayesian graphical lasso (solid) priors; providing a visual comparison of (left) spikes near the origin, (middle) heaviness of the tails, and (right) the induced penalty functions for p = 10.
definite matrices M + p .Combining the unrestricted prior as in ( 1) and (2), along with the above restriction, leads to the joint prior specification on Ω as, where ν = {ν ij } i<j .In this formulation, the latent parameters ν ij are component-specific, or local, and the shared parameter a is global, situating the horseshoe-like in the broader category of global-local priors (Bhadra et al., 2019b).Further details on the induced marginal prior on Ω are presented in Supplementary Section S.2.Although it is possible to put a further hyperprior on a, it is considered fixed for point estimation approaches, and is estimated by the effective model size approach of Piironen and Vehtari (2017) to avoid a collapse to zero.We defer the details to Supplementary Section S.6.With the prior specification as in (3), the log-posterior L thus becomes, At this point, the corresponding hierarchy of the horseshoe prior, which the horseshoe-like closely approximates, is well worth mentioning.The horseshoe prior (Carvalho et al., 2010), recognized as a state-of-the-art for sparse signal recovery (Bhadra et al., 2019a), was deployed in estimating GGMs by Li et al. (2019) with the following hierarchy: where C + (0, 1) denotes the standard half-Cauchy distribution.It is recognized as the state of the art in sparse signal recovery in linear models (Bhadra et al., 2019a), and was successfully deployed in estimating sparse precision matrices by Li et al. (2019).Figure 1 plots the smoothed histogram of prior densities of a randomly chosen off-diagonal element near the origin and at the tails for the graphical horseshoe-like along with two of its relatives: the graphical horseshoe (Li et al., 2019) and the Bayesian graphical lasso (Wang, 2012).The corresponding penalties, given by the negative of the logarithm of the densities, are also shown.The key observations are: (a) the graphical horseshoe and graphical horseshoe-like densities are very similar and (b) both have far sharper spikes near the origin and heavier tails compared to the Laplace priors used in the Bayesian graphical lasso, providing an intuitive basis for the superiority of the horseshoe-family in sparse signal recovery.Extensive formal support for these observations are available in linear models (Bhadra et al., 2019b), but barring some empirical evidence, the corresponding theoretical support is lacking in graphical models.
Some comments on the desirability of a Gaussian scale mixture representation are also in order.
First, the latent mixing variables make it easier to derive fully Bayesian computational strategies via data augmentation.A similar observation is true for point estimates via the expectationmaximization algorithm.Second, using a result of Barndorff-Nielsen et al. (1982), it is possible to derive a precise connection between the densities of the mixing variables and that of the resultant mixture.In particular, if the mixing densities are regularly varying in the tails, then so is the resultant Gaussian mixture.Since regular variation is closed under many nonlinear transformations, the heavier tails of global-local priors impart crucial robustness properties for estimating nonlinear, many-to-one functions of the parameters of interest in multi-parameter problems (Bhadra et al., 2016), and help avoid marginalization paradoxes of Dawid et al. (1973).

ECM Algorithm for MAP Estimation
We utilize the Gaussian mixture representation of the horseshoe-like prior with latent scale parameters to devise an Expectation Conditional Maximization (ECM) (Meng and Rubin, 1993) approach to MAP estimation, building on the calculations for linear models by Bhadra et al. (2019a).For updating the elements of the precision matrix, we use the coordinate descent technique proposed by Wang (2014), which guarantees the positive definiteness of the precision matrix at each update.

E
Step: Following Bhadra et al. (2019a), we calculate the conditional expectation of the latent variable ν ij , 1 ≤ i < j ≤ p, at current iteration (t) as follows: CM Steps: Having updated the latent parameters in the E-Step, the coordinate descent approach of Wang (2014) is used to update one column of the precision matrix at a time.Without loss of generality, we present the steps for updating the pth column.First we divide the precision matrix Ω and the sample covariance matrix S into blocks as follows: where, Ω 11 is a matrix of dimension (p − 1) × (p − 1) of the top left block of Ω; Ω 22 is the pth diagonal element and Ω 12 is a (p − 1) × 1 dimensional vector of the remaining elements in the pth column.The decomposition of nS is analogous.We define γ = Ω 22 − Ω T 12 Ω −1 11 Ω 12 and β = Ω 12 .With these transformations, we simplify (4) to update the pth column.We have, where c 1 , c 2 , c 3 are constants independent of β, γ.Now the log-posterior with the transformed variables is given by, Maximizing the above over β, γ gives the required update as: Having updated β, γ from (8), the pth column update of the precision matrix for the next iteration We repeat the above steps for the remaining (p − 1) columns to complete the CM Step updates for Algorithm 1 ECM algorithm for MAP estimation (GHS-LIKE-ECM) Update i th column of Ω u using coordinate descent algorithm of Wang (2014) described above.
end for end while return ΩMAP = Ω u end function Ω, until convergence to the MAP estimator ΩMAP .The procedure is summarized in Algorithm 1.
The most computationally expensive step is the required inverse of a (p − 1) × (p − 1) matrix to compute β in (8), which needs to be repeated for each of the p columns, giving a per iteration complexity of O(p 4 ) for the algorithm.

Posterior Sampling for the Fully Bayesian Estimate
For fully Bayesian estimation, we also outline the MCMC sampling procedure.With substitutions ij and a → τ 2 , the prior in (2) can be written with a different hierarchy as follows: where π(t ij ) above is known as the the slash normal density, expressed as (φ(0) − φ(t ij ))/t 2 ij , where φ(•) is the standard normal density (Bhadra et al., 2019a).Introducing a further local latent variable r ij , the density for t ij can also be written as a normal scale mixture, where the scale follows a Pareto distribution, that is, For efficient sampling, the above Pareto scale mixture can be represented as a product of an exponential density and an indicator function as follows: Different choices of prior for the global scale parameter are possible, but we consider τ ∼ C + (0, 1).

Makalic and Schmidt (2015
then marginally τ ∼ C + (0, 1).Using this, we can write the posterior updates of τ, ξ as follows: Following the remaining updates from the graphical horseshoe sampler of Li et al. ( 2019), the complete MCMC scheme for the graphical horseshoe-like is as outlined in Algorithm 2. The per iteration complexity of the algorithm is O(p 4 ).Diagnostic plots for both the ECM and MCMC algorithms are given in Supplementary Section S.7.

Posterior Concentration Results
In this section, we present our main result on the posterior contraction rate of the precision matrix Ω around the true precision matrix Ω 0 with respect to the Frobenius norm under the graphical horseshoe-like prior.The technique of our proofs uses the general theory on posterior convergence rates as outlined in Ghosal et al. (2000), which establishes the desired convergence with respect to the Hellinger distance.However, from the perspective of a precision matrix, the Frobenius norm is easier to interpret in comparison to the Hellinger distance.Under suitable assumptions on the eigenspace of the precision matrices, Banerjee and Ghosal (2015) showed that these two distances are equivalent, and hence the same posterior contraction rates hold with respect to the Frobenius norm as well.We assume that the true underlying graph is sparse, so that the corresponding true precision matrix Ω 0 has s non-zero off-diagonal elements.The total number of non-zero elements in Ω 0 is p + s, which gives the effective dimension of the parameter Ω 0 .To establish the Algorithm 2 The Graphical Horseshoe-Like MCMC Sampler (GHS-LIKE-MCMC) function GHS-LIKE(S, n, burnin, nmc) where S = X (n)T X (n) /n, n = sample size Set p to be number of rows (or columns) in S Set initial values Save updated Σ, T , M .end for Sample τ, ξ as in (10).
Sample τ, ξ end for Return MC samples Ω end function desired posterior concentration results, we shall need to control both the actual dimension and the effective dimension of the true precision matrix.Overall, our theoretical analyses depend on certain assumptions on the true precision matrix, the dimension and sparsity, and the prior space.
We present the details of these assumptions along with relevant discussions below.
Assumption 4.1.The prior is restricted to a subspace of symmetric positive definite matrices, M + p (L), where Assumption 4.3.The true precision matrix Ω 0 belongs to the parameter space given by Assumption 4.4.The bound [L −1 , L] on the eigenvalues of Ω as specified in (11) satisfies L > ε 0 , or, in other words, ε 0 = cL, for some c ∈ (0, 1).
The condition on the eigenvalues of Ω as specified in Assumption 4.1 is necessary for arriving at the theoretical results involving the posterior convergence rate of Ω.In this paper, we assume that L is a fixed constant, which can be large.precision matrix in a high-dimensional framework.This condition ensures that Ω 0 ∈ M + p (L), that is, the prior space contains the true precision matrix, which is necessary in efficient learning of the same.Assumption 4.5 ensures that the prior puts sufficient mass around the true zero elements in the precision matrix.The condition on the global scale parameter a is a sufficient one, and is required to obtain the desired posterior convergence rate.We present the main theoretical result for posterior convergence now.A proof can be found in Appendix A.1.Theorem 4.6.Let X (n) = (X 1 , . . ., X n ) T be a random sample from a p-dimensional normal distribution with mean 0 and covariance matrix Σ 0 = Ω −1 0 , where Ω 0 ∈ U(ε 0 , s).Consider the prior specification as given by (3).Under the assumptions on the prior as given in Assumptions (4.1)-(4.5), the posterior distribution of Ω satisfies for n = n −1/2 (p + s) 1/2 (log p) 1/2 and a sufficiently large constant M > 0.
Corollary 4.7.Under similar conditions as in Theorem 4.6 above, the posterior distribution of Ω has the posterior convergence rate n = n −1/2 (p + s) 1/2 (log p) 1/2 around Ω 0 with respect to the Frobenius norm under the graphical horseshoe prior as specified in (5).
A proof of Corollary 4.7 is in Supplementary Section S.4 and settles the question of posterior concentration for the graphical horseshoe which Li et al. (2019) did not address.The posterior convergence rate above directly compares with the rate of convergence of the frequentist graphical lasso estimator (Rothman et al., 2008), and is identical to the posterior convergence rates obtained by Banerjee and Ghosal (2015) and Liu and Martin (2019).However, our work is the first to address unstructured precision matrices, apart of a mild assumption of sparsity, using computationally efficient continuous shrinkage priors.This is at a contrast with previous theoretical analyses that imposed restrictive assumptions such as decomposability.

Properties of the MAP Estimator
The MAP estimator of Ω can be found by maximizing the following objective function: where, pen a (ω) = − log log(1 + a/ω 2 ), a > 0, is the horseshoe-like penalty.We start by proving pen a (ω) is strictly concave in the following lemma, with a proof in Supplementary Section S.5.
A direct consequence of Lemma 4.8 is as follows.Let Ω (t) be the tth iterate of a local linear approximation (LLA) algorithm (Zou and Li, 2008), that is, Then Theorem 1 of Zou and Li (2008), together with the strict concavity of horsehoe-like penalty function from Lemma 4.8, guarantees that the LLA algorithm will satisfy an ascent property, that is, Q(Ω (t+1) ) > Q(Ω (t) ), and hence the LLA algorithm will be a special case of minorize-maximize algorithms.We now present the result on consistency of the MAP estimate using the graphical horseshoe-like prior via an ECM algorithm, with a proof in Appendix A.2.
Theorem 4.9.Under the conditions of Theorem 4.6, the MAP estimator of Ω, given by ΩMAP is consistent, in the sense that where n is the posterior convergence rate as defined in Theorem 4.6.
The above result guarantees that the MAP estimator also converges to the true precision matrix Ω 0 at the same rate as the posterior convergence rate in the Frobenius norm.By triangle inequality, Theorem 4.6 and Theorem 4.9 together imply that Ω − ΩMAP 2 = O P ( n ), so that the posterior probability of an n -neighborhood around the MAP estimator with respect to the Frobenius norm converges to zero.This pleasing correspondence between the fully Bayesian and MAP estimates under the same prior-penalty dual is far from guaranteed, in the face of possible contradictions pointed out by Castillo et al. (2015) for the lasso in linear models.

Numerical Experiments
We compare the MAP and MCMC estimates under the horseshoe-based methods (GHS, GHS-LIKE-MCMC and GHS-LIKE-ECM) with two frequentist approaches: GLASSO, GSCAD and one Bayesian approach: the Bayesian GLASSO (BGL).We consider two problem dimensions: (n, p) = (120, 100) and (120, 200).For each dimension, we perform simulations under four different structures of the true precision matrix Ω 0 as in Li et al. (2019) and Friedman et al. (2008).These are: Random, Hubs, Cliques positive and Cliques negative, as detailed below.

2.
Hubs.The rows/columns are partitioned into K disjoint groups G 1 , . . ., G K .The off-diagonal entries ω 0 ij are set to 0.25 if i = j and i, j ∈ G k for k = 1, . . ., K. In our simulations we consider p/10 groups with equal number of elements in each group.
3. Cliques positive and Cliques negative.Same as Hubs, except for setting all ω 0 ij , i = j and i, j ∈ G k , we select 3 members within each group, g k ⊂ G k , and set ω 0 ij = 0.75, i = j and i, j ∈ g k for 'Cliques positive' and set ω 0 ij = −0.45,i = j and i, j ∈ g k for 'Cliques negative.' For each setting of (n, p) and Ω 0 , we generate 50 data sets and estimate the precision matrices by the methods stated above.All three horseshoe based methods are implemented in MATLAB, GSCAD is as implemented by Wang (2012) and GLASSO is implemented in R package 'glasso' (Friedman et al., 2018).Starting points for GHS-LIKE-ECM are randomly chosen in order to avoid getting stuck in a local minimum (see details in Supplementary Section S.7) and its global shrinkage parameter is chosen as in Supplementary Section S.6.Tuning parameters for GLASSO and GSCAD are chosen by 5-fold cross validation.The middle 50% posterior credible intervals are used for variable selection for the Bayesian approaches.We provide resuts on: Stein's loss (= tr( ΩΣ 0 ) − log det( ΩΣ 0 ) − p), Frobenius norm (F norm = Ω − Ω 0 2 ), true and false positive rates for detecting non-zero off-diagonal entries (resp., TPR and FPR), the Matthews Correlation Coefficient (MCC), and average CPU time.Note that for the fully Bayesian estimate, our theory concerns posterior concentration properties, and connections with convergence in Frobenius norm is established in Banerjee and Ghosal (2015).However, for the sake of completeness and comparisons with point estimation approaches, we provide variable selection results for all approaches as well, in addition to Stein's loss (an empirical measure of Kullback-Leibler divergence) and F norm that focus more directly on the entire distribution.
It can be clearly seen from Tables 1-4 that the horseshoe based methods generally perform the best.GHS has the smallest Stein's loss in all settings expect in Hubs when (n, p) = (120, 100).
This corroborates the finding of Li et al. (2019) that GHS results in improved Kullback-Leibler risk properties (of which Stein's loss is an empirical measure) when compared to prior densities that are bounded above at the origin, e.g., BGL, and it is apparent from both tables that BGL has the worst Stein's loss.For GHS-LIKE-ECM and MCMC, the measures of Stein's loss are generally close to that of GHS, and much better compared to the other competing methods.A similar pattern emerges in the results for F norm, with the horseshoe-based methods once again outperforming the competitors and performing similarly among themselves.It is worth noting, however, that the fully Bayesian approaches (GHS-LIKE-MCMC and GHS) generally result in the best statistical performance, at the expense of a considerably longer computing time, making the trade-off between fully Bayesian and penalized likelihood approaches apparent.
Coming next to variable selection results, one may expect the penalized likelihood approaches to really shine; since these methods produce exact zeros, unlike the Bayesian approaches that necessitate some form of post-processing.Nevertheless, the Bayesian approaches offer the advantage of controlling the trade-off between TPR and FPR, by varying the width of the credible interval, for example.With our chosen mechanism (i.e., a variable is considered not to be selected if the middle 50% credible interval includes zero), the GHS-LIKE-MCMC and GHS have the smallest TPR.Nevertheless, the penalized methods also have higher FPR in general (except for GHS-LKE-ECM), which results in lower MCC overall.In particular, the GSCAD estimate, which is not guaranteed to be positive definite in finite samples (Fan et al., 2016), seems not to work well in general.Additional numerical results investigating the choice of starting values for the GHS-LIKE-ECM algorithm is given in Supplementary Section S.7.

Protein-Protein Interaction Network in B-cell Lymphoma
We analyze Reverse Phase Protein Array (RPPA) data of 33 patients with lymphoid neoplasm "Diffuse Large B-cell Lymphoma" to infer the protein interaction network.The data set consists Table 2: Mean (sd) Stein's loss, Frobenius norm, true positive rates and false positive rates, Matthews Correlation Coefficient of precision matrix estimates over 50 data sets generated by multivariate normal distributions with precision matrix Ω 0 , where n = 120 and p = 100.The precision matrix is estimated by frequentist graphical lasso with penalized diagonal elements (GL1) and with unpenalized diagonal elements (GL2), graphical SCAD (GSCAD), Bayesian graphical lasso (BGL), the graphical horseshoe (GHS), graphical horseshoe-like ECM (ECM) and graphical horseshoe-like MCMC (MCMC).The best performer in each row is shown in bold.Average CPU time is in seconds.The estimated sparsity (% of zero elements) and number of non zeros in the lower triangle of the estimates are given in Table 5.We note that the GHS-LIKE-MCMC gives the sparsest estimate, almost 4% sparser than the GHS.This is consistent with prior studies that found robust gene networks are typically sparse (Leclerc, 2008).As in the simulations, GSCAD performs the worst.
To compare with a prior analysis of the same data set, we use the PRECISE framework of Ha et al. (2018).This method can infer directed edges, but we ignore the directionality since we are interested in interactions and not causation.The proportions of edges in the estimates that 'agree' and 'do not agree' with the edges inferred using PRECISE framework are presented in Table 6.
Protein networks realized from the estimates are presented in Figure 2. It can be seen that the GHS-LIKE-MCMC has the sparsest estimate among the methods that allow for interaction across all proteins, unlike the PRECISE framework that ignores interactions among proteins in different pathways, which may not be biologically justifiable.
Table 4: Mean (sd) Stein's loss, Frobenius norm, true positive rates and false positive rates, Matthews Correlation Coefficient of precision matrix estimates over 50 data sets generated by multivariate normal distributions with precision matrix Ω 0 , where n = 120 and p = 200.The precision matrix is estimated by frequentist graphical lasso with penalized diagonal elements (GL1) and with unpenalized diagonal elements (GL2), graphical SCAD (GSCAD), Bayesian graphical lasso (BGL), the graphical horseshoe (GHS), graphical horseshoe-like ECM (ECM) and graphical horseshoe-like MCMC (MCMC).The best performer in each row is shown in bold.Average CPU time is in seconds.

Concluding Remarks
Our main contribution in this paper is twofold: first, we propose a fully analytical prior-penalty dual termed the graphical horseshoe-like for inference in graphical models, and second, we provide   S.2.
the first ever optimality results for both the frequentist point estimate as well as the fully Bayesian posterior.Consequently, we also establish the first Bayesian optimality results for the graphical horseshoe prior of Li et al. (2019).Our simulation studies clearly establish that the family of horseshoe based priors perform the best among state-of-the-art competitors across a wide range of data generating mechanisms, and suggest a potential trade-off between computational burden and statistical performance vis-à-vis penalized likelihood and fully Bayesian procedures.Our analysis of the RPPA data establishes the proposed approach as an effective regularizer of a gene interaction network; useful for identifying the key interactions in the disease etiology of cancer.
Although we focus on the estimation of Ω, two other important aspects of network inference are edge selection and the associated uncertainty quantification.Here we use posterior credible intervals for edge selection, but it might be interesting to incorporate other methods that have been proposed for variable selection with shrinkage priors, such as 2-means (Bhattacharya et al., 2015) or shrinkage factor thresholding (Tang et al., 2018), with appropriate modifications.On a related note, it will be interesting to establish the Bayes risk and the oracle under 0 − 1 loss and we conjecture that global-local shrinkage priors will attain such oracular risk with suitable assumptions on the prior tails and the global shrinkage parameter.Finally, it will be worth investigating whether one can extend the methods for generalized linear models, e.g.graphical models with exponential families as node-conditional distributions (Yang et al., 2012).It has been shown that while restricting the response distribution to natural exponential families with quadratic variance functions, shrinkage estimators enjoy certain optimality properties (Xie et al., 2016), and it remains to be settled whether similar properties hold true for graphical models as well.
(iii) the probability of the complement of the above sieve is exponentially small, that is, Π(P c n ) ≤ exp(−c n 2 n ), for some constant c > 0.
The above three parts together give the posterior convergence rate n with respect to the Hellinger distance on the space of densities of the precision matrix.Owing to the intrinsic relationship between the Hellinger distance and the Frobenius distance for precision matrices as given by Lemma A.1 of Banerjee and Ghosal (2015), we get the desired posterior convergence rate.
We first define B(p Ω 0 , n ), the 2 n -neighborhoods of the true density in the Kullback-Leibler sense.
. Then, using Lemma S.3.1, we have, 2 is small, we have, max 1≤i≤p |1 − d i | < 1.This gives, using (13), Observe that, Hence, for a sufficiently small constant c 1 > 0, we have, The proposed prior on Ω has a bounded spectral norm.However, such a constraint can only increase the prior concentration, since Ω 0 ∈ U(ε 0 , s), ε 0 < L. Hence, we may pretend componentwise independence of the elements of Ω, so that the above expression can be simplified as products of marginal prior probabilities.We have, Note that, from equation (S.4) in Lemma S.3.2, we have, for all 1 ≤ i, j ≤ p, (p+s) .The prior concentration rate condition thus gives, (p + s)(log p + log(1/ n )) n 2 n , so as to yield n = n −1/2 (p + s) 1/2 (log n) 1/2 .(ii) Choosing the sieve and controlling metric entropy.
We now carefully choose a sieve in the space of prior densities to control its Hellinger metric entropy.Consider the sieve P n such that the maximum number of elements of Ω exceeding δ n = n /p ν , ν > 0 is at most rn , and the absolute values of the entries of Ω are at most B. Formally, the sieve is thus given by, where δ n = n /p ν and some sufficiently large B > 0. We shall choose B in such a way that the metric entropy condition is satisfied.Note that, for Choosing rn ∼ k 1 n 2 n / log n, k 1 > 0, and B ∼ k 2 n 2 n , k 2 > 0, the above metric entropy is bounded by a constant multiple of n ∞ , the n -metric entropy with respect to the Hellinger distance is also bounded by a constant multiple of n 2 n .Thus, the rate n obtained via the prior concentration rate calculation satisfies the metric entropy condition as well.
(iii) Controlling probability of the complement of the sieve.
The task of controlling the probability of the complement of the sieve can further be divided into two sub-parts.Note that, We will calculate the probabilities in the right-hand side of the above display under an unconstrained case, and then take care of the truncation used in the prior for Ω, given by Ω ∈ M + p (L), by finding a suitable lower bound for the event {0 < L −1 < Ω (2,2) < L < ∞}.Let us denote the prior under the unconstrained case by Π * .
Using results on bounding the Binomial CDF as in Song and Liang (2017), we have, for some C > 0. From (S.5) in Lemma S.3.2, for the choice of B ∼ k 2 n 2 n as outlined in the metric entropy condition above, we have, Combining equations ( 14) and ( 15), we get, for a suitable constant c 3 > 0, Combining ( 16) and (S.8), we get, for a suitable constant c 4 > 0, Hence, the complement of the chosen sieve has exponentially small prior probability.Thus, n = n −1/2 (p + s) 1/2 (log n) 1/2 is the posterior convergence rate and the result is established.

A.2 Proof of Theorem 4.9
Consider the MAP estimator of the precision matrix corresponding to the graphical horseshoe-like prior ΩMAP as outlined in Section 4.
Here, Ω 0 = ((ω ij,0 )) is the true precision matrix.The true covariance matrix is Σ 0 = ((σ ij,0 )), and the natural estimator of the covariance is S = ((s ij )).Consider Q(Ω) as defined in ( 12).If we can show that for some small ε > 0, then there exists a local maximizer Ω such that Ω − Ω 0 2 = O P ( n ).We have, Let us denote as (2/n) pen(ω ij ) as p n (ω ij ).This gives, By Taylor's series expansion of logarithm of the determinant of a matrix, we have, Plugging the above in (17), we have the expression for h We shall now separately bound the three terms I, II, and III.For bounding I, we have, Using Boole's inequality and Lemma S.3.7,we have, with probability tending to one, Hence, the first term in the RHS of display ( 19) is bounded by C 1 (log p/n) 1/2 ∆ − 1 .By Cauchy-Schwarz inequality and Lemma S.3.7,we have, with probability tending to one, Thus, combining the bounds above, with probability approaching one, a bound for expression I is, Now we proceed to find suitable bounds for expression II.Note that II is upper bounded by the negative of the minimum of vec Using the result that min x 2 =1 x T Ax = eig 1 (A), we have, with probability tending to one.The last inequality follows from the fact that ∆ (2,2) ≤ ∆ 2 = o(1).Hence, with probability tending to one, we have the bound for expression II as Finally, we proceed to find suitable bounds for expression III.Let us denote the set S = {(i, j): ω ij,0 = 0, i < j}.This set comprises of the indices in the uppper triangle of the true precision matrix that are exactly equal to zero.The complement of S consists of the indices with non-zero entries in the uppper triangle of the same.We can partition expression III (without the negative sign) as The last inequality follows from the fact that pen(θ) → −∞ as |θ| → 0, and hence the first term in the above expression is larger than M /n for some large M > 0. This implies: By Taylor's series expansion of p n (ω ij + δ ij ) around ω ij,0 ( = 0), we have, Since −x ≤ |x|, we can write, Now, note that, Combining Equations ( 20), ( 21), and ( 23), we have, with probability tending to one, (1 + o(1)) < 0, for M sufficiently large, and owing to the fact that the last two terms inside the bracket in the above display are o(1) as min (i,j)∈S c |ω ij,0 | are bounded away from zero.This completes the proof.

Figure 1 :
Figure 1: Smoothed density estimates of a randomly chosen off-diagonal element based on 10 4Markov chain Monte Carlo samples for the graphical horseshoe-like (dashes), the graphical horseshoe (small dashes) and the Bayesian graphical lasso (solid) priors; providing a visual comparison of (left) spikes near the origin, (middle) heaviness of the tails, and (right) the induced penalty functions for p = 10.
, a); (n, p): Sample size and number of variables respectively while

Figure 2 :
Figure 2: (a), (b), (c) and (d) correspond to RPPA networks for GHS-LIKE-ECM, GHS-LIKE-MCMC, GHS and PRECISE.The nodes are numbered from 1 to 67, which are proteins.The map between node numbers and protein names is given in the SupplementaryTable S.2.
However, this condition does not affect the practical implementation of our proposed method, and is used purely as a technical requirement, so that we only can work with Ω and Σ that are away from singular matrices.Beyond this, no structural Liu and Martin (2019)2015)osability are placed on either Ω or Σ.Similar assumptions have been made in related works; seeLiu and Martin (2019)andLee et al. (2021).Assumption 4.2 implies that the dimension grows to infinity as the sample size n → ∞, but at a slower rate than n.Additionally, the condition on the effective dimension ensures that the posterior convergence rate goes to zero as n → ∞.Similar conditions are necessary in proving the contraction results in other related works, for example, seeBanerjee and Ghosal (2015);Lee et al. (2021);Liu and Martin (2019).Assumption 4.3 implies that the true precision matrix Ω 0 is sparse, and has eigenvalues that are bounded away from zero or infinity.Similar conditions are common in the literature on large precision matrix estimation problems; see, for example,Banerjee and Ghosal (2014, 2015);Lee et al. (2021);Liu and Martin (2019); among others.Assumption 4.4 is crucial in learning the

Table 1 :
Mean (sd) Stein's loss, Frobenius norm, true positive rates and false positive rates, Matthews Correlation Coefficient of precision matrix estimates over 50 data sets generated by multivariate normal distributions with precision matrix Ω 0 , where n = 120 and p = 100.

Table 3 :
Mean (sd) Stein's loss, Frobenius norm, true positive rates and false positive rates, Matthews Correlation Coefficient of precision matrix estimates over 50 data sets generated by multivariate normal distributions with precision matrix Ω 0 , where n = 120 and p = 200.The precision matrix is estimated by frequentist graphical lasso with penalized diagonal elements (GL1) and with unpenalized diagonal elements (GL2), graphical SCAD (GSCAD), Bayesian graphical lasso (BGL), the graphical horseshoe (GHS), graphical horseshoe-like ECM (ECM) and graphical horseshoe-like MCMC (MCMC).The best performer in each row is shown in bold.Average CPU time is in seconds.

Table 5 :
Percentage of zeros (% Sparsity) and number of non-zero entries (NNZ) in the lower triangle of the precision matrix estimate of RPPA data for the competing approaches.

Table 6 :
Proportion of edges that 'agree' (AE) and 'do not agree' (NE) with the edges inferred using the PRECISE framework.
For the first term on the RHS above, we have, rn log{r n /(p n ν n )} ≥ rn log rn +b 1 rn log p n , since ν n ≤

Table S .
1: Mean (sd) Stein's loss, Frobenius norm, true positive rates and false positive rates, Matthews Correlation Coefficient of precision matrix estimates for GHS-LIKE-ECM over 50 data sets with p = 100 and n = 50.The best performer in each row is shown in bold.Average CPU time is in seconds.TableS.2:Map between node numbers and protein names in Figure2.