Consistent Group Selection with Bayesian High Dimensional Modeling

. In many applications with high dimensional covariates, the covariates are naturally structured into diﬀerent groups which can be used to perform eﬃcient statistical inference. We propose a Bayesian hierarchical model with a spike and slab prior speciﬁcation to perform group selection in high dimensional linear regression models. While several penalization methods and more recently, some Bayesian approaches are proposed for group selection, theoretical properties of Bayesian approaches have not been studied extensively. In this paper, we provide novel theoretical results for group selection consistency under spike and slab priors which demonstrate that the proposed Bayesian approach has advantages compared to penalization approaches. Our theoretical results accommodate ﬂexible conditions on the design matrix and can be applied to commonly used statistical models such as nonparametric additive models for which very limited theoretical results are available for the Bayesian methods. A shotgun stochastic search algorithm is adopted for the implementation of our proposed approach. We illustrate through simulation studies that the proposed method has better performance for group selection compared to a variety of existing methods.


Introduction
Variable selection is a crucial statistical tool especially in high dimensional data settings as it provides interpretability of the learned model and also often helps to improve prediction power by removing irrelevant predictors. Several variable selection methods from frequentist and Bayesian viewpoints have been proposed in the literature. Some of the popular choices of penalty based methods include least absolute shrinkage and selection operator (Lasso) (Tibshirani, 1996), bridge penalization (Frank and Friedman, 1993), smoothly clipped absolute deviation (SCAD) estimator (Fan and Li, 2001), and minimax concave penalty (MCP) estimator (Zhang, 2010). Several Bayesian variable selection methods have been developed in the literature with a variety of prior structures McCulloch, 1993, 1997;Ishwaran and Rao, 2005;Johnson and Rossell, 2012;Narisetty and He, 2014;George, 2014, 2018;Spitzner, 2019).
In many applications, the predictors under consideration naturally exhibit a grouping structure due to their inherent similarities. For example, in gene expression data, genes can be classified into different groups based on the phenotypic traits they control; in stock market data, stocks from the same sector form a group. Group structure also arises naturally in many statistical models. In multiple factor analysis of variance (ANOVA) models, dummy variables encoding the same factor form a group; in polynomial regressions or nonparametric additive models, the basis functions involving the same predictor become a group. Variable selection is thus translated to group selection in these problems as we desire to perform selection at the group level. Under these scenarios, traditional variable selection methods which do not incorporate the group information would not be efficient (Breheny and Huang, 2009). Incorporating the grouping structure naturally available in the predictors or implied by the statistical model helps in performing more precise variable selection. Therefore, it is important to develop variable selection methods accounting for group structure. While there are several existing frequentist methods and studies of their theoretical properties available for group selection (Yuan and Lin, 2006;Bach, 2008;Wang and Leng, 2008;Nardi and Rinaldo, 2008;, the Bayesian approaches and their theoretical properties have been much less explored. Given the recent insights about the advantages of Bayesian approaches for variable selection and their model selection consistency properties under weaker conditions compared to penalization approaches (Johnson and Rossell, 2012;Narisetty and He, 2014), it is important to investigate them for group selection.
Let us consider the linear regression model with G groups: where Y is an n × 1 vector, ∼ N n (0, σ 2 I), and X g and β g are respectively the n × m g design matrix and the m g × 1 coefficient vector corresponding to the g'th group. The group Lasso estimator (Yuan and Lin, 2006) was proposed to perform group selection and is defined as the minimizer of the following objective function: This is a natural extension of the Lasso (Tibshirani, 1996) by applying the L 1 penalty to the L 2 norms of the group coefficients. The theoretical properties of the group Lasso estimator were studied in Bach (2008) and Nardi and Rinaldo (2008). Group selection consistency is only guaranteed under restrictive correlation conditions which extend the irrepresentable condition (Zhao and Yu, 2006) of Lasso. To improve that, Wang and Leng (2008) and Nardi and Rinaldo (2008) studied an adaptive group Lasso method that generalizes the adaptive Lasso (Zou, 2006) in the group setting. With adaptive weights of penalty on each group, the adaptive group Lasso achieves group selection consistency under more flexible conditions. Non-convex group penalized methods like group SCAD and group MCP can be formulated similarly by applying the corresponding penalty to the L 2 norms of the group coefficients. Various penalized methods have also been proposed to perform sparse group selection. We mention group bridge , sparse-group Lasso (Simon et al., 2013), and group exponential Lasso (Breheny, 2015), among others. An extensive review of penalized group selection approaches is provided in Huang et al. (2012).
To formulate the group selection problem in the Bayesian framework, multivariate-Laplacian priors over each group of coefficients were considered by Raman et al. (2009) and Kyung et al. (2010). This approach is often referred to as Bayesian group Lasso. The maximum a posteriori (MAP) estimator of Bayesian group Lasso is equivalent to the group Lasso estimator, however, Bayesian group Lasso provides the entire posterior distribution on the parameter space as opposed to a single point estimator provided by group Lasso. The posterior distribution of Bayesian group Lasso has the flexibility to incorporate prior knowledge and can also be used in a more general way (such as posterior mean, uncertainty estimation) as opposed to one summary statistic in the form of the MAP estimator. However, model selection consistency results for the Bayesian group Lasso procedure are not extensively studied. Hernández-Lobato et al. (2013) introduced generalized spike-and-slab priors for group selection problems and used expectation propagation to perform approximate inference. Xu and Ghosh (2015) considered adopting the prior of Bayesian group Lasso as the slab prior and introduced Bayesian group Lasso with spike and slab priors (BGL-SS). Under the orthogonal design, the posterior median estimator of BGL-SS was shown to have the oracle property (Fan and Li, 2001). However, there are no theoretical results established under general designs beyond the orthogonal design.
In this paper, we propose spike and slab priors to perform model selection at the group level in the Bayesian framework. Spike and slab priors would be imposed on the group coefficients rather than individual coefficients. Meanwhile, a binary latent variable would be explicitly introduced for every group to indicate whether the corresponding group is active or not. We establish strong selection consistency (Johnson and Rossell, 2012) of our method under a general setup allowing both the number of groups and the size of the true model to go to infinity. With the proposed method, we consider the application of it in some special cases of group selection problems including nonparametric additive models and seemingly unrelated regressions and demonstrate that strong selection consistency continues to hold for these models. To the best of our knowledge, Shang and Li (2014) provide the only existing result for model selection consistency of Bayesian nonparametric additive model in high dimensions. However, they established strong selection consistency of their method when the sparsity level is correctly specified through the model hyperparameter, which is usually not available in reality.
We place a point mass at zero as the spike prior and a multivariate normal with diagonal covariance matrix as the slab prior. Though this prior is similar to generalized spike-and-slab priors and BGL-SS, our method differs from them in two aspects. First, following the same idea of Narisetty and He (2014), we suggest that the slab prior should be sample size dependent to achieve appropriate shrinkage. With this specification, our proposed method is shown to have strong selection consistency under more general designs. Second, we perform group selection based on the inference on the latent binary variables rather than the group coefficients. We propose a shotgun stochastic search algorithm similar to the one by Hans et al. (2007) to search for the MAP estimator in the model space rather than generating samples from the posterior. Our algorithm includes a deterministic searching layer on top of the algorithm of Hans et al. (2007) to make it explore a larger model space and to converge in fewer iterations.
In the theoretical analysis, we consider a quite flexible setup to accommodate both realistic design situations as well as commonly used statistical models such as the nonparametric additive models. Our method is shown to be able to asymptotically select all the groups containing any active individual predictors. Our theoretical results are stronger than any of the existing Bayesian group selection consistency results (Shang and Li, 2014;Xu and Ghosh, 2015) in terms of the weakness of assumptions and the generality of the results. Compared with traditional variable selection methods ignoring the group structure, our proposed group selection method achieves consistency under weaker conditions on the signal strength as it makes use of the group information. When there exists a large within-group correlation, it is likely that traditional variable selection methods will select only some of the individual predictors in the group and ignore the others. In contrast, our theoretical results show that our proposed method will not suffer from a large within-group correlation and performs well under this scenario.
The rest of the paper is organized as follows. We introduce the proposed method and theoretical results in Section 2. The application of our proposed method in some special cases is discussed in Section 3. Algorithms for the implementation of the method are discussed in Section 4. Simulation and real data studies are conducted in Section 5 and Section 6, respectively. Finally, we draw the conclusion in Section 7.

Bayesian Group Selection
Consider the regression model (1.1) and let m = (m 1 , . . . , m G ) denote the numbers of individual variables within each group and p = m g is the total number of individual variables. For the purpose of simplicity, we assume the design matrix of every group X g , g = 1, . . . , G, to be full-ranked. Furthermore, we center Y and normalize every column of the design matrix X = (X 1 , . . . , X G ). We introduce a latent binary random vector Z = (Z 1 , . . . , Z G ) with entries equal to either 1 (active) or 0 (not) to indicate selection of groups.

Model
Our Bayesian hierarchical model is given by: where g = 1, . . . , G and τ 2 , q are the prior parameters which depend on n. In the model, we use δ 0 (β g ), a point mass at zero, as the spike prior and N (0 mg , σ 2 τ 2 I mg ) as the slab prior. We treat τ 2 and q as tuning parameters and let them go to infinity and zero, respectively, along with n to achieve appropriate sparsity at the group level. We shall perform group selection based on the MAP estimator of the posterior distribution of Z.
We work with Model (2.1) for its simplicity and convenience in studying the theoretical properties. However, several alternative priors and further hierarchies on the hyper-parameters of the Model (2.1) can be considered in practice. For instance, a Beta prior on q and a Gamma prior on τ 2 can be specified. For the slab prior, one can alternatively adopt an objective Zellner's g-prior (Zellner, 1986), that is, N 0 mg , σ 2 τ 2 (X g X g ) −1 . This formulation is actually equivalent to first transform X g , the design matrix of the g'th group, to X g (X g X g ) −1/2 , and then place the original slab prior described in (2.1) on the new group coefficients corresponding to X g (X g X g ) −1/2 . By this procedure, the design matrix for each group is orthonormalized, so which slab prior to choose is a matter of whether orthonormalization within the group is needed. The distinction between the two different choices of the slab prior will be discussed further later.
In our implementation, we need the sample size for selecting our hyperparameters of the Model (2.1), which is not completely a subjective Bayesian approach as the prior may not be interpreted entirely as a representation of prior knowledge. The subjective aspect of our prior comes from the form of the prior specification that induces group sparsity while the specific values for the hyperparameters are not chosen in a subjective manner. Motivated by the difficulty in high dimensions for having prior belief about the distribution of a high dimensional parameter in its entirety, several authors (George and Foster, 2000;Park and Casella, 2008;Scott and Berger, 2010;Xu and Ghosh, 2015;George, 2014, 2016;Gan et al., 2018) take this approach for selecting hyperparameters while incorporating prior belief through the prior structures. These approaches as well as our proposal can be considered under the broad umbrella of objective priors that include a wide range of priors that could be data-dependent implicitly or explicitly such as Jeffreys' prior (Jeffreys, 1961), reference prior (Bernardo, 1979), and Zellner's g-prior (Zellner, 1986).
Our approach of specifying hyperparameters that depend on the sample size is quite similar to empirical Bayes strategies, which specify hyperparameters based on the data and can therefore also be sample size dependent. Empirical Bayes strategies use more specific information from the data compared to our approach as they often employ a likelihood maximization strategy. For instance, George and Foster (2000) estimated both the hyperparameter in their g-prior and prior inclusion probability based on a marginal or conditional maximum likelihood criterion; Scott and Berger (2010) specified the prior inclusion probability by marginal maximum likelihood; Park and Casella (2008), Kyung et al. (2010), and Xu and Ghosh (2015) specified the parameters of their hyperpriors within a Monte Carlo expectation-maximization (EM) algorithm. In all these approaches, hyperparameters are chosen by maximizing certain form of likelihood that requires access to the full data while our specification of the hyperparameters depends only on the sample size.
These objective approaches are in contrast with fully subjective Bayesian approaches such as Spitzner (2019) where prior distributions are entirely characterized based on prior knowledge before the data are collected. If a specific prior knowledge on the prior distribution is available that allows fixing the prior hyperparameters at specific values, this can certainly be done in our framework and the approach would be fully Bayesian. Another alternative is to place a prior on the hyperparameters but selection of its hyperparameters is also crucial since the procedures can be sensitive to those parameters in high dimensions.
We now introduce some of the notations to be used in this paper.
Model indices: Let k, a binary vector of length G with entries equal to either 1 or 0, be the index of the active groups present in a regression model. For instance, the regression model Y = X 2 β 2 + X 3 β 3 + will be indexed by the binary vector (0, 1, 1, 0, · · · , 0) G×1 . Let t denote the index of the true model which has 1's for all the groups containing active individual variables. For example, if we have 6 variables divided into three groups with m 1 = m 2 = m 3 = 2 and the active variables are the first 3 variables which are in groups 1 and 2, then t = (1, 1, 0).
Model symbols: For a regression model indexed by k, denote X(k), r k , P (k), β(k), and m(k) as the corresponding design matrix, rank of X(k), projection matrix corresponding to X(k), regression coefficients, and the vector of group sizes, respectively. For a vector, the operator | · | outputs the L 1 norm of the vector. For example, |k| = k g is the number of active groups in the model indexed by k; |m(k)| = [m(k)] g is the number of active individual variables in the model indexed by k.

Posterior Distribution of Z
We now proceed to derive the marginal posterior distribution of Z as group selection will be performed based on the corresponding MAP estimator. As will be seen later, this marginalization would be helpful not only in our theoretical analysis but also for computation.
The posterior probability of the model indexed by k can be computed by summing and integrating out β from the joint posterior distribution of β and Z which is given by In the posterior probability (2.2),R k can be interpreted as regularized residual sum of squares; [q/(1 − q)] |k| deals with the number of active groups in the model indexed by k; Q k represents the variation within the design matrix X(k). To see the last point, rewrite Q k as |I + τ 2 X(k)X(k) | −1/2 by Sylvester's determinant theorem. Recall from principal components analysis that |X(k) X(k)| is the product of the variances of the principal components of X(k). Including more groups in the model would introduce more variation within the design matrix X(k). Similar expression of the posterior probability (2.2) without a group structure is obtained by imposing Bayesian shrinking and diffusing priors (Narisetty and He, 2014) on the regression coefficients.
Remark. Looking for the model that maximizes the posterior probability (2.2) becomes the familiar problem of the trade-off between the residual sum of squares and the model We now illustrate the philosophy of whether the orthonormalization within the group is needed or not. If no orthonormalization is performed, the within-group correlation would be favored in the sense that given the regularized residual sum of squares and model size, the posterior probability (2.2) would be larger with a larger within-group correlation. To see this, suppose k is the index of an one-group model, that is, |k| = 1 and the correlation between every pair of two individual variables is equal to ρ: where J |m(k)| denotes the all-ones matrix. We have , which increases and therefore so does the posterior probability (2.2) when ρ goes up as can be seen in Figure 1. Alternatively, if orthonormalization within the group is done, or equivalently, a g prior is adopted as the slab prior, this feature of favoring the withingroup correlation would be discarded. We believe that the within-group correlation should be favored, so we stick to our original slab prior in Model (2.1).

Theoretical Results
We shall now provide the strong selection consistency of our Bayesian hierarchical model (2.1) in the sense that the posterior probability (2.2) of the true model goes to one as the sample size goes to infinity. We consider a general design allowing both the number of groups G and the size of the true model |t| to go to infinity and assume σ 2 to be fixed for simplicity. We first introduce the following notations to be used in our theoretical results: Operations of model indices: For a regression model indexed by k, we use k c = 1 G − k as the index of its complementary model. For two models indexed by k and w respectively, the set operations k ∪ w and k ∩ w index the models corresponding to the union and intersection of the covariates indexed by k and w, respectively. In addition, k ⊂ w (or w ⊃ k) denotes that the model indexed by k is a submodel of the model indexed by w, and k w (or w k) denotes strict inclusion. The submodel of the model indexed by k that only includes (or excludes) its i'th active group by k (i) (or k (−i) ) for i = 1, . . . , |k|.
Rates: For sequences a n and b n , a n ∼ b n means a n /b n → c for some constant c > 0; a n b n (or b n a n ) means a n = O(b n ); a n ≺ b n (or b n a n ) means a n = o(b n ).
Next, we describe the conditions used to achieve strong selection consistency.
where P (t (−i) ) is the projection matrix corresponding to the model indexed by t (−i) which is the submodel of the true model that only excludes the i'th active group, and where φ min outputs the minimum nonzero eigenvalue of the input matrix andφ outputs the geometric mean of the nonzero eigenvalues of the input matrix.
For a fixed K, define We assume the following regularity conditions: Here, conditions (A.1)-(A.3) are primarily related to the rates of the prior parameters and conditions (A.4) and (A.5) are concerned with the identifiability of the true model. Condition (A.1) allows the number of groups G to grow near-exponentially along with the sample size n. This is the strongest result available in the group selection literature. Under a similar near-exponential rate for the number of groups, Wei and Huang (2010) established the selection consistency of adaptive group Lasso and Shang and Li (2014) achieved the selection consistency for nonparametric additive models.
Condition (A.4) deals with the identifiability of the active groups and condition (A.5) is about preventing the selection of the inactive groups. Both the conditions essentially require that for any false model of moderate size that misses some active group, it cannot fit the mean response X(t)β(t) well enough and its residual sum of squares would be lower bounded at a certain rate. It is worth noting that our conditions depend on the number of groups G instead of the number of variables p which can be much larger than G. Otherwise, the conditions would be more restrictive if they have p in place of G. This illustrates an advantage of incorporating the group information. Under the orthogonal design matrix, that is, X X = nI, conditions (A.4) and (A.5) could be further simplified as for some c > 0. Therefore, the infimum group signal strength is allowed to shrink to zero with the sample size at a fast rate.
When the correlations between the covariates are high, conditions (A.4) and (A.5) can still be satisfied as long as the active coefficients are strong enough. To see this, consider the simple case of having one active group with design matrix X 1 and one inactive group with design matrix X 2 . Here, we assume X 1 and X 2 are orthonormalized and the correlation matrix between the two groups X 2 X 1 = ρJ for simplicity. Thus, The difference between β 1 2 2 and ρ 2 Jβ 1 2 2 could be lower bounded even when ρ is large as long as the active coefficients β 1 are large enough. In contrast, Bach (2008) derived the sufficient and necessary conditions for selection consistency of group Lasso as an extension of the irrepresentable condition (Zhao and Yu, 2006) of Lasso. For the selection consistency of group Lasso, the strength of the signals within the active groups does not play a role in Conditions (4) and (5) of Bach (2008). Due to this, once the correlation structure fails to satisfy Conditions (4) and (5) of Bach (2008), no matter how large the active signals are, selection consistency would not be achieved by group Lasso, which is quite restrictive and can be easily violated.
A proof of Theorem 2.1 is provided in the Supplementary Material (Yang and Narisetty, 2019).

Applications to Specific Statistical Models
So far, we have introduced our Bayesian hierarchical model under general regression models with group structures present. We now discuss how our proposed method can be applied to some special statistical models which can be formulated as group selection problems. Group selection methods have a lot of applications in statistical problems as well as in real data analysis, as discussed in Huang et al. (2012). Here, we consider the application of our proposed method in nonparametric additive models and seemingly unrelated regressions and illustrate that our established strong selection consistency continues to hold under these special cases.

Nonparametric Additive Models
One natural application of group selection methods is the nonparametric additive model: where f j 's are some unknown smooth univariate functions and f j (X j )'s are usually referred to as nonparametric components. A class of polynomial spline functions is used to approximate the unknown functions f j 's. Every continuous function can be approximated arbitrarily well by polynomial splines using a sufficient number of knots with a fixed order. The class of polynomial spline functions evaluated for the same individual feature are grouped naturally and thus the selection of the nonparametric components becomes a group selection problem.
To ensure that the nonparametric components can be approximated well enough, we make the following assumptions: } with D to be the differential operator and AC standing for absolute continuity.
Here, assumption (B.1) deals with the model identifiability and (B.2)-(B.4) are standard assumptions to ensure the approximation power of polynomial splines. We assume the absolute continuity for the smoothness of f j 's in (B.3). In contrast, Huang et al. (2010) assumed f j 's are Hölder continuous for nonparametric additive models under a random design. Now, we provide the strong selection consistency of our method under nonparametric additive models. We still consider a general design allowing both the number of groups p and the size of the true model α to go to infinity.
The proof of Theorem 3.1 follows the lines of Theorem 2.1 except that there is an additional approximation error in the linear model which needs to be controlled by the increasing approximation power of polynomial splines. A proof of Theorem 3.1 is provided in the Supplementary Material (Yang and Narisetty, 2019). Bach (2008) applied group Lasso to nonparametric additive models and established model selection consistency under conditions originating from the irrepresentable condition (Zhao and Yu, 2006). Ravikumar et al. (2009) proposed sparse additive models (SpAM) for nonparametric additive models and assumed conditions similar to the irrepresentable condition for model selection consistency. Therefore, similar to the earlier discussion, our results hold under weaker conditions on the covariates. Shang and Li (2014) proposed a Bayesian nonparametric size-control model which involves a sizecontrol prior that restricts the scope of the target models. Their method has strong selection consistency only when the hyperparameter associated with the size-control prior is correctly specified. This can be violated easily as the sparsity information is usually not available.

Seemingly Unrelated Regressions
We will now discuss the generality of our results with the seemingly unrelated regressions (SUR) model (Zellner, 1962;Lounici et al., 2011;Obozinski et al., 2011;Huang et al., 2012) as an example. The SUR model consists of a set of regression equations: where y (i) , X (i) , β (i) , and (i) are respectively the n i × 1 response vector, the n i × p design matrix, the p × 1 coefficient vector, and the n i × 1 error vector corresponding to the i'th regression. Equivalently, the equations can be written as ⎛ ⎜ ⎜ ⎜ ⎝ y (1) y (2) . . .
Denote the whole design matrix and coefficient vector in (3.1) as X and β, respectively. The seemingly unrelated regressions become related when the j'th predictors, j = 1, . . . , p, of the regression models describe similar features. Thus, we can reasonably assume that the j'th predictors in the regression equations are more likely to be either included or excluded together, that is, the active set of each individual regression would be the same. Under this formulation, the coefficient vector β of length T p is divided into p groups and the columns with the same remainder after their indices being divided by p stay in the same group. Now due to this group structure, it can be formulated as a group selection problem.
When our proposed method is applied to the seemingly unrelated regressions, we can retain group selection consistency without any further effort and the conditions can be directly translated and further simplified. Denote the design matrix, projection matrix, and coefficient vector of the model indexed by k within the i'th regression as X (i) (k), P (i) (k), and β (i) (k), respectively. Then, in the sensitivity condition (A.4), This implies that an active feature would be selected even when its coefficients are small in some individual regressions as long as the overall signal across the T regressions is strong enough. Similarly, the specificity condition (A.5) would tell that the inactive features would not be selected even when the correlations between the active features and the inactive features are large within some individual regressions as long as they are not large across all regressions.

Computation
We shall use the MAP estimator of P (Z | Y ) to perform group selection, so the implementation of the proposed method can be viewed as an optimization problem of finding the model with the maximum posterior probability among all possible models, which is a computationally challenging problem. The discreteness of the parameter space of Z facilitates the use of shotgun stochastic search algorithm (hereafter referred to as SSS) (Hans et al., 2007). Alternatively, Markov chain Monte Carlo (MCMC) methods like the Gibbs sampler can be adopted and inference can be performed with the samples drawn from P (Z | Y ). To deal with the unknown σ 2 , we let the inverse Gamma prior be flat with both shape and scale parameters equal to 0.01 and integrate it out from the posterior distribution (2.2).

Shotgun Stochastic Search Algorithm
Shotgun stochastic search algorithm attempts to find models having high posterior probabilities by systematically searching the high posterior density regions of the model space as opposed to MCMC methods which attempt to approximate the posterior distribution on the whole model space. Following the same notations and definitions as in Hans et al. (2007), we let Γ denote the set that will contain the models with large posterior probabilities and nbd(k) = γ + (k) ∪ γ − (k) ∪ γ • (k) denote the neighborhood of k where γ + (k), γ − (k), and γ • (k) are "addition" neighbors, "deletion" neighbors and "replacement" neighbors, respectively. More specifically, Step 1. Update Γ (t+1) to be Γ (t) ∪ nbd(k (t) ) and keep the top B models in Γ (t+1) having the largest P (Z | Y ).
The top model in the Γ after running SSS would be our MAP estimator of the posterior distribution P (Z | Y ).
Here, step 2 is the only difference from the original SSS algorithm (Hans et al., 2007). We add this deterministic step to ensure that the model space around the local maximums of P (Z | Y ) would be explored more exhaustively at each iteration.
Step 2 does not involve k (t) , so it would not interfere with the stochastic steps. Compared with the original SSS algorithm, our modified algorithm can converge in fewer iterations and explore a broader region of the model space as the neighborhood of the current top models can be fully explored at each iteration. The dominant part of the computational complexity at each iteration comes from the inversion of D k + X(k) X(k). The inverse can be efficiently computed using the Woodbury matrix identity which gives the computational complexity to be O(n|m(k)| 2 ), where |m(k)| is the number of active individual variables in the model indexed by k. A stochastic approximation Monte Carlo (SAMC) algorithm (Liang et al., 2007(Liang et al., , 2013 can also be used for posterior computation but it is a posterior sampling algorithm that generates samples from P (Z | Y ) as opposed to the SSS algorithm that aims to find the MAP model.

Gibbs Sampling
Alternatively, we can use a standard Gibbs sampling method to generate samples from P (Z | Y ). As Z is a binary vector, the full conditional distributions are all Bernoulli distributions: , If the posterior samples of the coefficients β are desired, they could be generated along with Z and σ 2 from the joint posterior distribution P (Z, β, σ 2 | Y ) by Gibbs sampling. The full conditional distributions would still be standard distribution due to the use of conjugate priors. The crucial point is that β and Z need to be blocked together to make Gibbs sampling work. Otherwise, the Markov chain would not be irreducible because of the use of a point mass as the spike prior. A Gibbs sampler to draw samples from P (Z, β, σ 2 | Y ) is provided in the Supplementary Material (Yang and Narisetty, 2019). We prefer to sample Z | Y directly due to the fact that the analytical integration over β would be beneficial as it results in fast mixing and thus speeds up convergence (George and McCulloch, 1997).

Remark on bi-level selection.
When within-group selection of variables is also important, many bi-level selection methods such as the hierarchical structured variable selection (HSVS) method (Zhang et al., 2014) and Bayesian sparse group selection with spike and slab priors (BSGS-SS) (Xu and Ghosh, 2015) have been proposed. It is possible to extend our proposed model (2.1) to induce within-group sparsity by introducing another set of binary variables S g = (S g1 , . . . , S gmg ) for every group of coefficients β g = (β g1 , . . . , β gmg ) to indicate individual level selection. By allowing some of the S gi 's to be inactive when Z g is active would then result in within-group sparsity. However, the introduction of S g 's would increase the computational burden because the dimension of the model space would be 2 p which can be much larger than 2 G . Therefore, we focus on group selection in this paper and the computational aspects of the model for bi-level selection are deferred for future research.

Simulation Results
We will refer to our proposed method as Group Spike and Diffusing prior (GSD) and the estimates computed from SSS and Gibbs sampling algorithms as GSD-SSS and GSD-Gibbs, respectively. To test the performance of our method, we compare it with existing methods including adaptive group Lasso (agLasso), group Lasso (gLasso), group SCAD (gSCAD), group MCP (gMCP), and BGL-SS under different settings. Additionally, for our last simulation setting where there is bi-level sparsity, we also implement the sparse-group Lasso (SGL) method as a bi-level selection approach for comparison. and a fixed q based on a simulation dataset generated as described in Case 2. As can be seen, the sparsity level does not change rapidly so a very fine tuning of τ 2 may not be needed. In the following simulation studies, we take q = 1/G and tune τ 2 for five different values equal to G 2.5 /(10 i n) with i = 0, 1, 2, 3, and 4. We choose the optimal τ 2 by the mean squared prediction error produced by 5-fold cross-validation using ridge regression. From our simulation studies, this coarse grid of tuning parameters is already able to yield quite promising results. Note that this implies that our hyperparameters are data-dependent and the resultant procedure is similar to an empirical-Bayesian approach which is a common practice in the high dimensional Bayesian literature due to sparse prior knowledge about the hyperparameters. In our model formulation, both q and τ 2 could be tuned if desired. However, the shrinkage effects of q and τ 2 are related to each other and several authors in prior research have observed that it is sufficient to tune one of them by setting the other at a reasonable value (Narisetty and He, 2014;George, 2014, 2016;Gan et al., 2018). For this reason, in our empirical studies, we fix q and tune only τ 2 . Another possible alternative would be to tune q by fixing τ 2 which we do not pursue in the paper. The choice of 1/G for q is motivated by our condition (A.2). From our empirical work, this procedure is already able to yield quite promising results. Otherwise, one may also consider tuning q but it does not seem necessary in most practical contexts.
When implementing our method using SSS, we set B, T and C as 10, 100 and 10, respectively, where B is the number of top models we record, T is the number of iterations, and C is related to the stopping criterion. In the Gibbs sampler, we take a burn-in period of 1000 iterations followed by 1000 iterations to compute the posterior.
We consider 6 simulation designs under the linear regression model (1.1), each with the same sample size of 100. Depending on the context, the number of groups would be either 50 or 100. Group size is taken to be one of 4, 5, 6, 7, or 8 randomly with equal probability on all the designs. The X covariates are generated independently from a multivariate normal distribution with zero mean, unit variance, and different correlation matrices on different designs. Coefficients of inactive covariates are set to be 0. The errors at each observation are i.i.d. standard normal. In each design, we consider two subcases of weak and strong signal strength with active coefficients to be 0.3 and 1, respectively, unless otherwise specified. Here are the different cases considered: Case 1 (Baseline design). The number of groups G = 50 and the first 3 groups are active. The correlations at both the group and within-group levels are equally 0.5.
Case 2 (Dense model design). The first 7 groups are active with the number of groups and correlation structure same as in Case 1.
Case 3 (High dimensional design). The number of groups G = 100 with the first 3 groups to be active. The correlation structure is the same as in Case 1.

Case 4 (Confounding group design).
There are G = 50 groups with the first 3 groups to be active. The correlations within the first 3 groups and the fourth group (confounding group) are 0.3 and 0.8, respectively. The correlation between a variable from the first 3 groups and a variable from the fourth group is 0.5. All the other entries of the correlation matrix are equal to 0.1.

Case 5 (Bi-level sparsity design).
The number of groups G = 50 with the same correlation structure as in Case 1. The first 5 groups are active but there are 2 inactive covariates within each of the 5 groups. Active coefficients are equal to 1 under the strong signal strength setting and 0.5 under the weak signal strength setting. Case 6 (Bi-level high sparsity design). The number of groups G = 50 with the same correlation structure as in Case 1. The first 5 groups are active with 1 active covariate in the first 3 groups and 2 active covariates in the other 2 active groups. Active coefficients are equal to 1 under the strong signal strength setting and 0.8 under the weak signal strength setting.
Each simulation design is repeated for 500 times and the results are summarized in Tables 1-6 where four measures are reported: Z = t is the proportion that the selected model is the true model; Z ⊃ t is the proportion that the selected model contains the true models; the area under the curve (AUC) is the average area under the receiver operating characteristic (ROC) curve; t ∈ Path is the proportion that the true model is a candidate model in the solution path.
Comparing different implementations of our method using SSS and Gibbs sampling, overall the two algorithms give similar results but SSS has a slightly better performance in most cases. The computational times for GSD-SSS and GSD-Gibbs using a MacBook Pro with 2.9 GHz Intel Core i5 processor, 8.00 GB memory, and macOS Sierra are reported in Table 7 which show that GSD-SSS is much faster than GSD-Gibbs.
Comparing our method with the competitors, we have the following observations: • Generally speaking, when signal strength is weak, our method can outperform the competitors except for agLasso in terms of the measure Z = t. As signal strength gets stronger, the consistency conditions of our method are easier to satisfy so that  our method has much better performance and can outperform all the competitors including agLasso.
• Cases 2 and 3 are modifications of Case 1 in terms of sparsity and dimensionality, respectively. In either case, our method works the best in all the measures under strong signal design. When signal strength is weak, agLasso is the most competitive method and our method is still comparable.   • Case 4 is a more challenging scenario with the presence of the fourth group as a confounding group. The correlations between the fourth group and the active groups are even larger than those within the active groups. Thus, the fourth group alone could explain a large proportion of the variation of the response variable. In this case, gLasso has a bad performance because it includes the confounding group along with the active groups at most times. Without a good initial estimator,   agLasso also works poorly. On the contrary, our proposed method can still perform well demonstrating that it is more flexible with different correlation structures.
• Cases 5 and 6 have sparsity at both the group and within-group levels. For every active group in Case 5, 25% to 50% of the predictors are inactive. The withingroup sparsity is even higher in Case 6 with only 1 or 2 active covariates within each active group. From Tables 5-6, we see that the penalized group selection methods suffer from the within-group sparsity and the bi-level selection method SGL tends to select more groups than the true model. In contrast, our method is able to accommodate this bi-level sparsity situation as our consistency conditions only rely on the overall signal strength of the whole group.
• Overall, our simulation studies indicate that our proposed method can perform well under a variety of configurations with different dimensionalities, model complexities, sparsity levels, and correlation structures.

Application to Gene Expression Data
We use the real dataset from Keller et al. (2018b) to evaluate the performance of our proposed Bayesian group selection method. The dataset contains 21771 expression levels of the islet gene and diabetes-related phenotypes of 378 mice to study the expression quantitative trait loci for pancreatic islet function. The data are available on Dryad Digital Repository (Keller et al., 2018a).
The nonparametric additive model is applied to study the relationship between the gene expression data and two phenotype variables, homeostatic model assessment (HOMA) of insulin resistance (IR) and pancreatic islet function (B). The two phenotype responses are highly correlated with a correlation of 0.819. We adopt the seemingly unrelated regressions (SUR) model to fit the two phenotype responses together to select a common set of covariates relevant for both the phenotypes.
As this is a high dimensional problem with 21771 predictors, we first screen out 500 genes for each of the responses by quantile-adaptive nonlinear variable screening (He et al., 2013) and take the 258 common genes plus the gender predictor to fit a SUR model. He et al. (2013) is a more flexible generalization of the sure independent screening approach of Fan and Lv (2008) and incorporates the information at multiple quantile levels. We approximate the nonparametric additive components using cubic splines with 10 knots. Thus, we have 259 groups in total and the final model used is given by where Y is the concatenation of the two phenotype responses HOMA-IR and HOMA-B and X g = X g 0 0 X g , with X g being the g'th predictor evaluated at the basis functions of cubic splines.
To perform group selection for the above model, we use GSD-SSS along with gLasso, gMCP, and gSCAD. For evaluating performance of the models selected by different methods, we randomly split the data into a training set with 80 percent of the observations and a test set with the remaining 20 percent. We only use the training set for model fitting as well as tuning parameter selection and the test set is used to calculate the prediction error for the purpose of performance evaluation. This process is replicated in parallel on a cluster machine with 24 cores resulting in 24 replications.
In Figure 3, we plot the average mean squared prediction errors (MSPE) at different model sizes by GSD-SSS along with gLasso, gMCP, and gSCAD. For the SSS algorithm, we set B to be 50 and to accelerate computation, we abandon the deterministic step 2 of the algorithm and set C for stopping criterion to be 5. We follow the same tuning parameter selection procedure as in Section 5 and the only difference is that the tuning is finer with more values considered for the prior variance parameter τ 2 . More specifically, we choose τ 2 from the set of values {G 2.5 /(10 i n) : i = 0, · · · , b}, where b = 4 for  simulation studies and b = 10 for real data analysis. We perform the ridge regression with 5-fold cross-validation on the training set to get coefficients estimates for GSD-SSS and use the test set to calculate MSPE. We can discover that GSD-SSS has the best performance in terms of the average MSPE across different model sizes. This indicates that compared to the competitors, our method selects more powerful groups for prediction at different sparsity levels. In Table 8, we report the genes selected at least twice by GSD-SSS in the 24 replications along with the corresponding proportion of times they were selected.

Conclusion
In this paper, we propose a Bayesian hierarchical model with a spike and slab prior specification to perform group selection in high dimensional linear regression models. The group selection consistency of our method is established under mild conditions and we show that this consistency result can be retained for important statistical models including nonparametric additive models and seemingly unrelated regressions. Shotgun stochastic search and Gibbs sampling algorithms can be used for the implementation of our proposed approach. We notice that our proposed shotgun stochastic search algorithm exhibits more computational efficiency due to its fast computation and also has better empirical performance compared to a standard Gibbs sampling algorithm. Our simulation and real data studies indicate that the proposed method has better performance for group selection compared to a variety of state-of-the-art competing methods.
The focus of our paper is to provide a general framework for group selection problems and can certainly be generalized to special cases such as the multivariate response model similar to Greenlaw et al. (2017); Liquet et al. (2017) by considering special covariance matrix structures for the errors. Other generalizations such as auto-regressive models can also be potentially studied within our framework by considering a general structure for the error covariance matrix.