Graphical-model based high dimensional generalized linear models

: We consider the problem of both prediction and model selection in high dimensional generalized linear models. Predictive performance can be improved by leveraging structure information among predictors. In this paper, a graphic model-based doubly sparse regularized estimator is discussed under the high dimensional generalized linear models, that utilizes the graph structure among the predictors. The graphic information among predictors is incorporated node-by-node using a decomposed representation and the sparsity is encouraged both within and between the decomposed components. We propose an eﬃcient iterative proximal algo-rithm to solve the optimization problem. Statistical convergence rates and selection consistency for the doubly sparse regularized estimator are established in the ultra-high dimensional setting. Speciﬁcally, we allow the dimensionality grows exponentially with the sample size. We compare the estimator with existing methods through numerical analysis on both simulation study and a microbiome data analysis.


Introduction
We consider a regularization method of high-dimensional generalized linear model (GLM) [23] where the dimensionality greatly surpasses the sample size. GLM is one of the most commonly used statistical methods for modeling, estimation, prediction and classification. It has been widely used in high dimensional data analysis. Many traditional statistical tools are not well-suited for the ultra-high dimensional data. Regularization methods have been widely used in the literature. Hoerl and Kennard [12] proposed the ridge regression which uses a ridge penalty to improve the estimation efficiency through a bias-variance trade-off. Tibshirani [39] proposed the Lasso regression which includes the 1 penalty for both shrinkage and variable selection. Many theoretical properties of 1 penalty for the high-dimensional GLM have been established, ranging from estimation consistency [26], selection consistency [4], persistence property for prediction [11] and risk consistency [41]. Other methods penalize the likelihood function with folded nonconvex penalty functions including the smoothly clipped absolute deviation (SCAD) [7,8], the adaptive Lasso penalty [52] and the minimum convex penalty (MCP) [49]. The elastic net method was proposed by [53] to perform variable selection where the variables could be highly correlated. A more generalized Lasso penalty was proposed by [40] for some prespecified modifying variables.
If the true sparsity structure comprises clusters or groups of predictors, one can use group Lasso to select the coefficients [48,27,14,10]. [44] discussed the hierarchical sparse modeling which utilized the group Lasso and the latent overlapping group Lasso penalty. Different from these works, we consider an undirected graph structure among the predictors [46,35,51]. As predictors in a neighborhood are connected, they are simultaneously effective or not effective for predicting the response. As an example, in the diagnosis of the metastatic melanoma using commensal microbial composition information, the patients' microbiomes are naturally correlated. Relevant microbiomes in the same neighborhood of the underlying graph often either influence or not influence the clinical response together, see [22]. Incorporating the structure information among the microbiomes can lead to the construction of a better classifier.
In literature, many methods have utilized the edge-by-edge information in the graph to solve the regression problem. For example, the method OSCAR in [2] used the ∞ penalty for every pairs of predictors and the method GRACE in [16] used the network-constrained penalty on the pairwise differences of the connected predictors. Yu and Liu [46] proposed the sparse regression method incorporating graph structure (SRIG) with a node-wise neighborhood based penalty where the penalty term is distributed over all nodes instead of all edges. In addition, they proposed an efficient computational method to solve the node-wise penalty. Liu et al. [18] proposed a graph-based high dimensional sparse linear discrimination analysis method which is specific for classification. Recently, Zhou et al. (2019) extended the sparse regression leveraging graphical structure to generalized linear models and establish the estimation error and prediction error of the penalized estimator. Yu and Liu [46] and Zhou et al. [51] assumed that all the components within each decomposition would be shrunk to zero simultaneously. This assumption is restrictive and can be further relaxed. It is known that while the predictors are correlated with each other in a neighborhood, they may not be all important predictors to the response variable. To overcome this, Stephenson et al. [36] proposed a doubly sparse method (DSRIG) which encourages sparsity both within and among the decomposition under linear regression model with fixed design matrix. They also applied the doubly sparse method to the logistic regression [37] using the predictor duplication (PD) algorithm [27]. But the PD method is not very efficient and requires large computing memory for models with high dimensional predictors or predictor graphs having large number of edges. Thus it is necessary to develop new efficient optimization algorithm for doubly sparse graphic model-based GLMs. Furthermore, no work has been done on finite sample bounds of the estimation error and the model selection consistency in graphic model-based doubly sparse generalized linear models. Accordingly, there is a great need to investigate these statistical properties.
In this paper, theoretical investigation is presented for the doubly sparse high-dimensional GLM estimators incorporating the graph structure through a node-wise penalty. In general, the graphic structure can be either given or estimated from the study samples. We generalize the sparse least squares estimator of [46] to allow for a wider class of loss functions as well as a more general structured regularization. In terms of optimization, the constraint can be expressed as a latent sparse group Lasso over neighborhood sets. Besides using the predictor duplication method, we combine the fast iterative shrinkage thresholding algorithm (FISTA) [1] with the proximal splitting methods [47,46] for solving the proximal operator. On the theoretical side, we establish the finite sample bounds of the optimal estimation, in addition with the prediction error bound for random design with some regularity conditions. The classical result of the estimation error bound of the penalized GLMs can be recovered if there is no edge in the predictor graph. Moreover, the model selection consistency is established for the ultra-high dimensional graphic model-based doubly sparse GLMs. In the simulation studies and the real data application, we show that the method can improve the performance in the aspects of estimation, prediction, and model selection compared with the regularization methods without using the predictors' graphic structure.
The paper is organized as follows. In Section 2, we set up the basic notation and introduce the penalization method. In Section 3, an efficient algorithm for the optimization problem is presented. In Section 4, the main theoretical results are provided. In Section 5 and 6, numerical simulations and an application on a human microbiome dataset demonstrate the competitive performance of the method. We provide some discussions in Section 7. Technical proofs are contained in Appendix.
For a matrix M , let |||M ||| 2 denote the spectral norms, and let |||M ||| max := max i,j |m ij | denote the elementwise ∞norm of matrix M . Let ∇h denote a gradient or subgradient for any function h : R p → R. Let supp(V ) indicate the support of the vector V , and let |N | indicate the cardinality of the set N . Let B q (r) represent the centered ball of radius r in q norm for q, r > 0. Let sign(·) represent the sign function.

Methodology
We consider a common setting of GMLs. Let {(X i , y i ); i = 1, . . . , n} denote independent and identically distributed (i.i.d) samples, where y i is a response variable and X i = (x i,1 , . . . , x i,p ) T is a p-dimensional covariate vector. Let y = (y 1 , . . . , y n ) T ∈ R n and X = (X 1 , . . . , X n ) T . Throughout the paper, the dimensionality p is allowed to grow with the sample size n. It is assumed that the conditional density of y i given X i is from the exponential family, with the functions b(·) and c(·, ·), the nuisance parameter φ, and the canonical parameter θ i = X T i β, where β = (β 1 , . . . , β p ) T is the p-dimensional unknown regression coefficient. By standard properties of exponential families [23], we have E(y|X i ) = b (θ i ) = μ i . The canonical link function g(μ i ) = θ i is used.
In our analysis, it is assumed that b (·) ≤ c for some constant c > 0. This boundedness condition implies that y i has a bounded conditional variance. This condition is required to establish the estimation error bound. Similar condition was assumed in [30], which is satisfied in many models including linear regression, logistic regression, and multinomial regression. Even though it does not hold for Poisson regression, a truncated Poisson regression model will satisfy this boundedness condition [45].
The population loss based on the negative log likelihood is formulated as The empirical loss function takes the form L n (β) i β], and the population-level and empirical gradients are given by Without loss of generality, we assume the nuisance parameter φ = 1 for the rest of the paper. It can be verified that ∇L(β 0 ) = 0 for the true parameter β 0 of the GLMs. We assume so L n is convex. Assume that the i.i.d. p-dimensional random variables X 1 , . . . , X p follow some multivariate distribution with a zero mean and a covariance matrix Σ. Denote the precision matrix Ω = Σ −1 , which contains elements ω jk and assumed to be sparse. Let G be an undirected predictor graph over the set of nodes J = {1, 2, . . . , p}. The edge set E in the graphical model represents the conditional dependence structure of the observed variables. When two predictors are conditionally independent, they are not connected by an edge in the graph, i.e., e jk / ∈ E ⇐⇒ X j ⊥ ⊥ X k | {X l : l = j, k}. Specifically, in a Gaussian graphical model, there will be edges between any pair of nodes (j, k), j = k, if ω jk = 0. In discrete graphical models which can be represented by a minimal exponential family, Loh and Wainwright [20] investigated the connection between the support of a generalized inverse covariance matrix and the conditional independence structure of the graph. In particular, they showed that for binary variables, the inverse of the usual covariance matrix corresponds exactly to the edge structure of the tree. For continuous but non-Gaussian distribution, Spantini et al. [34] showed that if X 1 , . . . , X p have a smooth and strictly positive density π(x), the pairwise conditional independence of the random variables X j and X k can be assessed by X j ⊥ ⊥ X k | {X l : l = j, k} ⇐⇒ ∂ 2 j,k log π(x) = 0, where Ω * denotes the generalized precision matrix with elements Ω * jk = E π [|∂ 2 j,k log π(x)|]. Then if Ω * jk = 0, j = k, nodes j and k are conditionally independent, hence there's no corresponding edge (j, k) in G. Define N j = {k : e jk ∈ E} to be a neighborhood set of the node j. Define the degree of node j be the size of its neighbourhood d j = |N j |. For any node k / ∈ N j , if e jk / ∈ E and we assume that the node j will not contribute to the decomposition of β k .
Given the predictor graph G, the neighbourhoods, N j , j ∈ J , represent a set of groups which are possibly overlapping. As was discussed in [51], under the GLM settings, based on the inverse regression method from Theorem 2.1 of [6] and Condition 3.1 of [17] we have where Ω is the precision matrix. Let the predictor graph G, be denoted by a p×p adjacency matrix E, where E jk = 1 for connected predictors j and k and E jk = 0 otherwise. We always set E jj = 1 for each j. Then, we have N j = {k : E jk = 1}. Base on the connection between the conditional independence and the graphical structure, by (2.1), β 0 can be decomposed into where the term {Θ (j) k : k ∈ N j } arises from the marginal correlation between the predictor j and the response. Let V k E kj for k ∈ N j , then the decomposition of the true coefficients β 0 can be expressed as: p ) T depends on the interaction among predictors (e.g., graphical structure) and the marginal correlation between the predictors and the response. The term {V (j) k : k ∈ N j } contains the contribution of the predictor k to the response through the predictor j. Such contribution depends on the strength of the conditional dependency relationship between j and k and the marginal correlation between the predictor j and the response variable. From the derivation in (2.3), V = 0, then either there's no edge between the predictor j and k (e.g., E kj = 0), or even though there's an edge between the predictor j and k (e.g., E kj = 1) for each k ∈ N j , but the predictor j and the response variable are uncorrelated (e.g., Θ (j) k = 0). Therefore, we assume a doubly sparse decomposition to help mitigate estimation bias, which means there are only a small number of vectors V (j) s that are nonzero, and even for the predictors with nonzero vectors V (j) , there are a small number of nonzero V (j) k s within the vector V (j) s. The underlying graph G is assumed to be known or estimated from data. Therefore, under the GLM setting and given the predictor graph G with neighborhoods N 1 , . . . , N p , we optimize the following penalized maximum likelihood objective function (2.4) which induce sparsity both between and within V (j) , j = 1, . . . , p.
where λ ≥ 0 is the tuning parameter, τ j is the positive group-specific weight, and ξ is the mixing parameter which can be viewed as a trade-off weight that balances the contributions of the 1 and 2 norms. We assume τ j and ξ are bounded. Different values of ξ correspond to different shapes of the constraints [28]. We use the weight of the form τ j ∝ d γ j with 0 < γ < 1/2 and d j = |N j |, the size of the neighborhood. Here, we set γ < 1/2 for the possible overlapping groups. This was also suggested in [27] and [51]. The penalty function R(β) : can be viewed as the sparse group Lasso penalty with additional constraints. The choice of tuning parameters will be discussed in Section 5. The L 1 component of (2.4) will control the sparsity within V (j) while the L 2 component of (2.4) will control the sparsity among the neighbourhoods. It will be reduced to a sparse GLM incorporating graphic structure discussed in [51] when ξ = 0. When the graph G has no edges, then the objective function (2.4) will reduce to that of the GLM Lasso [41,25]. Under the GLM settings, L n (β) is smooth, while R(β) is not smooth. Note that the penalty function proposed in (2.4) is similar to a sparse group Lasso [32], but it shrinks the decomposition V (j) s rather than directly penalizes the regression coefficients β.

Computation
Typically, the problem (2.4) can be transformed into a sparse group Lasso problem [32] and the predictor duplication (PD) method [27,46] can be used to solve the problem. However, the PD method is not efficient because of large memory requirement.
Therefore, we combine the FISTA with a proximal splitting method to solve the problem which is stable and efficient. Given the positive weights τ j and ξ, for β ∈ R p , denote where supp(V (j) ) ⊆ N j . Then, R(β) = β G,τ + β S,ξ is a norm and therefore convex. The optimization problem (2.4) is equivalently to the following problem For the exponential family of distributions, the loss function L n (·) is a continuously differentiable function, then the gradient satisfies the Lipschitz property with constant L, i.e., As L n is twice continuously differentiable, any bound on the operator norm of ∇ 2 L n is a Lipschitz constant for ∇L n . Then, a simple tight global upper bound for L n is Consider that R contains a non-smooth L 1 norm, a simple iterative strategy is to minimize the upper bound for L n at each iteration, without modifying R. At iteration t + 1, the updated β (t+1) is given by where the associated proximal operator is defined as Thus to solve the optimization problem (3.1), we use the algorithm summarized in the following Algorithm 1.
It is shown in [1] that the convergence rate to the optimal solution is of O(1/t 2 ). With high dimensional p, the associated proximal step in Algorithm 1 is the most time consuming step. Following [47], we can solve the proximal step 4 by finding the projection via the Parallel Dykstra-like Proximal Method [5] and converting the minimization problem (3.2) into a series of convex problems which have closed form solutions. The details about Dykstra-like Proximal Splitting Method is shown in the supplementary material. Based on this method, we only need to know the graphical structure information about the neighbours N j for j = 1, . . . , p to identify the overlap groups, which is pre-determined and unique. It is stable and very efficient for high-dimensional data, especially when the predictor graph are not very dense and can be decomposed into several disconnected components, see Example 1 in the simulations.

Theoretical properties
, and s 0 = |J 0 |. As R(β) is convex and coercive, Rao et al. [28] proved that an optimal decomposition of β minimizing R(β) always exists but may not be unique. For each β ∈ R p , we denote V(β) as the set of all optimal decompositions of β. Define K(β) = min (V (1) ,V (2) ,··· ,V (p) )∈V(β) j : V (j) 2 = 0 , which denotes the minimal number of nonzero V (j) among all the optimal decompositions of β. Denote K = sup supp (β)⊂J0 K(β) the maximum nonzero group among all decompositions. There are at most d max = max j=1,...,p {d j } non-zero elements in all nonzero V (j) s where d j = |N j |. Note that the optimal decomposition of β is the smallest decomposition for the associated penalty R(β). In order to prove the main theorem, we first need to show the following property with regard to the subgradient conditions for the optimization problem (2.4). This property can be obtained directly by the Karush-Kuhn-Tucker (KKT) conditions.

Proposition 4.1. The necessary and sufficient condition for β being the solution of (2.4) is that β can be decomposed as
These subgradient conditions are similar to those of [32] which were established for a least squares loss function with group Lasso.
Remark 4.1. Assumption 1 implies that neighbors of important predictors are also important predictors to the modelling of the response. This condition is also assumed in [46]. Assumption 2 of sub-Gaussianality and low bounded eigenvalues is often used in high dimensional setting. This assumption was also used to derive a lower bound of the Taylor-series error of the GLM log-likelihoods with sub-Gaussian covariates in [26] and [43]. The boundedness assumption on cumulant function b(·) is also assumed in [26]. From Assumption 2, we can extend the restricted strong convexity (RSC) condition for GLMs [26,43] to the current setting. That is, for any subset J ⊂ {1, 2, . . . , p} with |J| ≤ s 0 , and all the optimal decompositions (V (1) , . . . , V (p) ) of any vector Δ that Δ 2 ≤ 1, we have for some κ l > 0 and κ 2 ≥ 0, with probability tending to 1.
In the following, we establish the error bounds of the doubly sparse GLM estimator.
with probability greater than or equal to 1−c 1 exp(−c 2 log p) for some c 1 , c 2 > 0.
From the results above, we can see that our theoretical results do not depend on any uniqueness assumption on the decomposition of β that minimizes β G,τ + β S,ξ . In contrast to [27], our result depends only on K which represents the maximal structured sparsity of such decompositions. We consider the constraint R(β 0 ) ≤ ρ to ensure the existence of local/global optima which was also discussed in Loh and Wainwright [21]. Note that the setting ρ ≤ c √ dmax τmin+ξ n log p is feasible based on the assumption of sample size n s 0 log p. The following corollary from Theorem 4.1 provides the prediction error bound, which is defined as Note that under our GLM model setting, this error measure D( β, β) is equivalent to the symmetrized Bregman divergence defined by the cumulant function b(·) [21].

Remark 4.2. Note that when there is no edge in the predictor graph G and
In this case, we recover the rate in [21] and [19] for high dimensional regularized M -estimators. However, from Theorem 4.1, the error bounds of our method depend on the minimal number of nonzero optimal decomposition K rather than the true model size s 0 . If the true graph G consists of disconnected complete sub-graphs and J 0 is the union of K 0 node sets of those disconnected subgraphs, see Example 1 in Section 5, then K = K 0 . Our method gives better results compared to the Lasso if K 0 is much smaller than s 0 . This demonstrates our method's advantage over standard Lasso procedure when the structure sparsity of the predictor graph is incorporated. From the simulations, we find that our method indeed outperforms the GLM Lasso in all the simulation settings. Moreover, in Corollary 4.1, considering the fixed design linear regression, the expression ∇L n ( β) − ∇L n (β 0 ), β − β 0 corresponds to the commonly used error measure X( β − β 0 ) 2 2 /n. Next, we establish the model selection consistency of the doubly sparse highdimensional GLM estimation.

Assumption 5. There exists a constant
Note that Assumption 3 allows for graphs and sample sizes in the "large p, small n" regime, as long as the degrees are bounded, or grow at a sufficiently slow rate. Assumption 4 ensures that the Fisher information matrix of the relevant covariates is not singular. Assumption 5 refers to the incoherence condition which requires that the large number of irrelevant covariates cannot be strongly correlated with the subset of relevant covariates. The incoherence condition is also assumed in [25] to obtain the 2 -norm consistency of Lasso for fixed designs. Analogous conditions are required for the success of the Lasso in the case of high dimensional graphical model [24,50,29]. Under these assumptions, the proposed method is model selection consistent for high dimensional setting.

Theorem 4.2. Under Assumptions 1-5, suppose the weight
, the tuning parameters λ and the minimum absolute nonzero coefficient β 0 min = min j∈J0 |β 0 j | satisfy that, Then as n → ∞ and n s 3 0 log p, there exists a solution β of problem (2.4) such that sign( β) = sign(β 0 ) with probability tending to 1.

Remark 4.3.
Note that the quantities τ j and d j depend on n. If we set the tuning parameter λ s 0 log p/n, then β 0 min has to be greater than C 1 n (3c1+c2−1)/2 for some constants C 1 > 0 and 0 < 3c 1 + c 2 < 1 to ensure the model selection consistency. Compare to the estimation consistency in Theorem 4.1, we need the assumption n s 3 0 log p. The extra factor of s 2 0 is to ensure the consistency of the sample Fisher information matrix. Note that the theoretical results in this section requires the assumption that the true graph G is known. If the graph G is unknown, then a data splitting procedure can be used to first estimate the graph G and then the graph can be used to estimate the coefficient parameters.

Simulation study
In this section, we conduct simulations to compare the performance of the graphic model-based doubly sparse generalized linear model (GDSGLM) with other existing penalized methods, such as the Lasso, the ridge regression, the elastic net (Enet), and the recently proposed method sGLMg [51]. Throughout the simulation studies, we assume that the covariates are normally distributed with zero mean. Different covariance structure are explored and the graph G is given by the estimated precision matrix. We denote GDSGLM-O and sGLMg-O as the GDSGLM and sGLMg methods with the knowledge of the true graphs. We also present the so-called oracle estimation GLM-O by the maximum likelihood method based on the true subset of the predictors but not utilizing the graph structure.
In general, the tuning parameter λ, the positive group-specific weight γ, and the mixing parameter ξ in (2.4) can be chosen by a cross-validation (CV) procedure, like [15], but this may be time consuming for high dimensional settings. Therefore, throughout the simulations, we simplify the cross-validation to select one tuning parameter λ, while setting τ j = d γ j with γ = log(2)/{2 log(3)}, which is suggested in [27,51], and ξ = max(τ j ).
In the simulations, we generate the data from the logistic regression model, that is y|X ∼ Bernoulli {p(Xβ)} and .
For each example, our simulated data are divided into three sets such that a training set, an independent validation set, and an independent test set. All model fitting is based on training set. As we assume the predictors follow the multivariate normal distribution, we estimate the graph structure by the graphical Lasso method [9] using the training set, where the tuning parameter for the graphical lasso is selected by huge.select function in R software. We use the validation set to select the tuning parameters and use the independent test set to compare different methods. For each example, we set the dimension p = 100. We consider two cases of the sample size n: (I) 80/80/500, (II) 120/120/500, where ./././ denote the sample sizes of the training sets, the validation sets, and the independent test sets, respectively. To make comparisons, we consider the following three different predictor structures in [46] and [51]. As we assume X i s are sub-Gaussian vectors, we consider another simulation setting with non-Gaussian X i s following a uniform distribution.
Example 4 (block diagonal Ω). Let p = 100, and β 0 is the same as Example 1. We generate the predictors as We adopt the following measures to compare the performance of the different approaches: the estimation consistency by 2 distance β − β 0 2 ; the prediction misclassification error (P), which is based on a test set; and false positive rate (FPR) and false negative rate (FNR) for variable selection accuracy. All the true predictor graphs (defined by Ω) in the above three examples are displayed in Figure 1. We simulate 50 data sets for each example.  Tables 1 and Table 2 demonstrate the simulation results for the first example. The results shows that the GDSGLM method has the best performance of the estimation consistency, the misclassification error, the false positives rates and the false negatives rates among all the competing methods. The performances of sGLMg-O and GDSGLM-O are close to those of sGLMg and GDSGLM because the estimated graph is very accurate. The performance of GLM-O method is not very good due to the high correlation among predictors.  Tables 3 and Table 4 show the simulation results for the second example. The GDSGLM method and the sGLMg method outperform the other nongraph based methods. Moreover, by adding a 1 penalty, the proposed GDS-  GLM method is better than the sGLMg method and it achieves the smallest misclassification error and the lowest FPR and FNR. Tables 5 and 6 display the comparative results for the third example. The results are consistent with the previous two examples. Compared with the sGLMg (sGLMg-O) method, the GDSGLM (GDSGLM-O) method has smaller estimation errors of the β and smaller misclassification errors for the prediction due to the additional 1 penalty. Particularly, in the estimated predictor graph setting, GDSGLM method even offers improved performance over sGLMg-O. We notice that the GDSGLM method has a much smaller FPR with slightly higher FNR compared to sGLMg. Tables 7 and 8 display the comparative results for the non-Gaussian example. It is demonstrated that our method still outperforms all the other methods for this non-Gaussian setting.

Real data example
Human microbiome has received great interest in medical research. The microbiome composition has been found to link to many aspects of human health. In order to understand why only a subset of patients benefit from the immunotherapy in cancer treatment, Matson et al. [22] conducted studies to find whether or not some microbiome species are predictors which can be used to classify metastatic melanoma patients' response to the immunotherapy. The microbiome sequencing data (16s sequencing) includes 38 cancer patients with 153 operational taxonomic units (OTUs). As the dataset is zero-inflated, we first filter out the variables with more than 50 percent zero samples and combine the same species. After pre-processing, we obtain 33 microbiome OTUs as the predictors. The goal of the study is to use the commensal microbial composition  to predict the clinical response regarding whether or not the patients will benefit from the immunotherapy. A doubly sparse logistic model incorporating the graphical structure (2.4) is used to analyze the dataset. We compare the performance of the GDSGLM method with sGLMg [51], Lasso, ridge regression, Adaptive Lasso and Elastic net. We normalize the data set and split the data sets equally into training data and test data. The cross validation (CV) is used to compare different approaches. We use the graphical Lasso [9] to estimate the predictor (OTUs) graph G only based on the training data. Figure 2 shows the estimated graph structure of OTUs feature based on all the data. The training set is used to build the models and the testing set is used to measure all the error measures including the percentage of correct classifications (PCC), the specificity and sensitivity. An inner five-fold CV procedure is Table 9 Comparison for various methods on the microbiome sequencing dataset. used to select the tuning parameters [15]. We conduct the equal-splitting cross validation process 50 times. Table 9 shows the average misclassification error of all methods. The GDSGLM method provides the best classification with the highest PCC and a much higher sensitivity but a relatively lower specificity. Based on 50 times of two-fold cross validation, we obtain 100 models for each method. The GDSGLM method selects about 16 OTUs on average. There are five OTUs selected with more than 75 times by the GDSGLM method. In details, the feature names are g Adlercreutzia [33], g Collinsella [38], g SM B53 [13], g Odoribacter [31], and g Sutterella [3]. All these five bacteria species have been shown to influence patients response to immunotherapy in literature. Further biological experiments are required to check specifically whether these OTUs are closely related to anti-PD-1 efficacy in patients with metastatic melanoma.

Discussion
In this paper, we investigate the doubly sparse penalized estimator for high dimensional GLMs using the graphical structure of the predictors. We establish the tight finite sample bounds for both estimation and prediction. We also establish the model selection consistency under the ultra-high dimensional setting. Relevant directions for future work include a generalization of the statistical consistency results to nonconvex regularized M-estimators. Specifically, it would be interesting to expand the theory to the minimization of nonsmooth hinge loss function for classification.
In this paper, Assumption 1 assumes that the neighbors of true predictors are true predictors as well. This assumption is somehow restrictive and it excludes some common graphs, for example, the chain graphs. A sensitivity study is discussed in [46], and it is shown that if Assumption 1 is not violated seriously, the proposed method still has good performance. Here, we propose some future directions on how to relax this assumption. Let |N j ∩ N k | = p * for j ∈ J 0 and k ∈ J c 0 . For example, we have p * ≤ 2 for the chain structure. The proof of Theorem 4.1 needs to be modified by adding a small term δ(p * ) in the following equation: To extend the results in Theorem 4.1 to more general graphs, one has to control the additional term δ(p * ), which warrants future investigation.

Appendix A: Proofs of main theorems
In this section, we provide the proofs of the main theorems stated in the paper. The proof of Theorem 4.1 begins with the proofs of some technical lemmas. The following Lemma is quoted from the Lemma 2 in [46]. Lemma A.1. For any predictor graph G and positive weights τ 1 , τ 2 , . . . , τ p and ξ, suppose V (1) , V (2) , . . . , V (p) is an optimal decomposition of β ∈ R p , then for any S ⊂ {1, 2, . . . , p}, {V (j) : j ∈ S} is also an optimal decomposition of j∈S V (j) .

A.1. Proof of Theorem 4.1
Proof. We first show that R(β) is a norm and decomposable with respect to a pair of subspaces (M, M ⊥ ). By Lemma 3.2 in [36], we note that R(β) ≥ 0 with equality only when β = 0. Then for u ∈ R\{0}, we have the positive homogeneity, i.e., R(uβ) = |u|R(β). To demonstrate the triangle inequality, let the set of vectors V (j) , j = 1, . . . , p, be a decomposition of the regression parameter vector β and let W (j) , j = 1, . . . , p, be a decomposition of another regression parameter vector θ. Then, Therefore, R(β) is a norm. If we have vectors β ∈ M and θ ∈ M ⊥ , then β and θ will have supports that do not overlap. By Assumption 1, it follows that any optimal decomposition of β, say V (j) , j = 1, . . . , p, and any optimal decomposition of θ, say W (j) , j = 1, . . . , p, will also have supports that do not overlap. As we have mutually exclusive sets, the triangle inequality will hold with equality and R(β) is decomposable with respect to the subspaces M and M ⊥ . Next, we show that the event {λ ≥ 4R * (∇L n (β))} holds with a high probability. Define u to be a p × 1 vector and let u Nj be constrained to have support matching V (j) , i.e., u Nj has non-zero elements k ∈ N j . By the proof of Lemma 3 in [27], its dual norm satisfies In the following, we show that there are universal constants (c, c 1 , c 2 ) such that For each 1 ≤ i ≤ n and 1 ≤ j ≤ p, define the random variable Z ij : and E[X 2 ij ] ≤ κ 2 u . As n log p, {X ij , j = 1, . . . , p} are sub-Gaussian, and the squared X ij s are sub-exponential, there exist universal constants (c 1 , c 2 ) such that P (A c ) ≤ c 1 exp(−c 2 n). We focus on the second term on the right side of (A.1). For any t ∈ R, we have where b(·) is the cumulant generating function for the underlying exponential family. Thus, by a Taylor series expansion, there exists some ν i ∈ [0, 1] such that where the inequality is based the boundedness of b (·). Consequently, conditioned on the event A, the variable 1 n n i=1 Z ij is a sub-Gaussian random variable with the parameter less than or equal to κ = C max j=1,...,p E[X 2 ij ] for each j = 1, . . . , p. By a union bound, we have Then ∇L n (β 0 ) ∈ R p is zero-mean sub-Gaussian random vector. By the assumption λ(τ min + ξ) ≥ c d max log p/n for some constant c, therefore, we have λ ≥ 4R * (∇L n (β)). Next, there is an arbitrary optimal decomposition of β 0 as S (1) , S (2) , . . . , S (p) , and an arbitrary optimal decomposition of β − β 0 as T (1) , T (2) , . . . , T (p) . Then, by Assumption 1, Furthermore, and similarly, Hence, Note that J 0 = {j : β 0 j = 0} be the true support of β 0 , and the local optimal error vector Δ = β − β 0 where β is an arbitrary local optimum of (3.1). Let then the optimal error Δ = β − β 0 must satisfy F( Δ) ≤ 0. As the distribution belongs to exponential family, L n (β) is a convex and differentiable loss function.

A.2. Proof of Corollary 4.1
Proof. According to the first order stationary condition in [21], Let β = β 0 in the formula above, then Note that J 0 = {j : β 0 j = 0} be the true support of β 0 . By the condition that λ ≥ 2R * (∇L n (β 0 )), we obtain Then, substituting the -2 bound with the result from Theorem 4.1 yields the desired result.
First we provide a lemma of the bound of covariance matrices from sub-Gaussian ensembles quoted from [43].
Lemma A.2. There are universal constants {c j } 3 j=1 such that, for any rowwise sub-Gaussian random matrix X ∈ R n×p , the sample covariance Σ = 1 n n i=1 X i X T i satisfies the bounds for all δ > 0.
Proof. The proof can be obtained via a discretization argument in [43].

A.3. Proof of Theorem 4.2
Proof. By k | ≤ 1, k ∈ N j . As some predictors may reside in more than one neighborhoods, following [46], the estimate needs to satisfy the conditions: for each for each i 1 , i 2 / ∈Ĥ and j ∈ N i1 ∩ N i2 .
Then, any solution β satisfies the following equation where for each 1 ≤ j ≤ p, When event Ω 1 occurs, we have sign(β j ) = sign(β 0 j ) for each j ∈ J 0 . If event Ω 2 occurs, we obtain V (j) Nj = 0 for each j ∈ J c 0 . Furthermore, we know that V By equation (A.10), we obtain where we use the fact that |f j | ≤ τ j and |ĥ j | ≤ ξ for each j in the last inequality.
Because the X i 's are sub-Gaussian and b (·) is uniformly bounded by assumption, then for any v, w ∈ R p , the expression v T ∇ 2 L n (β)w = 1 n is the i.i.d. average of the product of sub-Gaussian variables b (X T iβ )v T X i and w T X i . By Lemma A.2, a standard discretization argument of the s 0 -dimensional unit sphere yields z j,n = x n+1 + z (j,n) − p (j,n) ; 9: n = n + 1; 10: until convergence; 11: return x.