Bi-selection in the high-dimensional additive hazards regression model

In this article, we consider a class of regularized regression under the additive hazards model with censored survival data and propose a novel approach to achieve simultaneous group selection, variable selection, and parameter estimation for high-dimensional censored data, by combining the composite penalty and the pseudoscore. We develop a local coordinate descent (LCD) algorithm for efficient computation and subsequently establish the theoretical properties for the proposed selection methods. As a result, the selectors possess both group selection oracle property and variable selection oracle property, and thus enable us to simultaneously identify important groups and important variables within selected groups with high probability. Simulation studies demonstrate that the proposed method and LCD algorithm perform well. A real data example is provided for illustration. MSC2020 subject classifications: Primary 62N01, 62N02; secondary 62F12.


Introduction
Recent advancements in experimental technology have enabled us to achieve numerous destinations in clinical studies that seemed impossible before. Most notably, availability of high-dimensional medical data, where the number of variables p is substantially greater than sample size n, introduces new opportunities in modern healthcare research but its high dimensionality characteristic poses tremendous challenges to classical statistical analysis methodologies. Luckily, the true regression model generally possesses the sparsity property, which means only a small number of nonzero components are attributable. Taking gene expression data as an example, which often involves tens of thousands of potential covariates, however, merely a handful of them are indeed related to the development of a particular disease. Variable selection techniques are effective tools in reducing dimensionality, cherry-picking the few covariates with significant contribution to the outcome at a certain threshold, resulting a simpler model for interpretation and more efficient parameter estimates. One particular type of variable selection methods frequently adopted to handle high-dimensional data is the regularization approach, achieving dimension reduction by adding a penalty function to the loss function. Moreover, the advantage associated with these methods includes its capability to simultaneously identify important variables and provide parameter estimates. Commonly used penalties include the least absolute shrinkage and selection operator (LASSO) [22], the smoothly clipped absolute deviation (SCAD) penalty [7], the adaptive LASSO [32], and the minimum concave penalty (MCP) [28], among others.
Furthermore, variable selection becomes especially challenging when the dataset exhibits group structure. Some examples include: multilevel categorical covariates in a regression model expressed by a group of dummy variables; a continuous covariate represented by a set of basis functions; genetic markers from the same gene considered as a group in genetic association studies; and in gene expression analysis, genes with the same biological pathway forming a natural group. Among others, [26,13,27,21] considered penalty-based group selection methods. [12] implemented a group bridge penalty to achieve bi-selection, simultaneously selecting important groups and important variables within selected groups. To complement this methodology, [3] subsequently proposed the local coordinate descent (LCD) algorithm to calculate the bi-level selection estimates in generalized linear models.
While most survival data involves censorship casting additional complexity to data structure and difficulty in regression modeling, there has been a large class of literature proposing various approaches to specifically address variable selection at the individual and group level based on the Cox models. For instance, [23,29,8] extended the LASSO, the adaptive LASSO, and nonconcave penalized likelihood approach to the Cox model for picking statistically significant individual variables. Building on that, [19] applied the supervised group LASSO penalization to the Cox model to select important variable groups. Subsequently, [11] further discussed the capability of group bridge approach to simultaneously selecting important variables at both the individual and group levels in the framework of the Cox model. Furthermore, [4] studied the issue of identifying regression structure under the Cox model by a penalized group selection method with concave penalties.
Alternative to the popular Cox model, [5,15] considered an additive hazards model, which relaxes the proportional hazard assumption by regressing the risk difference. Variable selection in the additive hazards model has drawn much attention recently. Under a fixed dimensional setting, [14] introduced a weighted LASSO approach; [20] discussed several regularization schemes including the LASSO, adaptive LASSO and Dantzig selector. On the contrary, under a high-dimensional setting, [31] developed tests for coefficients; [30] studied the properties of the weighted LASSO; [16,25] explored implications of implementing regularized least squares and penalized empirical likelihood for sparse models, respectively. Under the framework of additive hazards model for high-dimensional data, we propose a novel approach that caputures group structure while retaining sparsity of covariates, such that it simultaneously selects important variables at the individual and group levels, at the same time providing parameter estimates. This is achieved by combining a composite penalty and the pseudoscore method, where the number of covariates p is allowed to grow nonpolynomially with a sample size n. The asymptotic properties of the proposed estimators include both group selection oracle property and variable selection oracle property, which means important groups and important variables within selected groups are consistently identified, and the resulting estimators are asymptotically normal under some regularity conditions. Furthermore, we incorporate the local coordinate descent algorithm first proposed by [3], and demonstrate its effectiveness through simulation studies and real data analysis.
The remainder of the paper is organized as follows. In Section 2, we describe the penalized pseudoscore inference procedure, explore suitable penalty functions, and introduce the local coordinate descent algorithm. Theoretical properties of the estimators are studied in Section 3. We conduct simulation studies to evaluate the performance of the proposed method in Section 4, and in Section 5, we show an application of the proposed method to the breast cancer dataset. We draw some concluding remarks in Section 6 and relegate proofs of the key results to the Appendix.

Model setting and penalized procedure
Suppose that the failure time T U satisfies the following additive hazards model: . . , p} representing the k-th known group, and (β k1 , . . . , β k J k ) T be a J k -dimensional vector of regression coefficients in the kth group. Let C be a censoring time, T = T U ∧C be the observed survival time, and Δ = I(T U ≤ C) where I(·) is an indicator function. We assume that the failure time T U and the censoring time C are independent given covariate Z(·). Then the observed data consist of (T i , Δ i , Z i (·)) for subject i = 1, 2, . . . , n.
Define the observed failure counting process as N i (t) = I(T i ≤ t, Δ i = 1) and the at-risk indicator Y i (t) = I(T i ≥ t). Following Lin and Ying (1994), the regression coefficients can be estimated by solving the following pseudoscore estimating equation , and τ is the maximum follow-up time.
This equation can be rewritten as with v ⊗2 meaning vv T for a vector v. While V is positive semidefinite, integrating −U (β) with respect to β produces the least-squares-type loss function below In order to simultaneously select important groups and individual variables, we propose to obtain the estimatorβ by minimizing the following objective function with composite penalty where the functions f on the tuning parameter λ n , and assume that they are independent of k and j, which means that we can apply the same penalty functions across different variables and different groups. Subsequently, the subscript n in λ n is also omitted and we rewrite the penalty part as where |β| = (|β 1 |, . . . , |β p |) T . [18] studied a variety of penalty functions under the framework of generalized linear models. To determine the functional form of f O and f I , we mainly consider the following types of penalty f λ (·).

Penalty functions
(i) The smoothly clipped absolute deviation (SCAD) penalty [6,7] given by the derivative where a > 2 is a shape parameter.
(ii) The minimax concave penalty (MCP) proposed by [28] with the derivative where a > 1 is a shape parameter. (iii) The smooth integration of counting and absolute deviation (SICA) penalty [18] with where a > 0 is a shape parameter.
[3] set both f O and f I as the MCP penalty, and suggested the shape parameters in f I and f O to be a = 3 and J k aλ/2, respectively. The authors named this penalty the "composite MCP (CMCP) penalty". In this paper, we construct several new composite penalties, including composite SCAD (CSCAD) and composite SICA (CSICA) with both f O and f I as SCAD penalty and SICA penalty, respectively. Additionally, we construct another composite MSICA penalty by taking f O as the MCP penalty and f I as the SICA penalty. In subsequent simulation studies and real data analysis, we compare the performance of these composite penalties paired with three variable selectors, MCP, SCAD and SICA. [22] proposed to accomplish parameter shrinkage and selection for a linear regression model by minimizing the following squared loss function with LASSO penalty 1 2n

Local coordinate descent algorithm
where X = (X 1 , · · · , X p ) T , β = (β 1 , · · · , β p ) T , and v 2 represents the L 2 -norm for a vector v. At the jth iteration step in the coordinate decent algorithm, the solution of β j can be updated as Motivated by the above results, [3] developed a fast and stable local coordinate decent (LCD) algorithm for bi-level variable selectors that approximates the penalty proportional toλ kj |β kj | by taking its first order Taylor series about β kj , whereλ for each k j ∈ A k . Subsequently, the coefficient β kj is updated as 2), we apply the LCD algorithm to calculate the estimates through the following steps: Step 1. Choose an initial estimate β (0) ; Step 2. Let X = V 1/2 and y = X −1 b. Updateλ kj andβ kj cyclically according to (2.3) and (2.4) for each k j ∈ A k , k = 1, · · · , K; Step 3. Repeat Step 2 until convergence.
The choice of the initial estimate is critical to the proposed algorithm. Under the high-dimensional case, the ridge solution of the least-squares-type loss function L(β) is an ideal choice of β (0) since it is easy to calculate and it could be close enough to the true parameter for some suitable tuning parameter. In our simulations, we set The local coordinate descent (LCD) algorithm adopted here can be considered as an application of the algorithm developed in [17], and thus achieves optimal statistical rates.

Asymptotic results
According to the sparsity of the parameter, we split the true parameter as with |A| denoting the cardinality of A and s as the number of important variables. Note that both the number of covariates p and the number of important variables s are allowed to depend on sample size n throughout the paper, thus we omit the subscript n to simplify notations. We use Λ min (·) to denote the minimum eigenvalue of a matrix, and use the subscript A for a vector or a matrix to denote the sub-vector or sub-matrix containing them. For example, x A means the |A|-dimensional vector consisting of components {x j , j ∈ A} for the vector x, and V AA means the |A|-dimensional squared matrix with entries v ij , i ∈ A, j ∈ A for the matrix V = (v ij ). In addition, we denote that |β j | j∈A = (|β j | : j ∈ A) T .
We first state the following lemma, which plays an important role in establishing the selection consistency of the estimators.

Lemma 3.1.β ∈ R p is a strict local minimizer of Q(β) if the following conditions hold
for any β 1 ∈ R p / B in a sufficiently small neighborhood ofβ, and for any β 2 which is the projection of β 1 onto the subspace B. Thus,β satisfying the conditions in Lemma 3.1 is indeed a strict local minimizer of Q(β) on the whole space R p .
for all x > 0 and j = 1, . . . , p; (iv) The sample paths of Z j (·), j = 1, . . . , p are of uniformly bounded variation. Condition 3. There exist constants α ∈ (0, 1], γ ∈ [0, 1/2], and c > 0 such that Condition 1 is a very mild requirement that can be easily met with some commonly used penalties. Indeed, many penalties, such as the LASSO penalty, SCAD penalty, MCP penalty, and SICA penalty, satisfy the relation f I (0) = 0 if we take them as the function f I . Moreover, it is easy to see that the composite penalties listed in Section 2.2 satisfy Condition 1. Conditions 2 and 3 are similar to those in [16], where Condition 2 is a commonly used condition for survival models and Condition 3 is the key condition for verifying the selection consistency, such that the empirical counterparts of the matrices, such as D DA D −1 AA , D −1 AA and D AA , are close to them in some sense. We now present the conclusions regarding the group selection consistency of the proposed estimators.

Theorem 3.2 (Consistency of group selection). Suppose that Conditions 1-3 hold. Also assume that
where μ > 0, r 1 = (r + 4)/r, and c 1 = 2 + 1/(4c). Then for some constant M, K > 0, with probability at least Part (a) in Theorem 3.2 shows that the unimportant groups can be excluded with high probability; part (b) provides the convergence rate of the estimated regression coefficients of important variables in L ∞ -norm.
To explain the intuition for the conditions in Theorem 3.2, we consider some simplified cases. For the concave and composite penalties listed in Section 2.2, we have ρ λ (d) ≤ ρ (0+). Thus, the first two conditions in (3.4) when Λ min (D AA ) is bounded away from zero. In particular, if φ is a constant, then (3.5) is ensured by the condition that n s 2 (log p) r1 . This means that the dimension of the covariates is allowed to increase nonpolynomially with the sample size as large as log p = o(n 1/r1 ), where the dimension of the true sparse model s = o(n 1/2 ). Furthermore, for the bounded covariates, the third and the forth conditions in (3.4) reflect the requirement for the order of the regularization parameter λ as The last inequality in (3.4) implies that the minimum signal d must satisfy The conditions in Theorem 3.2 are different from those in the existing literature. For example, [9,2] demanded that s = O(n α ) and log p = O(n δ ) with α, δ ∈ (0, 1). As pointed out by [16], besides the difference in model assumptions, the critical difference is that they imposed a condition on a large empirical covariance matrix [see e.g., Condition 2 in [9] and Condition 8 in [2]]. As the empirical covariance matrix involves the outcome variables in survival models, the more nature idea is to provide a nonrandom condition on the population covariance matrix, as shown in Condition 3 of this paper. This population assumption can be viewed as high-dimensional extensions of the classical asymptotic regularity conditions in the low-dimensional setting.
To state the asymptotic normality ofβ A , we define Λ 1 = Λ min (D AA ), Λ 2 = Λ min (Σ AA ), and Λ 3 = Λ min (D −1 AA Σ AA D −1 AA ). First, we present the oracle properties of the proposed estimator in the ideal case where the variables within each group are either all important or all unimportant.

6)
where r 1 = (r + 4)/r, and Then for some constant M, K > 0, with probability at least To remove such strong condition and identify the important variables within selected groups, we propose to estimate β by minimizing Q ω (β) defined as where β ω = (ω 1 β 1 , . . . , ω p β p ) and ω j is a weight of β j . Then the weighted estimatorβ ω obtained by minimizing Q ω (β) satisfies the variable selection oracle property. 0A ) is asymptotically distributed as standard normal for every u ∈ R s with u 2 = 1.

Theorem 3.4 (Variable selection oracle property). Let ω D min = min{|ω j | : j ∈ D}. Suppose that Conditions 1-3, (3.4) and (3.6) hold. If f O (θ) is upper bounded by a constant c for all θ and λ and ω D min → ∞, then there exists a root-n consistent local minimizerβ
The boundedness condition for f O is trivial, as it is satisfied for most penalties, such as LASSO, SCAD, MCP and SICA. Theorems 3.2 and 3.4 indicate that the weighted estimators possess both the group selection oracle property and the variable selection oracle property, i.e., both of the important groups and important variables within selected groups can be identified with sufficiently high probability. Moreover, Corollary 3.5 provides a possible way for choosing weights that meet the requirements in Theorem 3.4. For example, we could use the LASSO estimator as the weight since the LASSO penalty is convex and its solution is globally optimal.

Simulation studies
In this section, we conducted simulation studies to evaluate the finite-sample properties of the proposed method. To this end, we generated survival data from the following additive hazards regression model where β = (β 1 , . . . , β p ) T , and Z = (Z 1 , . . . , Z p ) T were subjected to λ 0 (t) + β T 0 Z(t) > 0. To generate the covariate Z(t), we first simulated R 1 , . . . , R p independently from the standard normal distribution, and M 1 , . . . , M K from an AR(1) model with the initial standard normal distribution and Cov(M j1 , M j2 ) = 0.5 |j1−j2| for j 1 , j 2 = 1, . . . , K. In the following examples, we considered two cases for generating covariates Z j 's.
Case 1: Covariates Z j 's are independent of time t and they were generated by where g j is the group number that Z j belonged to. In this case, we set λ 0 (t) = 1. Case 2: Covariates Z j 's are dependent on time t. We first generated where g j is the same as that in Case 1. Then we set Z j (t) = η j t and let λ 0 (t) = 2t. We assumed Uniform(τ /2, τ) for censoring time C, where the value of τ was chosen such that the censoring rate reached at 25%. We set sampling size n = 250 in Case 1 and n = 600 in Case 2.
Example 1. We considered p = 50 such that the dimensionality of the covariates is comparable to sampling size but smaller. The true coefficient The covariates were divided into 10 groups with equal size of five in each group. Thus, the sparsity dimension was s = 6 and there were 3 important groups.
Example 2. We considered p = 500 to compare the performance of various methods when p is larger than n. Similar to Example 1, the true coefficient β 0 had values (v T , v T , v T , 0 T ) T , and the variables were divided into 100 groups with equal size of five in each group. In this example, there were 6 important variables and 3 important groups.
Example 3. We use this example to further demonstrate robustness of the proposed method with respect to the sparsity assumption. Similar to the setup in Example 2, we considered p = 500 and divided the variables into 100 groups with equal size of five in each group. Next, we set the first 18 elements of β 0 to (v T , v T , v T ) T and randomly chose 24 elements from positions 19 to 400 to have values {1, −1, . . . , 1, −1}, such that the number of important groups reached 12 with a total of 30 important variables.   The simulation results based on 200 replicates are summarized in Tables 1-3. We compared four types of composite penalties and three types of variable selection penalties, i.e., composite MCP (CMCP), composite SICA (CSICA), composite SCAD (CSCAD), MCP SICA (MSICA), MCP, SICA and SCAD penalties. The tuning parameter was chosen using the 5-fold cross-validation principle. In the tables, we report the rates of correctly identifying the important groups (GTPR), the rates of incorrectly selecting unimportant groups (GFPR), the rates of correctly identifying the important variables (TPR), the rates of incorrectly selecting unimportant variables (FPR), and the L 2 -loss for estimation accuracy for two cases. Table 1 indicates that all methods for two cases perform well for the number of selected important groups and variables when p < n. Table 2 shows that when the model is sparse enough but p > n, CSICA performs better than others correctly excluding unimportant groups. In addition, the computed L 2 -loss results indicate that CSICA is more efficient in group selection with higher correction rate, thus providing higher accuracy. Table 3 provides further evidence showing that CSICA tends to select a sparser model than other selectors as the number of important variables or important groups increases.
Assuming the true sparse model is known, to compare the performance of the best model that ever exists on the solution path, we recorded the maximum number of correctly selected variables among all models on the solution path and averaged it over all replicates. The performance results from Cases 1 and 2 are presented in Figures 1-3. Figures 1 and 2 show that under the sparsity model, all of the proposed selectors perform well since they can identify all important variables immediately after the model size reaches the true sparsity dimension. On the contrary, when p > n and s is large, Figure 3 shows that the proposed bi-selection method is comparable to the variable selection method in Case 1 and performs better than it in Case 2.

An application
We apply the proposed method to analyze the breast cancer data set containing the metastasis-free survival time. In the study of [24], 295 patients with primary breast carcinomas were classified as having a gene-expression signature associated with either a poor or a good prognosis. We focus on 144 patients having lymph node positive disease with censor rate at 66%. The data set is available in the R package "penalized ". The primary objective of this study is to identify key risk factors impacting the survival time of breast cancer patients. We consider 5 clinical risk factors and 70 gene expression measurements, including diameter of the tumor (1 for >= 2 cm and 0 for < 2 cm), number of affected lymph nodes (1 for 1-3 and 0 for >= 4), estrogen receptor status (1 for positive and 0 for negative), grade of the tumor (1 for Well diff and 0 otherwise), age of the patient at diagnosis, and gene expression measurements of 70 prognostic genes.
Huang et al. [11] analyzed this dataset using the group bridge penalty and considered the multiplicative effects of covariates on the survival time. Suppose the survival time follows the additive hazards model (2.1). We conduct the group selection and variable selection procedures by minimizing the penalized psudoscore function (2.2) with composite penalties (CMCP, CSCAD, CSICA, MSICA) and variable selectors (MCP, SICA and SCAD). We begin the statistical analysis by first reducing the model dimension to 50 through screening out the 25 most unimportant variables, and then group the remaining 25 relatively important variables into 8 distinct categories using dynamic clustering. In Figure 4, we show the selection results in panels (a) and (b), where the optimal tuning parameter is chosen by the 10-fold crossvalidation. For comparison, in panel (c), we display the results using group bridge, adaptive lasso, and group lasso with the AIC principle in Huang et al. [11], where each block represents a group. The following findings are easily observed from these figures: (i) the selectors identify more important variables in the multiplicative hazards model than the additive hazards model; (ii) under the additive hazards model, all selectors identify genes GN AZ and SCU BE2 as important variables; (iii) most of the selectors can identify 4 important groups in the additive hazards model compared to 8 important groups in the Cox model; (iv) the proposed method with the CSICA selects sparser model than others.

Concluding remarks
In this article, we studied a class of regularized regression methods under the additive hazards model. The proposed approach can consistently identify both important groups and important variables within selected groups. Furthermore, we established the asymptotic properties of the proposed estimators for highdimensional data. To efficiently compute the penalized estimator, we developed the local coordinate descent (LCD) algorithm through approximating the penalties by its first order linear part. The cross validation principle was adopted to determine the optimal tuning parameter. The numerical studies demonstrated that the proposed penalized approaches and the LCD algorithm work well.
The assumption of covariates exhibiting solely linear effects may be unrealistic. The true covariate effects may be in fact more complex, hence it is important to consider potential nonlinearity, especially when continuous covariates are involved. [10,4] studied a partially linear Cox model including linear and nonlinear components. Under the additive hazards model, a further study is to consider the partially linear additive hazards model conditional on covariates Z and X: where λ 0 is an unspecified baseline hazard function, β is a p-dimensional regression parameter, and φ 1 , . . . , φ d are unknown smooth functions with d much smaller than p. A future research opportunity is to extend the approach proposed in this paper to a model with covariates exhibiting both linear and nonlinear effects, attaining the goal of simultaneously selecting both important groups and important variables.