A Convex-Nonconvex Strategy for Grouped Variable Selection

This paper deals with the grouped variable selection problem. A widely used strategy is to augment the negative log-likelihood function with a sparsity-promoting penalty. Existing methods include the group Lasso, group SCAD, and group MCP. The group Lasso solves a convex optimization problem but is plagued by underestimation bias. The group SCAD and group MCP avoid this estimation bias but require solving a nonconvex optimization problem that may be plagued by suboptimal local optima. In this work, we propose an alternative method based on the generalized minimax concave (GMC) penalty, which is a folded concave penalty that maintains the convexity of the objective function. We develop a new method for grouped variable selection in linear regression, the group GMC, that generalizes the strategy of the original GMC estimator. We present an efficient algorithm for computing the group GMC estimator and also prove properties of the solution path to guide its numerical computation and tuning parameter selection in practice. We establish error bounds for both the group GMC and original GMC estimators. A rich set of simulation studies and a real data application indicate that the proposed group GMC approach outperforms existing methods in several different aspects under a wide array of scenarios.


Introduction
Consider the classical linear regression setting where the data have been generated according to the following model: where y ∈ R n is a response vector, X ∈ R n×p is a fixed design matrix whose columns are p covariate variables, and is a vector of independent noise variables with mean zero and variance σ 2 .In modern statistical applications, we often have p n where the ordinary least squares estimator is not well defined.
A natural strategy to address this issue is to assume that β is sparse so as to improve both the prediction accuracy and the interpretation of the model.To that end, Tibshirani (1996) developed the least absolute shrinkage and selection operator (Lasso) which performs both coefficient estimation and variable selection.The Lasso estimator is a solution to minimize where the objective function is a sum of the squared error loss, which represents the lackof-fit, and the l 1 -norm penalty, which encourages sparsity in the estimated model.The l 1 -norm is able to perform variable selection because of its singularity at the origin.The nonnegative tuning parameter λ balances the trade-off between the goodness-of-fit and the model complexity.
The Lasso is one of the most popular penalized regression formulations for selecting individual variables but cannot immediately deal with certain types of structured sparsity.
For example, in many statistical applications, variables may have a natural group structure.
A classic example in regression is the encoding of a single categorical variable using a group of dummy variables.In such case, what we need is a method for selecting, or not selecting, the entire set of dummy variables.The most prominent work in grouped variable selection is the group Lasso (Yuan and Lin, 2006), which is a natural extension of Lasso and solves the following penalized least squares problem: where the p covariates are divided into J groups, β = (β T 1 , ..., β T J ) T ∈ R p with β j ∈ R p j and J j=1 p j = p.The matrix X •,j is the submatrix of X whose columns correspond to the variables in the j-th group.The K j 's are nonnegative weights used to adjust for the group sizes.A typical choice of K j is √ p j .The group Lasso employs the l 2 -norm of the group coefficients as a component of the penalty function.One can also view the penalty as applying the l 1 -norm to the vector of l 2 -norms of the groups, which enforces sparsity at the group level while encouraging ridge regression-like shrinkage within a group.
Since the introduction of the group Lasso, many variants and generalizations have been proposed and investigated.Kim et al. (2006) designed a blockwise sparse regression method to extend the idea of the group Lasso to general loss functions but used the same penalty as the group Lasso.Meier et al. (2008) derived the group Lasso for logistic regression and presented an efficient algorithm for fitting related generalized linear models.Zhao et al.
(2009) developed a family of composite absolute penalties for grouped and hierarchical variable selection, which includes the group Lasso as a special case.Wei and Huang (2010) generalized the adaptive Lasso to an adaptive group Lasso method to improve the variable selection performance.Simon et al. (2013) introduced the sparse-group Lasso method which carries out both individual and grouped variable selection, also known as bi-level variable selection, by incorporating a combination of the l 1 -norm and the l 2 -norm into the penalty.
The grouping information included in the models discussed above has led to improvements in both estimation accuracy and model interpretability.Many applications can be found in the corresponding references.
Nonetheless, despite its many desirable characteristics, the group Lasso and its variants suffer from the same drawback as the Lasso, namely, they tend to underestimate large magnitude coefficients due to applying the same amount of shrinkage on all coefficients.
Nonconvex penalties, such as the smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001) and the minimax concave penalty (MCP) (Zhang et al., 2010), have been developed as alternatives to the Lasso that can diminish the estimation bias in the Lasso estimator.
By applying such nonconvex penalties to the l 2 -norms of the group coefficients, it is natural to obtain grouped variable selection via nonconvex penalization, such as the group SCAD and group MCP (Wang et al., 2007(Wang et al., , 2008;;Huang et al., 2012).On the other hand, applying the nonconvex penalties to a vector of the l 1 -norm of the groups can achieve bi-level variable selection and better estimation of coefficients.Available methods include the group bridge approach (Huang et al., 2009), the composite penalization methods (Breheny and Huang, 2009;Huang et al., 2012), and the group exponential Lasso (Breheny, 2015).
Grouped variable selection models with nonconvex penalties are not without their disadvantages, however.The nonconvex penalty, though beneficial for the estimation of co-  (Zhang and Zhang, 2012) or local optima obtained through specific initialization schemes and algorithms (Fan et al., 2014).More recently, Loh and Wainwright (2015) established statistical properties which apply to all stationary points of SCAD or MCP penalized least squares objective functions (though their results do not apply directly to group SCAD or MCP penalized estimators).However, empirical results from Fan et al. (2014) and Loh and Wainwright (2015), among others, suggest that in practice some stationary points perform much better than others, especially when the overall objective function is highly nonconvex, e.g., see the Remark on (α 1 , µ) and Figure 4 of Loh and Wainwright (2015).
To overcome the drawbacks of nonconvex optimization, one line of research, commonly referred to as the convex-nonconvex strategy, has been studied in the field of signal processing.This strategy adopts the so-called convexity-preserving nonconvex penalization, namely, the penalties are nonconvex but capable of maintaining the convexity of the whole objective function.The idea of convexity-preserving nonconvex penalties was introduced by Blake and Zisserman (1987), Nikolova (1998), andNikolova et al. (2010), and then further investigated in Bayram (2015), Selesnick (2017b), andZou et al. (2018).In particular, Selesnick (2017a) proposed a novel nonconvex penalty function for the regularized least squares problem, which they call the generalized minimax concave (GMC) penalty.The GMC penalty is given by which can guarantee the convexity of the optimization problem under a suitable condition on the matrix parameter B. In this paper, we focus on grouped variable selection in linear regression and propose a new generalization of the GMC penalty, called the group GMC penalty, which is also a convexity-preserving nonconvex penalty.We present the convexity-preserving condition for the group GMC model and some properties of its solution path.To solve the proposed optimization problem, we cast it as a saddle-point problem and provide a primal-dual algorithm for iteratively computing its saddle point.Theoretically, we establish error bounds for both the group GMC penalized least squares estimator and, as a special case, the GMC estimator of Selesnick (2017a), which to the best of our knowledge have not been established yet.In contrast to the theory for SCAD or MCP penalized least squares estimators, our theory applies only to global minimizers, which our algorithm is guaranteed to obtain.We evaluate the effectiveness of the proposed approach by comparing it with existing grouped variable selection methods in several different simulation experiments and a real data application.
The rest of this paper is organized as follows.In Section 2, we first review the GMC penalty and its relation to existing folded concave penalties.Then we formulate the group GMC penalty and the corresponding optimization problem for grouped variable selection in linear regression.We also include the convexity-preserving condition and some theoretical properties of the solution path in this section.In Section 3, we present in detail how to solve the proposed optimization problem with a first-order primal-dual algorithm.In Section 4, we study the statistical properties of both the group GMC estimator and the original GMC estimator by establishing l 2 -norm error bounds.In Sections 5 and 6, we report on numerical experiments and a real data application.We close with a discussion in Section 7. All proofs in this paper are included in the supplement.
2 Group GMC

GMC and MCP
We first review the GMC penalty to help readers understand the relationship between the GMC and the MCP in Zhang et al. (2010).For that purpose, we have to recall the definition of the infimal convolution and the Huber function.
The infimal convolution of two functions f and g is The Huber function (Huber, 1992) is defined as which can be equivalently expressed as the infimal convolution where the infimum is replaced by a minimum since the infimum in the definition is attained.
Selesnick (2017a) defined the scaled version of the Huber function as where b = 0 is a scalar parameter.In the special case where b = 0, h b (β) = 0. Based on the scaled Huber function, the scaled minimax concave (MC) penalty is given by After defining the scaled Huber function in the univariate case, Selesnick (2017a) proposed a natural multivariate generalization.Given a matrix parameter B, the generalized Huber function H B : R p → R is written as which is a convex function; the infimum is attained since the l 1 -norm is coercive.Mimimicking the univariate scaled minimax MC penalty, the generalized MC (GMC) penalty is defined as the difference of the l 1 -norm and the generalized Huber function.
which coincides with (3).Since the difference of two convex functions is not necessarily convex, the GMC penalty function is in general nonconvex.But as mentioned in the introduction, the GMC penalty can maintain the convexity of the penalized least squares problem by suitably choosing B. Details can be found in Selesnick (2017a).
Recall that the MCP defined in Zhang et al. ( 2010) is expressed as where the univariate MCP function defined on [0, ∞) is where λ ≥ 0 is the tuning parameter controlling the degree of penalization, and γ > 1 is a hyper-parameter that determines the degree of concavity of the MCP.The MCP function converges pointwise to the l 1 -norm as γ → ∞ and to the l 0 -norm as γ → 1, therefore the MCP provides a continuum of penalties by varying the value of γ.
Now let us have a closer look at the similarities and differences between the GMC penalty (5) and the MCP (6).In the univariate case, the GMC penalty coincides with the scaled MC penalty (4) and the MCP (6) reduces to In other words, ( 5) is equivalent to (6) up to a factor of λ.The difference between the GMC penalty (5) and the MCP (6) lies in how they are generalized from the univariate case to the multivariate one.The MCP (6) takes an additive form from the univariate MCP function (7), while the GMC penalty ( 5) is derived from the scaled MC penalty (4) via an infimal convolution, thus leading to a non-separable penalty function whenever The implications of expressing the MC penalty as an infimal convolution are non-trivial and lead to intrinsic differences with the standard MCP.It is well known that in the classic low-dimensional case where n > p, there exists a suitable hyper-parameter γ choice for MCP that leads to a convex objective function but that no such γ exists when n < p.In contrast, we will see that it is always possible to find a matrix B that leads to a convex objective function for any n and p.Thus, the GMC function enables the application of folded concave penalties in the high-dimensional case where n p without sacrificing convexity, opening the door to methods that can enjoy the best of both convex and nonconvex worlds.

The group GMC model
Based on the form of the GMC penalty (3) and mimicking the generalization from the Lasso to the group Lasso, we define the group GMC penalty as where j=1 p j = p, and K j is the same as that in the group Lasso model ( 2).Here we insert a multiplier 1/n in the squared term of the group GMC penalty to put it on the same scale with the squared error loss term in (2).
Therefore, the group GMC model for grouped variable selection and coefficient estimation in linear regression is cast as the following optimization problem: where Here λ ≥ 0 is again a tuning parameter that controls the degree of penalization, while B is a matrix parameter that controls the concavity of the group GMC penalty.Note that in our paper, we refer to λ as the tuning parameter of the group GMC and treat the matrix B as a hyper-parameter.
Similar to the GMC approach, the basic idea of the group GMC method is to maintain the convexity of the optimization problem while using a nonconvex penalty, which can be realized with an appropriate choice of the matrix hyper-parameter B. The next proposition specifies the condition that B has to satisfy to guarantee the convexity of problem (9).Recall that for two matrices A and B, A B means A − B is positive semi-definite; similarly, where Φ B : R p → R is the group GMC penalty (8).If then F is a convex function.We call (11) the convexity-preserving condition for the group GMC problem (9).
Note that the convexity-preserving condition (11) can hold without any restriction on the problem dimension p and the sample size n, namely, it can hold for both the low-dimensional case (n ≥ p) and the high-dimensional case (n < p).To satisfy the convexity-preserving condition ( 11), an intuitive and simple choice for B is We refer to α as the convexity-preserving parameter of the group GMC model since α controls the nonconvexity of the group GMC penalty.Setting α = 0 reduces the group GMC penalty to the group Lasso penalty.And setting α = 1 gives a maximally nonconvex penalty which can maintain the convexity of the optimization problem (9).The convexitypreserving parameter α is another hyper-parameter of the group GMC method and needs to be chosen by users.We recommend a range of 0.4 < α < 1 based on our simulation studies in Section 5.
The following proposition establishes the relationship between the group GMC and group MCP.It also clarifies the relationship between the GMC and MCP as a by-product.
Proposition 2.2.The group GMC method is equivalent to the group MCP method when B T B is diagonal and the diagonal elements are suitably designed.This equivalence also holds for the GMC and MCP.
We write the group GMC estimator, namely a minimizer to problem (9), as β(λ) which explicitly represents the dependency of the solution to (9) on the tuning parameter λ.We next discuss two properties of the solution path β(λ), that expedite the numerical computation in practice.
Theorem 2.1.Suppose X T X λB T B, then the solution path β(λ) to the group GMC problem (9) exists, is unique, and is continuous in λ.
Theorem 2.1 tells us that the optimization problem ( 9) is well-posed.Moreover, continuity of β(λ) opens the door to a homotopy strategy to reduce computation time when solving a sequence of problems over a grid of λ values.Namely, we use the solution to the problem at the previous value of λ to initialize, or warm start, the next iterate for computing the solution at the next λ value.
Intuitively, we may expect that all groups are excluded from the model when the tuning parameter λ is sufficiently large.The following theorem confirms our intuition.
Theorem 2.2.The group GMC problem (9) has a unique solution β(λ) = 0 p for all λ greater than λ 0 = max j (X •,j ) T y 2 /(nK j ) , where X •,j and K j are as defined in (2) for This second property is practically useful since it gives a range of λ, [0, λ 0 ], to sample the full dynamic range of group sparse models, and as an added benefit the computation of λ 0 is straightforward.
We close this section with a few remarks.First, the group GMC penalty (8) depends on B T B, not B itself.Therefore, there is no need to express B explicitly when computing the solution path β(λ).Second, the two properties of the solution path hold for any matrix B satisfying the convexity-preserving condition and are independent of how β(λ) is computed, as they are intrinsic to the group GMC problem.Finally, Theorem 2.1 applies only in the classic setting where n > p.This is a more stringent condition than what is required to ensure the uniqueness of the Lasso (Tibshirani, 2013).The proof of the uniqueness of the Lasso solution hinged on the Karush-Kuhn-Tucker (KKT) conditions of the Lasso optimization problem.The KKT conditions for the group GMC problem, however, are more complicated than the KKT conditions for the Lasso.Generalization of the proof used in the Lasso case to the group GMC is not straightforward due to the more complicated KKT conditions of the latter.Nonetheless, we conjecture that relaxed conditions similar to those that ensure the uniqueness of the Lasso solution can be established and leave establishing these conditions for future work.

Algorithm for the group GMC model
In this subsection, we focus on the computation of the solution path β(λ) to the group GMC model ( 9).We first present a first-order primal-dual method, called the Primal-Dual Hybrid Gradient (PDHG) algorithm (Esser et al., 2009;Chambolle and Pock, 2011), for computing the solution to non-smooth saddle-point problems.Then we formulate problem (9) as such a saddle-point problem, thus solving it by the PDHG algorithm.
The PDHG method, also known as the Chambolle-Pock method, is widely used to solve the following saddle-point problem: where f and g are convex functions, A ∈ R M ×N is a matrix, and X ⊂ R N and Y ⊂ R M are convex sets.A wide range of problems in statistics and machine learning can be cast as a case of ( 13), such as the scaled Lasso and total variation denoising.
Algorithm 1 summarizes the basic PDHG steps for problem (13), where σ k and τ k are stepsize parameters for updating x and y, respectively.One can choose constant stepsizes, τ k = τ and σ k = σ with τ σ < A T A −1 , to guarantee the convergence of the PDHG algorithm.Note that we use A to denote the spectral norm of a matrix A.
Algorithm 1 Basic PDHG steps for problem (13) Set 5: 6: until convergence We now recast the optimization problem (9) as a saddle-point problem where and and Z = λ n B T B ∈ R p×p is a symmetric matrix.In addition, both ( 15) and ( 16) are convex functions under the convexity-preserving condition (11).It is straightforward to see that problem ( 14) is under the framework of ( 13) and thereby can be solved by the PDHG algorithm.
The basic PDHG method can be slow to converge, however.In this paper, we implement the accelerated version, named the adaptive PDHG algorithm (Goldstein et al., 2013(Goldstein et al., , 2015a)), to solve the group GMC problem.We provide details about the adaptive PDHG for problem (14) and its convergence guarantees in the supplement.

Algorithm for the PDHG updates
The PDHG algorithm for solving the group GMC problem requires solving two subordinate optimization problems for updating β k+1 and v k+1 which we describe next.
We first introduce an efficient algorithm, Fast Adaptive Shrinkage/Thresholding Algorithm (FASTA) (Goldstein et al., 2014(Goldstein et al., , 2015b)), for solving optimization problems of the form minimize where m is convex and Lipschitz differentiable, h is proper, lower semi-continuous and convex, and m + h is coercive.FASTA provides a simple framework for implementing the forward-backward splitting (FBS) method, also known as the proximal gradient method (Bauschke and Combettes, 2011), to efficiently compute the solution to problem (17).Problems under the framework of ( 17) include the Lasso, noisy matrix completion, and many other regularized regression problems.
Algorithm 2 shows pseudocode of the basic FBS steps in FASTA for solving (17), where t k is a positive stepsize parameter and plays an important role in the convergence rate of the algorithm.The proximal operator of h is given by The proximal operator esixts and is unique if h is convex and lower semi-continuous.The key computation in FBS is the proximal mapping, and many regularizers h in sparse learning admit proximal operators which either have an explicit formula or can be evaluated by an efficient algorithm.For instance, the proximal operator of the l 2 -norm can be explicitly expressed as where (•) + = max{0, •}.We will use this expression in our proofs.The efficiency of computing the proximal operators makes the FBS method popularly used in practice.
Algorithm 2 Basic FBS steps in FASTA for problem (17 4: until convergence We now rewrite the two optimization problems for updating β k+1 and v k+1 in the PDHG updates in the form of (17).First, the optimization problem for updating β k+1 can be written as Similarly, we write the optimization problem for updating v k+1 as Both ( 19) and ( 20) satisfy the conditions on m and h in ( 17).Therefore, we can compute β k+1 and v k+1 by using Algorithm 2.
One of the primary difficulties with FBS is that users must carefully choose the stepsize.
Fortunately, many variants of FBS are available in FASTA for adaptively choosing stepsize and accelerating convergence.In this paper, we use the strategies adopted in the R package fasta to implement FASTA to get the solutions to ( 19) and (20).

Statistical properties 4.1 Main results
In this section, we consider the statistical properties of the group GMC estimator obtained by solving (9).First, we demonstrate that the group GMC estimator achieves an error bound of the same asymptotic order as existing estimators.Second, we discuss how the choice of B, or B T B to be exact, affects the error bound.We will also contrast our assumptions and error bounds with B = α/λX to those under the group Lasso penalization.
We now define a number of important quantities.First, define where β is the true vector of coefficients and v j has the same dimension as β j for each j ∈ [J].Implicitly, v is a function of B, β , n, and K j : we avoid notation indicating this dependence for improved readability.We can then define the sets S = {j : β j 2 = 0, j ∈ [J]} and S c = [J] \ S and use |S| to denote the cardinality of S. We also define where [B T B] j,• ∈ R p j ×p is the submatrix of B T B with rows corresponding to the indices of β defining the j-th group for each j ∈ [J].Finally, let ν = max j∈S ν j and ν = min k∈S c ν k .
Both ν and ν play important roles in our error bounds.In brief, our results indicate that a good choice of B is one in which ν is minimized and ν is maximized, while also, the ν j for j ∈ S are large and ν k for k ∈ S c are small.For the sake of illustration, we will derive closed form expressions for the ν j under a particular choice of B in the next subsection.
Our results require a number of conditions and assumptions.Our first condition is that the submatrices of X satisfy a simple scaling condition (Negahban et al., 2012).Specifically, we assume that X •,j ≤ √ n for all j ∈ [J] where • is the spectral norm.Such a condition was used in, for example, Corollary 4 of Negahban et al. (2012).In the case that p j = 1, this simplifies to the standard column-wise scaling condition that each column of X has squared Euclidean norm no greater than n (e.g., see Example 11.1 of Hastie et al. (2015)).We also assume the following: A1. (Subgaussian errors).The data are generated from (1) where ∈ R n has independent entries which are each σ-subgaussian random variables for 0 < σ < ∞.That is, E( i ) = 0 and for all t ∈ R, E{exp A2. (Convexity) The matrix B is chosen so that X T X λB T B.
As a consequence of our proof technique, we also establish an error bound for the original GMC estimator in Selesnick (2017a): this is a special case of the group GMC estimator with each group consisting of a single coefficient.This is the first error bound for the GMC estimator that we are aware of.
Like the group-penalized version, the GMC penalized estimator achieves the same wellknown asymptotic |S| log p/n rate as its l 1 -norm penalized counterpart.However, like the group GMC penalized estimator, an improvement over Lasso in finite sample settings may be realized through an inflation of the restricted eigenvalue κ B (S, c), and through the role of the ν j 's.

Additional insights
The restricted eigenvalue κ B (S, c) in A4 differs from the analogous condition under the l 2norm group penalization, which may partly explain the difference in performance observed in Section 5. Specifically, to establish error bounds for the group Lasso analog of ( 9), the corresponding restricted eigenvalue condition posits a lower bound on inf ∆∈Dn(S,c) where when the tuning parameter is chosen to be at least as large as c max j T X •,j 2 /n.The difference between restricted eigenvalue conditions comes both in terms of the function the infimum is taken with respect to and in terms of the set over which the infimum is taken.
For example, when we take B = α/λX for α ∈ (0, 1), we see that , which would at first glance seem to imply a (1−α) factor decrease in the restricted eigenvalue relative to that for the group Lasso estimator.However, the benefit comes through the potential reduction in volume of the set C n (S, ν, c) relative to D n (S, c).If, for example, each To get a sense of how the ν j 's depend on the choice of B, we focus on a special case.
Proposition 4.1.Suppose n > p and X T X O p .Consider the choice of B = η/λI p where η > 0 is the smallest eigenvalue of X T X so that A2 holds.In this situation, and v j = 0 for j ∈ S c .It thus follows that for j ∈ [J] where 1(•) is the indicator function.
In addition to providing insight regarding the ν j , in light of Proposition 2.2, this result provides a new lens through which the group MCP can be viewed in the settings in which group MCP and group GMC are equivalent.Existing theory for group MCP is primarily concerned with the oracle property rather than, say, error bounds.
Crucially, Proposition 4.1 implies ν j = K j for all j ∈ S c in the considered scenario.This choice of B is useful because ν j = K j for j ∈ S c (whereas alternative choices of B will yield ν j < K j for some j ∈ S c ), and because it enables us to express v explicitly in terms of β .
In this setting of Proposition 4.1, if each K j = 1 and the β j 2 are sufficiently large, then In practice, of course, the ν j cannot be computed since they depend on β .Likewise, the B which minimizes the bounds in Theorem 4.1 also depends on β and the set S, so this cannot be used in practice.

Simulation studies
We investigate the practical performance of the proposed group GMC method with experiments that build upon on the simulation scenarios in Yuan and Lin (2006).We also compare the group GMC with the group Lasso, group MCP, and group SCAD.The computation of the three existing methods is done using the R package grpreg developed by Breheny and Huang (2015), and the hyper-parameters in the group MCP and group SCAD are set as the default values given in the R package.
There are four linear regression models considered in the simulation study in Yuan and Lin (2006).In this work, we consider the two most complicated ones, an ANOVA model with all two-way interactions and an additive model with both categorical and continuous variables.More importantly, we study different cases for each model to explore the effects of interesting factors, including the signal-to-noise ratio (SNR), the correlation among groups, the problem dimension, and the convexity-preserving parameter (only for the group GMC).
In each case, we run the experiment for 100 replications and evaluate different methods with respect to: (i) mean squared error (MSE) of the estimated coefficients; (ii) prediction error defined as X β − Xβ 2 2 /n where n is the sample size, X is the design matrix, and β and β are the vectors of true and estimated coefficients respectively; (iii) support recovery with a sequence of σ so that the SNR ranges from 1 to 5. The sample size n is fixed as 100 for each setting.To better understand how the convexity-preserving parameter α affects the performance of the proposed group GMC method, we report the results of the group GMC with α ∈ {0.2, 0.4, 0.6, 0.8, 1}.These results also provide guidance on how to set α for the group GMC in practice.
Figure 1 presents the impact of SNR on the performance of the four methods for model (22).As expected, the MSE of the estimated coefficients and the prediction error decrease as the SNR increases for all methods.The group Lasso achieves the lowest MSE, while the group GMC gives the lowest prediction error among the four methods.The convexity-preserving parameter α does not show a significant effect on the prediction performance of the group GMC, but it has a mild effect on the coefficient estimation.As indicated in the top left panel of Figure 1, a smaller value of α leads to a lower MSE.When it comes to support recovery, the group GMC shows a distinct advantage over the other three methods.
It achieves a higher F1 score than existing methods in all SNR settings.The two plots on the bottom panel of Figure 1 display the variable selection results of different methods in detail.The group Lasso obtains the most true positives but also the most false positives.
Both group SCAD and group MCP miss some true positives and also include some irrelevant variables into the ANOVA model.The group GMC, however, can achieve a number of true positives comparable with the group Lasso while maintaining its number of false positives at a very low level.The convexity-preserving parameter α indeed affects the variable selection performance of the group GMC.Both numbers of true and false positives decrease as the value of α increases.In other words, a large value of α in the group GMC results in a sparse model.In general, a range of 0.4 < α < 1 works well for this ANOVA example.
Case C2: When comparing different grouped variable selection methods, one factor of interest is to what extent the correlation among groups impacts their performance.
For that purpose, we set a grid of values for ρ, ρ = {0, 0.2, 0.4, 0.6, 0.8}, so that the correlation between Z i and Z j is ρ |i−j| for i = j.We fix the SNR of the regression model to be 2 and the sample size to be 100 for each run.We set the convexity-preserving parameter α = 0.6 for the group GMC method.
Figure 2 shows the performance of the four methods under different group correlations.
Both the group GMC and group Lasso produce worse estimation as the correlation ρ increases, while the group MCP and group SCAD fail to achieve comparable estimation even in the uncorrelated setting.For the model prediction, all four methods are relatively stable across different correlation settings, while the group GMC compares favorably with the other three.Regarding the variable selection with respect to the F1 score, group GMC visibly outperforms the existing three methods.All methods see a drop in F1 score when the correlation ρ reaches up to 0.8.The plots of true and false positives provide detailed insight into the variable selection performance of different methods.The group Lasso includes the most false positives into the model, although it leads others in the inclusion of true positives.In contrast, the group MCP and group SCAD build much sparser models, thus missing some true positives.The group GMC is capable of obtaining true positives comparable with the group Lasso and excluding those irrelevant variables from the regression model.When ρ = 0.8, all methods suffer a drop in their true positives, resulting in the drop in their F1 scores as seen in the corresponding plot.
Case C3: In this third case, our goal is to explore the impact of the problem dimension on the performance of different methods.To that end, we set three different dimension settings where four, ten, and sixteen independent categorical variables Z i are generated accordingly in each setting and then trichotomized in the same way as described above.
As a result, the problem dimension p is 32, 200, and 512, respectively.But the response variable y remains generated according to model ( 22) with an SNR of 2, namely the number of true positives is 8 in all dimension settings.The sample size is 100 for each setting.We again fix the convexity-preserving parameter α of the group GMC as 0.6.Figure 3 summarizes the simulation results.In terms of the coefficient estimation, the group MCP and group SCAD behave quite similarly and much worse than the group Lasso and group GMC.Regarding the model prediction, the group GMC fares well in all dimension settings.With respect to variable selection, the group GMC shows a distinct advantage over the existing three methods across different problem dimensions thanks to its robust behavior with respect to both true and false positives.On the one hand, while group MCP performs well at excluding false positives, it errs on the side of being too conservative and misses some true positives.On the other hand, the group Lasso and group SCAD, however, select too many irrelevant variables into the regression model, especially for the high-dimensional scenarios.

Real Data Application
We apply our group GMC method on the birth weight data set from Hosmer and Lemeshow (1989), which studies risk factors associated with low infant birth weight.The data set is publicly available in the R package MASS and contains 189 observations of one response variable (infant birth weight) and eight explanatory variables from the mother, including both continuous and categorical factors.We include detailed descriptions of the data set in Table 1.As with Yuan and Lin (2006), we take into account the preliminary analysis that both mother's age and weight have non-linear effects on the birth weight.Therefore, we model these two effects by third-order polynomials.Finally, we get sixteen predictors from eight groups to fit a linear regression model.
Following our simulation studies, we analyze the data using the proposed group GMC as well as the group Lasso, group MCP, and group SCAD.For group GMC, we again set the matrix parameter B according to (12) and choose α = 0.8 based on our experience from the simulation studies.For evaluation, we first randomly sample three-quarters of the observations (142 cases) as a training set for selecting the tuning parameter λ by ten-fold cross-validation.Then we use the obtained tuning parameter to fit the full data to get the estimated coefficients.Finally, we compute the prediction error based on the testing set of the remaining one-quarter records.prediction errors obtained from the four different methods are comparable.
It seems that, for this birth weight data analysis, the group GMC does not exhibit any advantage over existing methods.Nevertheless, the solution paths from the four methods, as shown in Figure 4, tell a different story.The estimated coefficients from the group GMC, as indicated by the vertical dotted line, undergo noticeably less shrinkage than those of the group Lasso and are similar to the estimates from the group MCP and group SCAD.This confirms the unbiased (or at least less biased) estimation of the group GMC as a nonconvex penalization method.What is more, the group GMC method is more robust against the tuning parameter selection compared to the other three methods.It is anticipated that the estimated coefficients are increasingly shrunk as λ increases, and thus fewer variables are selected into the regression model.But as shown in Figure 4, the estimated coefficients and selected variables are stable over λ ∈ [0.04, 0.07] for the group GMC, while the other three methods do not have as comparably a wide range of λ.This insensitivity furnishes some evidence that the group GMC method can potentially blunt estimation bias successfully while still simultaneously achieving satisfactory variable selection.

Discussion
In this paper, we used the convex-nonconvex strategy to propose a novel concave penalty, called the group GMC, for grouped variable selection and coefficient estimation in linear regression.The group GMC penalty is a variant of the GMC penalty and thus inherits its characteristic that it is able to maintain the convexity of the corresponding optimization problem.Therefore, the group GMC eliminates the possibility of suboptimal local minima while maintaining unbiased estimation as a nonconvex penalization approach.We formulated linear regression with the group GMC penalization as a convex optimization problem, or more specifically a saddle-point problem when a certain condition is satisfied.The resulting group GMC estimator enjoys desirable properties which help accelerate numerical computation and tuning parameter selection.Our algorithm for computing the group GMC estimator is guaranteed to converge to the global minimizer of the group GMC optimization problem.Additionally, we analyzed statistical properties of the group GMC estimator as well as the original GMC estimator.Our results are the first to establish the l 2 -norm error bounds for the GMC least squares estimators and as such, provide novel insights about the performance of convex nonconvex penalization.In our simulation study, we compared the practical performance of the group GMC with the group Lasso, group MCP, and group SCAD via a comprehensive evaluation, including variable selection, coefficient estimation, and model prediction.Through a battery of simulation experiments, we found that the group GMC can achieve better or at least competitive performance in comparison with the existing three methods under different scenarios such as different SNRs, correlated or uncorrelated groups, and different dimension settings.A real data application displays the advantage of the group GMC method in its robustness in unbiased coefficient estimation and grouped variable selection.
Several related studies can be done in the future.First of all, how to set the matrix parameter B warrants more exploration and investigation.We discussed how B could affect the error bound of the group GMC estimator in Section 4 but anticipate that other approaches to set B could further improve the performance of the group GMC, both theoretically and practically.Second, the group GMC method could be extended to generalized linear models to deal with grouped variable selection problems in other high-dimensional cases.More generally, the convex nonconvex strategy could be applied to other sparse learning scenarios so that one can enjoy the advantages of convex optimization and nonconvex penalization simultaneously.
this would guarantee the set C n (S, ν, c) has volume no greater than D n (S, c).More generally, if ν and many ν j for j ∈ S are large, one may expect the reduction in volume of C n (S, ν, c) relative to D n (S, c) to lead to a larger restricted eigenvalue and in turn, an improved error bound.In addition, since the restricted eigenvalue condition A4 depends on the user-specified matrix B, one may select B such that this condition is more plausible than the analogous condition under the group Lasso penalization.The matrix B also affects the error bound through the ν j , both through the modification of C n and the ratio ν/ν.

Figure 1 :
Figure 1: Results for Case C1 : Impact of SNR.Group GMC(•) stands for the group GMC with a specific value of α.Average performance based on 100 simulation replicates for each method.MSE and Prediction error are on a log scale.

Figure 2 :
Figure 2: Results for Case C2 : Impact of group correlation.Average performance plus/minus one standard error based on 100 simulation replicates for each method.MSE and Prediction error are on a log scale.

Figure 3 :
Figure 3: Results for Case C3 : Impact of problem dimension.Average performance plus/minus one standard error based on 100 simulation replicates for each method.MSE and Prediction error are on a log scale.

Figure 4 :
Figure 4: Solution paths of the birth weight data from four different methods.The dotted vertical line in each subplot indicates the selected λ via ten-fold cross-validation.
Lanza et al. (2019)introduced a more general parametric nonconvex nonseparable regularizer for a convex nonconvex variational model.Abe et al.
Liu and Chi (2022)e idea of the GMC to a linearly involved GMC penalty, which is applicable to more general situations, especially for recovering piecewise constant signals.Liu and Chi (2022)further investigated the linearly involved GMC penalty, proposing a new method for choosing the matrix parameter B in the penalty and providing an additional algorithm to compute the solution path of the corresponding penalized least squares problem.The convex-

Table 2
summarizes the prediction errors, number of nonzero groups, and the excluded groups from the four methods.The group Lasso and group SCAD fail to exclude any group from the model.Both group MCP and group GMC, however, regard the number of physician visits during the first trimester as an unimportant factor to the infant birth weight.The

Table 1 :
Description of the birth weight data set

Table 2 :
Summarized results for the birth weight data