The Benefit of Group Sparsity in Group Inference with De-biased Scaled Group Lasso

We study confidence regions and approximate chi-squared tests for variable groups in high-dimensional linear regression. When the size of the group is small, low-dimensional projection estimators for individual coefficients can be directly used to construct efficient confidence regions and p-values for the group. However, the existing analyses of low-dimensional projection estimators do not directly carry through for chi-squared-based inference of a large group of variables without inflating the sample size by a factor of the group size. We propose to de-bias a scaled group Lasso for chi-squared-based statistical inference for potentially very large groups of variables. We prove that the proposed methods capture the benefit of group sparsity under proper conditions, for statistical inference of the noise level and variable groups, large and small. Such benefit is especially strong when the group size is large.

1. Introduction.We consider the linear regression model where X = (x 1 , . . ., x p ) ∈ R n×p is a design matrix, y ∈ R n is a response vector, ε ∼ N n (0, σ 2 I n ) with an unknown noise level σ, and β = (β 1 , . . ., β p ) T ∈ R p is a vector of unknown regression coefficients.We are interested in making statistical inference about a group of coefficients β G = (β j , j ∈ G) T .For small p, the F -distribution, which is approximately chisquare with proper normalization, provides classical confidence regions for β G and p-values for testing β G .We want to construct approximate versions of such procedures for potentially large groups in high-dimensional models where p is large, possibly much larger than n.
For individual regression coefficients, Zhang and Zhang (2014) proposed a low-dimensional projection estimator (LDPE) for regular statistical inference at the parametric n −1/2 rate under proper conditions.Their results provide along with known covariance structure V G,G and sufficient conditions for the asymptotic normality, Rem G = o(1), when the group size |G| is bounded.For random designs, the above covariance structure matches the Fisher information in the least favorable sub-model in a general context as described in Zhang (2011), and a proof of the asymptotic efficiency of the LDPE was provided in van de Geer et al. (2014).Earlier, Sun and Zhang (2012a) proved the consistency and efficiency of a scaled Lasso estimate of the noise level σ.However, the analysis of the LDPE, which guarantees Rem G ∞ β 0 (log p)/ √ n, does not directly imply sharp error bound for the 2 -or equivalently chi-square-based group inference for large groups.As Var(χ |G| ) ≈ 1/2, the trivial bound Rem G 2 |G| 1/2 β 0 (log p)/ √ n yields an extra |G| factor.Thus, the group inference problem is unsolved when one is unwilling to impose the condition |G| 1/2 β 0 (log p)/ √ n → 0. Our goal is to construct β G satisfying Rem G 2 = o(1) in an expansion of form (1.2) with moderately large |G|.The impact of such a result is certainly beyond F -or chi-square-type statistical inference.
Our approach is based on the natural idea that group sparsity can be exploited in statistical inference of variable groups.To this end, we propose to use an estimated efficient score matrix to correct the bias of a scaled group Lasso estimator.This combines and extends the ideas of the group Lasso (Yuan and Lin, 2006), scaled Lasso and LDPE, and will be shown to captures the benefit of group sparsity in both high-dimensional estimation as in Huang and Zhang (2010) and in bias correction.
The type of statistical inference under consideration here is regular in the sense that it does not require model selection consistency, and that it attains asymptotic efficiency in the sense of Fisher information without being super-efficient.A characterization of such inference is that it does not require a uniform signal strength condition on informative features, e.g. a lower bound on the non-zero |β j | above an inflated noise level due to model uncertainly adjustment, known as the "beta-min" condition.Many attempts have been made to assess the model selected by high dimensional regularizers; For example, some early work was done in Knight and Fu (2000), sample splitting was considered in Wasserman and Roeder (2009) and Meinshausen, Meier and Bühlmann (2009), and subsampling was considered in Meinshausen and Bühlmann (2010) and Shah and Samworth (2013).Leeb and Potscher (2006) proved that the sampling distribution of statistics based on selected models is not estimable.Berk and Zhao (2010) and Laber and Murphy (2011) proposed conservative approaches.Alternative approaches were proposed in Lockhart et al. (2014) and Meinshausen (2014).
The basic idea of Zhang and Zhang (2014) and Zhang (2011) is to correct the bias of high-dimensional regularized estimators by projecting its residual to a direction close to that of the efficient score.Such bias correction, which has been called de-biasing, is parallel to correcting the bias of nonparametric estimators in semiparametric inference (Bickel et al., 1998).Bühlmann (2013) adopted a similar approach to correct the bias of ridge regression.van de Geer et al. ( 2014) considered an extension to generalized linear model.Javanmard and Montanari (2014) obtained sharper results for Gaussian designs.Belloni, Chernozhukov and Hansen (2014) considered estimation of treatment effects with a large number of controls.Sun and Zhang (2012b), Ren et al. (2013) and Jankova and van de Geer (2014) considered extensions to graphical models and precision matrix estimation.
Since our proposed method relies upon group regularized initial estimator, in the following we provide a brief discussion of the literature on the topic.The group Lasso (Yuan and Lin, 2006) can be defined as where {G j , , 1 ≤ j ≤ M } forms a partition of the index set {1, . . ., p} of variables.It is worthwhile to note that when the group effects are being regularized, the choice of basis X G j = (x k , k ∈ G j ) within the group may not play a prominent role, so that the design is often "pre-normalized" to satisfy X T G j X G j /n = I G j ×G j as in Yuan and Lin (2006).The group Lasso and its variants have been studied in Bach (2008), Koltchinskii and Yuan (2008), Obozinski, Wainwright and Jordan (2008), Nardi and Rinaldo (2008), Liu and Zhang (2009), Huang andZhang (2010), andLounici et al. (2011) among many others.Huang and Zhang (2010) characterized the benefit of group Lasso in 2 estimation, versus the Lasso (Tibshirani, 1996), under the assumption of strong group sparsity; see (2.1).Huang et al. (2009) and Breheny and Huang (2011) developed methodologies for concave group and bilevel regularization.We refer to Bühlmann and van de Geer (2011) and Huang, Breheny and Ma (2012) for further discussion and additional references.
This paper is organized as follows.In Section 2, we describe the main results of the paper on statistical inference of variable groups.In Section 3, we study a scaled group Lasso needed for the construction in Section 2. In Section 4, we present some simulation results to demonstrate the feasibility and performance of the proposed methods.
2. Group Inference.We present our results in five subsections.In Subsection 2.1 describes our working assumption on the availability of a certain initial estimates of β and σ.The working assumption is based on the existing literature on group Lasso and will be verified in Section 3 under proper conditions.In Subsection 2.2 develops bias correction formulations as extension from statistical inference of real parameters.Subsection 2.3 provides optimization strategies (see equations (2.20) and (2.23)) for construction of inference procedures for groups of variables.Subsection 2.4 provides sufficient conditions (Theorem 3) under which a feasible solution to the optimization problem (2.20) is available.Subsection 2.5 discusses strategies for finding feasible solutions.
We use the following notation throughout the paper.For vectors u ∈ R d , the p norm is denoted by and the nuclear norm by A N = max B S =1 trace(B T A).Given A ⊂ {1, • • • , p}, for any vector u ∈ R p , u A ∈ R |A| denotes a vector with corresponding components from u, X A ∈ R n×|A| denotes the sub-matrix of X with corresponding columns as indicated by the set A, X −A denotes the sub-matrix of X with column indices belonging to the complement of A, and R(X A ) denotes the column space spanned by columns of X A .Additionally, E and P, denote the expectation and probability measure and D −→ the convergence in distribution.Finally, β * denotes the true regression coefficient vector.

2.1.
Working assumption based on strong group sparsity.We assume an inherent and pre-specified non overlapping group structure of the feature set.Put precisely, assume that {1, In the following, we allow the quantities n, p, M, d j 's etc. to all grow to infinity.
In light of this group structure, further results on consistency of group regularized estimators of β * will be based on a weighted mixed (2,1) , defined as with ω j > 0 for all j.This norm will be used both as penalty and as a key loss function.Weighted mixture norm of this type provides suitable description of the complexity of the unknown β when the following strong group sparsity holds Huang and Zhang (2010).
Strong group sparsity: With the given group structure {G j , j = 1, . . ., M } as a partition of {1, . . ., p}, there exists a group-index set, (2.1) In this case, we say that the true coefficient vector β * is (g, s) strongly group sparse with group support S * .
Under the strong group sparsity assumption, various error bounds for group regularized methods have been established in the literature as we reviewed in the introduction.With the support of the existing results and our own in Section 3, we make the following working assumption.
Working assumption: Suppose that we have estimators β and σ satisfying where n is an oracle estimate of the noise level σ, and G j , s and g are as in (2.1).
As we will prove in Section 3, the error bound for β (init) in (2.2) is attainable under proper conditions on the design matrix if the group Lasso is used with a consistent estimate of σ, and the error bounds for both β and σ in (2.2) are attainable if a scaled group Lasso is used.See Corollaries 1 and 2. The working assumption exhibits the benefit of strong group sparsity, since a reasonable working assumption under the 0 sparsity condition β 0 ≤ s would be Although error bounds in (2.2) and (2.3) do not dominate each other due to different interpretation of s when supp(β * ) = G S * , the right-hand side of (2.2) is of smaller order when s is of the same order in both settings and g s.
2.2.Bias correction via relaxed projection.Given a regularized initial estimator β of the regression coefficient vector, Zhang and Zhang (2014) proposed to use a relaxed projection to correct the bias of β (init) j via where z j is designed to be nearly orthogonal to all x k , k = j.For the estimation of β G , a formal vectorization of (2.4) is ), (2.5) where Z G is an n × |G| matrix and A † denotes Moore-Penrose pseudo inverse of a matrix A.
The problem is to choose β and Z G .Zhang and Zhang (2014) proposed two choices of z j to match 1 regularized initial estimators β (init) , which naturally controls β in the Lasso path in the regression of x j against X −j = (x k , k = j): (2.6) The Karush-Kuhn-Tucker (KKT) conditions for z j automatically controls z T j X −j ∞ , and thus The second proposal of Zhang and Zhang (2014), closely related to the first one and given in the discussion section of their paper, is a constrained variance minimization scheme (2.7) While the Lasso with penalty level λ j provides a feasible solution z j /λ j = (x j − X −j γ −j )/λ j for (2.7), an advantage of (2.7) is a guaranteed bias bound whenever the optimization problem is feasible.For Gaussian designs, such feasibility of z = nz o j /x T j z o j follows from an application of the union bound (Javanmard and Montanari, 2014).The algebraic extension of the above proposals is straightforward.Write (2.8) We may directly approximate Z o G via a regularized multivariate regression in (2.8) or mimic properties of Z o G with a regularized optimization scheme.The question is to make a right choice of the regularization to match a proper initial estimator of β.One possibility is to use an 1 regularized estimate of Γ −G,j in the univariate regression of x j against X −G for all individual j ∈ G.This has been considered in van de Geer (2014).However, the advantage of such a scheme is unclear compared with directly using ( β j , j ∈ G) T with the β j in (2.4).It is worthwhile to mention that the central limit theorem for (2.4) came with large deviation bounds to justify Bonferroni adjustments (Zhang and Zhang, 2014), so that (2.4) and its variations can be used to test H 0 : β * G = 0 versus an alternative hypothesis on β * G ∞ , especially when an 1 regularized β is used (van de Geer et al., 2014).However, we are interested in extensions of traditional F -or chi-square-type tests for 2 alternatives and to take advantage of group sparsity of β * .

An optimization strategy.
In this subsection we propose a multivariate extension of (2.7) to match the group structure and weights in our working assumption (2.2).
We write (2.5) in terms projections so that the resulting optimization scheme will be rotation and scale free within the subspaces under consideration.As our goal in essence is to construct inferential procedure for X G β G , we rewrite the regression problem (1.1) as follows: (2.9) Here and in the sequel, the following notation is used.For any A ⊂ {1, . . ., p}, µ A = X A β A and Q A is the orthogonal projection to R(X A ), the column space of X A , i.e. (2.10) In the simplest case where the variable group of interest matches the group sparsity in the following sense: e.g.G = G j 0 for some j 0 , (2.9) becomes Let P G be an orthogonal projection matrix close to Q G in certain distance and approximately orthogonal to Q G k \G for all k with G k ⊆ G.We write (2.5) in terms of projections as (2.12) with an initial estimator β . We note that , so that the condition in (2.13) is slightly weaker than the condition in (2.12).Moreover, cos θ min where θ min is the minimum principle angle between subspaces R(P G ) and R ⊥ (X G ). Thus, P G Q ⊥ G S = 1 iff the two subspaces have a nontrivial intersection.
Given σ an estimate of the noise level, we test the hypothesis H 0 : β G = 0 with the following statistic: (2.14) A test of this form can be easily converted into elliptical confidence regions for linear mappings of β G in usual way. Let We show that (2.12) and (2.13) are consistent with (2.5) as follows.Since both Z G and X G are n × |G| matrices, we have rank( This provides the consistency between (2.12) and (2.5).Furthermore, since Let Q be the projection to R(X).In the low-dimensional case of rank(X) = p < n, we may set , so that (2.12) is the least squares estimator of β G and T 2 G /|G| is the F -statistic for testing H 0 : β G = 0 when σ is the degree adjusted estimate of noise level based on the residuals of the least squares estimator.Of course, we need to relax the requirement of the orthogonality condition To find the proper relaxation, we first inspect the deviation of (2.12), (2.13) and (2.14) from the low-dimensional regression theory.Let β * be the true β and µ * A = X A β * A for all A ⊂ {1, . . ., p}.It follows immediately from (2.12), (2.13) and (2.14) that with a remainder term Moreover, when β * G = 0, (2.17) As P G is an orthogonal projection matrix depending on X only, P G ε/σ is a standard normal vector living in the image of P G and P G ε/σ 2 2 has the chi-square distribution with rank(P G ) degrees of freedom.Thus, chi-square based inference can be carried out using the projection estimators in (2.12) and (2.13) and test statistic T G in (2.14) under proper conditions on Rem G 2 and σ.For example, We still need to find an upper bound for Rem G 2 .To this end we use (2.2) to obtain (2.19) where The error bound in (2.19) motivates the following extension of (2.7): (2.20) We say that P G is a feasible solution of (2.20) if it satisfies all the constraints.We summarize the above analysis in the following theorem.
Theorem 1.Let β G be given by (2.12) and T G by (2.14) with a feasible solution P G of (2.20) with rank(P G ) = |G|.Suppose that (2.2) holds for β and σ, and with the M k in (2.19).Then, (2.18) holds.In particular, with Rem G 2 = o P (1), Remark 1.The optimization problem (2.20) also provides geometric insights.As we have mentioned earlier, the quantity , is the so-called 'gap' between the subspaces spanned by P G and Q G , which we try to minimize.This minimization is done subject to upper-bounds on so that by (2.17) The conclusions follow immediately.
A modification of (2.20), which removes the factors M k in condition (2.21), is to write projection to the column space of X G .The optimization scheme and statistical methods are changed accordingly as follows: , our analysis yields the following theorem.
Theorem 2. Let P G , β G and T G be given by (2.23).Suppose that (2.2) holds and Then, (2.18) and (2.22) hold with The optimization problems in (2.20) and (2.23) are still somewhat abstract for the moment, although our theorems only require feasible solutions.In the following we prove the feasibility of P G in (2.20) for Gaussian designs and describe penalized regression methods to find feasible solutions of (2.20) and (2.23).
2.4.Feasibility of relaxed orthogonal projection for random designs.Let e i be the i-th canonical unit vector of R n .Throughout this subsection, we assume that the matrix X has iid subGaussian rows e T i X satisfying EX = 0, E(X T X/n) = Σ with a positive-definite Σ, and that for certain constant We write the regression model (2.8) as (2.26) Let P o G be the orthogonal projection to the column space of Z o G , (2.27) We use the following lemma to evaluate P o G .The inequality is well known; See for example Vershynin (2010), and for Gaussian X the supplementary material for Ma (2013).
Lemma 1.Let B k be matrices of p rows and rank r k .Let P k be the projection to the range of XB k and Ω (2.28) and (2.29) is an average of iid variables with Since the size of an -net of the unit ball in R r k is bounded by (1 + 2/ ) r k , the Bernstein inequality implies that for r * = r 1 + r 2 and a certain numerical constant C 0 , This yields (2.28) as U T 1 ∆U 2 S = ∆ S for all ∆ of proper dimension.Suppose rank(P k ) = r k .Let r 0 = rank(P 1 P 2 ) and 1 ≥ λ 1 ≥ • • • ≥ λ r 0 > 0 be the (nonzero) singular values of P 1 P 2 .We have P 1 P 2 S = λ 1 and P 1 P ⊥ 2 S = P 1 − P 2 S = 1 − λ 2 min with λ min = λ r 0 I{r 0 = r 1 = r 2 }.By definition, Since (Z T k Z k ) −1/2 Z T k are unitary maps from the range of P k to R r k , the singular values of P 1 P 2 is the same as those of U 2 with unitary maps U 1 and U 2 , the Weyl inequality implies that Thus, (2.29) holds.As the conditions for λ 1 < 1 and λ min > 0 follow from the positivedefiniteness of Σ, the proof is complete.
Proof of Theorem 3. By (2.27), P o G is the orthogonal projection to the range of is a |G|×|G| matrix of rank |G| and the smallest singular value of Ω is λ min .Thus, by (2.29) of Lemma 1 and the definition of ω k and a n , This yields (2.30).It remains to proof max G k \G =∅ M k = O P (1) in view of Theorem 1.To this end, we notice that due to the condition 2.5.Finding feasible solutions.While (2.30) of Theorems 3 guarantees a feasible solution of (2.20), we discuss here penalized multivariate regression methods for finding feasible solutions of (2.20) and (2.23).As the only difference between (2.20) and (2.23) is the respective use of X G and X G .We provide formulas here only for (2.20), with the understanding that formulas for (2.23) can be generated in the same way with X G replaced by X G .
In view (2.26), a general formulation of the penalized multivariate regression is where (2.33) Our main interest is to find a feasible solutions of (2.20) and (2.23), not to estimate Γ −G,G .
The following weighted group nuclear penalty matches the dual of the constraint in (2.20) and (2.23): (2.34) It follows from the KKT conditions for (2.32) and (2.34) that If we set ω k = ω k in (2.34), then conditions (2.21) and (2.24) become in the case of Theorem 1.When the group sizes are not too large, one may even consider to replace the weighted group nuclear penalty with a weighted group Frobenius penalty as this can be conveniently computed using the group Lasso software.
Remark 2. Compared with existing sample size condition n 1/2 β 0 log p for statistical inference of a univariate parameter at n −1/2 rate, the sample size conditions in (2.21), (2.24) (2.31) and (2.36) clearly demonstrate the benefit of group sparsity as in Huang and Zhang (2010).Moreover, the extra factor |G| is removed in a number of scenarios even in case of large group sizes.For example |G| min G k ⊆G {|G k | + log(M/δ)} in (2.24) and (2.31), or 3. Mixed Norm Consistency Results.Using the group sparsity of the regression coefficient vector and sparse eigenvalue conditions on the design matrix, Huang and Zhang (2010) provided 2 oracle inequalities to show the benefits of the group Lasso over the Lasso.In this section we provide similar results on mixed weighted norms for both the group Lasso and the scaled group Lasso under different conditions on the design.
3.1.Assumptions for fixed design matrix.In the Lasso problem, performance bounds of the estimator are derived based on various conditions on the design matrix, for example, restricted isometry property (Candes and Tao, 2005), the compatibility condition (van de Geer, 2007), the sparse Riesz condition (Zhang and Huang, 2008), the restricted eigenvalue condition (Bickel, Ritov and Tsybakov, 2009;Koltchinskii, 2009), and cone invertibility conditions (Ye and Zhang, 2010).van de Geer and Bühlmann (2009) showed that the compatibility condition is weaker than the restricted eigenvalues condition for the prediction and 1 loss, while Ye and Zhang (2010) showed that both conditions can be weakened by cone invertibility conditions.In the following, we define grouped versions of such conditions.Let us first define a group wise mixed norm cone for T ⊂ {1 • • • , M } and ξ ≥ 0 as Following Nardi and Rinaldo (2008) and Lounici et al. (2011), the restricted eigenvalue (RE) is defined as For the weighted 2,1 norm, the group-wise compatibility constant (CC) can be defined as We also introduce the notion of group wise cone invertibility factor and extend it to signrestricted cone invertibility factor.The cone invertibility factor (CIF) is defined as Now we define the sign-restricted cone as and the group-wise sign-restricted cone invertibility factor (SCIF) as Moreover, the SCIF is always no smaller than the CIF.Thus, following (3.7), the restricted eigenvalue condition RE (G) (ξ, ω, T ) > κ 0 implies that all the other quantities are bounded from below by κ 0 .In the following we derive the mixed norm consistency results for the non-scaled group Lasso problem in Theorem 4 and extend it to the scaled group Lasso in Theorem 5. We establish these results under the weakest assumption on the SCIF.The SCIF in (3.6) will be used to derive oracle inequalities for the prediction and weighted 2,1 loss.For the 2 loss, we define the SCIF as We may also use the 2 version of the CIF, denoted by CIF 2 (ξ, ω, T ) and defined by replacing the sign-restricted cone C (G) − (ξ, ω, T ) with the cone in (3.1).It follows from a shifting inequality (Cai, Wang and Xu, 2010;Ye and Zhang, 2010) that where ω min = min 1≤j≤M ω j .Thus, Again, the cone invertibility factors provide error bounds of sharper form than (3.2), in view of Theorem 4 below and Theorem 3.1 of Lounici et al. (2011).
Remark 3. From Theorem 4, max{ β − β * 2 2 , M j=1 ω j β G j − β * G j 2 } = O((s + g log M )/n) when the SCIF can be treated as constant.This shows the benefit of the group Lasso compared with the Lasso as in Huang and Zhang (2010).The same convergence rate can be derived from the 2 consistency result in Huang and Zhang (2010).Their result however, is derived under a sparse eigenvalue condition on the design matrix X.
Proof of Theorem 4. The KKT conditions for the group Lasso asserts that Now take any w ∈ R p .Pre-multiplying by ( β G j − w G j ) T on both sides in (3.13), we have Rearranging we get, Putting w = β * and h = β − β * , we have whence it follows that h ∈ C (G) (ξ, ω, S * ).Moreover, from KKT conditions (3.13), premultiplying both sides by h G j for j / ∈ S * , we have in the event E, − (ξ, ω, S * ).Consequently, by (3.6) and (3.14), .
The bound for the weighted 2,1 loss follows as M j=1 ω j h G j 2 ≤ (1 + ξ) j∈S * ω j h G j 2 .The proof for the 2 loss is nearly identical and thus omitted.
Finally, we prove (3.12).As ε ∼ N n (0, σ 2 I n ), it follows from the Gaussian concentration inequality that for any 0 < δ < 1, with probability at least 1 − δ, The result in (3.12) follows by an application of union bound.
3.3.Scaled Group Lasso.In the optimization problem (1.3), scale-invariance considerations have not been taken into account.Usually the individual penalty level ω j 's could be chosen proportional to the scale σ as a remedy.This issue has been discussed and studied, as pertaining to the Lasso problem, in several literature.See Huber (2011), Städler, Bühlmann and Geer (2010), Antoniadis (2010), Sun and Zhang (2010), Belloni, Chernozhukov and Wang (2011), Sun and Zhang (2012a), Sun and Zhang (2013) and many more.Following Antoniadis (2010) we define an optimization problem, where Following Sun and Zhang (2010) we define an iterative algorithm for the estimation of {β, σ}, ← arg min β L ω (β), (3.17) where L ω (β) was as defined in (1.3).Due to the convexity of the joint loss function L ω (β, σ), the solution of (3.15) and the limit of (3.17) give the same estimator, which we call scaled group Lasso.The constant a ≥ 0 provides control over the degrees of freedom adjustments.
In practice, for scaled group Lasso in the p > n setting, we take a = 0 for all subsequent discussions.It is clear that that with a = 0 and ω = σω, one has σL ω (β, σ) = L ω (β)+ σ 2 /2.The algorithm in (3.17) suggests a profile optimization approach.The following lemma is similar to Proposition 1 in Sun and Zhang (2012a) and characterizes the solution via partial derivative of the profile objective.
Proof of Lemma 2. For η ≥ 0 define Consequently, All other claims follow from the joint convexity of L ω (β, σ) and the strict convexity of the loss function in Xβ.
We now present the consistency theorem for scaled group Lasso which extends Theorem 4 by providing convergence results for the estimate of scale.Define Let m d,n be the median of the beta(d/2, n/2 − d/2) distribution and define where ω * is the vector with elements ω * ,j .We will show that √ m d j ,n ≤ (d j /n) 1/2 + n −1/2 in the proof of the following theorem.
Theorem 5. Let { β, σ} be a solution of the optimization problem (3.16) with data (X, y) and β * be a vector with supp(β * ) ⊂ G S * for some √ n is the oracle noise level.Then in the event E, we have and q (ξ, ω, S * ) , q = 1, 2. (3.22) (ii) Suppose the regression model in (1.1) holds with Gaussian error and a design matrix satisfying max j≤M X Corollary 1.Consider the setup of Theorem 5 (ii).Assume that the design matrix X satisfies the following sign restricted cone invertibility condition: SCIF (G) 1 (ξ, S * ) > c > 0 for some fixed c > 0. Let 0 < δ < 1 be a fixed small constant and take Then, for a certain fixed constant C > 0 and with probability at least 1 (3.25) 4. Simulation Results.We provide a few simulation results for our theories developed in Sections 2.1 and 3.As a prelude, in the following we first show the performance of scaled group Lasso procedure in a simulation experiment.We consider a two simulation designs with (n = 1000, p = 200) and (n = 1000, p = 2000) design matrices with the elements of the design matrix generated independently from N(0, 1).We assume that the true parameter β * has an inherent grouping with total set of p parameters divided into groups of size d j = 4.In the design (n = 1000, p = 200) we have total number of groups M = 50 and in (n = 1000, p = 2000), M = 500.For both scenarios, the true parameter β * is assumed to be (g = 2, s = 8) strong group sparse with its non-zero coefficients in {−1, 1}.Both simulation designs have a N(0, σ 2 ) error added to the true regression model Xβ * with σ = 1.We also assume that the design matrix is group wise orthogonalized in the sense of X T G j X G j /n = I G j ×G j , j = 1, . . ., M .In estimation of σ we employ the scaled group Lasso procedure as shown in 3.17.The groupwise penalty factors ω j 's are chosen to equal to λ d j for some fixed λ > 0. The implementation of group Lasso procedure is via the R package gglasso.In the design setup with (n = 1000, p = 200), the estimate of σ averaged over a 100 replications is 0.997 with a standard deviation of 0.02.In the design setup with (n = 1000, p = 2000), the estimate of σ averaged over a 100 replications is 1.0002 with a standard deviation of 0.02.Additionally Figure 1 shows the Gaussian Q-Q plots of the test statistic √ 2n ( σ/σ − 1).
4.1.Asymptotic test statistic.We also seek the empirical validation of the asymptotic convergence of the group β G j as described in our theoretical results.For bias correction we take the penalty function in (2.32) to be the Frobenius norm and apply group Lasso based optimization.We also consider a new simulation design similar as before with (n = 1000, p = 200) and σ = 1.We will consider two different schemes for empirical analysis for asymptotic convergence.

Small group sizes
The true parameter β * is simulated to be (s = 40, g = 10) strong group sparse with its nonzero values in the interval [2,3].More specifically, β * is grouped into groups of sizes d j = 4 for all j.We construct the test statistic of µ G j as in (2.14) for one of the nonzero groups.

Large group sizes
The true parameter β * is simulated to be (s = 40, g = 2) strong group sparse with its nonzero values between [2,3].More specifically, β * is grouped into 20 groups each of sizes d j = 20 for all j.We let the sparsity of the true parameter β * to be s = 40 contained within 2 separate groups.Again, we construct the test statistic of µ G j as in (2.14) for one of the nonzero groups.
Figure 3 shows the Q-Q plot for this group's test statistic.As the figure suggests, for large group sizes asymptotic normality of the group test statistic is empirically supported.

Fig 1 .
Fig 1.Normal QQ plot for the test statistic for σ in (3.23) in Theorem 5 with n = 1000, p = {200, 2000}, g = 2, s = 8.The results are produced with 100 replications of the scaled group Lasso.The red dotted line is fitted through 1 st and 3 rd sample quantile.
Figure 2 provides χ 2 4 based Q-Q plot for the sample quantiles of our test statistic.

Fig 2 .
Fig 2. Chi square Q-Q plot for the test statistic for µ G j with n = 1000, p = 200, g = 10, s = 40.The theoretical quantiles were drawn from χ 2 4 random variable.The group being tested has size 4.

Fig 3 .
Fig 3. Normal QQ plot for the test statistic for µ G j with n = 1000, p = 200, g = 2, s = 40.Here the group size of the test group is 20.
Theorem 1.It follows from (2.19) and the feasibility of P G in (2.20) that