Group Inference in High Dimensions with Applications to Hierarchical Testing

Group inference has been a long-standing question in statistics and the development of high-dimensional group inference is an essential part of statistical methods for analyzing complex data sets, including hierarchical testing, tests of interaction, detection of heterogeneous treatment effects and local heritability. Group inference in regression models can be measured with respect to a weighted quadratic functional of the regression sub-vector corresponding to the group. Asymptotically unbiased estimators of these weighted quadratic functionals are constructed and a procedure using these estimator for inference is proposed. We derive its asymptotic Gaussian distribution which allows to construct asymptotically valid confidence intervals and tests which perform well in terms of length or power. The results simultaneously address four challenges encountered in the literature: controlling coverage or type I error even when the variables inside the group are highly correlated, achieving a good power when there are many small coefficients inside the group, computational efficiency even for a large group, and no requirements on the group size. We apply the methodology to several interesting statistical problems and demonstrate its strength and usefulness on simulated and real data.


Introduction
High-dimensional linear regression has found many applications in modern data analysis through extracting useful information from complex data.Statistical inference in the highdimensional sparse linear regression is an important but also challenging problem.In the past few years, there has been a fast growing literature on this topic.The current paper addresses an important and long-standing statistical inference problem, namely inference or testing significance of groups of covariates.Specifically, we consider the following highdimensional linear regression where X i• ∈ R p are i.i.d.p-dimensional random vectors with Σ = EX i• X i• and i are i.i.d.Gaussian errors, independent of X i• , with mean zero and variance σ 2 .For a given set G ⊂ {1, 2, • • • , p}, the group significance test is for the hypothesis where β G = {β j ; j ∈ G}.More generally, for inference we will consider certain weighted functionals of β G .In many applications, identification of group inference or significance is as important as that of individual significance, especially in the scenario when the covariates are likely to affect the outcome jointly or the covariates are highly correlated with each other.
In the corresponding low-dimensional setting, the (partial) F-test is the classical procedure for testing this hypothesis.However, there is still a lack of methods for conducting group inference or testing (2) in high dimensions, especially when the group G has large size.
In the following, we shall provide a series of motivations for group inference or significance.
1. Hierarchical Testing.As written above, when the variables are highly correlated, it is often too ambitious to detect significant single variables and groups of correlated variables are considered instead.Hierarchical testing is a multiple testing procedure which uses p-values for group significance as input.It is a hybrid between sequential testing and Bonferroni-type correction to control the familywise error rate [23].
A hierarchy is provided in terms of a tree which is typical the output of hierarchical clustering of the covariates: each node in the tree is a group of variables and at every level of the tree, the groups build a partition of {1, . . ., p}.Hierarchical testing then proceeds by testing in a top-down manner according to the tree by using group significance testing multiple times: at the beginning, the upper part of the tree, the groups are large and they become smaller when moving downward the tree.
Hierarchical testing determines the resolution level in the tree in a fully data-driven way, depending on the signal strength and the correlation structure among the variables.For strong signals with large regression coefficients in absolute values and when the variables are not too highly correlated, the method will discover small groups or even single variables; and vice-versa when the signal is not so strong or the correlation among the variables is high.Thus, hierarchical testing addresses very elegantly the trade-off between signal strength and high correlation and it is perhaps among the most natural ways to deal with large-scale high-dimensional testing problems in real applications [6,19].More details are given in Section 4.1.
2. Interaction Test and Detection of Effect Heterogeneity.Group significance test can be used to test the existence of interaction or detection of heterogeneous treatment effects.With a slight modification of (1), we write the following model with interaction terms, Here, D i is the treatment or exposure variable and can be a binary variable in many applications.The interaction test in the above model ( 3) is formulated as testing H 0 : γ = 0.If D i is an indicator whether the i-th observation receives a treatment or not, then the test of H 0 : γ = 0 can be viewed as detecting the existence of heterogeneous treatment effect.Define W i = (D i X i• , X i• ) and η = (γ , β ) and then the model (3) can be expressed as y i = W i• η + i and the test H 0 : γ = 0 can be formulated in the form of (2) as H 0 : η G = 0 with G = {1, 2, • • • , p}.For a detailed discussion, see Section 4.2 and [9,30].

3.
Local Heritability in Genetics.Heritability measures how much of the phenotype (response) variance can be explained by the genotypes (covariates).Local heritability is among the most important heritability measures [28] as it represents the proportion of variance explained by a subset of genotypes indexed by the group G.In applications, the group G can be naturally formulated, for example, the set of SNPs located in on the same chromosome.The group significance test is motivated from studying whether the joint effect of a group of genotypes with the index set G is significant.Additionally, it is also of great interest to perform statistical inference for local heritability in terms of confidence intervals, which is closely related to the group significance test problem.See more discussion in Section 4.3.

Problem Formulation
There exist different ways of conducting the testing problem (2), including the the F-test in low-dimensional settings or the maximum test where one considers the maximum of absolute values of single normalized components of an estimated regression vector.The latter is implicit in the work of [34] for the debiased Lasso.Here, we consider testing (2) through weighted quadratic functionals.For a given index set G ⊆ {1, 2, • • • , p} , we introduce the following null hypothesis, for some positive definite matrix A ∈ R |G|×|G| with |G| denoting the cardinality of G.So as long as A is positive definite, the null hypothesis ( 4) is equivalent to (2).We shall highlight the following most interesting cases of ( 4).The first one is to replace A with Σ G,G , where Σ = EX i• X i• is the second order moment matrix, The other one is to replace the weight matrix A with the identity matrix, The quantity 5) is naturally used for group significance testing as the quantity itself measures the variance explained by the set of variables We shall remark that the testing problem (5) can be conducted in both cases where the matrix in the middle Σ G,G is known or not.If Σ G,G is known, then it can be simply treated as a special case of (4).However, in a more practical setting where the positive definite matrix Σ G,G is unknown, we need to estimate Σ G,G in the construction of the test and also need to quantify the additional uncertainty of estimating this matrix from data.
Throughout the paper, when there is no confusion of the definition of the index set G, we omit the dependence on G in the definitions and introduce the following notations throughout the paper, Though the focus of the current paper is mainly about testing group significance, the inference problem for Q Σ and Q A in terms of confidence regions is of independent interest, in particular for the applications in genetics about local heritability.

Results and Contribution
Recently, statistical inference in high dimensional models has been carefully studied in both statistics and econometrics [7,11,18,31,34], with a focus on confidence interval construction and hypothesis testing related to single regression coefficients.Together with a careful use of bootstrap methods, certain maximum tests have been developed in [11,13,35] to conduct the hypothesis testing problem (2).Although the developed tests are successful in controlling the type I error under different conditions, it is known that the coordinate-based maximum test suffers from the following problems in both statistical inference and computational efficiency as well.Statistically speaking, when the variables inside the group G are highly correlated, there is no guarantee that the coordinate-based maximum test controls the type I error; additionally, if β G is not zero but contains many small regression coefficients, the maximum test is expected to have a relatively low power.Both of these phenomena are observed in our simulation studies.Computationally speaking, if the group G is of large size, then the coordinate-based maximum test requires implementation of |G| + 1 number of high-dimensional penalized/constrained optimization problems, where each optimization problem involves the p-dimensional parameter.This is time-consuming, especially when the size of G is large or when conducting several group significance tests.
A major goal of the current paper is to do the hypothesis testing (2) via an estimation procedure for the quadratic form in (4) and (5).At the same time, we would like to address the statistical and computational challenges for the coordinate-based maximum test.The testing procedure proposed in the current paper is to make inference for where both of them can be viewed as measures of the joint group effects.One of the advantages for testing the group effect jointly is that even if the variables inside the group G are highly correlated, a joint test based on β G Σ G,G β G is more reliable as the individual effects inside G are nearly non-identified.

The main methodology in the current paper is to carefully calibrate certain reasonably good initial estimators for Q
We take the inference problem for Q Σ as an example and briefly describe the main idea for our proposed procedure.Denote by β and Σ some reasonably good estimators of β and Σ, respectively.For example, β can be taken as a penalized estimator β with a proper tuning parameter and Σ = 1 and it is known that such an initial estimator is not proper for conducting statistical inference due to the fact that the bias of the penalized estimator β will be carried over to the plugin estimator β G Σ G,G β G and results in a large error.We calibrate this plug-in estimator through constructing a projection direction for correcting the bias.Although the idea of using projection directions for bias correction has been developed for the inference for each regression coefficient [34], we need to develop a new way of constructing the projection direction for our specific purpose of group inference.This ensures that our proposed group significance test is statistically accurate and also computationally efficient.
We shall consider a most challenging case where the index set G has a very large size and highlight how this new construction of projection direction will enable us to achieve both nice statistical and computational properties simultaneously.Intuitively, the new construction of projection direction is to correct the whole bias of β G Σ G,G β G all at once, instead of correcting the bias of β coordinate by coordinate.Such a direct correction has the computational advantage of only implementing an optimization problem with p-dimensional parameters twice, no matter how large the group size is.More fundamentally, a direction calibration of β G Σ G,G β G carefully rebalances the bias and variance, in the sense that the bias is well controlled and the variance is dominating the bias.
To sum up, the proposed group significance test has the following three main advantages.
1. High-correlation inside G.The proposed testing procedure is effective even if there exists high-correlation for variables inside G.This is particularly useful for hierarchical testing and also the inference for local heritability, where the variables belonging to the same group tend to be highly correlated.See the numerical illustration in Section 5.2.

Literature Comparison
There is a rich literature about testing group significance and related statistical inference problem.We have reviewed some of the existing work based on the maximum test and shall mention a few other related work to the group significance test.Inference for the quadratic functionals is closely related to the group significance test.Statistical inference for β Σβ and β 2 2 has been carefully investigated in [8,14,33] and the inference methods developed in [8] have been extended to conduct the signal detection problem, which is a special case of (1) by setting G = {1, 2, • • • , p}.We shall emphasize that the group significance test problem is much more challenging than the signal detection problem, mainly due to the fact that the group significance test requires a decoupling between variables inside G and other variables inside G c .This decoupling step between G and G c is essentially requiring a novel construction of the projection direction.The same comments can be made to differentiate the current paper with the signal detection problem considered in [1,17].
Along another line, the works [25,32] extended the F-test or χ 2 test for fixed dimension to the high-dimensional setting and this extended test can also be used for group significance testing.However, such a test is expected to be less powerful than the proposed test in the case of a large group G.To see this, the standard deviation level of the F-test or χ 2 test is at the scale of |G|/n; in contrast, the standard deviation level of our proposed test statistics is at the scale of β G 2 2 /n.If |G| is large, then the significance test proposed in the current paper is more powerful as long as β G 2 2 is smaller than |G|, which is typically true if β satisfies a certain sparsity structure.Additionally, [25] requires that the group size is smaller than the sample size n while the current paper imposes no condition on the group size G.Another related work [23] considered the group significance test without any compatibility condition on the design.
Paper Organization.The remaining part of the paper is organized as follows.In Section 2, we introduce the methodology for group significance testing.In Section 3, we provide theoretical justification for the proposed method.In Section 4, we apply the general procedure to several interesting problems, including hierarchical testing, test of interaction and detection of the effect heterogeneity and inference for local heritability; In Section 5, we conduct a large set of numerical studies to support our claims and theory.In Section 6, we apply our method to two real data sets from genomics and genetics.Proofs and additional numerical results are presented in the supplementary material.

Methodology for Testing Group Significance
In this section, we propose statistical inference procedures for both where, throughout the paper, we use β to denote a reasonably good estimator of β and use the sample covariance matrix Σ = 1 n n i=1 X i• X i• as the estimator of Σ.To simplify the discussion, we also assume the index set G is of the form {1, 2, • • • , |G|}.
We decompose the error of the plug-in estimator Based on this decomposition, we need to estimate 2 β G Σ G,G (β G − β G ) as this is one of the dominant terms and further calibrate the plug-in estimator β G Σ G,G β G .The calibration can be done through identifying a projection direction u ∈ R p for the following expression The challenging part is how to identify such a projection direction The construction of the projection direction can be motivated from the decomposition As long as β is a reasonably good estimator with a small β − β 1 , it remains to construct the projection direction is upper bounded by a small value.From a geometric perspective, this constraint is to ensure that the projection of the approximation vector Σu − β G Σ G,G 0 to all Euclidean basis {e j } 1≤j≤p is small.However, this idea is only useful for the case that the group size |G| is small.If the group size |G| is quite large, then this intuition is not guaranteed to work due to the fact that β G Σ G,G 0 can be quite dense.We need to constrain the difference term is small for all w ∈ C where We introduce the following projection direction, Let σ 2 denote an reasonable estimator of σ 2 .We then estimate the variance of the proposed estimator Q Σ by for some positive constant τ > 0. We can combine the proposed point estimator and the asymptotic variance to construct a confidence interval for Q Σ and conduct the group significance testing.More details will be provided in Section 2.3.
Two remarks about the proposed testing procedure are in order.First, the computational cost of the proposed algorithm is independent of the group size |G|.No matter whether the group size is large or small, the construction of the projection direction u involves a constrained optimization problem with a p-dimensional parameter or equivalently in its dual problem a penalized optimization problem with a p-dimensional parameter.Second, we correct the bias of the plug-in estimator all at once, instead of conducting coordinate-wise bias correction.Such a direct correction is not only leading to a computationally efficient algorithm, but also providing a good balance between the bias and variance, especially for a large group size |G|.

Inference for Q A
The main idea of estimating Q A is similar to that of estimating Q Σ though the problem itself is slightly easier due to the fact that the matrix A is known.We start with the error decomposition of the plug-in estimator Similarly, we can construct the projection direction u A as where We then estimate the variance of Q Σ by V A (τ ) with for some positive constant τ > 0. In the following, we provide some discussion on comparing Q Σ in (7) and Q A in ( 10) and then we consider the special case β G 2 2 .

Comparison to inference procedure for Q Σ
We compare now the general inference results for Q A with those for Q Σ .The connection is that Q Σ is equal to Q A if the positive definite matrix A is taken as Σ G,G .However, in most applications, the covariance submatrix Σ G,G is typically unknown and this leads to a more challenging problem for conducting statistical inference for Q Σ , that is, we have to estimate Σ G,G by the data.As a result, we also need to quantify the uncertainty of estimating Σ G,G .
Through comparing the variance of Q Σ in ( 8) and the variance of Q A in ( 11), we observe that there is one additional term 1 in the variance level for Q Σ , which captures the additional uncertainty of estimating Σ G,G .Beyond the additional complexity of dealing with the unknown matrix Σ G,G in Q Σ , the quantity with the true covariance matrix in the middle has its own advantage.To illustrate the advantage, we consider a simpler setting where the random vector X i,G is independent of the random and provide an alternative way of constructing the projection defined in (3) by u = β G 0 .If we further assume that |G| n, we can provide an alternative way of constructing the projection defined in ( 9) by Even in such a simplified setting where X i,G is independent of the random vector X i,G c and |G| n, we observe that the constructed projection direction u is easier than u A as the construction of u is free of inverting the matrix Σ G,G .In the case that the covariates in X i,G are highly correlated or the matrix Σ G,G is close to singular, the inference procedure depending on Q Σ is more reliable due to this observation.Additionally, Q Σ = E|X i,G β G | 2 amounts to estimating the regression surface with ỹi = y i − X i,G c β G c .Therefore, Q Σ is identifiable even if some of the components of X i,G exhibit correlation with absolute values being closed to 1. See Section 5.2 for the numerical results.

A special case with A = I
In this part, we consider a commonly used special example by setting A = I and decompose the error of the plug-in estimator as, For this special case, the projection direction can actually be identified via the following optimization algorithm, In contrast to u and u A , this algorithm for constructing u I is simpler since the constraint set C 0 is smaller than C, that is, we do not need to constraint the difference from the additional projection 1 which is a sparse vector no matter how large the set G is.

Testing procedures and confidence intervals
Having introduced the point estimators and the quantification of the variance, we propose the following two α-level significance tests, where z 1−α is the 1 − α quantile of the standard normal random variable.As side products, we can construct confidence intervals for Q A and Q Σ as follows: 3 Theoretical Justification In this section, we establish the theoretical properties of the estimators and inference procedures proposed in the methodological section.We first introduce the following regularity conditions.
(A) In model (1), the rows X i,• are i.i.d.p-dimensional sub-Gaussian random vectors with mean µ = EX i• and the second order moment Σ = E(X i,• X i,• ) where c 0 ≤ λ min (Σ) ≤ λ max (Σ) ≤ C 0 for positive constants C 0 > c 0 > 0. The errors 1 , ..., n are i.i.d centered Gaussian random variables with variance σ 2 and are assumed to be independent of X.
Under the sub-Gaussian condition on the covariates X, Condition (A) implies the restricted eigenvalue condition introduced in [3]; see [26,36] for the exact statement.The Gaussianity of the error is only imposed for simplifying the technical analysis and such an assumption can be weakened to sub-Gaussianity using a more refined analysis.
Beyond the model assumption, we also introduce the following conditions for the initial estimators and give an example right after to demonstrate the existence of initial estimators β and σ satisfying the following conditions (B1) and (B2).
Most of the high-dimensional estimators proposed in the literature are shown to satisfy the above condition (B1) of estimating the regression vector and the variance of the regression error under various conditions.See [2,3,29] and the references therein for more details.The condition (B2) is imposed for technical analysis.With such an independence assumption, the asymptotic normality of the proposed estimators is easier to establish.However, such a condition is believed to be only technical and not necessary for the proposed method.
As shown in the simulation study, we demonstrate that the proposed method, even not satisfying the independence assumption imposed in the condition (B2), still works well numerically.On the other hand, there exist scenarios where we can construct estimators satisfying the condition (B2).If we have historical data, then we can estimate the regression vector using the historical data and conduct the correction as proposed in ( 7) and ( 10) using the current data.Even in the case where we have only access to a single data set (X, y), we can use sample splitting e to create the independence.We randomly split the data into two subsamples (X (1) , y (1) ) with sample size n 1 = n 2 and (X (2) , y (2) ) with sample size n 2 = n − n 1 and estimate β based on the data (X (1) , y (1) ) and conduct the correction in (7) in the following form, ) X (2) and A similar argument can be used to reconstruct the estimator Q A defined in (10).As a result, the estimator using sampling-splitting is less efficient due to the fact that only half of the data is used in constructing the initial estimator and also correcting the bias.Multiple sample splitting and aggregation [24], single sample splitting and cross-fitting [10] or data-swapping [15] are commonly used sample splitting techniques.
The following theorem characterizes the behavior of the proposed estimator Q Σ .The estimation error is decomposed into two components, where a stochastic error M Σ has an asymptotic normal limit and the remaining error B Σ is controlled.
Theorem 1 Suppose that condition (A) holds and the initial estimator β satisfies the conditions (B1) and (B2), then the proposed estimator Q Σ satisfies Q Σ − Q Σ = M Σ + B Σ where the main error component M Σ and remaining error component B Σ satisfy As a consequence, under the additional condition k √ n/log p, we have for some positive constant τ > 0.
The above theorem establishes that the main error component M Σ of the error decomposition has an asymptotic normal limit and the remaining part B Σ is well controlled in terms of the convergence rate in (15).Such a decomposition is useful from the inference perspective, where (16) establishes that if the sparsity level is small enough, then the α/2 quantile of the standardized difference to quantify the uncertainty of M Σ + B Σ .Since there is no distributional result for B Σ , an upper bound for B Σ would be one (conservative) alternative to quantify the uncertainty of B Σ .
The enlargement of the uncertainty quantification from V 0 Σ to V 0 Σ + τ n is closely related to "super-efficiency".Consider the case of bounded spectral norm Σ 2 .The variance level V 0 Σ in ( 14) is of the order ( 2 )/ √ n and hence the variance level near the null hypothesis Q Σ = 0 is much faster than the parametric rate, which corresponds to the "superefficiency" phenomenon.In this case, the worst upper bound for To overcome the challenges posed by "super-efficiency", we enlarge the variance a bit by the level τ n such that it always dominates the worst upper bound for B Σ .
Additionally, the statistical accuracy of the test statistics depends only weakly on the group size |G|, in the sense that, the standard deviation of the test statistic depends on β G 2 , at the order of magnitude ( and β is a sparse vector, then the standard deviation level is not always strictly increasing with a growing set G and this phenomenon explains the statistical efficiency of the proposed test, especially when the test size G is large.In contrast, the F-test or χ 2 test proposed in [25,32] have the standard deviation at the scale of |G|/n, which is strictly increasing with the testing size |G| and also explains the condition on the group size |G| n.
An important feature of the proposed method is that the current proposed testing procedure imposes no conditions on the test size G.It works for a small group but also for the group size as large as |G| = p.
In parallel to Theorem 1, the following theorem characterizes the error decomposition of the estimator Q A .
Theorem 2 Suppose that condition (A) holds and the initial estimator β satisfies the conditions (B1) and (B2), then the proposed estimator where the main error component M A and remaining error component B A satisfy and As a consequence, under the additional condition k √ n/log p, we have lim sup n,p→∞ for some positive constant τ > 0.
Theorem 2 is established in a parallel way to Theorem 1 with the main difference as the variance level of M A is different from that of M Σ .Specifically, the variance component of M Σ consists of two components, the uncertainty of estimating β and Σ while the variance component of M A only reflects the uncertainty of estimating β.

Three of the most interesting examples of
, where diag (Σ G,G ) denotes the diagonal matrix of Σ G,G .For the case A = I |G|×|G| , we treat all regression coefficients equally and focus on the magnitude of the regression coefficients instead of the magnitude of the covariate variance and the correlation between covariates.For the case A = diag (Σ G,G ), the constructed significance test takes the variance level of each covariate into consideration but does not take the correlation between all covariates inside G into account.For the case A = Σ G,G , we take into consideration both the covariate variance level and also the correlation between all covariates inside G.However, the constructed test, with assuming A = Σ G,G to be known a priori, is different from the test statistics constructed in (7), where the construction does not make use of any prior knowledge of Σ G,G .The difference between ( 17) and ( 14) illustrates the additional uncertainty we need to account for due to the unknown weighted matrix Σ G,G .
In the following, we control the type I error of the proposed testing procedure and establish the asymptotic power of the proposed estimator.We consider the following parameter space for θ = (β, Σ, σ), where M 1 ≥ 1 and M 2 > 0 are positive constants.We define the null hypothesis parameter space as for a fixed group G.
Corollary 1 Suppose that condition (A) holds and the initial estimators ( β, σ) satisfy the conditions (B1) and (B2).If k √ n/ log p, then, for any given positive constant τ > 0, both proposed tests φ Σ (τ ) and φ A (τ ) defined in (12) control the type I error, To study the power, we define the local alternative hypothesis parameter space as To facilitate the presentation, the local alternative parameter space is defined such that it depends on the positive-definite matrix A. The following corollaries present the power of the proposed testing procedures in (12).
As a remark, for the local alternative space defined in (3), the separation parameter of the indifference region 2 ) and hence the proposed test φ Σ is of power converging to 1 as long as t → ∞.Similarly, we control the asymptotic power for the proposed test φ A in the following Corollary.
Corollary 3 Suppose that condition (A) holds and the initial estimators ( β, σ) satisfy the conditions (B1) and (B2 where V A is defined in (19) and Φ(•) is the quantile function of standard normal distribution.

Statistical Applications
In the following sections, we motivate three statistical applications of the group significance test.In Section 4.1, we explain how to apply hierarchical testing and its advantage of immensely reducing the number of possible group tests that could be performed.Section 4.2 covers interaction tests for detecting heterogeneous effects and finally Section 4.3 motivates the concept of local heritability in genetics measuring explained partial variance.

Hierarchical testing
In presence of many variables where some of them are highly correlated with each other, it is often too ambitious to detect significant single variables having regression coefficients being different from zero.An effect from a single variable after having adjusted for all others is often not sufficiently strong.This happens in particular in presence of high correlation or near collinearity among the variables.On the other hand, a group of variables is easier to be detected as significant, especially and again when the variables are highly correlated.The core issue then becomes which of the groups of variables to consider for testing.
Hierarchical testing is a powerful method to go through a sequence of groups to be tested, from larger groups to smaller ones depending on the strength of the signal and the amount of correlation among the variables in and between the groups.As such, it is a multiple testing scheme which controls the familywise error rate.The details are as follows.
The p covariates are structured into groups of variables in a hierarchical tree T such that at every level of the tree, the groups build a partition of {1, . . ., p}.The groups at each level of the tree are such that the variables have high correlation within groups (and a tendency for low correlations between groups).The default choice for constructing such a tree is with hierarchical clustering of the variables [16, cf.], typically using 1 − correlation 2 as dissimilarity measure and complete linkage.
We assume that T is deterministic, for example when conditioning on the covariates in the linear model and being the output of hierarchical clustering.Hierarchical testing with respect to T is then a sequential multiple testing adjustment procedure as follows.
Hierarchical testing procedure 1: INPUT: Hierarchical tree T with nodes corresponding to groups of variables; Group testing procedure returning p-values P G for each group of variables G, e.g. as described in Section 2.3; Significance level α.
2: OUTPUT: Significant groups of variables such that the procedure controls the familywise error rate (FWER).

4:
Go top-down the tree T and perform group significance testing for groups G.The raw p-value is corrected for multiplicity using where G is any group in the tree T .The second line enforces monotonicity of the adjusted p-values.

5:
For each group G when going top-down in T : if P G;adjusted ≤ α, continue to consider the children of G for group testing.
6: until All the children of each group G (i.e., the finer partition of G) when going topdown in T are non-significant at level α.
A schematic illustration with a binary hierarchical tree is shown in Figure 1.The color encodes whether the null hypothesis of a group could be rejected or not.
There are a few interesting properties of hierarchical testing.First, we can think of it as a hybrid of a sequential procedure and Bonferroni correction: for every level in the tree, the p-value adjustment is a weighted Bonferroni correction (the standard Bonferroni correction if the groups have equal size) and across different levels it is a sequential procedure with no correction but a stopping criterion to not go further down the tree when no rejection happens.Indeed, the root node needs no adjustment at all and also for each level in the tree, the correction depends only on the partitioning on that level and not on how many tests have been done before.Secondly, there is no need to pre-define the level of resolution of the groups, i.e., to decide where the tree should be cut.How far one goes down the tree with testing significance of groups of variables is fully data-driven, based on the hierarchical testing procedure.Third, the hierarchical testing method is computationally attractive as no further tests are considered once a certain group does not exhibit any significance.It is shown in [22] that the procedure controls the familywise error rate (FWER).The hierarchical testing method has been used in the setting of high-dimensional linear models in [20] with a further refinement in [21] based on a multi sample-splitting testing method for the groups.The latter is justified with the strong and questionable assumption that . . .The hierarchical procedure returns the three groups G 1 , G 2 , and G 3 .The green and red colors highlight significant groups and non-significant groups, respectively.The group G 1 is a leaf which means that it consists of one variable.
the Lasso is able to detect all the relevant variables and in this sense, this procedure is not fully reliable in terms of error control.Further details are provided in [27].

Testing interaction and detection of effect heterogeneity
The proposed significance test is useful in testing the existence of interaction, which itself is an important statistical problem.We focus on the interaction model (3) and re-express the model as y i = W i• η + i where we define W i = (D i X i• , X i• ) and η = (γ , β ) .Then the detection of interaction terms is to test whether We can then apply the group significance test developed here.
The detection of heterogeneous treatment effect can be viewed as a special case of testing the interaction term.If D i in the interaction model ( 3) is taken as a binary variable, where D i = 0 denotes that the subject belongs to the control group and D i = 1 that it belongs to the treatment group.Then this specific test for interaction amounts to testing whether the treatment effect is heterogeneous.In a very similar way, if D i takes two values where D i = 1 and D i = 2 represent the subject is receiving treatment 1 and 2, respectively, then the test of interaction is for testing whether the difference between two treatment effects is heterogeneous or not.The current developed method of detecting heterogeneous treatment effects is definitely not restricted to the case of a binary treatment.It can be applied to basically any type of treatment variables, such as count variables, categorical or continuous variables.

Local heritability
Local heritability is defined as a measure of the partial variance explained by a given set of genetic variables.In contrast to the (global) heritability, the local heritability is more informative as it describes the variability explained by a pre-specified set of genetic variants and takes the global heritability as one special case.Assuming the regression model ( 1), the local heritability can be represented by the quantities, where G denotes the index set of interest.The following corollary establishes the coverage and precision properties of the proposed confidence intervals for two measures of local heritability, Corollary 4 Suppose that condition (A) holds and the initial estimators ( β, σ) satisfy the conditions (B1) and (B2), then the constructed confidence intervals defined in (13) where τ > 0 is any given positive constant.

Simulation Studies
In this section, we investigate the finite sample performance of the proposed method over three different simulation scenarios.Throughout the simulation study, we generate the high dimensional linear model with the dimension p = 500.We generate the covariates following X i• ∼ N (0, Σ) and the error i ∼ N (0, 1), both being i.i.d.over the indices i.

Dense alternatives
In this section, we consider the setting where the regression vector is relatively dense but of small non-zero coefficients, as this is a challenging scenario for detecting the signals.We generate the regression vector β as β j = δ for 25 ≤ j ≤ 50 and β j = 0 otherwise and generate the covariance matrix Σ ij = 0.6 |i−j| for 1 ≤ i, j ≤ 500.We consider the following group significance test, H 0,G : We vary the signal strength parameter δ over {0, 0.02, 0.04, 0.06} and the sample size n over {250, 350, 500, 800}.
Table 1 summarizes the hypothesis testing results for different methods.For δ = 0, the empirical detection rate is an empirical measure of the type I error; For δ = 0, the empirical detection rate is an empirical measure of the power.We have observed that, for τ = 0, the testing procedures φ I (0) and φ Σ (0) do not control the type I error.The reason is that the bias component dominates the variance and τ = 0 only quantifies the uncertainty of the variance component but does not account for that from the bias.In contrast, as long as we set τ = 1 thereby providing a conservative upper bound for the bias component, the proposed procedures φ I (1) and φ Σ (1) control the type I error.As comparison, the maximum test φ hdi controls the type I error while the other maximum test φ FD does not reliably control the type I error.To compare the power, we focus on φ I (1), φ Σ (1), φ hdi and φ FD and observe that φ Σ (1) is the best in terms of power among all four tests and φ I (1) is the second best at most cases, especially when δ = 0.04, 0.06.The power of both φ hdi and φ FD are much lower than our proposed two testing procedures, φ Σ (1) and φ I (1).An interesting observation is that, although the proposed testing procedures φ Σ (1) and φ I (1) control the type I error in a conservative sense, they still achieve a higher power than the existing maximum tests. .We report the ERR for six different tests φ I (0), φ I (1), φ Σ (0), φ Σ (1), φ FD and φ hdi , where ERR denotes the proportion of rejected hypothesis among the total 500 simulations.
In Table 2, we report the coverage properties for the confidence intervals CI I (τ = 0) and CI I (τ = 1) for β G 2 2 and CI Σ (τ = 0) and CI Σ (τ = 1) for β G Σ G,G β G .We observe that for τ = 0, the constructed confidence intervals do not achieve 95% coverage while for τ = 1, they do.This happens due to the reason that for small δ, the bias component of the proposed estimator cannot be simply ignored and we use a conservative upper bound to control the bias component.In the supplementary materials, we report in Table 8 the absolute value of the bias of the plug-in estimator and the proposed estimator.

Highly correlated covariates
Here, we consider the setting where the regression vector is relatively sparse but a few variables are highly correlated.We generate the regression vector β as β 1 = β 3 = δ and  We report the empirical coverage of CI I (τ = 0) and CI I (τ = 1) for β G 2 2 and the empirical coverage of CI Σ (τ = 0) and CI Σ (τ = 1) for β G Σ G,G β G .β j = 0 for j = 1, 3 and we generate the covariance matrix as follows, There exists high correlations among the first five variables, where the pairwise correlation is 0.8 inside this group of five variables.In contrast to the previous simulation setting in Section 5.1, we do not generate a large number of non-zero entries in the regression coefficient but only assign the first and third coefficients to be possibly non-zero.We test the group hypothesis generated by the first five regression coefficients, H 0,G : We vary the signal strength parameter δ over {0, 0.1, 0.2, 0.3, 0.4, 0.5} and the sample size n over {250, 350, 500}.
Table 3 compares the empirical detection rate for our proposed method and also the maximum test.We have observed that, for τ = 0, the proposed testing procedures φ I (0) and φ Σ (0) do not control type I error while for τ = 1, φ I (1) and φ Σ (1) achieve control of the type I error below the significance level.The maximum test procedure φ hdi controls the type I error while φ FD barely controls it.Regarding the power, φ hdi and φ FD are better for δ = 0.2.0, 3 while our proposed testing procedure φ Σ (1) is comparable to φ hdi and φ FD when δ reaches 0.4.Seemingly, our proposed procedures φ Σ (1) and φ Σ (0) do not perform better than the maximum test φ hdi and φ FD .We shall emphasize that the coverage properties related to the maximum test φ hdi and φ FD are not guaranteed although this is not visible in Table 3. Specifically, since we are testing β i = 0 for i = 1, 2, 3, 4, 5, we can look at the coverage properties of these two proposed tests in terms of β i .As reported in Table 5, for δ = 0, we have observed that the coordinate-wise coverage properties are not guaranteed due to the high correlation among the first five variables.The reason for this phenomenon is that the coverage for an individual coordinate β j requires a decoupling between the j-th and all other variables and if there exists high correlations, this decoupling step is difficult to be conducted accurately.In contrast, even though the first five variables are highly correlated, the constructed confidence intervals CI I (τ = 1) and CI Σ (τ = 1) achieve the 95% coverage.This is reported in Table 4.As explained, our proposed testing procedure is more robust to high correlations inside the testing group as the whole group is tested as a unit instead of decoupling variables inside the testing group.In the supplementary materials, we report in Table 9 the absolute value of the bias of the plug-in estimator and the proposed estimator.

Hierarchical testing
We simulate data under two settings which differ in the set of active covariates and covariance matrix Σ used to generate X i• ∼ N (0, Σ).The covariance matrix Σ is block diagonal in both cases.The set of indices of the active covariates is denoted by S 0 .The |S 0 | = 10 active covariates are chosen as the first variable in the first 10 blocks.The number of covariates p is kept fixed at p = 500 and we vary the number of observations n between 100, 200, 300, 500, and 800.The results are calculated based on 500 simulation runs.
In setting 1, the first 20 covariates are generated to have high correlation within small blocks of size 2.This means that the covariance matrix Σ has 1's on the diagonal, Σ i,i+1 = Σ i+1,i = 0.7 on the first off-diagonals for i = 1, 3,5,7,9,11,13,15,17,19, and 0's otherwise.The set of active covariates is S 0 = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19}.In setting 2, there are ten blocks corresponding each to 50 covariates that have a high correlation of 0.7.This means that the covariance matrix Σ has 1's on the diagonal, Σ B k = 0.7 for B k = (i, j) : i = j and i, j ∈ {k, k + 1, . . ., k + p/10} for k = 1, 51, 101, . .., and 0's otherwise.The set of active covariates is S 0 = {1, 51, 101, 151, . . ., 451}.We use a modified version of the power to measure the performance of the hierarchical procedure because groups of variable sizes are returned.The idea is that we weight the true findings by one over the group size because smaller significant groups are more informative than larger ones, i.e., single variables get a weight of one.The adaptive power is defined by where MTD stands for Minimal True Detections, i.e., there is no significant subgroup ("Minimal"), the group has to be significant ("Detection"), and the group contains at least one active variable ("True"); see [20].
The results are illustrated in Figure 2 and Table 6.The hierarchical procedure performs very well for setting 1 since the adaptive power is close to 1 for all values of n.The setup in setting 2 is much harder because the 10 active covariates are each highly correlated with 49 other non-active covariates.It is difficult to distinguish the active variables from the  We report the empirical coverage of CI I (τ = 0) and CI I (τ = 1) for β G 2 2 and the empirical coverage of CI Σ (τ = 0) and CI Σ (τ = 1) for β G Σ G,G β G .correlated ones for the procedure and hence, it stops further up in the tree which results in larger significant groups compared to setting 1.This can be seen in the last three columns of Table 6.Thus, the adaptive power is smaller.The FWER is well controlled for both settings.

Real Data Analysis
In the following, we present the results of the hierarchical procedure for two real data sets.In Section 6.1, we compare the hierarchical procedure to testing single variables on a gene expression data set.In Section 6.2, the hierarchical procedure is analyzed for 46 traits (different responses) as linear functions of single nucleotide polymorphism (SNP) binary covariates. .The numbers under CI FD represent the empirical coverage for β j (1 ≤ j ≤ 5) using the method proposed in [18] and the numbers under CI hdi represent the empirical coverage for β j (1 ≤ j ≤ 5) using the method proposed in [31].

Riboflavin data set
We demonstrate the hierarchical procedure and compare it to testing of single variables on a data set about Riboflavin production with Bacillus Subtilis, made publicly available by [5].It consists of n = 71 samples of strains of Bacillus Subtilis for which they measured the riboflavin (vitamin B 2 ) production rate.Engineered strains and strains grown under different fermentation conditions were hybridized multiple times during a fed-batch fermentation process.The log-expression level of p = 4088 genes is tested for association with the response.
The hierarchical procedure goes top-down through a hierarchical cluster tree which was estimated using 1 − (empirical correlation) 2 as dissimilarity measure and average linkage.The results of the hierarchical procedure are displayed in Table 7.We compare them ht q q q q q 0.00 (a) Adaptive power.
q q q q q 0.00  to testing all the single covariates using the group test, i.e., all groups of size one, and correcting for multiplicity using Bonferroni-Holm.Testing of all single covariates revels seven significant covariates.Ideally with the hierarchical procedure, we would find the same seven single variables plus some additional groups: indeed, this happens here.The seven single variables are highlighted in green in Table 7.The average pairwise correlation is 0.98 for distinct covariates within the significant group of size three from the hierarchical procedure.Because of high correlations, this suggests that it is not possible to refine this group further to significant subgroups.

Yeast colony growth data set
Bloom et al. [4] performed a genome-wide association study of 46 quantitative traits to investigate the sources of missing heritability.The authors crossbred 1'008 yeast Saccharomyces cerevisiae segregates from a laboratory strain and a wine strain and measured 11'623 genotype markers which they reduced to 4'410 markers that show less correlation.Bloom et al. [4] processed the data such that the covariates encode from which of the two strains a given genotype was passed on.This is encoded using the values 1 and −1.Each crossbred was exposed to 46 different conditions like different temperatures, pH values, carbon sources, additional metal ions, and small molecules.The traits of interest are the q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 2 1 2 5 2 6 4 5 1 4 2 3 2 3 5 4 4 5 10 2 5 3 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 2 2 8 2 3 2 2 5 5 3 3 5 2 3 4 1 2 5 4 5 2 2  The number of significant groups is displayed on the top.We added a small amount of jitter to all the groups of size one for traits 20 (two groups) and 44 (three groups) in order to prevent over-plotting.
p-value significant cluster

Conclusions
The current paper studies the group inference problem in high-dimensional linear models by constructing an approximately unbiased estimator of the weighted quadratic functionals of the regression sub-vector corresponding to the group and then quantifying its uncertainty.The proposed group inference procedure has been shown to enjoy nice theoretical and empirical properties, including, valid inference even when variables in the group are highly correlated, good power performance when there are many small coefficients in the group, computational efficiency even for a large group and assumption-free regarding the group size.The proposed method is then applied to hierarchical testing, the detection of heterogeneous treatment effects, tests of interaction and inference for local heritability.

Figure 1 :
Figure1: The hierarchical procedure returns the three groups G 1 , G 2 , and G 3 .The green and red colors highlight significant groups and non-significant groups, respectively.The group G 1 is a leaf which means that it consists of one variable.

Figure 2 :
Figure2: Adaptive power and FWER for the simulation study of the hierarchical procedure (5% significance level).In setting 1, the 10 active covariates are each highly correlated with only one other covariate.In setting 2, the 10 active covariates are each highly correlated with other covariates.

Figure 3 :
Figure3: Size of significant groups (p-values ≤ 0.05) by applying the hierarchical procedure to each of the 46 traits of the Yeast Colony Growth data set.The number of significant groups is displayed on the top.We added a small amount of jitter to all the groups of size one for traits 20 (two groups) and 44 (three groups) in order to prevent over-plotting.

Table 1 :
Empirical Rejection Rate (ERR) for the Dense Alternative scenario (5% significance level)

Table 2 :
Empirical Coverage for the Dense Alternative scenario (95% nominal coverage).

Table 5 :
Empirical Coverage for first five regression coefficients for Highly Correlated scenario (95% nominal coverage)

Table 7 :
Results from applying the hierarchical procedure to the Riboflavin data set (5% significance level).The seven covariates which are found as well by testing for single variables (groups of size one only) are highlighted in green.