False discovery rate control via debiased lasso

: We consider the problem of variable selection in high-dimensional statistical models where the goal is to report a set of variables, out of many predictors X 1 ,...,X p , that are relevant to a response of interest. For linear high-dimensional model, where the number of parameters exceeds the number of samples ( p > n ), we propose a procedure for variables selection and prove that it controls the directional false discovery rate (FDR) below a pre-assigned signiﬁcance level q ∈ [0 , 1]. We further analyze the statistical power of our framework and show that for designs with subgaussian rows and a common precision matrix Ω ∈ R p × p , if the minimum nonzero parameter θ min satisﬁes


Introduction
Living in the era of data deluge, modern datasets are often very fine-grained, including information on a large number of potential explanatory variables.For a given response of interest, we know a priori that a large portion of these variables are irrelevant and would like to select a set of predictors that influence the response.For example, in genome-wide association studies (GWAS), we collect single nucleotide polymorphism (SNP) information across a large number of loci and then aim at finding loci that are related to the trait, while being resilient to false associations.
The focus of this paper is on high-dimensional regression models where the number of parameters exceeds the sample size.Since such models are overparameterized, they are prone to overfitting.In addition, high-dimensionality brings noise accumulation and spurious correlations between response and unrelated features, which may lead to wrong statistical inference and false predictions.Model selection is therefore a crucial task in analyzing high-dimensional models.For a successful model selection, we need to assure that most of the selected predictors are indeed relevant.This not only leads to noise reduction and enhances predictions but also offers reproducibility.
To be concrete in using the term "reproducibility", we characterize it for statistical inference problem by using the false discovery rate (FDR) criterion, which is the expected fraction of discoveries that are false positives.The notion of FDR has been proposed by the groundbreaking work [BH95] and nowadays is the criterion of choice for statistical inference in large scale hypothesis testing problem.In their work, Benjamini and Hochberg developed a procedure to control FDR under a pre-assigned significance level.It has been shown theoretically that BH procedure controls FDR in some special cases such as independence or positive dependence of tests [BH95,BY01].Since initially proposed, there have been various modifications of BH [BY01, SRC + 15, FHG12, Wu08, XCML11] and its applications in different domains [RYB03,GLN02].
Importantly, BH procedure (and its modifications) assumes that p-values are given as input for all the hypothesis tests.The p-values are often calculated using classical formula obtained by using large-sample theory which are theoretically justified only for the classical setting of fixed dimension p and diverging sample size n [VdV00].For example, [LS + 14] considers the setting where m i.i.d random samples of (X 1 , . . ., X p ) are given and p-values are estimated from the Student's t-test statistic.The authors propose a bootstrap calibration method to use with the BH procedure and show that, under weak dependence among observations, it can control the false discovery rate when the total number of observations n = mp is bigger than p(log p) c .However, for high-dimensional models obtaining valid p-values is highly nontrivial.This is in part due to the fact that fitting high-dimensional model often requires the use of nonlinear and non-explicit estimation procedures and characterizing the distribution of such estimators is extremely challenging.In the past couple years, there has been a surge of interest in constructing frequentist p-values and confidence intervals for highdimensional models.A common approach is the fundamental idea of debiasing which was proposed in a series of work [JM14b,ZZ14,JM14a,VdGBRD14,JM13b,BCH14].In this approach, starting from a regularized estimator one first constructs a debiased estimator and then makes inference based on the asymptotic normality of low-dimensional functionals of the debiased estimator.This approach also provides asymptotically valid p-values for the null hypotheses of the form H 0 : θ 0,i = 0, where θ 0,i is a fixed single model parameter.However, these p-values are correlated and the BH procedure is not guaranteed to control FDR in this case.The modification of BH for general dependency, that scales the significance level by 1/(log p) factor [BY01], also turns out to be overly conservative and leads to a low power.In [BCC + 18], the authors review the methods for constructing p-values in the high dimensional setting, their behavior and limitations, and describe a general set of assumptions under which these p-values can be used for inference tasks, such as finding confidence intervals and controlling FDR.In particular, [BCC + 18] extends the result of [LS + 14] to the so-called "Many Approximation Means (MAM)" framework and provide a set of conditions on the dependence among p-values such that the Benjamini-Hochberg procedure has the FDR control property.
In this paper, we build upon the debiasing approach and propose a procedure for model selection under the high-dimensional regime, which is guaranteed to have FDR under a pre-assigned level α ∈ [0, 1].We call our procedure FCD (for "FDR Control via Debiasing") and prove that it controls even a stronger criterion, namely directional FDR.We further analyze its statistical power (without imposing any assumption on the amplitude of the regression parameters or the noise level).
Controlling FDR in regression model has been a long standing problem.It was just a couple years ago that [BC15] proposed the ingenious idea of knockoff filter.In a nutshell, this approach constructs a set of "knockoff" variables that are irrelevant to response (conditional on the original covariates) but whose structure mirrors that of the original covariates.The knockoff variables then behave as the controls for original covariates.This way, they bypass the need of constructing p-values and directly select a model with the desired FDR.The focus of [BC15] was on linear regression model with n > 2p.Later, [CFJL18] extended the idea of knockoff filter to high-dimensional nonlinear models with random designs, but assumes that the joint distribution of covariates is known.Very recently, [FDLL17,BCS18] studied robustness of model-X knockoff to errors in estimating the joint distribution of covariates.A salient feature of the knockoff approach is that for n ≥ 2p, it controls FDR in finite sample setting without requiring any assumption on the covariates.However, the extension model-X knockoffs [CFJL18] requires the knowledge of the joint distribution of covariates.Moreover, the knockoff approach does not provide valid p-values for the hypotheses regarding the model parameters.By contrast, the FCD method that we present in this paper controls FDR as long as s 0 = o( √ n/(log p) 2 ), without requiring the joint distribution of covariates.Furthermore, it comes with the valid p-values for individual model parameters.However, the FDR control is proved for the asymptotic regime where n → ∞.1

Problem formulation
Suppose we have recorded n i.i.d observational units (y 1 , x 1 ), . . ., (y n , x n ), with y i ∈ R representing response variables and x i ∈ R p indicating the vector of explanatory variables on each sample, also referred to as features.We assume the classical linear regression model where the observations obey the following relation: Here, θ 0 ∈ R p is the unknown vector of coefficients.The symbol •, • denotes the standard inner product.Let y = (y 1 , . . ., y n ) T and let X ∈ R n×p denote the feature matrix that have x T 1 , . . ., x T n as rows.Then, writing the linear regression model in matrix form, we obtain We assume that conditional on the design X, the noise variables w i are independent with for some constants C > 0, a > 2.
We let S ⊆ {1, . . ., p} denote the set of truly relevant feature variables among the many that have been recorded.This set corresponds to the support of θ 0 , i.e., We let s 0 = |S| be the size of support or in other words the number of true positives.
In this paper, we propose a framework to select a set S of the feature variables, while controlling the directional false discovery rate (FDR) for the selected variables.This criterion is intimately rated to type S errors (S stands for sign).Type S error (a.k.a type III error) occurs when we say, with confidence, that a comparison goes one way while it goes the other way [GT00].For example, we claim that θ 1 > θ 2 , with confidence, while in fact θ 1 < θ 2 .In other words, we mistakenly make a claim on the sign (direction) of θ 1 −θ 2 .Gelman et.al. [GT00] argue that type S error is a more relevant notion in many applications.Tukey also conveys a similar message in [Tuk91] by arguing that the effects of A and B, for any A and B, are always different (in some decimal precision) and hence instead of questioning whether there is any difference in two effects, the valid question should be about the direction in which effect of A differs from that of B.
Motivated by this, we formally define directional FDR, denoted by FDR dir .For a selected set S of the features along with estimates sign j ∈ {−1, +1} of the sign of θ 0,j , we define where we adopt the convention sign(0) = 0.In words, FDR dir is the expected fraction of false discoveries among the selected ones, where a false discovery is measured with respect to type S and type I errors.For example, if sign j = +1, while θ 0,j = 0 (type I error) or θ 0,j < 0 (type S error), it is considered as a false discovery.
Recall that the classical FDR is defined as that defines the false discoveries only with respect to type I error.Therefore, a comparison of definitions ( 5) and (6) reveals that for any selected set S. As a result, proving that a framework controls FDR dir automatically implies that it also controls FDR.Likewise, we define the statistical power of a selected set S as i.e., for a true discovery not only the corresponding variable should be in fact non-zero but we should also declare its sign correctly.
The directional FDR has been also studied by [BC16] and it is shown that the knockoff filter also controls this metric as well as the FDR.

Our contributions and outline of the paper
Here, we provide a vignette of our contributions: Controlling directional FDR In Section 2, we propose a method for selecting relevant variables using the debiasing approach.We use the acronym FCD to call this method (standing for "FDR Control via Debiasing").In Section 3, we show that for design matrices with subgaussian rows, under the standard condition s 0 = o( √ n/(log p) 2 ), the FCD framework achieves exact directional FDR control.(See Theorem 3.1 for a formal statement).Characterizing the statistical power In Section 3.2, we characterize the statistical power of the FCD method.In particular, for designs with subgaussian rows and a common precision matrix Ω ∈ R p×p , we show that if the minimum nonzero coefficient, θ min satisfies then FCD achieves asymptotic power one.
Recently, [FDLL17] has studied the power of model-X knockoff filter, provided that θ min n log p → ∞ and assuming a lower bound on the size of the model selected by the knockoff procedure.Namely, if | S| ≥ cs 0 , for some constant c ∈ (2(qs 0 ) −1 , 1).Under such assumptions, it is shown that the model-X knockoff approach achieves asymptotic power one.Other than being restrictive, these assumptions are hard to verify and a sufficient given condition is that the size of {j : |θ 0j | s 0 (log p)/n} is at least cs 0 , for some constant c ∈ (2(qs 0 ) −1 , 1).This condition on the amplitude of nonzero coefficients is much stronger that the one we need for FCD to achieve power one.Numerical validation We validate our approach on both synthetic and real data in Sections 5 and 6 and compare its performance with the model-X knockoff.As the simulations suggest, FCD method compares favorably to the model free knockoff in a wide range of setups.We also compare the statistical power of FCD with the theoretical characterization and show that they are in good agreement.
Techniques.In our analysis of FDR, we use ideas from the debiasing approach [JM14b, ZZ14, JM14a, VdGBRD14, JM13b] together with some results from [Liu13] regarding the order statistic of sum of Gaussian random variables (See Lemma 6.1, 6.2 therein.)It is worth mentioning that [Liu13] developed such results to use in the analysis of a method they proposed for Gaussian graphical model and its FDR.This context is very different from the problem studied in this paper and as expected the test statistics are also very different.In our FCD approach, we construct the test statistics by debiasing the Lasso solution.
These test statistics have a Gaussian part and a bias term.In applying the results from [Liu13], we need to do a careful analysis of the bias term, and also the errors in noise level estimation.In addition, by a careful analysis of the test statistic and the data dependent threshold used in our procedure, we are able to analyze the statistical power of our approach.

Further related work
There exists a copious theoretical literature developed on high-dimensional regression and the Lasso.Most existing studies have focused on prediction error [GR04], model selection properties [MB06, ZY06, Wai09, CP09], estimation consistency [CT05,BRT09].For exact support recovery, it was known early on that, even in the classical setting of fixed p and diverging n, support of Lasso will be different from S (support of true signal) unless the columns of X, with index in S, are roughly orthogonal to the ones with index outside S [KF00].This assumption is formalized under the so-called 'irrepresentability condition'.In a seminal work, Zhao and Yu [ZY06] show that this condition also allows exact support recovery in the high-dimensional setting (p n).Independently, [MB06] studied model selection problem for random Gaussian designs, with applications to learning Gaussian graphical models.These papers consider the setting of s 0 = O(n c ), for some c < 1.Further, under a normalization of design such that its columns have norm at most √ n, they require the minimum nonzero amplitude of the signal θ min = min i∈S θ 0,i to satisfy θ min > c s 0 /n.Later, [Wai09] improved these results for the case of random Gaussian designs and showed that for a broad range of covariance matrices, the Lasso can recover the support of a signal for which θ min > cσ (log p)/n.The model selection problem was also studied under the weaker, generalized irrepresentability condition, for the Gauss-Lasso estimator [JM13a].
As an alternative to irrepresentability condition, [Lou08] proves the exact model selection under an incoherence assumption of max i =j Σ ij = O(1/s 0 ).This assumption is however stronger than irrepresentability condition [vdGB09].
As discussed in the introduction, related to the model selection is the problem of hypothesis testing for high-dimensional regression.In [ZZ11,Büh12], authors consider null hypotheses of form H 0,i : θ 0,i = 0 and propose methods that achieve a given power 1 − β, if |θ 0,i | > c β σ s 0 (log p)/n.Later, [JM14b] proposed a method for random Gaussian designs, with known covariance, under the setting s/p → ε and n/p → δ, for constants ε, δ ∈ (0, 1).The proposed method achieves a given power 1 − β, conditional on that |θ 0,i | > c β σ/ √ n.The debiasing approach [ZZ14, JM14a, VdGBRD14] also has been proposed to test H 0,i in the high-dimensional setting, with s 0 = o( √ n/(log p)).In [JM14a], it is shown that the debiasing based framework for testing H 0,i achieves a given power 1 − β, if θ min > c β σ (log p)/n.Applicability of the debiasing approach is extended to the setting of s 0 = o(n/(log p) 2 ), for random Gaussian designs, using a 'leave-one-out' technique [JM18].

Notations
Here, we provide a summary of notations used throughout this paper.We use [p] = {1, . . ., p} to refer to the first p integers.For a vector v, we denote its coordinates by v i and let v S be the restriction of v to indices in set S. Further, the term support of a vector indicates the nonzero coordinates of that vector, i.e., supp(v) = {i ∈ [p] : v i = 0}.We use I to denote the identity matrix and for clarity we might also make its dimension explicit as in I d×d .For a matrix A, we denote its maximum and minimum singular values by σ max (A) and σ min (A), respectively.For a random vector x, we denote its subgaussian norm by x ψ2 defined as: and for a random vector X ∈ R m , its subgaussian norm is defined as

FCD procedure: false discovery control via debiasing
In order to describe FCD framework, we first give an overview of debiasing approach.To this end, we focus on the Lasso estimator [Tib96], given by θ(y, X; λ) ≡ arg min In case the optimization has more than one optimizer we select one of them arbitrarily.We will often drop the arguments y, X, as they are clear from the context.There is a vast literature on the properties of the Lasso estimator in the high-dimensional regime (n < p), mainly through the lens of point estimation and prediction.A major quantity that plays a key role in the estimation error is the co-called Compatibility constant of the design matrix X.Let Σ ≡ X T X/n be the sample covariance matrix.In the high-dimensional setting, where n < p, Σ is always singular, and this makes the estimation of θ 0 challenging since for the parameter family {θ = θ 0 + v}, with v in the null-space of Σ, we have Xθ = Xθ 0 and hence we get the same response vector.A common assumption to cope with this problem is requiring Σ to be nonsingular for a restricted set of directions.
Definition 2.1.For a symmetric matrix Σ ∈ R p×p and a set S ⊆ [p], the corresponding compatibility constant is defined as The matrix Σ ∈ R p× is said to satisfy the compatibility condition if φ( Σ, S) ≥ φ 0 .
Despite the great properties of Lasso in terms of point estimation and prediction, it is biased due to the 1 penalty term.Indeed, bias is unavoidable in high-dimensional setting (n < p) as one needs to produce a point estimate, in p dimension, from the observed data in lower dimension, n.Furthermore, characterizing the exact distribution of regularized estimator is in general not tractable.To deal with these challenges, the debiasing approach aims at first removing the bias of Lasso and producing an estimator that is amenable to distributional characterization.

Debiasing lasso
A debiased estimator θ d takes the general simple form of Here, M is a 'decorrelating' matrix.There are various proposals for constructing M ; see e.g.[ZZ14,JM14a,VdGBRD14].In this paper we use the construction introduced by [JM14a].Here, we assume that the noise w is Gaussian and then discuss the non-Gaussian case in Section 2.2.
To set the stage to describe construction of M , note that by plugging in for y = Xθ 0 + w, we have where Σ ≡ (X T X)/n is the empirical covariance of the feature vectors.The first term is the bias and is controlled by The second term is the unbiased Gaussian noise whose covariance works out at M ΣM T .The decorrelating matrix M is constructed via a convex optimization that aims at reducing bias and variance of the coordinates of θ d at the same time.
with e i ∈ R p being the i'th standard unit vector.If any of the above problems is not feasible, we let M = I p×p .Note that M is constructed solely based on X.
The choice of running parameter μ will be discussed in the sequel.
The following proposition proved in [JM14a] shows that the error of the debiased estimator θ d can be decomposed as the sum of two 'bias' and 'noise' terms.In addition, a high probability bound is established on the bias term Δ ∞ , which leverages on the properties of the optimization (12) and the estimation error of the Lasso estimator.Note that the compatibility condition for the design matrix X is required for Lasso to achieve optimal estimation rate in high dimension [BvdG11,vdGB09].In [JM14a], there is also a version of the following proposition stated for deterministic results, with the compatibility constant φ 0 explicit in the bound (see Theorem 2.3 therein.)The next proposition concerns the setting of random designs, which per se implies the compatibility condition.Indeed, by employing a reduction principle established by [RZ11], if the population covariance Σ has minimum singular value c min > 0 and provided a large enough sample size, namely n ≥ Cs 0 log(p/s 0 ), the sample covariance Σ satisfies the compatibility condition with constant φ 0 = √ c min /2, with high probability.
The next lemma controls the variance of the noise coordinates Z i in terms of the diagonal entries of the precision matrix.

Lemma 2.3 ([JM14a]
).Let Ω ≡ Σ −1 be the precision matrix.Under the assumption of Proposition 2.2, the following holds true for any fixed sequence of integers i = i(n): for κ ≡ 2κ 2 c −1 min and a constant c = c(a) > 0.

Extension to non-Gaussian noise
In the decomposition (2.2), we have Z = MX T W/ √ n and given that W ∼ N(0, Σ), we have Z|X ∼ N(0, σ 2 M ΣM T ).In [JM14a], it is shown that by a slight modification of optimization (12), Z admits the same conditional distribution even for non-Gaussian noise.For the reader's convenience and to be self-contained we briefly explain it here.
Note that for any fixed i ∈ [p], we have Conditional on X, the terms ξ j are zero mean and independent.Moreover, Therefore, if the Lindeberg's condition holds, that is to say for every ε > 0, almost surely The construction of M can slightly be modified to ensure the Lindeberg's condition, namely optimization problem (12) should be modified as follows: where we recall the parameter a from Eq. (3).The following lemma, which is from [JM14a], shows that by this modification, the marginals of Z j are asymptotically normal.
Lemma 2.4 ([JM14a], Theorem 4.1).Under conditions (3) on the noise term, and using optimization (12) to construct M , we have that The above result can be easily generalized to fixed-dimensional marginals of Z, by using the fact that a vector has a multivariate normal distribution if every linear combination of its coordinates is normally distributed.
With this overview of debiasing approach we are ready to explain the FCD procedure.

Construction of test statistics
In order to construct the test statistics, we first need to propose a consistent estimate of noise variance, σ 2 .There are already several proposals for this in the literature.See e.g., [FL01, FL08, SBvdG10, Zha10, SZ12, BC13, FGH12, RTF16].To be concrete, we use the scaled Lasso [SZ12] given by { θ, σ} ≡ arg min We state the following lemma that shows σ is a consistent estimate of σ.We refer to [JM14a, Lemma 3.3] or [SZ12, Theorem 1] for its proof.
Define Λ = M ΣM T .For i ∈ [p], we define test statistic T i as follows: For a given threshold level t ≥ 0, we reject H 0,i if |T i | ≥ t and we return sign of T i as the estimate of sign of θ 0,i .We also let R(t) = p i=1 I(|T i | ≥ t) be the total set of rejections at threshold t.Next, we discuss how to choose a data dependent threshold t to ensure that directional FDR and FDP are controlled at a pre-assigned level q ∈ [0, 1].

A data dependent threshold for the test statistics
• Step 1: For the pre-assigned level q ∈ [0, 1], let t p = (2 log p−2 log log p) 1/2 and calculate • Step 3: We return sign i = sign(T i ) as the estimate of sign(θ 0,i ).

Control of directional false discovery rate
Suppose that the design matrix X has i.i.d rows with Σ = E(x 1 x T 1 ) being the population covariance.Let Ω ≡ Σ −1 be the precision matrix and recall the definition Λ ≡ M ΣM T , where M is the decorrelating matrix used in construction of the debiased estimator.
We also define the normalized matrices Ω 0 and Λ 0 as For a given constant γ > 0, define for some constant c > 0. The following theorem states a guarantee on the directional false discovery rate of the FCD procedure introduced in the previous section.
Theorem 3.1.Consider random design matrices with i.i.d rows and let Σ = E(x 1 x T 1 ) be the population level covariance.Suppose that σ min (Σ) ≥ c min > 0 and σ max (Σ) < c max , for some constants c min , c max and max i∈[p] Σ ii ≤ 1.In addition, assume that XΣ −1/2 has independent subgaussian rows with Σ −1/2 x 1 ψ2 = κ.Also assume that: Then, for FCD procedure we get lim sup Further, for any ε > 0, lim (n,p)→∞ Remark 3.2.While directional FDR is the expected directional false discovery proportion (FDP dir ), it is idealized for a variable selection procedure to control FDP dir in any given realization.In general, controlling FDR dir does not control the variations of FDP dir .As noted by [Owe05], the variance of FDP can be large if the test statistics are correlated, which is the case here.Let us emphasize that by Eq. ( 23), our FCD controls FDP dir , with high probability.
Examples.Here, we provide several examples of the precision matrices that satisfy conditions (ii)-(iii) of Theorem 3.1 to demonstrate its applicability.
Example 1: Our first example is the circulant covariance matrices, where Σ ij = η |i−j| , for some constant η ∈ (0, 1).It is simple to see that the inverse of such matrices has at most three nonzero coordinates per row.Therefore, the conditions will be satisfied by choosing ρ = 1, and c > 0, γ < ∞, arbitrary.Example 2: Suppose that Σ is block diagonal with size of blocks to bounded (as p → ∞).Then, the precision matrix will also have a block diagonal structure with blocks of bounded size.It is easy to check conditions, with choosing ρ = 1 and c > 0, γ < ∞, arbitrary.Example 3: Consider the equi-correlated features, where Σ = (1 − r)I + r11 T , for some constant r ∈ (0, 1), where 1 ∈ R p denotes the all-one vector.Then, we have Ω = (a − b)I + b11 T , with Note that |b| = O(1/p).Therefore, the conditions hold for arbitrary constants c > 0, 0 < ρ < 1.

Power analysis
Recall that S 0 ≡ supp(θ 0 ) is the set of indices of the truly significant features.Let S denote the set of significant parameters returned by our FCD procedure, namely The power of a selected model S is defined as We are now ready to characterize the statistical power of the FCD procedure for the linear model (2).
Theorem 3.3.Consider a sequence of random design matrices Suppose that σ min (Σ) ≥ c min > 0 and σ max (Σ) < c max , for some constants c min , c max and max i∈[p] Σ ii ≤ 1.Further, assume that XΣ −1/2 has independent subgaussian rows with Then, the following holds true: where, for α ∈ [0, 1] and u ∈ R + , the function F (α, u) is defined as follows: We refer to Section 7.2 for the proof of Theorem 3.3.
Proof of Corollary 3.5 is given in Appendix A.4.

Improved results for Gaussian designs
In [JM18], the authors improved upon Proposition 2.2 for Gaussian designs by providing a sharper bound for Δ ∞ using a 'leave-one-out' technique.Specifically, for Gaussian designs with known population covariance, it is shown that Δ ∞ = o p ( s0 n log p).The same bound holds when the population covariance is unknown but can be estimated sufficiently well.e.g., if the inverse covariance is sufficiently sparse.In this section, we aim at employing this result to relax the sparsity assumption (Condition (i)) in Theorem 3.1.

Known covariance
Consider linear model (2) where the design X has independent Gaussian rows, with zero mean and covariance Σ.Also, denote by Ω ≡ Σ −1 be the inverse population covariance, a.k.a precision matrix.Here, we assume that Σ is known and consider the test statistic T i , given by (18) where θ d is the debiased estimator with M = Ω.
For an integer 1 ≤ k ≤ p, define τ (Ω, k) as follows: where • ∞ denotes the ∞ operator norm (maximum 1 norm of the rows).As proved in [JM18], we have the following bound in place: The next theorem is analogues to Theorem 3.1 for Gaussian designs, under a weaker assumption on the sparsity level s 0 .

Theorem 4.1. (Known covariance). Consider a sequence of Gaussian random design matrices
Suppose that X has i.i.d rows with zero mean and Σ = E(x 1 x T 1 ) be the population covariance.Suppose that σ min (Σ) ≥ c min > 0 and σ max (Σ) < c max , for some constants c min , c max and max i∈[p] Σ ii ≤ 1.Further, assume that: (31) Further, for any ε > 0, lim (n,p)→∞ The proof of Theorem 4.1 proceeds along the same lines as proof of Theorem 3.1 and uses the result of [JM18, Theorem 3.8].We refer to section 7.3 for its proof.

Unknown covariance
For the case of unknown covariance, we follow the construction of the decorrelating matrix M proposed in [VdGBRD14].This construction is based on node-wise Lasso on matrix X. Formally, for i ∈ [p], let xi be the i-th column of X and represent it via sparse regression against all other columns: γi ( λ) = arg min where X ∼i is the submatrix obtained by removing the i-th column.Let Also define The decorating matrix M is then defined as We consider the FDC procedure, where the test statistic T i is given by (18) and θ d is the debiased estimator with the decorrelating matrix M (34).Define the sparsity level s Ω for the precision matrix Ω as: In words, s Ω is the maximum sparsity of the rows of Ω.
For the case of Gaussian designs with unknown covariance, we prove that the directional FDR of the FCD procedure is controlled under a weaker assumption on the sparsity of the parameters s 0 , provided s Ω is small enough.

Theorem 4.2. (Unknown covariance). Consider a sequence of Gaussian random design matrices
and X has i.i.d rows with zero mean and covariance Σ. Assume that σ min (Σ) ≥ c min > 0 and σ max (Σ) < c max , for some constants c min , c max and max i∈[p] Σ ii ≤ 1.Further, suppose that and Conditions (ii), (iii), (iv) in Theorem 4.1 hold for Σ.Then, for FCD procedure we get lim sup Further, for any ε > 0, lim (n,p)→∞ The proof of Theorem 4.2 proceeds along the same lines as proof of Theorem 3.1 and uses the result of [JM18, Theorem 3.13].We refer to section 7.4 for its proof.

Numerical experiments
We consider linear model (2) where the design matrix X is generated by drawing its rows independently from N(0, Σ).The covariance Σ ∈ R p×p has a circulant structure with Σ ij = η |i−j| , for some constant η ∈ (0, 1).We then normalize the columns of X to have unit norm.We generate the vector of coefficients θ 0 ∈ R p by choosing a subset of indices S ⊆ [p] at random, of size s 0 and setting θ 0,i from {±A} uniformly at random and θ 0,i = 0, for i / ∈ S 0 .The noise term W is drawn from N(0, I n×n ).
We perform three sets of simulations to compare the performance of FCD procedure with model free knockoff and to examine the effects of sparsity level, signal magnitude, and feature correlation.We also compare the empirical power of FCD with the analytical lower bound provided in Corollary 3.4.In all simulations, we set the target level FDR to q = 0.1.
For FCD procedure, we use the implementation of the debiased method provided by http://web.stanford.edu/montanar/sslasso/,to construct the debiased estimator.For model free knockoff, we use the implantation provided by http://web.stanford.edu/group/candes/knockoffs/.

Effect of Signal Amplitude:
We choose n = 2000, p = 3000, k = 100, η = 0.1 and vary the signal amplitude in the set A ∈ {0.5, 1, 1.5, . . ., 5.5, 6}.For the FCD procedure and the model free knockoff, we compute the directional FDR and power by averaging across 100 realizations of noise and the generation of coefficient vector θ 0 .The results are plotted in Figure 1.As we observe, both methods control FDR dir under the target level q = 0.1.As expected, the power of both procedures increases as the signal amplitude increases, with FCD procedure having larger power than the knockoff method over the entire range of signal amplitudes.The FCD procedure turn out to be more powerful than knockoff procedure.
We also plot the analytical lower bound on the power of FCD procedure, given in Corollary 3.4.As we see the lower bound is quite close to the actual empirical power of FCD procedure in the setup tested.

Effect of feature correlation:
We test the effect of feature correlations on the performance of FCD procedure, comparing it with the model free knockoff.We set n = 700, p = 1000, k = 50, A = 4.5.Recall that the rows of the design  matrix X are generated from a N(0, Σ) distribution, with Σ ij = η |i−j| , and then the columns of X are normalized to have unit norm.We vary the parameter η in the set {0.1, 0.15, 0.2, . . ., 0.75, 0.8}.For each value of η, we compute FDR dir and power for both methods, by averaging over 100 realizations of noise and design matrix X.The results are displayed in Figure 2.
As observed, both methods control FDR dir over the range of correlations tested.From the power plot, we see that the power of both methods decays as the features correlations increase.This is expected because when the features are highly correlated it is harder to distinguish between them and report the truly significant ones.Indeed, for large values of η, both methods select a few variables.This way, FDR dir is still controlled but the power is low.The proposed FCD procedure has higher power than model free knockoff for η ≤ 0.65.

Effect of Sparsity:
Here, we set n = 2000, p = 3000, A = 4.5, η = 0.1 and vary the sparsity level of the coefficients in the set k ∈ {10, 15, 20, . . ., 130}.For both methods, the power and FDR are computed by averaging over 100 trials of noise and the generation of coefficient vector θ 0 .Both methods control FDR dir over the entire range, with FCD achieving lower FDR dir for small values of k.In terms of power, both methods have close power, and the FCD has higher power for small k.

Real data experiments
In this section we evaluate the proposed method to find the mutations in the Human Immunodeficiency Virus Type 1 associated with drug resistance3 .This dataset is presented and analyzed in [RTW + 06] and is obtained by analyzing HIV-1 subtype B sequences from persons with histories of antiretroviral treatment.The dataset contains the mutations in the protease and reverse transcriptase (RT) positions of the HIV-1 subtype B sequences which correspond to resistance to Protease Inhibitors (PI), to nucleoside reverse transcriptase inhibitors (NRTIs) and to non-nucleoside RT inhibitors (NNRTIs).
In order to deal with missing measurements and preprocessing the dataset we mostly follow the steps taken in [BC15].The design matrix X ∈ {0, 1} n×p is formed by letting X ij = 1 if the i'th sample contains the j'th mutation and X ij = 0 otherwise.Further, for a specific drug, the i'th entry of the response vector y i denotes the logarithm of the increase in the resistance to that drug in the i'th patient.We let q = 0.2 and we apply the FCD procedure described in subsection 2.3 to detect the mutations in the HIV-1 associated with resistance to each drug.In order to evaluate the performance of our method, we compare it with the knockoff filter procedure [BC15] with the test statistics based on lasso.The size of the dataset (n, p) for each drug is noted under the bar plot corresponding to that drug.For all cases, except the data for resistance to TDF, we have n > 2p.
We have used two different methods for generating the knockoff variables; in knockoff1, the knockoff variables are generated by solving a semi-definite program (SDP) and in knockoff2, equi-correlated knockoff variables are created without solving an SDP at a lower computational cost4 .Since this is a real data experiment, there is no ground truth.However, we use the methodology in [BC15] to assess our results.In order to do this, we evaluate the reproducibility of the outcomes of these procedures by comparing them with treatment-selected mutation (TSM) panels provided in [RFZ + 05].These panels contain mutations that are observed more frequently in virus samples from patients that have been treated by each drug in compare with the patients who have never been treated with that drug.Since these panels are created independently from the dataset that we use, they can provide a good measure for validating the reproducibility of the results obtained by each procedure.
A summary of the results can be seen in Figures 4, 5, 6.It can be seen that the FCD method achieves the target FDR level of q = 0.2 and the obtained power in half of the cases (8 out of 16 drugs) is larger than the power achieved by the knockoff filter.Overall, the achieved power is comparable with the power of the knockoff filter method.

Proof of Theorem 3.1
Define G(t) = 2(1−Φ(t)), with Φ(t) denoting the standard Gaussian cumulative distribution.We start by two lemmas about the properties of G(t).
Lemma 7.1 is the standard trial bound on the Gaussian distribution and its proof is omitted.Lemma 7.2.For all t > 0, ε < min(1, 1/t) and δ < min(1, 1/t 2 ), the following holds true: Proof of Lemma 7.2 is given in Appendix A.1.Using Proposition (2.2), we have Summary of the results of applying the knockoff filter and FCD for detecting the mutation positions in HIV-1 associated with resistance to type-PI drugs using the dataset in [RTW + 06].In these experiments we have used q = 0.2.In the plots, blue bars show the number of detected positions by different methods that appear in the TSM panels.On top of each bar the proportion of detected mutations that appear in the TSM panel (an estimate for FDP) and the proportion of mutations in the TSM panel that are detected by different methods (an estimate for power) are stated.where Z i ∼ N(0, Λ 0 ).By invoking [JM14a, Lemma 3.1], Λ ii are bounded from below by an arbitrary fixed constant 0 < c < 1, for large enough n.In addition, since for some constant C > 0. Further, by Condition (iii) in the theorem statement, we have We first consider the case that t 0 , given by (19), does not exist.In this case, t 0 = √ 2 log p and for any ε > 0 we have We can bound the first term on the right hand side above as False discovery rate control via debiased lasso which goes to zero as n, p → ∞, due to Proposition 2.2 along with Condition (i), that s 0 = o( √ n/(log p) 2 ), and using Lemma 2.5.Similarly, and by symmetry, the second term goes to zero as n, p → ∞ and the claim follows.
We next focus on the event that t 0 , given by ( 19) exists.By definition of t 0 in this case, we have (Indeed, it is clear that the left-hand side is at most q.Equality holds since t 0 is the minimum t, with such property.)Define Then, Hence, we need to prove that A p → 0, in probability.Note that A. Javanmard and H. Javadi Note that by symmetry it is sufficient to prove that the first term in (47) goes to zero in probability.Consider a discretization 0 Hence, it suffices to show that max in probability.
In the following lemma, we provide sufficient conditions to obtain Eq. ( 50).
We refer to Appendix A.2 for the proof of Lemma 7.3.By virtue of Lemma 7.3 we only need to prove Eqs. (51) and (52).We start by analyzing the following expression False discovery rate control via debiased lasso 1237 with p 0 = |S ≤0 | and The last inequality of (53) holds because θ 0,i ≤ 0 for i ∈ S ≤0 and therefore T i ≤ T i (Recall definition of T i , given by Eq. (39).)Further, because S c = {i ∈ [p] : θ 0,i = 0} ⊆ S ≤0 , we have p 0 ≥ p − s 0 .Since s 0 = o( √ n/(log p) 2 ) by Condition (i), we have p 0 = Ω(p).

Proof of Theorem 3.3
The threshold t 0 retuned by the FCD procedure is data-dependent.To analyze the power, we first upper bound t 0 by a data-independent threshold t * .
Lemma 7.4.Under the assumptions of Theorem 3.3, we have Proof of Lemma 7.4 is given in Appendix A.3.Since t 0 ≤ t * by Lemma 7.4, if we replace t 0 by t * , we obtain a lower bound on the power.For fixed arbitrarily small constants c 0 , δ, ε, define Using Equation (76), we have

Proof of Theorem 4.1
The proof follows the proof of Theorem 3.1.Note that for the results of theorem to hold, it suffices that the conditions of Lemma 7.3 to be satisfied.The result in [JM18, Theorem 3.8], implies that under the conditions of Theorem 4.1, for some constants C, c, and n ≥ max(25 log p, cs 0 log(p/s 0 )), we have Using this, under the assumptions of Theorem 4.1, letting ε 2 = (log p)τ 0 s 0 /n, and following the same steps as in the proof of Theorem 3.1, we will reach the following equation which is similar to Eq. ( 70) + p 2 (log p) −1/2 2pe −Cminn/(16s0) + pe −n/1000 + 8p which is similar to Eq. (71).Again, by taking ε 1 = s 0 (log p)/n and ε 2 = (log p)τ 0 s 0 /n we deduce that Eq. (52) holds too.Hence, the desired results hold under the conditions of the Theorem.

Proof of Theorem 4.2
The proof is similar to the proof of Theorem 4.1.Here, using the result in [JM18, Theorem 3.13], under the conditions of Theorem 4.2, for some constants C, c, where we used that φ(t) is a decreasing function.We next separate the cases of t ∈ (0, 1) and t ≥ 1.For 0 < t < 1, we use the following bound Since P(G c ) → 0 and s 0 = o( √ n/(log p) 2 ), we can choose a deterministic sequence L n → ∞, arbitrarily slow, as n → ∞, such that L n P(G c ) → 0 and L n (s 0 /p) → 0. Letting A n ≡ (qs 0 /p) + P(G c ), we have L n A n → 0.
By applying Markov inequality, we obtain where the last inequality follows from (98).Therefore, with high probability, |S 0 ∩ S(t * ) c | ≤ s 0 L n A n , which implies that as claimed.Now using Equation (100) in Equation (96), we arrive at Therefore, for t * , given by we reach a contradiction which proves our claim t 0 ≤ t * is correct.Since G is a decreasing function, by applying G −1 on both sides, we get Using the above bound, we have By the assumption on θ min , we have that the left-hand side of (102) goes to ∞, which completes the proof.

Contents1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1213 * A. Javanmard is supported in part by NSF CAREER Award 1844481 and a Google Faculty Research Award. A. Javanmard would also like to acknowledge the financial support of the Office of the Provost at the University of Southern California through the Zumberge Fund Individual Grant Program.

1230A.
Fig 1. Testing FCD and model free knockoff methods with varying the coefficients amplitudeA.Here, n = 2000, p = 3000, k = 100, η = 0.1.The target level is q = 10%.FDR dir and power are computed by averaging over 100 realizations of noise and coefficient vectors.

Fig 2 .
Fig 2. Testing FCD and model free knockoff methods with varying the feature correlation parameter η.Here, n = 700, p = 1000, k = 50, A = 4.5.The target level is q = 10%.FDR dir and power are computed by averaging over 100 realizations of noise and design matrices.

Fig 3 .
Fig 3. Testing FCD and model free knockoff methods with varying the sparsity level k.Here, n = 2000, p = 3000, A = 4.5, η = 0.1.The target level is q = 10%.FDR dir and power are computed by averaging over 100 realizations of noise and coefficient vectors.