## The Annals of Statistics

### Controlling the false discovery rate via knockoffs

#### Abstract

In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR)—the expected fraction of false discoveries among all discoveries—is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This paper introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, or the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap—their construction does not require any new data—and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high.

#### Article information

Source
Ann. Statist. Volume 43, Number 5 (2015), 2055-2085.

Dates
Revised: April 2015
First available in Project Euclid: 3 August 2015

Permanent link to this document
https://projecteuclid.org/euclid.aos/1438606853

Digital Object Identifier
doi:10.1214/15-AOS1337

Mathematical Reviews number (MathSciNet)
MR3375876

Zentralblatt MATH identifier
1327.62082

Subjects
Primary: 62F03: Hypothesis testing 62J05: Linear regression

#### Citation

Barber, Rina Foygel; Candès, Emmanuel J. Controlling the false discovery rate via knockoffs. Ann. Statist. 43 (2015), no. 5, 2055--2085. doi:10.1214/15-AOS1337. https://projecteuclid.org/euclid.aos/1438606853.

#### References

• [1] Barber, R. F. and Candès, E. (2015). Supplement to “Controlling the false discovery rate via knockoffs.” DOI:10.1214/15-AOS1337SUPP.
• [2] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. Stat. Methodol. 57 289–300.
• [3] Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188.
• [4] Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837.
• [5] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [6] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
• [7] Candès, E. J. and Plan, Y. (2009). Near-ideal model selection by $\ell_{1}$ minimization. Ann. Statist. 37 2145–2177.
• [8] Chung, E. and Romano, J. P. (2013). Exact and asymptotically robust permutation tests. Ann. Statist. 41 484–507.
• [9] Chung, E. and Romano, J. P. (2013). Multivariate and multiple permutation tests. Technical report, Stanford Univ., Stanford, CA.
• [10] Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, 2nd ed. Wiley, New York.
• [11] Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge Univ. Press, Cambridge.
• [12] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
• [13] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• [14] G’Sell, M. G., Wager, S., Chouldechova, A. and Tibshirani, R. (2013). Sequential selection procedures and false discovery rate control. Available at arXiv:1309.5352.
• [15] Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
• [16] Liu, H., Roeder, K. and Wasserman, L. (2010). Stability approach to regularization selection (StARS) for high dimensional graphical models. Adv. Neural Inf. Process. Syst. 23 1432–1440.
• [17] Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the lasso. Ann. Statist. 42 413–468.
• [18] Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417–473.
• [19] Miller, A. (2002). Subset Selection in Regression, 2nd ed. Monographs on Statistics and Applied Probability 95. Chapman & Hall/CRC, Boca Raton, FL.
• [20] Miller, A. J. (1984). Selection of subsets of regression variables. J. Roy. Statist. Soc. Ser. A 147 389–425.
• [21] Pati, Y. C., Rezaiifar, R. and Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers 40–44. IEEE, Los Alamitos.
• [22] Rhee, S.-Y., Fessel, W. J., Zolopa, A. R., Hurley, L., Liu, T., Taylor, J., Nguyen, D. P., Slome, S., Klein, D., Horberg, M. et al. (2005). HIV-1 protease and reverse-transcriptase mutations: Correlations with antiretroviral therapy in subtype B isolates and implications for drug-resistance surveillance. Journal of Infectious Diseases 192 456–465.
• [23] Rhee, S.-Y., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag, D. L. and Shafer, R. W. (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proc. Natl. Acad. Sci. USA 103 17355–17360.
• [24] Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 479–498.
• [25] Taylor, J., Lockhart, R., Tibshirani, R. J. and Tibshirani, R. (2014). Exact post-selection inference for forward stepwise and least angle regression. Available at arXiv:1401.3889.
• [26] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
• [27] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
• [28] Voorman, A., Shojaie, A. and Witten, D. (2014). Inference in high dimensions with the penalized score test. Preprint. Available at arXiv:1401.2678.
• [29] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
• [30] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
• [31] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
• [32] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.

#### Supplemental materials

• Supplement to “Controlling the false discovery rate via knockoffs”. We provide details for the proofs of several theoretical results in the paper.