The Annals of Statistics

A knockoff filter for high-dimensional selective inference

Rina Foygel Barber and Emmanuel J. Candès

Full-text: Open access

Abstract

This paper develops a framework for testing for associations in a possibly high-dimensional linear model where the number of features/variables may far exceed the number of observational units. In this framework, the observations are split into two groups, where the first group is used to screen for a set of potentially relevant variables, whereas the second is used for inference over this reduced set of variables; we also develop strategies for leveraging information from the first part of the data at the inference step for greater power. In our work, the inferential step is carried out by applying the recently introduced knockoff filter, which creates a knockoff copy—a fake variable serving as a control—for each screened variable. We prove that this procedure controls the directional false discovery rate (FDR) in the reduced model controlling for all screened variables; this says that our high-dimensional knockoff procedure “discovers” important variables as well as the directions (signs) of their effects, in such a way that the expected proportion of wrongly chosen signs is below the user-specified level (thereby controlling a notion of Type S error averaged over the selected set). This result is nonasymptotic, and holds for any distribution of the original features and any values of the unknown regression coefficients, so that inference is not calibrated under hypothesized values of the effect sizes. We demonstrate the performance of our general and flexible approach through numerical studies, showing more power than existing alternatives. Finally, we apply our method to a genome-wide association study to find locations on the genome that are possibly associated with a continuous phenotype.

Article information

Source
Ann. Statist., Volume 47, Number 5 (2019), 2504-2537.

Dates
Received: February 2016
Revised: July 2018
First available in Project Euclid: 3 August 2019

Permanent link to this document
https://projecteuclid.org/euclid.aos/1564797855

Digital Object Identifier
doi:10.1214/18-AOS1755

Mathematical Reviews number (MathSciNet)
MR3988764

Subjects
Primary: 62F03: Hypothesis testing 62J05: Linear regression

Keywords
Knockoffs variable selection false discovery rate (FDR) high-dimensional regression

Citation

Barber, Rina Foygel; Candès, Emmanuel J. A knockoff filter for high-dimensional selective inference. Ann. Statist. 47 (2019), no. 5, 2504--2537. doi:10.1214/18-AOS1755. https://projecteuclid.org/euclid.aos/1564797855


Export citation

References

  • [1] Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43 2055–2085.
  • [2] Barber, R. F. and Candès, E. J. (2019). Supplement to “A knockoff filter for high-dimensional selective inference.” DOI:10.1214/18-AOS1755SUPP.
  • [3] Belloni, A., Chernozhukov, V. and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81 608–650.
  • [4] Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root Lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806.
  • [5] Benjamini, Y. and Braun, H. (2002). John Tukey’s contributions to multiple comparisons. ETS Research Report Series.
  • [6] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • [7] Benjamini, Y. and Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 100 71–93.
  • [8] Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837.
  • [9] Candès, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 551–577.
  • [10] Dai, R. and Barber, R. (2016). The knockoff filter for fdr control in group-sparse and multitask regression. In Proceedings of the 33rd International Conference on Machine Learning 1851–1859.
  • [11] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
  • [12] Fithian, W., Sun, D. and Taylor, J. (2014). Optimal inference after model selection. Preprint. Available at arXiv:1410.2597.
  • [13] G’Sell, M. G., Hastie, T. and Tibshirani, R. (2013). False variable selection rates in regression. Preprint. Available at arXiv:1302.2303.
  • [14] G’Sell, M. G., Wager, S., Chouldechova, A. and Tibshirani, R. (2016). Sequential selection procedures and false discovery rate control. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 423–444.
  • [15] Gelman, A. and Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput. Statist. 15 373–390.
  • [16] Huang, J., Ma, S., Zhang, C.-H. and Zhou, Y. (2013). Semi-penalized inference with direct false discovery rate control in high-dimensions. Preprint. Available at arXiv:1311.7455.
  • [17] Janson, L., Barber, R. F. and Candès, E. (2017). EigenPrism: Inference for high dimensional signal-to-noise ratios. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 1037–1065.
  • [18] Janson, L. and Su, W. (2016). Familywise error rate control via knockoffs. Electron. J. Stat. 10 960–975. Available at arXiv:1505.06549.
  • [19] Järvelin, M.-R., Sovio, U., King, V., Lauren, L., Xu, B., McCarthy, M. I., Hartikainen, A.-L., Laitinen, J., Zitting, P. et al. (2004). Early life factors and blood pressure at age 31 years in the 1966 northern Finland birth cohort. Hypertension 44 838–846.
  • [20] Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
  • [21] Jones, L. V. and Tukey, J. W. (2000). A sensible formulation of the significance test. Psychol. Methods 5 411.
  • [22] Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927. Available at arXiv:1311.6238.
  • [23] Leeb, H. and Pötscher, B. M. (2006). Can one estimate the conditional distribution of post-model-selection estimators? Ann. Statist. 34 2554–2591.
  • [24] Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the Lasso. Ann. Statist. 42 413–468.
  • [25] Miller, A. (2002). Subset Selection in Regression, 2nd ed. Monographs on Statistics and Applied Probability 95. CRC Press/CRC, Boca Raton, FL.
  • [26] Pati, Y. C., Rezaiifar, R. and Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers 40–44. IEEE, New York.
  • [27] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.
  • [28] Sabatti, C., Service, S. K., Hartikainen, A.-L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C. G., Zaitlen, N. A., Varilo, T. et al. (2009). Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41 35–46.
  • [29] Shaffer, J. P. (2002). Multiplicity, directional (type III) errors, and the null hypothesis. Psychol. Methods 7 356–369.
  • [30] Su, W., Bogdan, M. and Candès, E. (2017). False discoveries occur early on the Lasso path. Ann. Statist. 45 2133–2150.
  • [31] Taylor, J. T. (2017). Selective-inference. Available at https://github.com/jonathan-taylor/selective-inference.
  • [32] Tian, X., Loftus, J. R. and Taylor, J. E. (2018). Selective inference with unknown variance via the square-root Lasso. Biometrika 105 755–768. Available at arXiv:1504.08031.
  • [33] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [34] Tibshirani, R. J., Taylor, J., Lockhart, R. and Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. J. Amer. Statist. Assoc. 111 600–620.
  • [35] Tukey, J. W. (1991). The philosophy of multiple comparisons. Statist. Sci. 6 100–116.
  • [36] Voorman, A., Shojaie, A. and Witten, D. (2014). Inference in high dimensions with the penalized score test. Preprint. Available at arXiv:1401.2678.
  • [37] Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201.
  • [38] Willer, C. J., Schmidt, E. M., Sengupta, S., Peloso, G. M., Gustafsson, S., Kanoni, S., Ganna, A., Chen, J., Buchkovich, M. L. et al. (2013). Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45 1274–1283.
  • [39] Wu, J., Devlin, B., Ringquist, S., Trucco, M. and Roeder, K. (2010). Screen and clean: A tool for identifying interactions in genome-wide association studies. Genet. Epidemiol. 34 275–285.
  • [40] Wu, Y., Boos, D. D. and Stefanski, L. A. (2007). Controlling variable selection by the addition of pseudovariables. J. Amer. Statist. Assoc. 102 235–243.
  • [41] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
  • [42] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
  • [43] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.

Supplemental materials

  • Supplement to “A knockoff filter for high-dimensional selective inference”. We provide details for the proofs of several theoretical results in the paper, and report detailed results on the GWAS real data experiment.