Open Access
October 2019 A knockoff filter for high-dimensional selective inference
Rina Foygel Barber, Emmanuel J. Candès
Ann. Statist. 47(5): 2504-2537 (October 2019). DOI: 10.1214/18-AOS1755
Abstract

This paper develops a framework for testing for associations in a possibly high-dimensional linear model where the number of features/variables may far exceed the number of observational units. In this framework, the observations are split into two groups, where the first group is used to screen for a set of potentially relevant variables, whereas the second is used for inference over this reduced set of variables; we also develop strategies for leveraging information from the first part of the data at the inference step for greater power. In our work, the inferential step is carried out by applying the recently introduced knockoff filter, which creates a knockoff copy—a fake variable serving as a control—for each screened variable. We prove that this procedure controls the directional false discovery rate (FDR) in the reduced model controlling for all screened variables; this says that our high-dimensional knockoff procedure “discovers” important variables as well as the directions (signs) of their effects, in such a way that the expected proportion of wrongly chosen signs is below the user-specified level (thereby controlling a notion of Type S error averaged over the selected set). This result is nonasymptotic, and holds for any distribution of the original features and any values of the unknown regression coefficients, so that inference is not calibrated under hypothesized values of the effect sizes. We demonstrate the performance of our general and flexible approach through numerical studies, showing more power than existing alternatives. Finally, we apply our method to a genome-wide association study to find locations on the genome that are possibly associated with a continuous phenotype.

References

1.

[1] Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43 2055–2085. 1327.62082 10.1214/15-AOS1337 euclid.aos/1438606853[1] Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43 2055–2085. 1327.62082 10.1214/15-AOS1337 euclid.aos/1438606853

2.

[2] Barber, R. F. and Candès, E. J. (2019). Supplement to “A knockoff filter for high-dimensional selective inference.”  DOI:10.1214/18-AOS1755SUPP.[2] Barber, R. F. and Candès, E. J. (2019). Supplement to “A knockoff filter for high-dimensional selective inference.”  DOI:10.1214/18-AOS1755SUPP.

3.

[3] Belloni, A., Chernozhukov, V. and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81 608–650. 1409.62142 10.1093/restud/rdt044[3] Belloni, A., Chernozhukov, V. and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81 608–650. 1409.62142 10.1093/restud/rdt044

4.

[4] Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root Lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806. 1228.62083 10.1093/biomet/asr043[4] Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root Lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806. 1228.62083 10.1093/biomet/asr043

5.

[5] Benjamini, Y. and Braun, H. (2002). John Tukey’s contributions to multiple comparisons. ETS Research Report Series. 1029.01010 10.1214/aos/1043351247 euclid.aos/1043351247[5] Benjamini, Y. and Braun, H. (2002). John Tukey’s contributions to multiple comparisons. ETS Research Report Series. 1029.01010 10.1214/aos/1043351247 euclid.aos/1043351247

6.

[6] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. 0809.62014 10.1111/j.2517-6161.1995.tb02031.x[6] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. 0809.62014 10.1111/j.2517-6161.1995.tb02031.x

7.

[7] Benjamini, Y. and Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 100 71–93. 1117.62302 10.1198/016214504000001907[7] Benjamini, Y. and Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 100 71–93. 1117.62302 10.1198/016214504000001907

8.

[8] Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837. 1267.62080 10.1214/12-AOS1077 euclid.aos/1369836961[8] Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837. 1267.62080 10.1214/12-AOS1077 euclid.aos/1369836961

9.

[9] Candès, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 551–577. 1398.62335 10.1111/rssb.12265[9] Candès, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 551–577. 1398.62335 10.1111/rssb.12265

10.

[10] Dai, R. and Barber, R. (2016). The knockoff filter for fdr control in group-sparse and multitask regression. In Proceedings of the 33rd International Conference on Machine Learning 1851–1859.[10] Dai, R. and Barber, R. (2016). The knockoff filter for fdr control in group-sparse and multitask regression. In Proceedings of the 33rd International Conference on Machine Learning 1851–1859.

11.

[11] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911. 1411.62187 10.1111/j.1467-9868.2008.00674.x[11] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911. 1411.62187 10.1111/j.1467-9868.2008.00674.x

12.

[12] Fithian, W., Sun, D. and Taylor, J. (2014). Optimal inference after model selection. Preprint. Available at  arXiv:1410.25971410.2597[12] Fithian, W., Sun, D. and Taylor, J. (2014). Optimal inference after model selection. Preprint. Available at  arXiv:1410.25971410.2597

13.

[13] G’Sell, M. G., Hastie, T. and Tibshirani, R. (2013). False variable selection rates in regression. Preprint. Available at  arXiv:1302.23031302.2303[13] G’Sell, M. G., Hastie, T. and Tibshirani, R. (2013). False variable selection rates in regression. Preprint. Available at  arXiv:1302.23031302.2303

14.

[14] G’Sell, M. G., Wager, S., Chouldechova, A. and Tibshirani, R. (2016). Sequential selection procedures and false discovery rate control. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 423–444. 07065229 10.1111/rssb.12122[14] G’Sell, M. G., Wager, S., Chouldechova, A. and Tibshirani, R. (2016). Sequential selection procedures and false discovery rate control. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 423–444. 07065229 10.1111/rssb.12122

15.

[15] Gelman, A. and Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput. Statist. 15 373–390. 1037.62015 10.1007/s001800000040[15] Gelman, A. and Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput. Statist. 15 373–390. 1037.62015 10.1007/s001800000040

16.

[16] Huang, J., Ma, S., Zhang, C.-H. and Zhou, Y. (2013). Semi-penalized inference with direct false discovery rate control in high-dimensions. Preprint. Available at  arXiv:1311.74551311.7455[16] Huang, J., Ma, S., Zhang, C.-H. and Zhou, Y. (2013). Semi-penalized inference with direct false discovery rate control in high-dimensions. Preprint. Available at  arXiv:1311.74551311.7455

17.

[17] Janson, L., Barber, R. F. and Candès, E. (2017). EigenPrism: Inference for high dimensional signal-to-noise ratios. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 1037–1065. 1373.62355 10.1111/rssb.12203[17] Janson, L., Barber, R. F. and Candès, E. (2017). EigenPrism: Inference for high dimensional signal-to-noise ratios. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 1037–1065. 1373.62355 10.1111/rssb.12203

18.

[18] Janson, L. and Su, W. (2016). Familywise error rate control via knockoffs. Electron. J. Stat. 10 960–975. Available at  arXiv:1505.065491505.06549 1341.62245 10.1214/16-EJS1129[18] Janson, L. and Su, W. (2016). Familywise error rate control via knockoffs. Electron. J. Stat. 10 960–975. Available at  arXiv:1505.065491505.06549 1341.62245 10.1214/16-EJS1129

19.

[19] Järvelin, M.-R., Sovio, U., King, V., Lauren, L., Xu, B., McCarthy, M. I., Hartikainen, A.-L., Laitinen, J., Zitting, P. et al. (2004). Early life factors and blood pressure at age 31 years in the 1966 northern Finland birth cohort. Hypertension 44 838–846.[19] Järvelin, M.-R., Sovio, U., King, V., Lauren, L., Xu, B., McCarthy, M. I., Hartikainen, A.-L., Laitinen, J., Zitting, P. et al. (2004). Early life factors and blood pressure at age 31 years in the 1966 northern Finland birth cohort. Hypertension 44 838–846.

20.

[20] Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909. 1319.62145[20] Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909. 1319.62145

21.

[21] Jones, L. V. and Tukey, J. W. (2000). A sensible formulation of the significance test. Psychol. Methods 5 411.[21] Jones, L. V. and Tukey, J. W. (2000). A sensible formulation of the significance test. Psychol. Methods 5 411.

22.

[22] Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927. Available at  arXiv:1311.62381311.6238 1341.62061 10.1214/15-AOS1371 euclid.aos/1460381681[22] Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927. Available at  arXiv:1311.62381311.6238 1341.62061 10.1214/15-AOS1371 euclid.aos/1460381681

23.

[23] Leeb, H. and Pötscher, B. M. (2006). Can one estimate the conditional distribution of post-model-selection estimators? Ann. Statist. 34 2554–2591. 1106.62029 10.1214/009053606000000821 euclid.aos/1169571807[23] Leeb, H. and Pötscher, B. M. (2006). Can one estimate the conditional distribution of post-model-selection estimators? Ann. Statist. 34 2554–2591. 1106.62029 10.1214/009053606000000821 euclid.aos/1169571807

24.

[24] Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the Lasso. Ann. Statist. 42 413–468. 1305.62254 10.1214/13-AOS1175 euclid.aos/1400592161[24] Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the Lasso. Ann. Statist. 42 413–468. 1305.62254 10.1214/13-AOS1175 euclid.aos/1400592161

25.

[25] Miller, A. (2002). Subset Selection in Regression, 2nd ed. Monographs on Statistics and Applied Probability 95. CRC Press/CRC, Boca Raton, FL. 1051.62060[25] Miller, A. (2002). Subset Selection in Regression, 2nd ed. Monographs on Statistics and Applied Probability 95. CRC Press/CRC, Boca Raton, FL. 1051.62060

26.

[26] Pati, Y. C., Rezaiifar, R. and Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers 40–44. IEEE, New York.[26] Pati, Y. C., Rezaiifar, R. and Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers 40–44. IEEE, New York.

27.

[27] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.[27] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.

28.

[28] Sabatti, C., Service, S. K., Hartikainen, A.-L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C. G., Zaitlen, N. A., Varilo, T. et al. (2009). Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41 35–46.[28] Sabatti, C., Service, S. K., Hartikainen, A.-L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C. G., Zaitlen, N. A., Varilo, T. et al. (2009). Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41 35–46.

29.

[29] Shaffer, J. P. (2002). Multiplicity, directional (type III) errors, and the null hypothesis. Psychol. Methods 7 356–369.[29] Shaffer, J. P. (2002). Multiplicity, directional (type III) errors, and the null hypothesis. Psychol. Methods 7 356–369.

30.

[30] Su, W., Bogdan, M. and Candès, E. (2017). False discoveries occur early on the Lasso path. Ann. Statist. 45 2133–2150. 06821121 10.1214/16-AOS1521 euclid.aos/1509436830[30] Su, W., Bogdan, M. and Candès, E. (2017). False discoveries occur early on the Lasso path. Ann. Statist. 45 2133–2150. 06821121 10.1214/16-AOS1521 euclid.aos/1509436830

31.

[31] Taylor, J. T. (2017). Selective-inference. Available at  https://github.com/jonathan-taylor/selective-inference.[31] Taylor, J. T. (2017). Selective-inference. Available at  https://github.com/jonathan-taylor/selective-inference.

32.

[32] Tian, X., Loftus, J. R. and Taylor, J. E. (2018). Selective inference with unknown variance via the square-root Lasso. Biometrika 105 755–768. Available at  arXiv:1504.080311504.08031 06994533[32] Tian, X., Loftus, J. R. and Taylor, J. E. (2018). Selective inference with unknown variance via the square-root Lasso. Biometrika 105 755–768. Available at  arXiv:1504.080311504.08031 06994533

33.

[33] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. 0850.62538 10.1111/j.2517-6161.1996.tb02080.x[33] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. 0850.62538 10.1111/j.2517-6161.1996.tb02080.x

34.

[34] Tibshirani, R. J., Taylor, J., Lockhart, R. and Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. J. Amer. Statist. Assoc. 111 600–620.[34] Tibshirani, R. J., Taylor, J., Lockhart, R. and Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. J. Amer. Statist. Assoc. 111 600–620.

35.

[35] Tukey, J. W. (1991). The philosophy of multiple comparisons. Statist. Sci. 6 100–116.[35] Tukey, J. W. (1991). The philosophy of multiple comparisons. Statist. Sci. 6 100–116.

36.

[36] Voorman, A., Shojaie, A. and Witten, D. (2014). Inference in high dimensions with the penalized score test. Preprint. Available at  arXiv:1401.26781401.2678 1285.62061 10.1093/biomet/ast053[36] Voorman, A., Shojaie, A. and Witten, D. (2014). Inference in high dimensions with the penalized score test. Preprint. Available at  arXiv:1401.26781401.2678 1285.62061 10.1093/biomet/ast053

37.

[37] Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201. 1173.62054 10.1214/08-AOS646 euclid.aos/1247663752[37] Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201. 1173.62054 10.1214/08-AOS646 euclid.aos/1247663752

38.

[38] Willer, C. J., Schmidt, E. M., Sengupta, S., Peloso, G. M., Gustafsson, S., Kanoni, S., Ganna, A., Chen, J., Buchkovich, M. L. et al. (2013). Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45 1274–1283.[38] Willer, C. J., Schmidt, E. M., Sengupta, S., Peloso, G. M., Gustafsson, S., Kanoni, S., Ganna, A., Chen, J., Buchkovich, M. L. et al. (2013). Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45 1274–1283.

39.

[39] Wu, J., Devlin, B., Ringquist, S., Trucco, M. and Roeder, K. (2010). Screen and clean: A tool for identifying interactions in genome-wide association studies. Genet. Epidemiol. 34 275–285.[39] Wu, J., Devlin, B., Ringquist, S., Trucco, M. and Roeder, K. (2010). Screen and clean: A tool for identifying interactions in genome-wide association studies. Genet. Epidemiol. 34 275–285.

40.

[40] Wu, Y., Boos, D. D. and Stefanski, L. A. (2007). Controlling variable selection by the addition of pseudovariables. J. Amer. Statist. Assoc. 102 235–243. 1284.62242 10.1198/016214506000000843[40] Wu, Y., Boos, D. D. and Stefanski, L. A. (2007). Controlling variable selection by the addition of pseudovariables. J. Amer. Statist. Assoc. 102 235–243. 1284.62242 10.1198/016214506000000843

41.

[41] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594. 1142.62044 10.1214/07-AOS520 euclid.aos/1216237292[41] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594. 1142.62044 10.1214/07-AOS520 euclid.aos/1216237292

42.

[42] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242. 1411.62196 10.1111/rssb.12026[42] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242. 1411.62196 10.1111/rssb.12026

43.

[43] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563. 1222.62008[43] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563. 1222.62008
Copyright © 2019 Institute of Mathematical Statistics
Rina Foygel Barber and Emmanuel J. Candès "A knockoff filter for high-dimensional selective inference," The Annals of Statistics 47(5), 2504-2537, (October 2019). https://doi.org/10.1214/18-AOS1755
Received: 1 February 2016; Published: October 2019
Vol.47 • No. 5 • October 2019
Back to Top