Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries from these data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions about the exogeneity of the covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given a certain number of predictors, namely, the distribution of the correlation of a response variable $Y$ with the best $s$ linear combinations of $p$ covariates $\mathbf{X}$, even when $\mathbf{X}$ and $Y$ are independent. When the covariance matrix of $\mathbf{X}$ possesses the restricted eigenvalue property, we derive such distributions for both a finite $s$ and a diverging $s$, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of $\mathbf{X}$. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where the residuals are from regularized fits. Our approach is then used to construct the upper confidence limit for the maximum spurious correlation and to test the exogeneity of the covariates. The former provides a baseline for guarding against false discoveries and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated with both numerical examples and real data analysis.
Ann. Statist.
46(3):
989-1017
(June 2018).
DOI: 10.1214/17-AOS1575
Arlot, S., Blanchard, G. and Roquain, E. (2010). Some nonasymptotic results on resampling in high dimension. I. Confidence regions. Ann. Statist. 38 51–82. 1180.62066 10.1214/08-AOS667 euclid.aos/1262271609Arlot, S., Blanchard, G. and Roquain, E. (2010). Some nonasymptotic results on resampling in high dimension. I. Confidence regions. Ann. Statist. 38 51–82. 1180.62066 10.1214/08-AOS667 euclid.aos/1262271609
Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance. Econometrica 71 71–104. 1137.62332 10.1111/1468-0262.00390Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance. Econometrica 71 71–104. 1137.62332 10.1111/1468-0262.00390
Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732. MR2533469 1173.62022 10.1214/08-AOS620 euclid.aos/1245332830Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732. MR2533469 1173.62022 10.1214/08-AOS620 euclid.aos/1245332830
Brusco, M. J. and Stahl, S. (2005). Branch-and-Bound Applications in Combinatorial Data Analysis. Springer, New York. 1093.62006Brusco, M. J. and Stahl, S. (2005). Branch-and-Bound Applications in Combinatorial Data Analysis. Springer, New York. 1093.62006
Cai, T., Fan, J. and Jiang, T. (2013). Distributions of angles in random packing on spheres. J. Mach. Learn. Res. 14 1837–1864. 1318.60017Cai, T., Fan, J. and Jiang, T. (2013). Distributions of angles in random packing on spheres. J. Mach. Learn. Res. 14 1837–1864. 1318.60017
Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39 1496–1525. 1220.62066 10.1214/11-AOS879 euclid.aos/1305292044Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39 1496–1525. 1220.62066 10.1214/11-AOS879 euclid.aos/1305292044
Cai, T. T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means under dependence. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 349–372. MR3164870 10.1111/rssb.12034Cai, T. T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means under dependence. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 349–372. MR3164870 10.1111/rssb.12034
Chang, J., Zheng, C., Zhou, W.-X. and Zhou, W. (2017). Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity. Biometrics 73 1300–1310. 1405.62162 10.1111/biom.12695Chang, J., Zheng, C., Zhou, W.-X. and Zhou, W. (2017). Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity. Biometrics 73 1300–1310. 1405.62162 10.1111/biom.12695
Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations. Ann. Statist. 33 414–436. 1065.62073 10.1214/009053604000000904 euclid.aos/1112967711Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations. Ann. Statist. 33 414–436. 1065.62073 10.1214/009053604000000904 euclid.aos/1112967711
Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819. 1292.62030 10.1214/13-AOS1161 euclid.aos/1387313390Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819. 1292.62030 10.1214/13-AOS1161 euclid.aos/1387313390
Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist. 42 1564–1597. 1317.60038 10.1214/14-AOS1230 euclid.aos/1407420009Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist. 42 1564–1597. 1317.60038 10.1214/14-AOS1230 euclid.aos/1407420009
Davydov, Yu. A., Lifshits, M. A. and Smorodina, N. V. (1998). Local Properties of Distributions of Stochastic Functionals. Translations of Mathematical Monographs 173. Amer. Math. Soc., Providence, RI. Translated from the 1995 Russian original by V. E. Nazaĭkinskiĭ and M. A. Shishkova.Davydov, Yu. A., Lifshits, M. A. and Smorodina, N. V. (1998). Local Properties of Distributions of Stochastic Functionals. Translations of Mathematical Monographs 173. Amer. Math. Soc., Providence, RI. Translated from the 1995 Russian original by V. E. Nazaĭkinskiĭ and M. A. Shishkova.
Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genomics. Springer, New York. 1261.62014Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genomics. Springer, New York. 1261.62014
Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge Univ. Press, Cambridge. 1277.62016Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge Univ. Press, Cambridge. 1277.62016
Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 37–65. MR2885839 10.1111/j.1467-9868.2011.01005.xFan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 37–65. MR2885839 10.1111/j.1467-9868.2011.01005.x
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. 1073.62547 10.1198/016214501753382273Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. 1073.62547 10.1198/016214501753382273
Fan, J. and Liao, Y. (2014). Endogeneity in high dimensions. Ann. Statist. 42 872–917. 1305.62113 10.1214/13-AOS1202 euclid.aos/1400592646Fan, J. and Liao, Y. (2014). Endogeneity in high dimensions. Ann. Statist. 42 872–917. 1305.62113 10.1214/13-AOS1202 euclid.aos/1400592646
Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148. 1180.62080Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148. 1180.62080
Fan, J., Shao, Q.-M. and Zhou, W.-X. (2018). Supplement to “Are discoveries spurious? Distributions of maximum spurious correlations and their applications.” DOI:10.1214/17-AOS1575SUPP.Fan, J., Shao, Q.-M. and Zhou, W.-X. (2018). Supplement to “Are discoveries spurious? Distributions of maximum spurious correlations and their applications.” DOI:10.1214/17-AOS1575SUPP.
Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Statist. 42 819–849. 1305.62252 10.1214/13-AOS1198 euclid.aos/1400592644Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Statist. 42 819–849. 1305.62252 10.1214/13-AOS1198 euclid.aos/1400592644
Goeman, J. J., van de Geer, S. A. and van Houwelingen, H. C. (2006). Testing against a high dimensional alternative. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 477–493. 1110.62002 10.1111/j.1467-9868.2006.00551.xGoeman, J. J., van de Geer, S. A. and van Houwelingen, H. C. (2006). Testing against a high dimensional alternative. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 477–493. 1110.62002 10.1111/j.1467-9868.2006.00551.x
Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica 64 413–430. 0862.62090 10.2307/2171789Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica 64 413–430. 0862.62090 10.2307/2171789
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York. 1273.62005Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York. 1273.62005
Shao, Q.-M. and Zhou, W.-X. (2014). Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices. Ann. Probab. 42 623–648. 1354.60020 10.1214/13-AOP837 euclid.aop/1393251298Shao, Q.-M. and Zhou, W.-X. (2014). Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices. Ann. Probab. 42 623–648. 1354.60020 10.1214/13-AOP837 euclid.aop/1393251298
Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavaré, S., Deloukas, P. and Dermitzakis, E. T. (2007). Population genomics of human gene expression. Nat. Genet. 39 1217–1224.Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavaré, S., Deloukas, P. and Dermitzakis, E. T. (2007). Population genomics of human gene expression. Nat. Genet. 39 1217–1224.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288. MR1379242 0850.62538 10.1111/j.2517-6161.1996.tb02080.xTibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288. MR1379242 0850.62538 10.1111/j.2517-6161.1996.tb02080.x
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York. MR1385671 0862.60002van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York. MR1385671 0862.60002
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing (Y. Eldar and G. Kutyniok, eds.) 210–268. Cambridge Univ. Press, Cambridge.Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing (Y. Eldar and G. Kutyniok, eds.) 210–268. Cambridge Univ. Press, Cambridge.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942. MR2604701 1183.62120 10.1214/09-AOS729 euclid.aos/1266586618Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942. MR2604701 1183.62120 10.1214/09-AOS729 euclid.aos/1266586618
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533. 1142.62027 10.1214/009053607000000802 euclid.aos/1216237287Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533. 1142.62027 10.1214/009053607000000802 euclid.aos/1216237287