Are discoveries spurious? Distributions of maximum spurious correlations and their applications

Jianqing Fan; Qi-Man Shao; Wen-Xin Zhou

doi:10.1214/17-AOS1575

June 2018 Are discoveries spurious? Distributions of maximum spurious correlations and their applications

Jianqing Fan, Qi-Man Shao, Wen-Xin Zhou

Ann. Statist. 46(3): 989-1017 (June 2018). DOI: 10.1214/17-AOS1575

Abstract

Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries from these data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions about the exogeneity of the covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given a certain number of predictors, namely, the distribution of the correlation of a response variable $Y$ with the best $s$ linear combinations of $p$ covariates $\mathbf{X}$, even when $\mathbf{X}$ and $Y$ are independent. When the covariance matrix of $\mathbf{X}$ possesses the restricted eigenvalue property, we derive such distributions for both a finite $s$ and a diverging $s$, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of $\mathbf{X}$. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where the residuals are from regularized fits. Our approach is then used to construct the upper confidence limit for the maximum spurious correlation and to test the exogeneity of the covariates. The former provides a baseline for guarding against false discoveries and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated with both numerical examples and real data analysis.

References

1.

Arlot, S., Blanchard, G. and Roquain, E. (2010). Some nonasymptotic results on resampling in high dimension. I. Confidence regions. Ann. Statist. 38 51–82. 1180.62066 10.1214/08-AOS667 euclid.aos/1262271609Arlot, S., Blanchard, G. and Roquain, E. (2010). Some nonasymptotic results on resampling in high dimension. I. Confidence regions. Ann. Statist. 38 51–82. 1180.62066 10.1214/08-AOS667 euclid.aos/1262271609

2.

Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance. Econometrica 71 71–104. 1137.62332 10.1111/1468-0262.00390Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance. Econometrica 71 71–104. 1137.62332 10.1111/1468-0262.00390

3.

Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732. MR2533469 1173.62022 10.1214/08-AOS620 euclid.aos/1245332830Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732. MR2533469 1173.62022 10.1214/08-AOS620 euclid.aos/1245332830

4.

Brusco, M. J. and Stahl, S. (2005). Branch-and-Bound Applications in Combinatorial Data Analysis. Springer, New York. 1093.62006Brusco, M. J. and Stahl, S. (2005). Branch-and-Bound Applications in Combinatorial Data Analysis. Springer, New York. 1093.62006

5.

Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.

6.

Cai, T., Fan, J. and Jiang, T. (2013). Distributions of angles in random packing on spheres. J. Mach. Learn. Res. 14 1837–1864. 1318.60017Cai, T., Fan, J. and Jiang, T. (2013). Distributions of angles in random packing on spheres. J. Mach. Learn. Res. 14 1837–1864. 1318.60017

7.

Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39 1496–1525. 1220.62066 10.1214/11-AOS879 euclid.aos/1305292044Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39 1496–1525. 1220.62066 10.1214/11-AOS879 euclid.aos/1305292044

8.

Cai, T. T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means under dependence. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 349–372. MR3164870 10.1111/rssb.12034Cai, T. T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means under dependence. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 349–372. MR3164870 10.1111/rssb.12034

9.

Chang, J., Zheng, C., Zhou, W.-X. and Zhou, W. (2017). Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity. Biometrics 73 1300–1310. 1405.62162 10.1111/biom.12695Chang, J., Zheng, C., Zhou, W.-X. and Zhou, W. (2017). Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity. Biometrics 73 1300–1310. 1405.62162 10.1111/biom.12695

10.

Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations. Ann. Statist. 33 414–436. 1065.62073 10.1214/009053604000000904 euclid.aos/1112967711Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations. Ann. Statist. 33 414–436. 1065.62073 10.1214/009053604000000904 euclid.aos/1112967711

11.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819. 1292.62030 10.1214/13-AOS1161 euclid.aos/1387313390Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819. 1292.62030 10.1214/13-AOS1161 euclid.aos/1387313390

12.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist. 42 1564–1597. 1317.60038 10.1214/14-AOS1230 euclid.aos/1407420009Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist. 42 1564–1597. 1317.60038 10.1214/14-AOS1230 euclid.aos/1407420009

13.

Davydov, Yu. A., Lifshits, M. A. and Smorodina, N. V. (1998). Local Properties of Distributions of Stochastic Functionals. Translations of Mathematical Monographs 173. Amer. Math. Soc., Providence, RI. Translated from the 1995 Russian original by V. E. Nazaĭkinskiĭ and M. A. Shishkova.Davydov, Yu. A., Lifshits, M. A. and Smorodina, N. V. (1998). Local Properties of Distributions of Stochastic Functionals. Translations of Mathematical Monographs 173. Amer. Math. Soc., Providence, RI. Translated from the 1995 Russian original by V. E. Nazaĭkinskiĭ and M. A. Shishkova.

14.

Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genomics. Springer, New York. 1261.62014Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genomics. Springer, New York. 1261.62014

15.

Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge Univ. Press, Cambridge. 1277.62016Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge Univ. Press, Cambridge. 1277.62016

16.

Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 37–65. MR2885839 10.1111/j.1467-9868.2011.01005.xFan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 37–65. MR2885839 10.1111/j.1467-9868.2011.01005.x

17.

Fan, J., Han, F. and Liu, H. (2014). Challenges of big data analysis. Natl. Sci. Rev. 1 293–314.Fan, J., Han, F. and Liu, H. (2014). Challenges of big data analysis. Natl. Sci. Rev. 1 293–314.

18.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. 1073.62547 10.1198/016214501753382273Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. 1073.62547 10.1198/016214501753382273

19.

Fan, J. and Liao, Y. (2014). Endogeneity in high dimensions. Ann. Statist. 42 872–917. 1305.62113 10.1214/13-AOS1202 euclid.aos/1400592646Fan, J. and Liao, Y. (2014). Endogeneity in high dimensions. Ann. Statist. 42 872–917. 1305.62113 10.1214/13-AOS1202 euclid.aos/1400592646

20.

Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148. 1180.62080Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148. 1180.62080

21.

Fan, J., Shao, Q.-M. and Zhou, W.-X. (2018). Supplement to “Are discoveries spurious? Distributions of maximum spurious correlations and their applications.” DOI:10.1214/17-AOS1575SUPP.Fan, J., Shao, Q.-M. and Zhou, W.-X. (2018). Supplement to “Are discoveries spurious? Distributions of maximum spurious correlations and their applications.” DOI:10.1214/17-AOS1575SUPP.

22.

Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Statist. 42 819–849. 1305.62252 10.1214/13-AOS1198 euclid.aos/1400592644Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Statist. 42 819–849. 1305.62252 10.1214/13-AOS1198 euclid.aos/1400592644

23.

Goeman, J. J., van de Geer, S. A. and van Houwelingen, H. C. (2006). Testing against a high dimensional alternative. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 477–493. 1110.62002 10.1111/j.1467-9868.2006.00551.xGoeman, J. J., van de Geer, S. A. and van Houwelingen, H. C. (2006). Testing against a high dimensional alternative. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 477–493. 1110.62002 10.1111/j.1467-9868.2006.00551.x

24.

Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica 64 413–430. 0862.62090 10.2307/2171789Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica 64 413–430. 0862.62090 10.2307/2171789

25.

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York. 1273.62005Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York. 1273.62005

26.

Shao, Q.-M. and Zhou, W.-X. (2014). Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices. Ann. Probab. 42 623–648. 1354.60020 10.1214/13-AOP837 euclid.aop/1393251298Shao, Q.-M. and Zhou, W.-X. (2014). Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices. Ann. Probab. 42 623–648. 1354.60020 10.1214/13-AOP837 euclid.aop/1393251298

27.

Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavaré, S., Deloukas, P. and Dermitzakis, E. T. (2007). Population genomics of human gene expression. Nat. Genet. 39 1217–1224.Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavaré, S., Deloukas, P. and Dermitzakis, E. T. (2007). Population genomics of human gene expression. Nat. Genet. 39 1217–1224.

28.

Thorgeirsson, T. E. et al. (2010). Sequence variants at CHRNB3-CHRNA6 and CYP2A6 affect smoking behavior. Nat. Genet. 42 448–453.Thorgeirsson, T. E. et al. (2010). Sequence variants at CHRNB3-CHRNA6 and CYP2A6 affect smoking behavior. Nat. Genet. 42 448–453.

29.

Thorisson, G. A., Smith, A. V., Krishnan, L. and Stein, L. D. (2005). The international HapMap project web site. Genome Res. 15 1592–1593.Thorisson, G. A., Smith, A. V., Krishnan, L. and Stein, L. D. (2005). The international HapMap project web site. Genome Res. 15 1592–1593.

30.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288. MR1379242 0850.62538 10.1111/j.2517-6161.1996.tb02080.xTibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288. MR1379242 0850.62538 10.1111/j.2517-6161.1996.tb02080.x

31.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York. MR1385671 0862.60002van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York. MR1385671 0862.60002

32.

Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing (Y. Eldar and G. Kutyniok, eds.) 210–268. Cambridge Univ. Press, Cambridge.Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing (Y. Eldar and G. Kutyniok, eds.) 210–268. Cambridge Univ. Press, Cambridge.

33.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942. MR2604701 1183.62120 10.1214/09-AOS729 euclid.aos/1266586618Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942. MR2604701 1183.62120 10.1214/09-AOS729 euclid.aos/1266586618

34.

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533. 1142.62027 10.1214/009053607000000802 euclid.aos/1216237287Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533. 1142.62027 10.1214/009053607000000802 euclid.aos/1216237287

Citation Download Citation

Jianqing Fan, Qi-Man Shao, and Wen-Xin Zhou "Are discoveries spurious? Distributions of maximum spurious correlations and their applications," The Annals of Statistics 46(3), 989-1017, (June 2018). https://doi.org/10.1214/17-AOS1575

Received: 1 October 2016; Published: June 2018

Access the abstract

JOURNAL ARTICLE
29 PAGES

DOWNLOAD PDF + SAVE TO MY LIBRARY