Electronic Journal of Statistics

Significance testing in non-sparse high-dimensional linear models

Yinchu Zhu and Jelena Bradic

Full-text: Open access

Abstract

In high-dimensional linear models, the sparsity assumption is typically made, stating that most of the parameters are equal to zero. Under the sparsity assumption, estimation and, recently, inference have been well studied. However, in practice, sparsity assumption is not checkable and more importantly is often violated; a large number of covariates might be expected to be associated with the response, indicating that possibly all, rather than just a few, parameters are non-zero. A natural example is a genome-wide gene expression profiling, where all genes are believed to affect a common disease marker. We show that existing inferential methods are sensitive to the sparsity assumption, and may, in turn, result in the severe lack of control of Type-I error. In this article, we propose a new inferential method, named CorrT, which is robust to model misspecification such as heteroscedasticity and lack of sparsity. CorrT is shown to have Type I error approaching the nominal level for any models and Type II error approaching zero for sparse and many dense models. In fact, CorrT is also shown to be optimal in a variety of frameworks: sparse, non-sparse and hybrid models where sparse and dense signals are mixed. Numerical experiments show a favorable performance of the CorrT test compared to the state-of-the-art methods.

Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 3312-3364.

Dates
Received: January 2017
First available in Project Euclid: 6 October 2018

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1538791404

Digital Object Identifier
doi:10.1214/18-EJS1443

Mathematical Reviews number (MathSciNet)
MR3861831

Rights
Creative Commons Attribution 4.0 International License.

Citation

Zhu, Yinchu; Bradic, Jelena. Significance testing in non-sparse high-dimensional linear models. Electron. J. Statist. 12 (2018), no. 2, 3312--3364. doi:10.1214/18-EJS1443. https://projecteuclid.org/euclid.ejs/1538791404


Export citation

References

  • Acion, L., Kelmansky, D., van der Laan, M., Sahker, E., Jones, D. and Arndt, S. (2017). Use of a machine learning framework to predict substance use disorder treatment success., PloS one 12 e0175383.
  • Arriaga, J. M., Bravo, A. I., Mordoh, J. and Bianchini, M. (2017). Metallothionein 1G promotes the differentiation of HT-29 human colorectal cancer cells., Oncology Reports 37 2633–2651.
  • Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root lasso: pivotal recovery of sparse signals via conic programming., Biometrika 98 791–806.
  • Belloni, A., Chernozhukov, V. and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls., The Review of Economic Studies 81 608–650.
  • Belloni, A., Chernozhukov, V. and Kato, K. (2018). Valid post-selection inference in high-dimensional approximately sparse quantile regression models., Journal of the American Statistical Association just-accepted 1–33.
  • Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector., The Annals of Statistics 37 1705–1732.
  • Borovkov, A. A. (2000). Estimates for the distribution of sums and maxima of sums of random variables without the Cramer condition., Siberian Mathematical Journal 41 811–848.
  • Bosse, K., Haneder, S., Arlt, C., Ihling, C. H., Seufferlein, T. and Sinz, A. (2016). Mass spectrometry-based secretome analysis of non-small cell lung cancer cell lines., Proteomics 16 2801–2814.
  • Boucheron, S., Lugosi, G. and Massart, P. (2013)., Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
  • Bradic, J., Fan, J. and Zhu, Y. (2018). Testability of high-dimensional linear models with non-sparse structures., arXiv preprint arXiv:1802.09117.
  • Bühlmann, P. and Van de Geer, S. (2011)., Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
  • Cai, T. T., Guo, Z. et al. (2017). Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity., The Annals of statistics 45 615–646.
  • Cavalier, L. and Tsybakov, A. (2002). Sharp adaptation for inverse problems with random noise., Probability Theory and Related Fields 123 323–354.
  • Chernozhukov, V., Hansen, C. and Spindler, M. (2015). Valid post-selection and post-regularization inference: An elementary, general approach., Annual Review of Economics.
  • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C. and Newey, W. (2017a). Double/Debiased/Neyman machine learning of treatment effects., American Economic Review 107 261–65.
  • Chernozhukov, V., Hansen, C., Liao, Y. et al. (2017b). A lava attack on the recovery of sums of dense and sparse signals., The Annals of Statistics 45 39–76.
  • Collier, O., Comminges, L. and Tsybakov, A. B. (2018). Some effects in adaptive robust estimation under sparsity., arXiv preprint arXiv:1802.04230.
  • Collier, O., Comminges, L., Tsybakov, A. B. and Verzelen, N. (2016). Optimal adaptive estimation of linear functionals under sparsity., arXiv preprint arXiv:1611.09744.
  • Dicker, L. H. (2016). Ridge regression and asymptotic minimax estimation over spheres of growing dimension., Bernoulli 22 1–37.
  • Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk over lp -balls for lq -error., Probability Theory and Related Fields 99 277–303.
  • Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage., Journal of the American Statistical Association 90 1200–1224.
  • Ellis, M. J., Jenkins, S., Hanfelt, J., Redington, M. E., Taylor, M., Leek, R., Siddle, K. and Harris, A. (1998). Insulin-like growth factors in human breast cancer., Breast Cancer Research and Treatment 52 175–184.
  • Feller, W. (1968)., An introduction to probability theory and its applications: volume I 3. John Wiley & Sons London-New York-Sydney-Toronto.
  • Gautier, E. and Tsybakov, A. B. (2013). Pivotal estimation in high-dimensional regression via linear programming. In, Empirical Inference 195–204. Springer.
  • Hall, P. and Heyde, C. C. (1980)., Martingale limit theory and its application. Academic press New York.
  • Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity., Econometrica 461–465.
  • Holm, K., Staaf, J., Jönsson, G., Vallon-Christersson, J., Gunnarsson, H., Arason, A., Magnusson, L., Barkardottir, R. B., Hegardt, C., Ringnér, M. and Borg, Å. (2012). Characterisation of amplification patterns and target genes at chromosome 11q13 in CCND1-amplified sporadic and familial breast tumours., Breast Cancer Research and Treatment 133 583–594.
  • Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review., Review of Economics and statistics 86 4–29.
  • Imbens, G. W. and Wooldridge, J. M. (2009). Recent developments in the econometrics of program evaluation., Journal of Economic Literature 47 5–86.
  • Ingster, Y. I., Tsybakov, A. B. and Verzelen, N. (2010). Detection boundary in sparse regression., Electronic Journal of Statistics 4 1476–1526.
  • Janson, L., Barber, R. F. and Candes, E. (2017). EigenPrism: inference for high dimensional signal-to-noise ratios., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 1037–1065.
  • Javanmard, A. and Montanari, A. (2014a). Confidence intervals and hypothesis testing for high-dimensional regression., The Journal of Machine Learning Research 15 2869–2909.
  • Javanmard, A. and Montanari, A. (2014b). Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory., IEEE Transactions on Information Theory 60 6522–6554.
  • Javanmard, A. and Montanari, A. (2018). De-biasing the Lasso: Optimal Sample Size for Gaussian Designs., forthcoming in The Annals of Statistics.
  • Kitange, G., Mladek, A., Schroeder, M., Pokorny, J., Carlson, B., Zhang, Y., Nair, A., Lee, J.-H., Yan, H., Decker, P., Zhang, Z. and Sarkaria, J. (2016). Retinoblastoma Binding Protein 4 Modulates Temozolomide Sensitivity in Glioblastoma by Regulating DNA Repair Proteins., Cell Reports 14 2587–2598.
  • Lee, S., Wu, M. C. and Lin, X. (2012). Optimal tests for rare variant effects in sequencing association studies., Biostatistics 13 762–775.
  • Lehmann, E. L. and Romano, J. P. (2006)., Testing statistical hypotheses. Springer Science & Business Media.
  • Li, H., Lee, T.-H. and Avraham, H. (2002). A Novel Tricomplex of BRCA1, Nmi, and c-Myc Inhibits c-Myc-induced Human Telomerase Reverse Transcriptase Gene (hTERT) Promoter Activity in Breast Cancer., Journal of Biological Chemistry 277 20965–20973.
  • Liu, L., Miao, W., Sun, B., Robins, J. M. and Tchetgen Tchetgen, E. J. (2015). Doubly robust estimation of a marginal average effect of treatment on the treated with an instrumental variable., Harvard Working Paper Series.
  • Ma, Y. and Zhu, L. (2013). Doubly robust and efficient estimators for heteroscedastic partially linear single-index models allowing high dimensional covariates., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 305–322.
  • Merlevède, F., Peligrad, M. and Rio, E. (2011). A Bernstein type inequality and moderate deviations for weakly dependent sequences., Probability Theory and Related Fields 151 435–474.
  • Neale, B. M., Rivas, M. A., Voight, B. F., Altshuler, D., Devlin, B., Orho-Melander, M., Kathiresan, S., Purcell, S. M., Roeder, K. and Daly, M. J. (2011). Testing for an unusual distribution of rare variants., PLoS genetics 7 e1001322.
  • Newey, W. K. (1994). The asymptotic variance of semiparametric estimators., Econometrica 62 1349–1382.
  • Neyman, J. (1959). Optimal asymptotic tests of composite statistical hypotheses., The Harald Cramer Volume, ed. by U. Grenander 213–234.
  • Ning, Y., Liu, H. et al. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models., The Annals of Statistics 45 158–195.
  • Oates, A. J., Schumaker, L. M., Jenkins, S. B., Pearce, A. A., DaCosta, S. A., Arun, B. and Ellis, M. J. (1998). The mannose 6-phosphate/insulin-like growth factor 2 receptor (M6P/IGF2R), a putative breast tumor suppressor gene., Breast cancer research and treatment 47 269–281.
  • Pang, H., Liu, H. and Vanderbei, R. J. (2014). The fastclime package for linear programming and large-scale precision matrix estimation in R., Journal of Machine Learning Research 15 489–493.
  • Park, R. E. (1966). Estimation with heteroscedastic error terms., Econometrica 34 888.
  • Peña, V. H., Lai, T. L. and Shao, Q.-M. (2008)., Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media.
  • Poczobutt, J. M., Nguyen, T. T., Hanson, D., Li, H., Sippel, T. R., Weiser-Evans, M. C., Gijon, M., Murphy, R. C. and Nemenoff, R. A. (2016). Deletion of 5-lipoxygenase in the tumor microenvironment promotes lung cancer progression and metastasis through regulating T cell recruitment., The Journal of Immunology 196 891–901.
  • Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over-balls., IEEE Transactions on Information Theory 57 6976–6994.
  • Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed., Journal of the American Statistical Association 89 846–866.
  • Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data., Journal of the American Statistical Association 90 122–129.
  • Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data., Journal of the American Statistical Association 90 106–121.
  • Robins, J. M. and Rotnitzky, A. (2001). Comments., Statistica Sinica 920–936.
  • Rotnitzky, A., Robins, J. M. and Scharfstein, D. O. (1998). Semiparametric regression for repeated outcomes with nonignorable nonresponse., Journal of the American Statistical Association 93 1321–1339.
  • Rubin, D. B. and van der Laan, M. J. (2008). Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis., The International Journal of Biostatistics 4.
  • Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements., IEEE Transactions on Information Theory 59 3434–3447.
  • Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression., Biometrika 99 879–898.
  • Tang, N.-Y., Chueh, F.-S., Yu, C.-C., Liao, C.-L., Lin, J.-J., Hsia, T.-C., Wu, K.-C., Liu, H.-C., Lu, K.-W. and Chung, J.-G. (2016). Benzyl isothiocyanate alters the gene expression with cell cycle regulation and cell death in human brain glioblastoma GBM 8401 cells., Oncology reports 35 2089–2096.
  • Tchetgen, E. J. T. and Shpitser, I. (2012). Semiparametric theory for causal mediation analysis: efficiency bounds, multiple robustness, and sensitivity analysis., The Annals of Statistics 40 1816.
  • Terracciano, D., Ferro, M., Terreri, S., Lucarelli, G., D’Elia, C., Musi, G., de Cobelli, O., Mirone, V. and Cimmino, A. (2017). Urinary long non-coding RNAs in non-muscle invasive bladder cancer: new architects in cancer prognostic biomarkers., Translational Research.
  • Van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models., The Annals of Statistics 42 1166–1202.
  • Van der Laan, M. J. and Robins, J. M. (2003)., Unified methods for censored longitudinal data and causality. Springer Science & Business Media.
  • Van der Vaart, A. W. (2000)., Asymptotic statistics 3. Cambridge university press.
  • Vanderbei, R. J. (2014)., Linear Programming: Foundations and Extensions. Springer.
  • Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices., arXiv preprint arXiv:1011.3027.
  • Wang, Y., Han, R., Chen, Z., Fu, M., Kang, J., Li, K., Li, L., Chen, H. and He, Y. (2016). A transcriptional miRNA-gene network associated with lung adenocarcinoma metastasis based on the TCGA database., Oncology reports 35 2257–2269.
  • White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity., Econometrica 817–838.
  • Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 217–242.
  • Zhang, M., Gao, C., Yang, Y., Li, G., Dong, J., Ai, Y., Ma, Q. and Li, W. (2017). MiR-424 Promotes Non-Small Cell Lung Cancer Progression and Metastasis through Regulating the Tumor Suppressor Gene TNFAIP1., Cellular Physiology and Biochemistry 42 211–221.
  • Zhao, Y.-Q., Zeng, D., Laber, E. B., Song, R., Yuan, M. and Kosorok, M. R. (2014). Doubly robust learning for estimating individualized treatment with censored data., Biometrika 102 151–168.
  • Zhu, Y. and Bradic, J. (2017). Linear hypothesis testing in dense high-dimensional linear models., Journal of the American Statistical Association just-accepted.