Electronic Journal of Statistics

A global homogeneity test for high-dimensional linear regression

Camille Charbonnier, Nicolas Verzelen, and Fanny Villers

Full-text: Open access

Abstract

This paper is motivated by the comparison of genetic networks inferred from high-dimensional datasets originating from high-throughput Omics technologies. The aim is to test whether the differences observed between two inferred Gaussian graphical models come from real differences or arise from estimation uncertainties. Adopting a neighborhood approach, we consider a two-sample linear regression model with random design and propose a procedure to test whether these two regressions are the same. Relying on multiple testing and variable selection strategies, we develop a testing procedure that applies to high-dimensional settings where the number of covariates $p$ is larger than the number of observations $n_{1}$ and $n_{2}$ of the two samples. Both type I and type II errors are explicitly controlled from a non-asymptotic perspective and the test is proved to be minimax adaptive to the sparsity. The performances of the test are evaluated on simulated data. Moreover, we illustrate how this procedure can be used to compare genetic networks on Hess et al. breast cancer microarray dataset.

Article information

Source
Electron. J. Statist., Volume 9, Number 1 (2015), 318-382.

Dates
Received: August 2013
First available in Project Euclid: 17 March 2015

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1426611767

Digital Object Identifier
doi:10.1214/15-EJS999

Mathematical Reviews number (MathSciNet)
MR3323203

Zentralblatt MATH identifier
1310.62068

Subjects
Primary: 62H15: Hypothesis testing
Secondary: 62P10: Applications to biology and medical sciences

Keywords
Gaussian graphical model two-sample hypothesis testing high-dimensional statistics multiple testing adaptive testing minimax hypothesis testing detection boundary

Citation

Charbonnier, Camille; Verzelen, Nicolas; Villers, Fanny. A global homogeneity test for high-dimensional linear regression. Electron. J. Statist. 9 (2015), no. 1, 318--382. doi:10.1214/15-EJS999. https://projecteuclid.org/euclid.ejs/1426611767


Export citation

References

  • [1] Ambroise, C., Chiquet, J., and Matias, C. (2009). Inferring sparse Gaussian graphical models with latent structure., Electron. J. Stat. 3, 205–238. http://dx.doi.org/10.1214/08-EJS314.
  • [2] Arias-Castro, E., Candès, E. J., and Plan, Y. (2011). Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism., Ann. Statist. 39, 5, 2533–2556. http://dx.doi.org/10.1214/11-AOS910.
  • [3] Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem., Statist. Sinica 6, 2, 311–329.
  • [4] Baraud, Y., Giraud, C., and Huet, S. (2014). Estimator selection in the Gaussian setting., Ann. Inst. Henri Poincaré Probab. Stat. 50, 3, 1092–1119. http://dx.doi.org/10.1214/13-AIHP539.
  • [5] Baraud, Y., Huet, S., and Laurent, B. (2003). Adaptive tests of linear hypotheses by model selection., Ann. Statist. 31, 1, 225–251. http://dx.doi.org/10.1214/aos/1046294463.
  • [6] Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector., Ann. Statist. 37, 4, 1705–1732. http://dx.doi.org/10.1214/08-AOS620.
  • [7] Bühlmann, P. (2013). Statistical significance in high-dimensional linear models., Bernoulli 19, 4, 1212–1242. http://dx.doi.org/10.3150/12-BEJSP11.
  • [8] Cai, T., Liu, W., and Xia, Y. (2013). Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings., J. Amer. Statist. Assoc. 108, 501, 265–277. http://dx.doi.org/10.1080/01621459.2012.758041.
  • [9] Candès, E. J. and Plan, Y. (2009). Near-ideal model selection by $\ell_1$ minimization., Ann. Statist. 37, 5A, 2145–2177. http://dx.doi.org/10.1214/08-AOS653.
  • [10] Charbonnier, C., Verzelen, N., and Villers, F. (2015). Supplement to “A global homogeneity test for high-dimensional linear regression.”, http://dx.doi.org/10.1214/15-EJS999SUPP.
  • [11] Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing., Ann. Statist. 38, 2, 808–835. http://dx.doi.org/10.1214/09-AOS716.
  • [12] Chiquet, J., Grandvalet, Y., and Ambroise, C. (2011). Inferring multiple graphical structures., Stat. Comput. 21, 4, 537–553. http://dx.doi.org/10.1007/s11222-010-9191-2.
  • [13] Chu, J., Lazarus, R., Carey, V., and Raby, B. (2011). A statistical framework for diffenrential network analysis from microarray data., BMC Systems Biology 5.
  • [14] Chung, S., Suzuki, H., Miyamoto, T., Takamatsu, N., Tatsuguchi, A., Ueda, K., Kijima, K., Nakamura, Y., and Matsuo, Y. (2012). Development of an orally-administrative melk-targeting inhibitor that suppresses the growth of various types of human cancer., Oncotarget 3, 1629–1640.
  • [15] Davidson, K. R. and Szarek, S. J. (2001). Local operator theory, random matrices and Banach spaces. In, Handbook of the geometry of Banach spaces, Vol. I. North-Holland, Amsterdam, 317–366.
  • [16] Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures., Ann. Statist. 32, 3, 962–994. http://dx.doi.org/10.1214/009053604000000265.
  • [17] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression., Ann. Statist. 32, 2, 407–499. With discussion, and a rejoinder by the authors.
  • [18] Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical Lasso., Biostatistics 9, 3, 432– 441.
  • [19] Gill, R., Datta, S., and Datta, S. (2010). A statistical framework for diffenrential network analysis from microarray data., BMC Bioinformatics 11.
  • [20] Giraud, C., Huet, S., and Verzelen, N. (2012a). Graph selection with GGMselect., Stat. Appl. Genet. Mol. Biol. 11, 3, Art. 3, 52. http:// dx.doi.org/10.1515/1544-6115.1625.
  • [21] Giraud, C., Huet, S., and Verzelen, N. (2012b). Supplement to ‘High-dimensional regression with unknown, variance’.
  • [22] Heidel, J., Liu, J., Yen, Y., Zhou, B., Heale, B., Rossi, J., Bartlett, D., and Davis, M. (2007). Potent sirna inhibitors of ribonucleotide reductase subunit rrm2 reduce cell proliferation in vitro and in vivo., Clinical Cancer Research 13.
  • [23] Hess, K., Anderson, K., Symmans, W., Valero, V., Ibrahim, N., Mejia, J., Booser, D., Theriault, R., Buzdar, U., Dempsey, P., Rouzier, R., Sneige, N., Ross, J., Vidaurre, T., Gómez, H., Hortobagyi, G., and Pustzai, L. (2006). Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer., Journal of Clinical Oncology 24, 26, 4236–4244.
  • [24] Ingster, Y. I., Tsybakov, A. B., and Verzelen, N. (2010). Detection boundary in sparse regression., Electron. J. Stat. 4, 1476–1526. http:// dx.doi.org/10.1214/10-EJS589.
  • [25] Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression., J. Mach. Learn. Res. 15, 2869–2909.
  • [26] Jeanmougin, M., Guedj, M., and Ambroise, C. (2011). Defining a robust biological prior from pathway analysis to drive network inference., Journal de la Société Française de Statistique 152, 97– 110.
  • [27] Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection., Annals of Statistics 28, 5, 1302–1338.
  • [28] Lauritzen, S. L. (1996)., Graphical models. Oxford Statistical Science Series, Vol. 17. The Clarendon Press, Oxford University Press, New York. Oxford Science Publications.
  • [29] Li, J. and Chen, S. X. (2012). Two sample tests for high-dimensional covariance matrices., Ann. Statist. 40, 2, 908–940. http://dx.doi.org/10.1214/12-AOS993.
  • [30] Lin, Y., Chen, C., Cheng, C., and Yang, R. (2011). Domain and functional analysis of a novel breast tumor suppressor protein, scube2., Journal of Biological Chemistry 29, 27039–27047.
  • [31] Lockhart, R., Taylor, J., Tibshirani, R. J., and Tibshirani, R. (2014). Correction: Rejoinder to “A significance test for the lasso” []., Ann. Statist. 42, 5, 2138–2139. http://dx.doi.org/10.1214/14-AOS1261.
  • [32] Lopes, M., Jacob, L., and Wainwright, M. J. (2011). A more powerful two-sample test in high dimensions using random projection. In, Advances in Neural Information Processing Systems, 1206–1214.
  • [33] Meinshausen, N. (2013). Assumption-free confidence intervals for groups of variables in sparse high-dimensional regression. Available online at, http://arxiv.org/abs/1309.3489.
  • [34] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso., Annals of Statistics 34, 3, 1436–1462.
  • [35] Meinshausen, N., Meier, L., and Bühlmann, P. (2009). $p$-values for high-dimensional regression., J. Amer. Statist. Assoc. 104, 488, 1671–1681. http://dx.doi.org/10.1198/jasa.2009.tm08647.
  • [36] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data., Ann. Statist. 37, 1, 246–270. http://dx.doi.org/10.1214/07-AOS582.
  • [37] Natowicz, R., Incitti, R., Horta, E., Charles, B., Guinot, P., Yan, K., Coutant, C., Andre, F., Pusztai, L., and Rouzier, R. (2008). Prediction of the outcome of preoperative chemotherapy in breast cancer using dna probes that provide information on both complete and incomplete responses., BMC Bioinformatics 9.
  • [38] Raskutti, G., Wainwright, M., and Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs., Journal of Machine Learning Research 11, 2241–2259.
  • [39] Rouzier, R., Perou, C., Symmans, F., Ibrahim, N., Cristofanilli, M., Anderson, K., Hess, K., Stec, J., Ayers, M., Wagner, P., Morandi, P., Fan, C., Rabiul, I., Ross, J. S., Hortobagyi, G., and Pusztai, L. (2005). Breast cancer molecular subtypes respond differently to preoperative chemotherapy., Clinical Cancer Research 11.
  • [40] Rouzier, R., Rajan, R., Wagner, P., Hess, K., Gold, D., Stec, J., Ayers, M., Ross, J., Zhang, P., Buchholz, T., Kuerer, H., Green, M., Arun, B., Hortobagyi, G., Symmans, W., and Pusztai, L. (2005). Microtubule-associated protein tau: A marker of paclitaxel sensitivity in breast cancer., Proceedings of the National Academy of Sciences 102, 8315–8320.
  • [41] Shojaie, A. and Michailidis, G. (2010). Network enrichment analysis in complex experiments., Stat. Appl. Genet. Mol. Biol. 9, Art. 22, 36. http://dx.doi.org/10.2202/1544-6115.1483.
  • [42] Srivastava, M. S. and Du, M. (2008). A test for the mean vector with fewer observations than the dimension., J. Multivariate Anal. 99, 3, 386–402. http://dx.doi.org/10.1016/j.jmva.2006.11.002.
  • [43] Städler, N. and Mukherjee, S. (2012). Two-sample testing in high-dimensional models. Available online at, http://arxiv.org/abs/1210.4584v2.
  • [44] Städler, N. and Mukherjee, S. (2015). Multivariate gene-set testing based on graphical models., Biostatistics 16, 1, 47–59.
  • [45] Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression., Biometrika 99, 4, 879–898. http://dx.doi.org/10.1093/biomet/ass043.
  • [46] van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso., Electron. J. Stat. 3, 1360–1392. http://dx.doi.org/10.1214/09-EJS506.
  • [47] Verzelen, N. (2012). Minimax risks for sparse regressions: ultra-high dimensional phenomenons., Electron. J. Stat. 6, 38–90. http://dx.doi.org/10.1214/12-EJS666.
  • [48] Verzelen, N. and Villers, F. (2009). Tests for Gaussian graphical models., Comput. Statist. Data Anal. 53, 5, 1894–1905. http://dx.doi.org/10.1016/j.csda.2008.09.022.
  • [49] Verzelen, N. and Villers, F. (2010). Goodness-of-fit tests for high-dimensional Gaussian linear models., Ann. Statist. 38, 2, 704–752. http://dx.doi.org/10.1214/08-AOS629.
  • [50] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_1$-constrained quadratic programming (Lasso)., IEEE Trans. Inform. Theory 55, 5, 2183–2202. http://dx.doi.org/10.1109/TIT.2009.2016018.
  • [51] Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection., Ann. Statist. 37, 5A, 2178–2201. http://dx.doi.org/10.1214/08-AOS646.
  • [52] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models., J. R. Stat. Soc. Ser. B. Stat. Methodol. 76, 1, 217–242. http://dx.doi.org/10.1111/rssb.12026.

Supplemental materials