The Annals of Statistics

Two sample tests for high-dimensional covariance matrices

Jun Li and Song Xi Chen

Full-text: Open access

Abstract

We propose two tests for the equality of covariance matrices between two high-dimensional populations. One test is on the whole variance–covariance matrices, and the other is on off-diagonal sub-matrices, which define the covariance between two nonoverlapping segments of the high-dimensional random vectors. The tests are applicable (i) when the data dimension is much larger than the sample sizes, namely the “large $p$, small $n$” situations and (ii) without assuming parametric distributions for the two populations. These two aspects surpass the capability of the conventional likelihood ratio test. The proposed tests can be used to test on covariances associated with gene ontology terms.

Article information

Source
Ann. Statist. Volume 40, Number 2 (2012), 908-940.

Dates
First available in Project Euclid: 1 June 2012

Permanent link to this document
https://projecteuclid.org/euclid.aos/1338515142

Digital Object Identifier
doi:10.1214/12-AOS993

Mathematical Reviews number (MathSciNet)
MR2985938

Zentralblatt MATH identifier
1274.62383

Subjects
Primary: 62H15: Hypothesis testing
Secondary: 62G10: Hypothesis testing 62G20: Asymptotic properties

Keywords
High-dimensional covariance large $p$ small $n$ likelihood ratio test testing for gene-sets

Citation

Li, Jun; Chen, Song Xi. Two sample tests for high-dimensional covariance matrices. Ann. Statist. 40 (2012), no. 2, 908--940. doi:10.1214/12-AOS993. https://projecteuclid.org/euclid.aos/1338515142.


Export citation

References

  • Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, Hoboken, NJ.
  • Bai, Z. D. (1993). Convergence rate of expected spectral distributions of large random matrices. II. Sample covariance matrices. Ann. Probab. 21 649–672.
  • Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311–329.
  • Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer, New York.
  • Bai, Z. D. and Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large-dimensional sample covariance matrix. Ann. Probab. 21 1275–1294.
  • Bai, Z., Jiang, D., Yao, J.-F. and Zheng, S. (2009). Corrections to LRT on large-dimensional covariance matrix by RMT. Ann. Statist. 37 3822–3840.
  • Barry, W. T., Nobel, A. B. and Wright, F. A. (2005). Significance analysis of functional categories in gene expression studies: A structured permutation approach. Bioinformatics 21 1943–1949.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Bickel, P. J. and Levina, E. (2008a). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
  • Bickel, P. J. and Levina, E. (2008b). Covariance regularization by thresholding. Ann. Statist. 36 2577–2604.
  • Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39 1496–1525.
  • Cai, T., Liu, W. D. and Xia, Y. (2011). Two-sample covariance matrix testing and support recovery. Technical report, Dept. Statistics, Univ. Pennsylvania, Philadelphia, PA.
  • Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808–835.
  • Chen, S. X., Zhang, L.-X. and Zhong, P.-S. (2010). Tests for high-dimensional covariance matrices. J. Amer. Statist. Assoc. 105 810–819.
  • Chiaretti, S., Li, X. C., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J. and Foa, R. (2004). Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103 2771–2778.
  • Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
  • Dudoit, S., Keleş, S. and van der Laan, M. J. (2008). Multiple tests of association with biological annotation metadata. In Probability and Statistics: Essays in Honor of David A. Freedman. Inst. Math. Stat. Collect. 2 153–218. IMS, Beachwood, OH.
  • Dykstra, R. L. (1970). Establishing the positive definiteness of the sample covariance matrix. Ann. Math. Statist. 41 2153–2154.
  • Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes. Ann. Appl. Stat. 1 107–129.
  • El Karoui, N. (2007). Tracy–Widom limit for the largest eigenvalue of a large class of complex sample covariance matrices. Ann. Probab. 35 663–714.
  • Fan, J., Fan, Y. and Lv, J. (2008). High dimensional covariance matrix estimation using a factor model. J. Econometrics 147 186–197.
  • Fan, J., Hall, P. and Yao, Q. (2007). To how many simultaneous hypothesis tests can normal, Student’s $t$ or bootstrap calibration be applied? J. Amer. Statist. Assoc. 102 1282–1288.
  • Fan, J., Peng, H. and Huang, T. (2005). Semilinear high-dimensional model for normalization of microarray data: A theoretical analysis and partial consistency. J. Amer. Statist. Assoc. 100 781–813.
  • Glasser, G. J. (1961). An unbiased estimator for powers of the arithmetic mean. J. Roy. Statist. Soc. Ser. B 23 154–159.
  • Glasser, G. J. (1962). Estimators for the product of arithmetic means. J. Roy. Statist. Soc. Ser. B 24 180–184.
  • Hall, P. and Jin, J. (2008). Properties of higher criticism under strong dependence. Ann. Statist. 36 381–402.
  • Huang, J., Wang, D. and Zhang, C.-H. (2005). A two-way semilinear model for normalization and analysis of cDNA microarray data. J. Amer. Statist. Assoc. 100 814–829.
  • Huang, J. Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93 85–98.
  • Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104 682–693.
  • Lam, C. and Yao, Q. (2011). Factor modelling for high-dimensional time series: Inference for the number of factors. Ann. Statist. To appear.
  • Lam, C., Yao, Q. and Bathia, N. (2011). Estimation of latent factors for high-dimensional time series. Biometrika 98 901–918.
  • Lan, W., Luo, R., Tsai, C., Wang, H. and Yang, Y. (2010). Testing the diagonality of a large covariance matrix in a regression setting. Technical report, Peking Univ., China.
  • Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann. Statist. 30 1081–1102.
  • Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88 365–411.
  • Nettleton, D., Recknor, J. and Reecy, J. M. (2008). Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics 24 192–201.
  • Newton, M. A., Quintana, F. A., den Boon, J. A., Sengupta, S. and Ahlquist, P. (2007). Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann. Appl. Stat. 1 85–106.
  • Rothman, A. J., Levina, E. and Zhu, J. (2010). A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97 539–550.
  • Schott, J. R. (2007). A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Comput. Statist. Data Anal. 51 6535–6542.
  • Shedden, K. and Taylor, J. (2004). Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas. In Methods of Microarray Data Analysis IV (J. S. Shoemaker and S. M. Lin, eds.) 121–131. Springer, New York.
  • Tracy, C. A. and Widom, H. (1996). On orthogonal and symplectic matrix ensembles. Comm. Math. Phys. 177 727–754.
  • van der Laan, M. J. and Bryan, J. (2001). Gene expression analysis with the parametric bootstrap. Biostatistics 2 445–461.
  • Wu, W. B. and Pourahmadi, M. (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90 831–844.
  • Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.