The Annals of Statistics

Two sample tests for high-dimensional covariance matrices

Jun Li and Song Xi Chen
Source: Ann. Statist. Volume 40, Number 2 (2012), 908-940.

Abstract

We propose two tests for the equality of covariance matrices between two high-dimensional populations. One test is on the whole variance–covariance matrices, and the other is on off-diagonal sub-matrices, which define the covariance between two nonoverlapping segments of the high-dimensional random vectors. The tests are applicable (i) when the data dimension is much larger than the sample sizes, namely the “large $p$, small $n$” situations and (ii) without assuming parametric distributions for the two populations. These two aspects surpass the capability of the conventional likelihood ratio test. The proposed tests can be used to test on covariances associated with gene ontology terms.

First Page: Show Hide
Primary Subjects: 62H15
Secondary Subjects: 62G10, 62G20
Full-text: Access denied (no subscription detected)
We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1338515142
Digital Object Identifier: doi:10.1214/12-AOS993
Zentralblatt MATH identifier: 06073780
Mathematical Reviews number (MathSciNet): MR2985938

References

Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, Hoboken, NJ.
Mathematical Reviews (MathSciNet): MR1990662
Bai, Z. D. (1993). Convergence rate of expected spectral distributions of large random matrices. II. Sample covariance matrices. Ann. Probab. 21 649–672.
Mathematical Reviews (MathSciNet): MR1217560
Zentralblatt MATH: 0779.60025
Digital Object Identifier: doi:10.1214/aop/1176989262
Project Euclid: euclid.aop/1176989262
Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311–329.
Mathematical Reviews (MathSciNet): MR1399305
Zentralblatt MATH: 0848.62030
Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR2567175
Bai, Z. D. and Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large-dimensional sample covariance matrix. Ann. Probab. 21 1275–1294.
Mathematical Reviews (MathSciNet): MR1235416
Zentralblatt MATH: 0779.60026
Digital Object Identifier: doi:10.1214/aop/1176989118
Project Euclid: euclid.aop/1176989118
Bai, Z., Jiang, D., Yao, J.-F. and Zheng, S. (2009). Corrections to LRT on large-dimensional covariance matrix by RMT. Ann. Statist. 37 3822–3840.
Mathematical Reviews (MathSciNet): MR2572444
Zentralblatt MATH: 05644257
Digital Object Identifier: doi:10.1214/09-AOS694
Project Euclid: euclid.aos/1256303528
Barry, W. T., Nobel, A. B. and Wright, F. A. (2005). Significance analysis of functional categories in gene expression studies: A structured permutation approach. Bioinformatics 21 1943–1949.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
Mathematical Reviews (MathSciNet): MR1325392
Bickel, P. J. and Levina, E. (2008a). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
Mathematical Reviews (MathSciNet): MR2387969
Zentralblatt MATH: 1132.62040
Digital Object Identifier: doi:10.1214/009053607000000758
Project Euclid: euclid.aos/1201877299
Bickel, P. J. and Levina, E. (2008b). Covariance regularization by thresholding. Ann. Statist. 36 2577–2604.
Mathematical Reviews (MathSciNet): MR2485008
Zentralblatt MATH: 1196.62062
Digital Object Identifier: doi:10.1214/08-AOS600
Project Euclid: euclid.aos/1231165180
Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39 1496–1525.
Mathematical Reviews (MathSciNet): MR2850210
Zentralblatt MATH: 1220.62066
Digital Object Identifier: doi:10.1214/11-AOS879
Project Euclid: euclid.aos/1305292044
Cai, T., Liu, W. D. and Xia, Y. (2011). Two-sample covariance matrix testing and support recovery. Technical report, Dept. Statistics, Univ. Pennsylvania, Philadelphia, PA.
Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808–835.
Mathematical Reviews (MathSciNet): MR2604697
Zentralblatt MATH: 1183.62095
Digital Object Identifier: doi:10.1214/09-AOS716
Project Euclid: euclid.aos/1266586615
Chen, S. X., Zhang, L.-X. and Zhong, P.-S. (2010). Tests for high-dimensional covariance matrices. J. Amer. Statist. Assoc. 105 810–819.
Mathematical Reviews (MathSciNet): MR2724863
Digital Object Identifier: doi:10.1198/jasa.2010.tm09560
Chiaretti, S., Li, X. C., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J. and Foa, R. (2004). Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103 2771–2778.
Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
Mathematical Reviews (MathSciNet): MR2065195
Zentralblatt MATH: 1092.62051
Digital Object Identifier: doi:10.1214/009053604000000265
Project Euclid: euclid.aos/1085408492
Dudoit, S., Keleş, S. and van der Laan, M. J. (2008). Multiple tests of association with biological annotation metadata. In Probability and Statistics: Essays in Honor of David A. Freedman. Inst. Math. Stat. Collect. 2 153–218. IMS, Beachwood, OH.
Mathematical Reviews (MathSciNet): MR2459952
Zentralblatt MATH: 1168.62100
Digital Object Identifier: doi:10.1214/193940307000000446
Dykstra, R. L. (1970). Establishing the positive definiteness of the sample covariance matrix. Ann. Math. Statist. 41 2153–2154.
Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes. Ann. Appl. Stat. 1 107–129.
Mathematical Reviews (MathSciNet): MR2393843
Zentralblatt MATH: 1129.62102
Digital Object Identifier: doi:10.1214/07-AOAS101
Project Euclid: euclid.aoas/1183143731
El Karoui, N. (2007). Tracy–Widom limit for the largest eigenvalue of a large class of complex sample covariance matrices. Ann. Probab. 35 663–714.
Mathematical Reviews (MathSciNet): MR2308592
Zentralblatt MATH: 1117.60020
Digital Object Identifier: doi:10.1214/009117906000000917
Project Euclid: euclid.aop/1175287758
Fan, J., Fan, Y. and Lv, J. (2008). High dimensional covariance matrix estimation using a factor model. J. Econometrics 147 186–197.
Mathematical Reviews (MathSciNet): MR2472991
Digital Object Identifier: doi:10.1016/j.jeconom.2008.09.017
Fan, J., Hall, P. and Yao, Q. (2007). To how many simultaneous hypothesis tests can normal, Student’s $t$ or bootstrap calibration be applied? J. Amer. Statist. Assoc. 102 1282–1288.
Mathematical Reviews (MathSciNet): MR2372536
Zentralblatt MATH: 05564448
Digital Object Identifier: doi:10.1198/016214507000000969
Fan, J., Peng, H. and Huang, T. (2005). Semilinear high-dimensional model for normalization of microarray data: A theoretical analysis and partial consistency. J. Amer. Statist. Assoc. 100 781–813.
Mathematical Reviews (MathSciNet): MR2201010
Zentralblatt MATH: 1117.62330
Digital Object Identifier: doi:10.1198/016214504000001781
Glasser, G. J. (1961). An unbiased estimator for powers of the arithmetic mean. J. Roy. Statist. Soc. Ser. B 23 154–159.
Mathematical Reviews (MathSciNet): MR123388
Glasser, G. J. (1962). Estimators for the product of arithmetic means. J. Roy. Statist. Soc. Ser. B 24 180–184.
Mathematical Reviews (MathSciNet): MR137209
Hall, P. and Jin, J. (2008). Properties of higher criticism under strong dependence. Ann. Statist. 36 381–402.
Mathematical Reviews (MathSciNet): MR2387976
Zentralblatt MATH: 1139.62049
Digital Object Identifier: doi:10.1214/009053607000000767
Project Euclid: euclid.aos/1201877306
Huang, J., Wang, D. and Zhang, C.-H. (2005). A two-way semilinear model for normalization and analysis of cDNA microarray data. J. Amer. Statist. Assoc. 100 814–829.
Mathematical Reviews (MathSciNet): MR2201011
Zentralblatt MATH: 1117.62358
Digital Object Identifier: doi:10.1198/016214504000002032
Huang, J. Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93 85–98.
Mathematical Reviews (MathSciNet): MR2277742
Zentralblatt MATH: 1152.62346
Digital Object Identifier: doi:10.1093/biomet/93.1.85
Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
Mathematical Reviews (MathSciNet): MR1863961
Zentralblatt MATH: 1016.62078
Digital Object Identifier: doi:10.1214/aos/1009210544
Project Euclid: euclid.aos/1009210544
Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104 682–693.
Mathematical Reviews (MathSciNet): MR2751448
Digital Object Identifier: doi:10.1198/jasa.2009.0121
Lam, C. and Yao, Q. (2011). Factor modelling for high-dimensional time series: Inference for the number of factors. Ann. Statist. To appear.
Mathematical Reviews (MathSciNet): MR2883535
Digital Object Identifier: doi:10.1007/978-3-7908-2736-1_31
Lam, C., Yao, Q. and Bathia, N. (2011). Estimation of latent factors for high-dimensional time series. Biometrika 98 901–918.
Mathematical Reviews (MathSciNet): MR2860332
Zentralblatt MATH: 1228.62110
Digital Object Identifier: doi:10.1093/biomet/asr048
Lan, W., Luo, R., Tsai, C., Wang, H. and Yang, Y. (2010). Testing the diagonality of a large covariance matrix in a regression setting. Technical report, Peking Univ., China.
Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann. Statist. 30 1081–1102.
Mathematical Reviews (MathSciNet): MR1926169
Zentralblatt MATH: 1029.62049
Digital Object Identifier: doi:10.1214/aos/1031689018
Project Euclid: euclid.aos/1031689018
Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88 365–411.
Mathematical Reviews (MathSciNet): MR2026339
Zentralblatt MATH: 1032.62050
Digital Object Identifier: doi:10.1016/S0047-259X(03)00096-4
Nettleton, D., Recknor, J. and Reecy, J. M. (2008). Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics 24 192–201.
Newton, M. A., Quintana, F. A., den Boon, J. A., Sengupta, S. and Ahlquist, P. (2007). Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann. Appl. Stat. 1 85–106.
Mathematical Reviews (MathSciNet): MR2393842
Zentralblatt MATH: 1129.62103
Digital Object Identifier: doi:10.1214/07-AOAS104
Project Euclid: euclid.aoas/1183143730
Rothman, A. J., Levina, E. and Zhu, J. (2010). A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97 539–550.
Mathematical Reviews (MathSciNet): MR2672482
Zentralblatt MATH: 1195.62089
Digital Object Identifier: doi:10.1093/biomet/asq022
Schott, J. R. (2007). A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Comput. Statist. Data Anal. 51 6535–6542.
Mathematical Reviews (MathSciNet): MR2408613
Shedden, K. and Taylor, J. (2004). Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas. In Methods of Microarray Data Analysis IV (J. S. Shoemaker and S. M. Lin, eds.) 121–131. Springer, New York.
Tracy, C. A. and Widom, H. (1996). On orthogonal and symplectic matrix ensembles. Comm. Math. Phys. 177 727–754.
Mathematical Reviews (MathSciNet): MR1385083
Zentralblatt MATH: 0851.60101
Digital Object Identifier: doi:10.1007/BF02099545
Project Euclid: euclid.cmp/1104286442
van der Laan, M. J. and Bryan, J. (2001). Gene expression analysis with the parametric bootstrap. Biostatistics 2 445–461.
Wu, W. B. and Pourahmadi, M. (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90 831–844.
Mathematical Reviews (MathSciNet): MR2024760
Digital Object Identifier: doi:10.1093/biomet/90.4.831
Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
Mathematical Reviews (MathSciNet): MR2435448
Zentralblatt MATH: 1142.62044
Digital Object Identifier: doi:10.1214/07-AOS520
Project Euclid: euclid.aos/1216237292

2013 © Institute of Mathematical Statistics

The Annals of Statistics

The Annals of Statistics

Turn MathJax Off
What is MathJax?