The Annals of Statistics

Testing independence with high-dimensional correlated samples

Abstract

Testing independence among a number of (ultra) high-dimensional random samples is a fundamental and challenging problem. By arranging $n$ identically distributed $p$-dimensional random vectors into a $p\times n$ data matrix, we investigate the problem of testing independence among columns under the matrix-variate normal modeling of data. We propose a computationally simple and tuning-free test statistic, characterize its limiting null distribution, analyze the statistical power and prove its minimax optimality. As an important by-product of the test statistic, a ratio-consistent estimator for the quadratic functional of a covariance matrix from correlated samples is developed. We further study the effect of correlation among samples to an important high-dimensional inference problem—large-scale multiple testing of Pearson’s correlation coefficients. Indeed, blindly using classical inference results based on the assumed independence of samples will lead to many false discoveries, which suggests the need for conducting independence testing before applying existing methods. To address the challenge arising from correlation among samples, we propose a “sandwich estimator” of Pearson’s correlation coefficient by de-correlating the samples. Based on this approach, the resulting multiple testing procedure asymptotically controls the overall false discovery rate at the nominal level while maintaining good statistical power. Both simulated and real data experiments are carried out to demonstrate the advantages of the proposed methods.

Article information

Source
Ann. Statist., Volume 46, Number 2 (2018), 866-894.

Dates
Revised: March 2017
First available in Project Euclid: 3 April 2018

https://projecteuclid.org/euclid.aos/1522742439

Digital Object Identifier
doi:10.1214/17-AOS1571

Mathematical Reviews number (MathSciNet)
MR3782387

Zentralblatt MATH identifier
06870282

Subjects
Primary: 62F05: Asymptotic properties of tests
Secondary: 62H10: Distribution of statistics

Citation

Chen, Xi; Liu, Weidong. Testing independence with high-dimensional correlated samples. Ann. Statist. 46 (2018), no. 2, 866--894. doi:10.1214/17-AOS1571. https://projecteuclid.org/euclid.aos/1522742439

References

• Allen, G. I. and Tibshirani, R. (2012). Inference with transposable data: Modelling the effects of row and column correlations. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 721–743.
• Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley-Interscience, Hoboken, NJ.
• Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311–329.
• Bai, Z., Jiang, D., Yao, J.-F. and Zheng, S. (2009). Corrections to LRT on large-dimensional covariance matrix by RMT. Ann. Statist. 37 3822–3840.
• Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. Stat. Methodol. 57 289–300.
• Bien, J. and Tibshirani, R. J. (2011). Sparse estimation of a covariance matrix. Biometrika 98 807–820.
• Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39 1496–1525.
• Cai, T. T. and Jiang, T. (2012). Phase transition in limiting distributions of coherence of high-dimensional random matrices. J. Multivariate Anal. 107 24–39.
• Cai, T. and Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 106 672–684.
• Cai, T. T. and Liu, W. (2016). Large-scale multiple testing of correlations. J. Amer. Statist. Assoc. 111 229–240.
• Cai, T., Liu, W. and Luo, X. (2011). A constrained $\ell_{1}$ minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 106 594–607.
• Cai, T., Liu, W. and Zhou, H. (2016). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. Ann. Statist. 44 455–488.
• Cai, T. T., Ren, Z. and Zhou, H. H. (2016). Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation. Electron. J. Stat. 10 1–59.
• Carter, S. L., Brechbühler, C. M., Griffin, M. and Bond, A. T. (2004). Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20 2242–2250.
• Chen, X. and Liu, W. (2018). Supplement to “Testing independence with high-dimensional correlated samples.” DOI:10.1214/17-AOS1571SUPP.
• Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808–835.
• Dawid, A. P. (1977). Spherical matrix distributions and a multivariate model. J. R. Stat. Soc. Ser. B. Stat. Methodol. 39 254–261.
• Dawid, A. P. (1981). Some matrix-variate distribution theory: Notational considerations and a Bayesian application. Biometrika 68 265–274.
• Efron, B. (2009). Are a set of microarrays independent of each other? Ann. Appl. Stat. 3 922–942.
• Fan, J., Rigollet, P. and Wang, W. (2015). Estimation of functionals of sparse covariance matrices. Ann. Statist. 43 2706–2737.
• Fang, K. T. and Zhang, Y. T. (1990). Generalized Multivariate Analysis. Springer, Berlin.
• Han, F., Chen, S. Z. and Liu, H. (2017). Distribution-free tests of independence with applications to testing more structures. Biometrika 104 813–828.
• Hirai, M. Y., Sugiyama, K., Sawada, Y., Tohge, T., Obayashi, T., Suzuki, A., Araki, R., Sakurai, N., Suzuki, H., Aoki, K., Goda, H., Nishizawa, O. I., Shibata, D. and Saito, K. (2007). Omics-based identification of Arabidopsis Myb transcription factors regulating aliphatic glucosinolate biosynthesis. Proc. Natl. Acad. Sci. USA 104 6478–6483.
• Hong, Y. (1998). Testing for pairwise serial independence via the empirical distribution function. J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 429–453.
• Jiang, T. (2004). The asymptotic distributions of the largest entries of sample correlation matrices. Ann. Appl. Probab. 14 865–880.
• Jiang, T. and Yang, F. (2013). Central limit theorems for classical likelihood ratio tests for high-dimensional normal distributions. Ann. Statist. 41 2029–2074.
• Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
• Kim, K., Jiang, K., Teng, S. L., Feldman, L. J. and Huang, H. (2012). Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28 815–822.
• Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 37 4254–4278.
• Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statist. Sinica 12 61–86.
• Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann. Statist. 30 1081–1102.
• Lee, H. K., Hsu, A. K., Sajdak, J., Qin, J. and Pavlidis, P. (2004). Coexpression analysis of human genes across many microarray data sets. Genome Res. 14 1085–1094.
• Liu, W. (2013). Gaussian graphical model estimation with false discovery rate control. Ann. Statist. 41 2948–2978.
• Liu, W.-D., Lin, Z. and Shao, Q.-M. (2008). The asymptotic distribution and Berry–Esseen bound of a new test for independence in high dimension with an application to stochastic optimization. Ann. Appl. Probab. 18 2337–2366.
• Liu, W. and Shao, Q.-M. (2014). Phase transition and regularized bootstrap in large-scale $t$-tests with false discovery rate control. Ann. Statist. 42 2003–2025.
• Muralidharan, O. (2010). Detecting column dependence when rows are correlated and estimating the strength of the row correlation. Electron. J. Stat. 4 1527–1546.
• Nagao, H. (1973). On some test criteria for covariance matrix. Ann. Statist. 1 700–709.
• Pan, G., Gao, J. and Yang, Y. (2014). Testing independence among a large number of high-dimensional random vectors. J. Amer. Statist. Assoc. 109 600–612.
• Rothman, A. J., Levina, E. and Zhu, J. (2009). Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 104 177–186.
• Schott, J. R. (2005). Testing for complete independence in high dimensions. Biometrika 92 951–956.
• Shaw, P., Greenstein, D., Lerch, J., Clasen, L., Lenroot, R., Gogtay, N., Evans, A., Rapoport, J. and Giedd, J. (2006). Intellectual ability and cortical development in children and adolescents. Nature 440 676–679.
• Teng, S. L. and Huang, H. (2009). A statistical framework to inter functional gene relationships from biologically interrelated microarray experiments. J. Amer. Statist. Assoc. 104 465–473.
• Yin, J. and Li, H. (2012). Model selection and estimation in the matrix normal graphical model. J. Multivariate Anal. 107 119–140.
• Zhou, S. (2014). Gemini: Graph estimation with matrix variate normal instances. Ann. Statist. 42 532–562.
• Zhu, D., Hero, A. O., Qin, Z. S. and Swaroop, A. (2005). High throughput screening of co-expressed gene pairs with controlled false discovery rate (FDR) and minimum acceptable strength (MAS). J. Comput. Biol. 12 1029–1045.

Supplemental materials

• Supplement to “Testing independence with high-dimensional correlated samples”. We provide the proofs of all the theorectial results as well as additional simulated and real experimental results.