The Annals of Statistics

Independence test for high dimensional data based on regularized canonical correlation coefficients

Yanrong Yang and Guangming Pan

Full-text: Open access

Abstract

This paper proposes a new statistic to test independence between two high dimensional random vectors $\mathbf{X}:p_{1}\times1$ and $\mathbf{Y}:p_{2}\times1$. The proposed statistic is based on the sum of regularized sample canonical correlation coefficients of $\mathbf{X}$ and $\mathbf{Y}$. The asymptotic distribution of the statistic under the null hypothesis is established as a corollary of general central limit theorems (CLT) for the linear statistics of classical and regularized sample canonical correlation coefficients when $p_{1}$ and $p_{2}$ are both comparable to the sample size $n$. As applications of the developed independence test, various types of dependent structures, such as factor models, ARCH models and a general uncorrelated but dependent case, etc., are investigated by simulations. As an empirical application, cross-sectional dependence of daily stock returns of companies between different sections in the New York Stock Exchange (NYSE) is detected by the proposed test.

Article information

Source
Ann. Statist., Volume 43, Number 2 (2015), 467-500.

Dates
First available in Project Euclid: 24 February 2015

Permanent link to this document
https://projecteuclid.org/euclid.aos/1424787425

Digital Object Identifier
doi:10.1214/14-AOS1284

Mathematical Reviews number (MathSciNet)
MR3316187

Zentralblatt MATH identifier
1344.60027

Subjects
Primary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]

Keywords
Canonical correlation coefficients central limit theorem large dimensional random matrix theory independence test linear spectral statistics

Citation

Yang, Yanrong; Pan, Guangming. Independence test for high dimensional data based on regularized canonical correlation coefficients. Ann. Statist. 43 (2015), no. 2, 467--500. doi:10.1214/14-AOS1284. https://projecteuclid.org/euclid.aos/1424787425


Export citation

References

  • [1] Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis, 2nd ed. Wiley, New York.
  • [2] Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221.
  • [3] Bai, Z., Chen, J. and Yao, J. (2010). On estimation of the population spectral distribution from a high-dimensional sample covariance matrix. Aust. N. Z. J. Stat. 52 423–437.
  • [4] Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311–329.
  • [5] Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer, New York.
  • [6] Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
  • [7] Birke, M. and Dette, H. (2005). A note on testing the covariance matrix for large dimension. Statist. Probab. Lett. 74 281–289.
  • [8] El Karoui, N. (2008). Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Statist. 36 2757–2790.
  • [9] Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36 2605–2637.
  • [10] Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 74 37–65.
  • [11] Fujikoshi, Y., Ulyanov, V. V. and Shimizu, R. (2010). Multivariate Statistics: High-Dimensional and Large-Sample Approximations. Wiley, Hoboken, NJ.
  • [12] Huang, J., Horowitz, J. L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 36 587–613.
  • [13] Johnstone, I. M. (2008). Multivariate analysis and Jacobi ensembles: Largest eigenvalue, Tracy–Widom limits and rates of convergence. Ann. Statist. 36 2638–2716.
  • [14] Lytova, A. and Pastur, L. (2009). Central limit theorem for linear eigenvalue statistics of random matrices with independent entries. Ann. Probab. 37 1778–1840.
  • [15] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, San Diego.
  • [16] Pan, G. (2010). Strong convergence of the empirical distribution of eigenvalues of sample covariance matrices with a perturbation matrix. J. Multivariate Anal. 101 1330–1338.
  • [17] Silverstein, J. W. and Bai, Z. D. (1995). On the empirical distribution of eigenvalues of a class of large-dimensional random matrices. J. Multivariate Anal. 54 175–192.
  • [18] Timm, N. H. (2002). Applied Multivariate Analysis. Springer, New York.
  • [19] Wachter, K. W. (1980). The limiting empirical measure of multiple discriminant ratios. Ann. Statist. 8 937–957.
  • [20] Wilks, S. S. (1935). On the independence of $k$ sets of normally distributed statistical variables. Econometrica 3 309–326.
  • [21] Yang, Y. and Pan, G. (2012). The convergence of the empirical distribution of canonical correlation coefficients. Electron. J. Probab. 17 no. 64, 13.
  • [22] Zheng, S. (2012). Central limit theorems for linear spectral statistics of large dimensional $F$-matrices. Ann. Inst. Henri Poincaré Probab. Stat. 48 444–476.
  • [23] Yang, Y. and Pan, G. (2014). Supplement to “Independence test for high dimensional data based on regularized canonical correlation coefficients.” DOI:10.1214/14-AOS1284SUPP.

Supplemental materials

  • Supplementary material: Supplement to “Independence test for high dimensional data based on regularized canonical correlation coefficients”. The supplementary material is divided into Appendices A and B. Some useful lemmas, and proofs of all theorems and Proposition 4–5 are given in Appendix A while one theorem related to CLT of a sample covariance matrix plus a perturbation matrix is provided in Appendix B.