The Annals of Statistics

Spectrum estimation for large dimensional covariance matrices using random matrix theory

Noureddine El Karoui

Full-text: Open access

Abstract

Estimating the eigenvalues of a population covariance matrix from a sample covariance matrix is a problem of fundamental importance in multivariate statistics; the eigenvalues of covariance matrices play a key role in many widely used techniques, in particular in principal component analysis (PCA). In many modern data analysis problems, statisticians are faced with large datasets where the sample size, n, is of the same order of magnitude as the number of variables p. Random matrix theory predicts that in this context, the eigenvalues of the sample covariance matrix are not good estimators of the eigenvalues of the population covariance.

We propose to use a fundamental result in random matrix theory, the Marčenko–Pastur equation, to better estimate the eigenvalues of large dimensional covariance matrices. The Marčenko–Pastur equation holds in very wide generality and under weak assumptions. The estimator we obtain can be thought of as “shrinking” in a nonlinear fashion the eigenvalues of the sample covariance matrix to estimate the population eigenvalues. Inspired by ideas of random matrix theory, we also suggest a change of point of view when thinking about estimation of high-dimensional vectors: we do not try to estimate directly the vectors but rather a probability measure that describes them. We think this is a theoretically more fruitful way to think about these problems.

Our estimator gives fast and good or very good results in extended simulations. Our algorithmic approach is based on convex optimization. We also show that the proposed estimator is consistent.

Article information

Source
Ann. Statist., Volume 36, Number 6 (2008), 2757-2790.

Dates
First available in Project Euclid: 5 January 2009

Permanent link to this document
https://projecteuclid.org/euclid.aos/1231165184

Digital Object Identifier
doi:10.1214/07-AOS581

Mathematical Reviews number (MathSciNet)
MR2485012

Zentralblatt MATH identifier
1168.62052

Subjects
Primary: 62H12: Estimation
Secondary: 62-09: Graphical methods

Keywords
Covariance matrices principal component analysis eigenvalues of covariance matrices high-dimensional inference random matrix theory Stieltjes transform Marčenko–Pastur equation convex optimization

Citation

El Karoui, Noureddine. Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Statist. 36 (2008), no. 6, 2757--2790. doi:10.1214/07-AOS581. https://projecteuclid.org/euclid.aos/1231165184


Export citation

References

  • [1] Akhiezer, N. I. (1965). The Classical Moment Problem and Some Related Questions in Analysis. Hafner Publishing, New York.
  • [2] Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Ann. Math. Statist. 34 122–148.
  • [3] Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, Hoboken, NJ.
  • [4] Bai, Z. D. (1999). Methodologies in spectral analysis of large dimensional random matrices, a review (with discussion). Statist. Sinica 9 611–677.
  • [5] Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697.
  • [6] Baik, J. and Silverstein, J. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
  • [7] Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, “naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
  • [8] Bickel, P. J. and Levina, E. (2007). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
  • [9] Böttcher, A. and Silbermann, B. (1999). Introduction to Large Truncated Toeplitz Matrices. Springer, New York.
  • [10] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press.
  • [11] Burda, Z., Görlich, A., Jarosz, A. and Jurkiewicz, J. (2004). Signal and noise in correlation matrix. Phys. A 343 295–310.
  • [12] Burda, Z., Jurkiewicz, J. and Wacław, B. (2005). Spectral moments of correlated Wishart matrices. Phys. Rev. E 71 026111.
  • [13] Campbell, J., Lo, A. and MacKinlay, C. (1996). The Econometrics of Financial Markets. Princeton Univ. Press.
  • [14] Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
  • [15] Dozier, R. B. and Silverstein, J. W. (2007). On the empirical distribution of eigenvalues of large dimensional information-plus-noise-type matrices. J. Multivariate Anal. 98 678–694.
  • [16] Durrett, R. (1996). Probability: Theory and Examples, 2nd ed. Duxbury Press, Belmont, CA.
  • [17] El Karoui, N. (2007). Tracy–Widom limit for the largest eigenvalue of a large class of complex sample covariance matrices. Ann. Probab. 35 663–714.
  • [18] Geman, S. (1980). A limit theorem for the norm of random matrices. Ann. Probab. 8 252–261.
  • [19] Geronimo, J. S. and Hill, T. P. (2003). Necessary and sufficient condition that the limit of Stieltjes transforms is a Stieltjes transform. J. Approx. Theory 121 54–60.
  • [20] Gibbs, A. L. and Su, F. (2001). On choosing and bounding probability metrics. Internat. Statist. Rev. 70 419–435.
  • [21] Gray, R. M. (2002). Toeplitz and circulant matrices: A review. Available at http://ee.stanford.edu/~gray/toeplitz.pdf.
  • [22] Grenander, U. and Szegö, G. (1958). Toeplitz Forms and Their Applications. Univ. California Press, Berkeley.
  • [23] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Springer, New York.
  • [24] Hiai, F. and Petz, D. (2000). The Semicircle Law, Free Random Variables and Entropy. Mathematical Surveys and Monographs 77. Amer. Math. Soc., Providence, RI.
  • [25] Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal component analysis. Ann. Statist. 29 295–327.
  • [26] Jonsson, D. (1982). Some limit theorems for the eigenvalues of a sample covariance matrix. J. Multivariate Anal. 12 1–38.
  • [27] Laloux, L., Cizeau, P., Bouchaud, J.-P. and Potters, M. (1999). Noise dressing of financial correlation matrices. Phys. Rev. Lett. 83 1467–1470.
  • [28] Lax, P. D. (2002). Functional Analysis. Wiley, New York.
  • [29] Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88 365–411.
  • [30] Marčenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues in certain sets of random matrices. Mat. Sb. (N.S.) 72 507–536.
  • [31] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
  • [32] MOSEK. (2006). MOSEK Optimization Toolbox. Available at http://www.mosek.com.
  • [33] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • [34] Rao, N. R., Mingo, J., Speicher, R. and Edelman, A. (2008). Statistical eigen-inference from large Wishart matrices. Ann. Statist. 36 2850–2885.
  • [35] Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large-dimensional random matrices. J. Multivariate Anal. 55 331–339.
  • [36] Silverstein, J. W. and Bai, Z. D. (1995). On the empirical distribution of eigenvalues of a class of large-dimensional random matrices. J. Multivariate Anal. 54 175–192.
  • [37] Wachter, K. W. (1978). The strong limits of random matrix spectra for sample matrices of independent elements. Ann. Probab. 6 1–18.
  • [38] Yin, Y. Q. (1986). Limiting spectral distribution for a class of random matrices. J. Multivariate Anal. 20 50–68.
  • [39] Yin, Y. Q., Bai, Z. D. and Krishnaiah, P. R. (1988). On the limit of the largest eigenvalue of the large-dimensional sample covariance matrix. Probab. Theory Related Fields 78 509–521.