Annals of Statistics

PCA consistency in high dimension, low sample size context

Sungkyu Jung and J. S. Marron

Full-text: Open access


Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sufficient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a ρ-mixing condition and a broad range of sphericity measures of the covariance matrix.

Article information

Ann. Statist., Volume 37, Number 6B (2009), 4104-4130.

First available in Project Euclid: 23 October 2009

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62H25: Factor analysis and principal components; correspondence analysis 34L20: Asymptotic distribution of eigenvalues, asymptotic theory of eigenfunctions
Secondary: 62F12: Asymptotic properties of estimators

Principal component analysis sample covariance matrix ρ-mixing high dimension low sample size data nonstandard asymptotics consistency and strong inconsistency spiked population model


Jung, Sungkyu; Marron, J. S. PCA consistency in high dimension, low sample size context. Ann. Statist. 37 (2009), no. 6B, 4104--4130. doi:10.1214/09-AOS709.

Export citation


  • [1] Ahn, J., Marron, J. S., Muller, K. M. and Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika 94 760–766.
  • [2] Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697.
  • [3] Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
  • [4] Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J. and Meyerson, M. (2001). Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA 98 13790–13795.
  • [5] Bradley, R. C. (2005). Basic properties of strong mixing conditions. A survey and some open questions. Probab. Surv. 2 107–144 (electronic). (Update of, and a supplement to, the 1986 original.)
  • [6] Eaton, M. L. and Tyler, D. E. (1991). On Wielandt’s inequality and its application to the asymptotic distribution of the eigenvalues of a random symmetric matrix. Ann. Statist. 19 260–271.
  • [7] Gaydos, T. L. (2008). Data representation and basis selection to understand variation of function valued traits. Ph.D. thesis, Univ. North Carolina at Chapel Hill.
  • [8] Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 427–444.
  • [9] John, S. (1971). Some optimal multivariate tests. Biometrika 58 123–127.
  • [10] John, S. (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika 59 169–173.
  • [11] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • [12] Johnstone, I. M. and Lu, A. Y. (2004). Sparse principal component analysis. Unpublished manuscript.
  • [13] Kato, T. (1995). Perturbation Theory for Linear Operators. Springer, Berlin. (Reprint of the 1980 edition.)
  • [14] Kolmogorov, A. N. and Rozanov, Y. A. (1960). On strong mixing conditions for stationary Gaussian processes. Theory Probab. Appl. 5 204–208.
  • [15] Liu, Y., Hayes, D. N., Nobel, A. and Marron, J. S. (2008). Statistical significance of clustering for high dimension low sample size data. J. Amer. Statist. Assoc. 103 1281–1293.
  • [16] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • [17] Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New York.