The Annals of Statistics

PCA consistency in high dimension, low sample size context

Sungkyu Jung and J. S. Marron

Source: Ann. Statist. Volume 37, Number 6B (2009), 4104-4130.

Abstract

Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sufficient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a ρ-mixing condition and a broad range of sphericity measures of the covariance matrix.

Primary Subjects: 62H25, 34L20
Secondary Subjects: 62F12
Keywords: Principal component analysis; sample covariance matrix; ρ-mixing; high dimension; low sample size data; nonstandard asymptotics; consistency and strong inconsistency; spiked population model

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1256303538
Digital Object Identifier: doi:10.1214/09-AOS709

References

[1] Ahn, J., Marron, J. S., Muller, K. M. and Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika 94 760–766.
Mathematical Reviews (MathSciNet): MR2410023
Zentralblatt MATH: 1135.62039
Digital Object Identifier: doi:10.1093/biomet/asm050
[2] Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697.
Mathematical Reviews (MathSciNet): MR2165575
Zentralblatt MATH: 1086.15022
Digital Object Identifier: doi:10.1214/009117905000000233
Project Euclid: euclid.aop/1127395869
[3] Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
Mathematical Reviews (MathSciNet): MR2279680
Zentralblatt MATH: 05060652
Digital Object Identifier: doi:10.1016/j.jmva.2005.08.003
[4] Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J. and Meyerson, M. (2001). Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA 98 13790–13795.
[5] Bradley, R. C. (2005). Basic properties of strong mixing conditions. A survey and some open questions. Probab. Surv. 2 107–144 (electronic). (Update of, and a supplement to, the 1986 original.)
Mathematical Reviews (MathSciNet): MR2178042
Digital Object Identifier: doi:10.1214/154957805100000104
Project Euclid: euclid.ps/1115386870
[6] Eaton, M. L. and Tyler, D. E. (1991). On Wielandt’s inequality and its application to the asymptotic distribution of the eigenvalues of a random symmetric matrix. Ann. Statist. 19 260–271.
Mathematical Reviews (MathSciNet): MR1091849
Zentralblatt MATH: 0742.62015
Digital Object Identifier: doi:10.1214/aos/1176347980
Project Euclid: euclid.aos/1176347980
[7] Gaydos, T. L. (2008). Data representation and basis selection to understand variation of function valued traits. Ph.D. thesis, Univ. North Carolina at Chapel Hill.
[8] Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 427–444.
Mathematical Reviews (MathSciNet): MR2155347
Zentralblatt MATH: 1069.62097
Digital Object Identifier: doi:10.1111/j.1467-9868.2005.00510.x
[9] John, S. (1971). Some optimal multivariate tests. Biometrika 58 123–127.
Mathematical Reviews (MathSciNet): MR275568
Zentralblatt MATH: 0218.62055
[10] John, S. (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika 59 169–173.
Mathematical Reviews (MathSciNet): MR312619
Zentralblatt MATH: 0231.62072
Digital Object Identifier: doi:10.1093/biomet/59.1.169
[11] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
Mathematical Reviews (MathSciNet): MR1863961
Zentralblatt MATH: 1016.62078
Digital Object Identifier: doi:10.1214/aos/1009210544
Project Euclid: euclid.aos/1009210544
[12] Johnstone, I. M. and Lu, A. Y. (2004). Sparse principal component analysis. Unpublished manuscript.
[13] Kato, T. (1995). Perturbation Theory for Linear Operators. Springer, Berlin. (Reprint of the 1980 edition.)
Mathematical Reviews (MathSciNet): MR1335452
[14] Kolmogorov, A. N. and Rozanov, Y. A. (1960). On strong mixing conditions for stationary Gaussian processes. Theory Probab. Appl. 5 204–208.
[15] Liu, Y., Hayes, D. N., Nobel, A. and Marron, J. S. (2008). Statistical significance of clustering for high dimension low sample size data. J. Amer. Statist. Assoc. 103 1281–1293.
[16] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
Mathematical Reviews (MathSciNet): MR2399865
Zentralblatt MATH: 1134.62029
[17] Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New York.
Mathematical Reviews (MathSciNet): MR346957

2009 © Institute of Mathematical Statistics