The Annals of Statistics

Convergence and prediction of principal component scores in high-dimensional settings

Seunggeun Lee, Fei Zou, and Fred A. Wright

Full-text: Open access

Abstract

A number of settings arise in which it is of interest to predict Principal Component (PC) scores for new observations using data from an initial sample. In this paper, we demonstrate that naive approaches to PC score prediction can be substantially biased toward 0 in the analysis of large matrices. This phenomenon is largely related to known inconsistency results for sample eigenvalues and eigenvectors as both dimensions of the matrix increase. For the spiked eigenvalue model for random matrices, we expand the generality of these results, and propose bias-adjusted PC score prediction. In addition, we compute the asymptotic correlation coefficient between PC scores from sample and population eigenvectors. Simulation and real data examples from the genetics literature show the improved bias and numerical properties of our estimators.

Article information

Source
Ann. Statist. Volume 38, Number 6 (2010), 3605-3629.

Dates
First available in Project Euclid: 30 November 2010

Permanent link to this document
http://projecteuclid.org/euclid.aos/1291126967

Digital Object Identifier
doi:10.1214/10-AOS821

Zentralblatt MATH identifier
1204.62097

Mathematical Reviews number (MathSciNet)
MR2766862

Subjects
Primary: 62H25: Factor analysis and principal components; correspondence analysis
Secondary: 15A18: Eigenvalues, singular values, and eigenvectors

Keywords
PCA PC scores random matrix PC regression

Citation

Lee, Seunggeun; Zou, Fei; Wright, Fred A. Convergence and prediction of principal component scores in high-dimensional settings. The Annals of Statistics 38 (2010), no. 6, 3605--3629. doi:10.1214/10-AOS821. http://projecteuclid.org/euclid.aos/1291126967.


Export citation

References

  • [1] Ahn, J., Marron, J. S., Muller, K. M. and Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika 94 760–766.
  • [2] Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Ann. Math. Statist. 34 122–148.
  • [3] Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, Hoboken, NJ.
  • [4] Bai, Z. and Yao, J.-F. (2008). Central limit theorems for eigenvalues in a spiked population model. Ann. Inst. H. Poincaré Probab. Statist. 44 447–474.
  • [5] Bai, Z. D. (1999). Methodologies in spectral analysis of large-dimensional random matrices, a review. Statist. Sinica 9 611–677.
  • [6] Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697.
  • [7] Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
  • [8] Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components. J. Amer. Statist. Assoc. 101 119–137.
  • [9] Bovelstad, H., Nygard, S., Storvold, H., Aldrin, M., Borgan, O., Frigessi, A. and Lingjaerde, O. (2007). Predicting survival from microarray data a comparative study. Bioinformatics 23 2080–2087.
  • [10] El Karoui, N. (2008). Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Statist. 36 2757–2790.
  • [11] Fellay, J., Shianna, K., Ge, D., Colombo, S., Ledergerber, B., Weale, M., Zhang, K., Gumbs, C., Castagna, A., Cossarizza, A. et al. (2007). A whole-genome association study of major determinants for host control of HIV-1. Science 317 944–947.
  • [12] Girshick, M. (1936). Principal components. J. Amer. Statist. Assoc. 31 519–528.
  • [13] Girshick, M. (1939). On the sampling theory of roots of determinantal equations. Ann. Math. Statist. 10 203–224.
  • [14] Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 427–444.
  • [15] Horn, R. and Johnson, C. (1990). Matrix Analysis. Cambridge Univ. Press, Cambridge.
  • [16] Jackson, J. (2005). A User’s Guide to Principal Components. Wiley, New York.
  • [17] Johnstone, I. and Lu, A. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104 682–693.
  • [18] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • [19] Jolliffe, I. (2002). Principal Component Analysis. Springer, New York.
  • [20] Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37 4104–4130.
  • [21] Ma, S., Kosorok, M. R. and Fine, J. P. (2006). Additive risk models for survival data with high-dimensional covariates. Biometrics 62 202–210.
  • [22] Marčenko, V. and Pastur, L. (1967). Distribution of eigenvalues for some sets of random matrices. Sbornik: Mathematics 1 457–483.
  • [23] Nadler, B. (2008). Finite sample approximation results for principal component analysis: A matrix perturbation approach. Ann. Statist. 36 2791–2817.
  • [24] Patterson, N., Price, A. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics 2 e190.
  • [25] Paul, D. (2005). Asymptotics of the leading sample eigenvalues for a spiked covariance model. Technical report. Available at http://anson.ucdavis.edu/~debashis/techrep/eigenlimit.pdf.
  • [26] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • [27] Price, A., Patterson, N., Plenge, R., Weinblatt, M., Shadick, N. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.
  • [28] Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., Maller, J., Sklar, P., de Bakker, P., Daly, M. et al. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Gen. 81 559–575.
  • [29] Rao, N. R., Mingo, J. A., Speicher, R. and Edelman, A. (2008). Statistical eigen-inference from large Wishart matrices. Ann. Statist. 36 2850–2885.
  • [30] Wall, M., Rechtsteiner, A. and Rocha, L. (2003). Singular value decomposition and principal component analysis. In A Practical Approach to Microarray Data Analysis (D. P. Berrar, W. Dubitzky and M. Granzow, eds.) 91–109. Kluwer, Norwell, MA.