The Annals of Statistics

Asymptotics of empirical eigenstructure for high dimensional spiked covariance

Weichen Wang and Jianqing Fan

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the magnitude of spiked eigenvalues, sample size and dimensionality. This regime allows high dimensionality and diverging eigenvalues and provides new insights into the roles that the leading eigenvalues, sample size and dimensionality play in principal component analysis. Our results are a natural extension of those in [Statist. Sinica 17 (2007) 1617–1642] to a more general setting and solve the rates of convergence problems in [Statist. Sinica 26 (2016) 1747–1770]. They also reveal the biases of estimating leading eigenvalues and eigenvectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called Shrinkage Principal Orthogonal complEment Thresholding (S-POET), that corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks for large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies.

Article information

Source
Ann. Statist., Volume 45, Number 3 (2017), 1342-1374.

Dates
Received: September 2015
Revised: June 2016
First available in Project Euclid: 13 June 2017

Permanent link to this document
https://projecteuclid.org/euclid.aos/1497319697

Digital Object Identifier
doi:10.1214/16-AOS1487

Mathematical Reviews number (MathSciNet)
MR3662457

Zentralblatt MATH identifier
1373.62299

Subjects
Primary: 62H25: Factor analysis and principal components; correspondence analysis
Secondary: 62H10: Distribution of statistics

Keywords
Asymptotic distributions principal component analysis diverging eigenvalues approximate factor model relative risk management false discovery proportion

Citation

Wang, Weichen; Fan, Jianqing. Asymptotics of empirical eigenstructure for high dimensional spiked covariance. Ann. Statist. 45 (2017), no. 3, 1342--1374. doi:10.1214/16-AOS1487. https://projecteuclid.org/euclid.aos/1497319697


Export citation

References

  • Agarwal, A., Negahban, S. and Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. Ann. Statist. 40 1171–1197.
  • Amini, A. A. and Wainwright, M. J. (2008). High-dimensional analysis of semidefinite relaxations for sparse principal components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on 2454–2458. IEEE, New York.
  • Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Ann. Math. Stat. 34 122–148.
  • Antoniadis, A. and Fan, J. (2001). Regularization of wavelet approximations. J. Amer. Statist. Assoc. 96 939–967.
  • Bai, Z. D. (1999). Methodologies in spectral analysis of large-dimensional random matrices, a review. Statist. Sinica 9 611–677.
  • Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
  • Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221.
  • Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer, Berlin.
  • Bai, Z. and Yao, J. (2012). On sample eigenvalues in a generalized spiked population model. J. Multivariate Anal. 106 167–177.
  • Bai, Z. D. and Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large-dimensional sample covariance matrix. Ann. Probab. 21 1275–1294.
  • Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697.
  • Benaych-Georges, F. and Nadakuditi, R. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Adv. Math. 227 494–521.
  • Berthet, Q. and Rigollet, P. (2013). Optimal detection of sparse principal components in high dimension. Ann. Statist. 41 1780–1815.
  • Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Ann. Statist. 36 2577–2604.
  • Birnbaum, A., Johnstone, I. M., Nadler, B. and Paul, D. (2013). Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Statist. 41 1055–1084.
  • Cai, T., Fan, J. and Jiang, T. (2013). Distributions of angles in random packing on spheres. J. Mach. Learn. Res. 14 1837–1864.
  • Cai, T., Ma, Z. and Wu, Y. (2015). Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields 161 781–815.
  • Candès, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37.
  • Chamberlain, G. and Rothschild, M. (1983). Funds, factors, and diversification in arbitrage pricing models. Econometrica 51 1305–1324.
  • Chandrasekaran, V., Sanghavi, S., Parrilo, P. A. and Willsky, A. S. (2011). Rank-sparsity incoherence for matrix decomposition. SIAM J. Optim. 21 572–596.
  • Chen, K. H. and Shimerda, T. A. (1981). An empirical analysis of useful financial ratios. Financ. Manag. 51–60.
  • Davidson, K. R. and Szarek, S. J. (2001). Local operator theory, random matrices and Banach spaces. In Handbook of the Geometry of Banach Spaces I 317–366. North-Holland, Amsterdam.
  • Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. 7 1–46.
  • De Mol, C., Giannone, D. and Reichlin, L. (2008). Forecasting using a large number of predictors: Is Bayesian shrinkage a valid alternative to principal components? J. Econometrics 146 318–328.
  • Donoho, D. L., Gavish, M. and Johnstone, I. M. (2014). Optimal shrinkage of eigenvalues in the spiked covariance model. Preprint. Available at arXiv:1311.0851.
  • Fan, J., Fan, Y. and Lv, J. (2008). High dimensional covariance matrix estimation using a factor model. J. Econometrics 147 186–197.
  • Fan, J. and Han, X. (2013). Estimation of false discovery proportion with unknown dependence. J. R. Stat. Soc. Ser. B. Stat. Methodol. To appear. Available at arXiv:1305.7007.
  • Fan, J., Han, X. and Gu, W. (2012). Estimating false discovery proportion under arbitrary covariance dependence. J. Amer. Statist. Assoc. 107 1019–1035.
  • Fan, J., Liao, Y. and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 603–680.
  • Fan, J., Liao, Y. and Shi, X. (2015). Risks of large portfolios. J. Econometrics 186 367–387.
  • Fan, J., Liao, Y. and Wang, W. (2016). Projected principal component analysis in factor models. Ann. Statist. 44 219–254.
  • Fan, J., Xue, L. and Yao, J. (2015). Sufficient forecasting using factor models. Preprint. Available at arXiv:1505.07414.
  • Fan, J., Liu, H., Wang, W. and Zhu, Z. (2016). Heterogeneity adjustment with applications to graphical model inference. Preprint. Available at arXiv:1602.05455.
  • Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 427–444.
  • Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104 682–693.
  • Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37 4104–4130.
  • Koltchinskii, V. and Lounici, K. (2014). Asymptotics and concentration bounds for bilinear forms of spectral projectors of sample covariance. Preprint. Available at arXiv:1408.4643.
  • Koltchinskii, V. and Lounici, K. (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23 110–133.
  • Landgrebe, J., Wurst, W. and Welzl, G. (2002). Permutation-validated principal components analysis of microarray data. Genome Biol. 3 1–11.
  • Lee, S., Zou, F. and Wright, F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Ann. Statist. 38 3605–3629.
  • Leek, J. T. and Storey, J. D. (2008). A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 105 18718–18723.
  • Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11 733–739.
  • Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. Ann. Statist. 41 772–801.
  • Onatski, A. (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. J. Econometrics 168 244–258.
  • Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • Pesaran, M. H. and Zaffaroni, P. (2008). Optimal asset allocation with factor models for large portfolios.
  • Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.
  • Ringnér, M. (2008). What is principal component analysis? Nat. Biotechnol. 26 303–304.
  • Rothman, A. J., Levina, E. and Zhu, J. (2009). Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 104 177–186.
  • Shen, D., Shen, H., Zhu, H. and Marron, J. S. (2016). The statistics and mathematics of high dimension low sample size asymptotics. Statist. Sinica 26 1747–1770.
  • Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. J. Amer. Statist. Assoc. 97 1167–1179.
  • Thomas, C. G., Harshman, R. A. and Menon, R. S. (2002). Noise reduction in BOLD-based fMRI using component analysis. Neuroimage 17 1521–1537.
  • Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. Preprint. Available at arXiv:1011.3027.
  • Vu, V. Q. and Lei, J. (2012). Minimax rates of estimation for sparse PCA in high dimensions. In AISTATS 15 1278–1286.
  • Wang, W. and Fan, J. (2017). Supplement to “Asymptotics of empirical eigenstructure for high dimensional spiked covariance.” DOI:10.1214/16-AOS1487SUPP.
  • Yamaguchi-Kabata, Y., Nakazono, K., Takahashi, A., Saito, S., Hosono, N., Kubo, M., Nakamura, Y. and Kamatani, N. (2008). Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: Effects on population-based association studies. Am. J. Hum. Genet. 83 445–456.
  • Yata, K. and Aoshima, M. (2012). Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations. J. Multivariate Anal. 105 193–215.
  • Yata, K. and Aoshima, M. (2013). PCA consistency for the power spiked model in high-dimensional settings. J. Multivariate Anal. 122 334–354.

Supplemental materials

  • Technical proofs [Wang and Fan (2017)]. This document contains technical lemmas for Section 3 and the comparison of assumptions and theoretical proofs for Section 4.