The Annals of Statistics

Sparse PCA: Optimal rates and adaptive estimation

T. Tony Cai, Zongming Ma, and Yihong Wu

Full-text: Open access


Principal component analysis (PCA) is one of the most commonly used statistical procedures with a wide range of applications. This paper considers both minimax and adaptive estimation of the principal subspace in the high dimensional setting. Under mild technical conditions, we first establish the optimal rates of convergence for estimating the principal subspace which are sharp with respect to all the parameters, thus providing a complete characterization of the difficulty of the estimation problem in term of the convergence rate. The lower bound is obtained by calculating the local metric entropy and an application of Fano’s lemma. The rate optimal estimator is constructed using aggregation, which, however, might not be computationally feasible.

We then introduce an adaptive procedure for estimating the principal subspace which is fully data driven and can be computed efficiently. It is shown that the estimator attains the optimal rates of convergence simultaneously over a large collection of the parameter spaces. A key idea in our construction is a reduction scheme which reduces the sparse PCA problem to a high-dimensional multivariate regression problem. This method is potentially also useful for other related problems.

Article information

Ann. Statist. Volume 41, Number 6 (2013), 3074-3110.

First available in Project Euclid: 1 January 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62H12: Estimation
Secondary: 62H25: Factor analysis and principal components; correspondence analysis 62C20: Minimax procedures

Adaptive estimation aggregation covariance matrix eigenvector group sparsity low-rank matrix minimax lower bound optimal rate of convergence principal component analysis thresholding


Cai, T. Tony; Ma, Zongming; Wu, Yihong. Sparse PCA: Optimal rates and adaptive estimation. Ann. Statist. 41 (2013), no. 6, 3074--3110. doi:10.1214/13-AOS1178.

Export citation


  • [1] Abramovich, F., Benjamini, Y., Donoho, D. L. and Johnstone, I. M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653.
  • [2] Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921.
  • [3] Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, Hoboken, NJ.
  • [4] Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
  • [5] Berthet, Q. and Rigollet, P. (2013). Optimal detection of sparse principal components in high dimension. Ann. Statist. 41 1780–1815.
  • [6] Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Ann. Statist. 36 2577–2604.
  • [7] Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch. Verw. Gebiete 65 181–237.
  • [8] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268.
  • [9] Birnbaum, A., Johnstone, I. M., Nadler, B. and Paul, D. (2013). Minimax bounds for sparse PCA with noisy high-dimensional data. Preprint. Available at arXiv:1203.0967.
  • [10] Cai, T. T., Liu, W. and Zhou, H. H. (2012). Optimal estimation of large sparse precision matrices. Technical Report, Univ. Pennsylvania, PA.
  • [11] Cai, T. T., Ma, Z. and Wu, Y. (2013). Optimal estimation and rank detection for sparse spiked covariance matrices. Preprint. Available at arXiv:1305.3235.
  • [12] Cai, T. T., Ma, Z. and Wu, Y. (2013). Supplement to “Sparse PCA: Optimal rates and adaptive estimation.” DOI:10.1214/13-AOS1178SUPP.
  • [13] Cai, T. T. and Yuan, M. (2012). Adaptive covariance matrix estimation through block thresholding. Ann. Statist. 40 2014–2042.
  • [14] Cai, T. T., Zhang, C.-H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 38 2118–2144.
  • [15] Cai, T. T. and Zhou, H. H. (2012). Optimal rates of convergence for sparse covariance matrix estimation. Ann. Statist. 40 2389–2420.
  • [16] Chamberlain, G. and Rothschild, M. (1983). Arbitrage, factor structure, and mean–variance analysis on large asset markets. Econometrica 51 1281–1304.
  • [17] Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley, Hoboken, NJ.
  • [18] d’Aspremont, A., El Ghaoui, L., Jordan, M. I. and Lanckriet, G. R. G. (2007). A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49 434–448 (electronic).
  • [19] Davidson, K. R. and Szarek, S. J. (2001). Handbook of the Geometry of Banach Spaces 317–366. North-Holland, Amsterdam.
  • [20] Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. 7 1–46.
  • [21] Eaton, M. L. (1970). Some problems in covariance estimation. Technical Report 49, Dept. Statistics, Univ. Stanford, Stanford, CA.
  • [22] Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis. Cambridge Univ. Press, Cambridge.
  • [23] Hoyle, D. C. and Rattray, M. (2004). Principal-component-analysis eigenvalue spectra from data with symmetry-breaking structure. Phys. Rev. E (3) 69 026124.
  • [24] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • [25] Johnstone, I. M. (2001). Thresholding for weighted chi-squared. Statist. Sinica 11 691–704.
  • [26] Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104 682–693.
  • [27] Jolliffe, I. T., Trendafilov, N. T. and Uddin, M. (2003). A modified principal component technique based on the LASSO. J. Comput. Graph. Statist. 12 531–547.
  • [28] Journée, M., Nesterov, Y., Richtárik, P. and Sepulchre, R. (2010). Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11 517–553.
  • [29] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
  • [30] Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37 4104–4130.
  • [31] Kolmogorov, A. N. and Tihomirov, V. M. (1959). $\varepsilon $-entropy and $\varepsilon $-capacity of sets in function spaces. Uspehi Mat. Nauk 14 3–86.
  • [32] Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329.
  • [33] Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Ann. Statist. 28 1302–1338.
  • [34] LeCam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38–53.
  • [35] Lounici, K. (2013). Sparse principal component analysis with missing observations. In High Dimensional Probability VI 327–356. Springer, Basel.
  • [36] Lounici, K., Pontil, M., van de Geer, S. and Tsybakov, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39 2164–2204.
  • [37] Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. Ann. Statist. 41 772–801.
  • [38] Mendelson, S. (2010). Empirical processes with a bounded $\psi_{1}$ diameter. Geom. Funct. Anal. 20 988–1027.
  • [39] Nadler, B. (2008). Finite sample approximation results for principal component analysis: A matrix perturbation approach. Ann. Statist. 36 2791–2817.
  • [40] Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist. 39 1069–1097.
  • [41] Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin.
  • [42] Paul, D. (2005). Nonparametric estimation of principal components. Ph.D. thesis, Univ. Stanford, Stanford, CA.
  • [43] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • [44] Recht, B., Fazel, M. and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52 471–501.
  • [45] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
  • [46] Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. Ann. Statist. 39 887–930.
  • [47] Shen, D., Shen, H. and Marron, J. S. (2013). Consistency of sparse PCA in high dimension, low sample size contexts. J. Multivariate Anal. 115 317–333.
  • [48] Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 99 1015–1034.
  • [49] Stein, C. (1956). Some problems in multivariate analysis, part i. Technical Report 6, Dept. Statistics, Univ. Stanford.
  • [50] Szarek, S. J. (1982). Nets of Grassmann manifold and orthogonal group. In Proceedings of Research Workshop on Banach Space Theory (Iowa City, Iowa, 1981) 169–185. Univ. Iowa, Iowa City, IA.
  • [51] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
  • [52] Ulfarsson, M. O. and Solo, V. (2008). Sparse variable PCA using geodesic steepest descent. IEEE Trans. Signal Process. 56 5823–5832.
  • [53] Varmuza, K. and Filzmoser, P. (2009). Introduction to Multivariate Statistical Analysis in Chemometrics. CRC Press, Boca Raton, FL.
  • [54] Vu, V. Q. and Lei, J. (2012). Minimax rates of estimation for sparse PCA in high dimensions. In The Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS’12), available at
  • [55] Wedin, P.-Å. (1972). Perturbation bounds in connection with singular value decomposition. Nordisk Tidskr. Informationsbehandling (BIT) 12 99–111.
  • [56] Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
  • [57] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
  • [58] Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 1564–1599.
  • [59] Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res. 14 899–925.
  • [60] Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.

Supplemental materials

  • Supplementary material: Supplement to “Sparse PCA: Optimal rates and adaptive estimation”. We provide proofs for all the remaining theoretical results in the paper. The proofs rely on results in [17, 19, 20, 25, 31, 33] and [51].