The Annals of Applied Statistics

Sparse principal component analysis with missing observations

Seyoung Park and Hongyu Zhao

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Principal component analysis (PCA) is a commonly used statistical method in a wide range of applications. However, it does not work well when the number of features is larger than the sample size. We consider the estimation of the sparse principal subspace in the high dimensional setting with missing data motivated by the analysis of single-cell RNA sequence data. We propose a two step estimation procedure, and establish the rates of convergence for estimating the principal subspace. Simulated examples with various missing mechanisms show its competitive performance compared to existing sparse PCA methods. We apply the method to single-cell data and show that the proposed method can better distinguish cell types than other PCA methods.

Article information

Ann. Appl. Stat., Volume 13, Number 2 (2019), 1016-1042.

Received: June 2017
Revised: September 2018
First available in Project Euclid: 17 June 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

PCA missing data high dimensional single-cell data


Park, Seyoung; Zhao, Hongyu. Sparse principal component analysis with missing observations. Ann. Appl. Stat. 13 (2019), no. 2, 1016--1042. doi:10.1214/18-AOAS1220.

Export citation


  • Agarwal, A., Negahban, S. and Wainwright, M. J. (2012). Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Statist. 40 2452–2482.
  • Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921.
  • Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
  • Bailey, S. (2012). Principal component analysis with noisy and/or missing data. Publications of the Astronomical Society of the Pacific 124 1015–1023.
  • Berthet, Q. and Rigollet, P. (2013). Optimal detection of sparse principal components in high dimension. Ann. Statist. 41 1780–1815.
  • Cai, T. T., Ma, Z. and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. Ann. Statist. 41 3074–3110.
  • Cai, T. T. and Zhang, A. (2016). Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data. J. Multivariate Anal. 150 55–74.
  • Cannoodt, R. (2016). Scorpius improves trajectory inference and identifies novel modules in dendritic cell development. Preprint.
  • Chen, H. (2002). Principal component analysis with missing data and outliers. Robust Image Understanding Laboratory. Tech. report.
  • Delchambre, L. (2014). Weighted principal component analysis: A weighted covariance eigendecomposition approach. Mon. Not. R. Astron. Soc. 446 3545–3555.
  • Deng, Q., Ramsköld, D., Reinius, B. and Sandberg, R. (2014). Single-cell rna-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343 193–196.
  • Deshpande, Y. and Montanari, A. (2016). Sparse PCA via covariance thresholding. J. Mach. Learn. Res. 17 1–41.
  • Dodge, Y. (1985). Analysis of Experiments with Missing Data. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley, Chichester.
  • Eckart, C. and Young, G. (1936). The approximation of one matrix by another of low rank. Psychometrika 1 211.
  • Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 21 768–769.
  • Hastie, T., Mazumder, R., Lee, J. D. and Zadeh, R. (2015). Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16 3367–3402.
  • Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24 417–441.
  • Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104 682–693.
  • Jolliffe, I. T. (1986). Principal Component Analysis. Springer Series in Statistics. Springer, New York.
  • Journée, M., Nesterov, Y., Richtárik, P. and Sepulchre, R. (2010). Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11 517–553.
  • Kundu, A., Drineas, P. and Magdon-Ismail, M. (2015). Approximating sparse pca from incomplete data. Adv. Neural Inf. Process. Syst. 28 388–396.
  • Lange, K., Hunter, D. R. and Yang, I. (2000). Optimization transfer using surrogate objective functions. J. Comput. Graph. Statist. 9 1–59.
  • Lin, P., Troup, M. and Ho, J. W. K. (2017). Cidr: Ultrafast and accurate clustering through imputation for single-cell rna-seq data. Genome Biol. 18.
  • Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Statist. 40 1637–1664.
  • Lounici, K. (2013). Sparse principal component analysis with missing observations. In High Dimensional Probability VI. Progress in Probability 66 327–356. Birkhäuser/Springer, Basel.
  • Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. Ann. Statist. 41 772–801.
  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66) 281–297. Univ. California Press, Berkeley, CA.
  • Mazumder, R., Hastie, T. and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11 2287–2322.
  • Nadler, B. (2009). Discussion of “On consistency and sparsity for principal components analysis in high dimensions,” by I. M. Johnstone and A. Y. Lu” [MR2751448]. J. Amer. Statist. Assoc. 104 694–697.
  • Park, S. and Zhao, H. (2018). Spectral clustering based on learning similarity matrix. Bioinformatics 34 2069–2076.
  • Park, S. and Zhao, H. (2019). Supplement to “Sparse Principal Component Analysis with Missing Observations.” DOI:10.1214/18-AOAS1220SUPP.
  • Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • Paul, D. and Johnstone, I. M. (2007). Augmented sparse principal component analysis for high dimensional data. Available at arXiv:1202.1242v1.
  • Pierson, E. and Yau, C. (2015). Zifa: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16 1–10.
  • Pollen, A., Nowakowski, T., Shuga, J., Wang, X., Leyrat, A., Lui, J., Li, N., Szpankowski, L., Fowler, B., Chen, P., Ramalingam, N., Sun, G., Thu, M., Norris, M., Lebofsky, R., Toppani, D., Kemp, D., Wong, M., Clerkson, B., Jones, B., Wu, S., Knutsson, L., Alvarado, B., Wang, J., Weaver, L., May, A., Jones, R., Unger, M., Kriegstein, A. and West, J. (2014). Low-coverage single-cell mrna sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32 1053–1058.
  • Qi, X., Luo, R. and Zhao, H. (2013). Sparse principal component analysis by choice of norm. J. Multivariate Anal. 114 127–160.
  • Schlitzer, A., Sivakamasundari, V., Chen, J., Sumatoh, H. R. B., Schreuder, J., Lum, J., Malleret, B., Zhang, S., Larbi, A., Zolezzi, F. et al. (2015). Identification of cdc1- and cdc2-committed dc progenitors reveals early lineage priming at the common dc progenitor stage in the bone marrow. Nature Immunology 16 718–728.
  • Shao, C. and Höfer, T. (2017). Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics 33 235–242.
  • Srebro, N., Rennie, J. and Jaakkola, T. (2005). Maximum margin matrix factorization. Adv. Neural Inf. Process. Syst. 17.
  • Strehl, A. and Ghosh, J. (2003). Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3 583–617. Computational learning theory.
  • Ting, D. T., Wittner, B. S., Ligorio, M., Jordan, N. V., Shah, A. M., Miyamoto, D. T., Aceto, N., Bersani, F., Brannigan, B. W. et al. (2014). Single-cell rna sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Reports 8 1905–1918.
  • Tsalmantza, P. and Hogg, D. W. (2012). A data-driven model for spectra: Finding double redshifts in the sloan digital sky survey. Astrophys. J. 753 1–16.
  • Usoskin, D., Furlan, A., Islam, S., Abdo, H., Lönnerberg, P., Lou, D., Hjerling-Leffler, J., Haeggström, J., Kharchenko, O., Kharchenko, P. V., Linnarsson, S. and Ernfors, P. (2014). Unbiased classification of sensory neuron types by large-scale single-cell rna sequencing. Nat. Neurosci. 18 145–153.
  • Vu, V. Q. and Lei, J. (2013). Minimax sparse principal subspace estimation in high dimensions. Ann. Statist. 41 2905–2947.
  • Wagner, S. and Wagner, D. (2007). Comparing clusterings: An overview. Universität Karlsruhe, Fakultät Für Informatik Karlsruhe.
  • Wang, Z., Lu, H. and Liu, H. (2014). Tighten after relax: Minimax-optimal sparse pca in polynomial time. Adv. Neural Inf. Process. Syst. 3383–3391.
  • Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. and Batzoglou, S. (2017). Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14 414–416.
  • Winter, D. R. and Amit, I. (2015). Dcs are ready to commit. Nature Immunology 16 683–685.
  • Xu, C. and Su, Z. (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31 1974–1980.
  • Yan, L., Yang, M., Guo, H., Yang, L., Wu, J., Li, R., Liu, P., Lian, Y., Zheng, X. et al. (2013). Single-cell rna-seq profiling of human preimplantation embryos and embryonic stem cells. Nature Structure and Molecular Biology 20 1131–1142.
  • Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res. 14 899–925.
  • Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.

Supplemental materials

  • Supplement to “Sparse principal component analysis with missing observations”. We provide proofs of the theoretical results presented in the main paper, characteristics of the used scRNA-seq data sets, performance metrics, and additional tables and figures for simulation and single cell data analysis.