The Annals of Applied Statistics

Sparse logistic principal components analysis for binary data

Seokho Lee, Jianhua Z. Huang, and Jianhua Hu

Full-text: Open access


We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from a penalized Bernoulli likelihood. A Majorization–Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.

Article information

Ann. Appl. Stat., Volume 4, Number 3 (2010), 1579-1601.

First available in Project Euclid: 18 October 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Binary data dimension reduction MM algorithm LASSO PCA regularization sparsity


Lee, Seokho; Huang, Jianhua Z.; Hu, Jianhua. Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4 (2010), no. 3, 1579--1601. doi:10.1214/10-AOAS327.

Export citation


  • Böhning, D. (1999). The lower bound method in probit regression. Comput. Statist. Data Anal. 30 13–17.
  • Brookes, A. J. (1999). Review: The essence of SNPs. Gene 234 177–186.
  • Collins, M., Dasgupta, S. and Schapire, R. E. (2002). A generalization of principal component analysis to the exponential family. In Advanced in Neural Information Processing System (T. G. Dietterich, S. Becker and Z. Ghahramani, eds.) 14 617–642. MIT Press, Cambridge, MA.
  • de Leeuw, J. (2006). Principal component analysis of binary data by iterated singular value decomposition. Comput. Statist. Data Anal. 50 21–39.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Ewens, W. J. and Spielman, R. S. (1995). The transmission/disequilibrium test: History, subdivision, and admixture. The American Journal of Human Genetics 57 455–464.
  • Golub, G. and van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD.
  • Hao, K., Li, C., Rosenow, C. and Wong, W. H. (2004). Detect and adjust for population stratification in population-based association study using genomic control markers: An application of Affymetrix Genechip® Human Mapping 10K array. European Journal of Human Genetics 12 1001–1006.
  • Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24 417–441.
  • Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. Amer. Statist. 58 30–37.
  • Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33 1617–1642.
  • Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statist. Comput. 10 25–37.
  • Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York.
  • Jolliffe, I. T., Trendafilov, M. and Uddine, M. (2003). A modified principal component technique based on the LASSO. J. Comput. Graph. Statist. 12 531–547.
  • Kwok, P. Y., Deng, Q., Zakeri, H., Taylor, S. L. and Nickerson, D. A. (1996). Increasing the information content of STS-based genome maps: Identifying polymorphisms in mapped STSs. Genomics 31 123–126.
  • Lange, K., Hunter, D. R. and Yang, I. (2000). Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist. 9 1–20.
  • Lee, S., Huang, J. Z. and Hu, J. (2010). The MM algorithm for sparse logistic PCA using the tight bound: A supplementary note to “Sparse logistic principal components analysis for binary data.” DOI: 10.1214/10-AOAS327SUPP.
  • Liang, Y. and Kelemen, A. (2008). Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. Stat. Surv. 2 43–60.
  • Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Pholosophical Magazine and Journal of Science, Sixth Series 2 559–572.
  • Risch, N., Burchard, E., Ziv, E. and Tang, H. (2002). Categorization of humans in biomedical research: Genes, race and disease. Genome Biology 3 comment 2007.1–2007.12.
  • Schein, A. I., Saul, L. K. and Ungar, L. H. (2003). A generalized linear model for principal component analysis of binary data. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics (C. M. Bishop and B. J. Frey, eds.) 14–21. Key West, FL.
  • Serre, D., Montpetit, A., Paré, G., Engert, J. G., Yusuf, S., Keavney, B., Hudson, K. J. and Anand, S. (2008). Correction of population stratification in large multi-ethnic association studies. PLoS ONE 2 e1382.
  • Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 99 1015–1034.
  • The International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437 1299–1320.
  • Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Zou, H., Hastie, T. J. and Tibshirani, R. J. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.
  • Zou, H., Hastie, T. J. and Tibshirani, R. J. (2007). On the “Degrees of Freedom” of the LASSO. Ann. Statist. 35 2173–2192.

Supplemental materials

  • Supplementary material: The MM algorithm for sparse logistic PCA using the tight bound. We develop the MM algorithm for sparse logistic PCA using the tight majorizing bound. Comparison of the developed algorithm with the MM algorithm using the uniform bound in terms of computing time is also presented.