## Electronic Journal of Statistics

### Bayesian variable selection for globally sparse probabilistic PCA

#### Abstract

Sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features of high-dimensional data in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables may be difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we propose a Bayesian procedure that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify which original variables are most relevant to describe the data. To this end, using Roweis’ probabilistic interpretation of PCA and an isotropic Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. Moreover, in order to avoid the drawbacks of discrete model selection, a simple relaxation of this framework is presented. It allows to find a path of candidate models using a variational expectation-maximization algorithm. The exact marginal likelihood can eventually be maximized over this path, relying on Occam’s razor to select the relevant variables. Since the sparsity pattern is common to all components, we call this approach globally sparse probabilistic PCA (GSPPCA). Its usefulness is illustrated on synthetic data sets and on several real unsupervised feature selection problems coming from signal processing and genomics. In particular, using unlabeled microarray data, GSPPCA is shown to infer biologically relevant subsets of genes. According to a metric based on pathway enrichment, it vastly surpasses in this context the performance of traditional sparse PCA algorithms. An R implementation of the GSPPCA algorithm is available at http://github.com/pamattei/GSPPCA.

#### Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 3036-3070.

Dates
First available in Project Euclid: 20 September 2018

https://projecteuclid.org/euclid.ejs/1537430424

Digital Object Identifier
doi:10.1214/18-EJS1450

Mathematical Reviews number (MathSciNet)
MR3856168

Zentralblatt MATH identifier
06942965

#### Citation

Bouveyron, Charles; Latouche, Pierre; Mattei, Pierre-Alexandre. Bayesian variable selection for globally sparse probabilistic PCA. Electron. J. Statist. 12 (2018), no. 2, 3036--3070. doi:10.1214/18-EJS1450. https://projecteuclid.org/euclid.ejs/1537430424

#### References

• Abramowitz, M. and Stegun, I. (1965)., Handbook of Mathematical Functions. Dover Publications.
• Aminghafari, M., Cheze, N. and Poggi, J. M. (2006). Multivariate denoising using wavelets and principal component analysis., Computational Statistics & Data Analysis 50 2381–2398.
• Amos, D. E. (1986). Algorithm 644: A portable package for Bessel functions of a complex argument and nonnegative order., ACM Transactions on Mathematical Software 12 265–273.
• Ando, T. (2009). Bayesian factor analysis with fat-tailed factors and its exact marginal likelihood., Journal of Multivariate Analysis 100 1717–1726.
• Archambeau, C. and Bach, F. (2009). Sparse probabilistic projections. In, Advances in Neural Information Processing Systems 73–80.
• Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing., Journal of the Royal Statistical Society. Series B (Methodological) 57 289–300.
• Biernacki, C., Celeux, G. and Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models., Computational Statistics & Data Analysis 41 561–575.
• Bishop, C. M. (1999a). Bayesian PCA. In, Advances in Neural Information Processing Systems 382–388.
• Bishop, C. M. (1999b). Variational Principal Components. In, Proceedings of the Ninth International Conference on Artificial Neural Networks 509–514.
• Bishop, C. M. (2006)., Pattern recognition and machine learning. Springer.
• Bouveyron, C., Celeux, G. and Girard, S. (2011). Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA., Pattern Recognition Letters 32 1706–1713.
• Bro, R. and Smilde, A. K. (2003). Centering and scaling in component analysis., Journal of Chemometrics 17 16–33.
• Brusco, M. J. (2014). A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis., Computational Statistics & Data Analysis 77 38–53.
• Celeux, G., El Anbari, M., Marin, J. M. and Robert, C. P. (2012). Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation., Bayesian Analysis 7 477–502.
• Chan, T. H., Jia, K., Gao, S., Lu, J., Zeng, Z. and Ma, Y. (2015). PCANet: A Simple Deep Learning Baseline for Image Classification?, IEEE Transactions on Image Processing 24 5017–5032.
• d’Aspremont, A., Bach, F. and El Ghaoui, L. (2008). Optimal solutions for sparse principal component analysis., The Journal of Machine Learning Research 9 1269–1294.
• Fabregat, A., Sidiropoulos, K., Garapati, P., Gillespie, M., Hausmann, K., Haw, R., Jassal, B., Jupe, S., Korninger, F., McKay, S., Matthews, L., May, B., Milacic, M., Rothfels, K., Shamovsky, V., Webber, M., Weiser, J., Williams, M., Wu, G., Stein, L., Hermjakob, H. and D’Eustachio, P. (2016). The Reactome pathway Knowledgebase., Nucleic Acids Research 44 D481–D487.
• Fang, K. T., Kotz, S. and Ng, K. W. (1990)., Symmetric multivariate and related distributions. Chapman and Hall.
• Gramfort, A., Strohmeier, D., Haueisen, J., Hämäläinen, M. S. and Kowalski, M. (2013). Time-frequency mixed-norm estimates: Sparse M/EEG imaging with non-stationary source activations., NeuroImage 70 410–422.
• Gu, Q., Li, Z. and Han, J. (2011). Joint feature selection and subspace learning. In, Proceedings of the International Joint Conference on Artificial Intelligence 22 1294-1299.
• Guan, Y. and Dy, J. G. (2009). Sparse probabilistic principal component analysis. In, International Conference on Artificial Intelligence and Statistics 185–192.
• Hartman, P. and Watson, G. S. (1974). “Normal” distribution functions on spheres and the modified Bessel functions., The Annals of Probability 593–607.
• Hastie, T., Tibshirani, R. and Wainwright, M. (2015)., Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press.
• Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components., Journal of Educational Psychology 24 417.
• Ilin, A. and Raiko, T. (2010). Practical approaches to principal component analysis in the presence of missing values., The Journal of Machine Learning Research 11 1957–2000.
• Jenatton, R., Obozinski, G. and Bach, F. (2009). Structured sparse principal component analysis. In, International Conference on Artificial Intelligence and Statistics.
• Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions., Journal of the American Statistical Association 104.
• Jolliffe, I. T. (1972). Discarding variables in a principal component analysis. I: Artificial data., Journal of the Royal Statistical Society. Series C (Applied Statistics) 21 160–173.
• Jolliffe, I. T. (1973). Discarding variables in a principal component analysis. II: Real data., Journal of the Royal Statistical Society. Series C (Applied Statistics) 22 21–31.
• Journée, M. (2009). Geometric algorithms for component analysis with a view to gene expression data analysis. PhD thesis, Université de, Liège.
• Journée, M., Nesterov, Y., Richtárik, P. and Sepulchre, R. (2010). Generalized power method for sparse principal component analysis., The Journal of Machine Learning Research 11 517–553.
• Kass, R. E. and Steffey, D. (1989). Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models)., Journal of the American Statistical Association 84 717–726.
• Khan, Z., Shafait, F. and Mian, A. (2015). Joint Group Sparse PCA for Compressed Hyperspectral Imaging., IEEE Transactions on Image Processing 24 4934–4942.
• Khanna, R., Ghosh, J., Poldrack, R. and Koyejo, O. (2015). Sparse Submodular Probabilistic PCA. In, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics 453–461.
• Larochelle, H., Erhan, D., Courville, A., Bergstra, J. and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In, Proceedings of the 24th international conference on Machine learning 473–480. ACM.
• Latouche, P., Mattei, P. A., Bouveyron, C. and Chiquet, J. (2016). Combining a Relaxed EM Algorithm with Occam’s Razor for Bayesian Variable Selection in High-Dimensional Regression., Journal of Multivariate Analysis 146 177–190.
• Lawley, D. N. (1953). A modified method of estimation in factor analysis and some large sample results., Proceedings of the Uppsala Symposium on Psychological Factor Analysis, Uppsala, Sweden 35–42.
• Lázaro-Gredilla, M. and Titsias, M. K. (2011). Spike and slab variational inference for multi-task and multiple kernel learning. In, Advances in Neural Information Processing Systems 2339–2347.
• Liu, T. Y., Trinchera, L., Tenenhaus, A., Wei, D. and Hero, A. O. (2013). Globally sparse PLS regression. In, New Perspectives in Partial Least Squares and Related Methods 117–127. Springer.
• Lorch, L. (1967). Inequalities for some Whittaker functions., Archivum Mathematicum 3 1–9.
• MacKay, D. J. C. (1994). Bayesian methods for backpropagation networks. In, Models of Neural Networks III 211–254. Springer.
• MacKay, D. J. C. (2003)., Information theory, inference, and learning algorithms. Cambridge University Press.
• Madigan, D. and Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window., Journal of the American Statistical Association 89 1535–1546.
• Mäechler, M. (2013). Bessel: Bessel – Bessel Functions Computations and Approximations R package version, 0.5-5.
• Masaeli, M., Yan, Y., Cui, Y., Fung, G. and Dy, J. G. (2010). Convex principal feature selection. In, In SIAM International Conference on Data Mining 619–628.
• Mattei, P. A. (2017). Multiplying a Gaussian matrix by a Gaussian vector., Statistics & Probability Letters 128 67–70.
• Mattei, P. A., Bouveyron, C. and Latouche, P. (2016). Globally Sparse Probabilistic PCA. In, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics 976–984.
• McLachlan, G. and Krishnan, T. (2008)., The EM Algorithm and Extensions. Second Edition. John Wiley & Sons, New York.
• Miller, J. A., Cai, C., Langfelder, P., Geschwind, D. H., Kurian, S. M., Salomon, D. R. and Horvath, S. (2011). Strategies for aggregating gene expression data: the collapseRows R function., BMC bioinformatics 12 1.
• Minka, T. P. (2000). Automatic choice of dimensionality for PCA. In, Advances in Neural Information Processing Systems 598–604.
• Minn, A. J., Gupta, G. P., Padua, D., Bos, P., Nguyen, D. X., Nuyten, D., Kreike, B., Zhang, Y., Wang, Y., Ishwaran, H., Foekens, J. A., van de Vijver, M. and Massagué, J. (2007). Lung metastasis genes couple breast tumor size and metastatic spread., Proceedings of the National Academy of Sciences 104 6740–6745.
• Mitchell, T. and Beauchamp, J. (1988). Bayesian variable selection in linear regression (with discussion)., Journal of the American Statistical Association 83 1023-1036.
• Moghaddam, B., Weiss, Y. and Avidan, S. (2005). Spectral bounds for sparse PCA: Exact and greedy algorithms. In, Advances in Neural Information Processing Systems 915–922.
• Mohamed, S., Heller, K. and Ghahramani, Z. (2012). Bayesian and L1 approaches for sparse unsupervised learning. In, Proceedings of the 29th International Conference on Machine Learning 751–758.
• Nakajima, S., Sugiyama, M. and Babacan, D. (2011). On Bayesian PCA: Automatic dimensionality selection and analytic solution. In, Proceedings of the 28th International Conference on Machine Learning 497–504.
• Nakajima, S., Tomioka, R., Sugiyama, M. and Babacan, S. D. (2015). Condition for Perfect Dimensionality Recovery by Variational Bayesian PCA., Journal of Machine Learning Research 16 3757-3811.
• Neal, R. M. (1996)., Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., Secaucus, NJ, USA.
• Ogata, H. (2005). A numerical integration formula based on the Bessel functions., Publications of the Research Institute for Mathematical Sciences 41 949–970.
• Passemier, D., Li, Z. and Yao, J. (2017). On estimation of the noise variance in high dimensional probabilistic principal component analysis., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 51–67.
• Qiu, Y. and Mei, J. (2016). RSpectra: Solvers for Large Scale Eigenvalue and SVD Problems R package version, 0.12-0.
• Ringnér, M. (2008). What is principal component analysis?, Nature Biotechnology 26 303–304.
• Rivals, I., Personnaz, L., Taing, L. and Potier, M. C. (2007). Enrichment or depletion of a GO category within a class of genes: which test?, Bioinformatics 23 401–407.
• Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV-coefficient., Journal of the Royal Statistical Society. Series C (Applied Statistics) 25 257–265.
• Roweis, S. (1998). EM algorithms for PCA and SPCA. In, Advances in Neural Information Processing Systems 626–632.
• Schaback, R. and Wu, Z. (1996). Operators on radial functions., Journal of Computational and Applied Mathematics 73 257–270.
• Schroeder, M., Haibe-Kains, B., Culhane, A., Sotiriou, C., Bontempi, G. and Quackenbush, J. (2011). breastCancerVDX: Gene expression datasets published by Wang et al. [2005] and Minn et al. [2007] (VDX) R package version, 1.8.0.
• Schwarz, G. (1978). Estimating the dimension of a model., The Annals of Statistics 6 461–464.
• Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation., Journal of Multivariate Analysis 99 1015–1034.
• Sigg, C. D. and Buhmann, J. M. (2008). Expectation-maximization for sparse and non-negative PCA. In, Proceedings of the 25th international conference on Machine learning 960–967.
• Sobczyk, P., Bogdan, M. and Josse, J. (2017). Bayesian dimensionality reduction with PCA using penalized semi-integrated likelihood., Journal of Computational and Graphical Statistics 26 826–839.
• Teschendorff, A. E., Journée, M., Absil, P. A., Sepulchre, R. and Caldas, C. (2007). Elucidating the altered transcriptional programs in breast cancer using independent component analysis., PLoS Computational Biology 3 e161.
• Theobald, C. M. (1975). An Inequality with Application to Multivariate Analysis., Biometrika 62 461-466.
• Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine., The Journal of Machine Learning Research 1 211–244.
• Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 611–622.
• Ulfarsson, M. O. and Solo, V. (2008). Sparse variable PCA using geodesic steepest descent., IEEE Transactions on Signal Processing 56 5823–5832.
• Ulfarsson, M. O. and Solo, V. (2011). Vector l0 sparse variable PCA., IEEE Transactions on Signal Processing 59 1949–1958.
• Van der Vaart, A. W. (2000)., Asymptotic statistics 3. Cambridge University Press.
• Vu, V. Q. and Lei, J. (2013). Minimax sparse principal subspace estimation in high dimensions., The Annals of Statistics 41 2905–2947.
• Wang, Y., Klijn, J. G. M., Zhang, Y., Sieuwerts, A. M., Look, M. P., Yang, F., Talantov, D., Timmermans, M., Meijer-van Gelder, M. E., Yu, J., Jatkoe, T., Berns, E. M. J. J., Atkins, D. and Foekens, J. A. (2005). Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer., The Lancet 365 671–679.
• Wipf, D. and Nagarajan, S. (2008). A new view of automatic relevance determination. In, Advances in Neural Information Processing Systems 1625–1632.
• Wipf, D. and Nagarajan, S. (2009). A unified Bayesian framework for MEG/EEG source imaging., NeuroImage 44 947–966.
• Wipf, D. P., Rao, B. D. and Nagarajan, S. (2011). Latent variable Bayesian models for promoting sparsity., IEEE Transactions on Information Theory 57 6236–6255.
• Wishart, J. and Bartlett, M. S. (1932). The distribution of second order moment statistics in a normal system., Mathematical Proceedings of the Cambridge Philosophical Society 28.
• Xiaoshuang, S., Zhihui, L., Zhenhua, G., Minghua, W., Cairong, Z. and Heng, K. (2013). Sparse Principal Component Analysis via Joint $L_2,1$-Norm Penalty. In, AI 2013: Advances in Artificial Intelligence 148–159. Springer.
• Xu, L. and Jordan, M. (1996). On convergence properties of the EM algorithm for Gaussian mixtures., Neural Computation 8 129–151.
• Yu, G. and He, Q. Y. (2016). ReactomePA: an R/Bioconductor package for Reactome pathway analysis and visualization., Molecular BioSystems.
• Yu, L., Snapp, R. R., Ruiz, T. and Radermacher, M. (2010). Probabilistic principal component analysis with expectation maximization (PPCA-EM) facilitates volume classification and estimates the missing data., Journal of Structural Biology 171 18–30.
• Zhang, Y., d’Aspremont, A. and El Ghaoui, L. (2012). Sparse PCA: Convex relaxations, algorithms and applications. In, Handbook on Semidefinite, Conic and Polynomial Optimization 915–940. Springer.
• Zhang, Y. and El Ghaoui, L. (2011). Large-scale sparse principal component analysis with application to text data. In, Advances in Neural Information Processing Systems 532–539.
• Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis., Journal of Computational and Graphical Statistics 15 265–286.