## The Annals of Applied Statistics

### Variational inference for probabilistic Poisson PCA

#### Abstract

Many application domains, such as ecology or genomics, have to deal with multivariate non-Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites aiming to understand the covariations between these species. The Gaussian setting provides a canonical way to model such dependencies but does not apply in general. We consider here the multivariate exponential family framework for which we introduce a generic model with multivariate Gaussian latent variables. We show that approximate maximum likelihood inference can be achieved via a variational algorithm for which gradient descent easily applies. We show that this setting enables us to account for covariates and offsets. We then focus on the case of the Poisson-lognormal model in the context of community ecology. We demonstrate the efficiency of our algorithm on microbial ecology datasets. We illustrate the importance of accounting for the effects of covariates to better understand interactions between species.

#### Article information

Source
Ann. Appl. Stat., Volume 12, Number 4 (2018), 2674-2698.

Dates
Received: March 2017
Revised: February 2018
First available in Project Euclid: 13 November 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1542078060

Digital Object Identifier
doi:10.1214/18-AOAS1177

Mathematical Reviews number (MathSciNet)
MR3875716

Zentralblatt MATH identifier
07029470

#### Citation

Chiquet, Julien; Mariadassou, Mahendra; Robin, Stéphane. Variational inference for probabilistic Poisson PCA. Ann. Appl. Stat. 12 (2018), no. 4, 2674--2698. doi:10.1214/18-AOAS1177. https://projecteuclid.org/euclid.aoas/1542078060

#### References

• Acharya, A., Ghosh, J. and Zhou, M. (2015). Nonparametric Bayesian factor analysis for dynamic count matrices. In AISTATS.
• Aitchison, J. and Ho, C.-H. (1989). The multivariate Poisson-log normal distribution. Biometrika 76 643–653.
• Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley-Interscience, Hoboken, NJ.
• Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22 719–25.
• Cao, Y. and Xie, Y. (2015). Poisson matrix completion. In 2015 IEEE International Symposium on Information Theory (ISIT) 1841–1845. IEEE, New York.
• Chen, J., King, E., Deek, R., Wei, Z., Yu, Y., Grill, D. and Ballman, K. (2018). An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics 34 643–651.
• Collins, M., Dasgupta, S. and Schapire, R. E. (2001). A generalization of principal components analysis to the exponential family. In Advances in Neural Information Processing Systems 617–624.
• Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
• Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika 1 211–218.
• Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V. and Egozcue, J. J. (2017). Microbiome datasets are compositional: And this is not optional. Front. Microbiol. 8 2224.
• Hall, P., Ormerod, J. T. and Wand, M. P. (2011). Theory of Gaussian variational approximation for a Poisson mixed model. Statist. Sinica 21 369–389.
• Izsák, R. (2008). Maximum likelihood fitting of the Poisson lognormal distribution. Environ. Ecol. Stat. 15 143–156.
• Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Stat. Comput. 10 25–37.
• Jakuschkin, B., Fievet, V., Schwaller, L., Fort, T., Robin, C. and Vacher, C. (2016). Deciphering the pathobiome: Intra-and interkingdom interactions involving the pathogen Erysiphe alphitoides. Microb. Ecol. 72 870–880.
• Johnson, S. G. (2011). The NLopt nonlinear-optimization package. Available at http://ab-initio.mit.edu/nlopt.
• Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997). Discrete Multivariate Distributions. Wiley, New York.
• Karlis, D. (2005). EM algorithm for mixed Poisson and other discrete distributions. Astin Bull. 35 3–24.
• Lafond, J. (2015). Low rank matrix completion with exponential family noise. Preprint. Available at arXiv:1502.06919.
• Landgraf, A. J. (2015). Generalized principal component analysis: Dimensionality reduction through the projection of natural parameters. Ph.D. thesis, Ohio State Univ., Columbus, OH.
• Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 556–562.
• Li, J. and Tao, D. (2010). Simple exponential family PCA. In AISTATS 453–460.
• Little, R. J. A. and Rubin, D. B. (2014). Statistical Analysis with Missing Data, Wiley-Interscience, Hoboken, NJ.
• Liu, L. T., Dobriban, E. and Singer, A. (2016). $e$PCA: High dimensional exponential family PCA. Preprint. Available at arXiv:1611.05550.
• Mach, N., Berri, M., Estellé, J., Levenez, F., Lemonnier, G., Denis, C., Leplat, J.-J., Chevaleyre, C., Billon, Y., Doré, J., Rogel-Gaillard, C. and Lepage, P. (2015). Early-life establishment of the swine gut microbiome and impact on host phenotypes. Environ. Microbiol. Rep. 7 554–569.
• Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, New York.
• Minka, T. P. (2000). Automatic choice of dimensionality for PCA. In NIPS 13 598–604.
• Mohamed, S., Ghahramani, Z. and Heller, K. A. (2009). Bayesian exponential family PCA. In Advances in Neural Information Processing Systems 1089–1096.
• Nelson, J. F. (1985). Multivariate gamma-Poisson models. J. Amer. Statist. Assoc. 80 828–834.
• Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (1989). Numerical Recipes: The Art of Scientific Computing. Code CD-ROM v 2.06 with UNIX Single-Screen License, 3rd ed. Cambridge Univ. Press, Cambridge.
• R Development Core Team (2008). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org.
• Royle, J. A. and Wikle, C. K. (2005). Efficient statistical mapping of avian count data. Environ. Ecol. Stat. 12 225–243.
• Salmon, J., Harmany, Z., Deledalle, C.-A. and Willett, R. (2014). Poisson noise reduction with non-local PCA. J. Math. Imaging Vision 48 279–294.
• Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
• Smets, W., Leff, J. W., Bradford, M. A., McCulley, R. L., Lebeer, S. and Fierer, N. (2015). A method for simultaneous measurement of soil bacterial abundances and community composition via 16s rRNA gene sequencing. PeerJ PrePrints 3 e1318v1.
• Srivastava, S. and Chen, L. (2010). A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res. 38 e170–e170.
• Svanberg, K. (2002). A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM J. Optim. 12 555–573.
• Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B. Stat. Methodol. 61 611–622.
• Tsilimigras, M. C. B. and Fodor, A. A. (2016). Compositional data analysis of the microbiome: Fundamentals, tools, and challenges. Ann. Epidemiol. 26 330–335.
• Vandeputte, D., Kathagen, G., D’hoe, K., Vieira-Silva, S., Valles-Colomer, M., Sabino, J., Wang, J., Tito, R. Y., De Commer, L., Darzi, Y. et al. (2017). Quantitative microbiome profiling links gut community variation to microbial load. Nature 551 507–511.
• Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1 1–305.
• Wickham, H. (2009). Ggplot2: Elegant Graphics for Data Analysis. Springer, New York. Available at http://ggplot2.org.
• Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
• Ypma, J. (2017). R interface to NLopt, v. 1.0.4. Available at https://github.com/jyypma/nloptr.
• Zhou, M. (2016). Nonparametric Bayesian negative binomial factor analysis. Preprint. Available at arXiv:1604.07464.
• Zhou, M., Hannah, L., Dunson, D. B. and Carin, L. (2012). Beta-negative binomial process and Poisson factor analysis. In AISTATS 22 1462–1471.