Annals of Applied Statistics

Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis

Boyu Ren, Sergio Bacallado, Stefano Favaro, Tommi Vatanen, Curtis Huttenhower, and Lorenzo Trippa

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Detecting associations between microbial compositions and sample characteristics is one of the most important tasks in microbiome studies. Most of the existing methods apply univariate models to single microbial species separately, with adjustments for multiple hypothesis testing. We propose a Bayesian analysis for a generalized mixed effects linear model tailored to this application. The marginal prior on each microbial composition is a Dirichlet process, and dependence across compositions is induced through a linear combination of individual covariates, such as disease biomarkers or the subject’s age, and latent factors. The latent factors capture residual variability and their dimensionality is learned from the data in a fully Bayesian procedure. The proposed model is tested in data analyses and simulation studies with zero-inflated compositions. In these settings and within each sample, a large proportion of counts per microbial species are equal to zero. In our Bayesian model a priori the probability of compositions with absent microbial species is strictly positive. We propose an efficient algorithm to sample from the posterior and visualizations of model parameters which reveal associations between covariates and microbial compositions. We evaluate the proposed method in simulation studies, and then analyze a microbiome dataset for infants with type 1 diabetes which contains a large proportion of zeros in the sample-specific microbial compositions.

Article information

Ann. Appl. Stat., Volume 14, Number 1 (2020), 494-517.

Received: October 2018
Revised: August 2019
First available in Project Euclid: 16 April 2020

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Truncated dependent Dirichlet processes latent factor model type 1 diabetes


Ren, Boyu; Bacallado, Sergio; Favaro, Stefano; Vatanen, Tommi; Huttenhower, Curtis; Trippa, Lorenzo. Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis. Ann. Appl. Stat. 14 (2020), no. 1, 494--517. doi:10.1214/19-AOAS1295.

Export citation


  • Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669–679.
  • Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol. 11 R106.
  • Arbel, J., Mengersen, K. and Rousseau, J. (2016). Bayesian nonparametric dependent model for partially replicated data: The influence of fuel spills on species diversity. Ann. Appl. Stat. 10 1496–1516.
  • Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika 98 291–306.
  • Borg, I. and Groenen, P. J. F. (2005). Modern Multidimensional Scaling: Theory and Applications, 2nd ed. Springer Series in Statistics. Springer, New York.
  • Brooks, S. P. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. J. Comput. Graph. Statist. 7 434–455.
  • Chen, J. and Li, H. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 7 418–442.
  • Fanaro, S., Chierici, R., Guerrini, P. and Vigi, V. (2003). Intestinal microflora in early infancy: Composition and development. Acta Pdæiatr. 92 48–55.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Gevers, D., Kugathasan, S., Denson, L. A., Vázquez-Baeza, Y., Van Treuren, W., Ren, B., Schwager, E., Knights, D., Song, S. J. et al. (2014). The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host & Microbe 15 382–392.
  • Grantham, N. S., Guan, Y., Reich, B. J., Borer, E. T. and Gross, K. (2019). MIMIX: A Bayesian mixed-effects model for microbiome data from designed experiments. J. Amer. Statist. Assoc. 0 1–16.
  • Greenblum, S., Turnbaugh, P. J. and Borenstein, E. (2012). Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. Proc. Natl. Acad. Sci. USA 109 594–599.
  • Griffin, J. E., Kolossiatis, M. and Steel, M. F. J. (2013). Comparing distributions by using dependent normalized random-measure mixtures. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 499–529.
  • Human Microbiome Project Consortium (2012). Structure, function and diversity of the healthy human microbiome. Nature 486 207–214.
  • Ishwaran, H. and Zarepour, M. (2002). Exact and approximate sum representations for the Dirichlet process. Canad. J. Statist. 30 269–283.
  • James, L. F., Lijoi, A. and Prünster, I. (2009). Posterior analysis for normalized random measures with independent increments. Scand. J. Stat. 36 76–97.
  • Johnson, D. S., Ream, R. R., Towell, R. G., Williams, M. T. and Leon Guerrero, J. D. (2013). Bayesian clustering of animal abundance trends for inference and dimension reduction. J. Agric. Biol. Environ. Stat. 18 299–313.
  • Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Vol. 2. IJCAI’95 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco, CA.
  • Kostic, A. D., Gevers, D., Siljander, H., Vatanen, T., Hyötyläinen, T., Hämäläinen, A.-M., Peet, A., Tillmann, V., Pöhö, P. et al. (2015). The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. Cell Host & Microbe 17 260–273.
  • Ledoux, M. and Talagrand, M. (2011). Probability in Banach Spaces: Isoperimetry and Processes. Classics in Mathematics. Springer, Berlin. Reprint of the 1991 edition.
  • Li, H. (2015). Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu. Rev. Stat. Appl. 2 73–94.
  • Lijoi, A., Nipoti, B. and Prünster, I. (2014). Bayesian inference with dependent normalized completely random measures. Bernoulli 20 1260–1291.
  • Lindley, D. V. and Smith, A. F. M. (1972). Bayes estimates for the linear model. J. Roy. Statist. Soc. Ser. B 34 1–41.
  • Lozupone, C., Lladser, M. E., Knights, D., Stombaugh, J. and Knight, R. (2011). UniFrac: An effective distance metric for microbial community comparison. ISME J. 5 169–172.
  • MacEachern, S. N. (2000). Dependent Dirichlet processes. Technical Report, Dept. Statistics, The Ohio State Univ.
  • Morgan, X. C., Tickle, T. L., Sokol, H., Gevers, D., Devaney, K. L., Ward, D. V., Reyes, J. A., Shah, S. A., LeLeiko, N. et al. (2012). Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13 R79.
  • Müller, P., Quintana, F. and Rosner, G. L. (2011). A product partition model with regression on covariates. J. Comput. Graph. Statist. 20 260–278. Supplementary material available online.
  • Paulson, J. N., Stine, O. C., Bravo, H. C. and Pop, M. (2013). Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10 1200–1202.
  • Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F. et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464 59–65.
  • Quince, C., Lundin, E. E., Andreasson, A. N., Greco, D., Rafter, J., Talley, N. J., Agreus, L., Andersson, A. F., Engstrand, L. et al. (2013). The impact of Crohn’s disease genes on healthy human gut microbiota: A pilot study. Gut 62 952–954.
  • Ren, B., Bacallado, S., Favaro, S., Holmes, S. and Trippa, L. (2017). Bayesian nonparametric ordination for the analysis of microbial communities. J. Amer. Statist. Assoc. 112 1430–1442.
  • Ren, B., Bacallado, S., Favaro, S., Vatanen, T., Huttenhower, C. and Trippa, L. (2020). Supplement to “Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis.”,
  • Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: The $RV$-coefficient. J. R. Stat. Soc. Ser. C. Appl. Stat. 25 257–265.
  • Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139–140.
  • Rodríguez, A. and Dunson, D. B. (2011). Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Anal. 6 145–177.
  • Segata, N., Börnigen, D., Morgan, X. C. and Huttenhower, C. (2013). PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat. Commun. 4 2304.
  • Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association 101 1566–1581.
  • Vatanen, T., Kostic, A. D., d’Hennezel, E., Siljander, H., Franzosa, E. A., Yassour, M., Kolde, R., Vlamakis, H., Arthur, T. D. et al. (2016). Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165 842–853.
  • Vehtari, A., Gelman, A. and Gabry, J. (2015). Pareto smoothed importance sampling. arXiv preprint arXiv:1507.02646.
  • Vehtari, A., Gelman, A. and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat. Comput. 27 1413–1432.
  • Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelburne, S. A. and Vannucci, M. (2017). An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinform. 18 94.
  • Xia, F., Chen, J., Fung, W. K. and Li, H. (2013). A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics 69 1053–1063.
  • Xu, L., Paterson, A. D., Turpin, W. and Xu, W. (2015). Assessment and selection of competing models for zero-inflated microbiome data. PLoS ONE 10 1–30.

Supplemental materials

  • Source code for “Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis”. R source code for replicating results in this paper and data files for the microbiome dataset.
  • Supplement to “Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis”. We provide the proof of the proposition for model identifiability in the general setting. We also include additional supporting plots and tables for the simulation studies and data application.