Annals of Applied Statistics

Modeling microbial abundances and dysbiosis with beta-binomial regression

Bryan D. Martin, Daniela Witten, and Amy D. Willis

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon’s relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon’s relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon’s counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.

Article information

Ann. Appl. Stat., Volume 14, Number 1 (2020), 94-115.

Received: January 2019
Revised: June 2019
First available in Project Euclid: 16 April 2020

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Relative abundance microbiome correlated data overdispersion high throughput sequencing beta-binomial


Martin, Bryan D.; Witten, Daniela; Willis, Amy D. Modeling microbial abundances and dysbiosis with beta-binomial regression. Ann. Appl. Stat. 14 (2020), no. 1, 94--115. doi:10.1214/19-AOAS1283.

Export citation


  • Aerts, M., Molenberghs, G., Geys, H. and Ryan, L. M. (2002). Topics in Modelling of Clustered Data. CRC Press/CRC, Boca Raton, FL.
  • Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. CRC Press, London.
  • Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 1–10.
  • Bastedo, M. N. and Jaquette, O. (2011). Running in place: Low-income students and the dynamics of higher education stratification. Educ. Eval. Policy Anal. 33 318–339.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Callahan, B. J., DiGiulio, D. B., Goltsman, D. S. A., Sun, C. L., Costello, E. K., Jeganathan, P., Biggio, J. R., Wong, R. J., Druzin, M. L. et al. (2017). Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women. Proc. Natl. Acad. Sci. USA 114 9966–9971.
  • Cao, Y., Zhang, A. and Li, H. (2017). Microbial composition estimation from sparse count data. Preprint. Available at arXiv:1706.02380.
  • Chai, H., Jiang, H., Lin, L. and Liu, L. (2018). A marginalized two-part Beta regression model for microbiome compositional data. PLoS Comput. Biol. 14 e1006329.
  • Chen, J. and Li, H. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 7 418–442.
  • Chen, E. Z. and Li, H. (2016). A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 32 2611–2617.
  • Chen, L., Reeve, J., Zhang, L., Huang, S., Wang, X. and Chen, J. (2018). GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data. PeerJ 6 e4600.
  • Dethlefsen, L. and Relman, D. A. (2011). Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation. Proc. Natl. Acad. Sci. USA 108 4554–4561.
  • DiGiulio, D. B., Callahan, B. J., McMurdie, P. J., Costello, E. K., Lyell, D. J., Robaczewska, A., Sun, C. L., Goltsman, D. S. A., Wong, R. J. et al. (2015). Temporal and spatial variation of the human microbiota during pregnancy. Proc. Natl. Acad. Sci. USA 112 11060–11065.
  • Dolzhenko, E. and Smith, A. D. (2014). Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC Bioinform. 15 215.
  • Edgar, R. C. (2013). UPARSE: Highly accurate OTU sequences from microbial amplicon reads. Nat. Methods 10 996–998.
  • Fang, R., Wagner, B. D., Harris, J. K. and Fillon, S. A. (2016). Zero-inflated negative binomial mixed model: An application to two microbial organisms important in oesophagitis. Epidemiol. Infect. 144 2447–2455.
  • Faust, K., Lahti, L., Gonze, D., de Vos, W. M. and Raes, J. (2015). Metagenomics meets time series analysis: Unraveling microbial community dynamics. Curr. Opin. Microbiol. 25 56–66.
  • Fiacco, A. V. and McCormick, G. P. (1968). Nonlinear Programming: Sequential Unconstrained Minimization Techniques. Wiley, New York.
  • Fletcher, R. (1987). Practical Methods of Optimization, 2nd ed. Wiley, Chichester.
  • Gerber, G. K. (2014). The dynamic microbiome. FEBS Lett. 588 4131–4139.
  • Gevers, D., Kugathasan, S., Denson, L. A., Vázquez-Baeza, Y., Van Treuren, W., Ren, B., Schwager, E., Knights, D., Song, S. J. et al. (2014). The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15 382–392.
  • Geyer, C. J. (2015). trust: Trust region optimization. R package version 0.1-7.
  • Grice, E. A. (2014). The skin microbiome: Potential for novel diagnostic and therapeutic approaches to cutaneous disease. Semin. Cutan. Med. Surg. 33 98. NIH Public Access.
  • Halfvarson, J., Brislawn, C. J., Lamendella, R., Vázquez-Baeza, Y., Walters, W. A., Bramer, L. M., D’Amato, M., Bonfiglio, F., McDonald, D. et al. (2017). Dynamics of the human gut microbiome in inflammatory bowel disease. Nat. Microbiol. 2 17004.
  • Heinze, G. (2006). A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat. Med. 25 4216–4226.
  • Heinze, G. and Schemper, M. (2002). A solution to the problem of separation in logistic regression. Stat. Med. 21 2409–2419.
  • Hill-Burns, E. M., Debelius, J. W., Morton, J. T., Wissemann, W. T., Lewis, M. R., Wallen, Z. D., Peddada, S. D., Factor, S. A., Molho, E. et al. (2017). Parkinson’s disease and Parkinson’s disease medications have distinct signatures of the gut microbiome. Mov. Disord. 32 739–749.
  • Holmes, I., Harris, K. and Quince, C. (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE 7 e30126.
  • Hooks, K. B. and O’Malley, M. A. (2017). Dysbiosis and its discontents. mBio 8 e01492-17.
  • Kleinman, J. C. (1973). Proportions with extraneous variance: Single and independent samples. J. Amer. Statist. Assoc. 68 46–54.
  • Kosmidis, I. (2018). brglm2: Bias reduction in generalized linear models. R package version 0.1.8.
  • Kurtz, Z. D., Müller, C. L., Miraldi, E. R., Littman, D. R., Blaser, M. J. and Bonneau, R. A. (2015). Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 11 e1004226.
  • Law, C. W., Chen, Y., Shi, W. and Smyth, G. K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15 R29.
  • La Rosa, P. S., Brooks, J. P., Deych, E., Boone, E. L., Edwards, D. J., Wang, Q., Sodergren, E., Weinstock, G. and Shannon, W. D. (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE 7 e52078.
  • Li, Z., Lee, K., Karagas, M. R., Madan, J. C., Hoen, A. G., O’Malley, A. J. and Li, H. (2018). Conditional regression based on a multivariate zero-inflated logistic-normal model for microbiome relative abundance data. Stat. Biosci. 10 587–608.
  • Love, M. I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 550.
  • Mandal, S., Van Treuren, W., White, R. A., Eggesbø, M., Knight, R. and Peddada, S. D. (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microb. Ecol. Health Dis. 26 27663.
  • Martin, B. D., Witten, D. and Willis, A. D. (2020a). Supplement A to “Modeling microbial abundances and dysbiosis with beta-binomial regression.”
  • Martin, B. D., Witten, D. and Willis, A. D. (2020b). Supplement B to “Modeling microbial abundances and dysbiosis with beta-binomial regression.”
  • McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Monographs on Statistics and Applied Probability. CRC Press, London.
  • McMurdie, P. J. and Holmes, S. (2013). phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8 e61217.
  • McMurdie, P. J. and Holmes, S. (2014). Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 10 e1003531.
  • Mercer, L. D., Wakefield, J., Pantazis, A., Lutambi, A. M., Masanja, H. and Clark, S. (2015). Space-time smoothing of complex survey data: Small area estimation for child mortality. Ann. Appl. Stat. 9 1889–1905.
  • Morgan, X. C., Tickle, T. L., Sokol, H., Gevers, D., Devaney, K. L., Ward, D. V., Reyes, J. A., Shah, S. A., LeLeiko, N. et al. (2012). Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13 R79.
  • Morgan, X. C., Kabakchiev, B., Waldron, L., Tyler, A. D., Tickle, T. L., Milgrom, R., Stempak, J. M., Gevers, D., Xavier, R. J. et al. (2015). Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease. Genome Biol. 16 67.
  • Nocedal, J. and Wright, S. J. (1999). Numerical Optimization. Springer Series in Operations Research. Springer, New York.
  • Parker, I. M., Saunders, M., Bontrager, M., Weitz, A. P., Hendricks, R., Magarey, R., Suiter, K. and Gilbert, G. S. (2015). Phylogenetic structure and host abundance drive disease pressure in communities. Nature 520 542–544.
  • Paulson, J. N., Stine, O. C., Bravo, H. C. and Pop, M. (2013). Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10 1200–1202.
  • Peng, X., Li, G. and Liu, Z. (2016). Zero-inflated beta regression for differential abundance analysis with metagenomics data. J. Comput. Biol. 23 102–110.
  • Petersen, C. and Round, J. L. (2014). Defining dysbiosis and its influence on host immunity and disease. Cell. Microbiol. 16 1024–1033.
  • Poussin, C., Sierro, N., Boué, S., Battey, J., Scotti, E., Belcastro, V., Peitsch, M. C., Ivanov, N. V. and Hoeng, J. (2018). Interrogating the microbiome: Experimental and computational considerations in support of study reproducibility. Drug Discov. Today 23 1644–1657.
  • Prentice, R. L. (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. J. Amer. Statist. Assoc. 81 321–327.
  • Qin, N., Yang, F., Li, A., Prifti, E., Chen, Y., Shao, L., Guo, J., Le Chatelier, E., Yao, J. et al. (2014). Alterations of the human gut microbiome in liver cirrhosis. Nature 513 59.
  • R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139–140.
  • Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11 R25.
  • Ryan, D. M. (1974). Penalty and barrier functions. In Numerical Methods for Constrained Optimization (Proc. Sympos., National Physical Lab., Teddington, 1974) 175–190.
  • Sankaran, K. and Holmes, S. P. (2017). Latent variable modeling for the microbiome. Preprint. Available at arXiv:1706.04969.
  • Segata, N., Izard, J., Waldron, L., Gevers, D., Miropolsky, L., Garrett, W. S. and Huttenhower, C. (2011). Metagenomic biomarker discovery and explanation. Genome Biol. 12 R60.
  • Sender, R., Fuchs, S. and Milo, R. (2016). Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 14 e1002533.
  • Shi, B., Chang, M., Martin, J., Mitreva, M., Lux, R., Klokkevold, P., Sodergren, E., Weinstock, G. M., Haake, S. K. et al. (2015). Dynamic changes in the subgingival microbiome and their potential for diagnosis and prognosis of periodontitis. mBio 6 e01926-14.
  • Skellam, J. G. (1948). A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. J. R. Stat. Soc. Ser. B. Stat. Methodol. 10 257–261.
  • Sogin, M. L., Morrison, H. G., Huber, J. A., Welch, D. M., Huse, S. M., Neal, P. R., Arrieta, J. M. and Herndl, G. J. (2006). Microbial diversity in the deep sea and the underexplored “rare biosphere.” Proc. Natl. Acad. Sci. USA 103 12115–12120.
  • Sohn, M. B., Du, R. and An, L. (2015). A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics 31 2269–2275.
  • Tamboli, C. P., Neut, C., Desreumaux, P. and Colombel, J. F. (2004). Dysbiosis in inflammatory bowel disease. Gut 53 1–4.
  • Tromas, N., Taranu, Z. E., Martin, B. D., Willis, A., Fortin, N., Greer, C. W. and Shapiro, B. J. (2018). Niche separation increases with genetic distance among bloom-forming cyanobacteria. Front. Microbiol. 9 438.
  • Wagner, B., Riggs, P. and Mikulich-Gilbertson, S. (2015). The importance of distribution-choice in modeling substance use data: A comparison of negative binomial, beta binomial, and zero-inflated distributions. Am. J. Drug Alcohol Abuse 41 489–497.
  • Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. (1995). Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Ann. Statist. 23 1865–1895.
  • Welch, J. L. M., Rossetti, B. J., Rieken, C. W., Dewhirst, F. E. and Borisy, G. G. (2016). Biogeography of a human oral microbiome at the micron scale. Proc. Natl. Acad. Sci. USA 113 E791–E800.
  • White, J. R., Nagarajan, N. and Pop, M. (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol. 5 e1000352.
  • Whitman, T., Pepe-Ranney, C., Enders, A., Koechli, C., Campbell, A., Buckley, D. H. and Lehmann, J. (2016). Dynamics of microbial community composition and soil organic carbon mineralization in soil following addition of pyrogenic and fresh organic matter. ISME J. 10 2918–2930.
  • Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer, New York.
  • Williams, D. A. (1975). 394: The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 31 949–952.
  • Willis, A. D. and Martin, B. D. (2018). DivNet: Estimating diversity in networked communities. BioRxiv 305045.
  • Xia, F., Chen, J., Fung, W. K. and Li, H. (2013). A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics 69 1053–1063.
  • Yee, T. W. (2010). The VGAM package for categorical data analysis. J. Stat. Softw. 32 1–34.
  • Zhang, X., Mallick, H., Tang, Z., Zhang, L., Cui, X., Benson, A. K. and Yi, N. (2017). Negative binomial mixed models for analyzing microbiome count data. BMC Bioinform. 18 4.
  • Zhou, Y., Shan, G., Sodergren, E., Weinstock, G., Walker, W. A. and Gregory, K. E. (2015). Longitudinal analysis of the premature infant intestinal microbiome prior to necrotizing enterocolitis: A case-control study. PLoS ONE 10 e0118632.

Supplemental materials

  • Supplement A: corncob R package. We provide an R package implementing all methods proposed in this paper.
  • Supplement B: Figure code. We provide code to reproduce all simulations and data analyses in this paper.