The Annals of Applied Statistics

Regression analysis for microbiome compositional data

Pixu Shi, Anru Zhang, and Hongzhe Li

Full-text: Access denied (no subscription detected) We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


One important problem in microbiome analysis is to identify the bacterial taxa that are associated with a response, where the microbiome data are summarized as the composition of the bacterial taxa at different taxonomic levels. This paper considers regression analysis with such compositional data as covariates. In order to satisfy the subcompositional coherence of the results, linear models with a set of linear constraints on the regression coefficients are introduced. Such models allow regression analysis for subcompositions and include the log-contrast model for compositional covariates as a special case. A penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed. A method is also proposed to obtain debiased estimates of the regression coefficients that are asymptotically unbiased and have a joint asymptotic multivariate normal distribution. This provides valid confidence intervals of the regression coefficients and can be used to obtain the $p$-values. Simulation results show the validity of the confidence intervals and smaller variances of the debiased estimates when the linear constraints are imposed. The proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes.

Article information

Ann. Appl. Stat. Volume 10, Number 2 (2016), 1019-1040.

Received: June 2015
Revised: January 2016
First available in Project Euclid: 22 July 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Compositional coherence coordinate descent method of multipliers high dimension log-contrast model model selection regularization


Shi, Pixu; Zhang, Anru; Li, Hongzhe. Regression analysis for microbiome compositional data. Ann. Appl. Stat. 10 (2016), no. 2, 1019--1040. doi:10.1214/16-AOAS928.

Export citation


  • Aitchison, J. (1982). The statistical analysis of compositional data. J. Roy. Statist. Soc. Ser. B 44 139–177.
  • Aitchison, J. (2003). The Statistical Analysis of Compositional Data. Blackburn Press, Cadwell, NJ.
  • Aitchison, J. and Bacon-Shone, J. (1984). Log contrast models for experiments with mixtures. Biometrika 71 323–330.
  • Bertsekas, D. P. (1996). Constrained Optimization and Lagrange Multiplier Methods. Athena Scientific, Belmont.
  • Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli 19 1212–1242.
  • Cornell, J. A. (2002). Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data, 3rd ed. Wiley, New York.
  • Efron, B. (2014). Estimation and accuracy after model selection. J. Amer. Statist. Assoc. 109 991–1007.
  • Grant, M. and Boyd, S. (2013). CVX: Matlab software for disciplined convex programming, version 2.0 beta. Technical report. Available at
  • Huson, D. H., Auch, A. F., Qi, J. and Schuster, S. C. (2007). MEGAN analysis of metagenomic data. Genome Res. 17 377–386.
  • James, G. M., Paulson, C. and Rusmevichientong, P. (2015). Penalized and constrained regression. Unpublished manuscript.
  • Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
  • Kurtz, Z. D., Müller, C. L., Miraldi, E. R., Littman, D. R., Blaser, M. J. and Bonneau, R. A. (2015). Sparse and compositionally robust inference of microbial ecological networks. PLoS Computational Biolology 11 e1004226.
  • Lam, Y. Y., Ha, C. W. Y., Campbell, C. R., Mitchell, A. J., Dinudom, A., Oscarsson, J., Cook, D. I., Hunt, N. H., Caterson, I. D., Holmes, A. J. and Storlien, L. H. (2012). Increased gut permeability and microbiota change associate with mesenteric fat inflammation and metabolic dysfunction in diet-induced obese mice. PLoS ONE 7 e34233.
  • Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927.
  • Ley, R. E., Bäckhed, F., Turnbaugh, P., Lozupone, C. A., Knight, R. D. and Gordon, J. I. (2005). Obesity alters gut microbial ecology. Proc. Natl. Acad. Sci. USA 102 11070–11075.
  • Ley, R. E., Turnbaugh, P. J., Klein, S. and Gordon, J. I. (2006). Microbial ecology: Human gut microbes associated with obesity. Nature 444 1022–1023.
  • Lin, W., Shi, P., Feng, R. and Li, H. (2014). Variable selection in regression with compositional covariates. Biometrika 101 785–797.
  • Manichanh, C., Borruel, N., Casellas, F. and Guarner, F. (2012). The gut microbiota in IBD. Nat. Rev. Gastroenterol. Hepatol. 9 599–608.
  • Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T. et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464 59–65.
  • Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., Liang, S., Zhang, W., Guan, Y., Shen, D. et al. (2012). A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490 55–60.
  • Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O. and Huttenhower, C. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9 811–814.
  • Shi, P., Zhang, A. and Li (2016). Supplement to “Regression analysis for microbiome compositional data.” DOI:10.1214/16-AOAS928SUPP.
  • Snee, R. D. (1973). Techniques for the analysis of mixture data. Technometrics 15 517–528.
  • Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika 99 879–898.
  • Turnbaugh, P. J., Ley, R. E., Mahowald, M. A., Magrini, V., Mardis, E. R. and Gordon, J. I. (2006). An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444 1027–1031.
  • Turnbaugh, P. J., Ley, R. E., Hamady, M., Fraser-Liggett, C. M., Knight, R. and Gordon, J. I. (2007). The human microbiome project. Nature 449 804–810.
  • van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
  • Walker, A. W., Ince, J., Duncan, S. H., Webster, L. M., Holtrop, G., Ze, X., Brown, D., Stares, M. D., Scott, P., Bergerat, A., Louis, P., McIntosh, F., Johnstone, A. M., Lobley, G. E., Parkhill, J. and Flint, H. J. (2011). Dominant and diet-responsive groups of bacteria within the human colonic microbiota. ISME J. 5 220–230.
  • Wu, G. D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y.-Y., Keilbaugh, S. A., Bewtra, M., Knights, D., Walters, W. A., Knight, R., Sinha, R., Gilroy, E., Gupta, K., Baldassano, R., Nessel, L., Li, H., Bushman, F. D. and Lewis, J. D. (2011). Linking long-term dietary patterns with gut microbial enterotypes. Science 334 105–108.
  • Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.

Supplemental materials

  • Supplement to “Regression analysis for microbiome compositional data”. The online Supplemental Materials include proofs of all lemmas and theorems [Shi, Zhang and Hongzhe (2016)].