The Annals of Applied Statistics

Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis

Jun Chen and Hongzhe Li

Full-text: Open access

Abstract

With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of covariates is large, multiple testing can lead to loss of power. To address the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group $\ell_{1}$ penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 1 (2013), 418-442.

Dates
First available in Project Euclid: 9 April 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1365527205

Digital Object Identifier
doi:10.1214/12-AOAS592

Mathematical Reviews number (MathSciNet)
MR3086425

Zentralblatt MATH identifier
06171278

Keywords
Coordinate descent counts data overdispersion regularized likelihood sparse group penalty

Citation

Chen, Jun; Li, Hongzhe. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 7 (2013), no. 1, 418--442. doi:10.1214/12-AOAS592. https://projecteuclid.org/euclid.aoas/1365527205


Export citation

References

  • Aitchison, J. (1982). The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B Stat. Methodol. 44 139–177.
  • Bach, F. R. (2008). Bolasso: Model consistent Lasso estimation through the bootstrap. In ICML’08: Proceedings of the 25th International Conference on Machine Learning 33–40. ACM, New York.
  • Bäckhed, F., Ley, R. E., Sonnenburg, J. L., Peterson, D. A. and Gordon, J. I. (2005). Host-bacterial mutualism in the human intestine. Science 307 1915–1920.
  • Barry, S. and Welsh, A. (2002). Generalized additive modelling and zero inflated count data. Ecological Modelling 157 179–188.
  • Benson, A. K., Kelly, S. A., Legge, R., Ma, F., Low, S. J., Kim, J., Zhang, M., Oh, P. L., Nehrenberg, D., Hua, K. et al. (2010). Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors. Proc. Natl. Acad. Sci. USA 107 18933–18938.
  • Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., Fierer, N., Peña, A. G., Goodrich, J. K., Gordon, J. I. et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7 335–336.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. Preprint. Available at arXiv:1001.0736.
  • Lee, A. H., Wang, K., Scott, J. A., Yau, K. K. W. and McLachlan, G. J. (2006). Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros. Stat. Methods Med. Res. 15 47–61.
  • Legendre, P. and Legendre, L. (2002). Numerical Ecology, 2nd ed. Elsevier, Amsterdam.
  • Matsen, F. A., Kodner, R. B. and Armbrust, E. V. (2010). pplacer: Linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11 538.
  • McArdle, B. H. (2001). Fitting multivariate models to community data: A comment on distance-based redundancy analysis. Ecology 82 290–297.
  • Meier, L., van de Geer, S. and Bühlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 53–71.
  • Moghimbeigi, A., Eshraghian, M. R., Mohammad, K. and McArdle, B. (2008). Multilevel zero-inflated negative binomial regression modeling for over-dispersed count data with extra zeros. J. Appl. Stat. 35 1193–1202.
  • Mosimann, J. E. (1962). On the compound multinomial distribution, the multivariate $\beta$-distribution, and correlations among proportions. Biometrika 49 65–82.
  • Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R. and Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 4 53–77.
  • Schloss, P. D., Westcott, S. L., Ryabin, T., Hall, J. R., Hartmann, M., Hollister, E. B., Lesniewski, R. A., Oakley, B. B., Parks, D. H., Robinson, C. J. et al. (2009). Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology 75 7537–7541.
  • Sokol, H., Pigneur, B., Watterlot, L., Lakhdari, O., Bermúdez-Humarán, L. G., Gratadoux, J. J., Blugeon, S., Bridonneau, C., Furet, J. P., Corthier, G. et al. (2008). Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients. Proc. Natl. Acad. Sci. USA 105 16731–16736.
  • Tseng, P. and Yun, S. (2008). A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117 387–423.
  • Virgin, H. W. and Todd, J. A. (2011). Metagenomics and personalized medicine. Cell 147 44–56.
  • Wu, G. D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y. Y., Keilbaugh, S. A., Bewtra, M., Knights, D., Walters, W. A., Knight, R. et al. (2011). Linking long-term dietary patterns with gut microbial enterotypes. Science 334 105–108.
  • Zhang, H. H., Liu, Y., Wu, Y. and Zhu, J. (2008). Variable selection for the multicategory SVM via adaptive sup-norm regularization. Electron. J. Stat. 2 149–167.
  • Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.