## The Annals of Applied Statistics

### Bayesian nonparametric dependent model for partially replicated data: The influence of fuel spills on species diversity

#### Abstract

We introduce a dependent Bayesian nonparametric model for the probabilistic modeling of membership of subgroups in a community based on partially replicated data. The focus here is on species-by-site data, that is, community data where observations at different sites are classified in distinct species. Our aim is to study the impact of additional covariates, for instance, environmental variables, on the data structure, and in particular on the community diversity. To this end, we introduce dependence a priori across the covariates and show that it improves posterior inference. We use a dependent version of the Griffiths–Engen–McCloskey distribution defined via the stick-breaking construction. This distribution is obtained by transforming a Gaussian process whose covariance function controls the desired dependence. The resulting posterior distribution is sampled by Markov chain Monte Carlo. We illustrate the application of our model to a soil microbial data set acquired across a hydrocarbon contamination gradient at the site of a fuel spill in Antarctica. This method allows for inference on a number of quantities of interest in ecotoxicology, such as diversity or effective concentrations, and is broadly applicable to the general problem of community response to environmental variables.

#### Article information

Source
Ann. Appl. Stat., Volume 10, Number 3 (2016), 1496-1516.

Dates
Revised: February 2016
First available in Project Euclid: 28 September 2016

https://projecteuclid.org/euclid.aoas/1475069616

Digital Object Identifier
doi:10.1214/16-AOAS944

Mathematical Reviews number (MathSciNet)
MR3553233

Zentralblatt MATH identifier
06775275

#### Citation

Arbel, Julyan; Mengersen, Kerrie; Rousseau, Judith. Bayesian nonparametric dependent model for partially replicated data: The influence of fuel spills on species diversity. Ann. Appl. Stat. 10 (2016), no. 3, 1496--1516. doi:10.1214/16-AOAS944. https://projecteuclid.org/euclid.aoas/1475069616

#### References

• Aitchison, J. (1982). The statistical analysis of compositional data. J. Roy. Statist. Soc. Ser. B 44 139–177.
• Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall, London.
• Aitchison, J. (1994). Principles of compositional data analysis. In Multivariate Analysis and Its Applications (Hong Kong, 1992). Institute of Mathematical Statistics Lecture Notes—Monograph Series 24 73–81. IMS, Hayward, CA.
• Alston, C. L., Mengersen, K. L. and Gardner, G. E. (2011). Bayesian mixture models: A blood-free dissection of a sheep. In Mixtures: Estimation and Applications (K. Mengersen, C. P. Robert and M. Titterington, eds.) 293–308. Wiley, Chichester.
• Andrianakis, I. and Challenor, P. G. (2012). The effect of the nugget on Gaussian process emulators of computer models. Comput. Statist. Data Anal. 56 4215–4228.
• Arbel, J. (2013). Contributions to Bayesian nonparametric statistics. Ph.D. thesis, Univ. Paris-Dauphine.
• Arbel, J., Mengersen, K. and Rousseau, J. (2016). Supplement to “Bayesian nonparametric dependent model for partially replicated data: The influence of fuel spills on species diversity.” DOI:10.1214/16-AOAS944SUPP.
• Arbel, J., King, C. K., Raymond, B., Winsley, T. and Mengersen, K. L. (2015). Application of a Bayesian nonparametric model to derive toxicity estimates based on the response of Antarctic microbial communities to fuel-contaminated soil. Ecol. Evol. 5 2633–2645.
• Arbel, J., Favaro, S., Nipoti, B. and Teh, Y. W. (2016). Bayesian nonparametric inference for discovery probabilities: Credible intervals and large sample asymptotics. Statist. Sinica. To appear. Available at arXiv:1506.04915.
• Barrientos, A. F., Jara, A. and Quintana, F. A. (2012). On the support of MacEachern’s dependent Dirichlet processes and extensions. Bayesian Anal. 7 277–309.
• Barrientos, A. F., Jara, A. and Quintana, F. A. (2015). Bayesian density estimation for compositional data using random Bernstein polynomials. J. Statist. Plann. Inference 166 116–125.
• Bohlin, J., Skjerve, E. and Ussery, D. (2009). Analysis of genomic signatures in prokaryotes using multinomial regression and hierarchical clustering. BMC Genomics 10 487.
• Borges, E. P. and Roditi, I. (1998). A family of nonextensive entropies. Phys. Lett. A 246 399–402.
• Broms, K. M., Hooten, M. B. and Fitzpatrick, R. M. (2015). Accounting for imperfect detection in Hill numbers for biodiversity studies. Methods in Ecology and Evolution 6 99–108.
• Calabrese, E. J. (2005). Paradigm lost, paradigm found: The re-emergence of hormesis as a fundamental dose response model in the toxicological sciences. Environ. Pollut. 138 378–411.
• Caron, F., Davy, M. and Doucet, A. (2007). Generalized Pólya urn for time-varying Dirichlet process mixtures. In 23rd Conference on Uncertainty in Artificial Intelligence (UAI’2007). Vancouver, Canada.
• Cerquetti, A. (2014). Bayesian nonparametric estimation of Patil–Taillie–Tsallis diversity under Gnedin–Pitman priors. Preprint. Available at arXiv:1404.3441.
• Chung, Y. and Dunson, D. B. (2011). The local Dirichlet process. Ann. Inst. Statist. Math. 63 59–80.
• Colwell, R. K., Chao, A., Gotelli, N. J., Lin, S.-Y., Mao, C. X., Chazdon, R. L. and Longino, J. T. (2012). Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of Plant Ecology 5 3–21.
• Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley, New York.
• De’ath, G. (2012). The multinomial diversity model: Linking Shannon diversity to multiple predictors. Ecology 93 2286–2296.
• Donnelly, P. and Grimmett, G. (1993). On the asymptotic distribution of large prime factors. J. Lond. Math. Soc. (2) 47 395–404.
• Dorazio, R. M., Mukherjee, B., Zhang, L., Ghosh, M., Jelks, H. L. and Jordan, F. (2008). Modeling unobserved sources of heterogeneity in animal abundance using a Dirichlet process prior. Biometrics 64 635–644, 670–671.
• Dunson, D. B. and Park, J.-H. (2008). Kernel stick-breaking processes. Biometrika 95 307–323.
• Dunson, D. B., Pillai, N. and Park, J.-H. (2007). Bayesian density regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 163–183.
• Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc. 104 1042–1051.
• Dunstan, P. K., Foster, S. D. and Darnell, R. (2011). Model based grouping of species across environmental gradients. Ecol. Model. 222 955–963.
• Ellis, N., Smith, S. J. and Pitcher, C. R. (2011). Gradient forests: Calculating importance gradients on physical predictors. Ecology 93 156–168.
• Favaro, S., Lijoi, A. and Prünster, I. (2012). A new estimator of the discovery probability. Biometrics 68 1188–1196.
• Favaro, S., Nipoti, B. and Teh, Y. W. (2016). Rediscovery of Good–Turing estimators via Bayesian nonparametrics. Biometrics 72 136–145.
• Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
• Ferrier, S. and Guisan, A. (2006). Spatial modelling of biodiversity at the community level. J. Appl. Ecol. 43 393–404.
• Ferrier, S., Manion, G., Elith, J. and Richardson, K. (2007). Using generalized dissimilarity modelling to analyse and predict patterns of beta diversity in regional biodiversity assessment. Divers. Distrib. 13 252–264.
• Fordyce, J. A., Gompert, Z., Forister, M. L. and Nice, C. C. (2011). A hierarchical Bayesian approach to ecological count data: A flexible tool for ecologists. PLoS ONE 6 e26785.
• Foster, S. D. and Dunstan, P. K. (2010). The analysis of biodiversity using rank abundance distributions. Biometrics 66 186–195.
• Gelfand, A. E. (1996). Model determination using sampling-based methods. In Markov Chain Monte Carlo in Practice 145–161. Chapman & Hall, London.
• George, A. W., Mengersen, K. and Davis, G. P. (2000). Localization of a quantitative trait locus via a Bayesian approach. Biometrics 56 40–51.
• Gibbs, M. N. (1997). Bayesian Gaussian processes for regression and classification. Ph.D. thesis, Citeseer.
• Gill, C. A. and Joanes, D. N. (1979). Bayesian estimation of Shannon’s index of diversity. Biometrika 66 81–85.
• Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237–264.
• Griffin, J. E. and Steel, M. F. J. (2006). Order-based dependent Dirichlet processes. J. Amer. Statist. Assoc. 101 179–194.
• Griffin, J. E. and Steel, M. F. J. (2011). Stick-breaking autoregressive processes. J. Econometrics 162 383–396.
• Havrda, J. and Charvát, F. (1967). Quantification method of classification processes. Concept of structural $a$-entropy. Kybernetika (Prague) 3 30–35.
• Hill, M. O. (1973). Diversity and evenness: A unifying notation and its consequences. Ecology 54 427–432.
• Holmes, I., Harris, K. and Quince, C. (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE 7 e30126.
• Johnson, D. S., Ream, R. R., Towell, R. G., Williams, M. T. and Leon Guerrero, J. D. (2013). Bayesian clustering of animal abundance trends for inference and dimension reduction. J. Agric. Biol. Environ. Stat. 18 299–313.
• Kaniadakis, G., Lissia, M. and Scarfone, A. M. (2005). Two-parameter deformations of logarithm, exponential, and entropy: A consistent framework for generalized statistical mechanics. Phys. Rev. E (3) 71 046128, 12.
• Li, H. (2015). Microbiome, metagenomics and high-dimensional compositional data analysis. Annual Review of Statistics and Its Application 2 73–94.
• Lijoi, A., Mena, R. H. and Prünster, I. (2007). Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 94 769–786.
• Lijoi, A., Nipoti, B. and Prünster, I. (2014a). Bayesian inference with dependent normalized completely random measures. Bernoulli 20 1260–1291.
• Lijoi, A., Nipoti, B. and Prünster, I. (2014b). Dependent mixture models: Clustering and borrowing information. Comput. Statist. Data Anal. 71 417–433.
• Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S. and Bähler, J. (2015). Proportionality: A valid alternative to correlation for relative data. PLoS Comput. Biol. 11 e1004075.
• MacEachern, S. N. (1999). Dependent nonparametric processes. In ASA Proceedings of the Section on Bayesian Statistical Science 50–55. Amer. Statist. Assoc., Alexandria, VA.
• MacEachern, S. N. (2000). Dependent Dirichlet processes. Technical report, Dept. Statistics, The Ohio State Univ.
• Newman, M. C. (2012). Quantitative Ecotoxicology. CRC Press, Boca Raton, FL.
• Pati, D., Dunson, D. B. and Tokdar, S. T. (2013). Posterior consistency in conditional distribution estimation. J. Multivariate Anal. 116 456–472.
• Patil, G. P. and Taillie, C. (1982). Diversity as a concept and its measurement. J. Amer. Statist. Assoc. 77 548–567.
• Pawlowsky-Glahn, V. and Buccianti, A., eds. (2011). Compositional Data Analysis: Theory and Applications. Wiley, Chichester.
• Pitman, J. (2006). Combinatorial Stochastic Processes. Lecture Notes in Math. 1875. Springer, Berlin.
• Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
• Rodríguez, A. and Dunson, D. B. (2011). Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Anal. 6 145–177.
• Rodríguez, A., Dunson, D. B. and Gelfand, A. E. (2010). Latent stick-breaking processes. J. Amer. Statist. Assoc. 105 647–659.
• Royle, J. A. and Dorazio, R. M. (2006). Hierarchical models of animal abundance and occurrence. J. Agric. Biol. Environ. Stat. 11 249–263.
• Royle, J. A. and Dorazio, R. M. (2008). Hierarchical Modeling and Inference in Ecology: The Analysis of Data from Populations, Metapopulations and Communities. Academic Press, San Diego, CA.
• Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 319–392.
• Schloss, P. D., Westcott, S. L., Ryabin, T., Hall, J. R., Hartmann, M., Hollister, E. B., Lesniewski, R. A., Oakley, B. B., Parks, D. H., Robinson, C. J., Sahl, J. W., Stres, B., Thallinger, G. G., Horn, D. J. V. and Weber, C. F. (2009). Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75 7537–7541.
• Siciliano, S. D., Palmer, A. S., Winsley, T., Lamb, E., Bissett, A., Brown, M. V., van Dorst, J., Ji, M., Ferrari, B. C., Grogan, P., Chu, H. and Snape, I. (2014). Soil fertility is associated with fungal and bacterial richness, whereas pH is associated with community composition in polar soil microbial communities. Soil Biol. Biochem. 78 10–20.
• Snape, I., Siciliano, S. D., Winsley, T., van Dorst, J., Mukan, J., Palmer, A. S. and Lagerewskij, G. (2015). Operational Taxonomic Unit (OTU) Microbial Ecotoxicology Data from Macquarie Island and Casey Station: TPH, Chemistry and OTU Abundance Data. Australian Antarctic Data Centre.
• van den Boogaart, K. G. and Tolosana-Delgado, R. (2013). Fundamental concepts of compositional data analysis. In Analyzing Compositional Data with R, Use R!. Springer, Heidelberg.
• van der Vaart, A. W. and van Zanten, J. H. (2009). Adaptive Bayesian estimation using a Gaussian random field with inverse gamma bandwidth. Ann. Statist. 37 2655–2675.
• Wang, Y., Naumann, U., Wright, S. T. and Warton, D. I. (2012). mvabund—An R package for model-based analysis of multivariate abundance data. Methods in Ecology and Evolution 3 471–474.

#### Supplemental materials

• Supplement to “Bayesian nonparametric dependent model for partially replicated data: The influence of fuel spills on species diversity”. The supplementary material contains details about posterior computation and inference in the Dep-GEM model, additional results and omitted proofs that complement the analysis of the main text. It is available as Arbel, Mengersen and Rousseau (2016).