The Annals of Applied Statistics

Gaussian process modelling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria

Marko Järvenpää, Michael U. Gutmann, Aki Vehtari, and Pekka Marttinen

Full-text: Open access

Abstract

Approximate Bayesian computation (ABC) can be used for model fitting when the likelihood function is intractable but simulating from the model is feasible. However, even a single evaluation of a complex model may take several hours, limiting the number of model evaluations available. Modelling the discrepancy between the simulated and observed data using a Gaussian process (GP) can be used to reduce the number of model evaluations required by ABC, but the sensitivity of this approach to a specific GP formulation has not yet been thoroughly investigated. We begin with a comprehensive empirical evaluation of using GPs in ABC, including various transformations of the discrepancies and two novel GP formulations. Our results indicate the choice of GP may significantly affect the accuracy of the estimated posterior distribution. Selection of an appropriate GP model is thus important. We formulate expected utility to measure the accuracy of classifying discrepancies below or above the ABC threshold, and show that it can be used to automate the GP model selection step. Finally, based on the understanding gained with toy examples, we fit a population genetic model for bacteria, providing insight into horizontal gene transfer events within the population and from external origins.

Article information

Source
Ann. Appl. Stat., Volume 12, Number 4 (2018), 2228-2251.

Dates
Received: October 2016
Revised: November 2017
First available in Project Euclid: 13 November 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1542078043

Digital Object Identifier
doi:10.1214/18-AOAS1150

Mathematical Reviews number (MathSciNet)
MR3875699

Keywords
Approximate Bayesian computation intractable likelihood Gaussian process input-dependent noise model selection

Citation

Järvenpää, Marko; Gutmann, Michael U.; Vehtari, Aki; Marttinen, Pekka. Gaussian process modelling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria. Ann. Appl. Stat. 12 (2018), no. 4, 2228--2251. doi:10.1214/18-AOAS1150. https://projecteuclid.org/euclid.aoas/1542078043


Export citation

References

  • Ansari, M. A. and Didelot, X. (2014). Inference of the properties of the recombination process from whole bacterial genomes. Genetics 196 253–265.
  • Beaumont, M. A., Zhang, W. and Balding, D. J. (2002). Approximate Bayesian computation in population genetics. Genetics 162 2025–2035.
  • Beaumont, M. A., Cornuet, J.-M., Marin, J.-M. and Robert, C. P. (2009). Adaptive approximate Bayesian computation. Biometrika 96 983–990.
  • Bernardo, J.-M. and Smith, A. F. M. (2001). Bayesian Theory. Wiley, Chichester.
  • Blum, M. G. B. (2010). Approximate Bayesian computation: A nonparametric perspective. J. Amer. Statist. Assoc. 105 1178–1187.
  • Blum, M. G. B. and François, O. (2010). Non-linear regression models for approximate Bayesian computation. Stat. Comput. 20 63–73.
  • Brochu, E., Cora, V. M. and de Freitas, N. (2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Preprint. Available at arXiv:1012.2599.
  • Chewapreecha, C., Harris, S. R., Croucher, N. J., Turner, C., Marttinen, P., Cheng, L., Pessia, A., Aanensen, D. M., Mather, A. E., Page, A. J. et al. (2014). Dense genomic sampling identifies highways of pneumococcal recombination. Nat. Genet. 46 305–309.
  • Cohan, F. M. and Perry, E. B. (2007). A systematics for discovering the fundamental units of bacterial diversity. Curr. Biol. 17 R373–R386.
  • Croucher, N. J., Harris, S. R., Fraser, C., Quail, M. A., Burton, J., van der Linden, M., McGee, L., von Gottberg, A., Song, J. H., Ko, K. S. et al. (2011). Rapid pneumococcal evolution in response to clinical interventions. Science 331 430–434.
  • Croucher, N. J., Finkelstein, J. A., Pelton, S. I., Mitchell, P. K., Lee, G. M., Parkhill, J., Bentley, S. D., Hanage, W. P. and Lipsitch, M. (2013). Population genomics of post-vaccine changes in pneumococcal epidemiology. Nat. Genet. 45 656–663.
  • Del Moral, P., Doucet, A. and Jasra, A. (2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Stat. Comput. 22 1009–1020.
  • Doroghazi, J. R. and Buckley, D. H. (2011). A model for the effect of homologous recombination on microbial diversification. Genome Biol. Evol. 3 1349–1356.
  • Drovandi, C. C., Moores, M. T. and Boys, R. J. (2018). Accelerating pseudo-marginal MCMC using Gaussian processes. Comput. Statist. Data Anal. 118 1–17.
  • Drovandi, C. C. and Pettitt, A. N. (2011). Estimation of parameters for macroparasite population evolution using approximate Bayesian computation. Biometrics 67 225–233.
  • Drovandi, C. C., Pettitt, A. N. and Lee, A. (2015). Bayesian indirect inference using a parametric auxiliary model. Statist. Sci. 30 72–95.
  • Fan, Y., Nott, D. J. and Sisson, S. A. (2013). Approximate Bayesian computation via regression density estimation. Stat 2 34–48.
  • Fearnhead, P. and Prangle, D. (2012). Constructing summary statistics for approximate Bayesian computation: Semi-automatic approximate Bayesian computation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 419–474.
  • Fraser, C., Hanage, W. P. and Spratt, B. G. (2007). Recombination and the nature of bacterial speciation. Science 315 476–480.
  • Goldberg, P. W., Williams, C. K. I. and Bishop, C. M. (1997). Regression with input-dependent noise: A Gaussian process treatment. Adv. Neural Inf. Process. Syst. 10 493–499.
  • Gutmann, M. U. and Corander, J. (2016). Bayesian optimization for likelihood-free inference of simulator-based statistical models. J. Mach. Learn. Res. 17 Paper No. 125, 47.
  • Hartig, F., Calabrese, J. M., Reineking, B., Wiegand, T. and Huth, A. (2011). Statistical inference for stochastic simulation models – Theory and application. Ecol. Lett. 14 816–827.
  • Jabot, F., Lagarrigues, G., Courbaud, B. and Dumoulin, N. (2014). A comparison of emulation methods for approximate Bayesian computation. Preprint. Available at arXiv:1412.7560.
  • Järvenpää, M., Gutmann, M., Pleska, A., Vehtari, A. and Marttinen, P. (2017). Efficient acquisition rules for model-based approximate Bayesian computation. Preprint. Available at arXiv:1704.00520.
  • Järvenpää, M., Gutmann, M., Vehtari, A. and Marttinen, P. (2018). Supplement to “Gaussian process modeling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria.” DOI:10.1214/18-AOAS1150SUPP.
  • Kandasamy, K., Schneider, J. and Póczos, B. (2015). Bayesian active learning for posterior estimation. In International Joint Conference on Artificial Intelligence 3605–3611.
  • Lenormand, M., Jabot, F. and Deffuant, G. (2013). Adaptive approximate Bayesian computation for complex models. Comput. Statist. 28 2777–2796.
  • Lintusaari, J., Gutmann, M. U., Dutta, R., Kaski, S. and Corander, J. (2016). Fundamentals and recent developments in approximate Bayesian computation. Syst. Biol. 66 e66–e82.
  • Majewski, J. (2001). Sexual isolation in bacteria. FEMS Microbiol. Lett. 199 161–169.
  • Marin, J.-M., Pudlo, P., Robert, C. P. and Ryder, R. J. (2012). Approximate Bayesian computational methods. Stat. Comput. 22 1167–1180.
  • Marjoram, P., Molitor, J., Plagnol, V. and Tavare, S. (2003). Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100 15324–15328.
  • Marttinen, P., Croucher, N. J., Gutmann, M. U., Corander, J. and Hanage, W. P. (2015). Recombination produces coherent bacterial species clusters in both core and accessory genomes. Microb. Genomes 1 e000038.
  • Meeds, E. and Welling, M. (2014). GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence.
  • Niehus, R., Mitri, S., Fletcher, A. G. and Foster, K. R. (2015). Migration and horizontal gene transfer divide microbial genomes into multiple niches. Nat. Commun. 6 8924.
  • Papamakarios, G. and Murray, I. (2016). Fast e-free inference of simulation models with Bayesian conditional density estimation. In Advances in Neural Information Processing Systems 29.
  • Price, L. F., Drovandi, C. C., Lee, A. and Nott, D. J. (2018). Bayesian synthetic likelihood. J. Comput. Graph. Statist. 27 1–11.
  • Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA.
  • Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. and de Freitas, N. (2015). Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 104.
  • Shapiro, B. J., Friedman, J., Cordero, O. X., Preheim, S. P., Timberlake, S. C., Szabó, G., Polz, M. F. and Alm, E. J. (2012). Population genomics of early events in the ecological differentiation of bacteria. Science 336 48–51.
  • Sisson, S. A., Fan, Y. and Tanaka, M. M. (2007). Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 104 1760–1765.
  • Snelson, E., Rasmussen, C. E. and Ghahramani, Z. (2004). Warped Gaussian processes. In Advances in Neural Information Processing Systems 16 337–344.
  • Snoek, J., Larochelle, H. and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25 1–9.
  • Thomas, C. M. and Nielsen, K. M. (2005). Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev., Microbiol. 3 711–721.
  • Tolvanen, V., Jylänki, P. and Vehtari, A. (2014). Approximate inference for nonstationary heteroscedastic Gaussian process regression. In 2014 IEEE International Workshop on Machine Learning for Signal Processing 1–24.
  • Toni, T., Welch, D., Strelkowa, N., Ipsen, A. and Stumpf, M. P. H. (2009). Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface 6 187–202.
  • Touchon, M., Hoede, C., Tenaillon, O., Barbe, V., Baeriswyl, S., Bidet, P., Bingen, E., Bonacorsi, S., Bouchier, C., Bouvet, O. et al. (2009). Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet. 5 e1000344.
  • Turner, B. M. and Sederberg, P. B. (2014). A generalized, likelihood-free method for posterior estimation. Psychon. Bull. Rev. 21 227–250.
  • Turner, B. M. and Van Zandt, T. (2012). A tutorial on approximate Bayesian computation. J. Math. Psych. 56 69–85.
  • Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P., Tolvanen, V. and Vehtari, A. (2013). GPstuff: Bayesian modeling with Gaussian processes. J. Mach. Learn. Res. 14 1175–1179.
  • Vehtari, A. and Lampinen, J. (2002). Bayesian model assessment and comparison using cross-validation predictive densities. Neural Comput. 14 2439–2468.
  • Vehtari, A. and Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Stat. Surv. 6 142–228.
  • Wilkinson, R. D. (2014). Accelerating ABC methods using Gaussian processes. In Proceedings of the Seventeeth International Conference on Artificial Intelligence and Statistics 1015–1023.
  • Wood, S. N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466 1102–1104.

Supplemental materials