Statistical Science

Estimating Transmission from Genetic and Epidemiological Data: A Metric to Compare Transmission Trees

Michelle Kendall, Diepreye Ayabina, Yuanwei Xu, James Stimson, and Caroline Colijn

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Reconstructing who infected whom is a central challenge in analysing epidemiological data. Recently, advances in sequencing technology have led to increasing interest in Bayesian approaches to inferring who infected whom using genetic data from pathogens. The logic behind such approaches is that isolates that are nearly genetically identical are more likely to have been recently transmitted than those that are very different. A number of methods have been developed to perform this inference. However, testing their convergence, examining posterior sets of transmission trees and comparing methods’ performance are challenged by the fact that the object of inference—the transmission tree—is a complicated discrete structure. We introduce a metric on transmission trees to quantify distances between them. The metric can accommodate trees with unsampled individuals, and highlights differences in the source case and in the number of infections per infector. We illustrate its performance on simple simulated scenarios and on posterior transmission trees from a TB outbreak. We find that the metric reveals where the posterior is sensitive to the priors, and where collections of trees are composed of distinct clusters. We use the metric to define median trees summarising these clusters. Quantitative tools to compare transmission trees to each other will be required for assessing MCMC convergence, exploring posterior trees and benchmarking diverse methods as this field continues to mature.

Article information

Statist. Sci., Volume 33, Number 1 (2018), 70-85.

First available in Project Euclid: 2 February 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Infectious diseases genomics epidemiology Bayesian inference modelling


Kendall, Michelle; Ayabina, Diepreye; Xu, Yuanwei; Stimson, James; Colijn, Caroline. Estimating Transmission from Genetic and Epidemiological Data: A Metric to Compare Transmission Trees. Statist. Sci. 33 (2018), no. 1, 70--85. doi:10.1214/17-STS637.

Export citation


  • [1] Amenta, N. and Klingner, J. (2002). Case study: Visualizing sets of evolutionary trees. In IEEE Symposium on Information Visualization, 2002. (InfoVis’02) 71–74.
  • [2] Berglund, D. (2011). Visualization of phylogenetic tree space. Ph.D. thesis, Stockholm Univ.
  • [3] Billera, L. J., Holmes, S. P. and Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27 733–767.
  • [4] Cardona, G., Mir, A., Rossello Llompart, F., Rotger, L. and Sanchez, D. (2013). Cophenetic metrics for phylogenetic trees, after Sokal and Rohlf. BMC Bioinform. 14 3.
  • [5] Chakerian, J. and Holmes, S. (2012). Computational tools for evaluating phylogenetic and hierarchical clustering trees. J. Comput. Graph. Statist. 21 581–599.
  • [6] Cox, T. F. and Cox, M. A. A. (2000). Multidimensional Scaling. CRC Press Boca Raton, FL.
  • [7] De Maio, N., Wu, C.-H. and Wilson, D. J. (2016). SCOTTI: Efficient reconstruction of transmission within outbreaks with the structured coalescent. PLoS Comput. Biol. 12 e1005130.
  • [8] Didelot, X., Fraser, C., Gardy, J. and Colijn, C. (2017). Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks. Mol. Biol. Evol. 34 997–1007.
  • [9] Didelot, X., Gardy, J. and Colijn, C. (2014). Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol. Biol. Evol. 31 1869–1879.
  • [10] Drummond, A. J. and Rambaut, A. (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7 214.
  • [11] Gardy, J., Loman, N. J. and Rambaut, A. (2015). Real-time digital pathogen surveillance—The time is now. Genome Biol. 16 155.
  • [12] Gardy, J. L., Johnston, J. C., Ho Sui, S. J., Cook, V. J., Shah, L., Brodkin, E., Rempel, S., Moore, R., Zhao, Y., Holt, R., Varhol, R., Birol, I., Lem, M., Sharma, M. K., Elwood, K., Jones, S. J. M., Brinkman, F. S. L., Brunham, R. C. and Tang, P. (2011). Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. N. Engl. J. Med. 364 730–739.
  • [13] Gibbons, A. (1985). Algorithmic Graph Theory. Cambridge Univ. Press, Cambridge.
  • [14] Gray, R. R., Tatem, A. J., Johnson, J. A., Alekseyenko, A. V., Pybus, O. G., Suchard, M. A. and Salemi, M. (2011). Testing spatiotemporal hypothesis of bacterial evolution using methicillin-resistant Staphylococcus aureus ST239 genome-wide data within a Bayesian framework. Mol. Biol. Evol. 28 1593–1603.
  • [15] Hall, M., Woolhouse, M. and Rambaut, A. (2015). Epidemic reconstruction in a phylogenetics framework: Transmission trees as partitions of the node set. PLoS Comput. Biol. 11 e1004613.
  • [16] Hillis, D. M., Heath, T. A. and St John, K. (2005). Analysis and visualization of tree space. Syst. Biol. 54 471–482.
  • [17] Holmes, S. (2006). Visualising data. In Statistical Problems in Particle Physics, Astrophysics and Cosmology, Proceedings of PHYSTAT05 (L. Lyons and M. K. Ünel, eds.) 197–208. Imperial College Press, London.
  • [18] Jombart, T., Cori, A., Didelot, X., Cauchemez, S., Fraser, C. and Ferguson, N. (2014). Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. PLoS Comput. Biol. 10 e1003457.
  • [19] Jombart, T., Kendall, M., Almagro-Garcia, J. and Colijn, C. (2017). treespace: Statistical exploration of landscapes of phylogenetic trees. R package version 1.0.0.
  • [20] Jombart, T., Kendall, M., Almagro-Garcia, J. and Colijn, C. (2017). treespace: Statistical exploration of landscapes of phylogenetic trees. Mol. Ecol. Resour. 17 1385–1392.
  • [21] Kenah, E., Britton, T., Halloran, M. E. and Longini, I. M. Jr. (2016). Molecular infectious disease epidemiology: Survival analysis and algorithms linking phylogenies to transmission trees. PLoS Comput. Biol. 12 e1004869.
  • [22] Kendall, M. and Colijn, C. (2016). Mapping phylogenetic trees to reveal distinct patterns of evolution. Mol. Biol. Evol. 33 2735–2743.
  • [23] Klinkenberg, D., Backer, J. A., Didelot, X., Colijn, C., Wallinga, J. and Haydon, D. T. (2017). Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks. PLoS Comput. Biol. 13 e1005495.
  • [24] Köser, C. U., Holden, M. T. G., Ellington, M. J., Cartwright, E. J. P., Brown, N. M., Ogilvy-Stuart, A. L., Hsu, L. Y., Chewapreecha, C., Croucher, N. J., Harris, S. R., Sanders, M., Enright, M. C., Dougan, G., Bentley, S. D., Parkhill, J., Fraser, L. J., Betley, J. R., Schulz-Trieglaff, O. B., Smith, G. P. and Peacock, S. J. (2012). Rapid whole-genome sequencing for investigation of a neonatal MRSA outbreak. N. Engl. J. Med. 366 2267–2275.
  • [25] Lanfear, R., Hua, X. and Warren, D. L. (2016). Estimating the effective sample size of tree topologies from Bayesian phylogenetic analyses. Genome Biol. Evol. 8 2319–2332.
  • [26] Mollentze, N., Nel, L. H., Townsend, S., le Roux, K., Hampson, K., Haydon, D. T. and Soubeyrand, S. (2014). A Bayesian approach for inferring the dynamics of partially observed endemic infectious diseases from space-time-genetic data. Proc. R. Soc. Lond., B Biol. Sci. 281 20133251.
  • [27] Morelli, M. J., Thébaud, G., Chadœuf, J., King, D. P., Haydon, D. T. and Soubeyrand, S. (2012). A Bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data. PLoS Comput. Biol. 8 e1002768.
  • [28] Numminen, E., Chewapreecha, C., Sirén, J., Turner, C., Turner, P., Bentley, S. D. and Corander, J. (2014). Two-phase importance sampling for inference about transmission trees. Proc. R. Soc. Lond., B Biol. Sci. 281 20141324.
  • [29] Nye, T. M. W. (2008). Trees of trees: An approach to comparing multiple alternative phylogenies. Syst. Biol. 57 785–794.
  • [30] Quick, J., Loman, N. J., Duraffour, S., Simpson, J. T., Severi, E., Cowley, L., Bore, J. A., Koundouno, R., Dudas, G., Mikhail, A., Ouédraogo, N., Afrough, B., Bah, A., Baum, J. H. J., Becker-Ziaja, B., Boettcher, J. P., Cabeza-Cabrerizo, M., Camino-Sánchez, Á., Carter, L. L., Doerrbecker, J., Enkirch, T., García-Dorival, I., Hetzelt, N., Hinzmann, J., Holm, T., Kafetzopoulou, L. E., Koropogui, M., Kosgey, A., Kuisma, E., Logue, C. H., Mazzarelli, A., Meisel, S., Mertens, M., Michel, J., Ngabo, D., Nitzsche, K., Pallasch, E., Patrono, L. V., Portmann, J., Repits, J. G., Rickett, N. Y., Sachse, A., Singethan, K., Vitoriano, I., Yemanaberhan, R. L., Zekeng, E. G., Racine, T., Bello, A., Sall, A. A., Faye, O., Faye, O., Magassouba, N., Williams, C. V., Amburgey, V., Winona, L., Davis, E., Gerlach, J., Washington, F., Monteil, V., Jourdain, M., Bererd, M., Camara, A., Somlare, H., Camara, A., Gerard, M., Bado, G., Baillet, B., Delaune, D., Nebie, K. Y., Diarra, A., Savane, Y., Pallawo, R. B., Gutierrez, G. J., Milhano, N., Roger, I., Williams, C. J., Yattara, F., Lewandowski, K., Taylor, J., Rachwal, P., Turner, D. J., Pollakis, G., Hiscox, J. A., Matthews, D. A., O’Shea, M. K., Johnston, A. M., Wilson, D., Hutley, E., Smit, E., Di Caro, A., Wölfel, R., Stoecker, K., Fleischmann, E., Gabriel, M., Weller, S. A., Koivogui, L., Diallo, B., Keïta, S., Rambaut, A., Formenty, P., Günther, S. and Carroll, M. W. (2016). Real-time, portable genome sequencing for Ebola surveillance. Nature 530 228–232.
  • [31] Robinson, D. F. and Foulds, L. R. (1979). Comparison of weighted labelled trees. In Combinatorial Mathematics, VI (Proc. Sixth Austral. Conf., Univ. New England, Armidale, 1978). Lecture Notes in Math. 748 119–126. Springer, Berlin.
  • [32] Roetzer, A., Diel, R., Kohl, T. A., Rückert, C., Nübel, U., Blom, J., Wirth, T., Jaenicke, S., Schuback, S., Rüsch-Gerdes, S., Supply, P., Kalinowski, J. and Niemann, S. (2013). Whole genome sequencing versus traditional genotyping for investigation of a Mycobacterium tuberculosis outbreak: A longitudinal molecular epidemiological study. PLoS Med. 10 e1001387.
  • [33] Siddiqi, K., Lambert, M.-L. and Walley, J. (2003). Clinical diagnosis of smear-negative pulmonary tuberculosis in low-income countries: The current evidence. Lancet, Infect. Dis. 3 288–296.
  • [34] Singh, M., Mynak, M. L., Kumar, L., Mathew, J. L. and Jindal, S. K. (2005). Prevalence and risk factors for transmission of infection among children in household contact with adults having pulmonary tuberculosis. Arch. Dis. Child. 90 624–628.
  • [35] Soubeyrand, S. (2016). Construction of semi-Markov genetic-space-time SEIR models and inference. J. Soc. Fr. Stat. 157 129–152.
  • [36] Stadler, T. and Bonhoeffer, S. (2013). Uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 368 20120198.
  • [37] Walker, T. M., Ip, C. L. C., Harrell, R. H., Evans, J. T., Kapatai, G., Dedicoat, M. J., Eyre, D. W., Wilson, D. J., Hawkey, P. M., Crook, D. W., Parkhill, J., Harris, D., Walker, A. S., Bowden, R., Monk, P., Smith, E. G. and Peto, T. E. A. (2013). Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: A retrospective observational study. Lancet, Infect. Dis. 13 137–146.
  • [38] Worby, C. J., O’Neill, P. D., Kypraios, T., Robotham, J. V., De Angelis, D., Cartwright, E. J. P., Peacock, S. J. and Cooper, B. S. (2016). Reconstructing transmission trees for communicable diseases using densely sampled genetic data. Ann. Appl. Stat. 10 395–417.