Reconstructing who infected whom is a central challenge in analysing epidemiological data. Recently, advances in sequencing technology have led to increasing interest in Bayesian approaches to inferring who infected whom using genetic data from pathogens. The logic behind such approaches is that isolates that are nearly genetically identical are more likely to have been recently transmitted than those that are very different. A number of methods have been developed to perform this inference. However, testing their convergence, examining posterior sets of transmission trees and comparing methods’ performance are challenged by the fact that the object of inference—the transmission tree—is a complicated discrete structure. We introduce a metric on transmission trees to quantify distances between them. The metric can accommodate trees with unsampled individuals, and highlights differences in the source case and in the number of infections per infector. We illustrate its performance on simple simulated scenarios and on posterior transmission trees from a TB outbreak. We find that the metric reveals where the posterior is sensitive to the priors, and where collections of trees are composed of distinct clusters. We use the metric to define median trees summarising these clusters. Quantitative tools to compare transmission trees to each other will be required for assessing MCMC convergence, exploring posterior trees and benchmarking diverse methods as this field continues to mature.
"Estimating Transmission from Genetic and Epidemiological Data: A Metric to Compare Transmission Trees." Statist. Sci. 33 (1) 70 - 85, February 2018. https://doi.org/10.1214/17-STS637