The Annals of Applied Statistics

Fitting birth–death processes to panel data with applications to bacterial DNA fingerprinting

Charles R. Doss, Marc A. Suchard, Ian Holmes, Midori Kato-Maeda, and Vladimir N. Minin

Full-text: Open access


Continuous-time linear birth–death-immigration (BDI) processes are frequently used in ecology and epidemiology to model stochastic dynamics of the population of interest. In clinical settings, multiple birth–death processes can describe disease trajectories of individual patients, allowing for estimation of the effects of individual covariates on the birth and death rates of the process. Such estimation is usually accomplished by analyzing patient data collected at unevenly spaced time points, referred to as panel data in the biostatistics literature. Fitting linear BDI processes to panel data is a nontrivial optimization problem because birth and death rates can be functions of many parameters related to the covariates of interest. We propose a novel expectation–maximization (EM) algorithm for fitting linear BDI models with covariates to panel data. We derive a closed-form expression for the joint generating function of some of the BDI process statistics and use this generating function to reduce the E-step of the EM algorithm, as well as calculation of the Fisher information, to one-dimensional integration. This analytical technique yields a computationally efficient and robust optimization algorithm that we implemented in an open-source R package. We apply our method to DNA fingerprinting of Mycobacterium tuberculosis, the causative agent of tuberculosis, to study intrapatient time evolution of IS6110 copy number, a genetic marker frequently used during estimation of epidemiological clusters of Mycobacterium tuberculosis infections. Our analysis reveals previously undocumented differences in IS6110 birth–death rates among three major lineages of Mycobacterium tuberculosis, which has important implications for epidemiologists that use IS6110 for DNA fingerprinting of Mycobacterium tuberculosis.

Article information

Ann. Appl. Stat., Volume 7, Number 4 (2013), 2315-2335.

First available in Project Euclid: 23 December 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Missing data EM algorithm transposable element IS6110 tuberculosis


Doss, Charles R.; Suchard, Marc A.; Holmes, Ian; Kato-Maeda, Midori; Minin, Vladimir N. Fitting birth–death processes to panel data with applications to bacterial DNA fingerprinting. Ann. Appl. Stat. 7 (2013), no. 4, 2315--2335. doi:10.1214/13-AOAS673.

Export citation


  • Alonso, H., Aguilo, J. I., Samper, S., Caminero, J. A., Campos-Herrero, M. I., Gicquel, B., Brosch, R., Martín, C. and Otal, I. (2011). Deciphering the role of IS6110 in a highly transmissible Mycobacterium tuberculosis Beijing strain, GC1237. Tuberculosis 91 117–126.
  • Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41 164–171.
  • Cattamanchi, A., Hopewell, P. C., Gonzalez, L. C., Osmond, D. H., Masae, Kawamura, L., Daley, C. L. and Jasmer, R. M. (2006). A 13-year molecular epidemiological analysis of tuberculosis in San Francisco. The International Journal of Tuberculosis and Lung Disease 10 297–304.
  • Crespi, C. M., Cumberland, W. G. and Blower, S. (2005). A queueing model for chronic recurrent conditions under panel observation. Biometrics 61 193–198.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1–38.
  • Dorman, K. S., Sinsheimer, J. S. and Lange, K. (2004). In the garden of branching processes. SIAM Rev. 46 202–229 (electronic).
  • Doss, C. R., Suchard, M. A., Holmes, I., Kato-Maeda, M. and Minin, V. N. (2013). Supplement to “Fitting birth–death processes to panel data with applications to bacterial DNA fingerprinting.” DOI:10.1214/13-AOAS673SUPP.
  • Gagneux, S., DeRiemer, K., Van, T., Kato-Maeda, M., de Jong, B. C., Narayanan, S., Nicol, M., Niemann, S., Kremer, K., Gutierrez, M. C., Hilty, M., Hopewell, P. C. and Small, P. M. (2006). Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc. Natl. Acad. Sci. USA 103 2869–2873.
  • Gibson, G. J. and Renshaw, E. (1998). Estimating parameters in stochastic compartmental models using Markov chain methods. IMA Journal of Mathematics Applied in Medicine & Biology 15 19–40.
  • Golinelli, D. (2000). Bayesian inference in hidden stochastic population processes. Ph.D. thesis, Univ. Washington, Seattle, WA.
  • Guttorp, P. (1995). Stochastic Modeling of Scientific Data. Chapman & Hall, London.
  • Henrici, P. (1979). Fast Fourier methods in computational complex analysis. SIAM Rev. 21 481–527.
  • Holmes, I. (2005). Using evolutionary expectation maximization to estimate indel rates. Bioinformatics 21 2294–2300.
  • Holmes, I. and Rubin, G. M. (2002). An expectation maximization algorithm for training hidden substitution models. Journal of Molecular Biology 317 753–764.
  • Jackson, C. H. (2011). Multi-state models for panel data: The msm package for R. Journal of Statistical Software 38 1–29.
  • Jasmer, R. M., Hahn, J. A., Small, P. M., Daley, C. L., Behr, M. A., Moss, A. R., Creasman, J. M., Schecter, G. F., Paz, E. A. and Hopewell, P. C. (1999). A molecular epidemiologic analysis of tuberculosis trends in San Francisco, 1991–1997. Annals of Internal Medicine 130 971–978.
  • Kalbfleisch, J. D. and Lawless, J. F. (1985). The analysis of panel data under a Markov assumption. J. Amer. Statist. Assoc. 80 863–871.
  • Karlin, S. and McGregor, J. (1958). Linear growth birth and death processes. J. Math. Mech. 7 643–662.
  • Kato-Maeda, M., Metcalfe, J. Z. and Flores, L. (2011). Genotyping of Mycobacterium tuberculosis: Application in epidemiologic studies. Future Microbiol. 6 203–216.
  • Keiding, N. (1975). Maximum likelihood estimation in the birth-and-death process. Ann. Statist. 3 363–372.
  • Kendall, D. G. (1948). On the generalized “birth-and-death” process. Ann. Math. Statist. 19 1–15.
  • Lange, K. (1982). Calculation of the equilibrium distribution for a deleterious gene by the finite Fourier transform. Biometrics 38 79–86.
  • Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 425–437.
  • Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 44 226–233.
  • McEvoy, C. R. E., Falmer, A. A., van Pittius, N. C. G., Victor, T. C., van Helden, P. D. and Warren, R. M. (2007). The role of IS6110 in the evolution of Mycobacterium tuberculosis. Tuberculosis 87 393–404.
  • Minin, V. N. and Suchard, M. A. (2008). Counting labeled transitions in continuous-time Markov models of evolution. J. Math. Biol. 56 391–412.
  • Nee, S. (2006). Birth–death models in macroevolution. Annual Review of Ecology, Evolution, and Systematics 37 1–17.
  • Roberts, W. J. J. and Ephraim, Y. (2008). An EM algorithm for ion-channel current estimation. IEEE Trans. Signal Process. 56 26–33.
  • Rosenberg, N. A., Tsolaki, A. G. and Tanaka, M. M. (2003). Estimating change rates of genetic markers using serial samples: Applications to the transposon IS6110 in Mycobacterium tuberculosis. Theoretical Population Biology 63 347–363.
  • Sehl, M., Zhou, H., Sinsheimer, J. S. and Lange, K. L. (2011). Extinction models for cancer stem cell therapy. Math. Biosci. 234 132–146.
  • Small, P. M., Hopewell, P. C., Singh, S. P., Paz, A., Parsonnet, J., Ruston, D. C., Schecter, G. F., Daley, C. L. and Schoolnik, G. K. (1994). The epidemiology of tuberculosis in San Francisco. A population-based study using conventional and molecular methods. New England Journal of Medicine 330 1703–1709.
  • Suchard, M. A., Lange, K. and Sinsheimer, J. S. (2008). Efficiency of protein production from mRNA. J. Stat. Theory Pract. 2 173–182.
  • Tanaka, M. M. and Rosenberg, N. A. (2001). Optimal estimation of transposition rates of insertion sequences for molecular epidemiology. Stat. Med. 20 2409–2420.
  • Thorne, J. L., Kishino, H. and Felsenstein, J. (1991). An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33 114–124.
  • van Embden, J. D., Cave, M. D., Crawford, J. T., Dale, J. W., Eisenach, K. D., Gicquel, B., Hermans, P., Martin, C., McAdam, R., Shinnick, T. M. et al. (1993). Strain identification of Mycobacterium tuberculosis by DNA fingerprinting: Recommendations for a standardized methodology. J. Clin. Microbiol. 31 406–409.
  • Warren, R. M., van der Spuy, G. D., Richardson, M., Beyers, N., Booysen, C., Behr, M. A. and van Helden, P. D. (2002). Evolution of the IS6110-based restriction fragment length polymorphism pattern during the transmission of Mycobacterium tuberculosis. J. Clin. Microbiol. 40 1277–1282.

Supplemental materials