The Annals of Applied Probability

The diversity of a distributed genome in bacterial populations

F. Baumdicker, W. R. Hess, and P. Pfaffelhuber

Full-text: Open access


The distributed genome hypothesis states that the set of genes in a population of bacteria is distributed over all individuals that belong to the specific taxon. It implies that certain genes can be gained and lost from generation to generation. We use the random genealogy given by a Kingman coalescent in order to superimpose events of gene gain and loss along ancestral lines. Gene gains occur at a constant rate along ancestral lines. We assume that gained genes have never been present in the population before. Gene losses occur at a rate proportional to the number of genes present along the ancestral line. In this infinitely many genes model we derive moments for several statistics within a sample: the average number of genes per individual, the average number of genes differing between individuals, the number of incongruent pairs of genes, the total number of different genes in the sample and the gene frequency spectrum. We demonstrate that the model gives a reasonable fit with gene frequency data from marine cyanobacteria.

Article information

Ann. Appl. Probab. Volume 20, Number 5 (2010), 1567-1606.

First available in Project Euclid: 25 August 2010

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

Mathematical Reviews number (MathSciNet)

Primary: 92D15: Problems related to evolution 60J70: Applications of Brownian motions and diffusion theory (population genetics, absorption problems, etc.) [See also 92Dxx] 92D20: Protein sequences, DNA sequences
Secondary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]

Kingman’s coalescent infinitely many genes model infinitely many sites model gene content


Baumdicker, F.; Hess, W. R.; Pfaffelhuber, P. The diversity of a distributed genome in bacterial populations. Ann. Appl. Probab. 20 (2010), no. 5, 1567--1606. doi:10.1214/09-AAP657.

Export citation


  • Bentley, S. (2009). Sequencing the species pan-genome. Nature Rev. Microbiol. 7 258–259.
  • Dufresne, A., Ostrowski, M., Scanlan, D. J., Garczarek, L., Mazard, S., Palenik, B. P., Paulsen, I. T., de Marsac, N. T., Wincker, P., Dossat, C., Ferriera, S., Johnson, J., Post, A. F., Hess, W. R. and Partensky, F. (2008). Unraveling the genomic mosaic of a ubiquitous genus of marine cyanobacteria. Genome Biol. 9 R90.
  • Durrett, R. (2008). Probability Models for DNA Sequence Evolution, 2nd ed. Springer, New York.
  • Durrett, R. and Popovic, L. (2009). Degenerate diffusions arising from gene duplication models. Ann. Appl. Probab. 19 15–48.
  • Dykhuizen, D. E. and Green, L. (1991). Recombination in Escherichia coli and the definition of biological species. J. Bacteriol. 173 7257–7268.
  • Ehrlich, G. D., Hu, F. Z., Shen, K., Stoodley, P. and Post, J. C. (2005). Bacterial plurality as a general mechanism driving persistence in chronic infections. Clin. Orthop. Relat. Res. 437 20–24.
  • Evans, S., Shvets, S. and Slatkin, M. (2007). Non-equlibrium theory of the allele frequency spectrum. Theo. Pop. Biol. 71 109–119.
  • Ewens, W. J. (2004). Mathematical Population Genetics. I. Theoretical Introduction, 2nd ed. Interdisciplinary Applied Mathematics 27. Springer, New York.
  • Fraser, C., Hanage, W. P. and Spratt, B. G. (2007). Recombination and the nature of bacterial speciation. Science 315 476–480.
  • Fu, Y. X. (1995). Statistical properties of segregating sites. Theo. Pop. Biol. 48 172–197.
  • Griffiths, R. C. (2003). The frequency spectrum of a mutation and its age, in a general diffusion model. Theo. Pop. Biol. 64 241–251.
  • Hiller, N. L., Janto, B., Hogg, J. S., Boissy, R., Yu, S., Powell, E., Keefe, R., Ehrlich, N. E., Shen, K., Hayes, J., Barbadora, K., Klimke, W., Dernovoy, D., Tatusova, T., Parkhill, J., Bentley, S. D., Post, J. C., Ehrlich, G. D. and Hu, F. Z. (2007). Comparative genomic analyses of seventeen Streptococcus pneumoniae strains: Insights into the pneumococcal supragenome. J. Bacteriol. 189 8186–8195.
  • Hogg, J. S., Hu, F. Z., Janto, B., Boissy, R., Hayes, J., Keefe, R., Post, J. C. and Ehrlich, G. D. (2007). Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol. 8 R103.
  • Huson, D. H. and Steel, M. (2004). Phylogenetic trees based on gene content. Bioinformatics 20 2044–2049.
  • Kettler, G. C., Martiny, A. C., Huang, K., Zucker, J., Coleman, M. L., Rodrigue, S., Chen, F., Lapidus, A., Ferriera, S., Johnson, J., Steglich, C., Church, G. M., Richardson, P. and Chisholm, S. W. (2007). Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genet. 3 e231.
  • Kimura, M. (1964). Diffusion models in population genetics. J. Appl. Probab. 1 177–232.
  • Kingman, J. F. C. (1982). The coalescent. Stochastic Process. Appl. 13 235–248.
  • Kunin, V. and Ouzounis, C. A. (2003). GeneTRACE-reconstruction of gene content of ancestral species. Bioinformatics 19 1412–1416.
  • Lapierre, P. and Gogarten, J. P. (2009). Estimating the size of the bacterial pan-genome. Trends in Genetics 25 107–110.
  • Lefébure, T. and Stanhope, M. J. (2007). Evolution of the core and pan-genome of Streptococcus: Positive selection, recombination, and genome composition. Genome Biol. 8 R71.
  • Maiden, M. C., Bygraves, J. A., Feil, E., Morelli, G., Russell, J. E., Urwin, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, D. A., Feavers, I. M., Achtman, M. and Spratt, B. G. (1998). Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proc. Natl. Acad. Sci. USA 95 3140–3145.
  • Maynard-Smith, J. (1995). Do bacteria have population genetics? In Population Genetics of Bacteria 1–12. Cambridge Univ. Press, Cambridge.
  • Medini, D., Donati, C., Tettelin, H., Masignani, V. and Rappuoli, R. (2005). The microbial pan-genome. Curr. Opin. Genet. Dev. 15 589–594.
  • Möhle, M. and Sagitov, S. (2001). A classification of coalescent processes for haploid exchangeable population models. Ann. Probab. 29 1547–1562.
  • Perna, N. T., Plunkett, G., Burland, V., Mau, B., Glasner, J. D., Rose, D. J., Mayhew, G. F., Evans, P. S., Gregor, J., Kirkpatrick, H. A., Pésfai, G., Hackett, J., Klink, S., Boutin, A., Shao, Y., Miller, L., Grotbeck, E. J., Davis, N. W., Lim, A., Dimalanta, E. T., Potamousis, K. D., Apodaca, J., Anantharaman, T. S., Lin, J., Yen, G., Schwartz, D. C., Welch, R. A. and Blattner, F. R. (2001). Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409 529–533.
  • Riley, M. A. and Lizotte-Waniewski, M. (2009). Population genomics and the bacterial species concept. Methods Mol. Biol. 532 367–377.
  • Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., Angiuoli, S. V., Crabtree, J., Jones, A. L., Durkin, A. S., Deboy, R. T., Davidsen, T. M., Mora, M., Scarselli, M., Margarit y Ros, I., Peterson, J. D., Hauser, C. R., Sundaram, J. P., Nelson, W. C., Madupu, R., Brinkac, L. M., Dodson, R. J., Rosovitz, M. J., Sullivan, S. A., Daugherty, S. C., Haft, D. H., Selengut, J., Gwinn, M. L., Zhou, L., Zafar, N., Khouri, H., Radune, D., Dimitrov, G., Watkins, K., O’Connor, K. J., Smith, S., Utterback, T. R., White, O., Rubens, C. E., Grandi, G., Madoff, L. C., Kasper, D. L., Telford, J. L., Wessels, M. R., Rappuoli, R. and Fraser, C. M. (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome.” Proc. Natl. Acad. Sci. USA 102 13950–13955.
  • Tettelin, H., Riley, D., Cattuto, C. and Medini, D. (2008). Comparative genomics: The bacterial pan-genome. Curr. Opin. Microbiol. 11 472–477.
  • Vulic, M., Dionisio, F., Taddei, F. and Radman, M. (1997). Molecular keys to speciation: DNA polymorphism and the control of genetic exchange in enterobacteria. Proc. Natl. Acad. Sci. USA 94 9763–9767.
  • Wakeley, J. (2008). Coalescent Theory: An Introduction. Roberts and Company, Colorado.
  • Wright, S. (1938). The distribution of gene frequencies under irreversible mutation. Proc. Natl. Acad. Sci. USA 24 253–259.