Statistical Science

Bootstrapping Phylogenetic Trees: Theory and Methods

Susan Holmes

Full-text: Open access


This is a survey of the use of the bootstrap in the area of systematic and evolutionary biology. I present the current usage by biologists of the bootstrap as a tool both for making inferences and for evaluating robustness, and propose a framework for thinking about these problems in terms of mathematical statistics.

Article information

Statist. Sci. Volume 18, Issue 2 (2003), 241-255.

First available: 19 September 2003

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)


Holmes, Susan. Bootstrapping Phylogenetic Trees: Theory and Methods. Statistical Science 18 (2003), no. 2, 241--255. doi:10.1214/ss/1063994979.

Export citation


  • Aldous, D. (2001). Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Statist. Sci. 16 23--34.
  • Baker, C., Lento, G., Cipriano, F. and Palumbi, S. (2000). Predicted decline of protected whales based on molecular genetic monitoring of Japanese and Korean markets. Proc. Roy. Soc. London Ser. B 267 1191--1199.
  • Baker, C. and Palumbi, S. (1994). Which whales are hunted? A molecular genetic approach to monitoring whaling. Science 265 1538--1539.
  • Berry, V. and Gascuel, O. (1996). On the interpretation of bootstrap trees: Appropriate threshold of clade selection and induced gain. Molecular Biology and Evolution 13 999--1011.
  • Billera, L., Holmes, S. and Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27 733--767.
  • Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
  • Bremer, K. (1988). The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42 795--803.
  • Cooper, A. and Penny, D. (1997). Mass survival of birds across the cretaceous--tertiary boundary: Molecular evidence. Science 275 1109--1113.
  • Diaconis, P. (1989). A generalization of spectral analysis with application to ranked data. Ann. Statist. 17 949--979.
  • Diaconis, P. and Holmes, S. (1998). Matchings and phylogenetic trees. Proc. Nat. Acad. Sci. U.S.A. 95 14,600--14,602.
  • Diaconis, P. and Holmes, S. (2002). Random walks on trees and matchings. Electronic Journal of Probability 7 17 pages.
  • Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis. Cambridge Univ. Press.
  • Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall, London.
  • Efron, B. and Tibshirani, R. (1998). The problem of regions. Ann. Statist. 26 1687--1718.
  • Efron, B., Halloran, E. and Holmes, S. (1996). Bootstrap confidence levels for phylogenetic trees. Proc. Nat. Acad. Sci. U.S.A. 93 13,429--13,434.
  • Felsenstein, J. (1983). Statistical inference of phylogenies (with discussion). J. Roy. Statist. Soc. Ser. A 146 246--272.
  • Felsenstein, J. (2003). Inferring Phylogenies. Sinauer, Boston.
  • Felsenstein, J. and Churchill, G. A. (1996). A hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution 13 93--104.
  • Fitch, W. (1971a). The nonidentity of invariable positions in the cytochromes $c$ of different species. Biochemical Genetics 5 231--241.
  • Fitch, W. (1971b). Rate of change of concomitantly variable codons. Journal of Molecular Evolution 1 84--96.
  • Fitch, W. M. and Markowitz, E. (1970). An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochemical Genetics 4 579--593.
  • Freedman, D. A. and Peters, S. C. (1984a). Bootstrapping a regression equation: Some empirical results. J. Amer. Statist. Assoc. 79 97--106.
  • Freedman, D. A. and Peters, S. C. (1984b). Bootstrapping an econometric model: Some empirical results. J. Bus. Econom. Statist. 2 150--158.
  • Gong, G. (1986). Cross-validation, the jackknife, and the bootstrap: Excess error estimation in forward logistic regression. J. Amer. Statist. Assoc. 81 108--113.
  • Green, P. J. (1981). Peeling bivariate data. In Interpreting Multivariate Data (V. Barnett, ed.) 3--19. Wiley, New York.
  • Hall, P. (1987). On the bootstrap and likelihood-based confidence regions. Biometrika 74 481--493.
  • Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York.
  • Hedges, S. B. (1992). The number of replications needed for accurate estimation of the bootstrap $p$-value in phylogenetic studies. Molecular Biology and Evolution 9 366--369.
  • Hendy, M. D. and Penny, D. (1993). Spectral analysis of phylogenetic data. J. Classification 10 5--23.
  • Hendy, M. D., Penny, D. and Steel, M. A. (1994). A discrete Fourier analysis for evolutionary trees. Proc. Nat. Acad. Sci. U.S.A. 91 3339--3343.
  • Hillis, D. M. and Bull, J. J. (1993). An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Systematic Biology 42 182--192.
  • Holmes, S. (1999). Phylogenies: An overview. In Statistics in Genetics (M. E. Halloran and S. Geisser, eds.) 81--119. Springer, New York.
  • Huber, P. J. (1996). Robust Statistical Procedures, 2nd ed. SIAM, Philadelphia.
  • Künsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. Ann. Statist. 17 1217--1241.
  • LANL (2002). HIV database. Available at URL:
  • Lauritzen, S. L. (1988). Extremal Families and Systems of Sufficient Statistics. Lecture Notes in Statist. 49. Springer, Berlin.
  • Lento, G. M., Cipriano, F., Patenaude, N. J., Palumbi, S. R. and Baker, C. S. (1998). Taking stock of minke whale in the North Pacific: The origins of products for sale in Japan and Korea. Technical Report SC/50/RMP15, Scientific Committee, International Whaling Commission.
  • Li, S., Pearl, D. K. and Doss, H. (2000). Phylogenetic tree construction using Markov chain Monte Carlo. J. Amer. Statist. Assoc. 95 493--508.
  • Li, W. H. (1997). Molecular Evolution. Sinauer, Boston.
  • Li, W. H. and Zharkikh, A. (1995). Statistical tests of DNA phylogenies. Systematic Biology 44 49--63.
  • Liu, R. Y., Parelius, J. M. and Singh, K. (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference (with discussion). Ann. Statist. 77 783--858.
  • Lockhart, P. J., Larkum, A. W. D., Steel, M. A., Waddell, P. J. and Penny, D. (1996). Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Nat. Acad. Sci. U.S.A. 93 1930--1934.
  • Maddison, D. (1991). The discovery and importance of multiple islands of most parsimonious trees. Systematic Zoology 40 315--328.
  • Mau, B., Newton, M. A. and Larget, B. (1999). Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55 1--12.
  • Nei, M., Kumar, S. and Takahashi, K. (1998). The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Nat. Acad. Sci. U.S.A. 95 12,390--12,397.
  • Newton, M. A. (1996). Bootstrapping phylogenies: Large deviations and dispersion effects. Biometrika 83 315--328.
  • Page, R. and Holmes, E. (2000). Molecular Evolution: A Phylogenetic Approach. Blackwell Science, Oxford, UK.
  • Rambaut, A. and Grassly, N. C. (1997). Seq-gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Computational Applied Bioscience 13 235--238.
  • Ramsay, J. O. (1978). Confidence regions for multidimensional scaling analysis. Psychometrika 43 145--160.
  • Robbins, H. (1980). An empirical Bayes estimation problem. Proc. Nat. Acad. Sci. U.S.A. 77 6,988--6,989.
  • Robbins, H. (1983). Some thoughts on empirical Bayes estimation. Ann. Statist. 11 713--723.
  • Robbins, H. (1985). Linear empirical Bayes estimation of means and variances. Proc. Nat. Acad. Sci. U.S.A. 82 1571--1574.
  • Rodrigo, A. G. (1993). Calibrating the bootstrap test of monophyly. International Journal for Parasitology 23 507--514.
  • Sanderson, M. J. (1995). Objections to bootstrapping phylogenies: A critique. Systematic Biology 44 299--320.
  • Schröder, E. (1870). Vier combinatorische Probleme. Zeitschrift für Mathematik und Physik 15 361--376.
  • Tang, H. and Lewontin, R. (1999). Locating regions of differential variability in DNA and protein sequences. Genetics 153 485--495.
  • Tuffley, C. and Steel, M. (1998). Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147 63--91.
  • Tukey, J. (1975). Mathematics and the picturing of data. In Proc. International Congress of Mathematicians (R. D. James, ed.) 523--531. Vancouver.
  • Yang, Z. (1994). Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Molecular Evolution 39 306--314.
  • Yang, Z. and Rannala, B. (1997). Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method. Molecular Biology and Evolution 14 717--724.
  • Zharkikh, A. and Li, W.-H. (1995). Estimation of confidence in phylogeny: The complete-and-partial bootstrap technique. Molecular Phylogenetics and Evolution 4 44--63.