Statistical Science

Bootstrapping Phylogenetic Trees: Theory and Methods

Susan Holmes
Source: Statist. Sci. Volume 18, Issue 2 (2003), 241-255.

Abstract

This is a survey of the use of the bootstrap in the area of systematic and evolutionary biology. I present the current usage by biologists of the bootstrap as a tool both for making inferences and for evaluating robustness, and propose a framework for thinking about these problems in terms of mathematical statistics.

First Page: Show Hide
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.ss/1063994979
Digital Object Identifier: doi:10.1214/ss/1063994979
Mathematical Reviews number (MathSciNet): MR2026083

References

Aldous, D. (2001). Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Statist. Sci. 16 23--34.
Mathematical Reviews (MathSciNet): MR1838600
Digital Object Identifier: doi:10.1214/ss/998929474
Project Euclid: euclid.ss/998929474
Zentralblatt MATH: 1127.60313
Baker, C., Lento, G., Cipriano, F. and Palumbi, S. (2000). Predicted decline of protected whales based on molecular genetic monitoring of Japanese and Korean markets. Proc. Roy. Soc. London Ser. B 267 1191--1199.
Baker, C. and Palumbi, S. (1994). Which whales are hunted? A molecular genetic approach to monitoring whaling. Science 265 1538--1539.
Berry, V. and Gascuel, O. (1996). On the interpretation of bootstrap trees: Appropriate threshold of clade selection and induced gain. Molecular Biology and Evolution 13 999--1011.
Billera, L., Holmes, S. and Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27 733--767.
Mathematical Reviews (MathSciNet): MR1867931
Digital Object Identifier: doi:10.1006/aama.2001.0759
Zentralblatt MATH: 0995.92035
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
Bremer, K. (1988). The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42 795--803.
Cooper, A. and Penny, D. (1997). Mass survival of birds across the cretaceous--tertiary boundary: Molecular evidence. Science 275 1109--1113.
Diaconis, P. (1989). A generalization of spectral analysis with application to ranked data. Ann. Statist. 17 949--979.
Mathematical Reviews (MathSciNet): MR1015133
Digital Object Identifier: doi:10.1214/aos/1176347251
Project Euclid: euclid.aos/1176347251
Zentralblatt MATH: 0688.62005
Diaconis, P. and Holmes, S. (1998). Matchings and phylogenetic trees. Proc. Nat. Acad. Sci. U.S.A. 95 14,600--14,602.
Mathematical Reviews (MathSciNet): MR1665632
Digital Object Identifier: doi:10.1073/pnas.95.25.14600
Zentralblatt MATH: 0908.92023
Diaconis, P. and Holmes, S. (2002). Random walks on trees and matchings. Electronic Journal of Probability 7 17 pages.
Mathematical Reviews (MathSciNet): MR1887626
Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis. Cambridge Univ. Press.
Zentralblatt MATH: 0929.92010
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall, London.
Mathematical Reviews (MathSciNet): MR1270903
Zentralblatt MATH: 0835.62038
Efron, B. and Tibshirani, R. (1998). The problem of regions. Ann. Statist. 26 1687--1718.
Mathematical Reviews (MathSciNet): MR1673274
Digital Object Identifier: doi:10.1214/aos/1024691353
Project Euclid: euclid.aos/1024691353
Zentralblatt MATH: 0954.62031
Efron, B., Halloran, E. and Holmes, S. (1996). Bootstrap confidence levels for phylogenetic trees. Proc. Nat. Acad. Sci. U.S.A. 93 13,429--13,434.
Felsenstein, J. (1983). Statistical inference of phylogenies (with discussion). J. Roy. Statist. Soc. Ser. A 146 246--272.
Felsenstein, J. (2003). Inferring Phylogenies. Sinauer, Boston.
Felsenstein, J. and Churchill, G. A. (1996). A hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution 13 93--104.
Fitch, W. (1971a). The nonidentity of invariable positions in the cytochromes $c$ of different species. Biochemical Genetics 5 231--241.
Fitch, W. (1971b). Rate of change of concomitantly variable codons. Journal of Molecular Evolution 1 84--96.
Fitch, W. M. and Markowitz, E. (1970). An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochemical Genetics 4 579--593.
Freedman, D. A. and Peters, S. C. (1984a). Bootstrapping a regression equation: Some empirical results. J. Amer. Statist. Assoc. 79 97--106.
Mathematical Reviews (MathSciNet): MR742858
Digital Object Identifier: doi:10.2307/2288341
Freedman, D. A. and Peters, S. C. (1984b). Bootstrapping an econometric model: Some empirical results. J. Bus. Econom. Statist. 2 150--158.
Gong, G. (1986). Cross-validation, the jackknife, and the bootstrap: Excess error estimation in forward logistic regression. J. Amer. Statist. Assoc. 81 108--113.
Green, P. J. (1981). Peeling bivariate data. In Interpreting Multivariate Data (V. Barnett, ed.) 3--19. Wiley, New York.
Mathematical Reviews (MathSciNet): MR656974
Zentralblatt MATH: 0597.62002
Hall, P. (1987). On the bootstrap and likelihood-based confidence regions. Biometrika 74 481--493.
Mathematical Reviews (MathSciNet): MR909353
Zentralblatt MATH: 0635.62033
Digital Object Identifier: doi:10.2307/2336687
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York.
Mathematical Reviews (MathSciNet): MR829458
Zentralblatt MATH: 0593.62027
Hedges, S. B. (1992). The number of replications needed for accurate estimation of the bootstrap $p$-value in phylogenetic studies. Molecular Biology and Evolution 9 366--369.
Hendy, M. D. and Penny, D. (1993). Spectral analysis of phylogenetic data. J. Classification 10 5--23.
Hendy, M. D., Penny, D. and Steel, M. A. (1994). A discrete Fourier analysis for evolutionary trees. Proc. Nat. Acad. Sci. U.S.A. 91 3339--3343.
Hillis, D. M. and Bull, J. J. (1993). An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Systematic Biology 42 182--192.
Holmes, S. (1999). Phylogenies: An overview. In Statistics in Genetics (M. E. Halloran and S. Geisser, eds.) 81--119. Springer, New York.
Huber, P. J. (1996). Robust Statistical Procedures, 2nd ed. SIAM, Philadelphia.
Mathematical Reviews (MathSciNet): MR1420192
Künsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. Ann. Statist. 17 1217--1241.
Mathematical Reviews (MathSciNet): MR1015147
Digital Object Identifier: doi:10.1214/aos/1176347265
Project Euclid: euclid.aos/1176347265
Zentralblatt MATH: 0684.62035
LANL (2002). HIV database. Available at URL: http://hiv-web.lanl.gov/content/hiv-db/mainpage.html.
Lauritzen, S. L. (1988). Extremal Families and Systems of Sufficient Statistics. Lecture Notes in Statist. 49. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR971253
Zentralblatt MATH: 0681.62009
Lento, G. M., Cipriano, F., Patenaude, N. J., Palumbi, S. R. and Baker, C. S. (1998). Taking stock of minke whale in the North Pacific: The origins of products for sale in Japan and Korea. Technical Report SC/50/RMP15, Scientific Committee, International Whaling Commission.
Li, S., Pearl, D. K. and Doss, H. (2000). Phylogenetic tree construction using Markov chain Monte Carlo. J. Amer. Statist. Assoc. 95 493--508.
Li, W. H. (1997). Molecular Evolution. Sinauer, Boston.
Zentralblatt MATH: 0854.01041
Li, W. H. and Zharkikh, A. (1995). Statistical tests of DNA phylogenies. Systematic Biology 44 49--63.
Mathematical Reviews (MathSciNet): MR1323068
Digital Object Identifier: doi:10.1016/0378-3758(94)00036-U
Zentralblatt MATH: 0812.62020
Liu, R. Y., Parelius, J. M. and Singh, K. (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference (with discussion). Ann. Statist. 77 783--858.
Mathematical Reviews (MathSciNet): MR1724033
Project Euclid: euclid.aos/1018031260
Lockhart, P. J., Larkum, A. W. D., Steel, M. A., Waddell, P. J. and Penny, D. (1996). Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Nat. Acad. Sci. U.S.A. 93 1930--1934.
Maddison, D. (1991). The discovery and importance of multiple islands of most parsimonious trees. Systematic Zoology 40 315--328.
Mau, B., Newton, M. A. and Larget, B. (1999). Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55 1--12.
Mathematical Reviews (MathSciNet): MR1705672
Digital Object Identifier: doi:10.1111/j.0006-341X.1999.00001.x
Zentralblatt MATH: 1059.62675
Nei, M., Kumar, S. and Takahashi, K. (1998). The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Nat. Acad. Sci. U.S.A. 95 12,390--12,397.
Newton, M. A. (1996). Bootstrapping phylogenies: Large deviations and dispersion effects. Biometrika 83 315--328.
Mathematical Reviews (MathSciNet): MR1439786
Zentralblatt MATH: 0864.62077
Digital Object Identifier: doi:10.1093/biomet/83.2.315
Page, R. and Holmes, E. (2000). Molecular Evolution: A Phylogenetic Approach. Blackwell Science, Oxford, UK.
Rambaut, A. and Grassly, N. C. (1997). Seq-gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Computational Applied Bioscience 13 235--238.
Ramsay, J. O. (1978). Confidence regions for multidimensional scaling analysis. Psychometrika 43 145--160.
Robbins, H. (1980). An empirical Bayes estimation problem. Proc. Nat. Acad. Sci. U.S.A. 77 6,988--6,989.
Mathematical Reviews (MathSciNet): MR603064
Digital Object Identifier: doi:10.1073/pnas.77.12.6988
Zentralblatt MATH: 0456.62029
Robbins, H. (1983). Some thoughts on empirical Bayes estimation. Ann. Statist. 11 713--723.
Mathematical Reviews (MathSciNet): MR707923
Digital Object Identifier: doi:10.1214/aos/1176346239
Project Euclid: euclid.aos/1176346239
Zentralblatt MATH: 0522.62024
Robbins, H. (1985). Linear empirical Bayes estimation of means and variances. Proc. Nat. Acad. Sci. U.S.A. 82 1571--1574.
Mathematical Reviews (MathSciNet): MR779802
Digital Object Identifier: doi:10.1073/pnas.82.6.1571
Zentralblatt MATH: 0559.62034
Rodrigo, A. G. (1993). Calibrating the bootstrap test of monophyly. International Journal for Parasitology 23 507--514.
Sanderson, M. J. (1995). Objections to bootstrapping phylogenies: A critique. Systematic Biology 44 299--320.
Schröder, E. (1870). Vier combinatorische Probleme. Zeitschrift für Mathematik und Physik 15 361--376.
Tang, H. and Lewontin, R. (1999). Locating regions of differential variability in DNA and protein sequences. Genetics 153 485--495.
Tuffley, C. and Steel, M. (1998). Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147 63--91.
Mathematical Reviews (MathSciNet): MR1604518
Digital Object Identifier: doi:10.1016/S0025-5564(97)00081-3
Zentralblatt MATH: 0897.92025
Tukey, J. (1975). Mathematics and the picturing of data. In Proc. International Congress of Mathematicians (R. D. James, ed.) 523--531. Vancouver.
Mathematical Reviews (MathSciNet): MR426989
Zentralblatt MATH: 0347.62002
Yang, Z. (1994). Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Molecular Evolution 39 306--314.
Yang, Z. and Rannala, B. (1997). Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method. Molecular Biology and Evolution 14 717--724.
Zharkikh, A. and Li, W.-H. (1995). Estimation of confidence in phylogeny: The complete-and-partial bootstrap technique. Molecular Phylogenetics and Evolution 4 44--63.

2012 © Institute of Mathematical Statistics

Statistical Science

Statistical Science