Annals of Statistics

Principal components analysis in the space of phylogenetic trees

Tom M. W. Nye

Full-text: Open access


Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal path in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic principal path is sought that maximizes the variance of the data under a form of projection onto the path. Due to the high dimensionality of tree-space and the nonlinear nature of this problem, the computational complexity is potentially very high, so approximate optimization algorithms are used to search for the optimal path. Principal paths identified in this way reveal and quantify the main sources of variation in the original collection of trees in terms of both topology and branch lengths. The approach is illustrated by application to simulated sets of trees and to a set of gene trees from metazoan (animal) species.

Article information

Ann. Statist., Volume 39, Number 5 (2011), 2716-2739.

First available in Project Euclid: 22 December 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 92D15: Problems related to evolution
Secondary: 62H25: Factor analysis and principal components; correspondence analysis

Phylogeny principal component geodesic


Nye, Tom M. W. Principal components analysis in the space of phylogenetic trees. Ann. Statist. 39 (2011), no. 5, 2716--2739. doi:10.1214/11-AOS915.

Export citation


  • [1] Allen, B. L. and Steel, M. (2001). Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. 5 1–15.
  • [2] Amenta, N., Godwin, M., Postarnakevich, N. and St. John, K. (2007). Approximating geodesic tree distance. Inform. Process. Lett. 103 61–65.
  • [3] Aydin, B., Pataki, G., Wang, H., Bullitt, E. and Marron, J. S. (2009). A principal component analysis for trees. Ann. Appl. Stat. 3 1597–1615.
  • [4] Barthélémy, J.-P. (1986). The median procedure for n-trees. J. Classification 3 329–334.
  • [5] Billera, L. J., Holmes, S. P. and Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27 733–767.
  • [6] Brinkmann, H., van der Geizen, M., Zhou, Y., Poncelin de Raucourt, G. and Philippe, H. (2005). An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst. Biol. 54 743–757.
  • [7] Bryant, D. (2003). A classification of consensus methods for phylogenetics. In Bioconsensus (Piscataway, NJ, 2000/2001). DIMACS Series in Discrete Mathematics and Theoretical Computer Science 61 163–183. Amer. Math. Soc., Providence, RI.
  • [8] Chakerian, J. and Holmes, S. (2010). Computational tools for evaluating phylogenetic and hierachical clustering trees. Available at arXiv:1006.1015.
  • [9] Degnan, J. H. and Salter, L. A. (2005). Gene tree distributions under the coalescent process. Evolution 59 24–37.
  • [10] Donnelly, P. and Tavaré, S. (1995). Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29 401–421.
  • [11] Doolittle, W. F. (1999). Lateral genomics. Trends Genet. 15 M5–M8.
  • [12] Felsenstein, J. (1985). Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39 783–791.
  • [13] Felsenstein, J. (2004). Inferring Phylogenies. Sinauer, Sunderland, MA.
  • [14] Fletcher, P. T., Lu, C., Pizer, S. M. and Joshi, S. (2004). Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. Medical Imaging 23 995–1005.
  • [15] Goodman, M., Porter, C. A., Czelusniak, J., Page, S. L., Schneider, H., Shoshani, J., Gunnell, G. and Groves, C. P. (1998). Toward a phylogenetic classification of primates based on DNA evidence complemented by fossil evidence. Mol. Phyl. Evol. 9 585–598.
  • [16] Gromov, M. (1987). Hyperbolic groups. In Essays in Group Theory. Mathematical Sciences Research Institute Publications 8 75–263. Springer, New York.
  • [17] Guindon, S. and Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52 696–704.
  • [18] Hastie, T. and Stuetzle, W. (1989). Principal curves. J. Amer. Statist. Assoc. 84 502–516.
  • [19] Hillis, D. M., Heath, T. A. and St. John, K. (2005). Analysis and visualization of tree space. Syst. Biol. 54 471–482.
  • [20] Holmes, S. (2005). Statistical approach to tests involving phylogenies. In Mathematics of Evolution and Phylogeny (O. Gascuel, ed.) 91–120. Oxford Univ. Press, Oxford.
  • [21] Huelsenbeck, J. and Hillis, D. (1993). Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42 247–264.
  • [22] Kupczok, A., Von Haeseler, A. and Klaere, S. (2008). An exact algorithm for the geodesic distance between phylogenetic trees. J. Comput. Biol. 15 577–591.
  • [23] Nye, T. M. W. (2008). Trees of trees: An approach to comparing multiple alternative phylogenies. Syst. Biol. 57 785–794.
  • [24] Nye, T. M. W. (2011). Supplement to “Principal components analysis in the space of phylogenetic trees.” DOI:10.1214/11-AOS915SUPP.
  • [25] Owen, M. and Provan, J. S. (2010). A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Trans. Comp. Biol. and Bioinf. 8 2–13.
  • [26] Rambaut, A. and Grassly, N. C. (1997). Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13 235–238.
  • [27] Robinson, D. F. and Foulds, L. R. (1981). Comparison of phylogenetic trees. Math. Biosci. 53 131–147.
  • [28] Stockham, C., Wang, L.-S. and Warnow, T. (2002). Statistically based postprocessing of phylogenetic analysis by clustering. Bioinformatics 18 S285–S293.
  • [29] Wang, H. and Marron, J. S. (2007). Object oriented data analysis: Sets of trees. Ann. Statist. 35 1849–1873.

Supplemental materials

  • Supplementary material: Principal components analysis in the space of phylogenetic trees: Supplementary information. This contains further information about the simulation studies in Section 4.1.