Open Access
October 2011 Principal components analysis in the space of phylogenetic trees
Tom M. W. Nye
Ann. Statist. 39(5): 2716-2739 (October 2011). DOI: 10.1214/11-AOS915

Abstract

Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal path in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic principal path is sought that maximizes the variance of the data under a form of projection onto the path. Due to the high dimensionality of tree-space and the nonlinear nature of this problem, the computational complexity is potentially very high, so approximate optimization algorithms are used to search for the optimal path. Principal paths identified in this way reveal and quantify the main sources of variation in the original collection of trees in terms of both topology and branch lengths. The approach is illustrated by application to simulated sets of trees and to a set of gene trees from metazoan (animal) species.

Citation

Download Citation

Tom M. W. Nye. "Principal components analysis in the space of phylogenetic trees." Ann. Statist. 39 (5) 2716 - 2739, October 2011. https://doi.org/10.1214/11-AOS915

Information

Published: October 2011
First available in Project Euclid: 22 December 2011

zbMATH: 1231.62110
MathSciNet: MR2906884
Digital Object Identifier: 10.1214/11-AOS915

Subjects:
Primary: 92D15
Secondary: 62H25

Keywords: Geodesic , phylogeny , principal component

Rights: Copyright © 2011 Institute of Mathematical Statistics

Vol.39 • No. 5 • October 2011
Back to Top