The Annals of Applied Statistics

A principal component analysis for trees

Burcu Aydın, Gábor Pataki, Haonan Wang, Elizabeth Bullitt, and J. S. Marron

Full-text: Open access


The active field of Functional Data Analysis (about understanding the variation in a set of curves) has been recently extended to Object Oriented Data Analysis, which considers populations of more general objects. A particularly challenging extension of this set of ideas is to populations of tree-structured objects. We develop an analog of Principal Component Analysis for trees, based on the notion of tree-lines, and propose numerically fast (linear time) algorithms to solve the resulting problems to proven optimality. The solutions we obtain are used in the analysis of a data set of 73 individuals, where each data object is a tree of blood vessels in one person’s brain. Our analysis revealed a significant relation between the age of the individuals and their brain vessel structure.

Article information

Ann. Appl. Stat. Volume 3, Number 4 (2009), 1597-1615.

First available in Project Euclid: 1 March 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier


Aydın, Burcu; Pataki, Gábor; Wang, Haonan; Bullitt, Elizabeth; Marron, J. S. A principal component analysis for trees. Ann. Appl. Stat. 3 (2009), no. 4, 1597--1615. doi:10.1214/09-AOAS263.

Export citation


  • Aylward, S. and Bullitt, E. (2002). Initialization, noise, singularities and scale in height ridge traversal for tubular object centerline extraction. IEEE Transactions on Medical Imaging 21 61–75.
  • Banks, D. and Constantine, G. M. (1998). Metric models for random graphs. J. Classification 15 199–223.
  • Breiman, L. (1996). Bagging predictors. Mach. Learn. 24 123–140.
  • Breiman, L., Friedman, J. H., Olshen, J. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
  • Bullitt, E., Zeng, D., Ghosh, A., Aylward, S. R., Lin, W., Marks, B. L. and Smith, K. (2008). The effects of healthy aging on intracerebral blood vessels visualized by magnetic resonance angiography. Neurobiology of Aging. To appear.
  • Collins, M. and Duffy, N. (2002). Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 625–632. MIT Press, Cambridge, MA.
  • Eom, J.-H., Kim, S., Kim, S.-H. and Zhang, B.-T. (2006). A tree kernel-based method for protein–protein interaction mining from biomedical literature. In Knowledge Discovery in Life Science Literature, PAKDD 2006 International Workshop, Proceedings. Lecture Notes in Computer Science 3886. Springer, Singapore.
  • Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis, 4th ed. Oxford Univ. Press, New York.
  • Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice. Springer, Berlin.
  • Holmes, S. (1999). Phylogenies: An overview. In Statistics and Genetics (Halloran and Geisser, eds.). IMA Volumes in Mathematics and Its Applications 112 81–119. Springer, New York.
  • Li, S., Pearl, D. K. and Doss, H. (2000). Phylogenetic tree constructure using Markov chain Monte Carlo. J. Amer. Statist. Assoc. 95 493–508.
  • Pachter, L. and Sturmfels, B. (2005). Algebraic Statistics for Computational Biology. Cambridge Univ. Press, Cambridge, UK.
  • Ramsay, J. O. and Silverman, B. W. (2002). Applied Functional Data Analysis. Springer, New York.
  • Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, 2nd ed. Springer, New York.
  • Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press, New York.
  • Vert, J. P. (2002). A tree kernel to analyse phylogenetic profiles. Bioinformatics 18 Suppl. 1 276–284.
  • Wang, H. and Marron, J. S. (2007). Object oriented data analysis: Sets of trees. Ann. Statist. 35 1849–1873.
  • Yamanishi, Y., Bach, F. and Vert, J. P. (2007). Glycan classification with tree kernels. Bioinformatics 23 1211–1216.