The Annals of Applied Statistics

Analysis of a data matrix and a graph: Metagenomic data and the phylogenetic tree

Elizabeth Purdom
Source: Ann. Appl. Stat. Volume 5, Number 4 (2011), 2326-2358.

Abstract

In biological experiments researchers often have information in the form of a graph that supplements observed numerical data. Incorporating the knowledge contained in these graphs into an analysis of the numerical data is an important and nontrivial task. We look at the example of metagenomic data—data from a genomic survey of the abundance of different species of bacteria in a sample. Here, the graph of interest is a phylogenetic tree depicting the interspecies relationships among the bacteria species. We illustrate that analysis of the data in a nonstandard inner-product space effectively uses this additional graphical information and produces more meaningful results.

First Page: Show Hide
Full-text: Access denied (no subscription detected)
In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.
If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1324399597
Digital Object Identifier: doi:10.1214/10-AOAS402
Zentralblatt MATH identifier: 06017787
Mathematical Reviews number (MathSciNet): MR2907117

References

Aluja-Ganet, T. and Nonell-Torrent, R. (1991). Local principal components analysis. Questiio 15 267–278.
Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. J. Mach. Learn. Res. 3 1–48.
Mathematical Reviews (MathSciNet): MR1966051
Zentralblatt MATH: 1088.68689
Bapat, R., Kirkland, S. J. and Neumann, M. (2005). On distance matrices and Laplacians. Linear Algebra Appl. 401 193–209.
Mathematical Reviews (MathSciNet): MR2133282
Zentralblatt MATH: 1064.05097
Digital Object Identifier: doi:10.1016/j.laa.2004.05.011
Biyikoğlu, T., Leydold, J. and Stadler, P. F. (2007). Laplacian Eigenvectors of Graphs. Lecture Notes in Mathematics 1915. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR2340484
Zentralblatt MATH: 1129.05001
Cavalli-Sforza, L. L. and Piazza, A. (1975). Analysis of evolution: Evolutionary rates, independence and treeness. Theoretical Population Biology 8 127–165.
Mathematical Reviews (MathSciNet): MR526635
Digital Object Identifier: doi:10.1016/0040-5809(75)90029-5
Chessel, D., Dufour, A.-B., Dray, S., with contributions from Jean R. Lobry, Ollier, S., Pavoine, S. and Thioulouse., J. (2005). ade4: Analysis of environmental data: Exploratory and Euclidean methods in environmental sciences. R package Version 1.4-1.
D’Ambra, L. and Lauro, N. C. (1992). Non-symmetrical exploratory data analysis. Statist. Appl. 4 511–529.
di Bella, G. and Jona-Lasinio, G. (1996). Including spatial contiguity information in the analysis of multispecific patterns. Environmental and Ecological Statistics 3 260–280.
Diestel, R. (2005). Graph Theory, 3rd ed. Graduate Texts in Mathematics 173. Springer, New York.
Mathematical Reviews (MathSciNet): MR2159259
Dray, S. and Dufour, A.-B. (2007). The ade4 package: Implementing the duality diagram for ecologists. J. Statist. Softw. 22.
Dray, S., Saïd, S. and Debias, F. (2008). Spatial ordination of vegetation data using a generalization of Wartenberg’s multivariate spatial correlation. Journal of Vegetation Science 19 45–56.
Eckburg, P. B., Bik, E. M., Bernstein, C. N., Purdom, E., Dethlefsen, L., Sargent, M., Gill, S. R., Nelson, K. E. and Relman, D. A. (2005). Diversity of the human intestinal microbial flora. Science 308 1635–1638.
Escoufier, Y. (1987). The duality diagram: A means for better practical applications. In Developments in Numerical Ecology (P. Legendre and L. Legendre, eds.). NATO ASI Series G14 139–156. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR913539
Digital Object Identifier: doi:10.1007/978-3-642-70880-0_3
Excoffier, L., Smouse, P. and Quattro, J. (1992). Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics 131 479–491.
Felsenstein, J. (1981). Evolutionary trees from gene frequencies and quantitative characters: Finding maximum likelihood estimates. Evolution 35 1229–1242.
Gimaret-Carpentier, C., Chessel, D. and Pascal, J. P. (1998). Non-symmetric correspondence analysis: An alternative for community analysis with species occurrences data. Plant Ecology 138 97–112.
Golub, G. H. and van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins Univ. Press, Baltimore.
Mathematical Reviews (MathSciNet): MR1417720
Greenacre, M. J. (1984). Theory and Applications of Correspondence Analysis. Academic Press, London.
Mathematical Reviews (MathSciNet): MR767260
Zentralblatt MATH: 0555.62005
Hansen, T. F. and Martins, E. P. (1996). Translating between microevolutionary process and macroevolutionary patterns: The correlation structure of interspecific data. Evolution 50 1404–1417.
Holmes, S. (2008). Multivariate analysis: The French way. In Probability and Statistics: Essays in Honor of David A. Freedman (D. Nolan and T. Speed, eds.). IMS Lecture Notes 2 219–233. IMS, Beachwood, OH.
Mathematical Reviews (MathSciNet): MR2459953
Zentralblatt MATH: 1166.62310
Digital Object Identifier: doi:10.1214/193940307000000455
Jolliffe, I. T. (2002). Principal Components Analysis, 2nd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR2036084
Kondor, R. I. and Lafferty, J. (2002). Diffusion kernels on graphs and other discrete input spaces. In Proceedings of ICML 315–322.
Legendre, P. and Legendre, L. (1998). Numerical Ecology, 2nd English ed. Developments in Environmental Modeling 20. Elsevier, New York.
Mathematical Reviews (MathSciNet): MR1675780
Maesschalck, R. D., Jouan-Rimbaud, D. and Massart, D. (2000). The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems 50 1–18.
Martin, A. (2002). Phylogenetic approaches for describing and comparing the diversity of microbial communities. Applied and Environmental Microbiology 68 3673–3682.
Martins, E. P. and Housworth, E. A. (2002). Phylogeny shape and the phylogenetic comparative method. Syst. Biol. 51 873–880.
Pavoine, S., Dufour, A.-B. and Chessel, D. (2004). From dissimilarities among species to dissimilarities among sites: A double principal coordinate analysis. J. Theoret. Biol. 228 523–537.
Mathematical Reviews (MathSciNet): MR2080909
Digital Object Identifier: doi:10.1016/j.jtbi.2004.02.014
Pavoine, S., Ollier, S., Pontier, D. and Chessel, D. (2008). Testing for phylogenetic signal in phenotypic traits: New matrices of phylogenetic proximities. Theoretical Population Biology 73 79–91.
Pélissier, R., Couteron, P., Dray, S. and Sabatier, D. (2003). Consistency between ordination techniques and diversity measurements: Two strategies for species occurrence data. Ecology 84 242–251.
Purdom, E. (2006). Multivariate kernel methods in the analysis of graphical structures. Ph.D. thesis, Stanford Univ.
Mathematical Reviews (MathSciNet): MR2709407
R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0.
Rao, C. R. (1982). Diversity and dissimilarity coefficients: A unified approach. Theoretical Population Biology 21 24–43.
Mathematical Reviews (MathSciNet): MR662520
Zentralblatt MATH: 0516.92021
Digital Object Identifier: doi:10.1016/0040-5809(82)90004-1
Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E. and Vert, J.-P. (2007). Classification of microarray data using gene networks. BMC Bioinformatics 8.
Rohlf, F. J. (2001). Comparative methods for the analysis of continuous variables: Geometric interpretations. Evolution 55 2143–2160.
Zentralblatt MATH: 1095.92036
Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA.
Thioulouse, J., Chessel, D. and Champely, S. (1995). Multivariate analysis of spatial patterns: A unified approach to local and global structures. Environmental and Ecological Statistics 2 1–14.

2013 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics

Turn MathJax Off
What is MathJax?