The Annals of Applied Statistics

A spectral graph approach to discovering genetic ancestry

Ann B. Lee, Diana Luca, and Kathryn Roeder
Source: Ann. Appl. Stat. Volume 4, Number 1 (2010), 179-202.

Abstract

Mapping human genetic variation is fundamentally interesting in fields such as anthropology and forensic inference. At the same time, patterns of genetic diversity confound efforts to determine the genetic basis of complex disease. Due to technological advances, it is now possible to measure hundreds of thousands of genetic variants per individual across the genome. Principal component analysis (PCA) is routinely used to summarize the genetic similarity between subjects. The eigenvectors are interpreted as dimensions of ancestry. We build on this idea using a spectral graph approach. In the process we draw on connections between multidimensional scaling and spectral kernel methods. Our approach, based on a spectral embedding derived from the normalized Laplacian of a graph, can produce more meaningful delineation of ancestry than by using PCA. The method is stable to outliers and can more easily incorporate different similarity measures of genetic data than PCA. We illustrate a new algorithm for genetic clustering and association analysis on a large, genetically heterogeneous sample.

First Page: Show Hide
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1273584452
Digital Object Identifier: doi:10.1214/09-AOAS281
Zentralblatt MATH identifier: 1189.62170
Mathematical Reviews number (MathSciNet): MR2758169

References

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15 1373–1396.
Cavalli-Sforza, L., Menozzi, P. and Piazza, A. (1994). The History and Geography of Human Genes. Princeton Univ. Press, Princeton, NJ.
Chung, F. (1997). Spectral Graph Theory. CBMS Regional Conference Series in Mathematics 92. Amer. Math. Soc., Providence, RI.
Mathematical Reviews (MathSciNet): MR1421568
Zentralblatt MATH: 0867.05046
Chung, F., Lu, L. and Vu, V. (2003). Spectra of random graphs with given expected degrees. Proc. Nat. Acad. Sci. USA 100 6313–6318.
Mathematical Reviews (MathSciNet): MR1982145
Zentralblatt MATH: 1064.05138
Digital Object Identifier: doi:10.1073/pnas.0937490100
Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Warner, F. and Zucker, S. (2005). Geometric diffusions as a tool for harmonics analysis and structure definition of data: Diffusion maps. Proc. Nat. Acad. Sci. USA 102 7426–7431.
Devlin, B., Roeder, K. and Wasserman, L. (2001). Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol. 60 155–166.
Fouss, F., Pirotte, A., Renders, J.-M. and Saerens, M. (2007). Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering 19 355–369.
Gower, J. C. (1966). Some distance properties of latent root and vector methods in multivariate analysis. Biometrika 53 325–338.
Mathematical Reviews (MathSciNet): MR214224
Heath, S. C., Gut, I. G., Brennan, P., McKay, J. D., Bencko, V., Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch, M., Krokan, H. E., Elvestad, M. B., Lissowska, J., Mates, D., Rudnai, P., Skorpen, F., Schreiber, S., Soria, J. M., Syvnen, A. C., Meneton, P., Herberg, S., Galan, P., Szeszenia-Dabrowska, N., Zaridze, D., Gnin, E., Cardon, L. R. and Lathrop, M. (2008). Investigation of the fine structure of european populations with applications to disease association studies. European J. Human Genetics 16 1413–1429.
Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
Mathematical Reviews (MathSciNet): MR1863961
Zentralblatt MATH: 1016.62078
Digital Object Identifier: doi:10.1214/aos/1009210544
Project Euclid: euclid.aos/1009210544
Koltchinskii, V. and Giné, E. (2000). Random matrix approximation of spectra of integral operators. Bernoulli 6 113–167.
Mathematical Reviews (MathSciNet): MR1781185
Digital Object Identifier: doi:10.2307/3318636
Project Euclid: euclid.bj/1082665383
Lander, E. S. and Schork, N. (1994). Genetic dissection of complex traits. Science 265 2037–2048.
Lee, A. B., Luca, D., Klei, L., Devlin, B. and Roeder, K. (2009). Discovering genetic ancestry using spectral graph theory. Genetic Epidemiology. To appear.
Luca, D., Ringquist, S., Klei, L., Lee, A., Gieger, C., Wichmann, H. E., Schreiber, S., Krawczak, M., Lu, Y., Styche, A., Devlin, B., Roeder, K. and Trucco, M. (2008). On the use of general control samples for genome-wide association studies: Genetic matching highlights causal variants. Amer. J. Hum. Genet. 82 453–463.
Mardia, K., Kent, J. and Bibby, J. (1979). Multivariate Analysis. New York: Academic Press.
Mardia, K. V. (1978). Some properties of classical multi-dimensional scaling. Comm. Statist. Theory Methods 7 1233–1241.
Mathematical Reviews (MathSciNet): MR514645
Digital Object Identifier: doi:10.1080/03610927808827707
Nelson, M. R., Bryc, K., King, K. S., Indap, A., Boyko, A., Novembre, J., Briley, L. P., Maruyama, Y., Waterworth, D. M., Waeber, G., Vollenweider, P., Oksenberg, J. R., Hauser, S. L., Stirnadel, H. A., Kooner, J. S., Chambers, J. C., Jones, B., Mooser, V., Bustamante, C. D., Roses, A. D., Burns, D. K., Ehm, M. G. and Lai, E. H. (2008). The population reference sample, popres: A resource for population, disease, and pharmacological genetics research. Amer. J. Hum. Genet. 83 347–358.
Ng, A. Y., Jordan, M. I. and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 849–856.
Novembre, J. and Stephens, M. (2008). Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40 646–649.
Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A. R., Auton, A., Indap, A., King, K. S., Bergmann, S., Nelson, M. R., Stephens, M. and Bustamante, C. D. (2008). Genes mirror geography within europe. Nature 456 98–101.
Patterson, N. J., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLos Genetics 2 e190 DOI: 10.1371/journal.pgen.0020190.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904–909.
Pritchard, J. K., Stephens, M. and Donnelly, P. (2000a). Inference of population structure using multilocus genotype data. Genetics 155 945–959.
Pritchard, J. K., Stephens, M., Rosenberg, N. A. and Donnelly, P. (2000b). Association mapping in structured populations. Amer. J. Hum. Genet. 67 170–181.
Rosenbaum, P. (1995). Observational Studies. Springer, New York.
Mathematical Reviews (MathSciNet): MR1353914
Schölkopf, B., Smola, A. and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10 1299–1319.
Shawe-Taylor, J., Cristianini, N. and Kandola, J. (2002). On the concentration of spectral properties. In Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, MA.
Shawe-Taylor, J., Williams, C., Cristianini, N. and Kandola, J. (2005). On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA. IEEE Trans. Inform. Theory 51 2510–2522.
Mathematical Reviews (MathSciNet): MR2246374
Digital Object Identifier: doi:10.1109/TIT.2005.850052
Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 888–905.
Stewart, G. (1990). Matrix Perturbation Theory. Academic Press, Boston.
Mathematical Reviews (MathSciNet): MR1061154
Tishkoff, S. A., Reed, F. A., Ranciaro, A., Voight, B. F., Babbitt, C. C., Silverman, J. S., Powell, K., Mortensen, H. M., Hirbo, J. B., Osman, M., Ibrahim, M., Omar, S. A., Lema, T. B., Nyambo, G., Ghori, J., Bumpstead, S., Pritchard, J., Wray, G. A. and Deloukas, P. (2007). Convergent adaptation of human lactase persistence in Africa and Europe. Nature Genetics 39 31–40.
Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika 17 401–419.
Mathematical Reviews (MathSciNet): MR54219
Digital Object Identifier: doi:10.1007/BF02288916
von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17 395–416.
Mathematical Reviews (MathSciNet): MR2409803
Digital Object Identifier: doi:10.1007/s11222-007-9033-z
Weir, B. (1996). Genetic Data Analysis. Sinauer Associates, Sunderland, MA.

2013 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics

Turn MathJax Off
What is MathJax?