Mapping human genetic variation is fundamentally interesting in
fields such as anthropology and forensic inference. At the same
time, patterns of genetic diversity confound efforts to
determine the genetic basis of complex disease. Due to
technological advances, it is now possible to measure hundreds
of thousands of genetic variants per individual across the
genome. Principal component analysis (PCA) is routinely used to
summarize the genetic similarity between subjects. The
eigenvectors are interpreted as dimensions of ancestry. We build
on this idea using a spectral graph approach. In the process we
draw on connections between multidimensional scaling and
spectral kernel methods. Our approach, based on a spectral
embedding derived from the normalized Laplacian of a graph, can
produce more meaningful delineation of ancestry than by using
PCA. The method is stable to outliers and can more easily
incorporate different similarity measures of genetic data than
PCA. We illustrate a new algorithm for genetic clustering and
association analysis on a large, genetically heterogeneous
sample.
References
Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15 1373–1396.
Cavalli-Sforza, L., Menozzi, P. and Piazza, A. (1994). The History and Geography of Human Genes. Princeton Univ. Press, Princeton, NJ.
Chung, F. (1997). Spectral Graph Theory. CBMS Regional Conference Series in Mathematics 92. Amer. Math. Soc., Providence, RI.
Chung, F., Lu, L. and Vu, V. (2003). Spectra of random graphs with given expected degrees. Proc. Nat. Acad. Sci. USA 100 6313–6318.
Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Warner, F. and Zucker, S. (2005). Geometric diffusions as a tool for harmonics analysis and structure definition of data: Diffusion maps. Proc. Nat. Acad. Sci. USA 102 7426–7431.
Devlin, B., Roeder, K. and Wasserman, L. (2001). Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol. 60 155–166.
Fouss, F., Pirotte, A., Renders, J.-M. and Saerens, M. (2007). Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering 19 355–369.
Gower, J. C. (1966). Some distance properties of latent root and vector methods in multivariate analysis. Biometrika 53 325–338.
Mathematical Reviews (MathSciNet):
MR214224
Heath, S. C., Gut, I. G., Brennan, P., McKay, J. D., Bencko, V., Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch, M., Krokan, H. E., Elvestad, M. B., Lissowska, J., Mates, D., Rudnai, P., Skorpen, F., Schreiber, S., Soria, J. M., Syvnen, A. C., Meneton, P., Herberg, S., Galan, P., Szeszenia-Dabrowska, N., Zaridze, D., Gnin, E., Cardon, L. R. and Lathrop, M. (2008). Investigation of the fine structure of european populations with applications to disease association studies. European J. Human Genetics 16 1413–1429.
Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
Koltchinskii, V. and Giné, E. (2000). Random matrix approximation of spectra of integral operators. Bernoulli 6 113–167.
Lander, E. S. and Schork, N. (1994). Genetic dissection of complex traits. Science 265 2037–2048.
Lee, A. B., Luca, D., Klei, L., Devlin, B. and Roeder, K. (2009). Discovering genetic ancestry using spectral graph theory. Genetic Epidemiology. To appear.
Luca, D., Ringquist, S., Klei, L., Lee, A., Gieger, C., Wichmann, H. E., Schreiber, S., Krawczak, M., Lu, Y., Styche, A., Devlin, B., Roeder, K. and Trucco, M. (2008). On the use of general control samples for genome-wide association studies: Genetic matching highlights causal variants. Amer. J. Hum. Genet. 82 453–463.
Mardia, K., Kent, J. and Bibby, J. (1979). Multivariate Analysis. New York: Academic Press.
Mardia, K. V. (1978). Some properties of classical multi-dimensional scaling. Comm. Statist. Theory Methods 7 1233–1241.
Mathematical Reviews (MathSciNet):
MR514645
Nelson, M. R., Bryc, K., King, K. S., Indap, A., Boyko, A., Novembre, J., Briley, L. P., Maruyama, Y., Waterworth, D. M., Waeber, G., Vollenweider, P., Oksenberg, J. R., Hauser, S. L., Stirnadel, H. A., Kooner, J. S., Chambers, J. C., Jones, B., Mooser, V., Bustamante, C. D., Roses, A. D., Burns, D. K., Ehm, M. G. and Lai, E. H. (2008). The population reference sample, popres: A resource for population, disease, and pharmacological genetics research. Amer. J. Hum. Genet. 83 347–358.
Ng, A. Y., Jordan, M. I. and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 849–856.
Novembre, J. and Stephens, M. (2008). Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40 646–649.
Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A. R., Auton, A., Indap, A., King, K. S., Bergmann, S., Nelson, M. R., Stephens, M. and Bustamante, C. D. (2008). Genes mirror geography within europe. Nature 456 98–101.
Patterson, N. J., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis.
PLos Genetics 2 e190 DOI:
10.1371/journal.pgen.0020190.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904–909.
Pritchard, J. K., Stephens, M. and Donnelly, P. (2000a). Inference of population structure using multilocus genotype data. Genetics 155 945–959.
Pritchard, J. K., Stephens, M., Rosenberg, N. A. and Donnelly, P. (2000b). Association mapping in structured populations. Amer. J. Hum. Genet. 67 170–181.
Rosenbaum, P. (1995). Observational Studies. Springer, New York.
Schölkopf, B., Smola, A. and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10 1299–1319.
Shawe-Taylor, J., Cristianini, N. and Kandola, J. (2002). On the concentration of spectral properties. In Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, MA.
Shawe-Taylor, J., Williams, C., Cristianini, N. and Kandola, J. (2005). On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA. IEEE Trans. Inform. Theory 51 2510–2522.
Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 888–905.
Stewart, G. (1990). Matrix Perturbation Theory. Academic Press, Boston.
Tishkoff, S. A., Reed, F. A., Ranciaro, A., Voight, B. F., Babbitt, C. C., Silverman, J. S., Powell, K., Mortensen, H. M., Hirbo, J. B., Osman, M., Ibrahim, M., Omar, S. A., Lema, T. B., Nyambo, G., Ghori, J., Bumpstead, S., Pritchard, J., Wray, G. A. and Deloukas, P. (2007). Convergent adaptation of human lactase persistence in Africa and Europe. Nature Genetics 39 31–40.
Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika 17 401–419.
Mathematical Reviews (MathSciNet):
MR54219
von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17 395–416.
Weir, B. (1996). Genetic Data Analysis. Sinauer Associates, Sunderland, MA.