The Annals of Applied Statistics

A spectral graph approach to discovering genetic ancestry

Ann B. Lee, Diana Luca, and Kathryn Roeder

Full-text: Open access


Mapping human genetic variation is fundamentally interesting in fields such as anthropology and forensic inference. At the same time, patterns of genetic diversity confound efforts to determine the genetic basis of complex disease. Due to technological advances, it is now possible to measure hundreds of thousands of genetic variants per individual across the genome. Principal component analysis (PCA) is routinely used to summarize the genetic similarity between subjects. The eigenvectors are interpreted as dimensions of ancestry. We build on this idea using a spectral graph approach. In the process we draw on connections between multidimensional scaling and spectral kernel methods. Our approach, based on a spectral embedding derived from the normalized Laplacian of a graph, can produce more meaningful delineation of ancestry than by using PCA. The method is stable to outliers and can more easily incorporate different similarity measures of genetic data than PCA. We illustrate a new algorithm for genetic clustering and association analysis on a large, genetically heterogeneous sample.

Article information

Ann. Appl. Stat., Volume 4, Number 1 (2010), 179-202.

First available in Project Euclid: 11 May 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Human genetics dimension reduction multidimensional scaling population structure spectral embedding


Lee, Ann B.; Luca, Diana; Roeder, Kathryn. A spectral graph approach to discovering genetic ancestry. Ann. Appl. Stat. 4 (2010), no. 1, 179--202. doi:10.1214/09-AOAS281.

Export citation


  • Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15 1373–1396.
  • Cavalli-Sforza, L., Menozzi, P. and Piazza, A. (1994). The History and Geography of Human Genes. Princeton Univ. Press, Princeton, NJ.
  • Chung, F. (1997). Spectral Graph Theory. CBMS Regional Conference Series in Mathematics 92. Amer. Math. Soc., Providence, RI.
  • Chung, F., Lu, L. and Vu, V. (2003). Spectra of random graphs with given expected degrees. Proc. Nat. Acad. Sci. USA 100 6313–6318.
  • Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Warner, F. and Zucker, S. (2005). Geometric diffusions as a tool for harmonics analysis and structure definition of data: Diffusion maps. Proc. Nat. Acad. Sci. USA 102 7426–7431.
  • Devlin, B., Roeder, K. and Wasserman, L. (2001). Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol. 60 155–166.
  • Fouss, F., Pirotte, A., Renders, J.-M. and Saerens, M. (2007). Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering 19 355–369.
  • Gower, J. C. (1966). Some distance properties of latent root and vector methods in multivariate analysis. Biometrika 53 325–338.
  • Heath, S. C., Gut, I. G., Brennan, P., McKay, J. D., Bencko, V., Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch, M., Krokan, H. E., Elvestad, M. B., Lissowska, J., Mates, D., Rudnai, P., Skorpen, F., Schreiber, S., Soria, J. M., Syvnen, A. C., Meneton, P., Herberg, S., Galan, P., Szeszenia-Dabrowska, N., Zaridze, D., Gnin, E., Cardon, L. R. and Lathrop, M. (2008). Investigation of the fine structure of european populations with applications to disease association studies. European J. Human Genetics 16 1413–1429.
  • Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • Koltchinskii, V. and Giné, E. (2000). Random matrix approximation of spectra of integral operators. Bernoulli 6 113–167.
  • Lander, E. S. and Schork, N. (1994). Genetic dissection of complex traits. Science 265 2037–2048.
  • Lee, A. B., Luca, D., Klei, L., Devlin, B. and Roeder, K. (2009). Discovering genetic ancestry using spectral graph theory. Genetic Epidemiology. To appear.
  • Luca, D., Ringquist, S., Klei, L., Lee, A., Gieger, C., Wichmann, H. E., Schreiber, S., Krawczak, M., Lu, Y., Styche, A., Devlin, B., Roeder, K. and Trucco, M. (2008). On the use of general control samples for genome-wide association studies: Genetic matching highlights causal variants. Amer. J. Hum. Genet. 82 453–463.
  • Mardia, K., Kent, J. and Bibby, J. (1979). Multivariate Analysis. New York: Academic Press.
  • Mardia, K. V. (1978). Some properties of classical multi-dimensional scaling. Comm. Statist. Theory Methods 7 1233–1241.
  • Nelson, M. R., Bryc, K., King, K. S., Indap, A., Boyko, A., Novembre, J., Briley, L. P., Maruyama, Y., Waterworth, D. M., Waeber, G., Vollenweider, P., Oksenberg, J. R., Hauser, S. L., Stirnadel, H. A., Kooner, J. S., Chambers, J. C., Jones, B., Mooser, V., Bustamante, C. D., Roses, A. D., Burns, D. K., Ehm, M. G. and Lai, E. H. (2008). The population reference sample, popres: A resource for population, disease, and pharmacological genetics research. Amer. J. Hum. Genet. 83 347–358.
  • Ng, A. Y., Jordan, M. I. and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 849–856.
  • Novembre, J. and Stephens, M. (2008). Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40 646–649.
  • Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A. R., Auton, A., Indap, A., King, K. S., Bergmann, S., Nelson, M. R., Stephens, M. and Bustamante, C. D. (2008). Genes mirror geography within europe. Nature 456 98–101.
  • Patterson, N. J., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLos Genetics 2 e190 DOI: 10.1371/journal.pgen.0020190.
  • Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904–909.
  • Pritchard, J. K., Stephens, M. and Donnelly, P. (2000a). Inference of population structure using multilocus genotype data. Genetics 155 945–959.
  • Pritchard, J. K., Stephens, M., Rosenberg, N. A. and Donnelly, P. (2000b). Association mapping in structured populations. Amer. J. Hum. Genet. 67 170–181.
  • Rosenbaum, P. (1995). Observational Studies. Springer, New York.
  • Schölkopf, B., Smola, A. and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10 1299–1319.
  • Shawe-Taylor, J., Cristianini, N. and Kandola, J. (2002). On the concentration of spectral properties. In Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, MA.
  • Shawe-Taylor, J., Williams, C., Cristianini, N. and Kandola, J. (2005). On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA. IEEE Trans. Inform. Theory 51 2510–2522.
  • Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 888–905.
  • Stewart, G. (1990). Matrix Perturbation Theory. Academic Press, Boston.
  • Tishkoff, S. A., Reed, F. A., Ranciaro, A., Voight, B. F., Babbitt, C. C., Silverman, J. S., Powell, K., Mortensen, H. M., Hirbo, J. B., Osman, M., Ibrahim, M., Omar, S. A., Lema, T. B., Nyambo, G., Ghori, J., Bumpstead, S., Pritchard, J., Wray, G. A. and Deloukas, P. (2007). Convergent adaptation of human lactase persistence in Africa and Europe. Nature Genetics 39 31–40.
  • Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika 17 401–419.
  • von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17 395–416.
  • Weir, B. (1996). Genetic Data Analysis. Sinauer Associates, Sunderland, MA.