Statistical Science

Population Structure and Cryptic Relatedness in Genetic Association Studies

William Astle and David J. Balding

Full-text: Open access


We review the problem of confounding in genetic association studies, which arises principally because of population structure and cryptic relatedness. Many treatments of the problem consider only a simple “island” model of population structure. We take a broader approach, which views population structure and cryptic relatedness as different aspects of a single confounder: the unobserved pedigree defining the (often distant) relationships among the study subjects. Kinship is therefore a central concept, and we review methods of defining and estimating kinship coefficients, both pedigree-based and marker-based. In this unified framework we review solutions to the problem of population structure, including family-based study designs, genomic control, structured association, regression control, principal components adjustment and linear mixed models. The last solution makes the most explicit use of the kinships among the study subjects, and has an established role in the analysis of animal and plant breeding studies. Recent computational developments mean that analyses of human genetic association data are beginning to benefit from its powerful tests for association, which protect against population structure and cryptic kinship, as well as intermediate levels of confounding by the pedigree.

Article information

Statist. Sci. Volume 24, Number 4 (2009), 451-471.

First available: 20 April 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)


Astle, William; Balding, David J. Population Structure and Cryptic Relatedness in Genetic Association Studies. Statistical Science 24 (2009), no. 4, 451--471. doi:10.1214/09-STS307.

Export citation


  • Agresti, A. (2002). Categorical Data Analysis, 2nd ed. Wiley, New York.
  • Altshuler, D., Daly, M. J. and Lander, E. S. (2008). Genetic mapping in human disease. Science 322 881–888.
  • Aulchenko, Y. S., de Koning, D.-J. and Haley, C. (2007). Genomewide rapid association using mixed model and regression: A fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177 577–585.
  • Bacanu, S. A., Devlin, B. and Roeder, K. (2000). The power of genomic control. Am. J. Hum. Genet. 66 1933–1944.
  • Balding, D. J. (2003). Likelihood-based inference for genetic correlation coefficients. Theor. Popul. Biol. 63 221–230.
  • Balding, D. J. and Nichols, R. A. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96 3–12.
  • Boehnke, M. and Cox, N. J. (1997). Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61 423–429.
  • Bourgain, C., Hoffjan, S., Nicolae, R., Newman, D., Steiner, L., Walker, K., Reynolds, R., Ober, C. and McPeek, M. S. (2003). Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am. J. Hum. Genet. 73 612–626.
  • Browning, S. R. (2008). Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Genetics 178 2123–2132.
  • Campbell, C. D., Ogburn, E. L., Lunetta, K. L., Lyon, H. N., Freedman, M. L., Groop, L. C., Altshuler, D., Ardlie, K. G. and Hirschhorn, J. N. (2005). Demonstrating stratification in a European American population. Nat. Genet. 37 868–872.
  • Cardon, L. R. and Palmer, L. J. (2003). Population stratification and spurious allelic association. Lancet 361 598–604.
  • Clayton, D. (2007). Population association. In Handbook of Statistical Genetics, 3rd ed. (D. J. Balding, M. Bishop and C. Cannings, eds.) 2 1264–1237. Wiley, Chichester.
  • Clayton, D. G., Walker, N. M., Smyth, D. J., Pask, R., Cooper, J. D., Maier, L. M., Smink, L. J., Lam, A. C., Ovington, N. R., Stevens, H. E., Nutland, S., Howson, J. M. M., Faham, M., Moorhead, M., Jones, H. B., Falkowski, M., Hardenbol, P., Willis, T. D. and Todd, J. A. (2005). Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37 1243–1246.
  • Cotterman, C. (1940). A calculus for statistico-genetics. Dissertation, Ohio State Univ.
  • Dadd, T., Weale, M. E. and Lewis, C. M. (2009). A critical evaluation of genomic control methods for genetic association studies. Genet. Epidemiol. 33 290–298.
  • Devlin, B., Bacanu, S.-A. and Roeder, K. (2004). Genomic control to the extreme. Nat. Genet. 36 1129–1130; author reply 1131.
  • Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55 997–1004.
  • Dudbridge, F. (2007). Family-based association. In Handbook of Statistical Genetics, 3rd ed. (D. J. Balding, M. Bishop and C. Cannings, eds.) 2 1264–1285. Wiley, Chichester.
  • Epstein, M. P., Allen, A. S. and Satten, G. A. (2007). A simple and improved correction for population stratification in case-control studies. Am. J. Hum. Genet. 80 921–930.
  • Epstein, M. P., Duren, W. L. and Boehnke, M. (2000). Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet. 67 1219–1231.
  • Falush, D., Stephens, M. and Pritchard, J. K. (2003). Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164 1567–1587.
  • Fisher, R. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh 52 399–433.
  • Freedman, M. L., Reich, D., Penney, K. L., McDonald, G. J., Mignault, A. A., Patterson, N., Gabriel, S. B., Topol, E. J., Smoller, J. W., Pato, C. N., Pato, M. T., Petryshen, T. L., Kolonel, L. N., Lander, E. S., Sklar, P., Henderson, B., Hirschhorn, J. N. and Altshuler, D. (2004). Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36 388–393.
  • Gianola, D. (2007). Inferences from mixed models in quantitative genetics. In Handbook of Statistical Genetics, 3rd ed. (D. J. Balding, M. Bishop and C. Cannings, eds.) 678–717. Wiley, Chichester.
  • Gorroochurn, P., Hodge, S. E., Heiman, G. and Greenberg, D. A. (2004). Effect of population stratification on case-control association studies. ii. False-positive rates and their limiting behavior as number of subpopulations increases. Hum. Hered. 58 40–48.
  • Handley, L. J. L., Manica, A., Goudet, J. and Balloux, F. (2007). Going the distance: Human population genetics in a clinal world. Trends Genet. 23 432–439.
  • Helgason, A., Yngvadóttir, B., Hrafnkelsson, B., Gulcher, J. and Stefánsson, K. (2005). An icelandic example of the impact of population structure on association studies. Nat. Genet. 37 90–95.
  • Hill, W. G., Goddard, M. E. and Visscher, P. M. (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4 e1000008.
  • Hoggart, C. J., Parra, E. J., Shriver, M. D., Bonilla, C., Kittles, R. A., Clayton, D. G. and McKeigue, P. M. (2003). Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. 72 1492–1504.
  • Höschele, I. (2007). Mapping quantitative trait loci in outbred pedigrees. In Handbook of Statistical Genetics, 3rd ed. (D. J. Balding, M. Bishop and C. Cannings, eds.) 1 678–717. Wiley, Chichester.
  • Jacquard, A. (1970). Structures Génétiques des Populations. Masson & Cie, Paris.
  • Kang, H. M., Zaitlen, N. A., Wade, C. M., Kirby, A., Heckerman, D., Daly, M. J. and Eskin, E. (2008). Efficient control of population structure in model organism association mapping. Genetics 178 1709–1723.
  • Knowler, W. C., Williams, R. C., Pettitt, D. J. and Steinberg, A. G. (1988). Gm3;5,13,14 and type 2 diabetes mellitus: An association in American Indians with genetic admixture. Am. J. Hum. Genet. 43 520–526.
  • Lao, O., Lu, T. T., Nothnagel, M., Junge, O., Freitag-Wolf, S., Caliebe, A., Balascakova, M., Bertranpetit, J., Bindoff, L. A., Comas, D., Holmlund, G., Kouvatsi, A., Macek, M., Mollet, I., Parson, W., Palo, J., Ploski, R., Sajantila, A., Tagliabracci, A., Gether, U., Werge, T., Rivadeneira, F., Hofman, A., Uitterlinden, A. G., Gieger, C., Wichmann, H.-E., Rüther, A., Schreiber, S., Becker, C., Nürnberg, P., Nelson, M. R., Krawczak, M. and Kayser, M. (2008). Correlation between genetic and geographic structure in Europe. Curr. Biol. 18 1241–1248.
  • Lee, S., Wright, F. A. and Zou, F. (2010). Control of population stratification by correlation-selected principal components. Preprint.
  • Leutenegger, A.-L., Prum, B., Génin, E., Verny, C., Lemainque, A., Clerget-Darpoux, F. and Thompson, E. A. (2003). Estimation of the inbreeding coefficient through use of genomic data. Am. J. Hum. Genet. 73 516–523.
  • Li, C. C. and Horvitz, D. G. (1953). Some methods of estimating the inbreeding coefficient. Am. J. Hum. Genet. 5 107–117.
  • Liu, H., Prugnolle, F., Manica, A. and Balloux, F. (2006). A geographically explicit genetic model of worldwide human-settlement history. Am. J. Hum. Genet. 79 230–237.
  • Malécot, G. (1969). The Mathematics of Heredity. Freeman, San Francisco, CA.
  • Marchini, J., Cardon, L. R., Phillips, M. S. and Donnelly, P. (2004a). The effects of human population structure on large genetic association studies. Nat. Genet. 36 512–517.
  • Marchini, J., Cardon, L. R., Phillips, M. S. and Donnelly, P. (2004b). Reply to “Genomic control to the extreme.” Nat. Genet. 36 1129–1130; author reply 1131.
  • McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P. A. and Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat. Rev. Genet. 9 356–369.
  • McKeigue, P. (2007). Population admixture and stratification in genetic epidemiology. In Handbook of Statistical Genetics, 3rd ed. (D. J. Balding, M. Bishop and C. Cannings, eds.) 2 1190–1213. Wiley, Chichester.
  • McPeek, M. S. and Sun, L. (2000). Statistical tests for detection of misspecified relationships by use of genome-screen data. Am. J. Hum. Genet. 66 1076–1094.
  • McVean, G. (2007). Linkage disequilibrium, recombination and selection. In Handbook of Statistical Genetics, 3rd ed. (D. J. Balding, M. Bishop and C. Cannings, eds.) 2 909–944. Wiley, Chichester.
  • Milligan, B. G. (2003). Maximum-likelihood estimation of relatedness. Genetics 163 1153–1167.
  • Morris, A. and Cardon, L. (2007). Whole genome association. In Handbook of Statistical Genetics, 3rd ed. (D. J. Balding, M. Bishop and C. Cannings, eds.) 2 1238–1263. Wiley, Chichester.
  • NHGRI GWAS Catalog (2009). A catalog of published genome-wide association studies. Available at
  • Novembre, J. and Stephens, M. (2008). Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40 646–649.
  • Patterson, N., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet. 2 e190.
  • Prentice, R. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66 403–411.
  • Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.
  • Pritchard, J. K. and Donnelly, P. (2001). Case-control studies of association in structured or admixed populations. Theor. Popul. Biol. 60 227–237.
  • Pritchard, J. K. and Przeworski, M. (2001). Linkage disequilibrium in humans: Models and data. Am. J. Hum. Genet. 69 1–14.
  • Pritchard, J. K. and Rosenberg, N. A. (1999). Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65 220–228.
  • Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., Maller, J., Sklar, P., de Bakker, P. I. W., Daly, M. J. and Sham, P. C. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81 559–575.
  • Rakovski, C. S. and Stram, D. O. (2009). A kinship-based modification of the armitage trend test to address hidden population structure and small differential genotyping errors. PLoS ONE 4 e5825.
  • Ritland, K. (1996). Estimators for pairwise relatedness and individual inbreeding coefficients. Genetical Research 67 175–185.
  • Robinson, G. K. (1991). That BLUP is a good thing: The estimation of random effects. Statist. Sci. 6 15–32.
  • Rosenberg, N. A. and Nordborg, M. (2006). A general population-genetic model for the production by population structure of spurious genotype-phenotype associations in discrete, admixed or spatially distributed populations. Genetics 173 1665–1678.
  • Rousset, F. (2002). Inbreeding and relatedness coefficients: What do they measure? Heredity 88 371–380.
  • Seaman, S. R. and Richardson, S. (2004). Equivalence of prospective and retrospective models in the Bayesian analysis of case-control studies. Biometrika 91 15–25.
  • Setakis, E., Stirnadel, H. and Balding, D. J. (2006). Logistic regression protects against population structure in genetic association studies. Genome Res. 16 290–296.
  • Slatkin, M. (2002). The age of alleles. In Modern Developments in Theoretical Population Genetics, 3rd ed. (M. Slatkin and M. Veuille, eds.) 233–258. Oxford Univ. Press.
  • Spielman, R. S., McGinnis, R. E. and Ewens, W. J. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (iddm). Am. J. Hum. Genet. 52 506–516.
  • The Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661–678.
  • Thompson, E. A. (1975). The estimation of pairwise relationships. Ann. Hum. Genet. 39 173–188.
  • Thompson, E. A. (1985). Pedigree Analysis in Human Genetics. Johns Hopkins Univ. Press, Baltimore, MD.
  • Thompson, E. A. (2007). Linkage analysis. In Handbook of Statistical Genetics, 3rd ed. (D. J. Balding, M. Bishop and C. Cannings, eds.) 2 1141–1167. Wiley, Chichester.
  • Tiwari, H. K., Barnholtz-Sloan, J., Wineinger, N., Padilla, M. A., Vaughan, L. K. and Allison, D. B. (2008). Review and evaluation of methods correcting for population stratification with a focus on underlying statistical principles. Hum. Hered. 66 67–86.
  • Voight, B. F. and Pritchard, J. K. (2005). Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 1 e32.
  • Wang, Y., Localio, R. and Rebbeck, T. R. (2004). Evaluating bias due to population stratification in case-control association studies of admixed populations. Genet. Epidemiol. 27 14–20.
  • Wang, Y., Localio, R. and Rebbeck, T. R. (2005). Bias correction with a single null marker for population stratification in candidate gene association studies. Hum. Hered. 59 165–175.
  • Weale, M. E., Weiss, D. A., Jager, R. F., Bradman, N. and Thomas, M. G. (2002). Y chromosome evidence for Anglo-Saxon mass migration. Mol. Biol. Evol. 19 1008–1021.
  • Weinberg, C. R. (1999). Methods for detection of parent-of-origin effects in genetic studies of case-parents triads. Am. J. Hum. Genet. 65 229–235.
  • Weir, B. S., Anderson, A. D. and Hepler, A. B. (2006). Genetic relatedness analysis: Modern data and new challenges. Nat. Rev. Genet. 7 771–780.
  • Yu, J., Pressoir, G., Briggs, W. H., Bi, I. V., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S. and Buckler, E. S. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38 203–208.
  • Zhang, S., Zhu, X. and Zhao, H. (2003). On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet. Epidemiol. 24 44–56.
  • Zhao, K., Aranzana, M. J., Kim, S., Lister, C., Shindo, C., Tang, C., Toomajian, C., Zheng, H., Dean, C., Marjoram, P. and Nordborg, M. (2007). An arabidopsis example of association mapping in structured samples. PLoS Genet. 3 e4.
  • Zheng, G., Freidlin, B. and Gastwirth, J. L. (2006). Robust genomic control for association studies. Am. J. Hum. Genet. 78 350–356.
  • Zheng, G., Freidlin, B., Li, Z. and Gastwirth, J. L. (2005). Genomic control for association studies under various genetic models. Biometrics 61 186–192.
  • Zheng, G., Li, Z., Gail, M. H. and Gastwirth, J. L. (2010). Impact of population substructure on trend tests for genetic case-control association studies. Biometrics. To appear.