The Annals of Applied Statistics

Improving population-specific allele frequency estimates by adapting supplemental data: An empirical Bayes approach

Marc Coram and Hua Tang

Full-text: Open access


Estimation of the allele frequency at genetic markers is a key ingredient in biological and biomedical research, such as studies of human genetic variation or of the genetic etiology of heritable traits. As genetic data becomes increasingly available, investigators face a dilemma: when should data from other studies and population subgroups be pooled with the primary data? Pooling additional samples will generally reduce the variance of the frequency estimates; however, used inappropriately, pooled estimates can be severely biased due to population stratification. Because of this potential bias, most investigators avoid pooling, even for samples with the same ethnic background and residing on the same continent. Here, we propose an empirical Bayes approach for estimating allele frequencies of single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. In every example we have considered, our estimator achieves a mean squared error (MSE) that is smaller than either pooling or not, and sometimes substantially improves over both extremes. The bias introduced is small, as is shown by a simulation study that is carefully matched to a real data example. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.

Article information

Ann. Appl. Stat., Volume 1, Number 2 (2007), 459-479.

First available in Project Euclid: 30 November 2007

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Empirical Bayes allele frequency


Coram, Marc; Tang, Hua. Improving population-specific allele frequency estimates by adapting supplemental data: An empirical Bayes approach. Ann. Appl. Stat. 1 (2007), no. 2, 459--479. doi:10.1214/07-AOAS121.

Export citation


  • Bernardo, J.-M. and Smith, A. F. M. (1994). Bayesian Theory. Wiley, Chichester.
  • Choudhry, S., Coyle, N. E., Tang, H., Salari, K., Lind, D., Clark, S. L., Tsai, H.-J., Naqvi, M., Phong, A., Ung, N., Matallana, H., Avila, P. C., Casal, J., Torres, A., Nazario, S., Castro, R., Battle, N. C., Perez-Stable, E. J., Kwok, P.-Y., Sheppard, D., Shriver, M. D., Rodriguez-Cintron, W., Risch, N., Ziv, E. and Burchard, E. G. (2006). Population stratification confounds genetic association studies among Latinos. Hum. Genet. 118 652–664.
  • Clark, A. G., Hubisz, M. J., Bustamante, C. D., Williamson, S. H. and Nielsen, R. (2005). Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 15 1496–1502.
  • Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55 997–1004.
  • Fisher, R. (1922). On the dominance ratio. Proc. Roy. Soc. Edinburgh 42 321–341.
  • Hinds, D., Stuve, L., Nilsen, G., Halperin, E., Eskin, E., Ballinger, D., Frazer, K. and Cox, D. (2005). Whole-genome patterns of common DNA variation in three human populations. Science 307 1072–1079.
  • Hirschhorn, J. N. and Daly, M. J. (2005). Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6 95–108.
  • International HapMap Consortium. (2005). A haplotype map of the human genome. Nature 437 1299–1320.
  • Jiang, C. J. and Cockerham, C. C. (1987). Use of the multinomial Dirichlet model for analysis of subdivided genetic populations. Genetics 115 363–366.
  • Kimura, M. (1968). Evolutionary rate at the molecular level. Nature 217 624–626.
  • Kitada, S., Hayashi, T. and Kishino, H. (2000). Empirical Bayes procedure for estimating genetic distance between populations and effective population size. Genetics 156 2063–2079.
  • Lander, E. S. and Schork, N. J. (1994). Genetic dissection of complex traits. Science 265 2037–2048.
  • Lange, K. (1995). Applications of the dirichlet distribution to forensic match probabilities. Genetica 96 107–117.
  • Lockwood, J., Roeder, K. and Devlin, B. (2001). A Bayesian hierarchical model for allele frequencies. Genet. Epidemiol. 20 17–33.
  • Nicholson, G., Smith, A. V., Jónsson, F., Gústafsson, O., Stefánsson, K. and Donnelly, P. (2002). Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 695–715.
  • Nielsen, R., Hubisz, M. J. and Clark, A. G. (2004). Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics 168 2373–2382.
  • Nordborg, M. (2001). Coalescent theory. In Handbook of Statistical Genetics (D. Balding, M. Bishop and C. Cannings, eds.) Chapter 7 179–212. Wiley, Chichester.
  • Parra, E. J., Marcini, A., Akey, J., Martinson, J., Batzer, M. A., Cooper, R., Forrester, T., Allison, D. B., Deka, R., Ferrell, R. E. and Shriver, M. D. (1998). Estimating African American admixture proportions by use of population-specific alleles. Am. J. Hum. Genet. 63 1839–1851.
  • Pritchard, J. and Rosenberg, N. (1999). Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65 220–228.
  • Ramachandran, S., Deshpande, O., Roseman, C., Rosenberg, N., Feldman, M. and Cavalli-Sforza, L. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc. Natl. Acad. Sci. USA 102 15942–15947.
  • Risch, N. (1900). Linkage strategies for genetically complex traits. III. The effect of marker polymorphism on analysis of affected relative pairs. Am. J. Hum. Genet. 46 242–253.
  • Robbins, H. (1964). The empirical Bayes approach to statistical decision problems. Ann. Math. Statist. 35 1–20.
  • Rosenberg, N., Mahajan, S., Ramachandran, S., Zhao, C., Pritchard, J. and Feldman, M. W. (2005). Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 1 e70.
  • Rosenberg, N., Pritchard, J., Weber, J., Cann, H., Kidd, K., Zhivotovsky, L. and Feldman, M. W. (2002). Genetic structure of human populations. Science 298 2381–2385.
  • Sabeti, P. C., Schaffner, S. F., Fry, B., Lohmueller, J., Varilly, P., Shamovsky, O., Palma, A., Mikkelsen, T. S., Altshuler, D. and Lander, E. S. (2006). Positive natural selection in the human lineage. Science 312 1614–1620.
  • Skellam, J. G. (1948). A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. J. Roy. Statist. Soc. Ser. B 10 257–261.
  • Voight, B. F., Kudaravalli, S., Wen, X. and Pritchard, J. K. (2006). A map of recent positive selection in the human genome. PLoS Biol. 4 e72.
  • Wang, W. Y. S., Barratt, B. J., Clayton, D. G. and Todd, J. A. (2005). Genome-wide association studies: Theoretical and practical concerns. Nat. Rev. Genet. 6 109–118.
  • Weeks, D. and Lange, K. (1998). The affected-pedigree-member method of linkage analysis. Am. J. Hum. Genet. 42 315–326.
  • Wilson, I. J., Weale, M. E. and Balding, D. J. (2003). Inferences from DNA data: Population histories, evolutionary processes and forensic match probabilities. J. Roy. Statist. Soc. Ser. A 166 155–201.
  • Wright, S. (1931). Evolution in medelian populations. Genetics 16 97–159.
  • Wright, S. (1951). The genetical structure of populations. Ann. Eugenics 15 323–354.
  • Yonan, A. L., Palmer, A. A. and Gilliam, T. C. (2006). Hardy–Weinberg disequilibrium identified genotyping error of the serotonin transporter (SLC6A4) promoter polymorphism. Psychiatr. Genet. 16 31–34.