Bayesian Analysis

Modeling Population Structure Under Hierarchical Dirichlet Processes

Lloyd T. Elliott, Maria De Iorio, Stefano Favaro, Kaustubh Adhikari, and Yee Whye Teh

Full-text: Open access


We propose a Bayesian nonparametric model to infer population admixture, extending the hierarchical Dirichlet process to allow for correlation between loci due to linkage disequilibrium. Given multilocus genotype data from a sample of individuals, the proposed model allows inferring and classifying individuals as unadmixed or admixed, inferring the number of subpopulations ancestral to an admixed population and the population of origin of chromosomal regions. Our model does not assume any specific mutation process, and can be applied to most of the commonly used genetic markers. We present a Markov chain Monte Carlo (MCMC) algorithm to perform posterior inference from the model and we discuss some methods to summarize the MCMC output for the analysis of population admixture. Finally, we demonstrate the performance of the proposed model in a real application, using genetic data from the ectodysplasin-A receptor (EDAR) gene, which is considered to be ancestry-informative due to well-known variations in allele frequency as well as phenotypic effects across ancestry. The structure analysis of this dataset leads to the identification of a rare haplotype in Europeans. We also conduct a simulated experiment and show that our algorithm outperforms parametric methods.

Article information

Bayesian Anal., Volume 14, Number 2 (2019), 313-339.

First available in Project Euclid: 19 May 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

admixture modeling Bayesian nonparametrics hierarchical Dirichlet process linkage disequilibrium population stratification single nucleotide polymorphism data MCMC algorithm

Creative Commons Attribution 4.0 International License.


Elliott, Lloyd T.; De Iorio, Maria; Favaro, Stefano; Adhikari, Kaustubh; Teh, Yee Whye. Modeling Population Structure Under Hierarchical Dirichlet Processes. Bayesian Anal. 14 (2019), no. 2, 313--339. doi:10.1214/17-BA1093.

Export citation


  • Adhikari, K., Fontanil, T., Cal, S., Mendoza-Revilla, J., Fuentes-Guajardo, M., Chacón-Duque, J.-C., Al-Saadi, F., Johansson, J. A., Quinto-Sanchez, M., Acuña-Alonzo, V., et al. (2016a). “A genome-wide association scan in admixed Latin Americans identifies loci influencing facial and scalp hair features.” Nature Communications, 7.
  • Adhikari, K., Fuentes-Guajardo, M., Quinto-Sánchez, M., Mendoza-Revilla, J., Chacón-Duque, J. C., Acuña-Alonzo, V., Jaramillo, C., Arias, W., Lozano, R. B., Pérez, G. M., et al. (2016b). “A genome-wide association scan implicates DCHS2, RUNX2, GLI3, PAX1 and EDAR in human facial variation.” Nature Communications, 7.
  • Adhikari, K., Reales, G., Smith, A. J., Konka, E., Palmen, J., Quinto-Sanchez, M., Acuña-Alonzo, V., Jaramillo, C., Arias, W., Fuentes, M., et al. (2015). “A genome-wide association study identifies multiple loci for variation in human ear morphology.” Nature Communications, 6.
  • Alexander, D. H., Novembre, J., and Lange, K. (2009). “Fast model-based estimation of ancestry in unrelated individuals.” Genome Research, 19(9): 1655–1664.
  • Anderson, E. and Thompson, E. (2002). “A model-based method for identifying species hybrids using multilocus genetic data.” Genetics, 160(3): 1217–1229.
  • Anderson, E. C. (2001). “Monte Carlo methods for inference in population genetic models.” Ph.D. thesis, University of Washington.
  • Balding, D. J. and Nichols, R. A. (1995). “A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity.” In Human Identification: The Use of DNA Markers, 3–12. Springer.
  • Corander, J., Waldmann, P., and Sillanpää, M. J. (2003). “Bayesian analysis of genetic differentiation between populations.” Genetics, 163(1): 367–374.
  • Dawson, K. J. and Belkhir, K. (2001). “A Bayesian approach to the identification of panmictic populations and the assignment of individuals.” Genetical Research, 78(01): 59–77.
  • Delaneau, O., Zagury, J.-F., and Marchini, J. (2013). “Improved whole-chromosome phasing for disease and population genetic studies.” Nature Methods, 10(1): 5–6.
  • Elliott, L. T. and Teh, Y. W. (2012). “Scalable imputation of genetic data with a discrete fragmentation-coagulation process.” In Advances in Neural Information Processing Systems, volume 24.
  • Elliott, L. T. and Teh, Y. W. (2016). “A nonparametric HMM for genetic imputation and coalescent inference.” Electronic Journal of Statistics, 10(2): 3425–3451.
  • Elliott, L. T., De Iorio, M., Favaro, S., Adhikari, K., and Teh, Y. W. (2018). “Modeling population structure under hierarchical Dirichlet processes: Appendix.” Bayesian Analysis.
  • Evanno, G., Regnaut, S., and Goudet, J. (2005). “Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.” Molecular Ecology, 14(8): 2611–2620.
  • Ewens, W. J. (1972). “The sampling theory of selectively neutral alleles.” Theoretical Population Biology, 3(1): 87–112.
  • Falush, D., Stephens, M., and Pritchard, J. K. (2003). “Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies.” Genetics, 164(4): 1567–1587.
  • Favaro, S., Teh, Y. W., et al. (2013). “MCMC for normalized random measure mixture models.” Statistical Science, 28(3): 335–359.
  • Ferguson, T. S. (1973). “A Bayesian analysis of some nonparametric problems.” The Annals of Statistics, 1(2): 209–230.
  • Field, D., Ayre, D., Whelan, R., and Young, A. (2011). “Patterns of hybridization and asymmetrical gene flow in hybrid zones of the rare Eucalyptus aggregata and common E. rubida.” Heredity, 106(5): 841–853.
  • Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. (2012). “Joint modeling of multiple related time series via the beta process.”
  • Fritsch, A., Ickstadt, K., et al. (2009). “Improved criteria for clustering based on the posterior similarity matrix.” Bayesian Analysis, 4(2): 367–391.
  • Fujimoto, A., Kimura, R., Ohashi, J., Omi, K., Yuliwulandari, R., Batubara, L., Mustofa, M. S., Samakkarn, U., Settheetham-Ishida, W., Ishida, T., et al. (2008). “A scan for genetic determinants of human hair morphology: EDAR is associated with Asian hair thickness.” Human Molecular Genetics, 17(6): 835–843.
  • Griffin, J. E. and Walker, S. G. (2013). “On adaptive Metropolis–Hastings methods.” Statistics and Computing, 23(1): 123–134.
  • Hjort, N. L., Holmes, C., Müller, P., and Walker, S. G. (2010). Bayesian nonparametrics, volume 28. Cambridge University Press.
  • Hubisz, M. J., Falush, D., Stephens, M., and Pritchard, J. K. (2009). “Inferring weak population structure with the assistance of sample group information.” Molecular Ecology Resources, 9(5): 1322–1332.
  • Huelsenbeck, J. P. and Andolfatto, P. (2007). “Inference of population structure under a Dirichlet process model.” Genetics, 175(4): 1787–1802.
  • Jasra, A., Holmes, C., and Stephens, D. (2005). “Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling.” Statistical Science, 20(1): 50–67.
  • Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M., Ramachandran, S., Cann, H. M., Barsh, G. S., Feldman, M., Cavalli-Sforza, L. L., et al. (2008). “Worldwide human relationships inferred from genome-wide patterns of variation.” Science, 319(5866): 1100–1104.
  • Mikkola, M. L. (2009). “Molecular aspects of hypohidrotic ectodermal dysplasia.” American Journal of Medical Genetics Part A, 149(9): 2031–2036.
  • Myers, S., Bottolo, L., Freeman, C., McVean, G., and Donnelly, P. (2005). “A fine-scale map of recombination rates and hotspots across the human genome.” Science, 310(321): 321–324.
  • Neal, R. M. (2000). “Markov chain sampling methods for Dirichlet process mixture models.” Journal of Computational and Graphical Statistics, 9(2): 249–265.
  • Neal, R. M. (2003). “Density modeling and clustering using Dirichlet diffusion trees.” Bayesian Statistics, 7: 619–29.
  • Novembre, J. and Stephens, M. (2008). “Interpreting principal component analyses of spatial population genetic variation.” Nature Genetics, 40(5): 646–649.
  • Papaspiliopoulos, O. and Roberts, G. O. (2008). “Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models.” Biometrika, 95(1): 169–186.
  • Parker, H. G., Kim, L. V., Sutter, N. B., Carlson, S., Lorentzen, T. D., Malek, T. B., Johnson, G. S., DeFrance, H. B., Ostrander, E. A., and Kruglyak, L. (2004). “Genetic structure of the purebred domestic dog.” Science, 304(5674): 1160–1164.
  • Patterson, N., Price, A. L., and Reich, D. (2006). “Population structure and eigenanalysis.” PLoS Genetics, 2(12).
  • Pella, J. and Masuda, M. (2006). “The gibbs and split merge sampler for population mixture analysis from genetic data with incomplete baselines.” Canadian Journal of Fisheries and Aquatic Sciences, 63(3): 576–596.
  • Pitman, J. (2006). Combinatorial stochastic processes. Springer-Verlag.
  • Pitman, J. and Yor, M. (1997). “The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator.” The Annals of Probability, 25(2): 855–900.
  • Price, A. L., Zaitlen, N. A., Reich, D., and Patterson, N. (2010). “New approaches to population stratification in genome-wide association studies.” Nature Reviews Genetics, 11(7): 459–463.
  • Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). “Inference of population structure using multilocus genotype data.” Genetics, 155(2): 945–959.
  • Rand, W. M. (1971). “Objective Criteria for the Evaluation of Clustering Methods.” Journal of the American Statistical Association, 66(336): 846–850.
  • Rannala, B. and Mountain, J. L. (1997). “Detecting immigration by using multilocus genotypes.” Proceedings of the National Academy of Sciences, 94(17): 9197–9201.
  • Ray, A. and Quader, S. (2014). “Genetic diversity and population structure of Lantana camara in India indicates multiple introductions and gene flow.” Plant Biology, 16(3): 651–658.
  • Reich, D., Patterson, N., Campbell, D., Tandon, A., Mazieres, S., Ray, N., Parra, M. V., Rojas, W., Duque, C., Mesa, N., et al. (2012). “Reconstructing native American population history.” Nature, 488(7411): 370–374.
  • Reich, D., Thangaraj, K., Patterson, N., Price, A. L., and Singh, L. (2009). “Reconstructing Indian population history.” Nature, 461(7263): 489–494.
  • Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., and Feldman, M. W. (2002). “Genetic structure of human populations.” Science, 298(5602): 2381–2385.
  • Ruiz-Linares, A., Adhikari, K., Acuña-Alonzo, V., Quinto-Sanchez, M., Jaramillo, C., Arias, W., Fuentes, M., Pizarro, M., Everardo, P., de Avila, F., et al. (2014). “Admixture in Latin America: geographic structure, phenotypic diversity and self-perception of ancestry based on 7,342 individuals.” PLoS Genetics, 10(9).
  • Scharf, J. M., Yu, D., Mathews, C. A., Neale, B. M., Stewart, S. E., Fagerness, J. A., Evans, P., Gamazon, E., Edlund, C. K., Service, S., et al. (2013). “Genome-wide association study of Tourette’s syndrome.” Molecular Psychiatry, 18(6): 721–728.
  • Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica, 4(2): 639–650.
  • Sohn, K.-A., Ghahramani, Z., and Xing, E. P. (2012). “Robust estimation of local genetic ancestry in admixed populations using a nonparametric Bayesian approach.” Genetics, 191(4): 1295–1308.
  • Sohn, K.-A. and Xing, E. P. (2007). “Hidden Markov Dirichlet process: modeling genetic inference in open ancestral space.” Bayesian Analysis, 2(3): 501–527.
  • Tan, J., Yang, Y., Tang, K., Sabeti, P. C., Jin, L., and Wang, S. (2013). “The adaptive variant EDARV370A is associated with straight hair in East Asians.” Human Genetics, 132(10): 1187–1191.
  • Tang, H., Peng, J., Wang, P., and Risch, N. J. (2005). “Estimation of individual admixture: analytical and study design considerations.” Genetic Epidemiology, 28(4): 289–301.
  • Teh, Y. W. and Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models with applications. Cambridge University Press.
  • Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2012). “Hierarchical Dirichlet processes.” Journal of the American Statistical Association, 101(476): 1566–1581.
  • Walker, S. G. (2007). “Sampling the Dirichlet mixture model with slices.” Communications in Statistics—Simulation and Computation, 36(1): 45–54.
  • Wasser, S. K., Mailand, C., Booth, R., Mutayoba, B., Kisamo, E., Clark, B., and Stephens, M. (2007). “Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban.” Proceedings of the National Academy of Sciences, 104(10): 4228–4233.
  • Zhu, X., Tang, H., and Risch, N. (2008). “Admixture mapping and the role of population structure for localizing disease genes.” Advances in Genetics, 60: 547–569.

Supplemental materials