The Annals of Applied Statistics

Fast inference of individual admixture coefficients using geographic data

Kevin Caye, Flora Jay, Olivier Michel, and Olivier François

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Accurately evaluating the distribution of genetic ancestry across geographic space is one of the main questions addressed by evolutionary biologists. This question has been commonly addressed through the application of Bayesian estimation programs allowing their users to estimate individual admixture proportions and allele frequencies among putative ancestral populations. Following the explosion of high-throughput sequencing technologies, several algorithms have been proposed to cope with computational burden generated by the massive data in those studies. In this context, incorporating geographic proximity in ancestry estimation algorithms is an open statistical and computational challenge. In this study, we introduce new algorithms that use geographic information to estimate ancestry proportions and ancestral genotype frequencies from population genetic data. Our algorithms combine matrix factorization methods and spatial statistics to provide estimates of ancestry matrices based on least-squares approximation. We demonstrate the benefit of using spatial algorithms through extensive computer simulations, and we provide an example of application of our new algorithms to a set of spatially referenced samples for the plant species Arabidopsis thaliana. Without loss of statistical accuracy, the new algorithms exhibit runtimes that are much shorter than those observed for previously developed spatial methods. Our algorithms are implemented in the R package, tess3r.

Article information

Source
Ann. Appl. Stat. Volume 12, Number 1 (2018), 586-608.

Dates
Received: October 2016
Revised: September 2017
First available in Project Euclid: 9 March 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1520564485

Digital Object Identifier
doi:10.1214/17-AOAS1106

Keywords
Ancestry estimation algorithms genotypic data geographic data fast algorithms

Citation

Caye, Kevin; Jay, Flora; Michel, Olivier; François, Olivier. Fast inference of individual admixture coefficients using geographic data. Ann. Appl. Stat. 12 (2018), no. 1, 586--608. doi:10.1214/17-AOAS1106. https://projecteuclid.org/euclid.aoas/1520564485


Export citation

References

  • 1000 Genomes Project Consortium, Auton, A., Brooks, L. D., Durbin, R. M., Garrison, E. P., Kang, H. M., Korbel, J. O., Marchini, J. L., McCarthy, S., McVean, G. A. and Abecasis, G. R. (2015). A global reference for human genetic variation. Nature 526 68–74.
  • Alexander, D. H. and Lange, K. (2011). Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinform. 12 246.
  • Baran, Y., Quintela, I., Carracedo, Á., Pasaniuc, B. and Halperin, E. (2013). Enhanced localization of genetic samples through linkage-disequilibrium correction. Am. J. Hum. Genet. 92 882–894.
  • Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for eimensionality reduction and data representation. Neural Comput. 6 1373–1396.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Bertsekas, D. P. (1995). Nonlinear Programming. Athena Scientific, Nashua, USA.
  • Bradburd, G. S., Ralph, P. L. and Coop, G. M. (2016). A spatial framework for understanding population structure and admixture. PLoS Genet. 12 e1005703.
  • Cai, D., He, X., Han, J. and Huang, T. S. (2011). Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33 1548–1560.
  • Carbon, S., Ireland, A., Mungall, C. J., Shu, S., Marshall, B., Lewis, S., AmiGO Hub and Web Presence Working Group (2009). AmiGO: Online access to ontology and annotation data. Bioinformatics 25 288–289.
  • Cavalli, L. L., Menozzi, P. and Piazza, A. (1994). The History and Geography of Human Genes. Princeton Univ. Press, Princeton, NJ.
  • Caye, K., Deist, T. M., Martins, H., Michel, O. and François, O. (2016). TESS3: Fast inference of spatial population structure and genome scans for selection. Mol. Ecol. Resour. 16 540–548.
  • Chen, C., Durand, E., Forbes, F. and François, O. (2007). Bayesian clustering algorithms ascertaining spatial population structure: A new computer program and a comparison study. Mol. Ecol. Notes 7 747–756.
  • Cichocki, A., Zdunek, R., Phan, A. H. and Amari, S. I. (2009). Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. Wiley, Chichester.
  • Corander, J., Sirén, J. and Arjas, E. (2008). Bayesian spatial modeling of genetic population structure. Comput. Statist. 23 111–129.
  • Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, USA.
  • Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55 997–1004.
  • Durand, E., Jay, F., Gaggiotti, O. E. and François, O. (2009). Spatial inference of admixture proportions and secondary contact zones. Mol. Biol. Evol. 26 1963–1973.
  • Eastment, H. and Krzanowski, W. (1982). Cross-validatory choice of the number of components from a principal component analysis. Technometrics 24 73–77.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Engelhardt, B. E. and Stephens, M. (2010). Analysis of population structure: A unifying framework and novel methods based on sparse factor analysis. PLoS Genet. 6 e1001117.
  • Epperson, B. K. (2003). Geographical Genetics. Princeton Univ. Press, Princeton, NJ.
  • Epperson, B. K. and Li, T. (1996). Measurement of genetic structure within populations using Moran’s spatial autocorrelation statistics. Proc. Natl. Acad. Sci. USA 93 10528–10532.
  • Fournier-Level, A., Korte, A., Cooper, M. D., Nordborg, M., Schmitt, J. and Wilczek, A. M. (2011). A map of local adaptation in Arabidopsis thaliana. Science 334 86–89.
  • François, O. and Durand, E. (2010). Spatially explicit Bayesian clustering models in population genetics. Mol. Ecol. Resour. 10 773–784.
  • François, O. and Waits, L. P. (2016). Clustering and assignment methods in landscape genetics. 114–128. Wiley, Chichester.
  • François, O., Martins, H., Caye, K. and Schoville, S. D. (2016). Controlling false discoveries in genome scans for selection. Mol. Ecol. 25 454–469.
  • Frichot, E. and François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods Ecol. Evol. 6 925–929.
  • Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G. and François, O. (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics 196 973–983.
  • Grippo, L. and Sciandrone, M. (2000). On the convergence of the block nonlinear Gauss-Seidel method under convex constraints. Oper. Res. Lett. 26 127–136.
  • Hancock, A. M., Brachi, B., Faure, N., Horton, M. W., Jarymowycz, L. B., Sperone, F. G., Toomajian, C., Roux, F. and Bergelson, J. (2011). Adaptation to climate across the Arabidopsis thaliana genome. Science 334 83–86.
  • Hardy, O. J. and Vekemans, X. (1999). Isolation by distance in a continuous population: Reconciliation between spatial autocorrelation analysis and population genetics models. Heredity 83 145–154.
  • Horton, M. W., Hancock, A. M., Huang, Y. S., Toomajian, C., Atwell, S., Auton, A., Muliyati, N. W., Platt, A., Sperone, F. G., Vilhjálmsson, B. J., Nordborg, M., Borevitz, J. O. and Bergelson, J. (2012). Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel. Nat. Genet. 44 212–216.
  • Hudson, R. R. (2002). Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18 337–338.
  • Kim, J. and Park, H. (2011). Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM J. Sci. Comput. 33 3261–3281.
  • Korneliussen, T. S., Albrechtsen, A. and Nielsen, R. (2014). ANGSD: Analysis of next generation sequencing data. BMC Bioinform. 15 356.
  • Lao, O., Liu, F., Wollstein, A. and Kayser, M. (2014). GAGA: A new algorithm for genomic inference of geographic ancestry reveals fine level population substructure in Europeans. PLoS Comput. Biol. 10 e1003480.
  • Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788–791.
  • Li, G. and Zhu, H. (2013). Genetic studies: The linear mixed models in genome-wide association studies. Open Bioinform. J. 7 27–33.
  • Malécot, G. (1948). Les Mathématiques de L’Hérédité. Masson et Cie., Paris.
  • Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer Res. 27 209–220.
  • Martins, H., Caye, K., Luu, K., Blum, M. G. B. and François, O. (2016). Identifying outlier loci in admixed and in continuous populations using ancestral population differentiation statistics. Mol. Ecol. 25 5029–5042.
  • Popescu, A. A., Harper, A. L., Trick, M., Bancroft, I. and Huber, K. T. (2014). A novel and fast approach for population structure inference using kernel-PCA and optimization. Genetics 198 1421–1431.
  • Pritchard, J. K., Stephens, M. and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155 945–959.
  • Raj, A., Stephens, M. and Pritchard, J. K. (2014). FastSTRUCTURE: Variational inference of population structure in large SNP data sets. Genetics 197 573–589.
  • Rañola, J. M., Novembre, J. and Lange, K. (2014). Fast spatial ancestry via flexible allele frequency surfaces. Bioinformatics 30 2915–2922.
  • R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at https://www.R-project.org/.
  • Schraiber, J. G. and Akey, J. M. (2015). Methods and models for unravelling human evolutionary history. Nat. Rev. Genet. 16 727–740.
  • Segelbacher, G., Cushman, S. A., Epperson, B. K., Fortin, M. J., Francois, O., Hardy, O. J., Holderegger, R., Taberlet, P., Waits, L. P. and Manel, S. (2010). Applications of landscape genetics in conservation biology: Concepts and challenges. Conserv. Genet. 11 375–385.
  • Tang, H., Peng, J., Wang, P. and Risch, N. J. (2005). Estimation of individual admixture: Analytical and study design considerations. Genet. Epidemiol. 28 289–301.
  • Wang, J. (2017). The computer program structure for assigning individuals to populations: Easy to use but easier to misuse. Mol. Ecol. Resour. 17 981–990.
  • Weir, B. S. (1996). Genetic Data Analysis II: Methods for Discrete Population Genetic Data, Vol. 2. Sinauer Associates, Sunderland, MA.
  • Wold, S. (1978). Cross-validatory estimation of the number of components in factor and principal components models. Technometrics 20 397–405.
  • Wollstein, A. and Lao, O. (2015). Detecting individual ancestry in the human genome. Invest. Genet. 6 7.
  • Wright, S. (1943). Isolation by distance. Genetics 28 114–138.
  • Yang, W.-Y., Platt, A., Chiang, C. W.-K., Eskin, E., Novembre, J. and Pasaniuc, B. (2014). Spatial localization of recent ancestors for admixed individuals. Genes Genomes Genet. 4 2505–2518.