The Annals of Statistics

Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data

Anand Bhaskar and Yun S. Song

Full-text: Access denied (no subscription detected) We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different population demographies can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this nonidentifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise-defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant, piecewise-exponential and piecewise-generalized-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the “folded” SFS, which is often used when there is ambiguity as to which allelic type is ancestral. Our results are proved using a generalization of Descartes’ rule of signs for polynomials to the Laplace transform of piecewise continuous functions.

Article information

Source
Ann. Statist. Volume 42, Number 6 (2014), 2469-2493.

Dates
First available in Project Euclid: 20 October 2014

Permanent link to this document
http://projecteuclid.org/euclid.aos/1413810735

Digital Object Identifier
doi:10.1214/14-AOS1264

Mathematical Reviews number (MathSciNet)
MR3269986

Zentralblatt MATH identifier
1305.62027

Subjects
Primary: 62B10: Information-theoretic topics [See also 94A17]
Secondary: 92D15: Problems related to evolution

Keywords
Population genetics identifiability population size coalescent theory frequency spectrum

Citation

Bhaskar, Anand; Song, Yun S. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann. Statist. 42 (2014), no. 6, 2469--2493. doi:10.1214/14-AOS1264. http://projecteuclid.org/euclid.aos/1413810735.


Export citation

References

  • [1] 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature 467 1061–1073.
  • [2] Boyko, A. R., Williamson, S. H., Indap, A. R., Degenhardt, J. D., Hernandez, R. D., Lohmueller, K. E., Adams, M. D., Schmidt, S., Sninsky, J. J., Sunyaev, S. R. et al. (2008). Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 4 e1000083.
  • [3] Campbell, C. D., Ogburn, E. L., Lunetta, K. L., Lyon, H. N., Freedman, M. L., Groop, L. C., Altshuler, D., Ardlie, K. G. and Hirschhorn, J. N. (2005). Demonstrating stratification in a European American population. Nat. Genet. 37 868–872.
  • [4] Coventry, A., Bull-Otterson, L. M., Liu, X., Clark, A. G., Maxwell, T. J., Crosby, J., Hixson, J. E., Rea, T. J., Muzny, D. M., Lewis, L. R. et al. (2010). Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nature Communications 1 131.
  • [5] Excoffier, L., Dupanloup, I., Huerta-Sánchez, E., Sousa, V. C. and Foll, M. (2013). Robust demographic inference from genomic and SNP data. PLoS Genet. 9 e1003905.
  • [6] Fu, W., O’Connor, T. D., Jun, G., Kang, H. M., Abecasis, G., Leal, S. M., Gabriel, S., Altshuler, D., Shendure, J., Nickerson, D. A. et al. (2012). Analysis of 6515 exomes reveals the recent origin of most human protein-coding variants. Nature 493 216–220.
  • [7] Fu, Y. X. (1995). Statistical properties of segregating sites. Theor. Popul. Biol. 48 172–197.
  • [8] Gantmacher, F. R. (2000). The Theory of Matrices, Vol. 2. Chelsea, New York.
  • [9] Gazave, E., Chang, D., Clark, A. G. and Keinan, A. (2013). Population growth inflates the per-individual number of deleterious mutations and reduces their mean effect. Genetics 195 969–978.
  • [10] Gravel, S., Henn, B. M., Gutenkunst, R. N., Indap, A. R., Marth, G. T., Clark, A. G., Yu, F., Gibbs, R. A., Bustamante, C. D., Altshuler, D. L. et al. (2011). Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. USA 108 11983–11988.
  • [11] Griffiths, R. C. (2003). The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor. Popul. Biol. 64 241–251.
  • [12] Griffiths, R. C. and Tavaré, S. (1998). The age of a mutation in a general coalescent tree. Comm. Statist. Stochastic Models 14 273–295.
  • [13] Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. and Bustamante, C. D. (2009). Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5 e1000695.
  • [14] Harris, K. and Nielsen, R. (2013). Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 9 e1003521.
  • [15] Jameson, G. J. O. (2006). Counting zeros of generalised polynomials: Descartes’ rule of signs and Laguerre’s extensions. The Mathematical Gazette 90 223–234.
  • [16] Keinan, A. and Clark, A. G. (2012). Recent explosive human population growth has resulted in an excess of rare genetic variants. Science 336 740–743.
  • [17] Keinan, A., Mullikin, J. C., Patterson, N. and Reich, D. (2007). Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East asians than in europeans. Nat. Genet. 39 1251–1255.
  • [18] Kidd, J. M., Gravel, S., Byrnes, J., Moreno-Estrada, A., Musharoff, S., Bryc, K., Degenhardt, J. D., Brisbin, A., Sheth, V., Chen, R. et al. (2012). Population genetic inference from personal genome data: Impact of ancestry and admixture on human genomic variation. Am. J. Hum. Genet. 91 660–671.
  • [19] Kimura, M. (1955). Solution of a process of random genetic drift with a continuous model. Proc. Natl. Acad. Sci. USA 41 144–150.
  • [20] Kimura, M. (1969). The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61 893.
  • [21] Kingman, J. F. C. (1982). The coalescent. Stochastic Process. Appl. 13 235–248.
  • [22] Kingman, J. F. C. (1982). On the genealogy of large populations. J. Appl. Probab. 19A 27–43.
  • [23] Kingman, J. F. C. (1982). Exchangeability and the evolution of large populations. In Exchangeability in Probability and Statistics (Rome, 1981) (G. Koch and F. Spizzichino, eds.) 97–112. North-Holland, Amsterdam.
  • [24] Li, H. and Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature 475 493–496.
  • [25] Lohmueller, K. E., Indap, A. R., Schmidt, S., Boyko, A. R., Hernandez, R. D., Hubisz, M. J., Sninsky, J. J., White, T. J., Sunyaev, S. R., Nielsen, R. et al. (2008). Proportionally more deleterious genetic variation in European than in African populations. Nature 451 994–997.
  • [26] Lukic, S. and Hey, J. (2012). Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion. Genetics 192 619–639.
  • [27] Lukić, S., Hey, J. and Chen, K. (2011). Non-equilibrium allele frequency spectra via spectral methods. Theor. Popul. Biol. 79 203–219.
  • [28] Marchini, J., Cardon, L. R., Phillips, M. S. and Donnelly, P. (2004). The effects of human population structure on large genetic association studies. Nat. Genet. 36 512–517.
  • [29] Marth, G. T., Czabarka, E., Murvai, J. and Sherry, S. T. (2004). The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166 351–372.
  • [30] Myers, S., Fefferman, C. and Patterson, N. (2008). Can one learn history from the allelic spectrum? Theor. Popul. Biol. 73 342–348.
  • [31] Nelson, M. R., Wegmann, D., Ehm, M. G., Kessner, D., Jean, P. S., Verzilli, C., Shen, J., Tang, Z., Bacanu, S.-A., Fraser, D. et al. (2012). An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337 100–104.
  • [32] Palamara, P. F., Lencz, T., Darvasi, A. and Pe’er, I. (2012). Length distributions of identity by descent reveal fine-scale demographic history. Am. J. Hum. Genet. 91 809–822.
  • [33] Pasaniuc, B., Zaitlen, N., Lettre, G., Chen, G. K., Tandon, A., Kao, W. L., Ruczinski, I., Fornage, M., Siscovick, D. S., Zhu, X. et al. (2011). Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS Genet. 7 e1001371.
  • [34] Petkovšek, M., Wilf, H. S. and Zeilberger, D. (1996). $A=B$. A K Peters, Wellesley, MA.
  • [35] Polanski, A., Bobrowski, A. and Kimmel, M. (2003). A note on distributions of times to coalescence, under time-dependent population size. Theor. Popul. Biol. 63 33–40.
  • [36] Polanski, A. and Kimmel, M. (2003). New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165 427–436.
  • [37] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.
  • [38] Reppell, M., Boehnke, M. and Zöllner, S. (2012). FTEC: A coalescent simulator for modeling faster than exponential growth. Bioinformatics 28 1282–1283.
  • [39] Reppell, M., Boehnke, M. and Zöllner, S. (2014). The impact of accelerating faster than exponential population growth on genetic variation. Genetics 196 819–828.
  • [40] Sankararaman, S., Patterson, N., Li, H., Pääbo, S. and Reich, D. (2012). The date of interbreeding between Neandertals and modern humans. PLoS Genet. 8 e1002947.
  • [41] Schaffner, S. F., Foo, C., Gabriel, S., Reich, D., Daly, M. J. and Altshuler, D. (2005). Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15 1576–1583.
  • [42] Sheehan, S., Harris, K. and Song, Y. S. (2013). Estimating variable effective population sizes from multiple genomes: A sequentially Markov conditional sampling distribution approach. Genetics 194 647–662.
  • [43] Skoglund, P. and Jakobsson, M. (2011). Archaic human ancestry in East Asia. Proc. Natl. Acad. Sci. USA 108 18301–18306.
  • [44] Tennessen, J. A., Bigham, A. W., O’Connor, T. D., Fu, W., Kenny, E. E., Gravel, S., McGee, S., Do, R., Liu, X., Jun, G. et al. (2012). Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337 64–69.
  • [45] Williamson, S. H., Hernandez, R., Fledel-Alon, A., Zhu, L., Nielsen, R. and Bustamante, C. D. (2005). Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA 102 7882–7887.
  • [46] Živković, D. and Stephan, W. (2011). Analytical results on the neutral non-equilibrium allele frequency spectrum based on diffusion theory. Theor. Popul. Biol. 79 184–191.