The Annals of Applied Probability

Distance-based species tree estimation under the coalescent: Information-theoretic trade-off between number of loci and sequence length

Elchanan Mossel and Sebastien Roch

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We consider the reconstruction of a phylogeny from multiple genes under the multispecies coalescent. We establish a connection with the sparse signal detection problem, where one seeks to distinguish between a distribution and a mixture of the distribution and a sparse signal. Using this connection, we derive an information-theoretic trade-off between the number of genes, $m$, needed for an accurate reconstruction and the sequence length, $k$, of the genes. Specifically, we show that to detect a branch of length $f$, one needs $m=\Theta(1/[f^{2}\sqrt{k}])$ genes.

Article information

Ann. Appl. Probab. Volume 27, Number 5 (2017), 2926-2955.

Received: August 2015
Revised: September 2016
First available in Project Euclid: 3 November 2017

Permanent link to this document

Digital Object Identifier

Primary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43] 92D15: Problems related to evolution

Phylogenetics coalescent theory sequence-length requirement


Mossel, Elchanan; Roch, Sebastien. Distance-based species tree estimation under the coalescent: Information-theoretic trade-off between number of loci and sequence length. Ann. Appl. Probab. 27 (2017), no. 5, 2926--2955. doi:10.1214/16-AAP1273.

Export citation


  • [1] Allman, E. S., Degnan, J. H. and Rhodes, J. A. (2011). Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent.J. Math. Biol.62833–862.
  • [2] Anderson, C. N. K., Liu, L., Pearl, D. and Edwards, S. V. (2012). Tangled trees: The challenge of inferring species trees from coalescent and noncoalescent genes. InEvolutionary Genomics(M. Anisimova, ed.).Methods in Molecular Biology8563–28. Humana Press, Los Angeles, CA.
  • [3] Andoni, A., Daskalakis, C., Hassidim, A. and Roch, S. (2012). Global alignment of molecular sequences via ancestral state reconstruction.Stochastic Process. Appl.1223852–3874.
  • [4] Bhaskar, A. and Song, Y. S. (2014). Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data.Ann. Statist.422469–2493.
  • [5] Cai, T. T., Jeng, X. J. and Jin, J. (2011). Optimal detection of heterogeneous and heteroscedastic mixtures.J. R. Stat. Soc. Ser. B. Stat. Methodol.73629–662.
  • [6] Cai, T. T. and Wu, Y. (2014). Optimal detection of sparse mixtures against a given null distribution.IEEE Trans. Inform. Theory602217–2232.
  • [7] Cayon, L., Jin, J. and Treaster, A. (2005). Higher criticism statistic: Detecting and identifying non-gaussianity in the wmap first-year data.Mon. Not. R. Astron. Soc.362826–832.
  • [8] Cover, T. M. and Thomas, J. A. (1991).Elements of Information Theory. Wiley, New York.
  • [9] Cryan, M., Goldberg, L. A. and Goldberg, P. W. (2001). Evolutionary trees can be learned in polynomial time in the two-state general Markov model.SIAM J. Comput.31375–397.
  • [10] Dasarathy, G., Nowak, R. D. and Roch, S. (2015). Data requirement for phylogenetic inference from multiple loci: A new distance method.IEEE/ACM Trans. Comput. Biol. Bioinform.12422–432.
  • [11] Daskalakis, C., Mossel, E. and Roch, S. (2011). Evolutionary trees and the Ising model on the Bethe lattice: A proof of Steel’s conjecture.Probab. Theory Related Fields149149–189.
  • [12] Daskalakis, C., Mossel, E. and Roch, S. (2011). Phylogenies without branch bounds: Contracting the short, pruning the deep.SIAM J. Discrete Math.25872–893.
  • [13] Daskalakis, C. and Roch, S. (2013). Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis.Ann. Appl. Probab.23693–721.
  • [14] DeGiorgio, M. and Degnan, J. H. (2010). Fast and consistent estimation of species trees using supermatrix rooted triples.Mol. Biol. Evol.27552–569.
  • [15] Degnan, J. H., DeGiorgio, M., Bryant, D. and Rosenberg, N. A. (2009). Properties of consensus methods for inferring species trees from gene trees.Syst. Biol.5835–54.
  • [16] Degnan, J. H. and Rosenberg, N. A. (2006). Discordance of species trees with their most likely gene trees.PLoS Genet.2.
  • [17] Degnan, J. H. and Rosenberg, N. A. (2009). Gene tree discordance, phylogenetic inference and the multispecies coalescent.Trends Ecol. Evol.24332–340.
  • [18] Delsuc, F., Brinkmann, H. and Philippe, H. (2005). Phylogenomics and the reconstruction of the tree of life.Nat. Rev. Genet.6(5) 361–375.
  • [19] Dobrušin, R. L. (1958). A statistical problem in the theory of detection of signals in the background of noise in a multi-channel system, reducing to stable distribution laws.Teor. Veroyatn. Primen.3173–185.
  • [20] Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures.Ann. Statist.32962–994.
  • [21] Durrett, R. (1996).Probability:Theory and Examples, 2nd ed. Duxbury Press, Belmont, CA.
  • [22] Durrett, R. (2008).Probability Models for DNA Sequence Evolution, 2nd ed. Springer, New York.
  • [23] Erdős, P. L., Steel, M. A., Székely, L. A. and Warnow, T. J. (1999). A few logs suffice to build (almost) all trees. I.Random Structures Algorithms14153–184.
  • [24] Erdős, P. L., Steel, M. A., Székely, L. A. and Warnow, T. J. (1999). A few logs suffice to build (almost) all trees. II.Theoret. Comput. Sci.22177–118.
  • [25] Felsenstein, J. (2004).Inferring Phylogenies. Sinauer, New York.
  • [26] Hastie, T., Tibshirani, R. and Friedman, J. (2009).The Elements of Statistical Learning. Data Mining,Inference,and Prediction, 2nd ed. Springer, New York.
  • [27] Ingster, Yu. I. (1997). Some problems of hypothesis testing leading to infinitely divisible distributions.Math. Methods Statist.647–69.
  • [28] Jeng, X. J., Cai, T. T. and Li, H. (2010). Optimal sparse segment identification with application in copy number variation analysis.J. Amer. Statist. Assoc.1051156–1166.
  • [29] Jukes, T. H. and Cantor, C. (1969). Mammalian protein metabolism. InEvolution of Protein Molecules(H. N. Munro, ed.) 21–132. Academic Press, San Diego, CA.
  • [30] Kim, J., Mossel, E., Rácz, M. Z. and Ross, N. (2015). Can one hear the shape of a population history?Theoret. Popul. Biol.10026–38.
  • [31] Kulldorff, M., Heffernan, R., Hartman, J., Assunção, R. and Mostashari, F. (2005). A space–time permutation scan statistic for disease outbreak detection.PLoS Med.2(3) e59.
  • [32] Liu, L., Yu, L., Kubatko, L., Pearl, D. K. and Edwards, S. V. (2009). Coalescent methods for estimating phylogenetic trees.Mol. Phylogenet. Evol.53320–328.
  • [33] Liu, L., Yu, L. and Pearl, D. K. (2010). Maximum tree: A consistent estimator of the species tree.J. Math. Biol.6095–106.
  • [34] Liu, L., Yu, L., Pearl, D. K. and Edwards, S. V. (2009). Estimating species phylogenies using coalescence times among sequences.Syst. Biol.58468–477.
  • [35] Maddison, W. P. (1997). Gene trees in species trees.Syst. Biol.46523–536.
  • [36] Mossel, E. (2003). On the impossibility of reconstructing ancestral data and phylogenies.J. Comput. Biol.10669–678.
  • [37] Mossel, E. (2004). Phase transitions in phylogeny.Trans. Amer. Math. Soc.3562379–2404.
  • [38] Mossel, E. and Roch, S. (2006). Learning nonsingular phylogenies and hidden Markov models.Ann. Appl. Probab.16583–614.
  • [39] Mossel, E. and Roch, S. (2010). Incomplete lineage sorting: Consistent phylogeny estimation from multiple loci.IEEE/ACM Trans. Comput. Biol. Bioinform.7166–171.
  • [40] Mossel, E. and Roch, S. (2015). Distance-based species tree estimation: Information-theoretic trade-off between number of loci and sequence length under the coalescent. InApproximation,Randomization,and Combinatorial Optimization. Algorithms and Techniques. LIPIcs. Leibniz Int. Proc. Inform.40931–942. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern.
  • [41] Mossel, E., Roch, S. and Sly, A. (2011). On the inference of large phylogenies with long branches: How long is too long?Bull. Math. Biol.731627–1644.
  • [42] Mourad, R., Sinoquet, C., Zhang, N. L., Liu, T. and Leray, P. (2013). A survey on latent tree models and applications.J. Artificial Intelligence Res.47157–203.
  • [43] Myers, S., Fefferman, C. and Patterson, N. (2008). Can one learn history from the allelic spectrum?Theoret. Popul. Biol.73342–348.
  • [44] Nakhleh, L. (2013). Computational approaches to species phylogeny inference and gene tree reconciliation.Trends Ecol. Evol.2812.DOI:10.1016/j.tree.2013.09.004.
  • [45] Rannala, B. and Yang, Z. (2003). Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci.Genetics1641645–1656.
  • [46] Roch, S. (2010). Toward extracting all phylogenetic information from matrices of evolutionary distances.Science3271376–1379.
  • [47] Roch, S. (2013). An analytical comparison of multilocus methods under the multispecies coalescent: The three-taxon case. InPacific Symposium in Biocomputing2013 297–306.
  • [48] Roch, S. and Steel, M. (2015). Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent.Theoret. Popul. Biol.10056–62.
  • [49] Roch, S. and Warnow, T. (2015). On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods.Syst. Biol.64663–676.
  • [50] Semple, C. and Steel, M. (2003).Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications24. Oxford Univ. Press, Oxford.
  • [51] Steel, M. A. and Székely, L. A. (2002). Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications.SIAM J. Discrete Math.15562–575.