The Annals of Applied Probability

Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis

Constantinos Daskalakis and Sebastien Roch

Full-text: Open access


We present an efficient phylogenetic reconstruction algorithm allowing insertions and deletions which provably achieves a sequence-length requirement (or sample complexity) growing polynomially in the number of taxa. Our algorithm is distance-based, that is, it relies on pairwise sequence comparisons. More importantly, our approach largely bypasses the difficult problem of multiple sequence alignment.

Article information

Ann. Appl. Probab., Volume 23, Number 2 (2013), 693-721.

First available in Project Euclid: 12 February 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]
Secondary: 92D15: Problems related to evolution

Phylogenetic reconstruction alignment branching processes


Daskalakis, Constantinos; Roch, Sebastien. Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis. Ann. Appl. Probab. 23 (2013), no. 2, 693--721. doi:10.1214/12-AAP852.

Export citation


  • [1] Andoni, A., Braverman, M. and Hassidim, A. (2011). Phylogenetic reconstruction with insertions and deletions. Preprint.
  • [2] Andoni, A., Daskalakis, C., Hassidim, A. and Roch, S. (2010). Global alignment of molecular sequences via ancestral state reconstruction. In ICS 2010 358–369. Tsinghua University Press, Beijing, China.
  • [3] Athreya, K. B. and Ney, P. E. (1972). Branching Processes. Die Grundlehren der Mathematischen Wissenschaften 196. Springer, New York.
  • [4] Atteson, K. (1999). The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25 251–278.
  • [5] Csurös, M. (2002). Fast recovery of evolutionary trees with thousands of nodes. J. Comput. Biol. 9 277–297.
  • [6] Csurös, M. and Kao, M.-Y. (2001). Provably fast and accurate recovery of evolutionary trees through harmonic greedy triplets. SIAM J. Comput. 31 306–322 (electronic).
  • [7] Daskalakis, C., Hill, C., Jaffe, A., Mihaescu, R., Mossel, E. and Rao, S. (2006). Maximal accurate forests from distance matrices. In RECOMB 2006 281–295. Springer, Berlin.
  • [8] Daskalakis, C., Mossel, E. and Roch, S. (2006). Optimal phylogenetic reconstruction. In STOC’06: Proceedings of the 38th Annual ACM Symposium on Theory of Computing 159–168. ACM, New York.
  • [9] Daskalakis, C., Mossel, E. and Roch, S. (2009). Phylogenies without branch bounds: Contracting the short, pruning the deep. In RECOMB 2009 451–465. Springer, Berlin.
  • [10] Daskalakis, C. and Roch, S. (2010). Alignment-free phylogenetic reconstruction. In RECOMB 2010 123–137. Springer, Berlin.
  • [11] Edgar, R. C. (2004). MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 1792–1797.
  • [12] Elias, I. (2006). Settling the intractability of multiple alignment. J. Comput. Biol. 13 1323–1339 (electronic).
  • [13] Erdős, P. L., Steel, M. A., Székely, L. A. and Warnow, T. J. (1999). A few logs suffice to build (almost) all trees. I. Random Structures Algorithms 14 153–184.
  • [14] Erdős, P. L., Steel, M. A., Székely, L. A. and Warnow, T. J. (1999). A few logs suffice to build (almost) all trees. II. Theoret. Comput. Sci. 221 77–118.
  • [15] Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Syst. Biol. 27 401–410.
  • [16] Felsenstein, J. (2004). Inferring Phylogenies. Sinauer, New York.
  • [17] Graur, D. and Li, W. H. (1999). Fundamentals of Molecular Evolution, 2nd ed. Sinauer, Sunderland, MA.
  • [18] Gronau, I., Moran, S. and Snir, S. (2008). Fast and reliable reconstruction of phylogenetic trees with very short edges (extended abstract). In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms 379–388. ACM, New York.
  • [19] Higgins, D. G. and Sharp, P. M. (1988). Clustal: A package for performing multiple sequence alignment on a microcomputer. Gene 73 237–244.
  • [20] Huson, D. H., Nettles, S. H. and Warnow, T. J. (1999). Disk-covering, a fast-converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6 3–4.
  • [21] Karlin, S. and Taylor, H. M. (1981). A Second Course in Stochastic Processes. Academic Press, New York.
  • [22] Katoh, K., Misawa, K., Kuma, K.-i. and Miyata, T. (2002). MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30 3059–3066.
  • [23] King, V., Zhang, L. and Zhou, Y. (2003). On the complexity of distance-based evolutionary tree reconstruction. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, MD, 2003) 444–453. ACM, New York.
  • [24] Lacey, M. R. and Chang, J. T. (2006). A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. Math. Biosci. 199 188–215.
  • [25] Liu, K., Raghavan, S., Nelesen, S., Linder, C. R. and Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324 1561–1564.
  • [26] Löytynoja, A. and Goldman, N. (2008). Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320 1632–1635.
  • [27] Metzler, D. (2003). Statistical alignment based on fragment insertion and deletion models. Bioinformatics 19 490–499.
  • [28] Miklos, I., Lunter, G. A. and Holmes, I. (2004). A “Long Indel” model for evolutionary sequence alignment. Mol. Biol. Evol. 21 529–540.
  • [29] Mossel, E. (2007). Distorted metrics on trees and phylogenetic forests. IEEE/ACM Trans. Comput. Bio. Bioinform. 4 108–116.
  • [30] Mossel, E. and Roch, S. (2006). Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab. 16 583–614.
  • [31] Motwani, R. and Raghavan, P. (1995). Randomized Algorithms. Cambridge Univ. Press, Cambridge.
  • [32] Rivas, E. and Eddy, S. R. (2008). Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput. Biol. 4 e1000172, 20.
  • [33] Roch, S. (2008). Sequence-length requirement for distance-based phylogeny reconstruction: Breaking the polynomial barrier. In FOCS 2008 729–738. IEEE Comput. Soc., Los Alamitos, CA.
  • [34] Semple, C. and Steel, M. (2003). Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications 24. Oxford Univ. Press, Oxford.
  • [35] Steel, M. A. and Székely, L. A. (1999). Inverting random functions. Ann. Comb. 3 103–113.
  • [36] Steel, M. A. and Székely, L. A. (2002). Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discrete Math. 15 562–575 (electronic).
  • [37] Suchard, M. A. and Redelings, B. D. (2006). BAli-Phy: Simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22 2047–2048.
  • [38] Thatte, B. D. (2006). Invertibility of the TKF model of sequence evolution. Math. Biosci. 200 58–75.
  • [39] Thorne, J. L., Kishino, H. and Felsenstein, J. (1991). An evolutionary model for maximum likelihood alignment of dna sequences. Journal of Molecular Evolution 33 114–124.
  • [40] Thorne, J. L., Kishino, H. and Felsenstein, J. (1992). Inching toward reality: An improved likelihood model of sequence evolution. Journal of Molecular Evolution 34 3–16.
  • [41] Wang, L. and Jiang, T. (1994). On the complexity of multiple sequence alignment. J. Comput. Biol. 1 337–348.
  • [42] Wong, K. M., Suchard, M. A. and Huelsenbeck, J. P. (2008). Alignment uncertainty and genomic analysis. Science 319 473–476.