Annals of Statistics

Consistency and convergence rate of phylogenetic inference via regularization

Vu Dinh, Lam Si Tung Ho, Marc A. Suchard, and Frederick A. Matsen IV

Full-text: Open access


It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct “gene tree.” Although the gene tree may deviate from the “species tree” due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a nonstandard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera–Holmes–Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is “adaptive fast converging,” meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.

Article information

Ann. Statist., Volume 46, Number 4 (2018), 1481-1512.

Received: June 2016
Revised: February 2017
First available in Project Euclid: 27 June 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 05C05: Trees 62F12: Asymptotic properties of estimators
Secondary: 92B10: Taxonomy, cladistics, statistics 92D15: Problems related to evolution

Phylogenetics tree reconstruction gene tree species tree maximum likelihood estimator regularization


Dinh, Vu; Ho, Lam Si Tung; Suchard, Marc A.; Matsen IV, Frederick A. Consistency and convergence rate of phylogenetic inference via regularization. Ann. Statist. 46 (2018), no. 4, 1481--1512. doi:10.1214/17-AOS1592.

Export citation


  • Åkerborg, Ö., Sennblad, B., Arvestad, L. and Lagergren, J. (2009). Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc. Natl. Acad. Sci. USA 106 5714–5719.
  • Amenta, N., Godwin, M., Postarnakevich, N. and John, K. S. (2007). Approximating geodesic tree distance. Inform. Process. Lett. 103 61–65.
  • Atteson, K. (1997). The performance of neighbor-joining algorithms of phylogeny reconstruction. In Computing and Combinatorics (Shanghai, 1997). Lecture Notes in Computer Science 1276 101–110. Springer, Berlin.
  • Bačák, M. (2013). The proximal point algorithm in metric spaces. Israel J. Math. 194 689–701.
  • Bačák, M. (2014). Computing medians and means in Hadamard spaces. SIAM J. Optim. 24 1542–1566.
  • Bansal, M. S., Wu, Y.-C., Alm, E. J. and Kellis, M. (2015). Improved gene tree error correction in the presence of horizontal gene transfer. Bioinformatics 31 1211–1218.
  • Billera, L. J., Holmes, S. P. and Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27 733–767.
  • Boussau, B. and Daubin, V. (2010). Genomes as documents of evolutionary history. Trends Ecol. Evol. 25 224–232.
  • Boussau, B., Szöllősi, G. J., Duret, L., Gouy, M., Tannier, E. and Daubin, V. (2013). Genome-scale coestimation of species and gene trees. Genome Res. 23 323–330.
  • Chakerian, J. and Holmes, S. (2013). distory: Distance Between Phylogenetic Histories. R package version 1.4.2.
  • Csurös, M. (2002). Fast recovery of evolutionary trees with thousands of nodes. J. Comput. Biol. 9 277–297.
  • Cucker, F. and Smale, S. (2002). On the mathematical foundations of learning. Bull. Amer. Math. Soc. (N.S.) 39 1–49.
  • Cuong, N. V., Ho, L. S. T. and Dinh, V. (2013). Generalization and robustness of batched weighted average algorithm with V-geometrically ergodic Markov data. In Algorithmic Learning Theory. Lecture Notes in Computer Science 8139 264–278. Springer, Heidelberg.
  • Daskalakis, C., Mossel, E. and Roch, S. (2011). Phylogenies without branch bounds: Contracting the short, pruning the deep. SIAM J. Discrete Math. 25 872–893.
  • David, L. A. and Alm, E. J. (2011). Rapid evolutionary innovation during an Archaean genetic expansion. Nature 469 93–96.
  • Engl, H. W., Hanke, M. and Neubauer, A. (1996). Regularization of Inverse Problems. Mathematics and Its Applications 375. Kluwer Academic, Dordrecht.
  • Erdős, P. L., Steel, M. A., Székely, L. A. and Warnow, T. J. (1999). A few logs suffice to build (almost) all trees. I. Random Structures Algorithms 14 153–184.
  • Felsenstein, J. (1984). DNAML in PHYLIP 2. University of Washington, Seattle. 6.
  • Felsenstein, J. (2004). Inferring Phylogenies. Sinauer Associates, Sunderland.
  • Gronau, I., Moran, S. and Snir, S. (2012). Fast and reliable reconstruction of phylogenetic trees with indistinguishable edges. Random Structures Algorithms 40 350–384.
  • Heled, J. and Drummond, A. J. (2010). Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27 570–580.
  • Hendy, M. D. and Penny, D. (1993). Spectral analysis of phylogenetic data. J. Classification 10 5–24.
  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
  • Hoerl, A. E. (1962). Application of ridge analysis to regression problems. Chem. Eng. Prog. 58 54–59.
  • Hofmann, B. and Yamamoto, M. (2010). On the interplay of source conditions and variational inequalities for nonlinear ill-posed problems. Appl. Anal. 89 1705–1727.
  • Hofmann, B., Kaltenbacher, B., Pöschl, C. and Scherzer, O. (2007). A convergence rates result for Tikhonov regularization in Banach spaces with non-smooth operators. Inverse Probl. 23 987–1010.
  • Hohage, T. and Weidling, F. (2015). Verification of a variational source condition for acoustic inverse medium scattering problems. Inverse Probl. 31 075006, 14.
  • Homrighausen, D. and McDonald, D. (2013). The lasso, persistence, and cross-validation. In Proceedings of the 30th International Conference on Machine Learning 1031–1039.
  • Huson, D. H., Nettles, S. M. and Warnow, T. J. (1999). Disk-covering, a fast-converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6 369–386.
  • Ji, S., Kollár, J. and Shiffman, B. (1992). A global Łojasiewicz inequality for algebraic varieties. Trans. Amer. Math. Soc. 329 813–818.
  • Kazimierski, K. S. (2010). Aspects of Regularization in Banach Spaces. Logos Verlag, Berlin.
  • Kim, J. (2000). Slicing hyperdimensional oranges: The geometry of phylogenetic estimation. Mol. Phylogenet. Evol. 17 58–75.
  • Kuhner, M. K. and Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11 459–468.
  • Liu, L. and Pearl, D. K. (2007). Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst. Biol. 56 504–514.
  • Maddison, W. P. (1997). Gene trees in species trees. Syst. Biol. 46 523–536.
  • Mahmudi, O., Sjöstrand, J., Sennblad, B. and Lagergren, J. (2013). Genome-wide probabilistic reconciliation analysis across vertebrates. BMC Bioinform. 14 1–11.
  • Mossel, E. (2003). On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol. 10 669–676.
  • Mossel, E. (2004). Phase transitions in phylogeny. Trans. Amer. Math. Soc. 356 2379–2404.
  • Mossel, E. (2007). Distorted metrics on trees and phylogenetic forests. IEEE/ACM Trans. Comput. Biol. Bioinform. 4 108–116.
  • Moulton, V. and Steel, M. (2004). Peeling phylogenetic ‘oranges’. Adv. in Appl. Math. 33 710–727.
  • Nye, T. M. W. (2011). Principal components analysis in the space of phylogenetic trees. Ann. Statist. 39 2716–2739.
  • Owen, M. and Provan, J. S. (2011). A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Trans. Comput. Biol. Bioinform. 8 2–13.
  • Rasmussen, M. D. and Kellis, M. (2011). A Bayesian approach for fast and accurate gene tree reconstruction. Mol. Biol. Evol. 28 273–290.
  • Robinson, D. F. (1971). Comparison of labeled trees with valency three. J. Combin. Theory Ser. B 11 105–119.
  • Roch, S. and Sly, A. (2017). Phase transition in the sample complexity of likelihood-based phylogeny inference. Probab. Theory Related Fields 169 3–62.
  • Roch, S. and Warnow, T. (2015). On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Syst. Biol. 64 663–676.
  • Rokas, A., Williams, B. L., King, N. and Carroll, S. B. (2003). Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425 798–804.
  • Schliep, K. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics 27 592–593.
  • Scornavacca, C., Jacox, E. and Szöllősi, G. J. (2015). Joint amalgamation of most parsimonious reconciled gene trees. Bioinformatics 31 841–848.
  • Semple, C. and Steel, M. (2003). Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications 24. Oxford Univ. Press, Oxford.
  • Steel, M., Hendy, M. D. and Penny, D. (1998). Reconstructing phylogenies from nucleotide pattern probabilities: A survey and some new results. Discrete Appl. Math. 88 367–396.
  • Steel, M. A. and Székely, L. A. (2002). Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discrete Math. 15 562–575.
  • Steel, M. A. and Székely, L. A. (2009). Inverting random functions. III. Discrete MLE revisited. Ann. Comb. 13 365–382.
  • Szöllősi, G. J., Tannier, E., Daubin, V. and Boussau, B. (2015). The inference of gene trees with species trees. Syst. Biol. 64 e42–e62.
  • van Erven, T. and Harremoës, P. (2014). Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inform. Theory 60 3797–3820.
  • Wang, Q. (2004). Maximum Likelihood Estimation of Phylogenetic Tree with Evolutionary Parameters Ph.D. thesis, The Ohio State University. Available at!etd.send_file?accession=osu1083177084&disposition=inline.
  • Warnow, T., Moret, B. M. E. and John, K. S. (2001). Absolute convergence: True trees from short sequences. In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (Washington, DC, 2001) 186–195. SIAM, Philadelphia, PA.
  • Wu, Y.-C., Rasmussen, M. D., Bansal, M. S. and Kellis, M. (2013). TreeFix: Statistically informed gene tree error correction using species trees. Syst. Biol. 62 110–120.
  • Zhang, P. (1993). Model selection via multifold cross validation. Ann. Statist. 21 299–313.