The Annals of Applied Probability

Distance-based species tree estimation under the coalescent: Information-theoretic trade-off between number of loci and sequence length

Elchanan Mossel and Sebastien Roch

We consider the reconstruction of a phylogeny from multiple genes under the multispecies coalescent. We establish a connection with the sparse signal detection problem, where one seeks to distinguish between a distribution and a mixture of the distribution and a sparse signal. Using this connection, we derive an information-theoretic trade-off between the number of genes, $m$, needed for an accurate reconstruction and the sequence length, $k$, of the genes. Specifically, we show that to detect a branch of length $f$, one needs $m=\Theta(1/[f^{2}\sqrt{k}])$ genes.

Ann. Appl. Probab. Volume 27, Number 5 (2017), 2926-2955.

Received: August 2015
Revised: September 2016
First available in Project Euclid: 3 November 2017

Primary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43] 92D15: Problems related to evolution

Phylogenetics coalescent theory sequence-length requirement


