CRISPR technology has enabled cell lineage tracing for complex multicellular organisms through insertion-deletion mutations of synthetic genomic barcodes during organismal development. To reconstruct the cell lineage tree from the mutated barcodes, current approaches apply general-purpose computational tools that are agnostic to the mutation process and are unable to take full advantage of the data’s structure. We propose a statistical model for the CRISPR mutation process and develop a procedure to estimate the resulting tree topology, branch lengths and mutation parameters by iteratively applying penalized maximum likelihood estimation. By assuming the barcode evolves according to a molecular clock, our method infers relative ordering across parallel lineages, whereas existing techniques only infer ordering for nodes along the same lineage. When analyzing transgenic zebrafish data from (Science 353 (2016) aaf7907), we find that our method recapitulates known aspects of zebrafish development and the results are consistent across samples.
We are grateful to Anna Minkina and Jay Shendure for helpful discussions and comments. This work was supported by National Institutes of Health Grants R01-GM113246 and R01-AI146028 as well as National Science Foundation Grant CISE-1564137. The research of Frederick Matsen was supported in part by a Faculty Scholar grant from the Howard Hughes Medical Institute and the Simons Foundation. Jean Feng and Noah Simon were supported by NIH Early Independence Award 5DP5OD019820. William DeWitt was supported by NIH Grants 5T32HG000035-23 and F31 AI150163. Aaron McKenna was supported by NIH/NHGRI Pathway to Independence Award Grant K99HG010152/R00HG010152.
"Estimation of cell lineage trees by maximum-likelihood phylogenetics." Ann. Appl. Stat. 15 (1) 343 - 362, March 2021. https://doi.org/10.1214/20-AOAS1400