June 2021 A Bayesian nonparametric model for inferring subclonal populations from structured DNA sequencing data
Shai He, Aaron Schein, Vishal Sarsani, Patrick Flaherty
Author Affiliations +
Ann. Appl. Stat. 15(2): 925-951 (June 2021). DOI: 10.1214/20-AOAS1434


There are distinguishing features or “hallmarks” of cancer that are found across tumors, individuals and types of cancer, and these hallmarks can be driven by specific genetic mutations. Yet within a single tumor there is often extensive genetic heterogeneity as evidenced by single-cell and bulk DNA sequencing data. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. Understanding the genetic composition of the tumor at the time of treatment is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment.

We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement, and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma–Poisson hierarchy, and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals comutated loci across samples.

Funding Statement

This work was supported by NIH Grant 1R01GM13593101.


We would like to thank Alexandre Bouchard-Côté and Mingyuan Zhou for reading an early draft of this paper. The authors would also like to thank the anonymous referees, an Associate Editor and the Editor for their constructive comments.


Download Citation

Shai He. Aaron Schein. Vishal Sarsani. Patrick Flaherty. "A Bayesian nonparametric model for inferring subclonal populations from structured DNA sequencing data." Ann. Appl. Stat. 15 (2) 925 - 951, June 2021. https://doi.org/10.1214/20-AOAS1434


Received: 1 October 2020; Revised: 1 November 2020; Published: June 2021
First available in Project Euclid: 12 July 2021

Digital Object Identifier: 10.1214/20-AOAS1434

Keywords: augment-and-marginalize , Bayesian nonparametric , Dirichlet process mixture , DNA sequencing , tumor heterogeneity

Rights: Copyright © 2021 Institute of Mathematical Statistics


This article is only available to subscribers.
It is not available for individual sale.

Vol.15 • No. 2 • June 2021
Back to Top