The Annals of Applied Statistics

BayCount: A Bayesian decomposition method for inferring tumor heterogeneity using RNA-Seq counts

Fangzheng Xie, Mingyuan Zhou, and Yanxun Xu

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Tumors are heterogeneous. A tumor sample usually consists of a set of subclones with distinct transcriptional profiles and potentially different degrees of aggressiveness and responses to drugs. Understanding tumor heterogeneity is therefore critical for precise cancer prognosis and treatment. In this paper we introduce BayCount—a Bayesian decomposition method to infer tumor heterogeneity with highly over-dispersed RNA sequencing count data. Using negative binomial factor analysis, BayCount takes into account both the between-sample and gene-specific random effects on raw counts of sequencing reads mapped to each gene. For the posterior inference, we develop an efficient compound Poisson-based blocked Gibbs sampler. Simulation studies show that BayCount is able to accurately estimate the subclonal inference, including the number of subclones, the proportions of these subclones in each tumor sample, and the gene expression profiles in each subclone. For real world data examples, we apply BayCount to The Cancer Genome Atlas lung cancer and kidney cancer RNA sequencing count data and obtain biologically interpretable results. Our method represents the first effort in characterizing tumor heterogeneity using RNA sequencing count data that simultaneously removes the need of normalizing the counts, achieves statistical robustness, and obtains biologically/clinically meaningful insights. The R package BayCount implementing our model and algorithm is available for download.

Article information

Ann. Appl. Stat., Volume 12, Number 3 (2018), 1605-1627.

Received: February 2017
Revised: November 2017
First available in Project Euclid: 11 September 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Cancer genomics compound Poisson Markov chain Monte Carlo negative binomial overdispersion


Xie, Fangzheng; Zhou, Mingyuan; Xu, Yanxun. BayCount: A Bayesian decomposition method for inferring tumor heterogeneity using RNA-Seq counts. Ann. Appl. Stat. 12 (2018), no. 3, 1605--1627. doi:10.1214/17-AOAS1123.

Export citation


  • Abbas, A. R., Wolslegel, K., Seshasayee, D., Modrusan, Z. and Clark, H. F. (2009). Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS ONE 4 e6098.
  • Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika 35 246–254.
  • Cancer Genome Atlas Research Network (2012). Comprehensive genomic characterization of squamous cell lung cancers. Nature 489 519–525.
  • Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499 43–49.
  • Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird, P. W., Onofrio, R. C., Winckler, W., Weir, B. A. et al. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30 413–421.
  • DePianto, D., Kerns, M. L., Dlugosz, A. A. and Coulombe, P. A. (2010). Keratin 17 promotes epithelial proliferation and tumor growth by polarizing the immune response in skin. Nat. Genet. 42 910–914.
  • Deshwar, A. G., Vembu, S., Yung, C. K., Jang, G. H., Stein, L. and Morris, Q. (2015). PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16 1.
  • Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J. et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14 671–683.
  • Ding, L., Ley, T. J., Larson, D. E., Miller, C. A., Koboldt, D. C., Welch, J. S., Ritchey, J. K., Young, M. A., Lamprecht, T., McLellan, M. D. et al. (2012). Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481 506–510.
  • Fan, J., Salathia, N., Liu, R., Kaeser, G. E., Yung, Y. C., Herman, J. L., Kaper, F., Fan, J.-B., Zhang, K., Chun, J. et al. (2016). Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13 241–244.
  • Ghahramani, Z., Mohamed, S. and Heller, K. A. (2014). Partial Membership and Factor Analysis. Chapman & Hall/CRC, London.
  • Gong, T., Hartmann, N., Kohane, I. S., Brinkmann, V., Staedtler, F., Letzkus, M., Bongiovanni, S. and Szustakowski, J. D. (2011). Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PLoS ONE 6 e27156.
  • Hore, V., Viñuela, A., Buil, A., Knight, J., McCarthy, M. I., Small, K. and Marchini, J. (2016). Tensor decomposition for multiple-tissue gene expression experiments. Nat. Genet. 48 1094–1100.
  • Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997). Discrete Multivariate Distributions 165. Wiley, New York.
  • Karantza, V. (2011). Keratins in health and cancer: More than mere epithelial cell markers. Oncogene 30 127–138.
  • Kharchenko, P. V., Silberstein, L. and Scadden, D. T. (2014). Bayesian approach to single-cell differential expression analysis. Nat. Methods 11 740–742.
  • Kim, K.-T., Lee, H. W., Lee, H.-O., Kim, S. C., Seo, Y. J., Chung, W., Eum, H. H., Nam, D.-H., Kim, J., Joo, K. M. et al. (2015). Single-cell mRNA sequencing identifies subclonal heterogeneity in anti-cancer drug responses of lung adenocarcinoma cells. Genome Biol. 16 127.
  • Kudriavtseva, A., Anedchenko, E., Oparina, N. Y., Krasnov, G., Kashkin, K., Dmitriev, A., Zborovskaya, I., Kondratjeva, T., Vinogradova, E., Zinovyeva, M. et al. (2009). Expression of FTL and FTH genes encoding ferritin subunits in lung and renal carcinomas. Mol. Biol. 43 972–981.
  • Lähdesmäki, H., Dunmire, V., Yli-Harja, O., Zhang, W. et al. (2005). In silico microdissection of microarray data from heterogeneous cell populations. BMC Bioinform. 6 1.
  • Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788–791.
  • Lee, S., Chugh, P. E., Shen, H., Eberle, R. and Dittmer, D. P. (2013). Poisson factor models with applications to non-normalized microRNA profiling. Bioinformatics 29 1105–1111.
  • Lee, J., Müller, P., Sengupta, S., Gulukota, K. and Ji, Y. (2016). Bayesian inference for intratumour heterogeneity in mutations and copy number variation. J. R. Stat. Soc. Ser. C. Appl. Stat. 65 547–563.
  • Liao, Y., Smyth, G. K. and Shi, W. (2014). Featurecounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30 923–930.
  • Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A. R., Kamitaki, N., Martersteck, E. M. et al. (2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161 1202–1214.
  • Marusyk, A., Almendro, V. and Polyak, K. (2012). Intra-tumour heterogeneity: A looking glass for cancer? Nat. Rev. Cancer 12 323–334.
  • Nik-Zainal, S., Van Loo, P., Wedge, D. C., Alexandrov, L. B., Greenman, C. D., Lau, K. W., Raine, K., Jones, D., Marshall, J., Ramakrishna, M. et al. (2012). The life history of 21 breast cancers. Cell 149 994–1007.
  • Oesper, L., Mahmoody, A. and Raphael, B. J. (2013). THetA: Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol. 14 R80.
  • Oshlack, A. and Wakefield, M. J. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biol. Direct 4 1.
  • Pickrell, J. K., Marioni, J. C., Pai, A. A., Degner, J. F., Engelhardt, B. E., Nkadori, E., Veyrieras, J.-B., Stephens, M., Gilad, Y. and Pritchard, J. K. (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464 768–772.
  • Quenouille, M. H. (1949). A relation between the logarithmic, Poisson, and negative binomial series. Biometrics 5 162–164.
  • Rahman, M., Jackson, L. K., Johnson, W. E., Li, D. Y., Bild, A. H. and Piccolo, S. R. (2015). Alternative preprocessing of RNA-sequencing data in the Cancer Genome Atlas leads to improved analysis results. Bioinformatics 31 3666–3672.
  • Repsilber, D., Kern, S., Telaar, A., Walzl, G., Black, G. F., Selbig, J., Parida, S. K., Kaufmann, S. H. and Jacobsen, M. (2010). Biomarker discovery in heterogeneous tissue samples-taking the in-silico deconfounding approach. BMC Bioinform. 11 1.
  • Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S., Bouchard-Côté, A. and Shah, S. P. (2014). PyClone: Statistical inference of clonal population structure in cancer. Nat. Methods 11 396–398.
  • Russnes, H. G., Navin, N., Hicks, J. and Borresen-Dale, A.-L. (2011). Insight into the heterogeneity of breast cancer through next-generation sequencing. J. Clin. Invest. 121 3810–3818.
  • Shen, H. and Huang, J. Z. (2008). Forecasting time series of inhomogeneous Poisson processes with application to call center workforce management. Ann. Appl. Stat. 2 601–623.
  • Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M., Hastie, T., Sarwal, M. M., Davis, M. M. and Butte, A. J. (2010). Cell type–specific gene expression differences in complex tissues. Nat. Methods 7 287–289.
  • Shintani, Y., Maeda, M., Chaika, N., Johnson, K. R. and Wheelock, M. J. (2008). Collagen I promotes epithelial-to-mesenchymal transition in lung cancer cells via transforming growth factor-beta signaling. Am. J. Respir. Cell Mol. Biol. 38 95–104.
  • Venet, D., Pecasse, F., Maenhaut, C. and Bersini, H. (2001). Separation of samples into their constituents using gene expression data. Bioinformatics 17 S279–S287.
  • Wang, M., Master, S. R. and Chodosh, L. A. (2006). Computational expression deconvolution in a complex mammalian organ. BMC Bioinform. 7 1.
  • Wang, N., Hoffman, E. P., Chen, L., Chen, L., Zhang, Z., Liu, C., Yu, G., Herrington, D. M., Clarke, R. and Wang, Y. (2016). Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues. Sci. Rep. 6.
  • Wilkerson, M. D., Yin, X., Hoadley, K. A., Liu, Y., Hayward, M. C., Cabanski, C. R., Muldrew, K., Miller, C. R., Randell, S. H., Socinski, M. A. et al. (2010). Lung squamous cell carcinoma mRNA expression subtypes are reproducible, clinically important, and correspond to normal cell types. Clin. Cancer Res. 16 4864–4875.
  • Wilks, C., Cline, M. S., Weiler, E., Diehkans, M., Craft, B., Martin, C., Murphy, D., Pierce, H., Black, J., Nelson, D. et al. (2014). The Cancer Genomics Hub (CGHub): Overcoming cancer through the power of torrential data. Database 2014.
  • Xie, F., Zhou, M. and Xu, Y. (2018). Supplement to “BayCount: A Bayesian decomposition method for inferring tumor heterogeneity using RNA-Seq counts.” DOI:10.1214/17-AOAS1123SUPP.
  • Xu, Y., Müller, P., Yuan, Y., Gulukota, K. and Ji, Y. (2015). MAD Bayes for tumor heterogeneity—feature allocation with exponential family sampling. J. Amer. Statist. Assoc. 110 503–514.
  • Zhou, M. (2016). Nonparametric Bayesian negative binomial factor analysis. Preprint. Available at arXiv:1604.07464.
  • Zhou, M. and Carin, L. (2012). Augment-and-conquer negative binomial processes. In Advances in Neural Information Processing Systems 2546–2554.
  • Zhou, M., Hannah, L., Dunson, D. B. and Carin, L. (2012). Beta-negative binomial process and Poisson factor analysis. In AISTATS 22 1462–1471.
  • Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput. Statist. Data Anal. 51 918–930.
  • Zhu, J., Chen, X., Liao, Z., He, C. and Hu, X. (2015). TGFBI protein high expression predicts poor prognosis in colorectal cancer patients. Int. J. Clin. Exp. Pathol. 8 702.

Supplemental materials

  • Supplement to “BayCount: A Bayesian decomposition method for inferring tumor heterogeneity using RNA-Seq counts”. We provide the details for the posterior inference, supplementary figures, comparison with alternative methods for determining number of subclones, additional simulation studies, comparison with the nonnegative matrix factorization on transformed count data and additional convergence diagnostics in the supplementary material.