The Annals of Applied Statistics

A unified statistical framework for single cell and bulk RNA sequencing data

Lingxue Zhu, Jing Lei, Bernie Devlin, and Kathryn Roeder

Full-text: Open access


Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the “dropout” events. A “dropout” happens when the RNA for a gene fails to be amplified prior to sequencing, producing a “false” zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes’ approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.

Article information

Ann. Appl. Stat. Volume 12, Number 1 (2018), 609-632.

Received: September 2016
Revised: March 2017
First available in Project Euclid: 9 March 2018

Permanent link to this document

Digital Object Identifier

Single cell RNA sequencing hierarchical model empirical Bayes Gibbs sampling EM algorithm


Zhu, Lingxue; Lei, Jing; Devlin, Bernie; Roeder, Kathryn. A unified statistical framework for single cell and bulk RNA sequencing data. Ann. Appl. Stat. 12 (2018), no. 1, 609--632. doi:10.1214/17-AOAS1110.

Export citation


  • Abbas, A. R., Wolslegel, K., Seshasayee, D., Modrusan, Z. and Clark, H. F. (2009). Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS ONE 4 e6098.
  • Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. J. Amer. Statist. Assoc. 112 859–877.
  • Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
  • Brennecke, P., Anders, S., Kim, J. K., Kołodziejczyk, A. A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S. A., Marioni, J. C. et al. (2013). Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10 1093–1095.
  • Camp, J. G., Badsha, F., Florio, M., Kanton, S., Gerber, T., Wilsch-Bräuninger, M., Lewitus, E., Sykes, A., Hevers, W., Lancaster, M. et al. (2015). Human cerebral organoids recapitulate gene expression programs of fetal neocortex development. Proc. Natl. Acad. Sci. USA 112 15672–15677.
  • Casella, G. and George, E. I. (1992). Explaining the Gibbs sampler. Amer. Statist. 46 167–174.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion.
  • Donoho, D. and Stodden, V. (2003). When does non-negative matrix factorization give a correct decomposition into parts? In Advances in Neural Information Processing Systems.
  • Dupuy, C. and Bach, F. (2016). Online but accurate inference for latent variable models with local Gibbs sampling. J. Mach. Learn. Res. 1.
  • Fan, H. C., Fu, G. K. and Fodor, S. P. A. (2015). Combinatorial labeling of single cells for gene expression cytometry. Science 347 1258367.
  • Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., Slichter, C. K., Miller, H. W., McElrath, M. J., Prlic, M., Linsley, P. S. and Gottardo, R. (2015). MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16 278.
  • Fromer, M., Roussos, P., Sieberts, S. K., Johnson, J. S., Kavanagh, D. H., Perumal, T. M., Ruderfer, D. M., Oh, E. C., Topol, A. et al. (2016). Gene expression elucidates functional impact of polygenic risk for schizophrenia. Nat. Neurosci. 19 1442–1453.
  • Gaujoux, R. and Seoighe, C. (2012). Semi-supervised nonnegative matrix factorization for gene expression deconvolution: A case study. Infect. Genet. Evol. 12 913–921.
  • Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85 398–409.
  • Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6 721–741.
  • Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proc. Natl. Acad. Sci. USA 101 5228–5235.
  • Grün, D., Lyubimova, A., Kester, L., Wiebrands, K., Basak, O., Sasaki, N., Clevers, H. and van Oudenaarden, A. (2015). Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525 251–255.
  • GTEx Consortium (2013). The genotype-tissue expression (GTEx) project. Nat. Genet. 45 580–585.
  • Haque, A., Engel, J., Teichmann, S. A. and Lönnberg, T. (2017). A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Gen. Med. 9 75.
  • Huang, W.-C., Ferris, E., Cheng, T., Hörndli, C. S., Gleason, K., Tamminga, C., Wagner, J. D., Boucher, K. M., Christian, J. L. and Gregg, C. (2017a). Diverse non-genetic, allele-specific expression effects shape genetic architecture at the cellular level in the mammalian brain. Neuron 93 1094–1109.e7.
  • Huang, M., Wang, J., Torre, E., Dueck, H., Shaffer, S., Bonasio, R., Murray, J., Raj, A., Li, M. and Zhang, N. R. (2017b). Gene expression recovery for single cell RNA sequencing. BioRxiv. DOI:10.1101/138677.
  • Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37 183–233.
  • Kang, H. J., Kawasawa, Y. I., Cheng, F., Zhu, Y., Xu, X., Li, M., Sousa, A. M., Pletikos, M., Meyer, K. A., Sedmak, G. et al. (2011). Spatio-temporal transcriptome of the human brain. Nature 478 483–489.
  • Kharchenko, P. V., Silberstein, L. and Scadden, D. T. (2014). Bayesian approach to single-cell differential expression analysis. Nat. Methods 11 740–742.
  • Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C. and Teichmann, S. A. (2015). The technology and biology of single-cell RNA sequencing. Mol. Cell 58 610–620.
  • Lee, D. D. and Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13 556–562. MIT Press, Cambridge, MA.
  • Lin, P., Troup, M. and Ho, J. W. K. (2017). CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 18 59.
  • Newman, A. M., Liu, C. L., Green, M. R., Gentles, A. J., Feng, W., Xu, Y., Hoang, C. D., Diehn, M. and Alizadeh, A. A. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12 453–457.
  • Padovan-Merhar, O. and Raj, A. (2013). Using variability in gene expression as a tool for studying gene regulation. Wiley Interdiscip. Rev., Syst. Biol. Med. 5 751–759.
  • Pierson, E. and Yau, C. (2015). ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16 241.
  • Polson, N. G., Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using Pólya–gamma latent variables. J. Amer. Statist. Assoc. 108 1339–1349.
  • Prabhakaran, S., Azizi, E. and Pe’er, D. (2016). Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. In Proceedings of the 33rd International Conference on Machine Learning 1070–1079.
  • Repsilber, D., Kern, S., Telaar, A., Walzl, G., Black, G. F., Selbig, J., Parida, S. K., Kaufmann, S. H. E. and Jacobsen, M. (2010). Biomarker discovery in heterogeneous tissue samples-taking the in-silico deconfounding approach. BMC Bioinform. 11 27.
  • Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. and Regev, A. (2015). Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33 495–502.
  • Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M., Hastie, T., Sarwal, M. M., Davis, M. M. and Butte, A. J. (2010). Cell type-specific gene expression differences in complex tissues. Nat. Methods 7 287–289.
  • Sunkin, S. M., Ng, L., Lau, C., Dolbeare, T., Gilbert, T. L., Thompson, C. L., Hawrylycz, M. and Dang, C. (2013). Allen brain atlas: An integrated spatio-temporal portal for exploring the central nervous system. Nucleic Acids Res. 41 D996–D1008.
  • Trapnell, C., Cacchiarelli, D., Grimsby, J., Pokharel, P., Li, S., Morse, M., Lennon, N. J., Livak, K. J., Mikkelsen, T. S. and Rinn, J. L. (2014). The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32 381–386.
  • Vallejos, C. A., Marioni, J. C. and Richardson, S. (2015). BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 11 e1004333.
  • Vallejos, C. A., Richardson, S. and Marioni, J. C. (2016). Beyond comparisons of means: Understanding changes in gene expression at the single-cell level. Genome Biol. 17 1.
  • Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. and Marioni, J. C. (2017). Normalizing single-cell RNA sequencing data: Challenges and opportunities. Nat. Methods 14 565–571.
  • Vu, T. N., Wills, Q. F., Kalari, K. R., Niu, N., Wang, L., Rantalainen, M. and Pawitan, Y. (2016). Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics 32 2128–35.
  • Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 1–305.
  • Zhong, Y., Wan, Y.-W., Pang, K., Chow, L. M. and Liu, Z. (2013). Digital sorting of complex tissues for cell type-specific gene expression profiles. BMC Bioinform. 14 1.
  • Zhu, L., Lei, J., Devlin, B. and Roeder, K. (2018). Supplement to “A unified statistical framework for single cell and bulk RNA sequencing data.” DOI:10.1214/17-AOAS1110SUPP.

Supplemental materials

  • Supplement to “A unified statistical framework for single cell and bulk RNA sequencing data.”. This supplement provides additional information on the Gibbs sampling and EM algorithm.