Deconvolution of base pair level RNA-Seq read counts for quantification of transcript expression levels

Han Wu; Yu Zhu

doi:10.1214/16-AOAS906

September 2016 Deconvolution of base pair level RNA-Seq read counts for quantification of transcript expression levels

Han Wu, Yu Zhu

Ann. Appl. Stat. 10(3): 1195-1216 (September 2016). DOI: 10.1214/16-AOAS906

Abstract

RNA-Seq has emerged as the method of choice for profiling the transcriptomes of organisms. In particular, it aims to quantify the expression levels of transcripts using short nucleotide sequences or short reads generated from RNA-Seq experiments. In real experiments, the label of the transcript, from which each short read is generated, is missing, and short reads are mapped to the genome rather than the transcriptome. Therefore, the quantification of transcript expression levels is an indirect statistical inference problem.

In this article, we propose to use individual exonic base pairs as observation units and, further, to model nonzero as well as zero counts at all base pairs at both the transcript and gene levels. At the transcript level, two-component Poisson mixture distributions are postulated, which gives rise to the Convolution of Poisson mixture (CPM) distribution model at the gene level. The maximum likelihood estimation method equipped with the EM algorithm is used to estimate model parameters and quantify transcript expression levels. We refer to the proposed method as CPM-Seq. Both simulation studies and real data demonstrate the effectiveness of CPM-Seq, showing that CPM-Seq produces more accurate and consistent quantification results than Cufflinks.

Citation

Download Citation

Han Wu. Yu Zhu. "Deconvolution of base pair level RNA-Seq read counts for quantification of transcript expression levels." Ann. Appl. Stat. 10 (3) 1195 - 1216, September 2016. https://doi.org/10.1214/16-AOAS906