The Annals of Applied Statistics

Classification and clustering of sequencing data using a Poisson model

Daniela M. Witten

Full-text: Open access


In recent years, advances in high throughput sequencing technology have led to a need for specialized methods for the analysis of digital gene expression data. While gene expression data measured on a microarray take on continuous values and can be modeled using the normal distribution, RNA sequencing data involve nonnegative counts and are more appropriately modeled using a discrete count distribution, such as the Poisson or the negative binomial. Consequently, analytic tools that assume a Gaussian distribution (such as classification methods based on linear discriminant analysis and clustering methods that use Euclidean distance) may not perform as well for sequencing data as methods that are based upon a more appropriate distribution. Here, we propose new approaches for performing classification and clustering of observations on the basis of sequencing data. Using a Poisson log linear model, we develop an analog of diagonal linear discriminant analysis that is appropriate for sequencing data. We also propose an approach for clustering sequencing data using a new dissimilarity measure that is based upon the Poisson model. We demonstrate the performances of these approaches in a simulation study, on three publicly available RNA sequencing data sets, and on a publicly available chromatin immunoprecipitation sequencing data set.

Article information

Ann. Appl. Stat. Volume 5, Number 4 (2011), 2493-2518.

First available in Project Euclid: 20 December 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Classification clustering genomics gene expression Poisson sequencing


Witten, Daniela M. Classification and clustering of sequencing data using a Poisson model. Ann. Appl. Stat. 5 (2011), no. 4, 2493--2518. doi:10.1214/11-AOAS493.

Export citation


  • Agresti, A. (2002). Categorical Data Analysis. Wiley, Hoboken, NJ.
  • Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol. 11 R106.
  • Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika 35 246–254.
  • Auer, P. L. and Doerge, R. W. (2010). Statistical design and analysis of RNA sequencing data. Genetics 185 405–416.
  • Barrett, T., Suzek, T. O., Troup, D. B., Wilhite, S. E., Ngau, W.-C., Ledoux, P., Rudnev, D., Lash, A. E., Fujibuchi, W. and Edgar, R. (2005). NCBI GEO: Mining millions of expression profiles–database and tools. Nucleic Acids Res. 33 D562–D566.
  • Berninger, P., Gaidatzis, D., van Nimwegen, E. and Zavolan, M. (2008). Computational analysis of small RNA cloning data. Methods 44 13–21.
  • Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
  • Brown, P. and Botstein, D. (1999). Exploring the new world of the genome with DNA microarrays. Nature Genetics 21 33–37.
  • Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11 94.
  • Cai, L., Huang, H., Blackshaw, S., Liu, J., Cepko, C. and Wong, W. (2004). Clustering analysis of SAGE data using a Poisson approach. Genome Biology 5 R51.
  • DeRisi, J., Iyer, V. and Brown, P. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278 680–686.
  • Dudoit, S., Fridlyand, J. and Speed, T. P. (2001). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 96 1151–1160.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
  • Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science 316 1497–1502.
  • Kasowski, M., Grubert, F., Heffelfinger, C., Hariharan, M., Asabere, A., Waszak, S. M., Habegger, L., Rozowsky, J., Shi, M., Urban, A. E., Hong, M.-Y., Karczewski, K. J., Huber, W., Weissman, S. M., Gerstein, M. B., Korbel, J. O. and Snyder, M. (2010). Variation in transcription factor binding among humans. Science 328 232–235.
  • Lee, S., Huang, J. Z. and Hu, J. (2010). Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4 1579–1601.
  • Li, J., Witten, D., Johnstone, I. and Tibshirani, R. (2011). Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. To appear.
  • Linsen, S. E. V., de Wit, E., Janssens, G., Heater, S., Chapman, L., Parkin, R. K., Fritz, B., Wyman, S. K., de Bruijn, E., Voest, E. E., Kuersten, S., Tewari, M. and Cuppen, E. (2009). Limitations and possibilities of small RNA digital gene expression profiling. Nature Methods 6 474–476.
  • Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18 1509–1517.
  • Monti, S., Savage, K. J., Kutok, J. L., Feuerhake, F., Kurtin, P., Mihm, M., Wu, B., Pasqualucci, L., Neuberg, D., Aguiar, R. C. T., Cin, P. D., Ladd, C., Pinkus, G. S., Salles, G., Harris, N. L., Dalla-Favera, R., Habermann, T. M., Aster, J. C., Golub, T. R. and Shipp, M. A. (2005). Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. Blood 105 1851–1861.
  • Morozova, O., Hirst, M. and Marra, M. A. (2009). Applications of new sequencing technologies for transcriptome analysis. Annu. Rev. Genomics Hum. Genet. 10 135–151.
  • Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5 621–628.
  • Nagalakshmi, U., Wong, Z., Waern, K., Shou, C., Raha, D., Gerstein, M. and Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 302 1344–1349.
  • Nielsen, T., West, R., Linn, S., Alter, O., Knowling, M., O’Connell, J. S. Z., Fero, M., Sherlock, G., Pollack, J., Brown, P., Botstein, D. and van de Rijn, M. (2002). Molecular characterisation of soft tissue tumours: A gene expression study. The Lancet 359 1301–1307.
  • Oshlack, A., Robinson, M. and Young, M. (2010). From RNA-seq reads to differential expression results. Genome Biology 11 220.
  • Oshlack, A. and Wakefield, M. (2009). Transcript length bias in RNA-seq data confounds system biology. Biology Direct 4 14.
  • Pepke, S., Wold, B. and Mortazavi, A. (2009). Computation for ChIP-seq and RNA-seq studies. Nature Methods 6 S22–S32.
  • Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J., Poggio, T., Gerald, W., Loda, M., Lander, E. and Golub, T. (2001). Multiclass cancer diagnosis using tumor gene expression signature. PNAS 98 15149–15154.
  • Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc. 66 846–850.
  • Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139–140.
  • Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11 R25.
  • Spellman, P. T., Sherlock, G., Iyer, V. R., Zhang, M., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. and Futcher, B. (1998). Comprehensive identification of cell cycle-reulated genes of the yeast saccharomyces by microarray hybridization. Mol. Cell. Biol. 9 3273–3975.
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567–6572.
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statist. Sci. 18 104–117.
  • Wang, S. M. (2007). Understanding SAGE data. Trends Genet. 23 42–50.
  • Wang, Z., Gerstein, M. and Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10 57–63.
  • Wilhelm, B. T. and Landry, J.-R. (2009). RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48 249–257.
  • Witten, D. and Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. J. Roy. Statist. Soc. Ser. B 73 753–772.
  • Witten, D., Tibshirani, R., Gu, S., Fire, A. and Lui, W. (2010). Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumous and matched controls. BMC Biology 8 58.