The Annals of Applied Statistics

A statistical framework for the analysis of microarray probe-level data

Zhijin Wu and Rafael A. Irizarry

Full-text: Open access


In microarray technology, a number of critical steps are required to convert the raw measurements into the data relied upon by biologists and clinicians. These data manipulations, referred to as preprocessing, influence the quality of the ultimate measurements and studies that rely upon them. Standard operating procedure for microarray researchers is to use preprocessed data as the starting point for the statistical analyses that produce reported results. This has prevented many researchers from carefully considering their choice of preprocessing methodology. Furthermore, the fact that the preprocessing step affects the stochastic properties of the final statistical summaries is often ignored. In this paper we propose a statistical framework that permits the integration of preprocessing into the standard statistical analysis flow of microarray data. This general framework is relevant in many microarray platforms and motivates targeted analysis methods for specific applications. We demonstrate its usefulness by applying the idea in three different applications of the technology.

Article information

Ann. Appl. Stat. Volume 1, Number 2 (2007), 333-357.

First available in Project Euclid: 30 November 2007

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Microarray preprocessing probe level models background noise normalization


Wu, Zhijin; Irizarry, Rafael A. A statistical framework for the analysis of microarray probe-level data. Ann. Appl. Stat. 1 (2007), no. 2, 333--357. doi:10.1214/07-AOAS116.

Export citation


  • Amaratunga, D. and Cabrera, J. (2001). Analysis of data from viral DNA microchips. J. Amer. Statist. Assoc. 96 1161–1170.
  • Chu, T.-M., Weir, B. and Wolfinger, R. (2002). A systematic statistical linear modeling approach to oligonucleotide array experiments. Math. Biosci. 176 35–51.
  • Chudin, E., Walker, R., Kosaka, A., Wu, S. X., Rabert, D., Chang, T. K. and Kreder, D. E. (2001). Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip arrays. Genome Biol. 3 RESEARCH0005.
  • Cope, L., Irizarry, R., Jaffee, H., Wu, Z. and Speed, T. (2004). A benchmark for Affymetrix Genechip expression measures. Bioinformatics 20 323–331.
  • Cui, X., Kerr, M. K. and Churchill, G. A. (2003). Transformations for cDNA microarray data. Statistical Applications in Genetics and Molecular Biology 2 Article 4.
  • Dudoit, S., Yang, Y. H., Callow, M. J. and Speed, T. P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statist. Sinica 12 111–139.
  • Durbin, B. P., Hardin, J. S., Hawkins, D. M. and Rocke, D. M. (2002). A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 18(Suppl. 1) S105–S110.
  • Geller, S. C., Gregg, J. P., Hagerman, P. and Rocke, D. M. (2003). Transformation and normalization of oligonucleotide microarray data. Bioinformatics 19 1817–1823.
  • Getz, G., Levine, E. and Domany, E. (2000). Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA 97 12079–12084.
  • Giaever, G., Chu, A. M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A., Anderson, K., Andre, B., Arkin, A. P., Astromoff, A., El-Bakkoury, M., Bangham, R., Benito, R., Brachat, S., Campanaro, S., Curtiss, M., Davis, K., Deutschbauer, A., Entian, K. D., Flaherty, P., Foury, F., Garfinkel, D. J., Gerstein, M., Gotte, D., Guldener, U., Hegemann, J. H., Hempel, S., Herman, Z., Jaramillo, D. F., Kelly, D. E., Kelly, S. L., Kotter, P., LaBonte, D., Lamb, D. C., Lan, N., Liang, H., Liao, H., Liu, L., Luo, C., Lussier, M., Mao, R., Menard, P., Ooi, S., Revuelta, J., Roberts, C., Rose, M., Ross-Macdonald, P., Scherens, B., Schimmack, G., Shafer, B., Shoemaker, D. D., Sookhai-Mahadeo, S., Storms, R. K., Strathern, J. N., Valle, G., Voet, M., Volckaert, G., Wang, C. Y., Ward, T. R., Wilhelmy, J., Winzeler, E. A., Yang, Y., Yen, G., Youngman, E., Yu, K., Bussey, H., Boeke, J. D., Snyder, M., Philippsen, P., Davis, R. W. and Johnston, M. (2002). Functional profiling of the Saccharomyces cerevisiae genome. Nature 418 387–391.
  • Gottardo, R., Pannucci, J. A., Kuske, C. R. and Brettin, T. (2003). Statistical analysis of microarray data: A Bayesian approach. Biostatistics 4 597–620.
  • Hein, A.-M., Richardson, S., Causton, H. C., Ambler, G. K. and Green, P. J. (2005). BGX: A fully Bayesian gene expression index for Affymetrix GeneChip data. Biostatistics 6 349–373.
  • Hekstra, D., Taussig, A. R., Magnasco, M. and Naef, F. (2003). Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res. 31 1962–1968.
  • Hubbell, E., Liu, W.-M. and Mei, R. (2002). Robust estimators for expression analysis. Bioinformatics 18 1585–1592.
  • Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A. and Vingron, M. (2002). Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 1 1–9.
  • Irizarry, R. A., B. Hobbs, F. C., Beaxer-Barclay, Y., Antonellis, K., Scherf, U. and Speed, T. (2003a). Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics 4 249–264.
  • Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B. and Speed, T. P. (2003b). Summaries of affymetrix genechip probe level data. Nucleic Acids Research 31.
  • Irizarry, R. A., Wu, Z. and Jaffee, H. (2006). Comparison of affymetrix genechip expression measures. Bioinformatics 22 789–794.
  • Kendziorski, C. M., Newton, M. A., Lan, H. and Gould, M. (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine 22 3899–3914.
  • Kerr, M., Afshari, C., Bennett, L., Bushel, P., Martinez, J., Walker, N. and Churchill, G. (2002). Statistical analysis of a gene expression microarray experiment with replication. Statist. Sinica 12 203–217.
  • Kerr, M. K., Martin, M. and Churchill, G. A. (2000). Analysis of variance for gene expression microarray data. J. Comput. Biol. 7 819–837.
  • Lee, M.-L. T., Kuo, F. C., Whitmore, G. A. and Sklar, J. (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. USA 97 9834–9839.
  • Li, C. and Wong, W. (2001). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. USA 98 31–36.
  • Liu, W., Mei, R., Di, X., Ryder, T. B., Hubbell, E., Dee, S., Webster, T. A., Harrington, C. A., Ho, M., Baid, J. and Smeekens, S. P. (2002). Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics 18 1593–1599.
  • Liu, X., Milo, M., Lawrence, N. D. and Rattray, M. (2006). Probe-level measurement error improves accuracy in detecting differential gene expression. Bioinformatics 22 2107–2113.
  • Lonnstedt, I. and Speed, T. (2002). Replicated microarray data. Statist. Sinica 12 31–46.
  • Meyer, C., Gottardo, R., Carroll, J., Brown, M. and Liu, X. (2006). Model-based analysis of tiling-arrays for chip-chip. Proc. Natl. Acad. Sci. 103 12457–12462.
  • Naef, F. and Magnasco, M. O. (2003). Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Phys. Rev. E 68 011906.
  • Newton, M., Kendziorski, C., Richmond, C., Blattner, F. and Tsui, K. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comput. Biol. 8 37–52.
  • Pan, W., Lin, J. and Le, C. (2003). A mixture model approach to detecting differentially expressed genes with microarray data. Functional Integrative Genomics 3 117–124.
  • Peyser, B. D., Irizarry, R. A., Tiffany, C., Chen, O., Yuan, D. S., Boeke, J. D. and Spencer, F. A. (2005). Improved statistical analysis of budding yeast tag microarrays revealed by defined spike-in pools. Nucieic Acids Res. 33 40.
  • Rattray, M., Liu, X., Sanguinetti, G., Milo, M. and Lawrence, N. D. (2006). Propagating uncertainty in microarray data analysis. Briefings in Bioinformatics 7 37–47.
  • Rocke, D. M. and Durbin, B. (2001). A model for measurement error for gene expression arrays. J. Comput. Biology 8 557–569.
  • Schena, M., Shalon, D., Heller, R., Chai, A., Brown, P. and Davis, R. (1996). Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proc. Natl. Acad. Sci. USA 93 10614–10619.
  • Singh-Gasson, S., Green, R. D., Yue, Y., Nelson, C., Blattner, F., Sussman, M. R. and Cerrina, F. (1999). Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nature Biotechnology 17 974–978.
  • Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3 Article 3.
  • Tusher, V., Tibshirani, R. and Chu, C. (2001). Significance analysis of microarrays applied to ionizing radiation response. Proc. Natl. Acad. Sci. USA 98 5116–5121.
  • Wang, W., Caravalho, B., Miller, N., Pevsner, J., Chakravarti, A. and Irizarry, R. A. (2006a). Estimating genome-wide copy number using allele specific mixture models. Working Papers 122, Dept. of Biostatistics, Johns Hopkins University. Available at
  • Wang, X., He, H., Li, L., Chen, R., Deng, X. W. and Li, S. (2006b). Nmpp: A user-customized nimblegen microarray data processing pipeline. Bioinformatics 22 2955–2957.
  • Wolfinger, R., Gibson, G., Wolfinger, E., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C. and Paules, R. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8 625–637.
  • Wu, Z. and Irizarry, R. (2004). Stochastic models inspired by hybridization theory for short oligonucleotide arrays. In Proceedings of RECOMB 2004. J. Comput. Biol. 12 882–893.
  • Wu, Z., Irizarry, R., Gentlemen, R., Martinez-Murillo, F. and Spencer, F. (2004). A model-based background adjustment for oligonucleotide expression arrays. J. Amer. Statist. Assoc. 99 909–917.
  • Wu, Z. and Irizarry, R. A. (2005). A statistical framework for the analysis of microarray probe-level data. Working papers, Dept. Biostatistics, Johns Hopkins Univ. Available at
  • Yang, I. V., Chen, E., Hasseman, J. P., Liang, W., Frank, B. C., Wang, S., Sharov, V., Saeed, A. I., White, J., Li, J., Lee, N. H., Yeatman, T. J. and Quackenbush, J. (2002). Within the fold: Assessing differential expression measures and reproducibility in microarray assays. Genome Biology 3 research0062.1–0062.12.
  • Yuan, D., Pan, X., Ooi, S., Peyser, B., Spencer, F., Irizarry, R. and Boeke, J. (2005). Improved microarray methods for profiling the yeast knockout strain collection. To appear.