The Annals of Applied Statistics

Diverse correlation structures in gene expression data and their utility in improving statistical inference

Lev Klebanov and Andrei Yakovlev

Full-text: Open access


It is well known that correlations in microarray data represent a serious nuisance deteriorating the performance of gene selection procedures. This paper is intended to demonstrate that the correlation structure of microarray data provides a rich source of useful information. We discuss distinct correlation substructures revealed in microarray gene expression data by an appropriate ordering of genes. These substructures include stochastic proportionality of expression signals in a large percentage of all gene pairs, negative correlations hidden in ordered gene triples, and a long sequence of weakly dependent random variables associated with ordered pairs of genes. The reported striking regularities are of general biological interest and they also have far-reaching implications for theory and practice of statistical methods of microarray data analysis. We illustrate the latter point with a method for testing differential expression of nonoverlapping gene pairs. While designed for testing a different null hypothesis, this method provides an order of magnitude more accurate control of type 1 error rate compared to conventional methods of individual gene expression profiling. In addition, this method is robust to the technical noise. Quantitative inference of the correlation structure has the potential to extend the analysis of microarray data far beyond currently practiced methods.

Article information

Ann. Appl. Stat., Volume 1, Number 2 (2007), 538-559.

First available in Project Euclid: 30 November 2007

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Correlation structure gene expression microarrays


Klebanov, Lev; Yakovlev, Andrei. Diverse correlation structures in gene expression data and their utility in improving statistical inference. Ann. Appl. Stat. 1 (2007), no. 2, 538--559. doi:10.1214/07-AOAS120.

Export citation


  • Almudevar, A., Klebanov, L. B., Qiu, X., Salzman, P. and Yakovlev, A. Y. (2006). Utility of correlation measures in analysis of gene expression. NeuroRx 3 384–395.
  • Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Educ. Behav. Statist. 25 60–83.
  • Dai, M., Wang, P., Boyd, A. D., Kostov, G., Athey, B., Jones, E. G., Bunney, W. R., Myers, R. M., Speed, T. P., Akil, H., Watson, S. J. and Meng, F. (2005). Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucl. Acids Res. 33 e175.
  • Dettling, M., Gabrielson, E. and Parmigiani, G. (2005). Searching for differentially expressed gene combinations. Genome Biol. 6 Article R88.
  • Efron, B. (2003). Robbins, empirical Bayes and microarrays. Ann. Statist. 31 366–378.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Efron, B. (2007). Correlation and large–scale simultaneous testing. J. Amer. Statist. Assoc. 102 93–103.
  • Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC, New York.
  • Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160.
  • Geman, D., d'Avignon, C., Naiman, D. Q. and Winslow, R. L. (2004). Classifying gene expression profiles from pairwise mRNA comparisons. Statist. Appl. Genet. Molec. Biol. 3 Article 19.
  • Goerman, J. J., van de Geer, S. A., de Kort, F. and van Houwelingen, H. C. (2004). A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20 93–99.
  • Gordon, A., Glazko, G., Qiu, X. and Yakovlev, A. (2007). Control of the mean number of false discoveries, Bonferroni, and stability of multiple testing. Ann. Appl. Statist. 1 179–190.
  • Jaeger, J., Sengupta, R. and Ruzzo, W. L. (2003). Improved gene selection for classification of microarrays. Pacific Symposium on Biocomputing, Kauai, HI 53–64.
  • Klebanov, L., Jordan, C. and Yakovlev, A. (2006a). A new type of stochastic dependence revealed in gene expression data. Statist. Appl. Genet. Molec. Biol. 5 Article 7.
  • Klebanov, L., Gordon, A., Xiao, Y., Land, H. and Yakovlev, A. (2006b). A permutation test motivated by microarray data analysis. Comp. Statist. Data Anal. 50 3619–3628.
  • Klebanov, L. and Yakovlev, A. (2006). Treating expression levels of different genes as a sample in microarray data analysis: Is it worth a risk? Statist. Appl. Genet. Molec. Biol. 5 Article 9.
  • Klebanov, L. and Yakovlev, A. (2007). How high is the level of technical noise in microarray data? Biology Direct 2 Article 9.
  • Lai, Y., Wu, B., Chen, L. and Zhao, H. (2004). A statistical method for identifying differential gene–gene co-expression patterns. Bioinformatics 20 3146–3155.
  • Lamb, J. et al. (2006). The connectivity map: Using gene–expression signatures to connect small molecules, genes, and disease. Science 313 1929–1935.
  • Lu, Y., Liu, P.-Y. and Deng, H.-W. (2005). Hotellings $T^2$ multivariate profiling for detecting differential expression in microarrays. Bioinformatics 21 3105–3113.
  • Okoniewski, M. J. and Miller, C. J. (2006). Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 7 Article 276.
  • Qiu, X., Brooks, A. I., Klebanov, L. and Yakovlev, A. (2005a). The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics 6 Article 120.
  • Qiu, X., Klebanov, L. and Yakovlev, A. Y. (2005b). Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Statist. Appl. Genet. Molec. Biol. 4 Article 34.
  • Qiu, X., Xiao, Y., Gordon, A. and Yakovlev, A. (2006). Assessing stability of gene selection in microarray data analysis. BMC Bioinformatics 7 Article 50.
  • Qiu, X. and Yakovlev, A. (2006). Some comments on instability of false discovery rate estimation. J. Bioinformatics Comput. Biol. 4 1057–1068.
  • Qiu, X. and Yakovlev, A. (2007). Comments on probabilistic models behind the concept of false discovery rate. J. Bioinformatics Comput. Biol. To appear.
  • Reiner, A., Yekutieli, D. and Benjamini, Y. (2003). Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19 368–375.
  • Sen, P. K. (2006). Robust statistical inference for high-dimensional data models with application to genomics. Austrian J. Statist. 35 197–214.
  • Shedden, K. and Taylor, J. (2004). Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas. In Methods of Microarray Data Analysis IV (J. Shoemaker, ed.) 121–131. Springer, New York.
  • Shi, L. et al. (2006). The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnol. 24 1151–1161.
  • Singh, D. et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203–209.
  • Sotiriou, C. et al. (2006). Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis. J. Natl. Cancer Inst. 98 262–272.
  • Storey, J. D., Taylor, J. E. and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. Roy. Statist. Soc. Ser. B 66 187–205.
  • Szabo, A., Boucher, K., Carroll, W., Klebanov, L., Tsodikov, A. and Yakovlev, A. (2002). Variable selection and pattern recognition with gene expression data generated by the microarray technology. Math. Biosci. 176 71–98.
  • Szabo, A., Boucher, K., Jones, D., Klebanov, L., Tsodikov, A. and Yakovlev, A. (2003). Multivariate exploratory tools for microarray data analysis. Biostatistics 4 555–567.
  • Xiao, Y., Frisina, R., Gordon, A., Klebanov, L. and Yakovlev, A. (2004). Multivariate search for differentially expressed gene combinations. BMC Bioinformatics 5 Article 164.
  • Yeoh, E. J., Ross, M. E., Shurtleff, S. A., Williams, W. K., Patel, D., Mahfouz, R., Behm, F. G., Raimondi, S. C., Relling, M. V., Patel, A., Cheng, C., Campana, D., Wilkins, D., Zhou, X., Li, J., Liu, H., Pui, C. H., Evans, W. E., Naeve, C., Wong, L. and Downing, J. R. (2002). Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1 133–143.

Supplemental materials