The Annals of Applied Statistics

A statistical framework for testing functional categories in microarray data

William T. Barry, Andrew B. Nobel, and Fred A. Wright
Source: Ann. Appl. Stat. Volume 2, Number 1 (2008), 286-315.

Abstract

Ready access to emerging databases of gene annotation and functional pathways has shifted assessments of differential expression in DNA microarray studies from single genes to groups of genes with shared biological function. This paper takes a critical look at existing methods for assessing the differential expression of a group of genes (functional category), and provides some suggestions for improved performance. We begin by presenting a general framework, in which the set of genes in a functional category is compared to the complementary set of genes on the array. The framework includes tests for overrepresentation of a category within a list of significant genes, and methods that consider continuous measures of differential expression. Existing tests are divided into two classes. Class 1 tests assume gene-specific measures of differential expression are independent, despite overwhelming evidence of positive correlation. Analytic and simulated results are presented that demonstrate Class 1 tests are strongly anti-conservative in practice. Class 2 tests account for gene correlation, typically through array permutation that by construction has proper Type I error control for the induced null. However, both Class 1 and Class 2 tests use a null hypothesis that all genes have the same degree of differential expression. We introduce a more sensible and general (Class 3) null under which the profile of differential expression is the same within the category and complement. Under this broader null, Class 2 tests are shown to be conservative. We propose standard bootstrap methods for testing against the Class 3 null and demonstrate they provide valid Type I error control and more power than array permutation in simulated datasets and real microarray experiments.

First Page: Show Hide

Related Works:

Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1206367822
Digital Object Identifier: doi:10.1214/07-AOAS146
Zentralblatt MATH identifier: 1137.62390
Mathematical Reviews number (MathSciNet): MR2415604

References

Allison, D. B., Cui, X. Q., Page, G. P. et al. (2006). Microarray data analysis: From disarray to consolidation and consensus., Nature Reviews Genetics 7 55–65.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. and Sherlock, G. (2000). Gene Ontology: Tool for the unification of biology. The Gene Ontology Consortium., Nat. Genet. 25 25–29.
Barry, W. T., Nobel, A. B. and Wright, F. A. (2005). Significance analysis of functional categories in gene expression studies: A structured permutation approach., Bioinformatics 21 1943–1949.
Barry, W. T., Nobel, A. B. and Wright, F. A. (2008). Supplement to “A statistical framework for testing functional categories in microarray data.” DOI: 10.1214/07-AOAS146SUPPA, DOI:, 10.1214/07-AOAS146SUPPB.
Beißbarth, T. and Speed, T. P. (2004). GOstat: Find statistically overrepresented Gene Ontologies within a group of genes., Bioinformatics 20 1464–1465.
Ben-Shaul, Y., Bergman, H. and Soreq, H. (2005). Identifying subtle interrelated changes in functional gene categories using continuous measures of gene expression., Bioinformatics 21 1129–1137.
Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J. and Meyerson, M. (2001). Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses., Proc. Natl. Acad. Sci. USA 98 13790–13795.
Boorsma, A., Foat, B. C., Vis, D., Klis, F. and Bussemaker, H. J. (2005). T-profiler: scoring the activity of predefined groups of genes using gene expression data., Nucleic Acids Research 33 W592–W595.
Casella, G. and Berger, R. L. (2002)., Statistical Inference, 2nd ed. Duxbury, Australia.
Chang, H. Y., Nuyten, D. S. A., Sneddon, J. B., Hastie, T., Tibshirani, R., Sorlie, T., Dai, H. Y., He, Y. D. D., Veer, L. J. V., Bartelink, H., de Rijn, M. V., Brown, P. O. and de Vijver, M. J. V. (2005). Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival., Proc. Natl. Acad. Sci. USA 102 3738–3743.
Damian, D. and Gorfine, M. (2004). Statistical concerns about the GSEA procedure., Nature Genetics 36 663.
Dodd, L. E., Sengupta, S., Chen, I. H., den Boon, J. A., Cheng, Y. J., Westra, W., Newton, M. A., Mittl, B. F., McShane, L., Chen, C. J., Ahlquist, P. and Hildesheim, A. (2006). Genes involved in DNA repair and nitrosamine metabolism and those located on chromosome 14q32 are dysregulated in nasopharyngeal carcinoma., Cancer Epidemiology Biomarkers and Prevention 15 2216–2225.
Draghici, S., Khatri, P., Martins, R. P., Ostermeier, G. C. and Krawetz, S. A. (2003). Global functional profiling of gene expression., Genomics 81 98–104.
Dudoit, S., Keles, S. and van der Laan, M. J. (2007)., Multiple Tests of Association with Biological Annotation Metadata. Springer, New York.
Zentralblatt MATH: 1168.62100
Dudoit, S., Yang, Y. H., Callow, M. J. and Speed, T. P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments., Statist. Sinica 12 111–139.
Mathematical Reviews (MathSciNet): MR1894191
Zentralblatt MATH: 1004.62088
Efron, B. (1979). Bootstrap methods: Another look at the jackknife., Ann. Statist. 7 1–26.
Mathematical Reviews (MathSciNet): MR515681
Zentralblatt MATH: 0406.62024
Digital Object Identifier: doi:10.1214/aos/1176344552
Project Euclid: euclid.aos/1176344552
Efron, B. (1987). Better bootstrap confidence intervals., J. Amer. Statist. Assoc. 82 171–185.
Mathematical Reviews (MathSciNet): MR883345
Zentralblatt MATH: 0622.62039
Digital Object Identifier: doi:10.2307/2289144
Efron, B. and Tibshirani, R. J. (1998)., An Introduction to the Bootstrap, 2nd ed. Chapman and Hall/CRC, New York.
Mathematical Reviews (MathSciNet): MR1270903
Zentralblatt MATH: 0835.62038
Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes., Ann. Applied Statist. 1 107–129.
Mathematical Reviews (MathSciNet): MR2393843
Zentralblatt MATH: 1129.62102
Digital Object Identifier: doi:10.1214/07-AOAS101
Project Euclid: euclid.aoas/1183143731
Galitski, T., Saldanha, A. J., Styles, C. A., Lander, E. S. and Fink, G. R. (1999). Ploidy regulation of gene expression., Science 285 251–254.
Gastwirth, J. L. and Rubin, H. (1971). Effect of dependence on the level of some one-sample tests., J. Amer. Statist. Assoc. 66 816–820.
Mathematical Reviews (MathSciNet): MR314192
Zentralblatt MATH: 0229.62024
Digital Object Identifier: doi:10.2307/2284232
Goeman, J. J. and Buhlmann, P. (2007). Analyzing gene expression data in terms of gene sets: Methodological issues., Bioinformatics 23 980–987.
Hall, P. and Wilson, S. R. (1991). Two guidelines for bootstrap hypothesis testing., Biometrics 47 757–762.
Mathematical Reviews (MathSciNet): MR1132543
Digital Object Identifier: doi:10.2307/2532163
Kim, S.-Y. and Volsky, D. J. (2005). Parametric analysis of gene set enrichment., BMC Bioinformatics 6 144.
Lee, H. K., Hsu, A. K., Sajdak, J., Qin, J. and Pavlidis, P. (2004). Coexpression analysis of human genes across many microarray data sets., Genome Research 14 1085–1094.
Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn, J. N., Altshuler, D. and Groop, L. C. (2003). PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes., Nat. Genet. 34 267–273.
Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method., Biostatistics 5 155–176.
Pavlidis, P., Qin, J., Arango, V., Mann, J. J. and Sibille, E. (2004). Using the gene ontology for microarray data mining: A comparison of methods and application to age effects in human prefrontal cortex., Neurochemical Research 29 1213–1222.
Pearson, K. (1911). on the probability that two independent distributions of frequency are really samples from the same population., Biometrika 8 250–254.
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles., Proc. Natl. Acad. Sci. USA 102 15545–15550.
Thomas, G. B. J. and Finney, R. L. (1992)., Maxima, Minima, and Saddle Points, 8th ed. Addison-Wesley, Reading, MA.
Tusher, V. G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response., Proc. Natl. Acad. Sci. USA 98 5116–5121.
Virtaneva, K. I., Wright, F. A., Tanner, S. M., Yuan, B., Lemon, W. J., Caligiuri, M. A., Bloomfield, C. D., de la Chapelle, A. and Krahe, R. (2001). Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics., Proc. Natl. Acad. Sci. USA 98 1124–1129.
Zhong, S., Storch, K. F., Lipan, O., Kao, M. C., Weitz, C. J. and Wong, W. H. (2004). GoSurfer: A graphical interactive tool for comparative analysis of large gene sets in gene ontology space., Appl. Bioinformatics 3 261–264.

2013 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics

Turn MathJax Off
What is MathJax?