Institute of Mathematical Statistics Collections

Multiple tests of association with biological annotation metadata

Sandrine Dudoit, Sündüz Keleş, Mark J. van der Laan

Abstract

We propose a general and formal statistical framework for multiple tests of association between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest. The known gene-annotation profiles, corresponding to the fixed features of the genome, may concern Gene Ontology (GO) annotation, pathway membership, regulation by particular transcription factors, nucleotide sequences, or protein sequences. The unknown gene-parameter profiles, corresponding to the variable features of the genome, may be, for example, regression coefficients relating possibly censored biological and clinical outcomes to genome-wide transcript levels, DNA copy numbers, and other covariates. A generic question of great interest in current genomic research regards the detection of associations between biological annotation metadata and genome-wide expression measures. This biological question may be translated as the test of multiple hypotheses concerning association measures between gene-annotation profiles and gene-parameter profiles. A general and rigorous formulation of the statistical inference question allows us to apply the multiple hypothesis testing methodology developed in [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] and related articles, to control a broad class of Type I error rates, defined as generalized tail probabilities and expected values for arbitrary functions of the numbers of Type I errors and rejected hypotheses. The resampling-based single-step and stepwise multiple testing procedures of [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] take into account the joint distribution of the test statistics and provide Type I error control in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics.

The proposed statistical and computational methods are illustrated using the acute lymphoblastic leukemia (ALL) microarray dataset of [Blood 103 (2004) 2771–2778], with the aim of relating GO annotation to differential gene expression between B-cell ALL with the BCR/ABL fusion and cytogenetically normal NEG B-cell ALL. The sensitivity of the identified lists of GO terms to the choice of association parameter between GO annotation and differential gene expression demonstrates the importance of translating the biological question in terms of suitable gene-annotation profiles, gene-parameter profiles, and association measures. In particular, the results reveal the limitations of binary gene-parameter profiles of differential expression indicators, which are still the norm for combined GO annotation and microarray data analyses. Procedures based on such binary gene-parameter profiles tend to be conservative and lack robustness with respect to the estimator for the set of differentially expressed genes. Our proposed statistical framework, with general definitions for the gene-annotation and gene-parameter profiles, allows consideration of a much broader class of inference problems, that extend beyond GO annotation and microarray data analysis.

First Page: Show Hide
Primary Subjects: 62H15, 62P10
Secondary Subjects: 62G09, 62G10, 62H10, 62H20
Keywords: acute lymphoblastic leukemia (ALL); adjusted p-value; Affymetrix; annotation metadata; association measure; BCR/ABL fusion; Bioconductor Project; bootstrap; co-expression; correlation; cut-off; differential expression; directed acyclic graph; family-wise error rate; gene; gene-annotation profile; Gene Ontology (GO); gene-parameter profile; generalized expected value error rate; generalized tail probability error rate; genomics; joint distribution; maxT; microarray; multiple hypothesis testing; non-parametric; null distribution; null hypothesis; power; R; rejection region; resampling; software; test statistic; Type I error rate
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.imsc/1207580084
Digital Object Identifier: doi:10.1214/193940307000000446

References

[1] Al-Shahrour, F., Díaz-Uriarte, R. and Dopazo, J. (2004). FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20 578–580.
[2] Al-Shahrour, F., Minguez, P., Vaquerizas, J. M., Conde, L. and Dopazo, J. (2005). BABELOMICS: A suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucleic Acids Research 33 W460–W464.
[3] Banelli, B., Casciano, I., Croce, M., Vinci, A. D., Gelvi, I., Pagnan, G., Brignole, C., Allemanni, G., Ferrini, S., Ponzoni, M. and Romani, M. (2002). Expression and methylation of CASP8 in neuroblastoma: Identification of a promoter region. Nature Medicine 8 1333–1335.
[4] Beissbarth, T. and Speed, T. P. (2004). GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20 1464–1465.
[5] Birkner, M. D., Courtine, M., van der Laan, M. J., Clément, K., Zucker, J.-D. and Dudoit, S. (In preparation). Statistical methods for detecting genotype/phenotype associations in the ObeLinks Project. Technical report, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.
[6] Birkner, M. D., Hubbard, A. E. and van der Laan, M. J. (2005a). Application of a multiple testing procedure controlling the proportion of false positives to protein and bacterial data. Technical Report 186, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.
[7] Birkner, M. D., Hubbard, A. E., van der Laan, M. J., Skibola, C. F., Hegedus, C. M. and Smith, M. T. (2006). Issues of processing and multiple testing of SELDI-TOF MS proteomic data. Statistical Applications in Genetics and Molecular Biology 5 Article 11.
Mathematical Reviews (MathSciNet): MR2221295
Zentralblatt MATH: 1166.62332
Digital Object Identifier: doi:10.2202/1544-6115.1198
[8] Birkner, M. D., Pollard, K. S., van der Laan, M. J. and Dudoit, S. (2005b). Multiple testing procedures and applications to genomics. Technical Report 168, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.
[9] Birkner, M. D., Sinisi, S. E. and van der Laan, M. J. (2005c). Multiple testing and data adaptive regression: An application to HIV-1 sequence data. Statistical Applications in Genetics and Molecular Biology 4 Article 8.
Mathematical Reviews (MathSciNet): MR2138213
Zentralblatt MATH: 1095.62123
Digital Object Identifier: doi:10.2202/1544-6115.1110
[10] Blanchette, M., Green, R. E., Brenner, S. E. and Rio, D. C. (2005). Global analysis of positive and negative pre-mRNA splicing regulators in Drosophila. Genes and Development 19 1306–1314.
[11] Bolstad, B. M., Irizarry, R. A., Gautier, L. and Wu, Z. (2005). Preprocessing high-density oligonucleotide arrays. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. C. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit, eds.) Chapter 2 13–32. Springer, New York.
Mathematical Reviews (MathSciNet): MR2201836
[12] Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 3–62.
[13] Chiaretti, S., Li, X., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J. and Foa, R. (2004). Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103 2771–2778.
[14] Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genomics. Springer, New York.
Mathematical Reviews (MathSciNet): MR2373771
Zentralblatt MATH: 05234992
[15] Dudoit, S., van der Laan, M. J. and Birkner, M. D. (2004a). Multiple testing procedures for controlling tail probability error rates. Technical Report 166, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.
[16] Dudoit, S., van der Laan, M. J. and Pollard, K. S. (2004b). Multiple testing. Part I. Single-step procedures for control of general Type I error rates. Statistical Applications in Genetics and Molecular Biology 3 Article 13.
Mathematical Reviews (MathSciNet): MR2101462
Zentralblatt MATH: 1166.62338
Digital Object Identifier: doi:10.2202/1544-6115.1040
[17] Gentleman, R. C., Carey, V. J., Huber, W., Irizarry, R. A. and Dudoit, S., editors (2005a). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Statistics for Biology and Health. Springer, New York.
Mathematical Reviews (MathSciNet): MR2201836
[18] Gentleman, R. C., Carey, V. J. and Zhang, J. (2005b). Meta-data resources and tools in Bioconductor. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. C. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit, eds.) Chapter 7 111–133. Springer, New York.
Mathematical Reviews (MathSciNet): MR2201836
[19] Gill, R. D. (1989). Non- and semi-parametric maximum likelihood estimators and the von Mises method. I. Scand. J. Statist. 16 97–128. (With a discussion by J. A. Wellner and J. Præstgaard and a reply by the author).
[20] Gill, R. D., van der Laan, M. J. and Wellner, J. A. (1995). Inefficient estimators of the bivariate survival function for three models. Ann. Inst. H. Poincaré Probab. Statist. 31 545–597.
Mathematical Reviews (MathSciNet): MR1338452
Zentralblatt MATH: 0855.62024
[21] Gleissner, B., Gokbuget, N., Bartram, C. R., Janssen, B., Rieder, H., Janssen, J. W., Fonatsch, C., Heyll, A., Voliotis, D., Beck, J., Lipp, T., Munzert, G., Maurer, J., Hoelzer, D., Thiel, E. and of Adult Acute Lymphoblastic Leukemia Study Group, G. M. T. (2002). Leading prognostic relevance of the BCR-ABL translocation in adult acute B-lineage lymphoblastic leukemia: A prospective study of the German Multicenter Trial Group and confirmed polymerase chain reaction analysis. Blood 99 1536–1543.
[22] Grossmann, S., Bauer, S., Robinson, P. N. and Vingron, M. (2006). An improved statistic for detecting over-represented gene ontology annotations in gene sets. In Research in Computational Molecular Biology: 10th Annual International Conference, RECOMB 2006, Venice, Italy, April 2–5, 2006, Proceedings. Lecture Notes in Comput. Sci. ( A. Apostolico, C. Guerra, S. Istrail, P. Pevzner and M. Waterman, eds.) 3909 85–98. Springer, Berlin/Heidelberg.
Mathematical Reviews (MathSciNet): MR2260446
[23] Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75 800–802.
Mathematical Reviews (MathSciNet): MR995126
Zentralblatt MATH: 0661.62067
Digital Object Identifier: doi:10.1093/biomet/75.4.800
[24] Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures. Wiley, New York.
Mathematical Reviews (MathSciNet): MR914493
[25] Kaczynski, J., Cook, T. and Urrutia, R. (2003). Sp1- and Krüppel-like transcription factors. Genome Biology 4 206.
[26] Keleş, S., van der Laan, M. J., Dudoit, S. and Cawley, S. E. (2006). Multiple testing methods for ChIP-Chip high density oligonucleotide array data. J. Comput. Biol. 13 579–613.
Mathematical Reviews (MathSciNet): MR2255432
Digital Object Identifier: doi:10.1089/cmb.2006.13.579
[27] Kirchner, D., Duyster, J., Ottmann, O., Schmid, R. M., Bergmann, L. and Munzert, G. (2003). Mechanisms of Bcr-Abl-mediated NF-κB/Rel activation. Experimental Hematology 31 504–511.
[28] McCarroll, S. A., Murphy, C. T., Zou, S., Pletcher, S. D., Chin, C.-S., Jan, Y. N., Kenyon, C., Bargmann, C. I. and Li, H. (2004). Comparing genomic expression patterns across species identifies shared transcriptional profile in aging. Nature Genetics 36 197–204.
[29] Mootha, V. K., Lindgren, C. M., Eriksson, K.-F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn, J. H., Altshuler, D. and Groop, L. C. (2003). PGC–1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 34 267–273.
[30] Mukhopadhyay, A., Shishodia, S., Suttles, J., Brittingham, K., Lamothe, B., Nimmanapalli, R., Bhalla, K. N. and Aggarwal, B. B. (2002). Ectopic expression of protein-tyrosine kinase Bcr-Abl suppresses tumor necrosis factor (TNF)-induced NF-κB activation and IκBα phosphorylation. Relationship with down-regulation of TNF receptors. J. Biological Chemistry 277 30622–30628.
[31] Pollard, K. S., Birkner, M. D., van der Laan, M. J. and Dudoit, S. (2005a). Test statistics null distributions in multiple testing: Simulation studies and applications to genomics. J. Soc. Française de Statistique 146 77–115.
[32] Pollard, K. S., Dudoit, S. and van der Laan, M. J. (2005b). Multiple testing procedures: The multtest package and applications to genomics. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. C. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit, eds.) Chapter 15 249–271. Springer, New York.
[33] Pollard, K. S. and van der Laan, M. J. (2004). Choice of a null distribution in resampling-based multiple testing. J. Statist. Plann. Inference 125 85–100.
Mathematical Reviews (MathSciNet): MR2086890
Zentralblatt MATH: 1074.62009
Digital Object Identifier: doi:10.1016/j.jspi.2003.07.019
[34] Rubin, D., van der Laan, M. J. and Dudoit, S. (2006). A method to increase the power of multiple testing procedures through sample splitting. Statistical Applications in Genetics and Molecular Biology 5 Article 19.
Mathematical Reviews (MathSciNet): MR2240850
Zentralblatt MATH: 1166.62318
Digital Object Identifier: doi:10.2202/1544-6115.1148
[35] Shtivelman, E., Cohen, F. E. and Bishop, J. M. (1992). A human gene (AHNAK) encoding an unusually large protein with a 1.2-μm polyionic rod structure. Proc. Natl. Acad. Sci. 89 5472–5476.
[36] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102 15545–15550.
[37] Tian, L., Greenberg, S. A., Kong, S. W., Altschuler, J., Kohane, I. S. and Park, P. J. (2005). Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. 102 13544–13549.
[38] van der Laan, M. J. (2006). Statistical inference for variable importance. Internat. J. Biostatistics 2 Article 2.
Mathematical Reviews (MathSciNet): MR2275897
[39] van der Laan, M. J., Birkner, M. D. and Hubbard, A. E. (2005). Empirical Bayes and resampling based multiple testing procedure controlling tail probability of the proportion of false positives. Statistical Applications in Genetics and Molecular Biology 4 Article 29.
Mathematical Reviews (MathSciNet): MR2170445
Zentralblatt MATH: 1108.62303
Digital Object Identifier: doi:10.2202/1544-6115.1143
[40] van der Laan, M. J., Dudoit, S. and Pollard, K. S. (2004a). Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Statistical Applications in Genetics and Molecular Biology 3 Article 15.
Mathematical Reviews (MathSciNet): MR2101464
Zentralblatt MATH: 1166.62379
[41] van der Laan, M. J., Dudoit, S. and Pollard, K. S. (2004b). Multiple testing. Part II. Step-down procedures for control of the family-wise error rate. Statistical Applications in Genetics and Molecular Biology 3 Article 14.
Mathematical Reviews (MathSciNet): MR2101463
Zentralblatt MATH: 1166.62378
[42] van der Laan, M. J. and Hubbard, A. E. (2006). Quantile-function based null distribution in resampling based multiple testing. Statistical Applications in Genetics and Molecular Biology 5 Article 14.
Mathematical Reviews (MathSciNet): MR2221292
Zentralblatt MATH: 1166.62302
Digital Object Identifier: doi:10.2202/1544-6115.1199
[43] van der Laan, M. J. and Robins, J. M. (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer, New York.
Mathematical Reviews (MathSciNet): MR1958123
Zentralblatt MATH: 1013.62034
[44] von Heydebreck, A., Huber, W. and Gentleman, R. (2004). Differential expression with the Bioconductor Project. Technical Report 7, Bioconductor Project Working Papers.
[45] Westfall, P. H. and Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. Wiley, New York.
[46] Yu, Z. and van der Laan, M. J. (2002). Construction of counterfactuals and the G-computation formula. Technical Report 122, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.

2012 © Institute of Mathematical Statistics

Institute of Mathematical Statistics Collections

Institute of Mathematical Statistics Collections