Institute of Mathematical Statistics Collections

Multiple tests of association with biological annotation metadata

Sandrine Dudoit, Sündüz Keleş, and Mark J. van der Laan

Full-text: Open access

Abstract

We propose a general and formal statistical framework for multiple tests of association between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest. The known gene-annotation profiles, corresponding to the fixed features of the genome, may concern Gene Ontology (GO) annotation, pathway membership, regulation by particular transcription factors, nucleotide sequences, or protein sequences. The unknown gene-parameter profiles, corresponding to the variable features of the genome, may be, for example, regression coefficients relating possibly censored biological and clinical outcomes to genome-wide transcript levels, DNA copy numbers, and other covariates. A generic question of great interest in current genomic research regards the detection of associations between biological annotation metadata and genome-wide expression measures. This biological question may be translated as the test of multiple hypotheses concerning association measures between gene-annotation profiles and gene-parameter profiles. A general and rigorous formulation of the statistical inference question allows us to apply the multiple hypothesis testing methodology developed in [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] and related articles, to control a broad class of Type I error rates, defined as generalized tail probabilities and expected values for arbitrary functions of the numbers of Type I errors and rejected hypotheses. The resampling-based single-step and stepwise multiple testing procedures of [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] take into account the joint distribution of the test statistics and provide Type I error control in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics.

The proposed statistical and computational methods are illustrated using the acute lymphoblastic leukemia (ALL) microarray dataset of [Blood 103 (2004) 2771–2778], with the aim of relating GO annotation to differential gene expression between B-cell ALL with the BCR/ABL fusion and cytogenetically normal NEG B-cell ALL. The sensitivity of the identified lists of GO terms to the choice of association parameter between GO annotation and differential gene expression demonstrates the importance of translating the biological question in terms of suitable gene-annotation profiles, gene-parameter profiles, and association measures. In particular, the results reveal the limitations of binary gene-parameter profiles of differential expression indicators, which are still the norm for combined GO annotation and microarray data analyses. Procedures based on such binary gene-parameter profiles tend to be conservative and lack robustness with respect to the estimator for the set of differentially expressed genes. Our proposed statistical framework, with general definitions for the gene-annotation and gene-parameter profiles, allows consideration of a much broader class of inference problems, that extend beyond GO annotation and microarray data analysis.

Chapter information

Source
Deborah Nolan and Terry Speed, eds., Probability and Statistics: Essays in Honor of David A. Freedman (Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2008), 153-218

Dates
First available in Project Euclid: 7 April 2008

Permanent link to this document
https://projecteuclid.org/euclid.imsc/1207580084

Digital Object Identifier
doi:10.1214/193940307000000446

Mathematical Reviews number (MathSciNet)
MR2459952

Zentralblatt MATH identifier
1168.62100

Subjects
Primary: 62H15: Hypothesis testing 62P10: Applications to biology and medical sciences
Secondary: 62G09: Resampling methods 62G10: Hypothesis testing 62H10: Distribution of statistics 62H20: Measures of association (correlation, canonical correlation, etc.)

Keywords
acute lymphoblastic leukemia (ALL) adjusted p-value Affymetrix annotation metadata association measure BCR/ABL fusion Bioconductor Project bootstrap co-expression correlation cut-off differential expression directed acyclic graph family-wise error rate gene gene-annotation profile Gene Ontology (GO) gene-parameter profile generalized expected value error rate generalized tail probability error rate genomics joint distribution maxT microarray multiple hypothesis testing non-parametric null distribution null hypothesis power R rejection region resampling software test statistic Type I error rate

Rights
Copyright © 2008, Institute of Mathematical Statistics

Citation

Dudoit, Sandrine; Keleş, Sündüz; van der Laan, Mark J. Multiple tests of association with biological annotation metadata. Probability and Statistics: Essays in Honor of David A. Freedman, 153--218, Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2008. doi:10.1214/193940307000000446. https://projecteuclid.org/euclid.imsc/1207580084


Export citation

References

  • [1] Al-Shahrour, F., Díaz-Uriarte, R. and Dopazo, J. (2004). FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20 578–580.
  • [2] Al-Shahrour, F., Minguez, P., Vaquerizas, J. M., Conde, L. and Dopazo, J. (2005). BABELOMICS: A suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucleic Acids Research 33 W460–W464.
  • [3] Banelli, B., Casciano, I., Croce, M., Vinci, A. D., Gelvi, I., Pagnan, G., Brignole, C., Allemanni, G., Ferrini, S., Ponzoni, M. and Romani, M. (2002). Expression and methylation of CASP8 in neuroblastoma: Identification of a promoter region. Nature Medicine 8 1333–1335.
  • [4] Beissbarth, T. and Speed, T. P. (2004). GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20 1464–1465.
  • [5] Birkner, M. D., Courtine, M., van der Laan, M. J., Clément, K., Zucker, J.-D. and Dudoit, S. (In preparation). Statistical methods for detecting genotype/phenotype associations in the ObeLinks Project. Technical report, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.
  • [6] Birkner, M. D., Hubbard, A. E. and van der Laan, M. J. (2005a). Application of a multiple testing procedure controlling the proportion of false positives to protein and bacterial data. Technical Report 186, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.
  • [7] Birkner, M. D., Hubbard, A. E., van der Laan, M. J., Skibola, C. F., Hegedus, C. M. and Smith, M. T. (2006). Issues of processing and multiple testing of SELDI-TOF MS proteomic data. Statistical Applications in Genetics and Molecular Biology 5 Article 11.
  • [8] Birkner, M. D., Pollard, K. S., van der Laan, M. J. and Dudoit, S. (2005b). Multiple testing procedures and applications to genomics. Technical Report 168, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.
  • [9] Birkner, M. D., Sinisi, S. E. and van der Laan, M. J. (2005c). Multiple testing and data adaptive regression: An application to HIV-1 sequence data. Statistical Applications in Genetics and Molecular Biology 4 Article 8.
  • [10] Blanchette, M., Green, R. E., Brenner, S. E. and Rio, D. C. (2005). Global analysis of positive and negative pre-mRNA splicing regulators in Drosophila. Genes and Development 19 1306–1314.
  • [11] Bolstad, B. M., Irizarry, R. A., Gautier, L. and Wu, Z. (2005). Preprocessing high-density oligonucleotide arrays. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. C. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit, eds.) Chapter 2 13–32. Springer, New York.
  • [12] Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 3–62.
  • [13] Chiaretti, S., Li, X., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J. and Foa, R. (2004). Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103 2771–2778.
  • [14] Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genomics. Springer, New York.
  • [15] Dudoit, S., van der Laan, M. J. and Birkner, M. D. (2004a). Multiple testing procedures for controlling tail probability error rates. Technical Report 166, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.
  • [16] Dudoit, S., van der Laan, M. J. and Pollard, K. S. (2004b). Multiple testing. Part I. Single-step procedures for control of general Type I error rates. Statistical Applications in Genetics and Molecular Biology 3 Article 13.
  • [17] Gentleman, R. C., Carey, V. J., Huber, W., Irizarry, R. A. and Dudoit, S., editors (2005a). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Statistics for Biology and Health. Springer, New York.
  • [18] Gentleman, R. C., Carey, V. J. and Zhang, J. (2005b). Meta-data resources and tools in Bioconductor. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. C. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit, eds.) Chapter 7 111–133. Springer, New York.
  • [19] Gill, R. D. (1989). Non- and semi-parametric maximum likelihood estimators and the von Mises method. I. Scand. J. Statist. 16 97–128. (With a discussion by J. A. Wellner and J. Præstgaard and a reply by the author).
  • [20] Gill, R. D., van der Laan, M. J. and Wellner, J. A. (1995). Inefficient estimators of the bivariate survival function for three models. Ann. Inst. H. Poincaré Probab. Statist. 31 545–597.
  • [21] Gleissner, B., Gokbuget, N., Bartram, C. R., Janssen, B., Rieder, H., Janssen, J. W., Fonatsch, C., Heyll, A., Voliotis, D., Beck, J., Lipp, T., Munzert, G., Maurer, J., Hoelzer, D., Thiel, E. and of Adult Acute Lymphoblastic Leukemia Study Group, G. M. T. (2002). Leading prognostic relevance of the BCR-ABL translocation in adult acute B-lineage lymphoblastic leukemia: A prospective study of the German Multicenter Trial Group and confirmed polymerase chain reaction analysis. Blood 99 1536–1543.
  • [22] Grossmann, S., Bauer, S., Robinson, P. N. and Vingron, M. (2006). An improved statistic for detecting over-represented gene ontology annotations in gene sets. In Research in Computational Molecular Biology: 10th Annual International Conference, RECOMB 2006, Venice, Italy, April 2–5, 2006, Proceedings. Lecture Notes in Comput. Sci. ( A. Apostolico, C. Guerra, S. Istrail, P. Pevzner and M. Waterman, eds.) 3909 85–98. Springer, Berlin/Heidelberg.
  • [23] Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75 800–802.
  • [24] Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures. Wiley, New York.
  • [25] Kaczynski, J., Cook, T. and Urrutia, R. (2003). Sp1- and Krüppel-like transcription factors. Genome Biology 4 206.
  • [26] Keleş, S., van der Laan, M. J., Dudoit, S. and Cawley, S. E. (2006). Multiple testing methods for ChIP-Chip high density oligonucleotide array data. J. Comput. Biol. 13 579–613.
  • [27] Kirchner, D., Duyster, J., Ottmann, O., Schmid, R. M., Bergmann, L. and Munzert, G. (2003). Mechanisms of Bcr-Abl-mediated NF-κB/Rel activation. Experimental Hematology 31 504–511.
  • [28] McCarroll, S. A., Murphy, C. T., Zou, S., Pletcher, S. D., Chin, C.-S., Jan, Y. N., Kenyon, C., Bargmann, C. I. and Li, H. (2004). Comparing genomic expression patterns across species identifies shared transcriptional profile in aging. Nature Genetics 36 197–204.
  • [29] Mootha, V. K., Lindgren, C. M., Eriksson, K.-F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn, J. H., Altshuler, D. and Groop, L. C. (2003). PGC–1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 34 267–273.
  • [30] Mukhopadhyay, A., Shishodia, S., Suttles, J., Brittingham, K., Lamothe, B., Nimmanapalli, R., Bhalla, K. N. and Aggarwal, B. B. (2002). Ectopic expression of protein-tyrosine kinase Bcr-Abl suppresses tumor necrosis factor (TNF)-induced NF-κB activation and IκBα phosphorylation. Relationship with down-regulation of TNF receptors. J. Biological Chemistry 277 30622–30628.
  • [31] Pollard, K. S., Birkner, M. D., van der Laan, M. J. and Dudoit, S. (2005a). Test statistics null distributions in multiple testing: Simulation studies and applications to genomics. J. Soc. Française de Statistique 146 77–115.
  • [32] Pollard, K. S., Dudoit, S. and van der Laan, M. J. (2005b). Multiple testing procedures: The multtest package and applications to genomics. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. C. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit, eds.) Chapter 15 249–271. Springer, New York.
  • [33] Pollard, K. S. and van der Laan, M. J. (2004). Choice of a null distribution in resampling-based multiple testing. J. Statist. Plann. Inference 125 85–100.
  • [34] Rubin, D., van der Laan, M. J. and Dudoit, S. (2006). A method to increase the power of multiple testing procedures through sample splitting. Statistical Applications in Genetics and Molecular Biology 5 Article 19.
  • [35] Shtivelman, E., Cohen, F. E. and Bishop, J. M. (1992). A human gene (AHNAK) encoding an unusually large protein with a 1.2-μm polyionic rod structure. Proc. Natl. Acad. Sci. 89 5472–5476.
  • [36] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102 15545–15550.
  • [37] Tian, L., Greenberg, S. A., Kong, S. W., Altschuler, J., Kohane, I. S. and Park, P. J. (2005). Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. 102 13544–13549.
  • [38] van der Laan, M. J. (2006). Statistical inference for variable importance. Internat. J. Biostatistics 2 Article 2.
  • [39] van der Laan, M. J., Birkner, M. D. and Hubbard, A. E. (2005). Empirical Bayes and resampling based multiple testing procedure controlling tail probability of the proportion of false positives. Statistical Applications in Genetics and Molecular Biology 4 Article 29.
  • [40] van der Laan, M. J., Dudoit, S. and Pollard, K. S. (2004a). Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Statistical Applications in Genetics and Molecular Biology 3 Article 15.
  • [41] van der Laan, M. J., Dudoit, S. and Pollard, K. S. (2004b). Multiple testing. Part II. Step-down procedures for control of the family-wise error rate. Statistical Applications in Genetics and Molecular Biology 3 Article 14.
  • [42] van der Laan, M. J. and Hubbard, A. E. (2006). Quantile-function based null distribution in resampling based multiple testing. Statistical Applications in Genetics and Molecular Biology 5 Article 14.
  • [43] van der Laan, M. J. and Robins, J. M. (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer, New York.
  • [44] von Heydebreck, A., Huber, W. and Gentleman, R. (2004). Differential expression with the Bioconductor Project. Technical Report 7, Bioconductor Project Working Papers.
  • [45] Westfall, P. H. and Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. Wiley, New York.
  • [46] Yu, Z. and van der Laan, M. J. (2002). Construction of counterfactuals and the G-computation formula. Technical Report 122, Division of Biostatistics, Univ. California, Berkeley, Berkeley, CA 94720-7360.