Identifying differentially expressed (DE) genes associated with a
sample characteristic is the primary objective of many
microarray studies. As more and more studies are carried out
with observational rather than well controlled experimental
samples, it becomes important to evaluate and properly control
the impact of sample heterogeneity on DE gene finding. Typical
methods for identifying DE genes require ranking all the genes
according to a preselected statistic based on a single model for
two or more group comparisons, with or without adjustment for
other covariates. Such single model approaches unavoidably
result in model misspecification, which can lead to increased
error due to bias for some genes and reduced efficiency for the
others. We evaluated the impact of model misspecification from
such approaches on detecting DE genes and identified parameters
that affect the magnitude of impact. To properly control for
sample heterogeneity and to provide a flexible and coherent
framework for identifying simultaneously DE genes associated
with a single or multiple sample characteristics and/or their
interactions, we proposed a Bayesian model averaging approach
which corrects the model misspecification by averaging over
model space formed by all relevant covariates. An empirical
approach is suggested for specifying prior model probabilities.
We demonstrated through simulated microarray data that this
approach resulted in improved performance in DE gene
identification compared to the single model approaches. The
flexibility of this approach is demonstrated through our
analysis of data from two observational microarray studies.
References
Boyle, J. O., Gumus, Z. H., Kacker, A., Choksi, V. L., Jennifer, M. B., Zhou, X. K., Ante’s, R. K., Hughes, D., Du, B., Judson, B. L., Subbaramaiah, K. and Dannenberg, A. J. (2010). Effects of cigarette smoke on the human oral mucosal transcritpome. Cancer Prevention Reseach 3 266–278.
Cao, J. and Zhang, S. (2010). Measuring statistical significance for full Bayesian methods in microarray analyses. Bayesian Anal. 5 413–427.
Cao, J., Xie, X.-J., Zhang, S., Whitehurst, A. and White, M. A. (2009). Bayesian optimal discovery procedure for simultaneous significance testing. BMC Bioinformatics 10 5.
Carolan, B. J., Harvey, B. G., De Bishnu, P., Vanni, H. and Crystal, R. G. (2008). Decreased expression of Intelectin 1 in the human airway epithelium of smokers compared to nonsmokers. Journal of Immunology 181 5760–5767.
Conlon, E. M., Song, J. J. and Liu, J. S. (2006). Bayesian models for pooling microarray studies with multiple sources of replications. BMC Bioinformatics 7 247.
Delongchamp, R. R., Velasco, C., Dial, S. and Harris, A. J. (2005). Genome-wide estimation of gender differences in the gene expression of human livers: Statistical design and analysis. BMC Bioinformatics 6 Suppl 2 S13.
Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 23 1–22.
Efron, B. (2010). Sets of cases (Enrichment). In Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics Monographs 1 163–184. Cambridge Univ. Press, Cambridge.
Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes. Ann. Appl. Stat. 1 107–129.
Gottardo, R. and Raftery, A. (2009). Bayesian robust transformation and variable selection: A unified approach. Canad. J. Statist. 37 361–380.
Heller, R., Manduchi, E. and Small, D. S. (2009). Matching methods for observational microarray studies. Bioinformatics 25 904–909.
Hoeting, J. A., Madigan, D., Raftery, A. E. and Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statist. Sci. 14 382–417.
Hummel, M., Meister, R. and Mansmann, U. (2008). GlobalANCOVA: Exploration and assessment of gene group effects. Bioinformatics 24 78–85.
Jeffery, I. B., Higgins, D. G. and Culhane, A. C. (2006). Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics 7 359.
Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773–795.
Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 1724–1735.
Lewohl, J. M., Dodd, P. R., Mayfield, R. D. and Harris, R. A. (2001). Application of DNA microarrays to study human alcoholism. J. Biomed. Sci. 8 28–36.
Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of $g$ priors for Bayesian variable selection. J. Amer. Statist. Assoc. 103 410–423.
Müller, P., Parmigiani, G. and Rice, K. (2007). FDR and Bayesian multiple comparisons rules. In Bayesian Statistics 8 (J. M. Bernardo, M. Bayarri, J. Berger, et al., eds.). 349–370. Oxford Univ. Press, Oxford.
Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155–176.
Potter, J. D. (2003). Epidemiology, cancer genetics and microarrays: Making correct inferences, using appropriate designs. Trends Genet. 19 690–695.
Rao, P. (1971). Some notes on misspecification in multiple regressions. Amer. Statist. 25 37–39.
Rao, P. (1973). Some notes on the errors-in-variables model. Amer. Statist. 27 217–218.
Mathematical Reviews (MathSciNet):
MR348931
Rosenberg, S. H. and Levy, P. S. (1972). A characterization on misspecification in the general linear regression model. Biometrics 28 1129–1133.
Mathematical Reviews (MathSciNet):
MR329145
Sartor, M. A., Tomlinson, C. R., Wesselkamper, S. C., Sivaganesan, S., Leikauf, G. D. and Medvedovic, M. (2006). Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics 7 538.
Scheid, S. and Spang, R. (2007). Compensating for unknown confounders in microarray data analysis using filtered permutations. J. Comput. Biol. 14 669–681.
Scott, J. G. and Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Ann. Statist. 38 2587–2619.
Sebastiani, P., Xie, H. and Ramoni, M. F. (2006). Bayesian analysis of comparative microarray experiments by model averaging. Bayesian Anal. 1 707–732.
Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3 Art. 3, 29 pp. (electronic).
Spira, A., Beane, J., Shah, V., Liu, G., Schembri, F., Yang, X., Palma, J. and Brody, J. S. (2004). Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc. Natl. Acad. Sci. USA 101 10143–10148.
Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479–498.
Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100 9440–9445 (electronic).
Tan, Q., Zhao, J., Li, S., Christiansen, L., Kruse, T. A. and Christensen, K. (2008). Differential and correlation analyses of microarray gene expression data in the CEPH Utah families. Genomics 92 94–100.
Troester, M. A., Millikan, R. C. and Perou, C. M. (2009). Microarrays and epidemiology: Ensuring the impact and accessibility of research findings. Cancer Epidemiology, Biomarkers & Prevention 18 1–4.
Webb, P. M., Merritt, M. A., Boyle, G. M. and Green, A. C. (2007). Microarrays and epidemiology: Not the beginning of the end but the end of the beginning. Cancer Epidemiology, Biomarkers & Prevention 16 637–638.
Wu, X. L., Gianola, D., Rosa, G. J. M. and Weigel, K. A. (2010). Bayesian model averaging for evaluation of candidate gene effects. Genetica 138 395–407.
Xu, L., Craiu, R. V. and Sun, L. (2011). Bayesian methods to overcome the winner’s curse in genetic studies. Ann. Appl. Stat. 5 201–231.
Yang, X., Schadt, E. E., Wang, S., Wang, H., Arnold, A. P., Ingram-Drake, L., Drake, T. A. and Lusis, A. J. (2006). Tissue-specific expression and regulation of sexually dimorphic genes in mice. Genome Res. 16 995–1004.
Yeung, K. Y., Bumgarner, R. E. and Raftery, A. E. (2005). Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21 2394–2402.
Zellner, A. and Siow, A. (1980). Posterior odds ratios for selected regression hypotheses. In Bayesian Statistics: Proceedings of the First International Meeting Held in Valencia (Spain) (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.) 585–603. Valencia Univ. Press, Valencia.
Mathematical Reviews (MathSciNet):
MR638871
Zhou, X. K., Liu, F. and Dannenberg, A. J. (2012). Supplement to “A Bayesian model averaging approach for observational gene expression studies.”
DOI:10.1214/11-AOAS526SUPP.