Statistical Science

Microarrays, Empirical Bayes and the Two-Groups Model

Bradley Efron

Full-text: Open access


The classic frequentist theory of hypothesis testing developed by Neyman, Pearson and Fisher has a claim to being the twentieth century’s most influential piece of applied mathematics. Something new is happening in the twenty-first century: high-throughput devices, such as microarrays, routinely require simultaneous hypothesis tests for thousands of individual cases, not at all what the classical theory had in mind. In these situations empirical Bayes information begins to force itself upon frequentists and Bayesians alike. The two-groups model is a simple Bayesian construction that facilitates empirical Bayes analysis. This article concerns the interplay of Bayesian and frequentist ideas in the two-groups setting, with particular attention focused on Benjamini and Hochberg’s False Discovery Rate method. Topics include the choice and meaning of the null hypothesis in large-scale testing situations, power considerations, the limitations of permutation methods, significance testing for groups of cases (such as pathways in microarray studies), correlation effects, multiple confidence intervals and Bayesian competitors to the two-groups model.

Article information

Statist. Sci. Volume 23, Number 1 (2008), 1-22.

First available in Project Euclid: 7 July 2008

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)


Efron, Bradley. Microarrays, Empirical Bayes and the Two-Groups Model. Statist. Sci. 23 (2008), no. 1, 1--22. doi:10.1214/07-STS236.

Export citation


  • Allison, D., Gadbury, G., Heo, M., Fernandez, J., Lee, C. K., Prolla, T. and Weindruch, R. (2002). A mixture model approach for the analysis of microarray gene expression data. Computat. Statist. Data Anal. 39 1–20.
  • Aubert, J., Bar-hen, A., Daudin, J. and Robin, S. (2004). Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics 5 125.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate under dependency. Ann. Statist. 29 1165–1188.
  • Benjamini, Y. and Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 100 71–93.
  • Broberg, P. (2005). A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 6 199.
  • Do, K.-A., Mueller, P. and Tang, F. (2005). A Bayesian mixture model for differential gene expression. J. Roy. Statist. Soc. Ser. C 54 627–644.
  • Dudoit, S., Shaffer, J. and Boldrick, J. (2003). Multiple hypothesis testing in microarray experiments. Statist. Sci. 18 71–103.
  • Efron, B. (2003). Robbins, empirical Bayes, and microarrays. Ann. Statist. 31 366–378.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Efron, B. (2005). Local false discovery rates. Available at
  • Efron, B. (2006). Size, power, and false discovery rates. Available at Ann. Appl. Statist. To appear.
  • Efron, B. (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93–103.
  • Efron, B. and Gous, A. (2001). Scales of evidence for model selection: Fisher versus Jeffreys. Model Selection IMS Monograph 38 208–256.
  • Efron, B. and Morris, C. (1975). Data analysis using Stein’s estimator and its generalizations. J. Amer. Statist. Assoc. 70 311–319.
  • Efron, B. and Tibshirani, R. (1996). Using specially designed exponential families for density estimation. Ann. Statist. 24 2431–2461.
  • Efron, B. and Tibshirani, R. (2006). On testing the significance of sets of genes. Available at Ann. Appl. Statist. To appear.
  • Efron, B. and Tibshirani, R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology 23 70–86.
  • Efron, B., Tibshirani, R., Storey, J. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160.
  • Hedenfalk, I., Duggen, D., Chen, Y., et al. (2001). Gene expression profiles in hereditary breast cancer. New Engl. J. Medicine 344 539–548.
  • Heller, G. and Qing, J. (2003). A mixture model approach for finding informative genes in microarray studies. Unpublished manuscript.
  • Kerr, M., Martin, M. and Churchill, G. (2000). Analysis of variance in microarray data. J. Comp. Biology 7 819–837.
  • Langass, M., Lindquist, B. and Ferkinstad, E. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. Roy. Statist. Soc. Ser. B 67 555–572.
  • Lee, M. L. T., Kuo, F., Whitmore, G. and Sklar, J. (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. 97 9834–9838.
  • Lehmann, E. and Romano, J. (2005). Generalizations of the familywise error rate. Ann. Statist. 33 1138–1154.
  • Lehmann, E. and Romano, J. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York.
  • Lewin, A. Richardson, S., Marshall, C., Glaser, A. and Aitman, Y. (2006). Bayesian modeling of differential gene expression. Biometrics 62 1–9.
  • Liang, C., Rice, J., de Pater, I., Alcock, C., Axelrod, T., Wang, A. and Marshall, S. (2004). Statistical methods for detecting stellar occultations by Kuiper belt objects: The Taiwanese-American occultation survey. Statist. Sci. 19 265–274.
  • Liao, J., Lin, Y., Selvanayagam, Z. and Weichung, J. (2004). A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics 20 2694–2701.
  • Newton, M., Kendziorski, C., Richmond, C., Blattner, F. and Tsui, K. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comp. Biology 8 37–52.
  • Newton, M., Noveiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture model. Biostatistics 5 155–176.
  • Pan, W., Lin, J. and Le, C. (2003). A mixture model approach to detecting differentially expressed genes with microarray data. Functional and Integrative Genomics 3 117–124.
  • Parmigiani, G., Garrett, E., Ambazhagan, R. and Gabrielson, E. (2002). A statistical framework for expression-based molecular classification in cancer. J. Roy. Statist. Soc. Ser. B 64 717–736.
  • Pawitan, Y., Murthy, K., Michiels, J. and Ploner, A. (2005). Bias in the estimation of false discovery rate in microarray studies. Bioinformatics 21 3865–3872.
  • Pounds, S. and Morris, S. (2003). Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of the p-values. Bioinformatics 19 1236–1242.
  • Qui, X., Brooks, A., Klebanov, L. and Yakovlev, A. (2005). The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics 6 120.
  • Rogosa, D. (2003). Accuracy of API index and school base report elements: 2003 Academic Performance Index, California Department of Education. Available at http://www.cde.cagov/ta/ac/ap/researchreports.asp.
  • Schwartzman, A., Dougherty, R. F. and Taylor, J. E. (2005). Cross-subject comparison of principal diffusion direction maps. Magnetic Resonance in Medicine 53 1423–1431.
  • Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C. Tamayo, P., Renshaw, A., D’Amico, A., Richie, J., Lander, E., Loda, M., Kantoff, P., Golub, T. and Sellers, R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 302–309.
  • Smyth, G. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3 1–29.
  • Storey, J. (2002). A direct approach to false discovery rates. J. Roy. Statist. Soc. Ser. B 64 479–498.
  • Storey, J., Taylor, J. and Siegmund, D. (2005). Strong control conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. Roy. Statist. Soc. Ser. B 66 187–205.
  • Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102 15545–15550.
  • Turnbull, B. (2006). BEST proteomics data. Available at
  • Tusher, V., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proc. Natl. Acad. Sci. USA 98 5116–5121.
  • van’t Wout, A., Lehrma, G., Mikheeva, S., O’Keeffe, G., Katze, M., Bumgarner, R., Geiss, G. and Mullins, J. (2003). Cellular gene expression upon human immunodeficiency virus type 1 infection of CD$+ T-Cell lines. J. Virology 77 1392–1402.