The Annals of Statistics

Size, power and false discovery rates

Bradley Efron
Source: Ann. Statist. Volume 35, Number 4 (2007), 1351-1377.

Abstract

Modern scientific technology has provided a new class of large-scale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science surveys. This paper uses false discovery rate methods to carry out both size and power calculations on large-scale problems. A simple empirical Bayes approach allows the false discovery rate (fdr) analysis to proceed with a minimum of frequentist or Bayesian modeling assumptions. Closed-form accuracy formulas are derived for estimated false discovery rates, and used to compare different methodologies: local or tail-area fdr’s, theoretical, permutation, or empirical null hypothesis estimates. Two microarray data sets as well as simulations are used to evaluate the methodology, the power diagnostics showing why nonnull cases might easily fail to appear on a list of “significant” discoveries.

First Page: Show Hide
Primary Subjects: 62J07, 62G07
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1188405614
Digital Object Identifier: doi:10.1214/009053606000001460
Mathematical Reviews number (MathSciNet): MR2351089
Zentralblatt MATH identifier: 1123.62008

References

Allison, D., Gadbury, G., Heo, M., Fernández, J., Lee, C.-K., Prolla, T. and Weindruch, R. (2002). A mixture model approach for the analysis of microarray gene expression data. Comput. Statist. Data Anal. 39 1--20.
Mathematical Reviews (MathSciNet): MR1895555
Aubert, J., Bar-Hen, A., Daudin, J. and Robin, S. (2004). Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics 5 125.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289--300.
Mathematical Reviews (MathSciNet): MR1325392
Broberg, P. (2004). A new estimate of the proportion unchanged genes in a microarray experiment. Genome Biology 5 (5) P10.
Do, K.-A., Müller, P. and Tang, F. (2005). A Bayesian mixture model for differential gene expression. Appl. Statist. 54 627--644.
Mathematical Reviews (MathSciNet): MR2137258
Digital Object Identifier: doi:10.1111/j.1467-9876.2005.05593.x
Zentralblatt MATH: 05188702
Dudoit, S., Shaffer, J. and Boldrick, J. (2003). Multiple hypothesis testing in microarray experiments. Statist. Sci. 18 71--103.
Mathematical Reviews (MathSciNet): MR1997066
Digital Object Identifier: doi:10.1214/ss/1056397487
Project Euclid: euclid.ss/1056397487
Zentralblatt MATH: 1048.62099
Dudoit, S., van der Laan, M. and Pollard, K. (2004). Multiple testing. I. Single-step procedures for the control of general type I error rates. Stat. Appl. Genet. Mol. Biol. 3 article 13. Available at www.bepress.com/sagmb/vol3/iss1/art13.
Mathematical Reviews (MathSciNet): MR2101462
Digital Object Identifier: doi:10.2202/1544-6115.1040
Zentralblatt MATH: 1166.62338
Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96--104.
Mathematical Reviews (MathSciNet): MR2054289
Digital Object Identifier: doi:10.1198/016214504000000089
Zentralblatt MATH: 1089.62502
Efron, B. (2005). Local false discovery rates. Available at www-stat.stanford.edu/~brad/papers/False.pdf.
Efron, B. (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93--103.
Mathematical Reviews (MathSciNet): MR2293302
Digital Object Identifier: doi:10.1198/016214506000001211
Zentralblatt MATH: 05191552
Efron, B. and Gous, A. (2001). Scales of evidence for model selection: Fisher versus Jeffreys (with discussion). In Model Selection (P. Lahiri, ed.) 208--256. IMS, Beachwood, OH.
Mathematical Reviews (MathSciNet): MR2000754
Digital Object Identifier: doi:10.1214/lnms/1215540972
Efron, B. and Tibshirani, R. (1996). Using specially designed exponential families for density estimation. Ann. Statist. 24 2431--2461.
Mathematical Reviews (MathSciNet): MR1425960
Digital Object Identifier: doi:10.1214/aos/1032181161
Project Euclid: euclid.aos/1032181161
Zentralblatt MATH: 0878.62028
Efron, B. and Tibshirani, R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology 23 70--86.
Efron, B., Tibshirani, R., Storey, J. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151--1160.
Mathematical Reviews (MathSciNet): MR1946571
Digital Object Identifier: doi:10.1198/016214501753382129
Zentralblatt MATH: 1073.62511
Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. Ann. Statist. 32 1035--1061.
Mathematical Reviews (MathSciNet): MR2065197
Digital Object Identifier: doi:10.1214/009053604000000283
Project Euclid: euclid.aos/1085408494
Zentralblatt MATH: 1092.62065
Gottardo, R., Raftery, A., Yee Yeung, K. and Bumgarner, R. (2006). Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 62 10--18.
Mathematical Reviews (MathSciNet): MR2226551
Digital Object Identifier: doi:10.1111/j.1541-0420.2005.00397.x
Zentralblatt MATH: 1099.62128
Heller, G. and Qing, J. (2003). A mixture model approach for finding informative genes in microarray studies. Unpublished manuscript.
Johnstone, I. and Silverman, B. (2004). Needles and straw in haystacks: Empirical Bayes estimates of sparse sequences. Ann. Statist. 32 1594--1649.
Mathematical Reviews (MathSciNet): MR2089135
Digital Object Identifier: doi:10.1214/009053604000000030
Project Euclid: euclid.aos/1091626180
Zentralblatt MATH: 1047.62008
Kendziorski, C., Newton, M., Lan, H. and Gould, M. (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat. Med. 22 3899--3914.
Kerr, M., Martin, M. and Churchill, G. (2000). Analysis of variance for gene expression microarray data. J. Comput. Biol. 7 819--837.
Langaas, M., Lindqvist, B. and Ferkingstad, E. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 555--572.
Mathematical Reviews (MathSciNet): MR2168204
Digital Object Identifier: doi:10.1111/j.1467-9868.2005.00515.x
Zentralblatt MATH: 1095.62037
Lee, M.-L. T., Kuo, F., Whitmore, G. and Sklar, J. (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. USA 97 9834--9839.
Liao, J., Lin, Y., Selvanayagam, Z. and Weichung, J. (2004). A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics 20 2694--2701.
Lindsey, J. (1974). Comparison of probability distributions. J. Roy. Statist. Soc. Ser. B 36 38--47.
Mathematical Reviews (MathSciNet): MR0362643
Lindsey, J. (1974). Construction and comparison of statistical models. J. Roy. Statist. Soc. Ser. B 36 418--425.
Mathematical Reviews (MathSciNet): MR0365794
Newton, M., Kendziorski, C., Richmond, C., Blattner, F. and Tsui, K. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comput. Biol. 8 37--52.
Newton, M., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture model. Biostatistics 5 155--176.
Pan, W., Lin, J. and Le, C. (2003). A mixture model approach to detecting differentially expressed genes with microarray data. Functional and Integrative Genomics 3 117--124.
Pawitan, Y., Michiels, S., Koscielny, S., Gusnanto, A. and Ploner, A. (2005). False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 21 3017--3024.
Pounds, S. and Morris, S. (2003). Estimating the occurrence of false positions and false negatives in microarray studies by approximating and partitioning the empirical distribution of $p$-values. Bioinformatics 19 1236--1242.
Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D'Amico, A., Richie, J., Lander, E., Loda, M., Kantoff, P., Golub, T. and Sellers, R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203--209.
Storey, J. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479--498.
Mathematical Reviews (MathSciNet): MR1924302
Digital Object Identifier: doi:10.1111/1467-9868.00346
Zentralblatt MATH: 1090.62073
Storey, J., Taylor, J. and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. Ser. B Stat. Methodol. 66 187--206.
Mathematical Reviews (MathSciNet): MR2035766
Digital Object Identifier: doi:10.1111/j.1467-9868.2004.00439.x
Zentralblatt MATH: 1061.62110
van't Wout, A., Lehrman, G., Mikheeva, S., O'Keeffe, G. Katze, M., Bumgarner, R., Geiss, G. and Mullins, J. (2003). Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4$^+$-T-cell lines. J. Virology 77 1392--1402.

2013 © Institute of Mathematical Statistics

The Annals of Statistics

The Annals of Statistics

Turn MathJax Off
What is MathJax?