The Annals of Statistics

Robustness of multiple testing procedures against dependence

Sandy Clarke and Peter Hall

Full-text: Open access


An important aspect of multiple hypothesis testing is controlling the significance level, or the level of Type I error. When the test statistics are not independent it can be particularly challenging to deal with this problem, without resorting to very conservative procedures. In this paper we show that, in the context of contemporary multiple testing problems, where the number of tests is often very large, the difficulties caused by dependence are less serious than in classical cases. This is particularly true when the null distributions of test statistics are relatively light-tailed, for example, when they can be based on Normal or Student’s t approximations. There, if the test statistics can fairly be viewed as being generated by a linear process, an analysis founded on the incorrect assumption of independence is asymptotically correct as the number of hypotheses diverges. In particular, the point process representing the null distribution of the indices at which statistically significant test results occur is approximately Poisson, just as in the case of independence. The Poisson process also has the same mean as in the independence case, and of course exhibits no clustering of false discoveries. However, this result can fail if the null distributions are particularly heavy-tailed. There clusters of statistically significant results can occur, even when the null hypothesis is correct. We give an intuitive explanation for these disparate properties in light- and heavy-tailed cases, and provide rigorous theory underpinning the intuition.

Article information

Ann. Statist. Volume 37, Number 1 (2009), 332-358.

First available in Project Euclid: 16 January 2009

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G10: Hypothesis testing
Secondary: 62G35: Robustness

False-discovery rate family-wise error rate linear process moving average multiplicity significance level simultaneous hypothesis testing time-series


Clarke, Sandy; Hall, Peter. Robustness of multiple testing procedures against dependence. Ann. Statist. 37 (2009), no. 1, 332--358. doi:10.1214/07-AOS557.

Export citation


  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery fate in multiple testing with independent statistics. J. Educ. Behav. Statist. 25 60–83.
  • Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188.
  • Bernhard, G., Klein, M. and Hommel, G. (2004). Global and multiple test procedures using ordered p-values—a review. Statist. Papers 45 1–14.
  • Blair, R. C., Troendle, J. F. and Beck, R.W. (1996). Control of familywise errors in multiple endpoint assessments via stepwise permutation tests. Statist. Med. 15 1107–1121.
  • Brown, B. W. and Russell, K. (1997). Methods correcting for multiple testing: Operating characteristics. Statist. Med. 16 2511–2528.
  • Dudoit, S., Shaffer, J. P. and Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statist. Sci. 18 73–103.
  • Dunnett, C. W. and Tamhane, A. C. (1995). Step-up testing of parameters with unequally correlated estimates. Biometrics 51 217–227.
  • Efron, B. (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93–103.
  • Finner, H. and Roters, M. (1998). Asymptotic comparison of step-down and step-up multiple test procedures based on exchangeable test statistics. Ann. Statist. 26 505–524.
  • Finner, H. and Roters, M. (1999). Asymptotic comparison of the critical values of step-down and step-up multiple comparison procedures. J. Statist. Plann. Inference 79 11–30.
  • Finner, H. and Roters, M. (2000). On the critical value behavior of multiple decision procedures. Scand. J. Statist. 27 563–573.
  • Finner, H. and Roters, M. (2002). Multiple hypotheses testing and expected number of type I errors. Ann. Statist. 30 220–238.
  • Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. Ann. Statist. 32 1035–1061.
  • Godfrey, G. K. (1985). Comparing the means of several groups. New Eng. J. Med. 311 1450–1456.
  • Gotzsche, P. C. (1989). Methodology and overt and hidden bias in reports of 196 double-blind trials of nonsteroidal antiinflammatory drugs in rheumatoid arthritis. Control Clin. Trials 10 31–56.
  • Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75 800–802.
  • Hochberg, Y. and Benjamini, Y. (1990). More powerful procedures for multiple testing. Statist. Med. 9 811–818.
  • Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures. Wiley, New York.
  • Holland, B. and Cheung, S. H. (2002). Familywise robustness criteria for multiple-comparison procedures. J. Roy. Statist. Soc. Ser. B 64 63–77.
  • Hommell, G. (1988). A comparison of two modified Bonferroni procedures. Biometrika 76 624–625.
  • Kesselman, H. J., Cribbie, R. and Holland, B. (2002). Controlling the rate of Type I error over a large set of statistical tests. Brit. J. Math. Statist. Psych. 55 27–39.
  • Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York.
  • Lehmann, E. L., Romano, J. P. and Shaffer, J. P. (2005). On optimality of stepdown and stepup multiple test procedures. Ann. Statist. 33 1084–1108.
  • Ludbrook, J. (1991). On making multiple comparisons in clinical and experimental pharmacology and physiology. Clin. Exper. Pharm. Physiol. 18 379–392.
  • Olejnik, S., Li, J. M., Supattathum, S. and Huberty, C. J. (1997). Multiple testing and statistical power with modified Bonferroni procedures. J. Educ. Behav. Statist. 22 389–406.
  • Ottenbacher, K. J. (1991a). Statistical conclusion validity: An empirical analysis of multiplicity in mental retardation research. Amer. J. Ment. Retard. 95 421–427.
  • Ottenbacher, K. J. (1991b). Statistical conclusion validity—multiple inferences in rehabilitation research. Amer. J. Phys. Med. Rehab. 70 317–322.
  • Ottenbacher, K. J. (1998). Quantitative evaluation of multiplicity in epidemiology and public health research. Amer. J. Epidemiology 147 615–619.
  • Ottenbacher, K. J. and Barrett, K. A. (1991). Measures of effect size in the reporting of rehabilitation research. Amer. J. Phys. Med. Rehab. 70 S131–S137.
  • Pigeot, I. (2000). Basic concepts of multiple tests—A survey. Statist. Papers 41 3–36.
  • Pocock, S. J., Hughes, M. D. and Lee, R. J. (1987). Statistical problems in reporting of clinical trials. J. Amer. Statist. Assoc. 84 381–392.
  • Rom, D. M. (1990). A sequentially rejective test procedure based on a modified Bonferroni inequality. Biometrika 77 663–665.
  • Rosenberg, P. S., Che, A. and Chen, B. E. (2006). Multiple hypothesis testing strategies for genetic case-control association studies. Statist. Med. 25 3134–3149.
  • Sarkar, S. K. (1998). Some probability inequalities for ordered MTP2 random variables: A proof of the Simes conjecture. Ann. Statist. 26 494–504.
  • Sarkar, S. K. (2006). False discovery and false nondiscovery rates in single-step multiple testing procedures. Ann. Statist. 34 394–415.
  • Sarkar, S. K. and Chang, C. K. (1997). The Simes method for multiple hypothesis testing with positively dependent test statistics. J. Amer. Statist. Assoc. 92 1601–1608.
  • Schmidt, R. and Stadtmüller, U. (2006). Nonparametric estimation of tail dependence. Scand. J. Statist. 33 307–335.
  • Schmidt, T. (2007). Coping with copulas. In Copulas—From Theory to Applications in Finance (J. Rank, ed.) 3–34. Risk Books, London.
  • Sen, P. K. (1999). Some remarks on Simes-type multiple tests of significance. J. Statist. Plann. Inference 82 139–145.
  • Shao, Q.-M. (1999). A Cramér type large deviation result for Student’s t-statistic. J. Theoret. Probab. 12 385–398.
  • Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73 751–754.
  • Smith, D. E., Clemens, J., Crede, W., Harvey, M. and Gracely, E. J. (1987). Impact of multiple comparisons in randomized clinical trials. Amer. J. Med. 83 545–550.
  • Yekutieli, D., Reiner-Benaim, A., Benjamini, Y., Elmer, G. I., Kafkafi, N., Letwin, N. E. and Lee, N. H. (2006). Approaches to multiplicity issues in complex research in microarray analysis. Statist. Neerlandica 60 414–437.
  • Wang, Q. (2005). Limit theorems for self-normalized large deviations. Electron. J. Probab. 10 1260–1285.
  • Wright, S. P. (1992). Adjusted p-values for simultaneous inference. Biometrics 48 1005–1013.