The Annals of Statistics

Marginal asymptotics for the “large p, small n” paradigm: With applications to microarray data

Michael R. Kosorok and Shuangge Ma

Full-text: Open access


The “large p, small n” paradigm arises in microarray studies, image analysis, high throughput molecular screening, astronomy, and in many other high dimensional applications. False discovery rate (FDR) methods are useful for resolving the accompanying multiple testing problems. In cDNA microarray studies, for example, p-values may be computed for each of p genes using data from n arrays, where typically p is in the thousands and n is less than 30. For FDR methods to be valid in identifying differentially expressed genes, the p-values for the nondifferentially expressed genes must simultaneously have uniform distributions marginally. While feasible for permutation p-values, this uniformity is problematic for asymptotic based p-values since the number of p-values involved goes to infinity and intuition suggests that at least some of the p-values should behave erratically. We examine this neglected issue when n is moderately large but p is almost exponentially large relative to n. We show the somewhat surprising result that, under very general dependence structures and for both mean and median tests, the p-values are simultaneously valid. A small simulation study and data analysis are used for illustration.

Article information

Ann. Statist. Volume 35, Number 4 (2007), 1456-1486.

First available in Project Euclid: 29 August 2007

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

Primary: 62A01: Foundations and philosophical topics 62H15: Hypothesis testing
Secondary: 62G20: Asymptotic properties 62G30: Order statistics; empirical distribution functions

Brownian bridge Brownian motion empirical process false discovery rate Hungarian construction marginal asymptotics maximal inequalities median tests microarrays t-tests


Kosorok, Michael R.; Ma, Shuangge. Marginal asymptotics for the “large p , small n ” paradigm: With applications to microarray data. Ann. Statist. 35 (2007), no. 4, 1456--1486. doi:10.1214/009053606000001433.

Export citation


  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289--300.
  • Billingsley, P. (1995). Probability and Measure, 3rd ed. Wiley, New York.
  • Bretagnolle, J. and Massart, P. (1989). Hungarian construction from the nonasymptotic viewpoint. Ann. Probab. 17 239--256.
  • Csörgő, M. and Révész, P. (1981). Strong Approximations in Probability and Statistics. Academic Press, New York.
  • Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77--87.
  • Dvoretzky, A., Kiefer, J. and Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist. 27 642--669.
  • Fan, J., Hall, P. and Yao, Q. (2005). To how many simultaneous hypothesis tests can normal, Student's $t$ or bootstrap calibration be applied? Unpublished manuscript.
  • Fan, J., Peng, H. and Huang, T. (2005). Semilinear high-dimensional model for normalization of microarray data: A theoretical analysis and partial consistency (with discussion). J. Amer. Statist. Assoc. 100 781--813.
  • Fan, J., Tam, P., Vande Woude, G. and Ren, Y. (2004). Normalization and analysis of cDNA microarrays using within-array replications applied to neuroblastoma cell response to a cytokine. Proc. Natl. Acad. Sci. USA 101 1135--1140.
  • Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 499--517.
  • Ghosh, D. and Chinnaiyan, A. M. (2005). Classification and selection of biomarkers in genomic data using LASSO. J. Biomedicine and Biotechnology 2005 147--154.
  • Gui, J. and Li, H. (2005). Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21 3001--3008.
  • Huang, J., Kuo, H.-C., Koroleva, I., Zhang, C.-H. and Bento Soares, M. (2003). A semilinear model for normalization and analysis of cDNA microarray data. Technical Report 321, Dept. Statistics and Actuarial Science, Univ. Iowa.
  • Huang, J., Wang, D. and Zhang, C.-H. (2005). A two-way semilinear model for normalization and analysis of cDNA microarray data. J. Amer. Statist. Assoc. 100 814--829.
  • Komlós, J., Major, P. and Tusnády, G. (1975). An approximation of partial sums of independent rv's and the sample df. I. Z. Wahrsch. Verw. Gebiete 32 111--131.
  • Kosorok, M. R. (1999). Two-sample quantile tests under general conditions. Biometrika 86 909--921.
  • Kosorok, M. R. (2002). On global consistency of a bivariate survival estimator under univariate censoring. Statist. Probab. Lett. 56 439--446.
  • Kosorok, M. R. and Ma, S. (2005). Comment on ``Semilinear high-dimensional model for normalization of microarray data: A theoretical analysis and partial consistency,'' by J. Fan, H. Peng and T. Huang. J. Amer. Statist. Assoc. 100 805--807.
  • Kosorok, M. R. and Ma, S. (2005). Marginal asymptotics for the ``large $p$, small $n$'' paradigm: With applications to microarray data. Technical Report 188, Dept. Biostatistics and Medical Informatics, Univ. Wisconsin, Madison.
  • Massart, P. (1990). The tight constant in the Dvoretzky--Kiefer--Wolfowitz inequality. Ann. Probab. 18 1269--1283.
  • Skorohod, A. V. (1976). On a representation of random variables. Theory Probab. Appl. 21 628--632.
  • Spang, R., Blanchette, C., Zuzan, H., Marks, J., Nevins, J. and West, M. (2001). Prediction and uncertainty in the analysis of gene expression profiles. In Proc. German Conference on Bioinformatics GCB 2001 (E. Wingender, R. Hofestädt and I. Liebich, eds.) 102--111.
  • Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479--498.
  • Storey, J. D., Taylor, J. E. and Siegmund, E. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discover rates: A unified approach. J. R. Stat. Soc. Ser. B Methodol. 66 187--205.
  • van der Laan, M. J. and Bryan, J. (2001). Gene expression analysis with the parametric bootstrap. Biostatistics 2 445--461.
  • van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York.
  • West, M. (2003). Bayesian factor regression models in the ``large $p$, small $n$'' paradigm. In Bayesian Statistics 7 (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.) 733--742. Oxford Univ. Press.
  • West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J. A. Jr., Marks, J. R. and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA 98 11,462--11,467.
  • Yang, Y. H., Dudoit, S., Luu, P. and Speed, T. P. (2001). Normalization for cDNA microarray data. In Microarrays: Optical Technologies and Informatics (M. L. Bittner, Y. Chen, A. N. Dorsal and E. R. Dougherty, eds.) 141--152. Proc. SPIE 4266.