The Annals of Statistics

A two-sample test for high-dimensional data with applications to gene-set testing

Song Xi Chen and Ying-Li Qin

Full-text: Open access


We propose a two-sample test for the means of high-dimensional data when the data dimension is much larger than the sample size. Hotelling’s classical T2 test does not work for this “large p, small n” situation. The proposed test does not require explicit conditions in the relationship between the data dimension and sample size. This offers much flexibility in analyzing high-dimensional data. An application of the proposed test is in testing significance for sets of genes which we demonstrate in an empirical study on a leukemia data set.

Article information

Ann. Statist., Volume 38, Number 2 (2010), 808-835.

First available in Project Euclid: 19 February 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62H15: Hypothesis testing 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]
Secondary: 62G10: Hypothesis testing

High dimension gene-set testing large p small n martingale central limit theorem multiple comparison


Chen, Song Xi; Qin, Ying-Li. A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 (2010), no. 2, 808--835. doi:10.1214/09-AOS716.

Export citation


  • Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis. Wiley, Hoboken, NJ.
  • Abramovich, F., Benjamini, Y., Donoho, D. L. and Johnstone, I. M. (2006). Adaptive to unknown sparsity in controlling the false discovery rate. Ann. Statist. 34 584–653.
  • Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311–329.
  • Barry, W., Nobel, A. and Wright, F. (2005). Significance analysis of functional categories in gene expression studies: A structured permutation approach. Bioinformatics 21 1943–1949.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188.
  • Chen, S. X. and Qin, Y.-L. (2008). A Two Sample Test for High Dimensional Data with Applications to Gene-Set Testing. Research report, Dept. Statistics, Iowa State Univ.
  • Chiaretti, S., Li, X. C., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J. and Foa, R. (2004). Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103 2771–2778.
  • Dudoit, S., Keles, S. and van der Laan, M. (2008). Multiple tests of association with biological annotation metadata. Inst. Math. Statist. Collections 2 153–218.
  • Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes. Ann. Appl. Stat. 1 107–129.
  • Fan, J., Hall, P. and Yao, Q. (2007). To how many simultaneous hypothesis tests can normal, Student’s t or bootstrap calibration be applied. J. Amer. Statist. Assoc. 102 1282–1288.
  • Fan, J., Peng, H. and Huang, T. (2005). Semilinear high-dimensional model for normalization of microarray data: A theoretical analysis and partial consistency. J. Amer. Statist. Assoc. 100 781–796.
  • Gentleman, R., Irizarry, R. A., Carey, V. J., Dudoit, S. and Huber, W. (2005). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York.
  • Hall, P. and Heyde, C. (1980). Martingale Limit Theory and Applications. Academic Press, New York.
  • Huang, J., Wang, D. and Zhang, C. (2005). A two-way semilinear model for normalization and analysis of cDNA microarray data. J. Amer. Statist. Assoc. 100 814–829.
  • Kosorok, M. and Ma, S. (2007). Marginal asymptotics for the “large p, small n” paradigm: With applications to microarray data. Ann. Statist. 35 1456–1486.
  • Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrix when the dimension is large compare to the sample size. Ann. Statist. 30 1081–1102.
  • Newton, M., Quintana, F., Den Boon, J., Sengupta, S. and Ahlquist, P. (2007). Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann. Appl. Stat. 1 85–106.
  • Portnoy, S. (1986). On the central limit theorem in Rp when p→∞. Probab. Theory Related Fields 73 571–583.
  • Recknor, J., Nettleton, D. and Reecy, J. (2008). Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics 24 192–201.
  • Schott, J. R. (2005). Testing for complete independence in high dimensions. Biometrika 92 951–956.
  • Storey, J., Taylor, J. and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. Ser. B Stat. Methodol. 66 187–205.
  • Tracy, C. and Widom, H. (1996). On orthogonal and symplectic matrix ensembles. Comm. Math. Phys. 177 727–754.
  • Van der Laan, M. and Bryan, J. (2001). Gene expression analysis with the parametric bootstrap. Biostatistics 2 445–461.
  • Yin, Y., Bai, Z. and Krishnaiah, P. R. (1988). On the limit of the largest eigenvalue of the large-dimensional sample covariance matrix. Probab. Theory Related Fields 78 509–521.