The Annals of Statistics

Asymptotic inference for high-dimensional data

Jim Kuelbs and Anand N. Vidyashankar
Source: Ann. Statist. Volume 38, Number 2 (2010), 836-869.

Abstract

In this paper, we study inference for high-dimensional data characterized by small sample sizes relative to the dimension of the data. In particular, we provide an infinite-dimensional framework to study statistical models that involve situations in which (i) the number of parameters increase with the sample size (that is, allowed to be random) and (ii) there is a possibility of missing data. Under a variety of tail conditions on the components of the data, we provide precise conditions for the joint consistency of the estimators of the mean. In the process, we clarify and improve some of the recent consistency results that appeared in the literature. An important aspect of the work presented is the development of asymptotic normality results for these models. As a consequence, we construct different test statistics for one-sample and two-sample problems concerning the mean vector and obtain their asymptotic distributions as a corollary of the infinite-dimensional results. Finally, we use these theoretical results to develop an asymptotically justifiable methodology for data analyses. Simulation results presented here describe situations where the methodology can be successfully applied. They also evaluate its robustness under a variety of conditions, some of which are substantially different from the technical conditions. Comparisons to other methods used in the literature are provided. Analyses of real-life data is also included.

First Page: Show Hide
Primary Subjects: 60B10, 60B12, 60F05, 62A01, 62H15, 62G20, 62F40, 92B15
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1266586616
Digital Object Identifier: doi:10.1214/09-AOS718
Zentralblatt MATH identifier: 05686521
Mathematical Reviews number (MathSciNet): MR2604698

References

[1] Araujo, A. and Giné, E. (1980). The Central Limit Theorem for Real and Banach Valued Random Variables. Wiley, New York.
Mathematical Reviews (MathSciNet): MR576407
Zentralblatt MATH: 0457.60001
[2] Devroye, L. and Gyorfi, L. (1985). Nonparametric Density Estimation: The L1 View. Wiley, New York.
Mathematical Reviews (MathSciNet): MR780746
Zentralblatt MATH: 0546.62015
[3] Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87.
Mathematical Reviews (MathSciNet): MR1963389
Zentralblatt MATH: 1073.62576
Digital Object Identifier: doi:10.1198/016214502753479248
[4] Feller, W. (1966). An Introduction to Probability Theory and Its Applications. Wiley, New York.
Mathematical Reviews (MathSciNet): MR210154
[5] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
Mathematical Reviews (MathSciNet): MR144363
Zentralblatt MATH: 0127.10602
Digital Object Identifier: doi:10.2307/2282952
[6] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531–537.
[7] Kosorok, M. and Ma, S. (2007). Marginal asymptotics for the large p, small n paradigm: With application to microarray data. Ann. Statist. 35 1456–1486.
Mathematical Reviews (MathSciNet): MR2351093
Zentralblatt MATH: 1123.62005
Digital Object Identifier: doi:10.1214/009053606000001433
Project Euclid: euclid.aos/1188405618
[8] Kuelbs, J. and Vidyashankar, A. N. (2008). Asymptotic inference for high-dimensional data. Preprint. Available at http://mason.gmu.edu/~avidyash.
[9] Kuelbs, J. and Vidyashankar, A. N. (2008). Simulation report using structured covariances. Preprint. Available at http://mason.gmu.edu/~avidyash.
[10] Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88 365–411.
Mathematical Reviews (MathSciNet): MR2026339
Zentralblatt MATH: 1032.62050
Digital Object Identifier: doi:10.1016/S0047-259X(03)00096-4
[11] Lu, Y., Liu, P.-Y., Xiao, P. and Deng, H.-W. (2005). Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics 21 3105–3113.
[12] Okamato, M. (1958). Some inequalities relating to the partial sum of binomial probabilities. Ann. Inst. Statist. Math. 10 29–35.
Mathematical Reviews (MathSciNet): MR99733
Zentralblatt MATH: 0084.14001
Digital Object Identifier: doi:10.1007/BF02883985
[13] Parthasarathy, K. R. (1967). Probability Measures on Metric Spaces. Academic Press, New York.
Mathematical Reviews (MathSciNet): MR226684
[14] Paulauskas, V. (1984). On the central limit theorem in c0. Probab. Math. Statist. 3 127–141.
Mathematical Reviews (MathSciNet): MR764142
Zentralblatt MATH: 0555.60009
[15] Portnoy, S. (1984). Asymptotic behavior of M-estimators of p regression parameters when p2/n is large. I. Consistency. Ann. Statist. 12 1298–1309.
Mathematical Reviews (MathSciNet): MR760690
Zentralblatt MATH: 0584.62050
Digital Object Identifier: doi:10.1214/aos/1176346793
Project Euclid: euclid.aos/1176346793
[16] Reverter, A., Wang, Y. H., Byrne, K. A., Tan, S. H., Harper, G. S. and Lehnert, S. A. (2004). Joint analysis of multiple cDNA microarray studies via multivariate mixed models applied to genetic improvement of beef cattle. Journal of Animal Science 82 3430–3439.
[17] Schaffer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. Biol. 4 1–30.
Mathematical Reviews (MathSciNet): MR2170433
Zentralblatt MATH: 1077.92042
Digital Object Identifier: doi:10.2202/1544-6115.1128
[18] van der Lann, M. J. and Bryan, J. (2001). Gene expression analysis with parametric bootstrap. Biostatistics 2 445–461.
[19] Yan, X., Deng, M., Fung, W. K. and Qian, M. (2005). Detecting differentially expressed genes by relative entropy. J. Theoret. Biol. 3 395–402.
Mathematical Reviews (MathSciNet): MR2139667
Digital Object Identifier: doi:10.1016/j.jtbi.2004.11.039

2013 © Institute of Mathematical Statistics

The Annals of Statistics

The Annals of Statistics

Turn MathJax Off
What is MathJax?