The Annals of Statistics

High-dimensional classification using features annealed independence rules

Jianqing Fan and Yingying Fan

Full-text: Open access

Abstract

Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is poorly understood. In a seminal paper, Bickel and Levina [Bernoulli 10 (2004) 989–1010] show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as poor as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as poorly as the random guessing. Thus, it is important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample t-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.

Article information

Source
Ann. Statist., Volume 36, Number 6 (2008), 2605-2637.

Dates
First available in Project Euclid: 5 January 2009

Permanent link to this document
https://projecteuclid.org/euclid.aos/1231165181

Digital Object Identifier
doi:10.1214/07-AOS504

Mathematical Reviews number (MathSciNet)
MR2485009

Zentralblatt MATH identifier
1360.62327

Subjects
Primary: 62G08: Nonparametric regression
Secondary: 62J12: Generalized linear models 62F12: Asymptotic properties of estimators

Keywords
Classification feature extraction high dimensionality independence rule misclassification rates

Citation

Fan, Jianqing; Fan, Yingying. High-dimensional classification using features annealed independence rules. Ann. Statist. 36 (2008), no. 6, 2605--2637. doi:10.1214/07-AOS504. https://projecteuclid.org/euclid.aos/1231165181


Export citation

References

  • Antoniadis, A., Lambert-Lacroix, S. and Leblanc, F. (2003). Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19 563–570.
  • Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311–329.
  • Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components. J. Amer. Statist. Assoc. 101 119–137.
  • Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
  • Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data. Stat. Appl. Genet. Mol. Biol. 3 1–33.
  • Bühlmann, P. and Yu, B. (2003). Boosting with the L2 loss: Regression and classification. J. Amer. Statist. Assoc. 98 324–339.
  • Bura, E. and Pfeiffer, R. M. (2003). Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics 19 1252–1258.
  • Cao, H. (2007). Moderate deviations for two sample t-statistics. ESAIM Probab. Statist. 11 264–271.
  • Chiaromonte, F. and Martinelli, J. (2002). Dimension reduction strategies for analyzing global gene expression data with a response. Math. Biosci. 176 123–144.
  • Dettling, M. and Bühlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics 19 1061–1069.
  • Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87.
  • Fan, J. (1996). Test of significance based on wavelet thresholding and Neyman’s truncation. J. Amer. Statist. Assoc. 91 674–688.
  • Fan, J., Hall, P. and Yao, Q. (2006). To how many simultaneous hypothesis tests can normal, Student’s t or bootstrap calibration be applied? Manuscript.
  • Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In Proceedings of the International Congress of Mathematicians (M. Sanz-Sole, J. Soria, J. L. Varona and J. Verdera, eds.) III 595–622. Eur. Math. Soc., Zürich.
  • Fan, J. and Ren, Y. (2006). Statistical analysis of DNA microarray data. Clinical Cancer Research 12 4469–4473.
  • Fan, J. and Lv, J. (2007). Sure independence screening for ultra-high dimensional feature space. Manuscript.
  • Friedman, J. H. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 165–175.
  • Ghosh, D. (2002). Singular value decomposition regression modeling for classification of tumors from microarray experiments. Proceedings of the Pacific Symposium on Biocomputing 11462–11467.
  • Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classifcation of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • Gordon, G. J., Jensen, R. V., Hsiao, L. L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 62 4963–4967.
  • Greenshtein, E. (2006). Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint. Ann. Statist. 34 2367–2386.
  • Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988.
  • Huang, X. and Pan, W. (2003). Linear regression and two-class classification with gene expression data. Bioinformatics 19 2072–2978.
  • Meinshausen, N. (2007). Relaxed Lasso. Comput. Statist. Data Anal. To appear.
  • Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18 39–50.
  • Shao, Q. M. (2005). Self-normalized limit theorems in probability and statistics. Manuscript.
  • Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D’Amico, A. and Richie, J. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203–209.
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99 6567–6572.
  • van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
  • West, M., Blanchette, C., Fressman, H., Huang, E., Ishida, S., Spang, R., Zuan, H., Marks, J. R. and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer using gene expression profiles. Proc. Natl. Acad. Sci. 98 11462–11467.
  • Zou, H., Hastie, T. and Tibshirani. R. (2004). Sparse principal component analysis. Technical report.