Bernoulli

  • Bernoulli
  • Volume 20, Number 3 (2014), 1647-1671.

Feature selection when there are many influential features

Peter Hall, Jiashun Jin, and Hugh Miller

Full-text: Open access

Abstract

Recent discussion of the success of feature selection methods has argued that focusing on a relatively small number of features has been counterproductive. Instead, it is suggested, the number of significant features can be in the thousands or tens of thousands, rather than (as is commonly supposed at present) approximately in the range from five to fifty. This change, in orders of magnitude, in the number of influential features, necessitates alterations to the way in which we choose features and to the manner in which the success of feature selection is assessed. In this paper, we suggest a general approach that is suited to cases where the number of relevant features is very large, and we consider particular versions of the approach in detail. We propose ways of measuring performance, and we study both theoretical and numerical properties of the proposed methodology.

Article information

Source
Bernoulli Volume 20, Number 3 (2014), 1647-1671.

Dates
First available in Project Euclid: 11 June 2014

Permanent link to this document
https://projecteuclid.org/euclid.bj/1402488953

Digital Object Identifier
doi:10.3150/13-BEJ536

Mathematical Reviews number (MathSciNet)
MR3217457

Zentralblatt MATH identifier
06327922

Keywords
change-point analysis classification dimension reduction feature selection logit model maximum likelihood ranking thresholding

Citation

Hall, Peter; Jin, Jiashun; Miller, Hugh. Feature selection when there are many influential features. Bernoulli 20 (2014), no. 3, 1647--1671. doi:10.3150/13-BEJ536. https://projecteuclid.org/euclid.bj/1402488953


Export citation

References

  • [1] Abramovich, F., Benjamini, Y., Donoho, D.L. and Johnstone, I.M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653.
  • [2] Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. and Levine, A.J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96 6745–6750.
  • [3] Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene). (2009). Genome-wide association study identifies new multiple sclerosis susceptibility loci on chromosomes 12 and 20. Nat. Genet. 41 824–828.
  • [4] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 289–300.
  • [5] Bickel, P.J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes,’ and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
  • [6] Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373–384.
  • [7] Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350–2383.
  • [8] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
  • [9] Carlstein, E., Müller, H.G. and Siegmund, D., eds. (1994). Change-Point Problems. Institute of Mathematical Statistics Lecture Notes—Monograph Series 23. Hayward, CA: IMS.
  • [10] Chen, S.S., Donoho, D.L. and Saunders, M.A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comp. 20 33–61.
  • [11] Chen, J. and Gupta, A.K. (2000). Parametric Statistical Change Point Analysis. Boston, MA: Birkhäuser.
  • [12] Csörgő, M. and Horváth, L. (1997). Limit Theorems in Change-Point Analysis. Wiley Series in Probability and Statistics. Chichester: Wiley.
  • [13] Dettling, M. (2004). BagBoosting for tumor classification with gene expression data. Bioinformatics 20 3583–3593.
  • [14] Donoho, D.L. (2006). For most large underdetermined systems of linear equations the minimal $l_{1}$-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 797–829.
  • [15] Donoho, D.L. (2006). For most large underdetermined systems of equations, the minimal $l_{1}$-norm near-solution approximates the sparsest near-solution. Comm. Pure Appl. Math. 59 907–934.
  • [16] Donoho, D.L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via $l^{1}$ minimization. Proc. Natl. Acad. Sci. USA 100 2197–2202 (electronic).
  • [17] Donoho, D.L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory 47 2845–2862.
  • [18] Donoho, D.L. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features and rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–14795.
  • [19] Donoho, D. and Jin, J. (2009). Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4449–4470.
  • [20] Duda, R.O., Hart, P.E. and Stork, D.G. (2001). Pattern Classification, 2nd ed. New York: Wiley-Interscience.
  • [21] Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36 2605–2637.
  • [22] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • [23] Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In International Congress of Mathematicians. Vol. III (M. Sanz-Sole, J. Soria, J.L. Varona and J. Verdera, eds.) 595–622. Zürich: Eur. Math. Soc.
  • [24] Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
  • [25] Fan, J. and Ren, Y. (2006). Statistical analysis of DNA microarray data in cancer research. Clin. Cancer Res. 12 4469–4473.
  • [26] Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567–3604.
  • [27] Gao, H.Y. (1998). Wavelet shrinkage denoising using the non-negative garrote. J. Comput. Graph. Statist. 7 469–488.
  • [28] Genovese, C.R., Jin, J., Wasserman, L. and Yao, Z. (2012). A comparison of the lasso and marginal regression. J. Mach. Learn. Res. 13 2107–2143.
  • [29] Goldstein, D.B. (2009). Common genetic variation and human traits. New England J. Med. 360 1696–1698.
  • [30] Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D. and Lander, E.S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • [31] Hall, P. and Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. J. Comput. Graph. Statist. 18 533–550.
  • [32] Hall, P. and Wang, Q. (2010). Strong approximations of level exceedences related to multiple hypothesis testing. Bernoulli 16 418–434.
  • [33] Hanley, J.A. and Mcneil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 29–36.
  • [34] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. New York: Springer.
  • [35] Hirschhorn, J.N. (2009). Genomewide association studies—Illuminating biologic pathways. New England J. Med. 360 1699–1701.
  • [36] Jin, J. (2009). Impossibility of successful classification when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 106 8859–8864.
  • [37] Kraft, P. and Hunter, D.J. (2009). Genetic risk prediction—Are we there yet? New England J. Med. 360 1701–1703.
  • [38] Shakhnarovich, G., Darrell, T. and Indyk, P. (2005). Nearest-Neighbor Methods in Learning and Vision. Theory and Practice. Cambridge, MA: MIT Press.
  • [39] Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A.D., Amico, A.V. and Richie, J.P. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203–09.
  • [40] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267–288.
  • [41] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567–572.
  • [42] Tropp, J.A. (2005). Recovery of short, complex linear combinations via $l_{1}$ minimization. IEEE Trans. Inform. Theory 51 1568–1570.
  • [43] Wade, N. (2009). Genes show limited value in predicting diseases. New York Times, April 15. Available at www.nytimes.com/2009/04/16/health/research/16gene.html.
  • [44] Wu, Y. (2005). Inference for Change-Point and Post-Change Means After a CUSUM Test. Lecture Notes in Statistics 180. New York: Springer.