Institute of Mathematical Statistics Collections

Ensemble classifiers

Dhammika Amaratunga, Javier Cabrera, Yauheniya Cherckas, and Yung-Seop Lee

Full-text: Open access

Abstract

Ensemble classification methods like Random Forest are powerful and versatile classifiers. We explore variations in the ensemble approach and demonstrate the strong performance of ensemble versions of Linear Discriminant Analysis (LDA) variants such as LDA-PCA (LDA after a Principal Components Analysis step to reduce dimensionality) and LASSO in situations characterized by a huge number of features and a small number of samples such as DNA microarray data. We also demonstrate the value of enriching the ensembles with features that are most likely to be informative in situations where only a very small percentage of the features actually carries classification information. Notably, in the case studies we analyzed, the enriched ensemble procedure with LDA-PCA as base classifier had a misclassification rate that was essentially half that observed with Random Forest.

Chapter information

Source
Dominique Fourdrinier, Éric Marchand and Andrew L. Rukhin, eds., Contemporary Developments in Bayesian Analysis and Statistical Decision Theory: A Festschrift for William E. Strawderman (Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2012), 235-246

Dates
First available in Project Euclid: 14 March 2012

Permanent link to this document
https://projecteuclid.org/euclid.imsc/1331731623

Digital Object Identifier
doi:10.1214/11-IMSCOLL816

Mathematical Reviews number (MathSciNet)
MR3202514

Zentralblatt MATH identifier
1320.62142

Subjects
Primary: 62P10: Applications to biology and medical sciences 68T10: Pattern recognition, speech recognition {For cluster analysis, see 62H30} 68T05: Learning and adaptive systems [See also 68Q32, 91E40]

Keywords
classification ensemble lasso linear discriminant analysis microarray random forest

Rights
Copyright © 2012, Institute of Mathematical Statistics

Citation

Amaratunga, Dhammika; Cabrera, Javier; Cherckas, Yauheniya; Lee, Yung-Seop. Ensemble classifiers. Contemporary Developments in Bayesian Analysis and Statistical Decision Theory: A Festschrift for William E. Strawderman, 235--246, Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2012. doi:10.1214/11-IMSCOLL816. https://projecteuclid.org/euclid.imsc/1331731623


Export citation

References

  • Amaratunga, D. and Cabrera, J. (2004). Exploration and Analysis of DNA Microarray and Protein Array Data. Wiley, New York.
  • Amaratunga, D. and Cabrera, J. (2009). A conditional t suite of tests for identifying differentially expressed genes in a DNA microarray experiment with little replication. Statistics in Biopharmaceutical Research 1 26–38.
  • Amaratunga, D., Cabrera, J. and Lee, Y.S. (2008). Enriched random forest. Bioinformatics 24 2010–2014.
  • Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation 9 1545–1588.
  • Belhumeur, P.N., Hespanha, J.P. and Kriegman, D.J. (1997). Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, IEEE Trans. PAMI, Special Issue on Face Recognition, 19 711–20.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57 289–300.
  • Biau, G., Devroye, L. and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research 9 2015–2033.
  • Breiman, L. (1996). Bagging Predictors. Machine Learning 26 123–140.
  • Breiman, L. (1999). Pasting small votes for classification in large databases and on-line. Machine Learning 36 85–103.
  • Breiman, L. (2001a). Random Forests. Machine Learning 45 5–32.
  • Breiman, L. (2001b). Weld Lecture II: Looking Inside the Black Box.
  • Breiman, L. and Cutler, A. (2003). Random Forests Manual (version 4.0), Technical Report of the University of California, Berkeley, Department of Statistics.
  • Dietterich, T.G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning 40 139–157.
  • Dudoit, S., Freeland, J. and Speed, T.P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97 77–87.
  • Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 179–188.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Springer-Velag, New York.
  • Lee, J.W., Lee, J.B., Park, M. and Song, S.H. (2005). An extensive evaluation of recent classification tools applied to microarray data. Computational Statistics and Data Analysis 48 869–885.
  • Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association 101 578–590.
  • Liu, J. and Chen, S. (2005). Resampling LDA/QR and PCA+LDA for face recognition, Australian Conference on Artificial Intelligence, 1221–1224.
  • Lokhorst, J. (1999). The lasso and generalized linear models. Honors Project. University of Adelaide, Adelaide.
  • Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research 7 983–999.
  • Moechars, D., Vanacker, N., Cryns, K., Andries, L., Mancini, G. and Verheijen, F. (2005). Sialin-deficient mice: a novel animal model for infantile free sialic acid storage disease, ISSD: Society for Neuroscience 35th Annual Meeting. Washington, USA.
  • Raghavan, N., De Bondt, A., Talloen, W., Moechars, D., Göhlmann, H. and Amaratunga, D. (2007). The high-level similarity of some disparate gene expression measures. Bioinformatics 23 3032–3038.
  • Shevade, S.K. and Keerthi, S.S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19 2246–2253.
  • Statnikov, A., Wang, L. and Aliferis, C.F. (2008). A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 9 319.
  • Storey, J.D. and Tibshirani, R. (2003). Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences 100 9440–9445.
  • Strunnikova, N., Hilmer, S., Flippin, J., Robinson, M., Hoffman, E. and Csaky, K. (2005). Differences in gene expression profiles in dermal fibroblasts from control and patients with age-related macular degeneration elicited by oxidative injury. Free radical biology and medicine 39 781–96.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B 58 267–288.