The Annals of Applied Statistics

Improving the precision of classification trees

Wei-Yin Loh

Full-text: Open access

Abstract

Besides serving as prediction models, classification trees are useful for finding important predictor variables and identifying interesting subgroups in the data. These functions can be compromised by weak split selection algorithms that have variable selection biases or that fail to search beyond local main effects at each node of the tree. The resulting models may include many irrelevant variables or select too few of the important ones. Either eventuality can lead to erroneous conclusions. Four techniques to improve the precision of the models are proposed and their effectiveness compared with that of other algorithms, including tree ensembles, on real and simulated data sets.

Article information

Source
Ann. Appl. Stat., Volume 3, Number 4 (2009), 1710-1737.

Dates
First available in Project Euclid: 1 March 2010

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1267453961

Digital Object Identifier
doi:10.1214/09-AOAS260

Mathematical Reviews number (MathSciNet)
MR2752155

Zentralblatt MATH identifier
1184.62109

Keywords
Bagging kernel density discrimination nearest neighbor prediction random forest selection bias variable selection

Citation

Loh, Wei-Yin. Improving the precision of classification trees. Ann. Appl. Stat. 3 (2009), no. 4, 1710--1737. doi:10.1214/09-AOAS260. https://projecteuclid.org/euclid.aoas/1267453961


Export citation

References

  • [1] Amasyali, M. F. and Ersoy, O. (2008). CLINE: A new decision-tree family. IEEE Transactions on Neural Networks 19 356–363.
  • [2] Atkinson, E. J. and Therneau, T. M. (2000). An introduction to recursive partitioning using the RPART routines. Technical report 61, Biostatistic Section, Mayo Clinic, Rochester, NY.
  • [3] Breiman, L. (1996). Bagging predictors. Mach. Learn. 24 123–140.
  • [4] Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
  • [5] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont.
  • [6] Buttrey, S. E. and Karo, C. (2002). Using k-nearest-neighbor classification in the leaves of a tree. Comput. Statist. Data Anal. 40 27–37.
  • [7] Cantu-Paz, E. and Kamath, C. (2003). Inducing oblique decision trees with evolutionary algorithms. IEEE Transactions on Evolutionary Computation 7 54–68.
  • [8] Clark, V. (2004). SAS/STAT 9.1 User’s Guide. SAS Publishing, Cary, NC.
  • [9] Doyle, P. (1973). The use of Automatic Interaction Detector and similar search procedures. Operational Research Quarterly 24 465–467.
  • [10] Fan, G. (2008). Kernel-induced classification trees and random forests. Manuscript.
  • [11] Gama, J. (2004). Functional trees. Mach. Learn. 55 219–250.
  • [12] Ghosh, A. K., Chaudhuri, P. and Sengupta, D. (2006). Classification using kernel density estimates: Multiscale analysis and visualization. Technometrics 48 120–132.
  • [13] Heinz, G., Peterson, L. J., Johnson, R. W. and Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education 11. Available at www.amstat.org/publications/jse/v11n2/datasets.heinz.html.
  • [14] Hosmer, D. W. and Lemeshow, S. (2000). Applied Logistic Regression, 2nd ed. Wiley, New York.
  • [15] Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Statist. 15 651–674.
  • [16] Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits. J. Amer. Statist. Assoc. 96 589–604.
  • [17] Kim, H. and Loh, W.-Y. (2003). Classification trees with bivariate linear discriminant node models. J. Comput. Graph. Statist. 12 512–530.
  • [18] Lee, T.-H. and Shih, Y.-S. (2006). Unbiased variable selection for classification trees with multivariate responses. Comput. Statist. Data Anal. 51 659–667.
  • [19] Li, X. B., Sweigart, J. R., Teng, J. T. C., Donohue, J. M., Thombs, L. A. and Wang, S. M. (2003). Multivariate decision trees using linear discriminants and tabu search. IEEE Transactions on Systems Man and Cybernetics Part A–Systems and Humans 33 194–205.
  • [20] Li, Y. H., Dong, M. and Kothari, R. (2005). Classifiability-based omnivariate decision trees. IEEE Transactions on Neural Networks 16 1547–1560.
  • [21] Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R News 2 18–22. Available at http://CRAN.R-project.org/doc/Rnews/.
  • [22] Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. J. 40 203–228.
  • [23] Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statist. Sinica 12 361–386.
  • [24] Loh, W.-Y. and Shih, Y.-S. (1997). Split selection methods for classification trees. Statist. Sinica 7 815–840.
  • [25] Loh, W.-Y. and Vanichsetakul, N. (1988). Tree-structured classification via generalized discriminant analysis (with discussion). J. Amer. Statist. Assoc. 83 715–728.
  • [26] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London.
  • [27] Morgan, J. N. and Messenger, R. C. (1973). THAID: A sequential analysis program for the analysis of nominal scale dependent variables. Technical report, Institute for Social Research, Univ. Michigan, Ann Arbor.
  • [28] Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 415–434.
  • [29] Noh, H. G., Song, M. S. and Park, S. H. (2004). An unbiased method for constructing multilabel classification trees. Comput. Statist. Data Anal. 47 149–164.
  • [30] Perlich, C., Provost, F. and Simonoff, J. S. (2003). Tree induction vs. logistic regression: A learning-curve analysis. J. Mach. Learn. Res. 4 211–255.
  • [31] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo.
  • [32] StataCorp. (2003). Stata Statistical Software: Release 8.0. Stata Corporation, College Station, TX.
  • [33] Strobl, C., Boulesteix, A.-L. and Augustin, T. (2007). Unbiased split selection for classification trees based on the Gini index. Comput. Statist. Data Anal. 52 483–501.
  • [34] Wilson, E. B. and Hilferty, M. M. (1931). The distribution of chi-square. Proc. Nat. Acad. Sci. USA 17 684–688.
  • [35] Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann, San Fransico, CA.
  • [36] Yildlz, O. T. and Alpaydin, E. (2005). Linear discriminant trees. International Journal of Pattern Recognition and Artificial Intelligence 19 323–353.