Statistics Surveys

Navigating Random Forests and related advances in algorithmic modeling

David S. Siroky

Full-text: Open access

Abstract

This article addresses current methodological research on non-parametric Random Forests. It provides a brief intellectual history of Random Forests that covers CART, boosting and bagging methods. It then introduces the primary methods by which researchers can visualize results, the relationships between covariates and responses, and the out-of-bag test set error. In addition, the article considers current research on universal consistency and importance tests in Random Forests. Finally, several uses for Random Forests are discussed, and available software is identified.

Article information

Source
Statist. Surv. Volume 3 (2009), 147-163.

Dates
First available in Project Euclid: 5 November 2009

Permanent link to this document
http://projecteuclid.org/euclid.ssu/1257431567

Digital Object Identifier
doi:10.1214/07-SS033

Mathematical Reviews number (MathSciNet)
MR2556872

Zentralblatt MATH identifier
05719274

Subjects
Primary: 62-02: Research exposition (monographs, survey articles) 62-04: Explicit machine computation and programs (not the theory of computation or programming) 62G08: Nonparametric regression 62G09: Resampling methods 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20] 93E25: Other computational methods 62M99: None of the above, but in this section 62N99: None of the above, but in this section

Keywords
CART bagging boosting Random Forests algorithmic methods non-parametrics ensemble and committee methods

Citation

Siroky, David S. Navigating Random Forests and related advances in algorithmic modeling. Statist. Surv. 3 (2009), 147--163. doi:10.1214/07-SS033. http://projecteuclid.org/euclid.ssu/1257431567.


Export citation

References

  • [1] Banks, D., L. House, P. Arabie, F.R. McMorris, and W. Gaul, eds. 2004., Classification, Cluster Analysis, and Data Mining, Springer-Verlag, Berlin.
  • [2] Banks, D. 2007., Lectures on Statistical Data Mining, Duke University, Aug. 29–Nov. 28. http://www.stat.duke.edu/~banks/218-lectures.dir/
  • [3] Bauer, E. and Kohavi, R. 1999. ‘An Empirical Comparison of Voting Classification Algorithms,’, Machine Learning, 36, No. 1/2, 105–139.
  • [4] Buehlmann, P. and B. Yu. 2002. ‘Analyzing Bagging’, The Annals of Statistics 30: 927–61.
  • [5] Berk, R. 2006. ‘An Introduction to Ensemble Methods for Data Analysis.’, Sociological Methods and Research, 34: 3, (February), 263–95.
  • [6] Berk, R., A. Li and L. Hickman. 2005. ‘Statistical Difficulties in Determining the Role of Race in Capital Cases’, Journal of Quantitative Criminology, 21: 4, 365–390.
  • [7] Biau, G., L. Devroye, and G. Lugosi. ‘Consistency of Random Forests and other averaging classifiers.’ Preprint, October 10, 2007.
  • [8] Breiman, L. and A. Cutler, RAF:, http://www.math.usu.edu/~adele/forests/cc_graphics.htm
  • [9] Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. 1984., Classification and Regression Trees. Monterey, CA: Wadsworth.
  • [10] Breiman, L., and P. Spector. 1992. ‘Submodel selection and evaluation in regression: The X-random case,’, International Statistical Review, 60: 291–319.
  • [11] Breiman, L. 1996a. ‘Bagging Predictors.’, Machine Learning 26: 123–40.
  • [12] Breiman, L. 1996b. ‘Out-of-Bag Estimation.’, ftp://ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.
  • [13] Breiman, L. 1999. ‘Random Forests–Random Features.’ UC Berkeley, Statistics Department, Technical Report N. 567.
  • [14] Breiman, L. 2001a. ‘Random Forests.’, Machine Learning 45: 5–32.
  • [15] Breiman, L. 2001b. ‘Statistical Modeling: Two Cultures’ (with discussion)., Statistical Science 16: 199–231.
  • [16] Breiman, L. 2001c. ‘Wald Lecture I: Machine Learning’ and ‘Wald Lecture II: Looking Inside The Black Box’, ftp://ftp.stat.berkeley.edu/pub/users/breiman/.
  • [17] Breiman, L. 2004a. ‘Consistency For A Simple Model Of Random Forests,’ Technical Report 670, Statistics Department University Of California at Berkeley, September 9, 2004.
  • [18] Breiman, L. and A. Cutler. 2004. ‘Random Forests’, http://statwww.berkeley.edu/users/breiman/RandomForests/cc_home.htm.
  • [19] Breitenbach, M., R. Nielsen and G. Grudic ‘Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities,’ Available at http://www.cs.colorado.edu/department/publications/reports/docs/CU-CS-954-03.pdf. MATLAB code available at:, http://markus-breitenbach.com/machine_learning_code.php.
  • [20] Buehlmann, P. and Bin Yu. 2002. ‘Analyzing Bagging.’, The Annals of Statistics 30: 927–61.
  • [21] Bylander, T. 2002. ‘Estimating Generalization Error on Two-Class Datasets Using Out-of-Bag Estimates,’, Machine Learning 48, 1–3, p. 287–297.
  • [22] Chan, J.C-W. and D. Paelinckx. 2008. ‘Evaluation of Random Forest and Adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery,’, Remote Sensing of Environment 112, 6, 16 June 2008, 2999–3011.
  • [23] Cochran, W.G., and D.B. Rubin. 1973. Controlling bias in observational studies: A review. Sankhya:, The Indian Journal of Statistics, Series A 35(Part 4): 417–66.
  • [24] Cutler, A. and L. Breiman, RAFT:, RAndom Forest Tool, Available at: http://www.stat.berkeley.edu/users/breiman/RandomForests/.
  • [25] L. Devroye, L. Gyorfi, and G. Lugosi. 1996., A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York.
  • [26] Diaz-Uriarte, R. 2007. ‘GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest, BMC Bioinformatics, 8: 328.
  • [27] Dietterich, T. 1998. ‘An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization’, Machine Learning, 1–22.
  • [28] Dietterich, T. 2002. ‘Ensemble Learning,’ In, The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 405–408.
  • [29] Dietterich, T. 2007. ‘Ensemble Methods in Machine Learning,’ Available at:, eecs.oregonstate.edu/~tgd/publications/mcs-ensembles.ps.gz.
  • [30] Efron, B. 1979. ‘Bootstrap methods: another look at the jackknife,’, The Annals of Statistics 7: 1–26.
  • [31] Efron, B. and G. Gong. 1983. ‘A leisurely look at the bootstrap, the jackknife, and cross-validation,’, The American Statistician 37: 36–48.
  • [32] Freund, Y. and R. Schapire. 1996. ‘Experiments with a new boosting algorithm’, Machine Learning: Proceedings of the 13th International Conference, 148–156.
  • [33] Friedman, J.H., T. Hastie, and R. Tibsharini. 2000. ‘Additive Logistic Regression: A Statistical View of Boosting’ (with discussion)., Annals of Statistics 28: 337–407.
  • [34] Friedman, J.H., T. Hastie, and R. Tibsharini. 2001. ‘Greedy Function Approximation: A Gradient Boosting Machine.’, Annals of Statistics 29: 1189–1232.
  • [35] Friedman, J.H., T. Hastie, and R. Tibsharini. 2002. ‘Stochastic Gradient Boosting.’, Computational Statistics and Data Analysis 38: 4, 367–78.
  • [36] Frölich, M. 2004. ‘Finite sample properties of propensity score matching and weighting estimators,’, Review of Econometrics and Statistics 86: 77–90.
  • [37] Grandvalet, Y. 2004. ‘Bagging Equalizes Influence.’, Machine Learning 55: 251–70.
  • [38] Hastie, T., R. Tibshirani, and J. Friedman. 2001[2009]., The Elements of Statistical Learning. New York: Springer-Verlag.
  • [39] Ho, D., K. Imai, G. King, and E. Stuart. 2007. ‘Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference,’, Political Analysis, 15: 199–236.
  • [40] Ho, T.K. 1995. ‘Random Decision Forest’., Proceedings of the 3rd International Conf. on Document Analysis and Recognition, Montreal, Canada, August 14–18, 1995, 278–282.
  • [41] Hothorn, T. and B. Lausen. 2003. ‘Double-bagging: Combining classifiers by bootstrap aggregation,’, Pattern Recognition, 36: 6, 1303–1309.
  • [42] Hothorn, T., B. Lausen, A. Benner and Ma. Radespiel-Troeger. 2004. ‘Bagging Survival Trees’., Statistics in Medicine, 23: 1, 77–91.
  • [43] Hothorn, T., P. Buhlmann, S. Dudoit, A. Molinaro and M.J. van der Laan. 2006. ‘Survival Ensembles’., Biostatistics, 7: 3, 355–373.
  • [44] Hothorn, T. and A. Peters, 2009. ipred, http://cran.r-project.org/web/packages/ipred/index.html
  • [45] Ishwaran, H. and U. Kogalur. 2007. randomSurvivalForest (R software for random survival forest) Ensemble survival analysis based on a random forest of trees using random inputs. Version, 3.0.1.
  • [46] Karpievitch, Y.V., A.P. Leclerc, E.G. Hill, J.S. Almeida, ‘RF++: Improved Random Forest for Clustered Data Classification,’, http://www.ohloh.net/p/rfpp
  • [47] Kumar, Manish and M. Thenmozhi, ‘Forecasting Stock Index Movement: A Comparison of Support Vector Machines and Random Forest,’ Indian Institute of Capital Markets 9th Capital Markets Conference Paper Available at SSRN: http://ssrn.com/abstract, =876544.
  • [48] LeBlanc, M. and R. Tibshirani. 1996. ‘Combining Estimates on Regression and Classification.’, Journal of the American Statistical Association 91: 1641–50.
  • [49] Leshem, G. 2005. ‘Improvement of Adaboost Algorithm by using Random Forests as Weak Learner.’, Ph.D. Thesis, Hebrew University of Jerusalem: shum.huji.ac.il/~gleshem/Guy_Leshem_Proposal.pdf
  • [50] Liaw, A. and M. Wiener. ‘Classification and Regression by randomForest’, R News (2002) Vol. 2/3 p. 18 (Discussion of the use of the random forest package for R).
  • [51] Liaw, A. and M. Weiner. 2007. randomForest (R software for random forest). Fortran original (L. Breiman and A. Cutler), R port (A. Liaw and M. Wiener) Version 4.5-19 and 4.5-25., http://cran.r-project.org/web/packages/randomForest/index.html
  • [52] Lin, Y. and Y. Jeon. 2006. ‘Random Forests and adaptive nearest neighbors,’, Journal of the American Statistical Association, 101 (474): 578–590.
  • [53] Loh, W.-Y. 2002. ‘Regression Trees With Unbiased Variable Selection and Interaction Detection.’, Statistica Sinica 12: 361–86.
  • [54] Mannor, S., R. Meir and T. Zhang. 2002. ‘The Consistency of Greedy Algorithms for Classification,’, COLT, 319–333.
  • [55] Meinshausen, N. 2006. ‘Quantile regression forests,’, Journal of Machine Learning Research, 7: 983–999.
  • [56] Nyuyen, T.T. 2008. ‘Outlier and Exception Analysis in Rough Sets and Granular Computing,’ in, Handbook of Granular Computing (Eds. W Pedrycz, A. Skowron, V. Kreinovich), Wiley 2008.
  • [57] Opitz, D. and R. Maclin. 1999. ‘Popular Ensemble Methods: An Empirical Study’, Journal of Artificial Intelligence Research, 11, 169–198, citeseer.ist.psu.edu/opitz99popular.html.
  • [58] Peters, A. and T. Hothorn. 2007. ipred: Improved predictive models by indirect classification and bagging for classification, regression and survival problems as well as resampling based estimators of prediction error. (R software for random forest prediction). Version:, 0.8-5
  • [59], Picard, R. and D. Cook. 1984. ‘Cross-Validation of Regression Models,’, Journal of the American Statistical Association 79 (387): 575–583.
  • [60] Quinlan, R. 1993., C4.5: Programs for Machine Learning (Morgan Kaufmann)
  • [61] Rosenbaum, P.R. 1984. ‘The consequences of adjusting for a concomitant variable that has been affected by the treatment,’, Journal of the Royal Statistical Society, Series A 147: 656–66.
  • [62] Rosenbaum, P.R. 1989. ‘Optimal matching for observational studies,’, Journal of the American Statistical Association 84: 1024–1032.
  • [63] Rosenbaum, P.R. 2002., Observational studies. 2nd ed. New York: Springer.
  • [64] Rosenbaum, P.R., and D.B. Rubin. 1983. ‘The central role of the propensity score in observational studies for causal effects,’, Biometrika 70: 41–55.
  • [65] Sandri, M. and P. Zuccolotto. 2009. ‘Variable selection using Random Forests,’, Typescript, 8 pages.
  • [66] Schapire, R.E. 1990. ‘The strength of weak learnability,’, Machine Learning, 5: 197–227.
  • [67] Schapire, R. E. 1999. ‘A Brief Introduction to Boosting.’ In, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence.
  • [68] Schapire, R.E., Y. Freund, P. Bartlett, and W.S. Lee. 1998. ‘Boosting the margin: A new explanation for the effectiveness of voting methods,’, The Annals of Statistics, 26: 1651–1686.
  • [69] Shannon, W., and D. Banks. 1997. ‘An MLE Strategy for Combining CART Models,’, Computing Science and Statistics, 29: 540–544.
  • [70] Shi, T., Seligson, D. Belldegrun, A.S. Palotie, A. and Horvath, S. 2005. ‘Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma,’, Modern Pathology 18: 4, 547–57.
  • [71] Siroky, D.S. 2009., Secession and Survival, Ph.D. Dissertation, Duke University.
  • [72] Strobl, C., A. Boulesteix, A. Zeileis and T. Hothorn. 2007. Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution., BMC Bioinformatics, 8, 25. http://www.biomedcentral.com/1471-2105/8/25/abstract.
  • [73] Strobl, C. and A. Zeileis. 2008. ‘Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance,’ Technical Report Number 017, Department of Statistics, University of, Munich.
  • [74] Strobl, C., A-L Boulesteix, T. Augustin and A. Zeileis. 2008. ‘Conditional variable importance for Random Forests,’, BMC Bioinformatics, 9: 307.
  • [75] Stone, C. 1977. ‘Consistent nonparametric regression,’, The Annals of Statistics, 5: 595–645.
  • [76] Su, X., M. Wang, and J. Fan. 2004. ‘Maximum Likelihood Regression Trees.’, Journal of Computational and Graphical Statistics 13: 586–98.
  • [77] Therneau, T.M and B. Atkinson, ‘rpart: Recursive Partitioning’ Recursive partitioning and regression trees Version 3.1-38 (CART for, R).
  • [78] Traskin, M. ‘Random Forests: classification, variable selection and consistency,’ STAT900 Slides, University of Pennsylvania, Nov. 26, 2007.
  • [79] Wang, T. MATLAB R13. Available at:, http://lib.stat.cmu.edu/matlab/
  • [80] Ward, M., S. Pajevic, J. Dreyfuss, and J. Malley. 2006. ‘Short-term prediction of mortality in patients with systemic lupus erythematosus: classification of outcomes using Random Forests,’, Arthritis and Rheumatism 55: 74–80.