Statistics Surveys
previous :: next

Navigating Random Forests and related advances in algorithmic modeling

David S. Siroky
Source: Statist. Surv. Volume 3 (2009), 147-163.

Abstract

This article addresses current methodological research on non-parametric Random Forests. It provides a brief intellectual history of Random Forests that covers CART, boosting and bagging methods. It then introduces the primary methods by which researchers can visualize results, the relationships between covariates and responses, and the out-of-bag test set error. In addition, the article considers current research on universal consistency and importance tests in Random Forests. Finally, several uses for Random Forests are discussed, and available software is identified.

First Page: Show Hide
Primary Subjects: 62-02, 62-04, 62G08, 62G09, 62H30, 93E25, 62M99, 62N99
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.ssu/1257431567
Digital Object Identifier: doi:10.1214/07-SS033
Mathematical Reviews number (MathSciNet): MR2556872
Zentralblatt MATH identifier: 05719274

References

[1] Banks, D., L. House, P. Arabie, F.R. McMorris, and W. Gaul, eds. 2004., Classification, Cluster Analysis, and Data Mining, Springer-Verlag, Berlin.
Mathematical Reviews (MathSciNet): MR2112710
[2] Banks, D. 2007., Lectures on Statistical Data Mining, Duke University, Aug. 29–Nov. 28. http://www.stat.duke.edu/~banks/218-lectures.dir/
[3] Bauer, E. and Kohavi, R. 1999. ‘An Empirical Comparison of Voting Classification Algorithms,’, Machine Learning, 36, No. 1/2, 105–139.
[4] Buehlmann, P. and B. Yu. 2002. ‘Analyzing Bagging’, The Annals of Statistics 30: 927–61.
Mathematical Reviews (MathSciNet): MR1926165
Zentralblatt MATH: 1029.62037
Digital Object Identifier: doi:10.1214/aos/1031689014
Project Euclid: euclid.aos/1031689014
[5] Berk, R. 2006. ‘An Introduction to Ensemble Methods for Data Analysis.’, Sociological Methods and Research, 34: 3, (February), 263–95.
Mathematical Reviews (MathSciNet): MR2247098
[6] Berk, R., A. Li and L. Hickman. 2005. ‘Statistical Difficulties in Determining the Role of Race in Capital Cases’, Journal of Quantitative Criminology, 21: 4, 365–390.
[7] Biau, G., L. Devroye, and G. Lugosi. ‘Consistency of Random Forests and other averaging classifiers.’ Preprint, October 10, 2007.
Mathematical Reviews (MathSciNet): MR2447310
[8] Breiman, L. and A. Cutler, RAF:, http://www.math.usu.edu/~adele/forests/cc_graphics.htm
[9] Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. 1984., Classification and Regression Trees. Monterey, CA: Wadsworth.
Mathematical Reviews (MathSciNet): MR726392
Zentralblatt MATH: 0541.62042
[10] Breiman, L., and P. Spector. 1992. ‘Submodel selection and evaluation in regression: The X-random case,’, International Statistical Review, 60: 291–319.
[11] Breiman, L. 1996a. ‘Bagging Predictors.’, Machine Learning 26: 123–40.
[12] Breiman, L. 1996b. ‘Out-of-Bag Estimation.’, ftp://ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.
[13] Breiman, L. 1999. ‘Random Forests–Random Features.’ UC Berkeley, Statistics Department, Technical Report N. 567.
[14] Breiman, L. 2001a. ‘Random Forests.’, Machine Learning 45: 5–32.
[15] Breiman, L. 2001b. ‘Statistical Modeling: Two Cultures’ (with discussion)., Statistical Science 16: 199–231.
Mathematical Reviews (MathSciNet): MR1874152
Digital Object Identifier: doi:10.1214/ss/1009213726
Project Euclid: euclid.ss/1009213726
[16] Breiman, L. 2001c. ‘Wald Lecture I: Machine Learning’ and ‘Wald Lecture II: Looking Inside The Black Box’, ftp://ftp.stat.berkeley.edu/pub/users/breiman/.
[17] Breiman, L. 2004a. ‘Consistency For A Simple Model Of Random Forests,’ Technical Report 670, Statistics Department University Of California at Berkeley, September 9, 2004.
[18] Breiman, L. and A. Cutler. 2004. ‘Random Forests’, http://statwww.berkeley.edu/users/breiman/RandomForests/cc_home.htm.
[19] Breitenbach, M., R. Nielsen and G. Grudic ‘Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities,’ Available at http://www.cs.colorado.edu/department/publications/reports/docs/CU-CS-954-03.pdf. MATLAB code available at:, http://markus-breitenbach.com/machine_learning_code.php.
[20] Buehlmann, P. and Bin Yu. 2002. ‘Analyzing Bagging.’, The Annals of Statistics 30: 927–61.
Mathematical Reviews (MathSciNet): MR1926165
Zentralblatt MATH: 1029.62037
Digital Object Identifier: doi:10.1214/aos/1031689014
Project Euclid: euclid.aos/1031689014
[21] Bylander, T. 2002. ‘Estimating Generalization Error on Two-Class Datasets Using Out-of-Bag Estimates,’, Machine Learning 48, 1–3, p. 287–297.
[22] Chan, J.C-W. and D. Paelinckx. 2008. ‘Evaluation of Random Forest and Adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery,’, Remote Sensing of Environment 112, 6, 16 June 2008, 2999–3011.
[23] Cochran, W.G., and D.B. Rubin. 1973. Controlling bias in observational studies: A review. Sankhya:, The Indian Journal of Statistics, Series A 35(Part 4): 417–66.
[24] Cutler, A. and L. Breiman, RAFT:, RAndom Forest Tool, Available at: http://www.stat.berkeley.edu/users/breiman/RandomForests/.
[25] L. Devroye, L. Gyorfi, and G. Lugosi. 1996., A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York.
Mathematical Reviews (MathSciNet): MR1383093
Zentralblatt MATH: 0853.68150
[26] Diaz-Uriarte, R. 2007. ‘GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest, BMC Bioinformatics, 8: 328.
[27] Dietterich, T. 1998. ‘An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization’, Machine Learning, 1–22.
[28] Dietterich, T. 2002. ‘Ensemble Learning,’ In, The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 405–408.
[29] Dietterich, T. 2007. ‘Ensemble Methods in Machine Learning,’ Available at:, eecs.oregonstate.edu/~tgd/publications/mcs-ensembles.ps.gz.
[30] Efron, B. 1979. ‘Bootstrap methods: another look at the jackknife,’, The Annals of Statistics 7: 1–26.
Mathematical Reviews (MathSciNet): MR515681
Zentralblatt MATH: 0406.62024
Digital Object Identifier: doi:10.1214/aos/1176344552
Project Euclid: euclid.aos/1176344552
[31] Efron, B. and G. Gong. 1983. ‘A leisurely look at the bootstrap, the jackknife, and cross-validation,’, The American Statistician 37: 36–48.
Mathematical Reviews (MathSciNet): MR694281
Digital Object Identifier: doi:10.2307/2685844
[32] Freund, Y. and R. Schapire. 1996. ‘Experiments with a new boosting algorithm’, Machine Learning: Proceedings of the 13th International Conference, 148–156.
[33] Friedman, J.H., T. Hastie, and R. Tibsharini. 2000. ‘Additive Logistic Regression: A Statistical View of Boosting’ (with discussion)., Annals of Statistics 28: 337–407.
Mathematical Reviews (MathSciNet): MR1790002
Zentralblatt MATH: 1106.62323
Digital Object Identifier: doi:10.1214/aos/1016218223
Project Euclid: euclid.aos/1016218223
[34] Friedman, J.H., T. Hastie, and R. Tibsharini. 2001. ‘Greedy Function Approximation: A Gradient Boosting Machine.’, Annals of Statistics 29: 1189–1232.
Mathematical Reviews (MathSciNet): MR1873328
Zentralblatt MATH: 1043.62034
Digital Object Identifier: doi:10.1214/aos/1013203451
Project Euclid: euclid.aos/1013203451
[35] Friedman, J.H., T. Hastie, and R. Tibsharini. 2002. ‘Stochastic Gradient Boosting.’, Computational Statistics and Data Analysis 38: 4, 367–78.
Mathematical Reviews (MathSciNet): MR1884869
[36] Frölich, M. 2004. ‘Finite sample properties of propensity score matching and weighting estimators,’, Review of Econometrics and Statistics 86: 77–90.
[37] Grandvalet, Y. 2004. ‘Bagging Equalizes Influence.’, Machine Learning 55: 251–70.
[38] Hastie, T., R. Tibshirani, and J. Friedman. 2001[2009]., The Elements of Statistical Learning. New York: Springer-Verlag.
Mathematical Reviews (MathSciNet): MR1851606
[39] Ho, D., K. Imai, G. King, and E. Stuart. 2007. ‘Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference,’, Political Analysis, 15: 199–236.
[40] Ho, T.K. 1995. ‘Random Decision Forest’., Proceedings of the 3rd International Conf. on Document Analysis and Recognition, Montreal, Canada, August 14–18, 1995, 278–282.
[41] Hothorn, T. and B. Lausen. 2003. ‘Double-bagging: Combining classifiers by bootstrap aggregation,’, Pattern Recognition, 36: 6, 1303–1309.
[42] Hothorn, T., B. Lausen, A. Benner and Ma. Radespiel-Troeger. 2004. ‘Bagging Survival Trees’., Statistics in Medicine, 23: 1, 77–91.
[43] Hothorn, T., P. Buhlmann, S. Dudoit, A. Molinaro and M.J. van der Laan. 2006. ‘Survival Ensembles’., Biostatistics, 7: 3, 355–373.
[44] Hothorn, T. and A. Peters, 2009. ipred, http://cran.r-project.org/web/packages/ipred/index.html
[45] Ishwaran, H. and U. Kogalur. 2007. randomSurvivalForest (R software for random survival forest) Ensemble survival analysis based on a random forest of trees using random inputs. Version, 3.0.1.
[46] Karpievitch, Y.V., A.P. Leclerc, E.G. Hill, J.S. Almeida, ‘RF++: Improved Random Forest for Clustered Data Classification,’, http://www.ohloh.net/p/rfpp
[47] Kumar, Manish and M. Thenmozhi, ‘Forecasting Stock Index Movement: A Comparison of Support Vector Machines and Random Forest,’ Indian Institute of Capital Markets 9th Capital Markets Conference Paper Available at SSRN: http://ssrn.com/abstract, =876544.
[48] LeBlanc, M. and R. Tibshirani. 1996. ‘Combining Estimates on Regression and Classification.’, Journal of the American Statistical Association 91: 1641–50.
Mathematical Reviews (MathSciNet): MR1439105
Zentralblatt MATH: 0881.62046
Digital Object Identifier: doi:10.2307/2291591
[49] Leshem, G. 2005. ‘Improvement of Adaboost Algorithm by using Random Forests as Weak Learner.’, Ph.D. Thesis, Hebrew University of Jerusalem: shum.huji.ac.il/~gleshem/Guy_Leshem_Proposal.pdf
[50] Liaw, A. and M. Wiener. ‘Classification and Regression by randomForest’, R News (2002) Vol. 2/3 p. 18 (Discussion of the use of the random forest package for R).
Mathematical Reviews (MathSciNet): MR964178
[51] Liaw, A. and M. Weiner. 2007. randomForest (R software for random forest). Fortran original (L. Breiman and A. Cutler), R port (A. Liaw and M. Wiener) Version 4.5-19 and 4.5-25., http://cran.r-project.org/web/packages/randomForest/index.html
[52] Lin, Y. and Y. Jeon. 2006. ‘Random Forests and adaptive nearest neighbors,’, Journal of the American Statistical Association, 101 (474): 578–590.
Mathematical Reviews (MathSciNet): MR2256176
Zentralblatt MATH: 1119.62304
Digital Object Identifier: doi:10.1198/016214505000001230
[53] Loh, W.-Y. 2002. ‘Regression Trees With Unbiased Variable Selection and Interaction Detection.’, Statistica Sinica 12: 361–86.
Mathematical Reviews (MathSciNet): MR1902715
Zentralblatt MATH: 0998.62042
[54] Mannor, S., R. Meir and T. Zhang. 2002. ‘The Consistency of Greedy Algorithms for Classification,’, COLT, 319–333.
Mathematical Reviews (MathSciNet): MR2040422
Zentralblatt MATH: 1050.68581
Digital Object Identifier: doi:10.1007/3-540-45435-7_22
[55] Meinshausen, N. 2006. ‘Quantile regression forests,’, Journal of Machine Learning Research, 7: 983–999.
Mathematical Reviews (MathSciNet): MR2274394
[56] Nyuyen, T.T. 2008. ‘Outlier and Exception Analysis in Rough Sets and Granular Computing,’ in, Handbook of Granular Computing (Eds. W Pedrycz, A. Skowron, V. Kreinovich), Wiley 2008.
[57] Opitz, D. and R. Maclin. 1999. ‘Popular Ensemble Methods: An Empirical Study’, Journal of Artificial Intelligence Research, 11, 169–198, citeseer.ist.psu.edu/opitz99popular.html.
[58] Peters, A. and T. Hothorn. 2007. ipred: Improved predictive models by indirect classification and bagging for classification, regression and survival problems as well as resampling based estimators of prediction error. (R software for random forest prediction). Version:, 0.8-5
[59], Picard, R. and D. Cook. 1984. ‘Cross-Validation of Regression Models,’, Journal of the American Statistical Association 79 (387): 575–583.
Mathematical Reviews (MathSciNet): MR763576
Zentralblatt MATH: 0547.62047
Digital Object Identifier: doi:10.2307/2288403
[60] Quinlan, R. 1993., C4.5: Programs for Machine Learning (Morgan Kaufmann)
[61] Rosenbaum, P.R. 1984. ‘The consequences of adjusting for a concomitant variable that has been affected by the treatment,’, Journal of the Royal Statistical Society, Series A 147: 656–66.
[62] Rosenbaum, P.R. 1989. ‘Optimal matching for observational studies,’, Journal of the American Statistical Association 84: 1024–1032.
[63] Rosenbaum, P.R. 2002., Observational studies. 2nd ed. New York: Springer.
Mathematical Reviews (MathSciNet): MR1899138
[64] Rosenbaum, P.R., and D.B. Rubin. 1983. ‘The central role of the propensity score in observational studies for causal effects,’, Biometrika 70: 41–55.
Mathematical Reviews (MathSciNet): MR742974
Zentralblatt MATH: 0522.62091
Digital Object Identifier: doi:10.1093/biomet/70.1.41
[65] Sandri, M. and P. Zuccolotto. 2009. ‘Variable selection using Random Forests,’, Typescript, 8 pages.
[66] Schapire, R.E. 1990. ‘The strength of weak learnability,’, Machine Learning, 5: 197–227.
[67] Schapire, R. E. 1999. ‘A Brief Introduction to Boosting.’ In, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence.
[68] Schapire, R.E., Y. Freund, P. Bartlett, and W.S. Lee. 1998. ‘Boosting the margin: A new explanation for the effectiveness of voting methods,’, The Annals of Statistics, 26: 1651–1686.
Mathematical Reviews (MathSciNet): MR1673273
Zentralblatt MATH: 0929.62069
Digital Object Identifier: doi:10.1214/aos/1024691352
Project Euclid: euclid.aos/1024691352
[69] Shannon, W., and D. Banks. 1997. ‘An MLE Strategy for Combining CART Models,’, Computing Science and Statistics, 29: 540–544.
[70] Shi, T., Seligson, D. Belldegrun, A.S. Palotie, A. and Horvath, S. 2005. ‘Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma,’, Modern Pathology 18: 4, 547–57.
[71] Siroky, D.S. 2009., Secession and Survival, Ph.D. Dissertation, Duke University.
[72] Strobl, C., A. Boulesteix, A. Zeileis and T. Hothorn. 2007. Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution., BMC Bioinformatics, 8, 25. http://www.biomedcentral.com/1471-2105/8/25/abstract.
[73] Strobl, C. and A. Zeileis. 2008. ‘Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance,’ Technical Report Number 017, Department of Statistics, University of, Munich.
Mathematical Reviews (MathSciNet): MR2509600
[74] Strobl, C., A-L Boulesteix, T. Augustin and A. Zeileis. 2008. ‘Conditional variable importance for Random Forests,’, BMC Bioinformatics, 9: 307.
[75] Stone, C. 1977. ‘Consistent nonparametric regression,’, The Annals of Statistics, 5: 595–645.
Mathematical Reviews (MathSciNet): MR443204
Zentralblatt MATH: 0366.62051
Digital Object Identifier: doi:10.1214/aos/1176343886
Project Euclid: euclid.aos/1176343886
[76] Su, X., M. Wang, and J. Fan. 2004. ‘Maximum Likelihood Regression Trees.’, Journal of Computational and Graphical Statistics 13: 586–98.
Mathematical Reviews (MathSciNet): MR2087716
Digital Object Identifier: doi:10.1198/106186004X2165
[77] Therneau, T.M and B. Atkinson, ‘rpart: Recursive Partitioning’ Recursive partitioning and regression trees Version 3.1-38 (CART for, R).
[78] Traskin, M. ‘Random Forests: classification, variable selection and consistency,’ STAT900 Slides, University of Pennsylvania, Nov. 26, 2007.
[79] Wang, T. MATLAB R13. Available at:, http://lib.stat.cmu.edu/matlab/
[80] Ward, M., S. Pajevic, J. Dreyfuss, and J. Malley. 2006. ‘Short-term prediction of mortality in patients with systemic lupus erythematosus: classification of outcomes using Random Forests,’, Arthritis and Rheumatism 55: 74–80.
previous :: next

2012 © The author, under a Creative Commons Attribution License

Statistics Surveys

Statistics Surveys