The Annals of Applied Statistics

Tree ensembles with rule structured horseshoe regularization

Malte Nalenz and Mattias Villani

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We propose a new Bayesian model for flexible nonlinear regression and classification using tree ensembles. The model is based on the RuleFit approach in Friedman and Popescu [Ann. Appl. Stat. 2 (2008) 916–954] where rules from decision trees and linear terms are used in a L1-regularized regression. We modify RuleFit by replacing the L1-regularization by a horseshoe prior, which is well known to give aggressive shrinkage of noise predictors while leaving the important signal essentially untouched. This is especially important when a large number of rules are used as predictors as many of them only contribute noise. Our horseshoe prior has an additional hierarchical layer that applies more shrinkage a priori to rules with a large number of splits, and to rules that are only satisfied by a few observations. The aggressive noise shrinkage of our prior also makes it possible to complement the rules from boosting in RuleFit with an additional set of trees from Random Forest, which brings a desirable diversity to the ensemble. We sample from the posterior distribution using a very efficient and easily implemented Gibbs sampler. The new model is shown to outperform state-of-the-art methods like RuleFit, BART and Random Forest on 16 datasets. The model and its interpretation is demonstrated on the well known Boston housing data, and on gene expression data for cancer classification. The posterior sampling, prediction and graphical tools for interpreting the model results are implemented in a publicly available R package.

Article information

Ann. Appl. Stat., Volume 12, Number 4 (2018), 2379-2408.

Received: February 2017
Revised: February 2018
First available in Project Euclid: 13 November 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Nonlinear regression classification decision trees Bayesian prediction MCMC interpretation


Nalenz, Malte; Villani, Mattias. Tree ensembles with rule structured horseshoe regularization. Ann. Appl. Stat. 12 (2018), no. 4, 2379--2408. doi:10.1214/18-AOAS1157.

Export citation


  • Breiman, L. (1996). Stacked regressions. Mach. Learn. 24 49–64.
  • Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
  • Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009). Handling sparsity via the horseshoe. In AISTATS 5 73–80.
  • Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika 97 465–480.
  • Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794. ACM, New York.
  • Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat. 4 266–298.
  • Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning (ICML’95) 115–123.
  • Dembczyński, K., Kotłowski, W. and Słowiński, R. (2010). ENDER: A statistical framework for boosting decision rules. Data Min. Knowl. Discov. 21 52–90.
  • Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 96 148–156. Bari, Italy.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189–1232.
  • Friedman, J. H. and Popescu, B. E. (2003). Importance sampled learning ensembles. Technical report, Dept. Statistics, Stanford Univ., Stanford, CA.
  • Friedman, J. H. and Popescu, B. E. (2008). Predictive learning via rule ensembles. Ann. Appl. Stat. 2 916–954.
  • Fürnkranz, J. (1999). Separate-and-conquer rule learning. Artif. Intell. Rev. 13 3–54.
  • George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889.
  • Glaab, E., Garibaldi, J. M. and Krasnogor, N. (2010). Learning pathway-based decision rules to classify microarray cancer samples.
  • Hahn, P. R. and Carvalho, C. M. (2015). Decoupling shrinkage and selection in Bayesian linear models: A posterior summary perspective. J. Amer. Statist. Assoc. 110 435–448.
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 3149–3157.
  • Li, L. and Yao, W. (2014). Fully Bayesian logistic regression with hyper-LASSO priors for high-dimensional feature selection. J. Stat. Comput. Simul. 88 2827–2851.
  • Linero, A. R. (2018). Bayesian regression trees for high dimensional prediction and variable selection. J. Amer. Statist. Assoc. 113 626–636.
  • Makalic, E. and Schmidt, D. F. (2016). A simple sampler for the horseshoe estimator. IEEE Signal Process. Lett. 23 179–182.
  • Nalenz, M. and Villani, M. (2018). Supplement to “Tree ensembles with rule structured horseshoe regularization.” DOI:10.1214/18-AOAS1157SUPP.
  • Piironen, J. and Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Stat. Comput. 27 711–735.
  • Polson, N. G., Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Amer. Statist. Assoc. 108 1339–1349.
  • Puelz, D., Hahn, P. R. and Carvalho, C. M. (2017). Variable selection in seemingly unrelated regressions with random predictors. Bayesian Anal. 12 969–989.
  • Rokach, L. (2010). Ensemble-based classifiers. Artif. Intell. Rev. 33 1–39.
  • Schapire, R. E. (1999). A brief introduction to boosting. In IJCAI 1401–1406.
  • Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P. et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203–209.
  • Slonim, D. K. (2002). From patterns to pathways: Gene expression data analysis comes of age. Nat. Genet. 32 (Supp) 502.
  • Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. J. Econometrics 75 317–343.
  • Terenin, A., Dong, S. and Draper, D. (2016). GPU-accelerated Gibbs sampling. Preprint. Available at arXiv:1608.04329.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Van’t Veer, L., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J., Witteveen, A. T. et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530–536.
  • Wolpert, D. H. (1992). Stacked generalization. Neural Netw. 5 241–259.
  • Yap, Y., Zhang, X., Ling, M. T., Wang, X., Wong, Y. C. and Danchin, A. (2004). Classification between normal and tumor tissues based on the pair-wise gene expression ratio. BMC Cancer 4 72.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.

Supplemental materials

  • The HorseRule R-package. Example code illustrating the basic features of our HorseRule package in R with standard settings. The package is available on CRAN at