## The Annals of Applied Statistics

### Variable selection for BART: An application to gene regulation

#### Abstract

We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266–298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.

#### Article information

Source
Ann. Appl. Stat., Volume 8, Number 3 (2014), 1750-1781.

Dates
First available in Project Euclid: 23 October 2014

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1414091233

Digital Object Identifier
doi:10.1214/14-AOAS755

Mathematical Reviews number (MathSciNet)
MR3271352

Zentralblatt MATH identifier
1304.62132

#### Citation

Bleich, Justin; Kapelner, Adam; George, Edward I.; Jensen, Shane T. Variable selection for BART: An application to gene regulation. Ann. Appl. Stat. 8 (2014), no. 3, 1750--1781. doi:10.1214/14-AOAS755. https://projecteuclid.org/euclid.aoas/1414091233

#### References

• Bleich, J., Kapelner, A., George, E. and Jensen, S. (2014). Supplement to “Variable selection for BART: An application to gene regulation.” DOI:10.1214/14-AOAS755SUPP.
• Bottolo, L. and Richardson, S. (2010). Evolutionary stochastic search for Bayesian model exploration. Bayesian Anal. 5 583–618.
• Breiman, L. (2001). Random forests. Machine Learning 45 5–32.
• Breiman, L. and Cutler, A. (2013). Online manual for random forests. Available at www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.
• Chipman, H. A., George, E. I. and McCulloch, R. E. (1998). Bayesian CART model search. J. Amer. Statist. Assoc. 93 935–948.
• Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat. 4 266–298.
• Deng, H. and Runger, G. (2012). Feature selection via regularized trees. In The 2012 International Joint Conference on Neural Networks (IJCNN).
• Díaz-Uriarte, R. and Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7 1–13.
• Friedman, J. H. (1991). Multivariate adaptive regression splines. Ann. Statist. 19 1–141. With discussion and a rejoinder by the author.
• Friedman, J. H. (2002). Stochastic gradient boosting. Comput. Statist. Data Anal. 38 367–378.
• Friedman, J. H., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 1–22.
• Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6 721–741.
• George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889.
• Gramacy, R. B., Taddy, M. and Wild, S. M. (2013). Variable selection and sensitivity analysis using dynamic trees, with an application to computer code performance tuning. Ann. Appl. Stat. 7 51–80.
• Guang, C., Jensen, S. T. and Stoeckert, C. J. (2007). Clustering of genes into regulons using integrated modeling—COGRIM. Genome Biol. 8 R4.
• Hans, C. (2009). Bayesian lasso regression. Biometrika 96 835–845.
• Hans, C., Dobra, A. and West, M. (2007). Shotgun stochastic search for “large $p$” regression. J. Amer. Statist. Assoc. 102 507–516.
• Hastings, H. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 97–109.
• Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics 32 1–49.
• Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and Bayesian strategies. Ann. Statist. 33 730–773.
• Ishwaran, H. and Rao, J. S. (2010). Generalized ridge regression: Geometry and computational solutions when $p$ is larger than $n$. Technical report.
• Ishwaran, H., Rao, J. S. and Kogalur, U. B. (2013). spikeslab: Prediction and variable selection using spike and slab regression. Available at http://cran.r-project.org/web/packages/spikeslab/. R package version 1.1.5.
• Jensen, S. T., Chen, G. and Stoeckert, C. J., Jr. (2007). Bayesian variable selection and data integration for biological regulatory networks. Ann. Appl. Stat. 1 612–633.
• Kapelner, A. and Bleich, J. (2014). bartMachine: Machine learning with Bayesian additive regression trees. Available at arXiv:1312.2171.
• Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., Zeitlinger, J., Jennings, E. G., Murray, H. L., Gordon, D. B., Ren, B., Wyrick, J. J., Tagne, J. B., Volkert, T. L., Fraenkel, E., Gifford, D. K. and Young, R. A. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298 763–764.
• Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J. Amer. Statist. Assoc. 105 1202–1214.
• Liaw, A. and Wiener, M. (2002). Classification and regression by random forest. R news 2 18–22.
• Miller, A. J. (2002). Subset Selection in Regression, 2nd ed. Chapman & Hall, London.
• Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83 1023–1036.
• Park, T. and Casella, G. (2008). The Bayesian lasso. J. Amer. Statist. Assoc. 103 681–686.
• Rockova, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable selection. J. Amer. Statist. Assoc. 109 828–846.
• Stingo, F. and Vannucci, M. (2011). Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data. Bioinformatics 27 495–501.
• Taddy, M. A., Gramacy, R. B. and Polson, N. G. (2011). Dynamic trees for learning and design. J. Amer. Statist. Assoc. 106 109–123.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• van Rijsbergen, C. J. (1979). Information Retrieval, 2nd ed. Butterworth, Stoneham.
• Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.

#### Supplemental materials

• Supplementary material: Additional results for simulations and gene regulation application. Complete set of results for simulations in Section 4 and additional output for Section 5.