Electronic Journal of Statistics

Projective inference in high-dimensional problems: Prediction and feature selection

Juho Piironen, Markus Paasiniemi, and Aki Vehtari

Full-text: Open access

Abstract

This paper reviews predictive inference and feature selection for generalized linear models with scarce but high-dimensional data. We demonstrate that in many cases one can benefit from a decision theoretically justified two-stage approach: first, construct a possibly non-sparse model that predicts well, and then find a minimal subset of features that characterize the predictions. The model built in the first step is referred to as the reference model and the operation during the latter step as predictive projection. The key characteristic of this approach is that it finds an excellent tradeoff between sparsity and predictive accuracy, and the gain comes from utilizing all available information including prior and that coming from the left out features. We review several methods that follow this principle and provide novel methodological contributions. We present a new projection technique that unifies two existing techniques and is both accurate and fast to compute. We also propose a way of evaluating the feature selection process using fast leave-one-out cross-validation that allows for easy and intuitive model size selection. Furthermore, we prove a theorem that helps to understand the conditions under which the projective approach could be beneficial. The key ideas are illustrated via several experiments using simulated and real world data.

Article information

Source
Electron. J. Statist., Volume 14, Number 1 (2020), 2155-2197.

Dates
Received: February 2019
First available in Project Euclid: 13 May 2020

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1589335310

Digital Object Identifier
doi:10.1214/20-EJS1711

Subjects
Primary: 62F15: Bayesian inference 62F07: Ranking and selection 62J12: Generalized linear models

Keywords
Projection prediction feature selection sparsity post-selection inference

Rights
Creative Commons Attribution 4.0 International License.

Citation

Piironen, Juho; Paasiniemi, Markus; Vehtari, Aki. Projective inference in high-dimensional problems: Prediction and feature selection. Electron. J. Statist. 14 (2020), no. 1, 2155--2197. doi:10.1214/20-EJS1711. https://projecteuclid.org/euclid.ejs/1589335310


Export citation

References

  • Afrabandpey, H., Peltola, T., Piironen, J., Vehtari, A. and Kaski, S. (2019). Making Bayesian predictive models interpretable: a decision theoretic approach., arXiv:1910.09358 .
  • Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data., Proceedings of the National Academy of Sciences 99 6562-6566.
  • Armagan, A., Clyde, M. and Dunson, D. B. (2011). Generalized beta mixtures of Gaussians. In, Advances in Neural Information Processing Systems 24 (J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira and K. Q. Weinberger, eds.) 523–531.
  • Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components., Journal of the American Statistical Association 101 119–137.
  • Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection., The Annals of Statistics 32 870–897.
  • Bernardo, J. M. and Juárez, M. A. (2003). Intrinsic Estimation. In, Bayesian Statistics 7 (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.) 465–476. Oxford University Press.
  • Bernardo, J. M. and Smith, A. F. M. (1994)., Bayesian Theory. John Wiley & Sons.
  • Bhadra, A., Datta, J., Polson, N. G. and Willard, B. (2017). The horseshoe$+$ estimator of ultra-sparse signals., Bayesian Analysis 12 1105–1131.
  • Bhattacharya, A., Pati, D., Pillai, N. S. and Dunson, D. B. (2015). Dirichlet-Laplace priors for optimal shrinkage., Journal of the American Statistical Association 110 1479–1490.
  • Breiman, L. (1995). Better subset regression using the nonnegative garrote., Technometrics 37 373–384.
  • Bucila, C., Caruana, R. and Niculescu-Mizil, A. (2006). Model compression. In, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’06 535–541. ACM.
  • Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan., Journal of Statistical Software 80 1–28.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when $p$ is much larger than $n$., The Annals of Statistics 35 2313–2351.
  • Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009). Handling sparsity via the horseshoe. In, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (D. van Dyk and M. Welling, eds.). Proceedings of Machine Learning Research 5 73–80.
  • Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals., Biometrika 97 465–480.
  • Castillo, I. and van der Vaart, A. (2012). Needles and straws in a haystack: posterior concentration for possibly sparse sequences., The Annals of Statistics 40 2069–2101.
  • Cawley, G. C. and Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation., Journal of Machine Learning Research 11 2079–2107.
  • Dupuis, J. A. and Robert, C. P. (2003). Variable selection in qualitative models via an entropic explanatory power., Journal of Statistical Planning and Inference 111 77–94.
  • Efron, B. (2010)., Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge University Press.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression., The Annals of Statistics 32 407–499.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties., Journal of the American Statistical Association 96 1348–1360.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space., Journal of the Royal Statistical Society. Series B (Methodological) 70 849–911.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent., Journal of Statistical Software 33.
  • Gabry, J., Simpson, D., Vehtari, A., Betancourt, M. and Gelman, A. (2018). Visualization in Bayesian workflow., Journal of the Royal Statistical Society. Series A 182 389–402.
  • Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2013)., Bayesian Data Analysis, third ed. Chapman & Hall.
  • George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling., Journal of the American Statistical Association 88 881–889.
  • Goodrich, B., Gabry, J., Ali, I. and Brilleman, S. (2018). rstanarm: Bayesian applied regression modeling via Stan. R package version, 2.17.4.
  • Goutis, C. and Robert, C. P. (1998). Model choice in generalised linear models: A Bayesian approach via Kullback–Leibler projections., Biometrika 85 29–37.
  • Hahn, P. R. and Carvalho, C. M. (2015). Decoupling shrinkage and selection in Bayesian linear models: a posterior summary perspective., Journal of the American Statistical Association 110 435–448.
  • Harrell, F. E. (2015)., Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis, second ed. Springer.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009)., The Elements of Statistical Learning, second ed. Springer-Verlag.
  • Hastie, T., Tibshirani, R. and Wainwright, M. (2015)., Statistical learning with sparsity: the Lasso and generalizations. Chapman & Hall.
  • Hernández-Lobato, D., Hernández-Lobato, J. M. and Suárez, A. (2010). Expectation propagation for microarray data classification., Pattern Recognition Letters 31 1618–1626.
  • Hinton, G., Vinyals, O. and Dean, J. (2015). Distilling the knowledge in a neural network., arXiv:1503.02531 .
  • Ishwaran, H., Kogalur, U. B. and Rao, J. S. (2010). spikeslab: Prediction and variable selection using spike and slab regression., The R Journal 2 68–73.
  • Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist and Bayesian strategies., The Annals of Statistics 33 730–773.
  • Johnson, V. E. and Rossell, D. (2012). Bayesian model selection in high-dimensional settings., Journal of the American Statistical Association 107 649–660.
  • Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences., The Annals of Statistics 32 1594–1649.
  • Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M. and Mallick, B. K. (2003). Gene selection: a Bayesian variable selection approach., Bioinformatics 19 90–97.
  • Li, Y., Campbell, C. and Tipping, M. (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data., Bioinformatics 18 1332–1339.
  • Lindley, D. V. (1968). The choice of variables in multiple regression., Journal of the Royal Statistical Society. Series B (Methodological) 30 31–66.
  • McCullagh, P. and Nelder, J. A. (1989)., Generalized linear models, second ed. Monographs on Statistics and Applied Probability. Chapman & Hall.
  • Meinshausen, N. (2007). Relaxed Lasso., Computational Statistics & Data Analysis 52 374–393.
  • Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors., The Annals of Statistics 42 789–817.
  • Neal, R. and Zhang, J. (2006). High dimensional classification with Bayesian neural networks and Dirichlet diffusion trees. In, Feature Extraction, Foundations and Applications (I. Guyon, S. Gunn, M. Nikravesh and L. A. Zadeh, eds.) 265–296. Springer.
  • Nott, D. J. and Leng, C. (2010). Bayesian projection approaches to variable selection in generalized linear models., Computational Statistics and Data Analysis 54 3227–3241.
  • Paananen, T., Piironen, J., Bürkner, P.-C. and Vehtari, A. (2020). Implicitly adaptive importance sampling., arXiv:1906.08850.
  • Paul, D., Bair, E., Hastie, T. and Tibshirani, R. (2008). “Preconditioning” for feature selection and regression in high-dimensional problems., The Annals of Statistics 36 1595–1618.
  • Peltola, T. (2018). Local interpretable model-agnostic explanations of Bayesian predictive models via Kullback-Leibler projections. In, Proceedings of the 2nd Workshop on Explainable Artificial Intelligence (D. W. Aha, T. Darrell, P. Doherty and D. Magazzeni, eds.) 114–118.
  • Peltola, T., Havulinna, A. S., Salomaa, V. and Vehtari, A. (2014). Hierarchical Bayesian survival analysis and projective covariate selection in cardiovascular event risk prediction. In, Proceedings of the 11th UAI Bayesian Modeling Applications Workshop. CEUR Workshop Proceedings 1218 79–88.
  • Piironen, J. and Vehtari, A. (2016). Projection predictive model selection for Gaussian processes. In, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP) 1–6. IEEE.
  • Piironen, J. and Vehtari, A. (2017a). Comparison of Bayesian predictive methods for model selection., Statistics and Computing 27 711–735.
  • Piironen, J. and Vehtari, A. (2017b). Sparsity information and regularization in the horseshoe and other shrinkage priors., Electronic Journal of Statistics 11 5018–5051.
  • Piironen, J. and Vehtari, A. (2017c). On the hyperprior choice for the global shrinkage parameter in the horseshoe prior. In, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (A. Singh and J. Zhu, eds.). Proceedings of Machine Learning Research 54 905–913.
  • Piironen, J. and Vehtari, A. (2018). Iterative supervised principal components. In, Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (A. Storkey and F. Perez-Cruz, eds.). Proceedings of Machine Learning Research 84 106–114.
  • Polson, N. G. and Scott, J. G. (2011). Shrink globally, act locally: sparse Bayesian regularization and prediction. In, Bayesian statistics 9 (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.) 501–538. Oxford University Press, Oxford.
  • Raftery, A. E., Madigan, D. and Hoeting, J. A. (1997). Bayesian model averaging for linear regression models., Journal of the American Statistical Association 92 179–191.
  • Reid, S., Tibshirani, R. and Friedman, J. (2016). A study of error variance estimation in Lasso regression., Statistica Sinica 26 35–67.
  • Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods., Journal of Machine Learning Research 3 1371–1382.
  • Ribeiro, M. T., Singh, S. and Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16 1135–1144. ACM.
  • Snelson, E. and Ghahramani, Z. (2005). Compact approximations to Bayesian predictive distributions. In, Proceedings of the 22nd International Conference on Machine Learning. ICML ’05 840–847. ACM.
  • Stan Development Team (2018). Stan modeling language users guide and reference manual, Version, 2.18.0.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso., Journal of the Royal Statistical Society. Series B (Methodological) 58 267–288.
  • Tran, M.-N., Nott, D. J. and Leng, C. (2012). The predictive Lasso., Statistics and Computing 22 1069–1084.
  • van der Pas, S. L., Kleijn, B. J. K. and van der Vaart, A. W. (2014). The horseshoe estimator: posterior concentration around nearly black vectors., Electronic Journal of Statistics 8 2585–2618.
  • Vehtari, A., Gelman, A. and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC., Statistics and Computing 27 1413–1432.
  • Vehtari, A. and Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison., Statistics Surveys 6 142–228.
  • Vehtari, A., Simpson, D., Gelman, A., Yao, Y. and Gabry, J. (2019). Pareto smoothed importance sampling., arXiv:1507.02646 .
  • Yao, Y., Vehtari, A., Simpson, D. and Gelman, A. (2018). Using stacking to average Bayesian predictive distributions (with discussion)., Bayesian Analysis 13 917–1003.
  • Zanella, G. and Roberts, G. (2019). Scalable importance tempering and Bayesian variable selection., Journal of the Royal Statistical Society. Series B (Methodological) 81 489–517.
  • Zou, H. (2006). The adaptive Lasso and its oracle properties., Journal of the American Statistical Association 101 1418–1429.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net., Journal of the Royal Statistical Society. Series B (Methodological) 67 301–320.