Gibbs posterior for variable selection in high-dimensional classification and data mining



The Annals of Statistics

Gibbs posterior for variable selection in high-dimensional classification and data mining

Wenxin Jiang and Martin A. Tanner

Source: Ann. Statist. Volume 36, Number 5 (2008), 2207-2231.

Abstract

In the popular approach of “Bayesian variable selection” (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics. The Gibbs posterior is constructed from a risk function of practical interest (such as the classification error) and aims at minimizing a risk function without modeling the data probabilistically. This can improve the performance over the usual Bayesian approach, which depends on a probability model which may be misspecified. Conditions will be provided to achieve good risk performance, even in the presence of high dimensionality, when the number of candidate variables “K” can be much larger than the sample size “n.” In addition, we develop a convenient Markov chain Monte Carlo algorithm to implement BVS with the Gibbs posterior.

Primary Subjects: 62F99
Secondary Subjects: 82-08
Keywords: Data augmentation; data mining; Gibbs posterior; high-dimensional data; linear classification; Markov chain Monte Carlo; prior distribution; risk performance; sparsity; variable selection

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
This document is available for purchase at a cost of $15. Select the "buy article" button below to make a credit card purchase of this document through a secure payment site.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1223908090
Digital Object Identifier: doi:10.1214/07-AOS547

References

Brown, P. J., Fearn, T. and Vannucci, M. (1999). The choice of variables in multivariate regres- sion: A non-conjugate Bayesian decision theory approach. Biometrika 86 635–648.
Mathematical Reviews (MathSciNet): MR1723783
Zentralblatt MATH: 1072.62510
Digital Object Identifier: doi:10.1093/biomet/86.3.635
Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
Mathematical Reviews (MathSciNet): MR1383093
Zentralblatt MATH: 0853.68150
Dobra, A., Hans, C., Jones, B., Nevins, J. R., Yao, G. and West, M. (2004). Sparse graphical models for exploring gene expression data. J. Multivariate Anal. 90 196–212.
Mathematical Reviews (MathSciNet): MR2064941
Digital Object Identifier: doi:10.1016/j.jmva.2004.02.009
Friedman, J., Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004). Discussion on boosting. Ann. Statist. 32 102–107.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. 6 721–741.
George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7 339–373.
Gerlach, R., Bird, R. and Hall, A. (2002). Bayesian variable selection in logistic regression: Predicting company earnings direction. Aust. N. Z. J. Statist. 44 155–168.
Mathematical Reviews (MathSciNet): MR1963292
Greenshtein, E. (2006). Best subset selection, persistency in high dimensional statistical learning and optimization under 1 constraint. Ann. Statist. 34 2367–2386.
Mathematical Reviews (MathSciNet): MR2291503
Digital Object Identifier: doi:10.1214/009053606000000768
Project Euclid: euclid.aos/1169571800
Horowitz, J. L. (1992). A smoothed maximum score estimator for the binary response model. Econometrica 60 505–531.
Mathematical Reviews (MathSciNet): MR1162997
Digital Object Identifier: doi:10.2307/2951582
Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist. 34 837–877.
Mathematical Reviews (MathSciNet): MR2283395
Digital Object Identifier: doi:10.1214/009053606000000029
Project Euclid: euclid.aos/1151418243
Jiang, W. (2007). Bayesian variable selection for high dimensional generalized linear models: Convergence rates of the fitted densities. Ann. Statist. 35 1487–1511.
Mathematical Reviews (MathSciNet): MR2351094
Digital Object Identifier: doi:10.1214/009053607000000019
Project Euclid: euclid.aos/1188405619
Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M. and Mallick, B. K. (2003). Gene selection: A Bayesian variable selection approach. Bioinformatics 19 90–97.
Lindley, D. V. (1968). The choice of variables in multiple regression (with discussion). J. Roy. Statist. Assoc. Ser. B 30 31–66.
Mathematical Reviews (MathSciNet): MR231492
Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. J. Econometrics 75 317–343.
Tanner, M. A. (1996). Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR1396311
Zentralblatt MATH: 0846.62001
Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). J. Amer. Statist. Assoc. 82 528–550.
Mathematical Reviews (MathSciNet): MR898357
Digital Object Identifier: doi:10.2307/2289457
Zhang, T. (1999). Theoretical analysis of a class of randomized regularization methods. In COLT 99. Proceedings of the Twelfth Annual Conference on Computational Learning Theory 156–163. ACM Press, New York.
Mathematical Reviews (MathSciNet): MR1811611
Zhang, T. (2006a). From ε-entropy to KL-entropy: Analysis of minimum information complexity density estimation. Ann. Statist. 34 2180–2210.
Mathematical Reviews (MathSciNet): MR2291497
Digital Object Identifier: doi:10.1214/009053606000000704
Project Euclid: euclid.aos/1169571794
Zhang, T. (2006b). Information theoretical upper and lower bounds for statistical estimation. IEEE Trans. Inform. Theory 52 1307–1321.
Mathematical Reviews (MathSciNet): MR2241190
Digital Object Identifier: doi:10.1109/TIT.2005.864439
Zhou, X., Liu, K.-Y. and Wong, S. T. C. (2004). Cancer classification and prediction using logistic regression with Bayesian gene selection. J. Biomedical Informatics 37 249–259.

2009 © Institute of Mathematical Statistics