Bayesian Analysis

Data augmentation for support vector machines

Abstract

This paper presents a latent variable representation of regularized support vector machines (SVM's) that enables EM, ECME or MCMC algorithms to provide parameter estimates. We verify our representation by demonstrating that minimizing the SVM optimality criterion together with the parameter regularization penalty is equivalent to finding the mode of a mean-variance mixture of normals pseudo-posterior distribution. The latent variables in the mixture representation lead to EM and ECME point estimates of SVM parameters, as well as MCMC algorithms based on Gibbs sampling that can bring Bayesian tools for Gaussian linear models to bear on SVM's. We show how to implement SVM's with spike-and-slab priors and run them against data from a standard spam filtering data set.

Article information

Source
Bayesian Anal. Volume 6, Number 1 (2011), 1-23.

Dates
First available in Project Euclid: 13 June 2012

https://projecteuclid.org/euclid.ba/1339611936

Digital Object Identifier
doi:10.1214/11-BA601

Mathematical Reviews number (MathSciNet)
MR2781803

Zentralblatt MATH identifier
1330.62259

Citation

Polson, Nicholas G.; Scott, Steven L. Data augmentation for support vector machines. Bayesian Anal. 6 (2011), no. 1, 1--23. doi:10.1214/11-BA601. https://projecteuclid.org/euclid.ba/1339611936.

References

• Andrews, D. F. and Mallows, C. L. (1974). "Scale Mixtures of Normal Distributions." Journal of the Royal Statistical Society, Series B: Methodological, 36: 99–102.
• Carlin, B. P. and Polson, N. G. (1991). "Inference for Nonconjugate Bayesian Models Using the Gibbs Sampler." The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 19: 399–405.
• Cawley, G. C. and Talbot, N. L. C. (2005). "Constructing Bayesian formulations of sparse kernel learning methods." Neural Networks, 18(5-6): 674–683.
• Clyde, M. and George, E. I. (2004). "Model uncertainty." Statistical Science, 19: 81–94.
• Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). "Maximum likelihood from incomplete data via the EM" algorithm (C/R: p22-37). Journal of the Royal Statistical Society, Series B, Methodological, 39: 1–22.
• Devroye, L. (1986). Non-uniform Random Variate Generation. Springer-Verlag. http://cg.scs.carleton.ca/~luc/rnbookindex.html
• Fan, J. and Li, R. (2001). "Variable Selection Via Nonconcave Penalized Likelihood and Its Oracle Properties." Journal of the American Statistical Association, 96(456): 1348–1360.
• George, E. I. and McCulloch, R. E. (1993). "Variable Selection Via Gibbs Sampling." Journal of the American Statistical Association, 88: 881–889.
• –- (1997). "Approaches for Bayesian Variable Selection." Statistica Sinica, 7: 339–374.
• Gold, C., Holub, A., and Sollich, P. (2005). "Bayesian approach to feature selection and parameter tuning for support vector machine classifiers." Neural Networks, 18(5-6): 693–701.
• Goldstein, M. and Smith, A. F. M. (1974). "Ridge-type Estimators for Regression Analysis." Journal of the Royal Statistical Society, Series B: Methodological, 36: 284–291.
• Golub, G. H. and van Loan, C. F. (2008). Matrix Computations. John Hopkins Press, third edition.
• Gomez-Sanchez-Manzano, E., Gomez-Villegas, M. A., and Marin, J. M. (2008). "Multivariate exponential power distributions as mixtures of normals with Bayesian applications." Communications in Statistics, 37(6): 972–985.
• Greene, W. H. and Seaks, T. G. (1991). "The restricted least squares estimator: a pedagogical note." The Review of Economics and Statistics, 73(3): 563–567.
• Griffin, J. E. and Brown, P. J. (2005). "Alternative Prior Distributions for Variable Selection with very many more variables than observations." (working paper available on Google scholar).
• Hans, C. (2009). "Bayesian lasso regression." Biometrika, 96(4): 835–845.
• Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer, second edition.
• Holmes, C. C. and Held, L. (2006). "Bayesian Auxiliary Variable Models for Binary and Multinomial Regression." Bayesian Analysis, 1(1): 145–168.
• Holmes, C. C. and Pintore, A. (2006). "Bayesian Relaxation: Boosting, the Lasso and other $L^\alpha$-norms." In Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., and West, M. (eds.), Bayesian Statistics 8, 253 – 283. Oxford University Press.
• Huang, J., Horowitz, J., and Ma, S. (2008). "Asymptotic properties of Bridge estimators in sparse high-dimensional regression models." The Annals of Statistics, 36: 587–613.
• Ishwaran, H. and Rao, J. S. (2005). "Spike and Slab Gene Selection for multigroup microarray data." Journal of the American Statistical Association, 100: 764–780.
• Johnstone, I. M. and Silverman, B. W. (2004). "Needles and Straws in Haystacks: Empirical Bayes Estimates of Possibly Sparse Sequences." The Annals of Statistics, 32(4): 1594–1649.
• –- (2005). "Empirical Bayes Selection of Wavelet Thresholds." The Annals of Statistics, 33(4): 1700–1752.
• Liu, C. and Rubin, D. B. (1994). "The ECME" Algorithm: A Simple Extension of EM and ECM With Faster Monotone Convergence. Biometrika, 81: 633–648.
• Mallick, B. K., Ghosh, D., and Ghosh, M. (2005). "Bayesian classification of tumours by using gene expression data." Journal of the Royal Statistical Society, Series B, Statistical Methodology, 67(2): 219–234.
• Meng, X.-L. and Rubin, D. B. (1993). "Maximum Likelihood Estimation Via the ECM" Algorithm: A General Framework. Biometrika, 80: 267–278.
• Meng, X.-L. and van Dyk, D. A. (1999). "Seeking efficient data augmentation schemes via conditional and marginal augmentation." Biometrika, 86(2): 301–320.
• Mitchell, T. J. and Beauchamp, J. J. (1988). "Bayesian Variable Selection in Linear Regression (C/R: P1033-1036)." Journal of the American Statistical Association, 83: 1023–1032.
• Neal, R. M. (2003). "Slice Sampling." The Annals of Statistics, 31(3): 705–767.
• Pollard, H. (1946). "The representation of $e^{ - x^\lambda }$ as a Laplace integral." Bull. Amer. Math. Soc., 52(10): 908–910.
• Polson, N. G. (1996). "Convergence of Markov Chain Monte Carlo Algorithms." In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M. (eds.), Bayesian Statistics 5 – Proceedings of the Fifth Valencia International Meeting, 297–321. Clarendon Press [Oxford University Press].
• Pontil, M., Mukherjee, S., and Girosi, F. (1998). "On the Noise Model of Support Vector Machine Regression." A.I. Memo, MIT Artificial Intelligence Laboratory, 1651: 1500–1999.
• Sollich, P. (2001). "Bayesian methods for support vector machines: evidence and predictive class probabilities." Machine Learning, 46: 21–52.
• Tibshirani, R. (1996). "Regression Shrinkage and Selection Via the Lasso." Journal of the Royal Statistical Society, Series B: Methodological, 58: 267–288.
• Tipping, M. E. (2001). "Sparse Bayesian learning and the Relevance Vector Machine." Journal of Machine Learning Research, 1: 211–244.
• Tropp, J. A. (2006). "Just relax: Convex programming methods for identifying sparse signals." IEEE Info. Theory, 55(2): 1039–1051.
• West, M. (1987). "On Scale Mixtures of Normal Distributions." Biometrika, 74: 646–648.
• Zhu, J., Saharon, R., Hastie, T., and Tibshirani, R. (2004). "1-norm Support Vector Machines." In Thrun, S., Saul, L. K., and Schoelkopf, B. (eds.), Advances in Neural Information Processing 16, 49–56.