Statistics Surveys

Primal and dual model representations in kernel-based learning

Johan A.K. Suykens, Carlos Alzate, and Kristiaan Pelckmans

Full-text: Open access

Abstract

This paper discusses the role of primal and (Lagrange) dual model representations in problems of supervised and unsupervised learning. The specification of the estimation problem is conceived at the primal level as a constrained optimization problem. The constraints relate to the model which is expressed in terms of the feature map. From the conditions for optimality one jointly finds the optimal model representation and the model estimate. At the dual level the model is expressed in terms of a positive definite kernel function, which is characteristic for a support vector machine methodology. It is discussed how least squares support vector machines are playing a central role as core models across problems of regression, classification, principal component analysis, spectral clustering, canonical correlation analysis, dimensionality reduction and data visualization.

Article information

Source
Statist. Surv. Volume 4 (2010), 148-183.

Dates
First available in Project Euclid: 25 August 2010

Permanent link to this document
https://projecteuclid.org/euclid.ssu/1282746475

Digital Object Identifier
doi:10.1214/09-SS052

Mathematical Reviews number (MathSciNet)
MR2679494

Zentralblatt MATH identifier
06162223

Keywords
Kernel methods support vector machines constrained optimization primal and dual problem feature map regression classification principal component analysis spectral clustering canonical correlation analysis independence dimensionality reduction and data visualization sparseness robustness

Citation

Suykens, Johan A.K.; Alzate, Carlos; Pelckmans, Kristiaan. Primal and dual model representations in kernel-based learning. Statist. Surv. 4 (2010), 148--183. doi:10.1214/09-SS052. https://projecteuclid.org/euclid.ssu/1282746475


Export citation

References

  • [1] Alzate C. and Suykens J.A.K. (2008). “A Regularized Kernel CCA Contrast Function for ICA”, Neural Networks, 21(2–3), 170–181.
  • [2] Alzate C. and Suykens J.A.K. (2008). “Kernel Component Analysis using an Epsilon Insensitive Robust Loss Function”, IEEE Transactions on Neural Networks, 19(9), 1583–1598.
  • [3] Alzate C. and Suykens J.A.K. (2008). “Sparse Kernel Models for Spectral Clustering using the Incomplete Cholesky Decomposition”, IEEE World Congress on Computational Intelligence (WCCI-IJCNN 2008), Hong Kong, pp. 3555–3562.
  • [4] Alzate C. and Suykens J.A.K. (2010). “Multiway Spectral Clustering with Out-of-Sample Extensions through Weighted Kernel PCA”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(2): 335–347
  • [5] Alzate C. and Suykens J.A.K. (2009). “A Regularized Formulation for Spectral Clustering with Pairwise Constraints”, International Joint Conference on Neural Networks (IJCNN 2009), Atlanta, 141–148.
  • [6] Alzate C. and Suykens J.A.K. (2010). “Highly Sparse Kernel Spectral Clustering with Predictive Out-of-Sample Extensions”, Proc. of the 18th European Symposium on Artificial Neural Networks (ESANN 2010), Bruges, Belgium, pp. 235–240.
  • [7] Aronszajn N. (1950). “Theory of reproducing kernels”, Trans. American Mathematical Soc., 68, 337–404.
  • [8] Bach F.R. and Jordan M.I. (2002). “Kernel independent component analysis”, Journal of Machine Learning Research, 3, 1–48.
  • [9] Belkin M. and Niyogi P. (2003). “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation”, Neural Computation, 15(6): 1373–1396.
  • [10] Belkin M., Niyogi P. and Sindhwani V. (2006). “Manifold Regularization: a Geometric Framework for Learning from Labeled and Unlabeled Examples”, Journal of Machine Learning Research, 7: 2399–2434.
  • [11] Bengio Y., Paiement J.-F., Vincent P., Delalleau O., Le Roux N. and Ouimet M. (2004). “Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering,”, Advances in Neural Information Processing Systems, 16, 2004.
  • [12] Bissacco A., Chiuso A. and Soatto S. (2007). “Classification and Recognition of Dynamical Models: The Role of Phase, Independent Components, Kernels and Optimal Transport”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11): 1958–1972.
  • [13] Bousquet O. and Elisseeff A. (2002). “Stability and Generalization”, Journal of Machine Learning Research, 2, 499–526.
  • [14] Boyd S. and Vandenberghe L. (2004)., Convex Optimization, Cambridge University Press.
  • [15] Bradley P.S. and Mangasarian O.L. (1998). “Feature Selection via Concave Minimization and Support Vector Machines”, Machine Learning Proceedings of the Fifteenth International Conference (ICML98), (Ed. J. Shavlik), Morgan Kaufmann, San Francisco, California, 82–90.
  • [16] Chapelle O., Schölkopf B. and Zien A. (Eds.) (2006)., Semi-Supervised Learning, MIT Press.
  • [17] Chung F.R.K. (1997)., Spectral Graph Theory, CBMS Regional Conference Series in Mathematics, 92.
  • [18] Coifman R.R. and Lafon S. (2006). “Diffusion maps”, Applied and Computational Harmonic Analysis, 21(1), 5–30.
  • [19] Cortes C. and Vapnik V. (1995). “Support vector networks”, Machine Learning, 20, 273–297.
  • [20] Cressie N. (1993)., Statistics for Spatial Data, John Wiley & Sons, New York.
  • [21] Cristianini N. and Shawe-Taylor J. (2000)., An Introduction to Support Vector Machines, Cambridge University Press.
  • [22] Cucker F. and Smale S. (2002). “On the mathematical foundations of learning theory”, Bulletin of the AMS, 39, 1–49.
  • [23] Cucker F. and Zhou D.-X. (2007)., Learning Theory: an Approximation Theory Viewpoint, Cambridge University Press.
  • [24] Daubechies I., DeVore R., Fornasier M. and Gunturk S. (2010). “Iteratively re-weighted least squares minimization for sparse recovery”, Communications on Pure and Applied Mathematics, 63(1): 1–38.
  • [25] De Brabanter K., Pelckmans K., De Brabanter J., Debruyne M., Suykens J.A.K., Hubert M. and De Moor B. (2009) “Robustness of Kernel Based Regression: a Comparison of Iterative Weighting Schemes”, Proc. of the 19th International Conference on Artificial Neural Networks (ICANN 2009), Limassol, Cyprus, pp. 100–110.
  • [26] De Brabanter K., De Brabanter J., Suykens J.A.K. and De Moor B. (2010), “Optimized Fixed-Size Kernel Models for Large Data Sets”, Computational Statistics & Data Analysis, 54(6): 1484–1504.
  • [27] Debruyne M., Hubert M. and Suykens J.A.K. (2008). “Model selection for kernel regression using the influence function”, Journal of Machine Learning Research, 9, 2377–2400.
  • [28] Debruyne M., Christmann A., Hubert M. and Suykens J.A.K. (2010). “Robustness of reweighted least squares kernel based regression”, Journal of Multivariate Analysis, 101(2): 447–463.
  • [29] Donoho D.L. and Grimes C. (2003). “Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data”, Proc. Natl. Acad. Sci. U.S.A, 100(10): 5591–5596.
  • [30] Espinoza M., Suykens J.A.K. and De Moor B. (2006). “LS-SVM Regression with Autocorrelated Errors”, Proc. of the 14th IFAC Symposium on System Identification (SYSID 2006), Newcastle, Australia, pp. 582–587.
  • [31] Espinoza M., Suykens J.A.K., Belmans R. and De Moor B. (2007). “Electric Load Forecasting using kernel based modeling for nonlinear system identification”, IEEE Control Systems Magazine, 27(5), 43–57.
  • [32] Falck T., Pelckmans K., Suykens J.A.K. and De Moor B. (2009). “Identification of Wiener-Hammerstein Systems using LS-SVMs”, Proceedings of the 15th IFAC Symposium on System Identification (SYSID 2009), Saint-Malo, France, pp. 820–825.
  • [33] Fung G. and Mangasarian O.L. (2001). “Proximal support vector machine classifiers”, Proceedings KDD-2001, 77–86.
  • [34] Fung G. and Mangasarian O.L. (2005). “Multicategory proximal support vector machine classifiers”, Machine Learning, 59: 77–97.
  • [35] Girolami M. (2002). “Orthogonal series density estimation and the kernel eigenvalue problem”, Neural Computation, 14(3), 669–688.
  • [36] Golub G.H., Heath M. and Wahba G. (1979). “Generalized cross-validation as a method for choosing a good ridge regression parameter”, Technometrics, 21, 215–223.
  • [37] Goethals I., Pelckmans K., Suykens J.A.K. and De Moor B. (2005). “Identification of MIMO Hammerstein Models using Least Squares Support Vector Machines”, Automatica, 41(7), 1263–1272.
  • [38] Goethals I., Pelckmans K., Suykens J.A.K. and De Moor B. (2005). “Subspace Identification of Hammerstein Systems using Least Squares Support Vector Machines”, IEEE Transactions on Automatic Control, 50(10), 1509–1519.
  • [39] Gretton A., Herbrich R., Smola A., Bousquet O. and Schölkopf B. (2005). “Kernel Methods for Measuring Independence”, Journal of Machine Learning Research, 6, 2075–2129.
  • [40] Hastie T. and Tibshirani R. (1990)., Generalized Additive Models, Chapman & Hall/CRC.
  • [41] Hastie T., Tibshirani R. and Friedman J. (2001)., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer series in statistics.
  • [42] Huber P.J. (1981)., Robust Statistics, Wiley, New York.
  • [43] Jetter K., Smale S. and Zhou D.-X. (Eds.) (2008)., Learning Theory and Approximation, Oberwolfach Reports, 5(3): 1655–1706.
  • [44] Jolliffe I.T. (1986)., Principal Component Analysis, Springer Series in Statistics, Springer-Verlag.
  • [45] Kimeldorf G.S. and Wahba G. (1971). “A correspondence between Bayesian estimation on stochastic processes and smoothing by splines”, Ann. Math. Statist., 2, 495–502.
  • [46] Kohonen T. (1990). “The self-organizing map,”, Proceedings of the IEEE, 78(9): 1464–1480.
  • [47] Luenberger D.G. (1969)., Optimization by vector space methods, Wiley, New York.
  • [48] Luts J., Suykens J.A.K. and Van Huffel S. (2007). “Semi-supervised learning: avoiding zero label assumptions in kernel based classifiers”, Internal Report 07-122, ESAT-SISTA, K.U.Leuven (Leuven, Belgium).
  • [49] Luts J., Ojeda F., Van de Plas R., De Moor B., Van Huffel S. and Suykens J.A.K. (2010). “A tutorial on support vector machine-based methods for classification problems in chemometrics”, Analytica Chimica Acta, 66(2): 129–145.
  • [50] MacKay D.J.C. (1998). “Introduction to Gaussian processes” in, Neural networks and machine learning (Ed. C.M. Bishop), Springer NATO-ASI Series F: Computer and Systems Sciences, Vol.168, 133–165.
  • [51] Mangasarian O.L. and Wild E.W. (2008). “Nonlinear Knowledge-Based Classification”, IEEE Transactions on Neural Networks, 19, 1826–1832.
  • [52] Meila M. and Shi J. (2001). “A random walks view of spectral segmentation”, Artificial Intelligence and Statistics (AISTATS 2001).
  • [53] Mercer J. (1909). “Functions of positive and negative type and their connection with the theory of integral equations”, Philos. Trans. Roy. Soc. London, 209, 415–446.
  • [54] Ng A.Y., Jordan M.I. and Weiss Y. (2002). “On spectral clustering: Analysis and an algorithm”, in T. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (NIPS), 14.
  • [55] Ojeda F., Suykens J.A.K. and De Moor B. (2008). “Low rank updated LS-SVM classifiers for fast variable selection”, Neural Networks, 21(2–3), 437–449.
  • [56] Parzen E. (1970). “Statistical inference on time series by RKHS methods”, Dep. Statist. Stanford Univ. Tech. Rep.14, Jan.
  • [57] Pelckmans K., Espinoza M., De Brabanter J., Suykens J.A.K. and De Moor B. (2005). “Primal-Dual Monotone Kernel Regression”, Neural Processing Letters, 22(2), 171–182.
  • [58] Pelckmans K., Goethals I., De Brabanter J., Suykens J.A.K. and De Moor B. (2005). “Componentwise Least Squares Support Vector Machines”, Chapter in, Support Vector Machines: Theory and Applications, (Wang L., ed.), Springer, 2005, pp. 77–98.
  • [59] Pelckmans K., Suykens J.A.K. and De Moor B. (2005). “Building Sparse Representations and Structure Determination on LS-SVM Substrates”, Neurocomputing, 64, 137–159.
  • [60] Pelckmans K., Suykens J.A.K. and De Moor B. (2006). “Additive Regularization Trade-off: Fusion of Training and Validation levels in Kernel Methods”, Machine Learning, 62(3), 217–252.
  • [61] Perez-Cruz F., Bousono-Calzon C. and Artes-Rodriguez A. (2005). “Convergence of the IRWLS Procedure to the Support Vector Machine Solution”, Neural Computation, 17(1), 7–18.
  • [62] Poggio T. and Girosi F. (1990). “Networks for approximation and learning”, Proceedings of the IEEE, 78(9), 1481–1497.
  • [63] Poggio T., Rifkin R., Mukherjee S. and Niyogi P. (2004). “General conditions for predictivity in learning theory”, Nature, 428(6981), 419–422.
  • [64] Rasmussen C.E. and Williams C.K.I. (2006)., Gaussian Processes for Machine Learning, MIT Press.
  • [65] Rousseeuw P.J. and Leroy A. (1997)., Robust Regression and Outlier Detection, John Wiley & Sons, New York.
  • [66] Roweis S. and Saul L. (2000). “Nonlinear dimensionality reduction by locally linear embedding”, Science, 290 (5500), 2323–2326.
  • [67] Saunders C., Gammerman A. and Vovk V. (1998). “Ridge regression learning algorithm in dual variables”, Proc. of the 15th Int. Conf. on Machine Learning, Madison-Wisconsin, 515–521.
  • [68] Schölkopf B., Smola A. and Müller K.-R. (1998). “Nonlinear component analysis as a kernel eigenvalue problem”, Neural Computation, 10, 1299–1319.
  • [69] Schölkopf B. and Smola A. (2002)., Learning with Kernels, MIT Press, Cambridge, MA.
  • [70] Schölkopf B., Tsuda K. and Vert J.P. (Eds.) (2004)., Kernel Methods in Computational Biology 400, MIT Press.
  • [71] Shawe-Taylor J. and Cristianini N. (2004)., Kernel Methods for Pattern Analysis, Cambridge University Press.
  • [72] Shi J. and Malik J. (2000). “Normalized Cuts and Image Segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
  • [73] Smale S. (1997), “Complexity theory and numerical analysis,”, Acta Numerica, 523–551.
  • [74] Suykens J.A.K. and Vandewalle J. (1999). “Least squares support vector machine classifiers”, Neural Processing Letters, 9(3), 293–300.
  • [75] Suykens J.A.K. and Vandewalle J. (1999). “Multiclass Least Squares Support Vector Machines”, Proc. of the International Joint Conference on Neural Networks (IJCNN’99), Washington DC, USA.
  • [76] Suykens J.A.K., De Brabanter J., Lukas L. and Vandewalle J. (2002). “Weighted least squares support vector machines: robustness and sparse approximation”, Neurocomputing, 48(1–4), 85–105.
  • [77] Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B. and Vandewalle J. (2002)., Least Squares Support Vector Machines, World Scientific, Singapore.
  • [78] Suykens J.A.K., Horvath G., Basu S., Micchelli C. and Vandewalle J. (Eds.) (2003)., Advances in Learning Theory: Methods, Models and Applications, vol. 190 NATO-ASI Series III: Computer and Systems Sciences, IOS Press.
  • [79] Suykens J.A.K., Van Gestel T., Vandewalle J. and De Moor B. (2003). “A support vector machine formulation to PCA analysis and its kernel version”, IEEE Transactions on Neural Networks, 14(2): 447–450.
  • [80] Suykens J.A.K. (2008). “Data Visualization and Dimensionality Reduction using Kernel Maps with a Reference Point”, IEEE Transactions on Neural Networks, 19(9), 1501–1517.
  • [81] Tibshirani R. (1996). “Regression shrinkage and selection via the lasso”, J. Royal. Statist. Soc B., 58(1), 267–288.
  • [82] Tsuda K., Shin H.J. and Schölkopf B. (2005). “Fast protein classification with multiple networks”, Bioinformatics (ECCB’05), 21(Suppl.2): ii59–ii65.
  • [83] Van Belle V., Pelckmans K., Suykens J.A.K. and Van Huffel S. (2010). “Additive survival least squares support vector machines”, Statistics in Medicine, 29(2): 296–308.
  • [84] Van Gestel T., Suykens J.A.K., Lanckriet G., Lambrechts A., De Moor B. and Vandewalle J. (2002). “Multiclass LS-SVMs: Moderated outputs and coding-decoding schemes”, Neural Processing Letters, 15(1): 45–48.
  • [85] Vapnik V. (1998)., Statistical Learning Theory, Wiley, New York.
  • [86] Wahba G. (1990)., Spline Models for Observational Data, Series in Applied Mathematics, 59, SIAM, Philadelphia.
  • [87] Weinberger K.Q. and Saul L.K. (2004). “Unsupervised learning of image manifolds by semidefinite programming,”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-04), Washington D.C.
  • [88] Williams C.K.I. and Seeger M. (2001). “Using the Nyström method to speed up kernel machines”, In T.K. Leen, T.G. Dietterich, and V. Tresp (Eds.), Advances in neural information processing systems, 13, 682–688.