The Annals of Statistics

Kernel methods in machine learning

Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola

Full-text: Open access

Abstract

We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data.

We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data.

Article information

Source
Ann. Statist. Volume 36, Number 3 (2008), 1171-1220.

Dates
First available in Project Euclid: 26 May 2008

Permanent link to this document
https://projecteuclid.org/euclid.aos/1211819561

Digital Object Identifier
doi:10.1214/009053607000000677

Mathematical Reviews number (MathSciNet)
MR2418654

Zentralblatt MATH identifier
1151.30007

Subjects
Primary: 30C40: Kernel functions and applications
Secondary: 68T05: Learning and adaptive systems [See also 68Q32, 91E40]

Keywords
Machine learning reproducing kernels support vector machines graphical models

Citation

Hofmann, Thomas; Schölkopf, Bernhard; Smola, Alexander J. Kernel methods in machine learning. Ann. Statist. 36 (2008), no. 3, 1171--1220. doi:10.1214/009053607000000677. https://projecteuclid.org/euclid.aos/1211819561.


Export citation

References

  • [1] Aizerman, M. A., Braverman, É. M. and Rozonoér, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25 821–837.
  • [2] Allwein, E. L., Schapire, R. E. and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. In Proc. 17th International Conf. Machine Learning (P. Langley, ed.) 9–16. Morgan Kaufmann, San Francisco, CA.
  • [3] Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1993). Scale-sensitive dimensions, uniform convergence, and learnability. In Proc. of the 34rd Annual Symposium on Foundations of Computer Science 292–301. IEEE Computer Society Press, Los Alamitos, CA.
  • [4] Altun, Y., Hofmann, T. and Smola, A. J. (2004). Gaussian process classification for segmenting and annotating sequences. In Proc. International Conf. Machine Learning 25–32. ACM Press, New York.
  • [5] Altun, Y., Smola, A. J. and Hofmann, T. (2004). Exponential families for conditional random fields. In Uncertainty in Artificial Intelligence (UAI) 2–9. AUAI Press, Arlington, VA.
  • [6] Altun, Y., Tsochantaridis, I. and Hofmann, T. (2003). Hidden Markov support vector machines. In Proc. Intl. Conf. Machine Learning 3–10. AAAI Press, Menlo Park, CA.
  • [7] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404.
  • [8] Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. J. Mach. Learn. Res. 3 1–48.
  • [9] Bakir, G., Hofmann, T., Schölkopf, B., Smola, A., Taskar, B. and Vishwanathan, S. V. N. (2007). Predicting Structured Data. MIT Press, Cambridge, MA.
  • [10] Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psych. 12 387–415.
  • [11] Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley, New York.
  • [12] Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463–482.
  • [13] Basilico, J. and Hofmann, T. (2004). Unifying collaborative and content-based filtering. In Proc. Intl. Conf. Machine Learning 65–72. ACM Press, New York.
  • [14] Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities 3 1–8.
  • [15] Ben-David, S., Eiron, N. and Long, P. (2003). On the difficulty of approximately maximizing agreements. J. Comput. System Sci. 66 496–514.
  • [16] Bennett, K. P., Demiriz, A. and Shawe-Taylor, J. (2000). A column generation algorithm for boosting. In Proc. 17th International Conf. Machine Learning (P. Langley, ed.) 65–72. Morgan Kaufmann, San Francisco, CA.
  • [17] Bennett, K. P. and Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optim. Methods Softw. 1 23–34.
  • [18] Berg, C., Christensen, J. P. R. and Ressel, P. (1984). Harmonic Analysis on Semigroups. Springer, New York.
  • [19] Bertsimas, D. and Tsitsiklis, J. (1997). Introduction to Linear Programming. Athena Scientific, Nashua, NH.
  • [20] Bloomfield, P. and Steiger, W. (1983). Least Absolute Deviations: Theory, Applications and Algorithms. Birkhäuser, Boston.
  • [21] Bochner, S. (1933). Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse. Math. Ann. 108 378–410.
  • [22] Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B. and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics (ISMB) 22 e49–e57.
  • [23] Boser, B., Guyon, I. and Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proc. Annual Conf. Computational Learning Theory (D. Haussler, ed.) 144–152. ACM Press, Pittsburgh, PA.
  • [24] Bousquet, O., Boucheron, S. and Lugosi, G. (2005). Theory of classification: A survey of recent advances. ESAIM Probab. Statist. 9 323–375.
  • [25] Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2 121–167.
  • [26] Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings of the IEEE 90 2009–2026.
  • [27] Chapelle, O. and Harchaoui, Z. (2005). A machine learning approach to conjoint analysis. In Advances in Neural Information Processing Systems 17 (L. K. Saul, Y. Weiss and L. Bottou, eds.) 257–264. MIT Press, Cambridge, MA.
  • [28] Chen, A. and Bickel, P. (2005). Consistent independent component analysis and prewhitening. IEEE Trans. Signal Process. 53 3625–3632.
  • [29] Chen, S., Donoho, D. and Saunders, M. (1999). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
  • [30] Collins, M. (2000). Discriminative reranking for natural language parsing. In Proc. 17th International Conf. Machine Learning (P. Langley, ed.) 175–182. Morgan Kaufmann, San Francisco, CA.
  • [31] Collins, M. and Duffy, N. (2001). Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 (T. G. Dietterich, S. Becker and Z. Ghahramani, eds.) 625–632. MIT Press, Cambridge, MA.
  • [32] Cook, D., Buja, A. and Cabrera, J. (1993). Projection pursuit indices based on orthonormal function expansions. J. Comput. Graph. Statist. 2 225–250.
  • [33] Cortes, C., Mohri, M. and Weston, J. (2005). A general regression technique for learning transductions. In ICML’05: Proceedings of the 22nd International Conference on Machine Learning 153–160. ACM Press, New York.
  • [34] Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning 20 273–297.
  • [35] Crammer, K. and Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2 265–292.
  • [36] Crammer, K. and Singer, Y. (2005). Loss bounds for online category ranking. In Proc. Annual Conf. Computational Learning Theory (P. Auer and R. Meir, eds.) 48–62. Springer, Berlin.
  • [37] Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press.
  • [38] Cristianini, N., Shawe-Taylor, J., Elisseeff, A. and Kandola, J. (2002). On kernel-target alignment. In Advances in Neural Information Processing Systems 14 (T. G. Dietterich, S. Becker and Z. Ghahramani, eds.) 367–373. MIT Press, Cambridge, MA.
  • [39] Culotta, A., Kulp, D. and McCallum, A. (2005). Gene prediction with conditional random fields. Technical Report UM-CS-2005-028, Univ. Massachusetts, Amherst.
  • [40] Darroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. Ann. Math. Statist. 43 1470–1480.
  • [41] Das, D. and Sen, P. (1994). Restricted canonical correlations. Linear Algebra Appl. 210 29–47.
  • [42] Dauxois, J. and Nkiet, G. M. (1998). Nonlinear canonical analysis and independence tests. Ann. Statist. 26 1254–1278.
  • [43] Dawid, A. P. (1992). Applications of a general propagation algorithm for probabilistic expert systems. Stat. Comput. 2 25–36.
  • [44] DeCoste, D. and Schölkopf, B. (2002). Training invariant support vector machines. Machine Learning 46 161–190.
  • [45] Dekel, O., Manning, C. and Singer, Y. (2004). Log-linear models for label ranking. In Advances in Neural Information Processing Systems 16 (S. Thrun, L. Saul and B. Schölkopf, eds.) 497–504. MIT Press, Cambridge, MA.
  • [46] Della Pietra, S., Della Pietra, V. and Lafferty, J. (1997). Inducing features of random fields. IEEE Trans. Pattern Anal. Machine Intelligence 19 380–393.
  • [47] Einmal, J. H. J. and Mason, D. M. (1992). Generalized quantile processes. Ann. Statist. 20 1062–1078.
  • [48] Elisseeff, A. and Weston, J. (2001). A kernel method for multi-labeled classification. In Advances in Neural Information Processing Systems 14 681–687. MIT Press, Cambridge, MA.
  • [49] Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Math. J. 23 298–305.
  • [50] FitzGerald, C. H., Micchelli, C. A. and Pinkus, A. (1995). Functions that preserve families of positive semidefinite matrices. Linear Algebra Appl. 221 83–102.
  • [51] Fletcher, R. (1989). Practical Methods of Optimization. Wiley, New York.
  • [52] Fortet, R. and Mourier, E. (1953). Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École Norm. Sup. 70 266–285.
  • [53] Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learing 148–146. Morgan Kaufmann, San Francisco, CA.
  • [54] Friedman, J. H. (1987). Exploratory projection pursuit. J. Amer. Statist. Assoc. 82 249–266.
  • [55] Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. C-23 881–890.
  • [56] Gärtner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations 5 49–58.
  • [57] Green, P. and Yandell, B. (1985). Semi-parametric generalized linear models. Proceedings 2nd International GLIM Conference. Lecture Notes in Statist. 32 44–55. Springer, New York.
  • [58] Gretton, A., Bousquet, O., Smola, A. and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert–Schmidt norms. In Proceedings Algorithmic Learning Theory (S. Jain, H. U. Simon and E. Tomita, eds.) 63–77. Springer, Berlin.
  • [59] Gretton, A., Smola, A., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Schölkopf, B. and Logothetis, N. (2005). Kernel constrained covariance for dependence measurement. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (R. G. Cowell and Z. Ghahramani, eds.) 112–119. Society for Artificial Intelligence and Statistics, New Jersey.
  • [60] Ham, J., Lee, D., Mika, S. and Schölkopf, B. (2004). A kernel view of the dimensionality reduction of manifolds. In Proceedings of the Twenty-First International Conference on Machine Learning 369–376. ACM Press, New York.
  • [61] Hammersley, J. M. and Clifford, P. E. (1971). Markov fields on finite graphs and lattices. Unpublished manuscript.
  • [62] Haussler, D. (1999). Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Dept., UC Santa Cruz.
  • [63] Hein, M., Bousquet, O. and Schölkopf, B. (2005). Maximal margin classification for metric spaces. J. Comput. System Sci. 71 333–359.
  • [64] Herbrich, R. (2002). Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Cambridge, MA.
  • [65] Herbrich, R., Graepel, T. and Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers (A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans, eds.) 115–132. MIT Press, Cambridge, MA.
  • [66] Hettich, R. and Kortanek, K. O. (1993). Semi-infinite programming: Theory, methods, and applications. SIAM Rev. 35 380–429.
  • [67] Hilbert, D. (1904). Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen. Nachr. Akad. Wiss. Göttingen Math.-Phys. Kl. II 49–91.
  • [68] Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 55–67.
  • [69] Hofmann, T., Schölkopf, B. and Smola, A. J. (2006). A review of kernel methods in machine learning. Technical Report 156, Max-Planck-Institut für biologische Kybernetik.
  • [70] Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321–377.
  • [71] Huber, P. J. (1981). Robust Statistics. Wiley, New York.
  • [72] Huber, P. J. (1985). Projection pursuit. Ann. Statist. 13 435–475.
  • [73] Hyvärinen, A., Karhunen, J. and Oja, E. (2001). Independent Component Analysis. Wiley, New York.
  • [74] Jaakkola, T. S. and Haussler, D. (1999). Probabilistic kernel regression models. In Proceedings of the 7th International Workshop on AI and Statistics. Morgan Kaufmann, San Francisco, CA.
  • [75] Jebara, T. and Kondor, I. (2003). Bhattacharyya and expected likelihood kernels. Proceedings of the Sixteenth Annual Conference on Computational Learning Theory (B. Schölkopf and M. Warmuth, eds.) 57–71. Lecture Notes in Comput. Sci. 2777. Springer, Heidelberg.
  • [76] Jensen, F. V., Lauritzen, S. L. and Olesen, K. G. (1990). Bayesian updates in causal probabilistic networks by local computation. Comput. Statist. Quaterly 4 269–282.
  • [77] Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer Academic, Boston.
  • [78] Joachims, T. (2005). A support vector method for multivariate performance measures. In Proc. Intl. Conf. Machine Learning 377–384. Morgan Kaufmann, San Francisco, CA.
  • [79] Jones, M. C. and Sibson, R. (1987). What is projection pursuit? J. Roy. Statist. Soc. Ser. A 150 1–36.
  • [80] Jordan, M. I., Bartlett, P. L. and McAuliffe, J. D. (2003). Convexity, classification, and risk bounds. Technical Report 638, Univ. California, Berkeley.
  • [81] Karush, W. (1939). Minima of functions of several variables with inequalities as side constraints. Master’s thesis, Dept. Mathematics, Univ. Chicago.
  • [82] Kashima, H., Tsuda, K. and Inokuchi, A. (2003). Marginalized kernels between labeled graphs. In Proc. Intl. Conf. Machine Learning 321–328. Morgan Kaufmann, San Francisco, CA.
  • [83] Kettenring, J. R. (1971). Canonical analysis of several sets of variables. Biometrika 58 433–451.
  • [84] Kim, K., Franz, M. O. and Schölkopf, B. (2005). Iterative kernel principal component analysis for image modeling. IEEE Trans. Pattern Analysis and Machine Intelligence 27 1351–1366.
  • [85] Kimeldorf, G. S. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33 82–95.
  • [86] Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914.
  • [87] Kondor, I. R. and Lafferty, J. D. (2002). Diffusion kernels on graphs and other discrete structures. In Proc. International Conf. Machine Learning 315–322. Morgan Kaufmann, San Francisco, CA.
  • [88] Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. Proc. 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics 481–492. Univ. California Press, Berkeley.
  • [89] Lafferty, J., Zhu, X. and Liu, Y. (2004). Kernel conditional random fields: Representation and clique selection. In Proc. International Conf. Machine Learning 21 64. Morgan Kaufmann, San Francisco, CA.
  • [90] Lafferty, J. D., McCallum, A. and Pereira, F. (2001). Conditional random fields: Probabilistic modeling for segmenting and labeling sequence data. In Proc. International Conf. Machine Learning 18 282–289. Morgan Kaufmann, San Francisco, CA.
  • [91] Lee, T.-W., Girolami, M., Bell, A. and Sejnowski, T. (2000). A unifying framework for independent component analysis. Comput. Math. Appl. 39 1–21.
  • [92] Leslie, C., Eskin, E. and Noble, W. S. (2002). The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing 564–575. World Scientific Publishing, Singapore.
  • [93] Loève, M. (1978). Probability Theory II, 4th ed. Springer, New York.
  • [94] Magerman, D. M. (1996). Learning grammatical structure using statistical decision-trees. Proceedings ICGI. Lecture Notes in Artificial Intelligence 1147 1–21. Springer, Berlin.
  • [95] Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Oper. Res. 13 444–452.
  • [96] McCallum, A., Bellare, K. and Pereira, F. (2005). A conditional random field for discriminatively-trained finite-state string edit distance. In Conference on Uncertainty in AI (UAI) 388. AUAI Press, Arlington, VA.
  • [97] McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models. Chapman and Hall, London.
  • [98] Mendelson, S. (2003). A few notes on statistical learning theory. Advanced Lectures on Machine Learning (S. Mendelson and A. J. Smola, eds.). Lecture Notes in Artificial Intelligence 2600 1–40. Springer, Heidelberg.
  • [99] Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. A 209 415–446.
  • [100] Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A. J. and Müller, K.-R. (2003). Learning discriminative and invariant nonlinear features. IEEE Trans. Pattern Analysis and Machine Intelligence 25 623–628.
  • [101] Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA.
  • [102] Morozov, V. A. (1984). Methods for Solving Incorrectly Posed Problems. Springer, New York.
  • [103] Murray, M. K. and Rice, J. W. (1993). Differential Geometry and Statistics. Chapman and Hall, London.
  • [104] Oliver, N., Schölkopf, B. and Smola, A. J. (2000). Natural regularization in SVMs. In Advances in Large Margin Classifiers (A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans, eds.) 51–60. MIT Press, Cambridge, MA.
  • [105] O’Sullivan, F., Yandell, B. and Raynor, W. (1986). Automatic smoothing of regression functions in generalized linear models. J. Amer. Statist. Assoc. 81 96–103.
  • [106] Parzen, E. (1970). Statistical inference on time series by RKHS methods. In Proceedings 12th Biennial Seminar (R. Pyke, ed.) 1–37. Canadian Mathematical Congress, Montreal.
  • [107] Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods—Support Vector Learning (B. Schölkopf, C. J. C. Burges and A. J. Smola, eds.) 185–208. MIT Press, Cambridge, MA.
  • [108] Poggio, T. (1975). On optimal nonlinear associative recall. Biological Cybernetics 19 201–209.
  • [109] Poggio, T. and Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE 78 1481–1497.
  • [110] Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (1994). Numerical Recipes in C. The Art of Scientific Computation. Cambridge Univ. Press.
  • [111] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
  • [112] Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.-R., Sommer, R. J. and Schölkopf, B. (2007). Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Computational Biology 3 e20 doi:10.1371/journal.pcbi.0030020.
  • [113] Rényi, A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hungar. 10 441–451.
  • [114] Rockafellar, R. T. (1970). Convex Analysis. Princeton Univ. Press.
  • [115] Schoenberg, I. J. (1938). Metric spaces and completely monotone functions. Ann. Math. 39 811–841.
  • [116] Schölkopf, B. (1997). Support Vector Learning. R. Oldenbourg Verlag, Munich. Available at http://www.kernel-machines.org.
  • [117] Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J. and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Comput. 13 1443–1471.
  • [118] Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
  • [119] Schölkopf, B., Smola, A. J. and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10 1299–1319.
  • [120] Schölkopf, B., Smola, A. J., Williamson, R. C. and Bartlett, P. L. (2000). New support vector algorithms. Neural Comput. 12 1207–1245.
  • [121] Schölkopf, B., Tsuda, K. and Vert, J.-P. (2004). Kernel Methods in Computational Biology. MIT Press, Cambridge, MA.
  • [122] Sha, F. and Pereira, F. (2003). Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL 213–220. Association for Computational Linguistics, Edmonton, Canada.
  • [123] Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press.
  • [124] Smola, A. J., Bartlett, P. L., Schölkopf, B. and Schuurmans, D. (2000). Advances in Large Margin Classifiers. MIT Press, Cambridge, MA.
  • [125] Smola, A. J. and Kondor, I. R. (2003). Kernels and regularization on graphs. Proc. Annual Conf. Computational Learning Theory (B. Schölkopf and M. K. Warmuth, eds.). Lecture Notes in Comput. Sci. 2726 144–158. Springer, Heidelberg.
  • [126] Smola, A. J. and Schölkopf, B. (1998). On a kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica 22 211–231.
  • [127] Smola, A. J., Schölkopf, B. and Müller, K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks 11 637–649.
  • [128] Steinwart, I. (2002). On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2 67–93.
  • [129] Steinwart, I. (2002). Support vector machines are universally consistent. J. Complexity 18 768–791.
  • [130] Stewart, J. (1976). Positive definite functions and generalizations, an historical survey. Rocky Mountain J. Math. 6 409–434.
  • [131] Stitson, M., Gammerman, A., Vapnik, V., Vovk, V., Watkins, C. and Weston, J. (1999). Support vector regression with ANOVA decomposition kernels. In Advances in Kernel Methods—Support Vector Learning (B. Schölkopf, C. J. C. Burges and A. J. Smola, eds.) 285–292. MIT Press, Cambridge, MA.
  • [132] Taskar, B., Guestrin, C. and Koller, D. (2004). Max-margin Markov networks. In Advances in Neural Information Processing Systems 16 (S. Thrun, L. Saul and B. Schölkopf, eds.) 25–32. MIT Press, Cambridge, MA.
  • [133] Taskar, B., Klein, D., Collins, M., Koller, D. and Manning, C. (2004). Max-margin parsing. In Empirical Methods in Natural Language Processing 1–8. Association for Computational Linguistics, Barcelona, Spain.
  • [134] Tax, D. M. J. and Duin, R. P. W. (1999). Data domain description by support vectors. In Proceedings ESANN (M. Verleysen, ed.) 251–256. D Facto, Brussels.
  • [135] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267–288.
  • [136] Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4 1035–1038.
  • [137] Tsochantaridis, I., Joachims, T., Hofmann, T. and Altun, Y. (2005). Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6 1453–1484.
  • [138] van Rijsbergen, C. (1979). Information Retrieval, 2nd ed. Butterworths, London.
  • [139] Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. Springer, Berlin.
  • [140] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York.
  • [141] Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
  • [142] Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264–281.
  • [143] Vapnik, V. and Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis 1 283–305.
  • [144] Vapnik, V., Golowich, S. and Smola, A. J. (1997). Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processing Systems 9 (M. C. Mozer, M. I. Jordan and T. Petsche, eds.) 281–287. MIT Press, Cambridge, MA.
  • [145] Vapnik, V. and Lerner, A. (1963). Pattern recognition using generalized portrait method. Autom. Remote Control 24 774–780.
  • [146] Vishwanathan, S. V. N. and Smola, A. J. (2004). Fast kernels for string and tree matching. In Kernel Methods in Computational Biology (B. Schölkopf, K. Tsuda and J. P. Vert, eds.) 113–130. MIT Press, Cambridge, MA.
  • [147] Vishwanathan, S. V. N., Smola, A. J. and Vidal, R. (2007). Binet–Cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. Internat. J. Computer Vision 73 95–119.
  • [148] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.
  • [149] Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. (1995). Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Ann. Statist. 23 1865–1895.
  • [150] Wainwright, M. J. and Jordan, M. I. (2003). Graphical models, exponential families, and variational inference. Technical Report 649, Dept. Statistics, Univ. California, Berkeley.
  • [151] Watkins, C. (2000). Dynamic alignment kernels. In Advances in Large Margin Classifiers (A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans, eds.) 39–50. MIT Press, Cambridge, MA.
  • [152] Wendland, H. (2005). Scattered Data Approximation. Cambridge Univ. Press.
  • [153] Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B. and Vapnik, V. (2003). Kernel dependency estimation. In Advances in Neural Information Processing Systems 15 (S. T. S. Becker and K. Obermayer, eds.) 873–880. MIT Press, Cambridge, MA.
  • [154] Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, New York.
  • [155] Yang, H. H. and Amari, S.-I. (1997). Adaptive on-line learning algorithms for blind separation—maximum entropy and minimum mutual information. Neural Comput. 9 1457–1482.
  • [156] Zettlemoyer, L. S. and Collins, M. (2005). Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Uncertainty in Artificial Intelligence UAI 658–666. AUAI Press, Arlington, Virginia.
  • [157] Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T. and Müller, K.-R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16 799–807.