The Annals of Statistics

Kernel methods in machine learning

Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola

Source: Ann. Statist. Volume 36, Number 3 (2008), 1171-1220.

Abstract

We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data.

We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data.

Primary Subjects: 30C40
Secondary Subjects: 68T05
Keywords: Machine learning; reproducing kernels; support vector machines; graphical models

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1211819561
Digital Object Identifier: doi:10.1214/009053607000000677
Zentralblatt MATH identifier: 1151.30007

References

[1] Aizerman, M. A., Braverman, É. M. and Rozonoér, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25 821–837.
Mathematical Reviews (MathSciNet): MR172768
[2] Allwein, E. L., Schapire, R. E. and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. In Proc. 17th International Conf. Machine Learning (P. Langley, ed.) 9–16. Morgan Kaufmann, San Francisco, CA.
Mathematical Reviews (MathSciNet): MR1884092
Digital Object Identifier: doi:10.1162/15324430152733133
[3] Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1993). Scale-sensitive dimensions, uniform convergence, and learnability. In Proc. of the 34rd Annual Symposium on Foundations of Computer Science 292–301. IEEE Computer Society Press, Los Alamitos, CA.
Mathematical Reviews (MathSciNet): MR1328428
[4] Altun, Y., Hofmann, T. and Smola, A. J. (2004). Gaussian process classification for segmenting and annotating sequences. In Proc. International Conf. Machine Learning 25–32. ACM Press, New York.
[5] Altun, Y., Smola, A. J. and Hofmann, T. (2004). Exponential families for conditional random fields. In Uncertainty in Artificial Intelligence (UAI) 2–9. AUAI Press, Arlington, VA.
[6] Altun, Y., Tsochantaridis, I. and Hofmann, T. (2003). Hidden Markov support vector machines. In Proc. Intl. Conf. Machine Learning 3–10. AAAI Press, Menlo Park, CA.
[7] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404.
Mathematical Reviews (MathSciNet): MR51437
Digital Object Identifier: doi:10.2307/1990404
[8] Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. J. Mach. Learn. Res. 3 1–48.
Mathematical Reviews (MathSciNet): MR1966051
Digital Object Identifier: doi:10.1162/153244303768966085
[9] Bakir, G., Hofmann, T., Schölkopf, B., Smola, A., Taskar, B. and Vishwanathan, S. V. N. (2007). Predicting Structured Data. MIT Press, Cambridge, MA.
[10] Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psych. 12 387–415.
Mathematical Reviews (MathSciNet): MR384214
Digital Object Identifier: doi:10.1016/0022-2496(75)90001-2
[11] Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley, New York.
Mathematical Reviews (MathSciNet): MR489333
Zentralblatt MATH: 0387.62011
[12] Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463–482.
Mathematical Reviews (MathSciNet): MR1984026
Digital Object Identifier: doi:10.1162/153244303321897690
[13] Basilico, J. and Hofmann, T. (2004). Unifying collaborative and content-based filtering. In Proc. Intl. Conf. Machine Learning 65–72. ACM Press, New York.
[14] Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities 3 1–8.
Mathematical Reviews (MathSciNet): MR341782
[15] Ben-David, S., Eiron, N. and Long, P. (2003). On the difficulty of approximately maximizing agreements. J. Comput. System Sci. 66 496–514.
Mathematical Reviews (MathSciNet): MR1981222
Digital Object Identifier: doi:10.1016/S0022-0000(03)00038-2
[16] Bennett, K. P., Demiriz, A. and Shawe-Taylor, J. (2000). A column generation algorithm for boosting. In Proc. 17th International Conf. Machine Learning (P. Langley, ed.) 65–72. Morgan Kaufmann, San Francisco, CA.
[17] Bennett, K. P. and Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optim. Methods Softw. 1 23–34.
[18] Berg, C., Christensen, J. P. R. and Ressel, P. (1984). Harmonic Analysis on Semigroups. Springer, New York.
Mathematical Reviews (MathSciNet): MR747302
Zentralblatt MATH: 0619.43001
[19] Bertsimas, D. and Tsitsiklis, J. (1997). Introduction to Linear Programming. Athena Scientific, Nashua, NH.
[20] Bloomfield, P. and Steiger, W. (1983). Least Absolute Deviations: Theory, Applications and Algorithms. Birkhäuser, Boston.
Mathematical Reviews (MathSciNet): MR748483
Zentralblatt MATH: 0536.62049
[21] Bochner, S. (1933). Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse. Math. Ann. 108 378–410.
Mathematical Reviews (MathSciNet): MR1512856
Digital Object Identifier: doi:10.1007/BF01452844
[22] Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B. and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics (ISMB) 22 e49–e57.
[23] Boser, B., Guyon, I. and Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proc. Annual Conf. Computational Learning Theory (D. Haussler, ed.) 144–152. ACM Press, Pittsburgh, PA.
[24] Bousquet, O., Boucheron, S. and Lugosi, G. (2005). Theory of classification: A survey of recent advances. ESAIM Probab. Statist. 9 323–375.
Mathematical Reviews (MathSciNet): MR2182250
Digital Object Identifier: doi:10.1051/ps:2005018
[25] Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2 121–167.
[26] Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings of the IEEE 90 2009–2026.
[27] Chapelle, O. and Harchaoui, Z. (2005). A machine learning approach to conjoint analysis. In Advances in Neural Information Processing Systems 17 (L. K. Saul, Y. Weiss and L. Bottou, eds.) 257–264. MIT Press, Cambridge, MA.
[28] Chen, A. and Bickel, P. (2005). Consistent independent component analysis and prewhitening. IEEE Trans. Signal Process. 53 3625–3632.
Mathematical Reviews (MathSciNet): MR2239886
Digital Object Identifier: doi:10.1109/TSP.2005.855098
[29] Chen, S., Donoho, D. and Saunders, M. (1999). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
Mathematical Reviews (MathSciNet): MR1639094
Digital Object Identifier: doi:10.1137/S1064827596304010
[30] Collins, M. (2000). Discriminative reranking for natural language parsing. In Proc. 17th International Conf. Machine Learning (P. Langley, ed.) 175–182. Morgan Kaufmann, San Francisco, CA.
[31] Collins, M. and Duffy, N. (2001). Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 (T. G. Dietterich, S. Becker and Z. Ghahramani, eds.) 625–632. MIT Press, Cambridge, MA.
[32] Cook, D., Buja, A. and Cabrera, J. (1993). Projection pursuit indices based on orthonormal function expansions. J. Comput. Graph. Statist. 2 225–250.
Mathematical Reviews (MathSciNet): MR1272393
Digital Object Identifier: doi:10.2307/1390644
[33] Cortes, C., Mohri, M. and Weston, J. (2005). A general regression technique for learning transductions. In ICML’05: Proceedings of the 22nd International Conference on Machine Learning 153–160. ACM Press, New York.
[34] Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning 20 273–297.
[35] Crammer, K. and Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2 265–292.
[36] Crammer, K. and Singer, Y. (2005). Loss bounds for online category ranking. In Proc. Annual Conf. Computational Learning Theory (P. Auer and R. Meir, eds.) 48–62. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR2203253
Zentralblatt MATH: 05034616
[37] Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press.
[38] Cristianini, N., Shawe-Taylor, J., Elisseeff, A. and Kandola, J. (2002). On kernel-target alignment. In Advances in Neural Information Processing Systems 14 (T. G. Dietterich, S. Becker and Z. Ghahramani, eds.) 367–373. MIT Press, Cambridge, MA.
[39] Culotta, A., Kulp, D. and McCallum, A. (2005). Gene prediction with conditional random fields. Technical Report UM-CS-2005-028, Univ. Massachusetts, Amherst.
[40] Darroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. Ann. Math. Statist. 43 1470–1480.
Mathematical Reviews (MathSciNet): MR345337
Digital Object Identifier: doi:10.1214/aoms/1177692379
Project Euclid: euclid.aoms/1177692379
[41] Das, D. and Sen, P. (1994). Restricted canonical correlations. Linear Algebra Appl. 210 29–47.
Mathematical Reviews (MathSciNet): MR1294769
Digital Object Identifier: doi:10.1016/0024-3795(94)90464-2
[42] Dauxois, J. and Nkiet, G. M. (1998). Nonlinear canonical analysis and independence tests. Ann. Statist. 26 1254–1278.
Mathematical Reviews (MathSciNet): MR1647653
Digital Object Identifier: doi:10.1214/aos/1024691242
Project Euclid: euclid.aos/1024691242
[43] Dawid, A. P. (1992). Applications of a general propagation algorithm for probabilistic expert systems. Stat. Comput. 2 25–36.
[44] DeCoste, D. and Schölkopf, B. (2002). Training invariant support vector machines. Machine Learning 46 161–190.
[45] Dekel, O., Manning, C. and Singer, Y. (2004). Log-linear models for label ranking. In Advances in Neural Information Processing Systems 16 (S. Thrun, L. Saul and B. Schölkopf, eds.) 497–504. MIT Press, Cambridge, MA.
[46] Della Pietra, S., Della Pietra, V. and Lafferty, J. (1997). Inducing features of random fields. IEEE Trans. Pattern Anal. Machine Intelligence 19 380–393.
[47] Einmal, J. H. J. and Mason, D. M. (1992). Generalized quantile processes. Ann. Statist. 20 1062–1078.
Mathematical Reviews (MathSciNet): MR1165606
Digital Object Identifier: doi:10.1214/aos/1176348670
Project Euclid: euclid.aos/1176348670
[48] Elisseeff, A. and Weston, J. (2001). A kernel method for multi-labeled classification. In Advances in Neural Information Processing Systems 14 681–687. MIT Press, Cambridge, MA.
[49] Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Math. J. 23 298–305.
Mathematical Reviews (MathSciNet): MR318007
[50] FitzGerald, C. H., Micchelli, C. A. and Pinkus, A. (1995). Functions that preserve families of positive semidefinite matrices. Linear Algebra Appl. 221 83–102.
Mathematical Reviews (MathSciNet): MR1331791
Digital Object Identifier: doi:10.1016/0024-3795(93)00232-O
[51] Fletcher, R. (1989). Practical Methods of Optimization. Wiley, New York.
Mathematical Reviews (MathSciNet): MR955799
Zentralblatt MATH: 0905.65002
[52] Fortet, R. and Mourier, E. (1953). Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École Norm. Sup. 70 266–285.
Mathematical Reviews (MathSciNet): MR61325
[53] Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learing 148–146. Morgan Kaufmann, San Francisco, CA.
[54] Friedman, J. H. (1987). Exploratory projection pursuit. J. Amer. Statist. Assoc. 82 249–266.
Mathematical Reviews (MathSciNet): MR883353
Digital Object Identifier: doi:10.2307/2289161
[55] Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. C-23 881–890.
[56] Gärtner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations 5 49–58.
[57] Green, P. and Yandell, B. (1985). Semi-parametric generalized linear models. Proceedings 2nd International GLIM Conference. Lecture Notes in Statist. 32 44–55. Springer, New York.
Mathematical Reviews (MathSciNet): MR824535
[58] Gretton, A., Bousquet, O., Smola, A. and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert–Schmidt norms. In Proceedings Algorithmic Learning Theory (S. Jain, H. U. Simon and E. Tomita, eds.) 63–77. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR2255909
Digital Object Identifier: doi:10.1007/11564089_7
[59] Gretton, A., Smola, A., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Schölkopf, B. and Logothetis, N. (2005). Kernel constrained covariance for dependence measurement. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (R. G. Cowell and Z. Ghahramani, eds.) 112–119. Society for Artificial Intelligence and Statistics, New Jersey.
[60] Ham, J., Lee, D., Mika, S. and Schölkopf, B. (2004). A kernel view of the dimensionality reduction of manifolds. In Proceedings of the Twenty-First International Conference on Machine Learning 369–376. ACM Press, New York.
[61] Hammersley, J. M. and Clifford, P. E. (1971). Markov fields on finite graphs and lattices. Unpublished manuscript.
[62] Haussler, D. (1999). Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Dept., UC Santa Cruz.
[63] Hein, M., Bousquet, O. and Schölkopf, B. (2005). Maximal margin classification for metric spaces. J. Comput. System Sci. 71 333–359.
Mathematical Reviews (MathSciNet): MR2168357
Digital Object Identifier: doi:10.1016/j.jcss.2004.10.013
[64] Herbrich, R. (2002). Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Cambridge, MA.
[65] Herbrich, R., Graepel, T. and Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers (A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans, eds.) 115–132. MIT Press, Cambridge, MA.
Mathematical Reviews (MathSciNet): MR1820960
[66] Hettich, R. and Kortanek, K. O. (1993). Semi-infinite programming: Theory, methods, and applications. SIAM Rev. 35 380–429.
Mathematical Reviews (MathSciNet): MR1234637
Digital Object Identifier: doi:10.1137/1035089
[67] Hilbert, D. (1904). Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen. Nachr. Akad. Wiss. Göttingen Math.-Phys. Kl. II 49–91.
[68] Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 55–67.
[69] Hofmann, T., Schölkopf, B. and Smola, A. J. (2006). A review of kernel methods in machine learning. Technical Report 156, Max-Planck-Institut für biologische Kybernetik.
[70] Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321–377.
[71] Huber, P. J. (1981). Robust Statistics. Wiley, New York.
Mathematical Reviews (MathSciNet): MR606374
Zentralblatt MATH: 0536.62025
[72] Huber, P. J. (1985). Projection pursuit. Ann. Statist. 13 435–475.
Mathematical Reviews (MathSciNet): MR790553
Digital Object Identifier: doi:10.1214/aos/1176349519
Project Euclid: euclid.aos/1176349519
[73] Hyvärinen, A., Karhunen, J. and Oja, E. (2001). Independent Component Analysis. Wiley, New York.
[74] Jaakkola, T. S. and Haussler, D. (1999). Probabilistic kernel regression models. In Proceedings of the 7th International Workshop on AI and Statistics. Morgan Kaufmann, San Francisco, CA.
[75] Jebara, T. and Kondor, I. (2003). Bhattacharyya and expected likelihood kernels. Proceedings of the Sixteenth Annual Conference on Computational Learning Theory (B. Schölkopf and M. Warmuth, eds.) 57–71. Lecture Notes in Comput. Sci. 2777. Springer, Heidelberg.
[76] Jensen, F. V., Lauritzen, S. L. and Olesen, K. G. (1990). Bayesian updates in causal probabilistic networks by local computation. Comput. Statist. Quaterly 4 269–282.
[77] Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer Academic, Boston.
[78] Joachims, T. (2005). A support vector method for multivariate performance measures. In Proc. Intl. Conf. Machine Learning 377–384. Morgan Kaufmann, San Francisco, CA.
[79] Jones, M. C. and Sibson, R. (1987). What is projection pursuit? J. Roy. Statist. Soc. Ser. A 150 1–36.
Mathematical Reviews (MathSciNet): MR887823
Digital Object Identifier: doi:10.2307/2981662
[80] Jordan, M. I., Bartlett, P. L. and McAuliffe, J. D. (2003). Convexity, classification, and risk bounds. Technical Report 638, Univ. California, Berkeley.
[81] Karush, W. (1939). Minima of functions of several variables with inequalities as side constraints. Master’s thesis, Dept. Mathematics, Univ. Chicago.
[82] Kashima, H., Tsuda, K. and Inokuchi, A. (2003). Marginalized kernels between labeled graphs. In Proc. Intl. Conf. Machine Learning 321–328. Morgan Kaufmann, San Francisco, CA.
[83] Kettenring, J. R. (1971). Canonical analysis of several sets of variables. Biometrika 58 433–451.
Mathematical Reviews (MathSciNet): MR341750
Zentralblatt MATH: 0225.62072
Digital Object Identifier: doi:10.1093/biomet/58.3.433
[84] Kim, K., Franz, M. O. and Schölkopf, B. (2005). Iterative kernel principal component analysis for image modeling. IEEE Trans. Pattern Analysis and Machine Intelligence 27 1351–1366.
[85] Kimeldorf, G. S. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33 82–95.
Mathematical Reviews (MathSciNet): MR290013
Digital Object Identifier: doi:10.1016/0022-247X(71)90184-3
[86] Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914.
Mathematical Reviews (MathSciNet): MR1842526
Digital Object Identifier: doi:10.1109/18.930926
[87] Kondor, I. R. and Lafferty, J. D. (2002). Diffusion kernels on graphs and other discrete structures. In Proc. International Conf. Machine Learning 315–322. Morgan Kaufmann, San Francisco, CA.
[88] Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. Proc. 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics 481–492. Univ. California Press, Berkeley.
Mathematical Reviews (MathSciNet): MR47303
Zentralblatt MATH: 0044.05903
[89] Lafferty, J., Zhu, X. and Liu, Y. (2004). Kernel conditional random fields: Representation and clique selection. In Proc. International Conf. Machine Learning 21 64. Morgan Kaufmann, San Francisco, CA.
[90] Lafferty, J. D., McCallum, A. and Pereira, F. (2001). Conditional random fields: Probabilistic modeling for segmenting and labeling sequence data. In Proc. International Conf. Machine Learning 18 282–289. Morgan Kaufmann, San Francisco, CA.
[91] Lee, T.-W., Girolami, M., Bell, A. and Sejnowski, T. (2000). A unifying framework for independent component analysis. Comput. Math. Appl. 39 1–21.
Mathematical Reviews (MathSciNet): MR1766376
[92] Leslie, C., Eskin, E. and Noble, W. S. (2002). The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing 564–575. World Scientific Publishing, Singapore.
[93] Loève, M. (1978). Probability Theory II, 4th ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR651018
[94] Magerman, D. M. (1996). Learning grammatical structure using statistical decision-trees. Proceedings ICGI. Lecture Notes in Artificial Intelligence 1147 1–21. Springer, Berlin.
[95] Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Oper. Res. 13 444–452.
Mathematical Reviews (MathSciNet): MR192918
[96] McCallum, A., Bellare, K. and Pereira, F. (2005). A conditional random field for discriminatively-trained finite-state string edit distance. In Conference on Uncertainty in AI (UAI) 388. AUAI Press, Arlington, VA.
[97] McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models. Chapman and Hall, London.
Mathematical Reviews (MathSciNet): MR727836
Zentralblatt MATH: 0588.62104
[98] Mendelson, S. (2003). A few notes on statistical learning theory. Advanced Lectures on Machine Learning (S. Mendelson and A. J. Smola, eds.). Lecture Notes in Artificial Intelligence 2600 1–40. Springer, Heidelberg.
[99] Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. A 209 415–446.
[100] Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A. J. and Müller, K.-R. (2003). Learning discriminative and invariant nonlinear features. IEEE Trans. Pattern Analysis and Machine Intelligence 25 623–628.
[101] Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA.
[102] Morozov, V. A. (1984). Methods for Solving Incorrectly Posed Problems. Springer, New York.
Mathematical Reviews (MathSciNet): MR766231
[103] Murray, M. K. and Rice, J. W. (1993). Differential Geometry and Statistics. Chapman and Hall, London.
Mathematical Reviews (MathSciNet): MR1293124
Zentralblatt MATH: 0804.53001
[104] Oliver, N., Schölkopf, B. and Smola, A. J. (2000). Natural regularization in SVMs. In Advances in Large Margin Classifiers (A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans, eds.) 51–60. MIT Press, Cambridge, MA.
Mathematical Reviews (MathSciNet): MR1820960
[105] O’Sullivan, F., Yandell, B. and Raynor, W. (1986). Automatic smoothing of regression functions in generalized linear models. J. Amer. Statist. Assoc. 81 96–103.
Mathematical Reviews (MathSciNet): MR830570
Digital Object Identifier: doi:10.2307/2287973
[106] Parzen, E. (1970). Statistical inference on time series by RKHS methods. In Proceedings 12th Biennial Seminar (R. Pyke, ed.) 1–37. Canadian Mathematical Congress, Montreal.
Mathematical Reviews (MathSciNet): MR275616
Zentralblatt MATH: 0253.60053
[107] Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods—Support Vector Learning (B. Schölkopf, C. J. C. Burges and A. J. Smola, eds.) 185–208. MIT Press, Cambridge, MA.
[108] Poggio, T. (1975). On optimal nonlinear associative recall. Biological Cybernetics 19 201–209.
Mathematical Reviews (MathSciNet): MR503978
[109] Poggio, T. and Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE 78 1481–1497.
[110] Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (1994). Numerical Recipes in C. The Art of Scientific Computation. Cambridge Univ. Press.
Mathematical Reviews (MathSciNet): MR833288
[111] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
[112] Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.-R., Sommer, R. J. and Schölkopf, B. (2007). Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Computational Biology 3 e20 doi:10.1371/journal.pcbi.0030020.
[113] Rényi, A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hungar. 10 441–451.
Mathematical Reviews (MathSciNet): MR115203
Digital Object Identifier: doi:10.1007/BF02024507
[114] Rockafellar, R. T. (1970). Convex Analysis. Princeton Univ. Press.
Mathematical Reviews (MathSciNet): MR274683
Zentralblatt MATH: 0193.18401
[115] Schoenberg, I. J. (1938). Metric spaces and completely monotone functions. Ann. Math. 39 811–841.
[116] Schölkopf, B. (1997). Support Vector Learning. R. Oldenbourg Verlag, Munich. Available at http://www.kernel-machines.org.
[117] Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J. and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Comput. 13 1443–1471.
[118] Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
[119] Schölkopf, B., Smola, A. J. and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10 1299–1319.
[120] Schölkopf, B., Smola, A. J., Williamson, R. C. and Bartlett, P. L. (2000). New support vector algorithms. Neural Comput. 12 1207–1245.
[121] Schölkopf, B., Tsuda, K. and Vert, J.-P. (2004). Kernel Methods in Computational Biology. MIT Press, Cambridge, MA.
[122] Sha, F. and Pereira, F. (2003). Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL 213–220. Association for Computational Linguistics, Edmonton, Canada.
[123] Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press.
[124] Smola, A. J., Bartlett, P. L., Schölkopf, B. and Schuurmans, D. (2000). Advances in Large Margin Classifiers. MIT Press, Cambridge, MA.
Mathematical Reviews (MathSciNet): MR1820960
[125] Smola, A. J. and Kondor, I. R. (2003). Kernels and regularization on graphs. Proc. Annual Conf. Computational Learning Theory (B. Schölkopf and M. K. Warmuth, eds.). Lecture Notes in Comput. Sci. 2726 144–158. Springer, Heidelberg.
[126] Smola, A. J. and Schölkopf, B. (1998). On a kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica 22 211–231.
Mathematical Reviews (MathSciNet): MR1637511
Digital Object Identifier: doi:10.1007/PL00013831
[127] Smola, A. J., Schölkopf, B. and Müller, K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks 11 637–649.
[128] Steinwart, I. (2002). On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2 67–93.
Mathematical Reviews (MathSciNet): MR1883281
Digital Object Identifier: doi:10.1162/153244302760185252
[129] Steinwart, I. (2002). Support vector machines are universally consistent. J. Complexity 18 768–791.
Mathematical Reviews (MathSciNet): MR1928806
Digital Object Identifier: doi:10.1006/jcom.2002.0642
[130] Stewart, J. (1976). Positive definite functions and generalizations, an historical survey. Rocky Mountain J. Math. 6 409–434.
Mathematical Reviews (MathSciNet): MR430674
[131] Stitson, M., Gammerman, A., Vapnik, V., Vovk, V., Watkins, C. and Weston, J. (1999). Support vector regression with ANOVA decomposition kernels. In Advances in Kernel Methods—Support Vector Learning (B. Schölkopf, C. J. C. Burges and A. J. Smola, eds.) 285–292. MIT Press, Cambridge, MA.
[132] Taskar, B., Guestrin, C. and Koller, D. (2004). Max-margin Markov networks. In Advances in Neural Information Processing Systems 16 (S. Thrun, L. Saul and B. Schölkopf, eds.) 25–32. MIT Press, Cambridge, MA.
[133] Taskar, B., Klein, D., Collins, M., Koller, D. and Manning, C. (2004). Max-margin parsing. In Empirical Methods in Natural Language Processing 1–8. Association for Computational Linguistics, Barcelona, Spain.
[134] Tax, D. M. J. and Duin, R. P. W. (1999). Data domain description by support vectors. In Proceedings ESANN (M. Verleysen, ed.) 251–256. D Facto, Brussels.
[135] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267–288.
Mathematical Reviews (MathSciNet): MR1379242
[136] Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4 1035–1038.
[137] Tsochantaridis, I., Joachims, T., Hofmann, T. and Altun, Y. (2005). Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6 1453–1484.
Mathematical Reviews (MathSciNet): MR2249862
[138] van Rijsbergen, C. (1979). Information Retrieval, 2nd ed. Butterworths, London.
[139] Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR672244
Zentralblatt MATH: 0499.62005
[140] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York.
Mathematical Reviews (MathSciNet): MR1367965
Zentralblatt MATH: 0833.62008
[141] Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1641250
Zentralblatt MATH: 0935.62007
[142] Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264–281.
[143] Vapnik, V. and Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis 1 283–305.
[144] Vapnik, V., Golowich, S. and Smola, A. J. (1997). Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processing Systems 9 (M. C. Mozer, M. I. Jordan and T. Petsche, eds.) 281–287. MIT Press, Cambridge, MA.
[145] Vapnik, V. and Lerner, A. (1963). Pattern recognition using generalized portrait method. Autom. Remote Control 24 774–780.
Mathematical Reviews (MathSciNet): MR163785
[146] Vishwanathan, S. V. N. and Smola, A. J. (2004). Fast kernels for string and tree matching. In Kernel Methods in Computational Biology (B. Schölkopf, K. Tsuda and J. P. Vert, eds.) 113–130. MIT Press, Cambridge, MA.
[147] Vishwanathan, S. V. N., Smola, A. J. and Vidal, R. (2007). Binet–Cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. Internat. J. Computer Vision 73 95–119.
[148] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.
Mathematical Reviews (MathSciNet): MR1045442
Zentralblatt MATH: 0813.62001
[149] Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. (1995). Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Ann. Statist. 23 1865–1895.
Mathematical Reviews (MathSciNet): MR1389856
Digital Object Identifier: doi:10.1214/aos/1034713638
Project Euclid: euclid.aos/1034713638
[150] Wainwright, M. J. and Jordan, M. I. (2003). Graphical models, exponential families, and variational inference. Technical Report 649, Dept. Statistics, Univ. California, Berkeley.
[151] Watkins, C. (2000). Dynamic alignment kernels. In Advances in Large Margin Classifiers (A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans, eds.) 39–50. MIT Press, Cambridge, MA.
Mathematical Reviews (MathSciNet): MR1820960
[152] Wendland, H. (2005). Scattered Data Approximation. Cambridge Univ. Press.
Mathematical Reviews (MathSciNet): MR2131724
Zentralblatt MATH: 1075.65021
[153] Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B. and Vapnik, V. (2003). Kernel dependency estimation. In Advances in Neural Information Processing Systems 15 (S. T. S. Becker and K. Obermayer, eds.) 873–880. MIT Press, Cambridge, MA.
[154] Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1112133
[155] Yang, H. H. and Amari, S.-I. (1997). Adaptive on-line learning algorithms for blind separation—maximum entropy and minimum mutual information. Neural Comput. 9 1457–1482.
[156] Zettlemoyer, L. S. and Collins, M. (2005). Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Uncertainty in Artificial Intelligence UAI 658–666. AUAI Press, Arlington, Virginia.
[157] Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T. and Müller, K.-R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16 799–807.

2009 © Institute of Mathematical Statistics