Annals of Statistics

Sparsity in multiple kernel learning

Vladimir Koltchinskii and Ming Yuan

Full-text: Open access


The problem of multiple kernel learning based on penalized empirical risk minimization is discussed. The complexity penalty is determined jointly by the empirical L2 norms and the reproducing kernel Hilbert space (RKHS) norms induced by the kernels with a data-driven choice of regularization parameters. The main focus is on the case when the total number of kernels is large, but only a relatively small number of them is needed to represent the target function, so that the problem is sparse. The goal is to establish oracle inequalities for the excess risk of the resulting prediction rule showing that the method is adaptive both to the unknown design distribution and to the sparsity of the problem.

Article information

Ann. Statist., Volume 38, Number 6 (2010), 3660-3695.

First available in Project Euclid: 30 November 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G08: Nonparametric regression 62F12: Asymptotic properties of estimators
Secondary: 62J07: Ridge regression; shrinkage estimators

High dimensionality multiple kernel learning oracle inequality reproducing kernel Hilbert spaces restricted isometry sparsity


Koltchinskii, Vladimir; Yuan, Ming. Sparsity in multiple kernel learning. Ann. Statist. 38 (2010), no. 6, 3660--3695. doi:10.1214/10-AOS825.

Export citation


  • Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404.
  • Bach, F. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9 1179–1225.
  • Bickel, P., Ritov, Y. and Tsybakov, A. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • Bousquet, O. and Herrmann, D. (2003). On the complexity of learning the kernel matrix. In Advances in Neural Information Processing Systems 15 415–422. MIT Press, Cambridge.
  • Blanchard, G., Bousquet, O. and Massart, P. (2008). Statistical performance of support vector machines. Ann. Statist. 36 489–531.
  • Bousquet, O. (2002). A Bennett concentration inequality and its applications to suprema of empirical processes. C. R. Math. Acad. Sci. Paris 334 495–500.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2313–2351.
  • Crammer, K., Keshet, J. and Singer, Y. (2003). Kernel design using boosting. In Advances in Neural Information Processing Systems 15 553–560. MIT Press, Cambridge.
  • Koltchinskii, V. (2008). Oracle inequalities in empirical risk minimization and sparse recovery problems. Ecole d’Eté de Probabilités de Saint-Flour, Lecture Notes. Preprint.
  • Koltchinskii, V. (2009a). Sparsity in penalized empirical risk minimization. Ann. Inst. H. Poincaré Probab. Statist. 45 7–57.
  • Koltchinskii, V. (2009b). The Dantzig selector and sparsity oracle inequalities. Bernoulli 15 799–828.
  • Koltchinskii, V. (2009c). Sparse recovery in convex hulls via entropy penalization. Ann. Statist. 37 1332–1359.
  • Koltchinskii, V. and Yuan, M. (2008). Sparse recovery in large ensembles of kernel machines. In Proceedings of 19th Annual Conference on Learning Theory 229–238. Omnipress, Madison, WI.
  • Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L. and Jordan, M. (2004). Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5 27–72.
  • Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.
  • Lin, Y. and Zhang, H. (2006). Component selection and smoothing in multivariate nonparametric regression. Ann. Statist. 34 2272–2297.
  • Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-dimensional additive modeling. Ann. Statist. 37 3779–3821.
  • Mendelson, S. (2002). Geometric parameters of kernel machines. In COLT 2002. Lecture Notes in Artificial Intelligence 2375 29–43. Springer, Berlin.
  • Micchelli, C. and Pontil, M. (2005). Learning the kernel function via regularization. J. Mach. Learn. Res. 6 1099–1125.
  • Raskutti, G., Wainwright, M. and Yu, B. (2009). Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness. In Advances in Neural Information Processing Systems (NIPS 22) 1563–1570. Curran Associates, Red Hook, NY.
  • Ravikumar, P., Liu, H., Lafferty, J. and Wasserman, L. (2008). SpAM: Sparse additive models. In Advances in Neural Information Processing Systems (NIPS 20) 1201–1208. Curran Associates, Red Hook, NY.
  • Srebro, N. and Ben-David, S. (2006). Learning bounds for support vector machines with learned kernels. In Learning Theory. Lecture Notes in Comput. Sci. 4005 169–183. Springer, Berlin.
  • Talagrand, M. (1996). New concentration inequalities for product measures. Invent. Math. 126 505–563.
  • Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
  • van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. With Applications to Statistics. Springer, New York.