## Electronic Journal of Statistics

### Fast learning rate of non-sparse multiple kernel learning and optimal regularization strategies

Taiji Suzuki

#### Abstract

In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations, and discuss what kind of regularization gives a favorable predictive accuracy. Our main target in this paper is dense type regularizations including $\ell_{p}$-MKL. According to the numerical experiments, it is known that the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates of MKL that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. This enables us to compare the generalization performances of various types of regularizations. As a consequence, we observe that the homogeneity of the complexities of candidate reproducing kernel Hilbert spaces (RKHSs) affects which regularization strategy ($\ell_{1}$ or dense) is preferred. In fact, in homogeneous complexity settings where the complexities of all RKHSs are evenly same, $\ell_{1}$-regularization is optimal among all isotropic norms. On the other hand, in inhomogeneous complexity settings, dense type regularizations can show better learning rate than sparse $\ell_{1}$-regularization. We also show that our learning rate achieves the minimax lower bound in homogeneous complexity settings.

#### Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 2141-2192.

Dates
First available in Project Euclid: 13 July 2018

https://projecteuclid.org/euclid.ejs/1531468825

Digital Object Identifier
doi:10.1214/18-EJS1399

Mathematical Reviews number (MathSciNet)
MR3827817

Zentralblatt MATH identifier
06917472

#### Citation

Suzuki, Taiji. Fast learning rate of non-sparse multiple kernel learning and optimal regularization strategies. Electron. J. Statist. 12 (2018), no. 2, 2141--2192. doi:10.1214/18-EJS1399. https://projecteuclid.org/euclid.ejs/1531468825

#### References

• J. Aflalo, A. Ben-Tal, C. Bhattacharyya, J. S. Nath, and S. Raman. Variable sparsity kernel learning., Journal of Machine Learning Research, 12:565–592, 2011.
• A. Argyriou, R. Hauser, C. A. Micchelli, and M. Pontil. A DC-programming algorithm for kernel selection. In, the 23st International Conference on Machine Learning, 2006.
• F. R. Bach. Consistency of the group lasso and multiple kernel learning., Journal of Machine Learning Research, 9 :1179–1225, 2008.
• F. R. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 105–112. 2009.
• F. R. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In, the 21st International Conference on Machine Learning, pages 41–48, 2004.
• P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities., The Annals of Statistics, 33 :1487–1537, 2005.
• P. Bartlett, M. Jordan, and D. McAuliffe. Convexity, classification, and risk bounds., Journal of the American Statistical Association, 101:138–156, 2006.
• C. Bennett and R. Sharpley., Interpolation of Operators. Academic Press, Boston, 1988.
• O. Bousquet. A Bennett concentration inequality and its application to suprema of empirical process., C. R. Acad. Sci. Paris Ser. I Math., 334:495–500, 2002.
• U. Chakraborty, editor., Advances in Differential Evolution (Studies in Computational Intelligence). Springer, 2008.
• C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 396–404. 2009a.
• C. Cortes, M. Mohri, and A. Rostamizadeh. $L_2$ regularization for learning kernels. In, the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), 2009b. Montréal, Canada.
• C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In, Proceedings of the 27th International Conference on Machine Learning, 2010.
• D. E. Edmunds and H. Triebel., Function Spaces, Entropy Numbers, Differential Operators. Cambridge, Cambridge, 1996.
• E. Giné and R. Nickl., Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2015.
• G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions., Journal of Mathematical Analysis and Applications, 33:82–95, 1971.
• M. Kloft and G. Blanchard. The local rademacher complexity of lp-norm multiple kernel learning, 2011., arXiv:1103.0790.
• M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Müller, and A. Zien. Efficient and accurate $\ell_p$-norm multiple kernel learning. In, Advances in Neural Information Processing Systems 22, pages 997 –1005, Cambridge, MA, 2009. MIT Press.
• M. Kloft, U. Rückert, and P. L. Bartlett. A unifying view of multiple kernel learning. In, Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), 2010.
• M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. $\ell_p$-norm multiple kernel learning, 2011.
• V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization., The Annals of Statistics, 34 :2593–2656, 2006.
• V. Koltchinskii and M. Yuan. Sparse recovery in large ensembles of kernel machines. In, Proceedings of the Annual Conference on Learning Theory, pages 229–238, 2008.
• V. Koltchinskii and M. Yuan. Sparsity in multiple kernel learning., The Annals of Statistics, 38(6) :3660–3695, 2010.
• K. P.. R. M. S.. J. A. Lampinen., Differential Evolution - A Practical Approach to Global Optimization. Springer, 2005.
• G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. Jordan. Learning the kernel matrix with semi-definite programming., Journal of Machine Learning Research, 5:27–72, 2004.
• M. Ledoux and M. Talagrand., Probability in Banach Spaces. Isoperimetry and Processes. Springer, New York, 1991. MR1102015.
• L. Meier, S. van de Geer, and P. Bühlmann. High-dimensional additive modeling., The Annals of Statistics, 37(6B) :3779–3821, 2009.
• C. A. Micchelli and M. Pontil. Learning the kernel function via regularization., Journal of Machine Learning Research, 6 :1099–1125, 2005.
• C. A. Micchelli, M. Pontil, Q. Wu, and D.-X. Zhou. Error bounds for learning the kernel., Analysis and Applications, 14(06):849–868, 2016.
• C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels., Journal of Machine Learning Research, 6 :1043–1071, 2005.
• G. Raskutti, M. Wainwright, and B. Yu. Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness. In, Advances in Neural Information Processing Systems 22, pages 1563–1570. MIT Press, Cambridge, MA, 2009.
• G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Technical report, 2010., arXiv:1008.3654.
• B. Schölkopf and A. J. Smola., Learning with Kernels. MIT Press, Cambridge, MA, 2002.
• J. Shawe-Taylor. Kernel learning for novelty detection. In, NIPS 2008 Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, Whistler, 2008.
• J. Shawe-Taylor and N. Cristianini., Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
• N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned kernels. In, Proceedings of the Annual Conference on Learning Theory, 2006.
• I. Steinwart., Support Vector Machines. Springer, 2008.
• I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In, Proceedings of the Annual Conference on Learning Theory, pages 79–93, 2009.
• T. Suzuki and M. Sugiyama. Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness., The Annals of Statistics, 41(3) :1381–1405, 2013.
• T. Suzuki and R. Tomioka. SpicyMKL: A fast algorithm for multiple kernel learning with thousands of kernels., Machine Learning, 85:77–108, 2011.
• M. Talagrand. New concentration inequalities in product spaces., Inventiones Mathematicae, 126:505–563, 1996.
• R. Tomioka and T. Suzuki. Sparsity-accuracy trade-off in MKL. In, NIPS 2009 Workshop: Understanding Multiple Kernel Learning Methods, Whistler, 2009.
• S. van de Geer., Empirical Processes in M-Estimation. Cambridge University Press, 2000.
• A. W. van der Vaart and J. A. Wellner., Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York, 1996.
• M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In, The 26th International Conference on Machine Learning, 2009.
• Q. Wu, Y. Ying, and D.-X. Zhou. Multi-kernel regularized classifiers., Journal of Complexity, 23(1):108–134, 2007.
• Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence., The Annals of Statistics, 27(5) :1564–1599, 1999.
• Y. Ying and C. Campbell. Generalization bounds for learning the kernel. In S. Dasgupta and A. Klivans, editors, Proceedings of the Annual Conference on Learning Theory, Montreal Quebec, 2009. Omnipress.
• Y. Ying and D.-X. Zhou. Learnability of gaussians with flexible variances., Journal of Machine Learning Research, 8(Feb):249–276, 2007.
• M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables., Journal of The Royal Statistical Society Series B, 68(1):49–67, 2006.