## The Annals of Statistics

### Bandwidth selection in kernel empirical risk minimization via the gradient

#### Abstract

In this paper, we deal with the data-driven selection of multidimensional and possibly anisotropic bandwidths in the general framework of kernel empirical risk minimization. We propose a universal selection rule, which leads to optimal adaptive results in a large variety of statistical models such as nonparametric robust regression and statistical learning with errors in variables. These results are stated in the context of smooth loss functions, where the gradient of the risk appears as a good criterion to measure the performance of our estimators. The selection rule consists of a comparison of gradient empirical risks. It can be viewed as a nontrivial improvement of the so-called Goldenshluger–Lepski method to nonlinear estimators. Furthermore, one main advantage of our selection rule is the nondependency on the Hessian matrix of the risk, usually involved in standard adaptive procedures.

#### Article information

Source
Ann. Statist., Volume 43, Number 4 (2015), 1617-1646.

Dates
Revised: January 2015
First available in Project Euclid: 17 June 2015

https://projecteuclid.org/euclid.aos/1434546217

Digital Object Identifier
doi:10.1214/15-AOS1318

Mathematical Reviews number (MathSciNet)
MR3357873

Zentralblatt MATH identifier
1317.62026

#### Citation

Chichignoud, Michaël; Loustau, Sébastien. Bandwidth selection in kernel empirical risk minimization via the gradient. Ann. Statist. 43 (2015), no. 4, 1617--1646. doi:10.1214/15-AOS1318. https://projecteuclid.org/euclid.aos/1434546217

#### References

• [1] Antos, A., Györfi, L. and György, A. (2005). Individual convergence rates in empirical vector quantizer design. IEEE Trans. Inform. Theory 51 4013–4022.
• [2] Arias-Castro, E., Salmon, J. and Willett, R. (2012). Oracle inequalities and minimax rates for nonlocal means and related adaptive kernel-based methods. SIAM J. Imaging Sci. 5 944–992.
• [3] Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10 245–279.
• [4] Blanchard, G., Bousquet, O. and Massart, P. (2008). Statistical performance of support vector machines. Ann. Statist. 36 489–531.
• [5] Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM Probab. Stat. 9 323–375.
• [6] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Univ. Press, Oxford.
• [7] Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris 334 495–500.
• [8] Brunet, C. and Loustau, S. (2014). Noisy quantization: Theory and practice. Submitted.
• [9] Chichignoud, M. and Lederer, J. (2014). A robust, adaptive M-estimator for pointwise estimation in heteroscedastic regression. Bernoulli 20 1560–1599.
• [10] Chichignoud, M. and Loustau, S. (2014). Adaptive noisy clustering. IEEE Trans. Inform. Theory 60 7279–7292.
• [11] Comte, F. and Lacour, C. (2013). Anisotropic adaptive kernel deconvolution. Ann. Inst. Henri Poincaré Probab. Stat. 49 569–609.
• [12] Dattner, I., Reiß, M. and Trabs, M. (2013). Adaptive quantile estimation in deconvolution with unknown error distribution. Preprint. Available at arXiv:1303.1698.
• [13] Goldenshluger, A. and Lepski, O. (2008). Universal pointwise selection rule in multivariate function estimation. Bernoulli 14 1150–1190.
• [14] Goldenshluger, A. and Lepski, O. (2009). Structural adaptation via $\mathbb{L}_{p}$-norm oracle inequalities. Probab. Theory Related Fields 143 41–71.
• [15] Goldenshluger, A. and Lepski, O. (2011). Bandwidth selection in kernel density estimation: Oracle inequalities and adaptive minimax optimality. Ann. Statist. 39 1608–1632.
• [16] Goldenshluger, A. and Lepski, O. (2011). Uniform bounds for norms of sums of independent random functions. Ann. Probab. 39 2318–2384.
• [17] Hall, P. and Lahiri, S. N. (2008). Estimation of distributions, moments and quantiles in deconvolution problems. Ann. Statist. 36 2110–2134.
• [18] Hasminskii, R. and Ibragimov, I. (1990). On density estimation in the view of Kolmogorov’s ideas in approximation theory. Ann. Statist. 18 999–1010.
• [19] Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35 73–101.
• [20] Ibragimov, I. A. and Has’minskiĭ, R. Z. (1981). Statistical Estimation: Asymptotic Theory. Applications of Mathematics 16. Springer, New York.
• [21] Katkovnik, V. (1999). A new method for varying adaptive bandwidth selection. IEEE Trans. Image Process. 47 2567–2571.
• [22] Katkovnik, V., Foi, A., Egiazarian, K. and Astola, J. (2010). From local kernel to nonlocal multiple-model image denoising. Int. J. Comput. Vis. 86 1–32.
• [23] Katkovnik, V. and Spokoiny, V. (2008). Spatially adaptive estimation via fitted local likelihood techniques. IEEE Trans. Signal Process. 56 873–886.
• [24] Kerkyacharian, G., Lepski, O. and Picard, D. (2001). Nonlinear estimation in anisotropic multi-index denoising. Probab. Theory Related Fields 121 137–170.
• [25] Klutchnikoff, N. (2005). On the adaptive estimation of anisotropic functions. Ph.D. thesis, Aix-Masrseille 1.
• [26] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
• [27] Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
• [28] Lepski, O. V., Mammen, E. and Spokoiny, V. G. (1997). Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. 25 929–947.
• [29] Lepskiĭ, O. V. (1990). A problem of adaptive estimation in Gaussian white noise. Theory Probab. Appl. 35 454–466.
• [30] Levrard, C. (2013). Fast rates for empirical vector quantization. Electron. J. Stat. 7 1716–1746.
• [31] Loustau, S. (2013). Inverse statistical learning. Electron. J. Stat. 7 2065–2097.
• [32] Loustau, S. and Marteau, C. (2015). Minimax fast rates for discriminant analysis with errors in variables. Bernoulli 21 176–208.
• [33] Mallat, S. (2009). A Wavelet Tour of Signal Processing, 3rd ed. Elsevier/Academic Press, Amsterdam.
• [34] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
• [35] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Springer, Berlin.
• [36] Massart, P. and Nédélec, É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
• [37] Mendelson, S. (2004). On the performance of kernel classes. J. Mach. Learn. Res. 4 759–771.
• [38] Nikol’skiĭ, S. M. (1975). Approximation of Functions of Several Variables and Imbedding Theorems. Springer, New York.
• [39] Pollard, D. (1981). Strong consistency of $k$-means clustering. Ann. Statist. 9 135–140.
• [40] Pollard, D. (1982). A central limit theorem for $k$-means clustering. Ann. Probab. 10 919–926.
• [41] Polzehl, J. and Spokoiny, V. (2006). Propagation–separation approach for local likelihood estimation. Probab. Theory Related Fields 135 335–362.
• [42] Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–563.
• [43] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
• [44] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
• [45] Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
• [46] Vapnik, V. N. and Chervonenkis, A. Ya. (1971). Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data. Avtomat. i Telemekh. 2 42–53.