## Bernoulli

• Bernoulli
• Volume 22, Number 3 (2016), 1839-1893.

### On the optimal estimation of probability measures in weak and strong topologies

Bharath Sriperumbudur

#### Abstract

Given random samples drawn i.i.d. from a probability measure $\mathbb{P}$ (defined on say, $\mathbb{R}^{d}$), it is well-known that the empirical estimator is an optimal estimator of $\mathbb{P}$ in weak topology but not even a consistent estimator of its density (if it exists) in the strong topology (induced by the total variation distance). On the other hand, various popular density estimators such as kernel and wavelet density estimators are optimal in the strong topology in the sense of achieving the minimax rate over all estimators for a Sobolev ball of densities. Recently, it has been shown in a series of papers by Giné and Nickl that these density estimators on $\mathbb{R}$ that are optimal in strong topology are also optimal in $\Vert\cdot\Vert_{\mathcal{F} }$ for certain choices of $\mathcal{F}$ such that $\Vert\cdot\Vert_{\mathcal{F} }$ metrizes the weak topology, where $\Vert\mathbb{P} \Vert_{\mathcal{F} }:=\sup\{\int f\,\mathrm{d}\mathbb{P} \colon\ f\in\mathcal{F} \}$. In this paper, we investigate this problem of optimal estimation in weak and strong topologies by choosing $\mathcal{F}$ to be a unit ball in a reproducing kernel Hilbert space (say $\mathcal{F}_{H}$ defined over $\mathbb{R}^{d}$), where this choice is both of theoretical and computational interest. Under some mild conditions on the reproducing kernel, we show that $\Vert\cdot\Vert_{\mathcal{F}_{H}}$ metrizes the weak topology and the kernel density estimator (with $L^{1}$ optimal bandwidth) estimates $\mathbb{P}$ at dimension independent optimal rate of $n^{-1/2}$ in $\Vert\cdot\Vert_{\mathcal{F}_{H}}$ along with providing a uniform central limit theorem for the kernel density estimator.

#### Article information

Source
Bernoulli, Volume 22, Number 3 (2016), 1839-1893.

Dates
Revised: February 2015
First available in Project Euclid: 16 March 2016

https://projecteuclid.org/euclid.bj/1458133001

Digital Object Identifier
doi:10.3150/15-BEJ713

Mathematical Reviews number (MathSciNet)
MR3474835

Zentralblatt MATH identifier
1360.62163

#### Citation

Sriperumbudur, Bharath. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli 22 (2016), no. 3, 1839--1893. doi:10.3150/15-BEJ713. https://projecteuclid.org/euclid.bj/1458133001

#### References

• [1] Anthony, M. and Bartlett, P.L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge Univ. Press.
• [2] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404.
• [3] Bartlett, P.L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
• [4] Berg, C., Christensen, J.P.R. and Ressel, P. (1984). Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions. Graduate Texts in Mathematics 100. New York: Springer.
• [5] Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Boston, MA: Kluwer Academic.
• [6] Bickel, P.J. and Ritov, Y. (2003). Nonparametric estimators which can be “plugged-in”. Ann. Statist. 31 1033–1053.
• [7] de la Peña, V.H. and Giné, E. (1999). Decoupling: From Dependence to Independence. Probability and Its Applications (New York). New York: Springer.
• [8] Devroye, L. and Györfi, L. (1985). Nonparametric Density Estimation: The $L_{1}$ View. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. New York: Wiley.
• [9] Diestel, J. and Uhl, J.J. Jr. (1977). Vector Measures. Providence, RI: Amer. Math. Soc.
• [10] Dudley, R.M. (1999). Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics 63. Cambridge: Cambridge Univ. Press.
• [11] Dudley, R.M. (2002). Real Analysis and Probability. Cambridge Studies in Advanced Mathematics 74. Cambridge: Cambridge Univ. Press.
• [12] Folland, G.B. (1999). Real Analysis: Modern Techniques and Their Applications. New York: Wiley.
• [13] Fukumizu, K., Gretton, A., Sun, X. and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20 (J.C. Platt, D. Koller, Y. Singer and S. Roweis, eds.) 489–496. Cambridge, MA: MIT Press.
• [14] Fukumizu, K., Sriperumbudur, B.K., Gretton, A. and Schölkopf, B. (2009). Characteristic kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21 (D. Koller, D. Schuurmans, Y. Bengio and L. Bottou, eds.) 473–480. Cambridge, MA: MIT Press.
• [15] Giné, E. and Nickl, R. (2008). Uniform central limit theorems for kernel density estimators. Probab. Theory Related Fields 141 333–387.
• [16] Giné, E. and Nickl, R. (2008). Adaptation on the space of finite signed measures. Math. Methods Statist. 17 113–122.
• [17] Giné, E. and Nickl, R. (2009). Uniform limit theorems for wavelet density estimators. Ann. Probab. 37 1605–1646.
• [18] Giné, E. and Nickl, R. (2009). An exponential inequality for the distribution function of the kernel density estimator, with applications to adaptive estimation. Probab. Theory Related Fields 143 569–596.
• [19] Giné, E. and Nickl, R. (2010). Adaptive estimation of a distribution function and its density in sup-norm loss by wavelet and spline projections. Bernoulli 16 1137–1163.
• [20] Giné, E. and Zinn, J. (1986). Empirical processes indexed by Lipschitz functions. Ann. Probab. 14 1329–1338.
• [21] Gretton, A., Borgwardt, K.M., Rasch, M., Schölkopf, B. and Smola, A. (2007). A kernel method for the two sample problem. In Advances in Neural Information Processing Systems 19 (B. Schölkopf, J. Platt and T. Hoffman, eds.) 513–520. Cambridge, MA: MIT Press.
• [22] Härdle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. (1998). Wavelets, Approximation, and Statistical Applications. Lecture Notes in Statistics 129. New York: Springer.
• [23] Lepski, O.V., Mammen, E. and Spokoiny, V.G. (1997). Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. 25 929–947.
• [24] Marcus, D.J. (1985). Relationships between Donsker classes and Sobolev spaces. Z. Wahrsch. Verw. Gebiete 69 323–330.
• [25] Mendelson, S. (2002). Rademacher averages and phase transitions in Glivenko–Cantelli classes. IEEE Trans. Inform. Theory 48 251–263.
• [26] Nickl, R. (2007). Donsker-type theorems for nonparametric maximum likelihood estimators. Probab. Theory Related Fields 138 411–449.
• [27] Radulović, D. and Wegkamp, M. (2000). Weak convergence of smoothed empirical processes: Beyond Donsker classes. In High Dimensional Probability, II (Seattle, WA, 1999). Progress in Probability 47 89–105. Boston, MA: Birkhäuser.
• [28] Rudin, W. (1991). Functional Analysis, 2nd ed. International Series in Pure and Applied Mathematics. New York: McGraw-Hill.
• [29] Srebro, N., Sridharan, K. and Tewari, A. (2010). Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel and A. Culotta, eds.) 2199–2207. MIT Press.
• [30] Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Lanckriet, G.R.G. and Schölkopf, B. (2009). Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C.K.I. Williams and A. Culotta, eds.) 1750–1758. Cambridge, MA: MIT Press.
• [31] Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Schölkopf, B. and Lanckriet, G.R.G. (2012). On the empirical estimation of integral probability metrics. Electron. J. Stat. 6 1550–1599.
• [32] Sriperumbudur, B.K., Fukumizu, K. and Lanckriet, G. (2011). Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint. In Advances in Neural Information Processing Systems 24 (J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira and K.Q. Weinberger, eds.) 1773–1781. Cambridge, MA: MIT Press.
• [33] Sriperumbudur, B.K., Fukumizu, K. and Lanckriet, G.R.G. (2011). Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res. 12 2389–2410.
• [34] Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B. and Lanckriet, G.R.G. (2010). Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11 1517–1561.
• [35] Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Information Science and Statistics. New York: Springer.
• [36] van der Vaart, A. (1994). Weak convergence of smoothed empirical processes. Scand. J. Stat. 21 501–504.
• [37] van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge: Cambridge Univ. Press.
• [38] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes. Springer Series in Statistics. New York: Springer.
• [39] Wendland, H. (2005). Scattered Data Approximation. Cambridge Monographs on Applied and Computational Mathematics 17. Cambridge: Cambridge Univ. Press.
• [40] Ying, Y. and Campbell, C. (2010). Rademacher chaos complexities for learning the kernel problem. Neural Comput. 22 2858–2886.
• [41] Yukich, J.E. (1992). Weak convergence of smoothed empirical processes. Scand. J. Stat. 19 271–279.