## Electronic Journal of Statistics

### On the empirical estimation of integral probability metrics

#### Abstract

Given two probability measures, $\mathbb{P}$ and $\mathbb{Q}$ defined on a measurable space, $S$, the integral probability metric (IPM) is defined as $$\gamma_{\EuScript{F}}(\mathbb{P},\mathbb{Q})=\sup\left\{\left\vert \int_{S}f\,d\mathbb{P}-\int_{S}f\,d\mathbb{Q}\right\vert\,:\,f\in\EuScript{F}\right\},$$ where $\EuScript{F}$ is a class of real-valued bounded measurable functions on $S$. By appropriately choosing $\EuScript{F}$, various popular distances between $\mathbb{P}$ and $\mathbb{Q}$, including the Kantorovich metric, Fortet-Mourier metric, dual-bounded Lipschitz distance (also called the Dudley metric), total variation distance, and kernel distance, can be obtained.

In this paper, we consider the problem of estimating $\gamma_{\EuScript{F}}$ from finite random samples drawn i.i.d. from $\mathbb{P}$ and $\mathbb{Q}$. Although the above mentioned distances cannot be computed in closed form for every $\mathbb{P}$ and $\mathbb{Q}$, we show their empirical estimators to be easily computable, and strongly consistent (except for the total-variation distance). We further analyze their rates of convergence. Based on these results, we discuss the advantages of certain choices of $\EuScript{F}$ (and therefore the corresponding IPMs) over others—in particular, the kernel distance is shown to have three favorable properties compared with the other mentioned distances: it is computationally cheaper, the empirical estimate converges at a faster rate to the population value, and the rate of convergence is independent of the dimension $d$ of the space (for $S=\mathbb{R}^{d}$). We also provide a novel interpretation of IPMs and their empirical estimators by relating them to the problem of binary classification: while the IPM between class-conditional distributions is the negative of the optimal risk associated with a binary classifier, the smoothness of an appropriate binary classifier (e.g., support vector machine, Lipschitz classifier, etc.) is inversely related to the empirical estimator of the IPM between these class-conditional distributions.

#### Article information

Source
Electron. J. Statist., Volume 6 (2012), 1550-1599.

Dates
First available in Project Euclid: 18 September 2012

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1347974672

Digital Object Identifier
doi:10.1214/12-EJS722

Mathematical Reviews number (MathSciNet)
MR2988458

Zentralblatt MATH identifier
1295.62035

Subjects
Primary: 62G05: Estimation

#### Citation

Sriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. On the empirical estimation of integral probability metrics. Electron. J. Statist. 6 (2012), 1550--1599. doi:10.1214/12-EJS722. https://projecteuclid.org/euclid.ejs/1347974672

#### References

• [1] Ali, S. M. and Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another., Journal of the Royal Statistical Society, Series B (Methodological) 28, 131–142.
• [2] Aronszajn, N. (1950). Theory of reproducing kernels., Trans. Amer. Math. Soc. 68, 337–404.
• [3] Bartlett, P. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results., Journal of Machine Learning Research 3, 463–482.
• [4] Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local rademacher complexities., Annals of Statistics 33, 4, 1497–1537.
• [5] Beauzamy, B. (1985)., Introduction to Banach spaces and their Geometry. North-Holland, The Netherlands.
• [6] Berlinet, A. and Thomas-Agnan, C. (2004)., Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, London, UK.
• [7] Bickel, P. J. (1969). A distribution free version of the smirnov two sample test in the p-variate case., The Annals of Mathematical Statistics 40, 1, 1–23.
• [8] Boucheron, S., Lugosi, G., and Massart, P. (2000). A sharp concentration inequality with applications., Random Structures and Algorithms 16, 3, 277–292.
• [9] Breiman, L. (1999). Prediction games and arcing algorithms., Neural Computation 11, 7, 1493–1517.
• [10] Cortes, C. and Vapnik, V. (1995). Support-vector networks., Machine Learning 20, 273–297.
• [11] Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observations., Studia Scientiarium Mathematicarum Hungarica 2, 299–318.
• [12] Cucker, F. and Zhou, D.-X. (2007)., Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge, UK.
• [13] Devroye, L. and Györfi, L. (1985)., Nonparametric Density Estimation: The $L_1$ View. Wiley, New York.
• [14] Devroye, L., Györfi, L., and Lugosi, G. (1996)., A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York.
• [15] Dudley, R. M. (2002)., Real Analysis and Probability. Cambridge University Press, Cambridge, UK.
• [16] Fedotov, A. A., Harremoës, P., and Topsøe, F. (2003). Refinements of Pinsker’s inequality., IEEE Trans. Information Theory 49, 6, 1491–1498.
• [17] Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In, Advances in Neural Information Processing Systems 20, J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. MIT Press, Cambridge, MA, 489–496.
• [18] Gibbs, A. L. and Su, F. E. (2002). On choosing and bounding probability metrics., International Statistical Review 70, 3, 419–435.
• [19] Gray, R. M., Neuhoff, D. L., and Shields, P. C. (1975). A generalization of Ornstein’s $\overlined$ distance with applications to information theory., Annals of Probability 3, 315–328.
• [20] Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., and Smola, A. (2012). A kernel two-sample test., JMLR 13, 723–773.
• [21] Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2007). A kernel method for the two sample problem. In, Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds. MIT Press, 513–520.
• [22] Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2008). A kernel statistical test of independence. In, Advances in Neural Information Processing Systems 20, J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. MIT Press, 585–592.
• [23] Khosravifard, M., Fooladivanda, D., and Gulliver, T. A. (2007). Confliction of the convexity and metric properties in $f$-divergences., IEICE Trans. Fundamentals E90-A, 9, 1848–1853.
• [24] Kolmogorov, A. N. and Tihomirov, V. M. (1961). $\epsilon$-entropy and $\epsilon$-capacity of sets in functional space., American Mathematical Society Translations 2, 17, 277–364.
• [25] Lindvall, T. (1992)., Lectures on the Coupling Method. John Wiley & Sons, New York.
• [26] Massart, P. (2000). Some applications of concentration inequalities to statistics., Ann. Fac. Sci. Toulouse Math. 9, 6, 245–303.
• [27] McShane, E. J. (1934). Extension of range of functions., Bulletin of the American Mathematical Society 40, 837–842.
• [28] Mendelson, S. (2002). Rademacher averages and phase transitions in Glivenko-Cantelli classes., IEEE Transactions on Information Theory 48, 1, 251–263.
• [29] Müller, A. (1997). Integral probability metrics and their generating classes of functions., Advances in Applied Probability 29, 429–443.
• [30] Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2007). Nonparametric estimation of the likelihood ratio and divergence functionals. In, IEEE International Symposium on Information Theory.
• [31] Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2009). On surrogate loss functions and $f$-divergences., Annals of Statistics 37, 2, 876–904.
• [32] Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization., IEEE Transactions on Information Theory 56, 11, 5847–5861.
• [33] Rachev, S. T. (1984). On a class of minimum functionals in a space of probability measures., Theory of Probability and its Applications 29, 41–48.
• [34] Rachev, S. T. (1985). The Monge–Kantorovich mass transference problem and its stochastic applications., Theory of Probability and its Applications 29, 647–676.
• [35] Rachev, S. T. and Rüschendorf, L. (1998)., Mass transportation problems. Vol. I Theory, Vol. II Applications. Probability and its Applications. Springer-Verlag, Berlin.
• [36] Rockafellar, R. T. (1970)., Convex Analysis. Princeton University Press, Princeton, NJ.
• [37] Schölkopf, B. and Smola, A. J. (2002)., Learning with Kernels. MIT Press, Cambridge, MA.
• [38] Shawe-Taylor, J. and Cristianini, N. (2004)., Kernel Methods for Pattern Analysis. Cambridge University Press, UK.
• [39] Shorack, G. R. (2000)., Probability for Statisticians. Springer-Verlag, New York.
• [40] Smirnov, N. (1939). On the estimation of the discrepancy between empirical curves of distribution for two independent samples., Moscow University Mathematics Bulletin 2, 3–26. University of Moscow.
• [41] Srebro, N., Sridharan, K., and Tewari, A. (2010). Smoothness, low noise and fast rates. In, Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds. 2199–2207.
• [42] Sriperumbudur, B. (2011). Mixture density estimation via Hilbert space embedding of measures. In, Proceedings of International Symposium on Information Theory. 1027–1030.
• [43] Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Schölkopf, B. (2008). Injective Hilbert space embeddings of probability measures. In, Proc. of the 21$^st$ Annual Conference on Learning Theory, R. Servedio and T. Zhang, Eds. 111–122.
• [44] Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. G. (2010). Hilbert space embeddings and metrics on probability measures., Journal of Machine Learning Research 11, 1517–1561.
• [45] Steinwart, I. and Christmann, A. (2008)., Support Vector Machines. Springer.
• [46] Suquet, C. (2009). Reproducing kernel Hilbert spaces and random measures. In, Proc. of the 5th International ISAAC Congress, Catania, Italy, 25-30 July 2005, H. G. W. Begehr and F. Nicolosi, Eds. World Scientific, 143–152.
• [47] Vajda, I. (1989)., Theory of Statistical Inference and Information. Kluwer Academic Publishers, Boston.
• [48] Vallander, S. S. (1973). Calculation of the Wasserstein distance between probability distributions on the line., Theory Probab. Appl. 18, 784–786.
• [49] van de Geer, S. (2000)., Empirical Processes in M-Estimation. Cambridge University Press, Cambridge, UK.
• [50] van der Vaart, A. W. and Wellner, J. A. (1996)., Weak Convergence and Empirical Processes. Springer-Verlag, New York.
• [51] von Luxburg, U. and Bousquet, O. (2004). Distance-based classification with Lipschitz functions., Journal for Machine Learning Research 5, 669–695.
• [52] Wang, Q., Kulkarni, S. R., and Verdú, S. (2005). Divergence estimation of continuous distributions based on data-dependent partitions., IEEE Trans. Information Theory 51, 9, 3064–3074.
• [53] Wendland, H. (2005)., Scattered Data Approximation. Cambridge University Press, Cambridge, UK.
• [54] Whitney, H. (1934). Analytic extensions of differentiable functions defined in closed sets., Transactions of the American Mathematical Society 36, 63–89.
• [55] Zolotarev, V. M. (1983). Probability metrics., Theory of Probability and its Applications 28, 278–302.