Open Access
2012 On the empirical estimation of integral probability metrics
Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Gert R. G. Lanckriet
Electron. J. Statist. 6: 1550-1599 (2012). DOI: 10.1214/12-EJS722


Given two probability measures, $\mathbb{P}$ and $\mathbb{Q}$ defined on a measurable space, $S$, the integral probability metric (IPM) is defined as $$\gamma_{\mathcal{F}}(\mathbb{P},\mathbb{Q})=\sup\left\{\left\vert \int_{S}f\,d\mathbb{P}-\int_{S}f\,d\mathbb{Q}\right\vert\,:\,f\in\mathcal{F}\right\},$$ where $\mathcal{F}$ is a class of real-valued bounded measurable functions on $S$. By appropriately choosing $\mathcal{F}$, various popular distances between $\mathbb{P}$ and $\mathbb{Q}$, including the Kantorovich metric, Fortet-Mourier metric, dual-bounded Lipschitz distance (also called the Dudley metric), total variation distance, and kernel distance, can be obtained.

In this paper, we consider the problem of estimating $\gamma_{\mathcal{F}}$ from finite random samples drawn i.i.d. from $\mathbb{P}$ and $\mathbb{Q}$. Although the above mentioned distances cannot be computed in closed form for every $\mathbb{P}$ and $\mathbb{Q}$, we show their empirical estimators to be easily computable, and strongly consistent (except for the total-variation distance). We further analyze their rates of convergence. Based on these results, we discuss the advantages of certain choices of $\mathcal{F}$ (and therefore the corresponding IPMs) over others—in particular, the kernel distance is shown to have three favorable properties compared with the other mentioned distances: it is computationally cheaper, the empirical estimate converges at a faster rate to the population value, and the rate of convergence is independent of the dimension $d$ of the space (for $S=\mathbb{R}^{d}$). We also provide a novel interpretation of IPMs and their empirical estimators by relating them to the problem of binary classification: while the IPM between class-conditional distributions is the negative of the optimal risk associated with a binary classifier, the smoothness of an appropriate binary classifier (e.g., support vector machine, Lipschitz classifier, etc.) is inversely related to the empirical estimator of the IPM between these class-conditional distributions.


Download Citation

Bharath K. Sriperumbudur. Kenji Fukumizu. Arthur Gretton. Bernhard Schölkopf. Gert R. G. Lanckriet. "On the empirical estimation of integral probability metrics." Electron. J. Statist. 6 1550 - 1599, 2012.


Published: 2012
First available in Project Euclid: 18 September 2012

zbMATH: 1295.62035
MathSciNet: MR2988458
Digital Object Identifier: 10.1214/12-EJS722

Primary: 62G05

Keywords: dual-bounded Lipschitz distance (Dudley metric) , empirical estimation , Integral probability metrics , Kantorovich metric , kernel distance , Lipschitz classifier , Rademacher average , ‎reproducing kernel Hilbert ‎space , Support Vector Machine

Rights: Copyright © 2012 The Institute of Mathematical Statistics and the Bernoulli Society

Back to Top