Annals of Statistics

On surrogate loss functions and f-divergences

XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan

Full-text: Open access

Abstract

The goal of binary classification is to estimate a discriminant function γ from observations of covariate vectors and corresponding binary labels. We consider an elaboration of this problem in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer Q. We present conditions on loss functions such that empirical risk minimization yields Bayes consistency when both the discriminant function and the quantizer are estimated. These conditions are stated in terms of a general correspondence between loss functions and a class of functionals known as Ali-Silvey or f-divergence functionals. Whereas this correspondence was established by Blackwell [Proc. 2nd Berkeley Symp. Probab. Statist. 1 (1951) 93–102. Univ. California Press, Berkeley] for the 0–1 loss, we extend the correspondence to the broader class of surrogate loss functions that play a key role in the general theory of Bayes consistency for binary classification. Our result makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.

Article information

Source
Ann. Statist., Volume 37, Number 2 (2009), 876-904.

Dates
First available in Project Euclid: 10 March 2009

Permanent link to this document
https://projecteuclid.org/euclid.aos/1236693153

Digital Object Identifier
doi:10.1214/08-AOS595

Mathematical Reviews number (MathSciNet)
MR2502654

Zentralblatt MATH identifier
1162.62060

Subjects
Primary: 62G10: Hypothesis testing 68Q32: Computational learning theory [See also 68T05] 62K05: Optimal designs

Keywords
Binary classification discriminant analysis surrogate losses f-divergences Ali-Silvey divergences quantizer design nonparametric decentralized detection statistical machine learning Bayes consistency

Citation

Nguyen, XuanLong; Wainwright, Martin J.; Jordan, Michael I. On surrogate loss functions and f -divergences. Ann. Statist. 37 (2009), no. 2, 876--904. doi:10.1214/08-AOS595. https://projecteuclid.org/euclid.aos/1236693153


Export citation

References

  • [1] Ali, S. M. and Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B 28 131–142.
  • [2] Bartlett, P., Jordan, M. I. and McAuliffe, J. D. (2006). Convexity, classification and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
  • [3] Blackwell, D. (1951). Comparison of experiments. Proc. 2nd Berkeley Symp. Probab. Statist. 1 93–102. Univ. California Press, Berkeley.
  • [4] Blackwell, D. (1953). Equivalent comparisons of experiments. Ann. Math. Statist. 24 265–272.
  • [5] Bradt, R. and Karlin, S. (1956). On the design and comparison of certain dichotomous experiments. Ann. Math. Statist. 27 390–409.
  • [6] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20 273–297.
  • [7] Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Studia Sci. Math. Hungar 2 299–318.
  • [8] Hiriart-Urruty, J. and Lemaréchal, C. (2001). Fundamentals of Convex Analysis. Springer, New York.
  • [9] Jiang, W. (2004). Process consistency for adaboost. Ann. Statist. 32 13–29.
  • [10] Kailath, T. (1967). The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Comm. Technology 15 52–60.
  • [11] Longo, M., Lookabaugh, T. and Gray, R. (1990). Quantization for decentralized hypothesis testing under communication constraints. IEEE Trans. Inform. Theory 36 241–255.
  • [12] Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32 30–55.
  • [13] Mannor, S., Meir, R. and Zhang, T. (2003). Greedy algorithms for classification—consistency, convergence rates and adaptivity. J. Machine Learning Research 4 713–741.
  • [14] McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics (J. Simons, ed.). Cambridge Univ. Press, Cambridge.
  • [15] Nguyen, X., Wainwright, M. J. and Jordan, M. I. (2005). Nonparametric decentralized detection using kernel methods. IEEE Trans. Signal Processing 53 4053–4066.
  • [16] Phelps, R. R. (1993). Convex Functions, Monotone Operators and Differentiability. Springer, New York.
  • [17] Poor, H. V. and Thomas, J. B. (1977). Applications of Ali-Silvey distance measures in the design of generalized quantizers for binary decision systems. IEEE Trans. Comm. 25 893–900.
  • [18] Rockafellar, G. (1970). Convex Analysis. Princeton Univ. Press, Princeton.
  • [19] Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel machines. IEEE Trans. Inform. Theory 51 128–142.
  • [20] Topsoe, F. (2000). Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inform. Theory 46 1602–1609.
  • [21] Tsitsiklis, J. N. (1993). Decentralized detection. In Advances in Statistical Signal Processing 297–344. JAI Press, Greenwich.
  • [22] Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–134.