Electronic Journal of Statistics

On the interpretability of conditional probability estimates in the agnostic setting

Abstract

We study the interpretability of conditional probability estimates for binary classification under the agnostic setting or scenario. Under the agnostic setting, conditional probability estimates do not necessarily reflect the true conditional probabilities. Instead, they have a certain calibration property: among all data points that the classifier has predicted $\mathcal{P}(Y=1|X)=p$, $p$ portion of them actually have label $Y=1$. For cost-sensitive decision problems, this calibration property provides adequate support for us to use Bayes Decision Rule. In this paper, we define a novel measure for the calibration property together with its empirical counterpart, and prove a uniform convergence result between them. This new measure enables us to formally justify the calibration property of conditional probability estimations. It also provides new insights on the problem of estimating and calibrating conditional probabilities, and allows us to reliably estimate the expected cost of decision rules when applied to an unlabeled dataset.

Article information

Source
Electron. J. Statist., Volume 11, Number 2 (2017), 5198-5231.

Dates
First available in Project Euclid: 15 December 2017

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1513306871

Digital Object Identifier
doi:10.1214/17-EJS1376SI

Mathematical Reviews number (MathSciNet)
MR3738209

Zentralblatt MATH identifier
06825044

Citation

Gao, Yihan; Parameswaran, Aditya; Peng, Jian. On the interpretability of conditional probability estimates in the agnostic setting. Electron. J. Statist. 11 (2017), no. 2, 5198--5231. doi:10.1214/17-EJS1376SI. https://projecteuclid.org/euclid.ejs/1513306871

References

• [1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results., The Journal of Machine Learning Research, 3:463–482, 2003.
• [2] P. L. Bartlett and A. Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results., The Journal of Machine Learning Research, 8:775–790, 2007.
• [3] P. N. Bennett. Assessing the calibration of naive bayes posterior estimates. Technical report, DTIC Document, 2000.
• [4] C. M. Bishop et al., Pattern Recognition and Machine Learning, volume 4. Springer New York, 2006.
• [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation., The Journal of machine Learning research, 3:993 –1022, 2003.
• [6] G. F. Cooper. Nestor: A computer-based medical diagnostic aid that integrates causal and probabilistic knowledge. Technical report, DTIC Document, 1984.
• [7] R. Durrett., Probability: Theory and Examples. Cambridge University Press, 2010.
• [8] T. Fawcett and F. Provost. Adaptive fraud detection., Data Mining and Knowledge Discovery, 1(3):291–316, 1997.
• [9] D. P. Foster and R. V. Vohra. Asymptotic calibration., Biometrika, 85(2):379–390, 1998.
• [10] Y. Gao, A. Parameswaran, and J. Peng. On the interpretability of conditional probability estimates in the agnostic setting. In, Artificial Intelligence and Statistics, pages 1367–1374, 2017.
• [11] R. L. Graham. An efficient algorithm for determining the convex hull of a finite planar set., Information Processing Letters, 1(4):132–133, 1972.
• [12] S. M. Kakade, A. Kalai, V. Kanade, and O. Shamir. Efficient learning of generalized linear and single index models with isotonic regression. In, Advances in Neural Information Processing Systems, pages 927–935, 2011.
• [13] L. Lee. Measures of distributional similarity. In, Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 25–32. Association for Computational Linguistics, 1999.
• [14] A. H. Murphy. A new vector partition of the probability score., Journal of Applied Meteorology, 12(4):595–600, 1973.
• [15] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In, Proceedings of the 22nd International Conference on Machine Learning, pages 625–632. ACM, 2005.
• [16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python., Journal of Machine Learning Research, 12 :2825–2830, 2011.
• [17] J. Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods., Advances in Large Margin Classifiers, 10(3):61–74, 1999.
• [18] S. Shalev-Shwartz and S. Ben-David., Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
• [19] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities., Theory of Probability & Its Applications, 16(2):264–280, 1971.
• [20] V. N. Vapnik and V. Vapnik., Statistical Learning Theory, volume 1. Wiley New York, 1998.
• [21] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In, ICML, volume 1, pages 609–616. Citeseer, 2001.
• [22] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699. ACM, 2002.