## The Annals of Statistics

### Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory

#### Abstract

We describe and develop a close relationship between two problems that have customarily been regarded as distinct: that of maximizing entropy, and that of minimizing worst-case expected loss. Using a formulation grounded in the equilibrium theory of zero-sum games between Decision Maker and Nature, these two problems are shown to be dual to each other, the solution to each providing that to the other. Although Topsøe described this connection for the Shannon entropy over 20 years ago, it does not appear to be widely known even in that important special case.

We here generalize this theory to apply to arbitrary decision problems and loss functions. We indicate how an appropriate generalized definition of entropy can be associated with such a problem, and we show that, subject to certain regularity conditions, the above-mentioned duality continues to apply in this extended context. This simultaneously provides a possible rationale for maximizing entropy and a tool for finding robust Bayes acts. We also describe the essential identity between the problem of maximizing entropy and that of minimizing a related discrepancy or divergence between distributions. This leads to an extension, to arbitrary discrepancies, of a well-known minimax theorem for the case of Kullback–Leibler divergence (the “redundancy-capacity theorem” of information theory).

For the important case of families of distributions having certain mean values specified, we develop simple sufficient conditions and methods for identifying the desired solutions. We use this theory to introduce a new concept of “generalized exponential family” linked to the specific decision problem under consideration, and we demonstrate that this shares many of the properties of standard exponential families.

Finally, we show that the existence of an equilibrium in our game can be rephrased in terms of a “Pythagorean property” of the related divergence, thus generalizing previously announced results for Kullback–Leibler and Bregman divergences.

#### Article information

Source
Ann. Statist., Volume 32, Number 4 (2004), 1367-1433.

Dates
First available in Project Euclid: 4 August 2004

https://projecteuclid.org/euclid.aos/1091626173

Digital Object Identifier
doi:10.1214/009053604000000553

Mathematical Reviews number (MathSciNet)
MR2089128

Zentralblatt MATH identifier
1048.62008

Subjects
Primary: 62C20: Minimax procedures
Secondary: 94A17: Measures of information, entropy

#### Citation

Grünwald, Peter D.; Dawid, A. Philip. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Statist. 32 (2004), no. 4, 1367--1433. doi:10.1214/009053604000000553. https://projecteuclid.org/euclid.aos/1091626173

#### References

• Azoury, K. S. and Warmuth, M. K. (2001). Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning 43 211--246.
• Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory. Wiley, New York.
• Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York.
• Berger, J. O. and Bernardo, J. M. (1992). On the development of reference priors (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 35--60. Oxford Univ. Press.
• Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). J. Roy. Statist. Soc. Ser. B 41 113--147.
• Billingsley, P. (1999). Convergence of Probability Measures, 2nd ed. Wiley, New York.
• Borwein, J. M., Lewis, A. S. and Noll, D. (1996). Maximum entropy reconstruction using derivative information. I. Fisher information and convex duality. Math. Oper. Res. 21 442--468.
• Brègman, L. M. (1967). The relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. and Math. Phys. 7 200--217.
• Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review 78 1--3.
• Censor, Y. and Zenios, S. A. (1997). Parallel Optimization: Theory, Algorithms and Applications. Oxford Univ. Press.
• Clarke, B. and Barron, A. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 36 453--471.
• Clarke, B. and Barron, A. (1994). Jeffreys' prior is asymptotically least favorable under entropy risk. J. Statist. Plann. Inference 41 37--60.
• Cover, T. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.
• Csiszár, I. (1975). $I$-divergence geometry of probability distributions and minimization problems. Ann. Probab. 3 146--158.
• Csiszár, I. (1991). Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Statist. 19 2032--2066.
• Davisson, L. D. and Leon-Garcia, A. (1980). A source matching approach to finding minimax codes. IEEE Trans. Inform. Theory 26 166--174.
• Dawid, A. P. (1986). Probability forecasting. Encyclopedia of Statistical Sciences 7 210--218. Wiley, New York.
• Dawid, A. P. (1992). Prequential analysis, stochastic complexity and Bayesian inference (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 109--125. Oxford Univ. Press.
• Dawid, A. P. (1998). Coherent measures of discrepancy, uncertainty and dependence, with applications to Bayesian predictive experimental design. Technical Report 139, Dept. Statistical Science, Univ. College London.
• Dawid, A. P. (2003). Generalized entropy functions and Bregman divergence. Unpublished manuscript.
• Dawid, A. P. and Sebastiani, P. (1999). Coherent dispersion criteria for optimal experimental design. Ann. Statist. 27 65--81.
• DeGroot, M. H. (1962). Uncertainty, information and sequential experiments. Ann. Math. Statist. 33 404--419.
• DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York.
• Della Pietra, S., Della Pietra, V. and Lafferty, J. (2002). Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-109, School of Computer Science, Carnegie Mellon Univ.
• Edwards, A. W. F. (1992). Likelihood, expanded ed. Johns Hopkins Univ. Press, Baltimore, MD.
• Ferguson, T. S. (1967). Mathematical Statistics. A Decision-Theoretic Approach. Academic Press, New York.
• Gallager, R. G. (1976). Source coding with side information and universal coding. Unpublished manuscript.
• Good, I. J. (1952). Rational decisions. J. Roy. Statist. Soc. Ser. B 14 107--114.
• Grünwald, P. D. (1998). The minimum description length principle and reasoning under uncertainty. Ph.D. dissertation, ILLC Dissertation Series 1998-03, Univ. Amsterdam.
• Grünwald, P. D. and Dawid, A. P. (2002). Game theory, maximum generalized entropy, minimum discrepancy, robust Bayes and Pythagoras. In Proc. 2002 IEEE Information Theory Workshop (ITW 2002) 94--97. IEEE, New York.
• Harremoës, P. and Topsøe, F. (2001). Maximum entropy fundamentals. Entropy 3 191--226. Available at www.mdpi.org/entropy/.
• Harremoës, P. and Topsøe, F. (2002). Unified approach to optimization techniques in Shannon theory. In Proc. 2002 IEEE International Symposium on Information Theory 238. IEEE, New York.
• Haussler, D. (1997). A general minimax result for relative entropy. IEEE Trans. Inform. Theory 43 1276--1280.
• Jaynes, E. T. (1957a). Information theory and statistical mechanics. I. Phys. Rev. 106 620--630.
• Jaynes, E. T. (1957b). Information theory and statistical mechanics. II. Phys. Rev. 108 171--190.
• Jaynes, E. T. (1985). Some random observations. Synthèse 63 115--138.
• Jaynes, E. T. (1989). Papers on Probability, Statistics and Statistical Physics, 2nd ed. Kluwer Academic, Dordrecht.
• Jones, L. K. and Byrne, C. L. (1990). General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE Trans. Inform. Theory 36 23--30.
• Kapur, J. N. and Kesavan, H. (1992). Entropy Optimization Principles with Applications. Academic Press, New York.
• Kivinen, J. and Warmuth, M. K. (1999). Boosting as entropy projection. In Proc. Twelfth Annual Conference on Computational Learning Theory (COLT'99) 134--144. ACM Press, New York.
• Krob, J. and Scholl, H. R. (1997). A minimax result for the Kullback--Leibler Bayes risk. Econ. Qual. Control 12 147--157.
• Kullback, S. (1959). Information Theory and Statistics. Wiley, New York.
• Lafferty, J. (1999). Additive models, boosting, and inference for generalized divergences. In Proc. Twelfth Annual Conference on Computational Learning Theory (COLT'99) 125--133. ACM Press, New York.
• Lindley, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math. Statist. 27 986--1005.
• Merhav, N. and Feder, M. (1995). A strong version of the redundancy-capacity theorem of universal coding. IEEE Trans. Inform. Theory 41 714--722.
• von Neumann, J. (1928). Zur Theorie der Gesellschaftspiele. Math. Ann. 100 295--320.
• Noubiap, R. F. and Seidel, W. (2001). An algorithm for calculating $\Gamma$-minimax decision rules under generalized moment conditions. Ann. Statist. 29 1094--1116.
• Pinsker, M. S. (1964). Information and Information Stability of Random Variables and Processes. Holden-Day, San Francisco.
• Posner, E. (1975). Random coding strategies for minimum entropy. IEEE Trans. Inform. Theory 21 388--391.
• Rao, C. R. (1982). Diversity and dissimilarity coefficients: A unified approach. J. Theoretical Population Biology 21 24--43.
• Rényi, A. (1961). On measures of entropy and information. Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1 547--561. Univ. California Press, Berkeley.
• Rockafellar, R. T. (1970). Convex Analysis. Princeton Univ. Press.
• Ryabko, B. Y. (1979). Coding of a source with unknown but ordered probabilities. Problems Inform. Transmission 15 134--138.
• Scholl, H. R. (1998). Shannon optimal priors on IID statistical experiments converge weakly to Jeffreys' prior. Test 7 75--94.
• Seidenfeld, T. (1986). Entropy and uncertainty. Philos. Sci. 53 467--491.
• Shimony, A. (1985). The status of the principle of maximum entropy. Synthèse 63 35--53. [Reprinted as Chapter 8 of shimony:book1?.]
• Shimony, A. (1993). Search for a Naturalistic World View 1. Cambridge Univ. Press.
• Skyrms, B. (1985). Maximum entropy inference as a special case of conditionalization. Synthèse 63 55--74.
• Stoer, J. and Witzgall, C. (1970). Convexity and Optimization in Finite Dimensions. I. Springer, Berlin.
• Stroock, D. W. (1993). Probability Theory, an Analytic View. Cambridge Univ. Press.
• Topsøe, F. (1979). Information-theoretical optimization techniques. Kybernetika 15 8--27.
• Topsøe, F. (2001). Basic concepts, identities and inequalities---the toolkit of information theory. Entropy 3 162--190. Available at www.mdpi.org/entropy/.
• Topsøe, F. (2002). Maximum entropy versus minimum risk and applications to some classical discrete distributions. IEEE Trans. Inform. Theory 48 2368--2376.
• Uffink, J. (1995). Can the maximum entropy principle be explained as a consistency requirement? Stud. Hist. Philos. Sci. B Stud. Hist. Philos. Modern Phys. 26 223--262.
• Uffink, J. (1996). The constraint rule of the maximum entropy principle. Stud. Hist. Philos. Sci. B Stud. Hist. Philos. Modern Phys. 27 47--79.
• van Fraassen, B. (1981). A problem for relative information minimizers in probability kinematics. British J. Philos. Sci. 32 375--379.
• Vidakovic, B. (2000). Gamma-minimax: A paradigm for conservative robust Bayesians. Robust Bayesian Analysis. Lecture Notes in Statist. 152 241--259. Springer, New York.
• Walley, P. (1991). Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London.
• Xie, Q. and Barron, A. R. (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Trans. Inform. Theory 46 431--445.