The Annals of Statistics

Mutual information, metric entropy and cumulative relative entropy risk

David Haussler and Manfred Opper

Full-text: Open access


Assume ${P_{\theta}: \theta \epsilon \Theta}$ is a set of probability distributions with a common dominating measure on a complete separable metric space Y. A state $\theta^* \epsilon \Theta$ is chosen by Nature. A statistician obtains n independent observations $Y_1, \dots, Y_n$ from Y distributed according to $P_{\theta^*}$. For each time t between 1 and n, based on the observations $Y_1, \dots, Y_{t-1}$, the statistician produces an estimated distribution $\hat{P}_t$ for $P_{\theta^*}$ and suffers a loss $L(P_{\theta^*}, \hat{P}_t)$. The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss $L(P_{\theta^*}, \hat{P}_t)$ is the relative entropy between the true distribution $P_{\theta^*}$ and the estimated distribution $\hat{P}_t$. Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter $\Theta^*$ and the observations $Y_1, \dots, Y_n$.

New bounds on this mutual information are given in terms of the Laplace transform of the Hellinger distance between pairs of distributions indexed by parameters in $\Theta$. From these, bounds on the cumulative minimax risk are given in terms of the metric entropy of $\Theta$ with respect to the Hellinger distance. The assumptions required for these bounds are very general and do not depend on the choice of the dominating measure. They apply to both finite- and infinite-dimensional $\Theta$. They apply in some cases where Y is infinite dimensional, in some cases where Y is not compact, in some cases where the distributions are not smooth and in some parametric cases where asymptotic normality of the posterior distribution fails.

Article information

Ann. Statist., Volume 25, Number 6 (1997), 2451-2492.

First available in Project Euclid: 30 August 2002

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G07: Density estimation
Secondary: 62B10: Information-theoretic topics [See also 94A17] 62C20: Minimax procedures 94A29: Source coding [See also 68P30]

Mutual information Hellinger distance relative entropy metric entropy minimax risk Bayes risk density estimation Kullback-Leibler distance


Haussler, David; Opper, Manfred. Mutual information, metric entropy and cumulative relative entropy risk. Ann. Statist. 25 (1997), no. 6, 2451--2492. doi:10.1214/aos/1030741081.

Export citation


  • 1 AMARI, S. 1982. Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. 10 357 385.
  • 2 AMARI, S. and MURATA, N. 1993. Statistical theory of learning curves under entropic loss. Neural Comput. 5 140 153.
  • 3 BARRON, A. 1985. The strong ergodic theorem for densities: generalized Shannon McMillan Breiman theorem. Ann. Probab. 13 1292 1303.
  • 4 BARRON, A. 1987. Are Bay es rules consistent in information? In Open Problems in CommuZ. nication and Computation T. M. Cover and B. Gopinath, eds. 85 91. Springer-Verlag, New York.
  • 5 BARRON, A. 1987. The exponential convergence of posterior probabilities with implications for Bay es estimators of density functions. Technical Report 7, Dept. Statistics, Univ. Illinois Urbana-Champaign.
  • 6 BARRON, A., CLARKE, B. and HAUSSLER, D. 1993. Information bounds for the risk of Bayesian predictions and the redundancy of universal codes. Proceedings of the International Sy mposium on Information Theory. IEEE Press, New York.
  • 7 BARRON, A. and COVER, T. 1988. A bound on the financial value of information. IEEE Trans. Inform. Theory 34 1097 1100.
  • 8 BARRON, A., Gy ORFI, L. and VAN DER MEULEN, E. 1992. Distribution estimation consistent ¨ in total variation and in two ty pes of information divergence. IEEE Trans. Inform. Theory 38 1437 1454.
  • 9 BARRON, A. and YANG, Y. 1995. Information theoretic lower bounds on convergence rates of nonparametric estimators. Unpublished manuscript.
  • 10 BIRGE, L. 1983. Approximation dans les espaces metriques et theorie de l'estimation. Z. ´ ´ ´ Wahrsch. Verw. Gebiete 65 181 237.
  • 11 BIRGE, L. 1986. On estimating a density using Hellinger distance and some other strange ´facts. Probab. Theory Related Fields 71 271 291.
  • 12 BIRGE, L. and MASSART, P. 1993. Rates of convergence for minimum contrast estimators. ´Probab. Theory Related Fields 97 113 150.
  • 13 CAMERON, R. H. and MARTIN, W. T. 1944. Transformation of Wiener integrals under translations. Ann. Math. 45 386 396.
  • 14 CLARKE, B. 1989. Asy mptotic cumulative risk and Bay es risk under entropy loss with applications. Ph.D. thesis, Dept. Statistics, Univ. Illinois.
  • 15 CLARKE, B. and BARRON, A. 1990. Information-theoretic asy mptotics of Bay es methods. IEEE Trans. Inform. Theory 36 453 471.
  • 16 CLARKE, B. and BARRON, A. 1994. Jeffery s' prior is asy mptotically least favorable under entropy risk. J. Statist. Plann. Inference 41 37 60.
  • 17 CLEMENTS, G. F. 1963. Entropy of several sets of real-valued functions. Pacific J. Math. 13 1085 1095.
  • 18 COVER, T. and THOMAS, J. 1991. Elements of Information Theory. Wiley, New York.
  • 19 DAVISSON, L. and LEON-GARCIA, A. 1980. A source matching approach to finding minimax codes. IEEE Trans. Inform. Theory 26 166 174.
  • 20 DEVROy E, L. and Gy ORFI, L. 1986. Nonparametric Density Estimation, the L View. Wiley, ¨ 1 New York.
  • 21 DIACONIS, P. and FREEDMAN, D. 1986. On the consistency of Bay es estimates. Ann. Statist. 14 1 26.
  • 22 DUDLEY, R. M. 1984. A course on empirical processes. Lecture Notes in Math. 1097 2 142. Springer, New York.
  • 23 EFROIMOVICH, S. Y. 1980. Information contained in a sequence of observations. Problems Inform. Transmission 15 178 189.
  • 24 FEDER, M., FREUND, Y. and MANSOUR, Y. 1995. Optimal universal learning and prediction of probabilistic concepts. In Proceedings of the IEEE Information Theory Conference 233. IEEE, New York.
  • 25 GALLAGER, R. 1979. Source coding with side information and universal coding. Technical Report LIDS-P-937, Laboratory for Information and Decision Sy stems, MIT.
  • 26 GHOSH, J., GHOSAL, S. and SAMANTA, T. 1994. Stability and convergence of the posterior in Z non-regular problems. In Statistical Decision Theory and Related Topics. V S. Gupta. and J. O. Berger, eds.. Springer, New York.
  • 27 GINE, E. and ZINN, J. 1984. Some limit theorems for empirical processes. Ann. Probab. 12 ´ 929 989.
  • 28 Gy ORFI, L., PALI, I. and VAN DER MEULEN, E. 1994. There is no universal source code for an ¨ ´ infinite alphabet. IEEE Trans. Inform. Theory 40 267 271.
  • 29 HASMINSKII, R. and IBRAGIMOV, I. 1990. On density estimation in the view of Kolmogorov's ideas in approximation theory. Ann. Statist. 18 999 1010.
  • 30 HAUSSLER, D. 1997. A general minimax result for relative entropy. IEEE Trans. Inform. Theory 40 1276 1280.
  • 31 HAUSSLER, D. and BARRON, A. 1992. How well do Bay es methods work for on-line predicÄ 4 tion of 1, 1 values? In Proceedings of the Third NEC Sy mposium on Computation and Cognition 74 100. SIAM, Philadelphia.
  • 32 HAUSSLER, D., KEARNS, M. and SCHAPIRE, R. E. 1994. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning 14 83 113.
  • 33 HAUSSLER, D. and OPPER, M. 1995. General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Seventh Annual ACM Workshop on Computational Learning Theory 402 411. ACM Press, New York.
  • 34 HAUSSLER, D. and OPPER, M. 1996. Mutual information, metric entropy, and risk in estimation of probability distributions. Technical Report UCSC-CRL-96-27, Comput. Res. Lab., Univ. California, Santa Cruz.
  • 35 IBRAGIMOV, I. and HASMINSKII, R. 1972. On the information in a sample about a parameter. In Second International Sy mposium on Information Theory 295 309. IEEE, New York.
  • 36 IZENMAN, A. J. 1991. Recent developments in nonparametric density estimation. J. Amer. Statist. Assoc. 86 205 224.
  • 37 KOLMOGOROV, A. N. and TIKHOMIROV, V. M. 1961. -entropy and -capacity of sets in functional spaces. Amer. Math. Soc. Trans. Ser. 2 17 277 364.
  • 38 KOMAKI, F. 1994. On asy mptotic properties of predictive distributions. Technical Report METR 94-21, Dept. Math. Engrg. Phy s., Univ. Toky o.
  • 40 LECAM, L. 1986. Asy mptotic Methods in Statistical Decision Theory. Springer, New York.
  • 41 MEIR, R. and MERHAV, N. 1995. On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning 19 241 261.
  • 42 MERHAV, N. and FEDER, M. 1995. A strong version of the redundancy-capacity theorem of universal coding. IEEE Trans. Inform. Theory 41 714 722.
  • 43 OPPER, M. and HAUSSLER, D. 1991. Calculation of the learning curve of Bay es optimal classification algorithm for learning a perceptron with noise. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory 75 87. Morgan Kaufmann, San Mateo, CA.
  • 44 OPPER, M. and HAUSSLER, D. 1995. Bounds for predictive errors in the statistical mechanics of in supervised learning. Phy s. Rev. Lett. 75 3772 3775.
  • 45 PINSKER, M. S. 1964. Information and Information Stability of Random Variables and Processes. Holden-Day, Oakland, CA.
  • 46 POLLARD, D. 1990. Empirical Processes: Theory and Applications. IMS, Hay ward, CA.
  • 47 RENy I, A. 1960. On measures of entropy and information. Proc. Fourth Berkeley Sy mp. Math. Statist. Probab. 1 547 561. Univ. California Press, Berkeley.
  • 48 RENy I, A. 1964. On the amount of information concerning an unknown parameter in a sequence of observations. Publ. Math. Inst. Hungar. Acad. Sci. 9 617 625.
  • 49 RISSANEN, J. 1986. Stochastic complexity and modeling. Ann. Statist. 14 1080 1100.
  • 50 RISSANEN, J., SPEED, T. and YU, B. 1992. Density estimation by stochastic complexity. IEEE Trans. Inform. Theory 38 315 323.
  • 51 Sy MANZIK, K. 1965. Proof and refinements of an inequality of Fey nman. J. Math. Phy s. 6 1155 1165.
  • 53 WONG, W. and SHEN, X. 1995. Probability inequalities for likelihood ratios and convergence rates for sieve MLE's. Ann. Statist. 23 339 362.
  • 54 YAMANISHI, K. 1995. A loss bound model for on-line stochastic prediction algorithms. Inform. Comput. 119 39 54.
  • 55 YU, B. 1996. Lower bounds on expected redundancy for nonparametric classes. IEEE Trans. Inform. Theory 42 272 275.
  • 56 ZHU, H. and ROHWER, R. 1995. Information geometric measurements of generalization. Technical Report NCRG 4350, Neural Computing Research Group, Aston Univ., England.
  • 57 HAUSSLER, D. and OPPER, M. 1997. Metric entropy and minimax risk in classification. In Z Lecture Notes in Comp. Sci.: Studies in Logic and Comp. Sci. J. My cielski, G.. Rozenberg and A. Salomaa, eds. 1261 212 235. Springer-Verlag, New York.