Mutual information, metric entropy and cumulative relative entropy risk



The Annals of Statistics

Mutual information, metric entropy and cumulative relative entropy risk

David Haussler and Manfred Opper

Source: Ann. Statist. Volume 25, Number 6 (1997), 2451-2492.

Abstract

Assume ${P_{\theta}: \theta \epsilon \Theta}$ is a set of probability distributions with a common dominating measure on a complete separable metric space Y. A state $\theta^* \epsilon \Theta$ is chosen by Nature. A statistician obtains n independent observations $Y_1, \dots, Y_n$ from Y distributed according to $P_{\theta^*}$. For each time t between 1 and n, based on the observations $Y_1, \dots, Y_{t-1}$, the statistician produces an estimated distribution $\hat{P}_t$ for $P_{\theta^*}$ and suffers a loss $L(P_{\theta^*}, \hat{P}_t)$. The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss $L(P_{\theta^*}, \hat{P}_t)$ is the relative entropy between the true distribution $P_{\theta^*}$ and the estimated distribution $\hat{P}_t$. Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter $\Theta^*$ and the observations $Y_1, \dots, Y_n$.

New bounds on this mutual information are given in terms of the Laplace transform of the Hellinger distance between pairs of distributions indexed by parameters in $\Theta$. From these, bounds on the cumulative minimax risk are given in terms of the metric entropy of $\Theta$ with respect to the Hellinger distance. The assumptions required for these bounds are very general and do not depend on the choice of the dominating measure. They apply to both finite- and infinite-dimensional $\Theta$. They apply in some cases where Y is infinite dimensional, in some cases where Y is not compact, in some cases where the distributions are not smooth and in some parametric cases where asymptotic normality of the posterior distribution fails.

Primary Subjects: 62G07
Secondary Subjects: 62B10, 62C20, 94A29
Keywords: Mutual information; Hellinger distance; relative entropy; metric entropy; minimax risk; Bayes risk; density estimation; Kullback-Leibler distance

Full-text: Open access

Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1030741081
Mathematical Reviews number (MathSciNet): MR1604481
Digital Object Identifier: doi:10.1214/aos/1030741081
Zentralblatt MATH identifier: 0920.62007

References

1 AMARI, S. 1982 . Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. 10 357 385.
Mathematical Reviews (MathSciNet): MR84g:62027
Zentralblatt MATH: 0507.62026
2 AMARI, S. and MURATA, N. 1993 . Statistical theory of learning curves under entropic loss. Neural Comput. 5 140 153.
3 BARRON, A. 1985 . The strong ergodic theorem for densities: generalized Shannon McMillan Breiman theorem. Ann. Probab. 13 1292 1303.
Mathematical Reviews (MathSciNet): MR86k:94023
4 BARRON, A. 1987 . Are Bay es rules consistent in information? In Open Problems in CommuZ . nication and Computation T. M. Cover and B. Gopinath, eds. 85 91. Springer-Verlag, New York.
5 BARRON, A. 1987 . The exponential convergence of posterior probabilities with implications for Bay es estimators of density functions. Technical Report 7, Dept. Statistics, Univ. Illinois Urbana-Champaign.
6 BARRON, A., CLARKE, B. and HAUSSLER, D. 1993 . Information bounds for the risk of Bayesian predictions and the redundancy of universal codes. Proceedings of the International Sy mposium on Information Theory. IEEE Press, New York.
7 BARRON, A. and COVER, T. 1988 . A bound on the financial value of information. IEEE Trans. Inform. Theory 34 1097 1100.
Mathematical Reviews (MathSciNet): MR89k:90016
Zentralblatt MATH: 0662.90023
8 BARRON, A., Gy ORFI, L. and VAN DER MEULEN, E. 1992 . Distribution estimation consistent ¨ in total variation and in two ty pes of information divergence. IEEE Trans. Inform. Theory 38 1437 1454.
9 BARRON, A. and YANG, Y. 1995 . Information theoretic lower bounds on convergence rates of nonparametric estimators. Unpublished manuscript.
10 BIRGE, L. 1983 . Approximation dans les espaces metriques et theorie de l'estimation. Z. ´ ´ ´ Wahrsch. Verw. Gebiete 65 181 237.
Mathematical Reviews (MathSciNet): MR85k:62067
Zentralblatt MATH: 0506.62026
11 BIRGE, L. 1986 . On estimating a density using Hellinger distance and some other strange ´facts. Probab. Theory Related Fields 71 271 291.
Mathematical Reviews (MathSciNet): MR87c:62097
12 BIRGE, L. and MASSART, P. 1993 . Rates of convergence for minimum contrast estimators. ´Probab. Theory Related Fields 97 113 150.
Mathematical Reviews (MathSciNet): MR94m:62095
Zentralblatt MATH: 0805.62037
13 CAMERON, R. H. and MARTIN, W. T. 1944 . Transformation of Wiener integrals under translations. Ann. Math. 45 386 396.
Mathematical Reviews (MathSciNet): MR6,5f
14 CLARKE, B. 1989 . Asy mptotic cumulative risk and Bay es risk under entropy loss with applications. Ph.D. thesis, Dept. Statistics, Univ. Illinois.
15 CLARKE, B. and BARRON, A. 1990 . Information-theoretic asy mptotics of Bay es methods. IEEE Trans. Inform. Theory 36 453 471.
16 CLARKE, B. and BARRON, A. 1994 . Jeffery s' prior is asy mptotically least favorable under entropy risk. J. Statist. Plann. Inference 41 37 60.
17 CLEMENTS, G. F. 1963 . Entropy of several sets of real-valued functions. Pacific J. Math. 13 1085 1095.
Mathematical Reviews (MathSciNet): MR28:2191
18 COVER, T. and THOMAS, J. 1991 . Elements of Information Theory. Wiley, New York.
Mathematical Reviews (MathSciNet): MR92g:94001
19 DAVISSON, L. and LEON-GARCIA, A. 1980 . A source matching approach to finding minimax codes. IEEE Trans. Inform. Theory 26 166 174.
Mathematical Reviews (MathSciNet): MR81c:94021
Zentralblatt MATH: 0431.94026
20 DEVROy E, L. and Gy ORFI, L. 1986 . Nonparametric Density Estimation, the L View. Wiley, ¨ 1 New York.
21 DIACONIS, P. and FREEDMAN, D. 1986 . On the consistency of Bay es estimates. Ann. Statist. 14 1 26.
22 DUDLEY, R. M. 1984 . A course on empirical processes. Lecture Notes in Math. 1097 2 142. Springer, New York.
23 EFROIMOVICH, S. Y. 1980 . Information contained in a sequence of observations. Problems Inform. Transmission 15 178 189.
24 FEDER, M., FREUND, Y. and MANSOUR, Y. 1995 . Optimal universal learning and prediction of probabilistic concepts. In Proceedings of the IEEE Information Theory Conference 233. IEEE, New York.
25 GALLAGER, R. 1979 . Source coding with side information and universal coding. Technical Report LIDS-P-937, Laboratory for Information and Decision Sy stems, MIT.
26 GHOSH, J., GHOSAL, S. and SAMANTA, T. 1994 . Stability and convergence of the posterior in Z non-regular problems. In Statistical Decision Theory and Related Topics. V S. Gupta . and J. O. Berger, eds. . Springer, New York.
27 GINE, E. and ZINN, J. 1984 . Some limit theorems for empirical processes. Ann. Probab. 12 ´ 929 989.
Mathematical Reviews (MathSciNet): MR86f:60028
Zentralblatt MATH: 0553.60037
28 Gy ORFI, L., PALI, I. and VAN DER MEULEN, E. 1994 . There is no universal source code for an ¨ ´ infinite alphabet. IEEE Trans. Inform. Theory 40 267 271.
29 HASMINSKII, R. and IBRAGIMOV, I. 1990 . On density estimation in the view of Kolmogorov's ideas in approximation theory. Ann. Statist. 18 999 1010.
Mathematical Reviews (MathSciNet): MR92b:62049
30 HAUSSLER, D. 1997 . A general minimax result for relative entropy. IEEE Trans. Inform. Theory 40 1276 1280.
Mathematical Reviews (MathSciNet): MR98f:94006
Zentralblatt MATH: 0878.94038
31 HAUSSLER, D. and BARRON, A. 1992 . How well do Bay es methods work for on-line predicÄ 4 tion of 1, 1 values? In Proceedings of the Third NEC Sy mposium on Computation and Cognition 74 100. SIAM, Philadelphia.
32 HAUSSLER, D., KEARNS, M. and SCHAPIRE, R. E. 1994 . Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning 14 83 113.
Zentralblatt MATH: 0798.68145
33 HAUSSLER, D. and OPPER, M. 1995 . General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Seventh Annual ACM Workshop on Computational Learning Theory 402 411. ACM Press, New York.
34 HAUSSLER, D. and OPPER, M. 1996 . Mutual information, metric entropy, and risk in estimation of probability distributions. Technical Report UCSC-CRL-96-27, Comput. Res. Lab., Univ. California, Santa Cruz.
35 IBRAGIMOV, I. and HASMINSKII, R. 1972 . On the information in a sample about a parameter. In Second International Sy mposium on Information Theory 295 309. IEEE, New York.
36 IZENMAN, A. J. 1991 . Recent developments in nonparametric density estimation. J. Amer. Statist. Assoc. 86 205 224.
Mathematical Reviews (MathSciNet): MR1137112
Zentralblatt MATH: 0734.62040
37 KOLMOGOROV, A. N. and TIKHOMIROV, V. M. 1961 . -entropy and -capacity of sets in functional spaces. Amer. Math. Soc. Trans. Ser. 2 17 277 364.
Mathematical Reviews (MathSciNet): MR23:A2031
38 KOMAKI, F. 1994 . On asy mptotic properties of predictive distributions. Technical Report METR 94-21, Dept. Math. Engrg. Phy s., Univ. Toky o.
40 LECAM, L. 1986 . Asy mptotic Methods in Statistical Decision Theory. Springer, New York.
41 MEIR, R. and MERHAV, N. 1995 . On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning 19 241 261.
Zentralblatt MATH: 0830.68109
42 MERHAV, N. and FEDER, M. 1995 . A strong version of the redundancy-capacity theorem of universal coding. IEEE Trans. Inform. Theory 41 714 722.
Zentralblatt MATH: 0821.94020
43 OPPER, M. and HAUSSLER, D. 1991 . Calculation of the learning curve of Bay es optimal classification algorithm for learning a perceptron with noise. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory 75 87. Morgan Kaufmann, San Mateo, CA.
44 OPPER, M. and HAUSSLER, D. 1995 . Bounds for predictive errors in the statistical mechanics of in supervised learning. Phy s. Rev. Lett. 75 3772 3775.
45 PINSKER, M. S. 1964 . Information and Information Stability of Random Variables and Processes. Holden-Day, Oakland, CA.
Mathematical Reviews (MathSciNet): MR35:4054b
46 POLLARD, D. 1990 . Empirical Processes: Theory and Applications. IMS, Hay ward, CA.
Mathematical Reviews (MathSciNet): MR93e:60046
47 RENy I, A. 1960 . On measures of entropy and information. Proc. Fourth Berkeley Sy mp. Math. Statist. Probab. 1 547 561. Univ. California Press, Berkeley.
48 RENy I, A. 1964 . On the amount of information concerning an unknown parameter in a sequence of observations. Publ. Math. Inst. Hungar. Acad. Sci. 9 617 625.
Mathematical Reviews (MathSciNet): MR32:6602
49 RISSANEN, J. 1986 . Stochastic complexity and modeling. Ann. Statist. 14 1080 1100.
Mathematical Reviews (MathSciNet): MR88c:62009
Zentralblatt MATH: 0602.62008
50 RISSANEN, J., SPEED, T. and YU, B. 1992 . Density estimation by stochastic complexity. IEEE Trans. Inform. Theory 38 315 323.
Zentralblatt MATH: 0743.62004
51 Sy MANZIK, K. 1965 . Proof and refinements of an inequality of Fey nman. J. Math. Phy s. 6 1155 1165.
53 WONG, W. and SHEN, X. 1995 . Probability inequalities for likelihood ratios and convergence rates for sieve MLE's. Ann. Statist. 23 339 362.
54 YAMANISHI, K. 1995 . A loss bound model for on-line stochastic prediction algorithms. Inform. Comput. 119 39 54.
Mathematical Reviews (MathSciNet): MR96g:68113
Zentralblatt MATH: 0832.68053
55 YU, B. 1996 . Lower bounds on expected redundancy for nonparametric classes. IEEE Trans. Inform. Theory 42 272 275.
Zentralblatt MATH: 0843.62006
56 ZHU, H. and ROHWER, R. 1995 . Information geometric measurements of generalization. Technical Report NCRG 4350, Neural Computing Research Group, Aston Univ., England.
57 HAUSSLER, D. and OPPER, M. 1997 . Metric entropy and minimax risk in classification. In Z Lecture Notes in Comp. Sci.: Studies in Logic and Comp. Sci. J. My cielski, G. . Rozenberg and A. Salomaa, eds. 1261 212 235. Springer-Verlag, New York.
SANTA CRUZ, CALIFORNIA 95064 GERMANY E-MAIL: haussler@cse.ucsc.edu E-MAIL: opper@physik.uni-wuerzburg.de ¨

2009 © Institute of Mathematical Statistics