On Information in Statistics

D. A. S. Fraser

doi:10.1214/aoms/1177700061

June, 1965 On Information in Statistics

D. A. S. Fraser

Ann. Math. Statist. 36(3): 890-896 (June, 1965). DOI: 10.1214/aoms/1177700061

Abstract

The three familiar definitions of statistical information, Fisher (1925), Shannon (1948), and Kullback (1959), are closely tied to asymptotic properties, hypothesis testing, and a general principle that information should be additive. A definition of information is proposed here in the framework of an important kind of statistical model. It has an interpretation for small samples, has several optimum properties, and in most cases is not additive with independent observations. The Fisher, Shannon and Kullback informations are aspects of this information. A variety of definitions of information may be found in the statistical literature. Perhaps the oldest and most familiar is that due to Fisher (1925). For a real parameter and a density function satisfying Cramer-Rao regularity conditions it has the form \begin{align*}I_F(\theta) &= \int \lbrack(\partial/\partial\theta) \ln f(x \mid \theta)\rbrack^2f(x \mid \theta) dx \\ &= \int - (\partial^2/\partial\theta^2) \ln f(x \mid \theta)f(x \mid \theta) dx;\\ \end{align*} for a vector parameter $\mathbf{\theta}$ it becomes a matrix \begin{align*}\mathbf{I}_F(\mathbf{\theta}) = \operatorname{cov} \{(\partial/\partial\mathbf{\theta}) \ln f(x \mid \mathbf{\theta})\mid \mathbf{\theta}\} \\ &= \int - (\partial/\partial\mathbf{\theta})(\partial/\partial\mathbf{\theta}') \ln f(x \mid \mathbf{\theta})f(x \mid \mathbf{\theta}) dx\\ \end{align*} where $\operatorname{cov}$ stands for covariance matrix and where the integrand being averaged in the second expression is the matrix of partial derivatives of the log-likelihood. Shannon (1948) proposed a definition of information for communication theory. In its primary form it measures variation in a distribution; with a change of sign it measures concentration and is thereby more appropriate for statistics: $I_S(\theta) = \int \ln f(x \mid \theta)f(x \mid \theta) dx.$ Kullback (1959) considers a definition of information for "discriminating in favor of $H_1(\theta_1)$ against $H_2(\theta_2)$": $I_K(\theta_1, \theta_2) = \int \ln \lbrack f(x \mid \theta_1)/f(x \mid \theta_2)\rbrack f(x \mid \theta_1) dx.$ These information functions are additive with independent observation; in fact, additivity is taken as an essential property in most developments of information. The three information functions are defined for quite general statistical models, (mild regularity required for Fisher's definition). And the Fisher and the Kullback definitions are tied closely to large sample theory and to Bayes' theory respectively. The emphasis in this paper is not on information in a general model based on general principles. Rather it is on information in an important and somewhat special model--the location model and, more generally, the transformation-parameter model.