The problem considered is that of finding the "best" linear function for discriminating between two multivariate normal populations, $\pi_1$ and $\pi_2$, without limitation to the case of equal covariance matrices. The "best" linear function is found by maximizing the divergence, $J'(1, 2)$, between the distributions of the linear function. Comparison with the divergence, $J(1, 2)$, between $\pi_1$ and $\pi_2$ offers a measure of the discriminating efficiency of the linear function, since $J(1, 2) \geq J'(1, 2)$. The divergence, a special case of which is Mahalanobis's Generalized Distance, is defined in terms of a measure of information which is essentially that of Shannon and Wiener. Appropriate assumptions about $\pi_1$ and $\pi_2$ lead to discriminant analysis (Sections 4, 7), principal components (Section 5), and canonical correlations (Section 6).
"An Application of Information Theory to Multivariate Analysis." Ann. Math. Statist. 23 (1) 88 - 102, March, 1952. https://doi.org/10.1214/aoms/1177729487