A High Dimensional Two Sample Significance Test
The classical multivariate 2 sample significance test based on Hotelling's $T^2$ is undefined when the number $k$ of variables exceeds the number of within sample degrees of freedom available for estimation of variances and covariances. Addition of an a priori Euclidean metric to the affine $k$-space assumed by the classical method leads to an alternative approach to the same problem. A test statistic $F$ which is the ratio of 2 mean square distances is proposed and 3 methods of attaching a significance level to $F$ are described. The third method is considered in detail and leads to a "non-exact" significance test where the null hypothesis distribution of $F$ depends, in approximation, on a single unknown parameter $r$ for which an estimate must be substituted. Approximate distribution theory leads to 2 independent estimates of $r$ based on nearly sufficient statistics and these may be combined to yield a single estimate. A test of $F$ nominally at the 5% level but based on an estimate of $r$ rather than $r$ itself has a true significance level which is a function of $r$. This function is investigated and shown to be quite near 5%. The sensitivity of the test to a parameter measuring statistical distance between population means is discussed and it is shown that arbitrarily small differences in each individual variable can result in a detectable overall difference provided the number of variables (or, more precisely, $r$) can be made sufficiently large. This sensitivity discussion has stated implications for the a priori choice of metric in $k$-space. Finally a geometrical description of the case of large $r$ is presented.
Permanent link to this document: http://projecteuclid.org/euclid.aoms/1177706437
Digital Object Identifier: doi:10.1214/aoms/1177706437
Mathematical Reviews number (MathSciNet): MR112207
Zentralblatt MATH identifier: 0226.62014