Two-Sample Tests for Multivariate Distributions

Lionel Weiss

doi:10.1214/aoms/1177705995

March, 1960 Two-Sample Tests for Multivariate Distributions

Lionel Weiss

Ann. Math. Statist. 31(1): 159-164 (March, 1960). DOI: 10.1214/aoms/1177705995

Abstract

$X(1), X(2), \cdots, X(m), Y(1), Y(2), \cdots, Y(n)$ are independent $k$-variate random variables. The distribution of $X(i)$ has pdf $f(x)$, say, where $x$ denotes a $k$-dimensional vector throughout this paper, and the distribution of $Y(j)$ has pdf $g(x)$, say. We assume that $f(x)$ and $g(x)$ are piecewise continuous, and that each has a finite upper bound, which it is not necessary to specify. Denote by $2R_i$ the distance from $X(i)$ to the nearest of the points $X(1), \cdots, X(i - 1), X(i + 1), \cdots, X(m)$, and denote by $S_i$ the number of points $Y(1), \cdots, Y(n)$ contained in the open sphere $\{x: | x - X(i) | < R_i\}$. Clearly, the joint distribution of $S_i, S_j$ is the same as the joint distribution of $S_{i'}, S_{j'}$, for any subscripts with $i \neq j, i' \neq j'$. Let $r$ be a non-negative integer, and $\alpha$ any fixed positive value. $Q(r)$ denotes the Lebesgue integral $\int_{E_k} \frac{2^k \alpha f^2 (x)\lbrack g(x) \rbrack^r}{\lbrack g(x) + 2^k\alpha f(x) \rbrack^{r + 1}} dx,$ where $E_k$ denotes Euclidean $k$-space. We will show that $\lim_{m \rightarrow \infty, m/n = \alpha} P_{m, n}\lbrack S_1 = s_1, S_2 = s_2\rbrack = Q(s_1)Q(s_2),$ for any non-negative integers $s_1,s_2$, the approach being uniform in $s_1,s_2$. Thus, in the limit $S_1, S_2$ are independently distributed, with $\lim_{m \rightarrow \infty, m/n = \alpha} P_{m, n}\lbrack S_1 = s_1\rbrack = Q(s_1).$ In [1], which discussed the univariate case, $S_i$ was defined as the number of $Y$'s closer to $X(i)$ than to any other $X$ to their right. In the present paper, $S_i$ is defined as the number of $Y$'s in another neighborhood of $X(i)$. Our present definition of $S_i$ does not become for $k = 1$ the same as the definition of $S_i$ in [1]. Rather, in the univariate case, our present definition of $S_i$ is the number of $Y$'s lying within a distance $R_i$ on either side of $X(i)$. However, if $\lim_{m \rightarrow \infty, m/n = \alpha} P_{m, n}\lbrack S_1, = s_1, S_2 = s_2\rbrack$ is computed for the univariate case using the definition of $S_i$ given in [1], the only way in which it differs from $Q(s_1)Q(s_2)$ is that $\alpha$ is replaced by $\alpha/2$. Thus it seems reasonable to treat the $S_i$ as defined here as $k$-dimensional analogues of the $S_i$ as defined in [1], at least for large samples. An intuitive reason for $\alpha$ being replaced by $\alpha/2$ is that in our present case, $\sum^m_{i = 1} S_i$ may be less than $n$, whereas in [1] this sum must always equal $n$. Thus in our present case, we are in a sense discarding some of the $Y$'s, which lowers $n$ relative to $m$ and thus raises $\alpha$ by a certain factor (2, as it happens). In our present case, $\sum S_i$ may be less than $n$ because the $R_i$ are chosen to make the spheres around the $X$'s non-overlapping, thus simplifying the analysis. The $R_i$ were chosen to give the largest possible non-overlapping spheres because it would seem intuitively that the larger the spheres, the more rapid the approach of the probabilities to their limiting values.

Citation

Download Citation

Lionel Weiss. "Two-Sample Tests for Multivariate Distributions." Ann. Math. Statist. 31 (1) 159 - 164, March, 1960. https://doi.org/10.1214/aoms/1177705995