Some Tests for Categorical Data

V. P. Bhapkar

doi:10.1214/aoms/1177705140

March, 1961 Some Tests for Categorical Data

V. P. Bhapkar

Ann. Math. Statist. 32(1): 72-83 (March, 1961). DOI: 10.1214/aoms/1177705140

Abstract

We shall be concerned with experimental data given in the form of frequencies in cells determined by a multiway cross-classification, with predefined categories along each way of classification. Roy and Bhapkar [10] have posed hypotheses, which might be considered generalizations appropriate to this set up of the usual hypotheses in classical "normal" univariate "fixed effects" analysis of variance, "normal" multivariate "fixed effects" analysis of variance and analysis of various kinds of "normal" independence. Large sample tests for such hypotheses are offered here. The large sample tests suggested are based on the $\chi^2$-test of Karl Pearson [8]. The general probability model is that of a product of several multinomial distributions. According as the marginal frequencies along any dimension are held fixed or left free, that dimension is said to be associated with a "factor" or a "response" (or variable). The probability model is \begin{equation*}\tag{1}\prod_j \frac{n_{oj}!}{n_{ij}!} \prod p^{n_{ij}}_{ij},\end{equation*} where $\sum_i p_{ij} \equiv p_{oj} = 1$ and $\sum_i n_{ij} \equiv n_{oj}$ is held fixed. Thus $i$ refers to categories of the response while $j$ refers to categories of the factor. $n_{oj}$ denotes the preassigned sample-size for the $j$th factor-category, out of which $n_{ij}$ happen to lie in the $i$th response-category. It should be noticed that $i$ may be a multiple subscript, say $i_1, i_2, \cdots, i_k; j$ also may be a multiple subscript, say $j_1, j_2 \cdots, j_l$. We then speak of a $k$-response (or $k$-variate) and $l$-factor problem According as a set of real numbers is or is not associated with the categories along any way of classification (factor or response), that way of classification will be said to be structured or unstructured. It is well-known (for example, Neyman [6]) that if a hypothesis $H_o$ is given in the form of certain constraints on the $p_{ij}$'s, then a large sample test statistic of $H_o$ under (1) for the model is a $\chi^2$ statistic given by $\sum_{ij} (n_{ij} - n_{oj}\hat p_{ij})^2/(n_{oj}\hat p_{ij}),$ or a $\chi^2_1$ statistic given by $\sum_{ij} (n_{ij} - n_{oj}\hat p_{ij})^2/n_{ij}$, where the $\hat p_{ij}$'s form any set of BAN estimates [6]. In the particular case when the constraints are linear in $p$'s, the method of minimum $\chi^2_1$ permits a reduction of the problem to the solution of a system of linear equations and hence is more convenient. Reiersol [9] considers binomial experiments and makes use of results of Neyman [6] to determine tests for hypotheses appropriate to factorial experiments. Mitra [5] not only generalizes Reiersol's theorems to multinomial experiments, but also avoids his restriction that the parameter-sets in the different linear forms occurring in the hypothesis be nonoverlapping. We shall prove theorems to cover the cases that cannot be treated by these theorems. In Section 2, the $\chi^2_1$ statistic based on the minimum $\chi^2_1$ estimates is obtained to test linear hypotheses. It is further shown that, when $H_o$ specifies linear functions of the $p$'s as known linear functions of some unknown parameters, the $\chi^2_1$ statistic, based on the minimum $\chi^2_1$ estimates, is exactly the same as the minimum sum of squares of residuals obtained by a certain general least squares technique to estimate the unknown parameters. This is then applied to derive test criteria appropriate to various hypotheses proposed in [3] and [10].