The Annals of Statistics

Generalized Pearson-Fisher Chi-Square Goodness-of-Fit Tests, with Applications to Models with Life History Data

Abstract

Suppose that $X_1,\ldots,X_n$ are i.i.d. $\sim F$, and we wish to test the null hypothesis that $F$ is a member of the parametric family $\mathscr{F} = \{F_\theta(x); \theta \in \Theta\}$ where $\Theta \subset \mathbb{R}^q$. The classical Pearson-Fisher chi-square test involves partitioning the real axis into $k$ cells $I_1,\ldots, I_k$ and forming the chi-square statistic $X^2 = \sum^k_{i = 1}(O_i - nF_{\hat{\theta}}(I_i))^2/nF_{\hat{\theta}}(I_i)$, where $O_i$ is the number of observations falling into cell $i$ and $\hat{\theta}$ is the value of $\theta$ minimizing $\sum^k_{i = 1}(O_i - nF_\theta(I_i))^2/nF_\theta(I_i)$. We obtain a generalization of this test to any situation for which there is available a nonparametric estimator $\hat{F}$ of $F$ for which $n^{1/2}(\hat{F} - F) \rightarrow_d W$, where $W$ is a continuous zero mean Gaussian process satisfying a mild regularity condition. We allow the cells to be data dependent. Essentially, we estimate $\theta$ by the value $\hat{\theta}$ that minimizes a "distance" between the vectors $(\hat{F}(I_1),\ldots,\hat{F}(I_k))$ and $(F_\theta(I_1),\ldots, F_\theta(I_k))$, where distance is measured through an arbitrary positive definite quadratic form, and then form a chi-square type test statistic based on the difference between $(\hat{F}(I_1),\ldots,\hat{F}(I_k))$ and $(F_{\hat{\theta}}(I_1),\ldots, F_{\hat{\theta}}(I_k))$. We prove that this test statistic has asymptotically a chi-square distribution with $k - q - 1$ degrees of freedom, and point out some errors in the literature on chi-square tests in survival analysis. Our procedure is very general and applies to a number of well-known models in survival analysis, such as right censoring and left truncation. We apply our method to deal with questions of model selection in the problem of estimating the distribution of the length of the incubation period of the AIDS virus using the CDC's data on blood-transfusion related AIDS. Our analysis suggests some models that seem to fit better than those used in the literature.

Article information

Source
Ann. Statist., Volume 21, Number 2 (1993), 772-797.

Dates
First available in Project Euclid: 12 April 2007

Permanent link to this document
https://projecteuclid.org/euclid.aos/1176349151

Digital Object Identifier
doi:10.1214/aos/1176349151

Mathematical Reviews number (MathSciNet)
MR1232519

Zentralblatt MATH identifier
0788.62020

JSTOR