## The Annals of Mathematical Statistics

### Optimal Spacing in Regression Analysis

#### Abstract

When a response (or dependent) variable $y$ can be observed for a continuous range of values of the independent variable $x$, which is at the control of the experimenter, the question arises as to how a given number of observations should be spaced. It will be assumed that $x$ is measurable without error and that $y$ differs from the true response function $f(x)$ by a random term $z$ with mean zero and constant variance $\sigma^2$. We suppose that the aim of the experimenter is to estimate $f(x)$, or possibly the mean response $\overline{f(x)}$, on the basis of $n$ observations $(x_i, y_i)$. Various aspects of this problem of optimal spacing have been studied for the case where $f(x)$ is known apart from some parameters (see e.g. Elfving [3], Chernoff [1], de la Garza [2], and Kiefer and Wolfowitz [8]). However, the functional form of $f(x)$ is often unknown or only approximately known. In the absence of a specific model to the contrary, polynomial approximations to $f(x)$ provide a convenient approach. Section 2 deals briefly with the non-statistical case $\sigma = 0$ when the problem of choosing $n$ abscissae in order to approximate to $f(x)$ by a polynomial of degree $n - 1$ reduces to one of optimum interpolation and that of integrating $f(x)$ reduces to Gaussian quadrature. For a fuller account of this part see Hildebrand [5] or Kopal [6]. If the response contains a random element, a polynomial of degree $n - 1$ or less may be fitted to the $n$ observations by least squares. The error of approximation will now be due, in general, both to random error and the use of an incorrect approximating function. We confine ourselves to the case of fitting a straight line when the true response, while roughly linear, may contain a quadratic component. Two criteria are considered in arriving at the two abscissae resulting in an optimal fit. The first of these criteria ((3.2) below) has also been discussed in a recent paper by Box and Draper [7] who have extended its use to the case of several independent variables. It is shown in Section 6 that for $x$-values symmetrically spaced about the centre of the region of interest nothing is gained in fitting a straight line by the use of more than two such abscissae. These optimal abscissae are determined in Sections 3 and 4. The emphasis of the present approach is on attaining an optimal straight line fit with a small number of observations, rather than on detecting departures from linearity. For the latter purpose more than two abscissae would, of course, be needed, but the number of observations required may well be uneconomically large. In Section 7 comparisons with some other simple spacings are made. As an illustration, consider the calibration of a large number of instruments for a range of $x$ in which $f(x)$ is known to be approximately linear. In this case adequate accuracy may be attainable by the use of two observations only. If $\sigma$ is not negligibly small several observations may be taken at each of two appropriately selected settings, especially if it is much easier to repeat measurements at a given setting than to turn to a new one (compare de la Garza [2]). An example illustrating the methods proposed is given in Section 8.

#### Article information

Source
Ann. Math. Statist., Volume 30, Number 4 (1959), 1072-1081.

Dates
First available in Project Euclid: 27 April 2007

https://projecteuclid.org/euclid.aoms/1177706091

Digital Object Identifier
doi:10.1214/aoms/1177706091

Mathematical Reviews number (MathSciNet)
MR110161

Zentralblatt MATH identifier
0104.37802

JSTOR