## The Annals of Mathematical Statistics

### Snowball Sampling

Leo A. Goodman

#### Abstract

An $s$ stage $k$ name snowball sampling procedure is defined as follows: A random sample of individuals is drawn from a given finite population. (The kind of random sample will be discussed later in this section.) Each individual in the sample is asked to name $k$ different individuals in the population, where $k$ is a specified integer; for example, each individual may be asked to name his "$k$ best friends," or the "$k$ individuals with whom he most frequently associates," or the "$k$ individuals whose opinions he most frequently seeks," etc. (For the sake of simplicity, we assume throughout that an individual cannot include himself in his list of $k$ individuals.) The individuals who were not in the random sample but were named by individuals in it form the first stage. Each of the individuals in the first stage is then asked to name $k$ different individuals. (We assume that the question asked of the individuals in the random sample and of those in each stage is the same and that $k$ is the same.) The individuals who were not in the random sample nor in the first stage but were named by individuals who were in the first stage form the second stage. Each of the individuals in the second stage is then asked to name $k$ different individuals. The individuals who were not in the random sample nor in the first or second stages but were named by individuals who were in the second stage form the third stage. Each of the individuals in the third stage is then asked to name $k$ different individuals. This procedure is continued until each of the individuals in the $s$th stage has been asked to name $k$ different individuals. The data obtained using an $s$ stage $k$ name snowball sampling procedure can be utilized to make statistical inferences about various aspects of the relationships present in the population. The relationships present, in the hypothetical situation where each individual in the population is asked to name $k$ different individuals, can be described by a matrix with rows and columns corresponding to the members of the population, rows for the individuals naming and columns for the individuals named, where the entry $\theta_{ij}$ in the $i$th row and $j$th column is 1 if the $i$th individual in the population includes the $j$th individual among the $k$ individuals he would name, and it is 0 otherwise. While the matrix of the $\theta$'s cannot be known in general unless every individual in the population is interviewed (i.e., asked to name $k$ different individuals), it will be possible to make statistical inferences about various aspects of this matrix from the data obtained using an $s$ stage $k$ name snowball sampling procedure. For example, when $s = k = 1$, the number, $M_{11}$, of mutual relationships present in the population (i.e., the number of values $i$ with $\theta_{ij} = \theta_{ji} = 1$ for some value of $j > i$) can be estimated. The methods of statistical inference applied to the data obtained from an $s$ stage $k$ name snowball sample will of course depend on the kind of random sample drawn as the initial step. In most of the present paper, we shall suppose that a random sample (i.e., the "zero stage" in snowball sample) is drawn so that the probability, $p$, that a given individual in the population will be in the sample is independent of whether a different given individual has appeared. This kind of sampling has been called binomial sampling; the specified value of $p$ (assumed known) has been called the sampling fraction [4]. This sampling scheme might also be described by saying that a given individual is included in the sample just when a coin, which has a probability $p$ of "heads," comes up "heads," where the tosses of the coin from individual to individual are independent. (To each individual there corresponds an independent Bernoulli trial determining whether he will or will not be included in the sample.) This sampling scheme differs in some respects from the more usual models where the sample size is fixed in advance or where the ratio of the sample size to the population size (i.e., the sample size-population size ratio) is fixed. For binomial sampling, this ratio is a random variable whose expected value is $p$. (The variance of this ratio approaches zero as the population becomes infinite.) In some situations (where, for example, the variance of this ratio is near zero), mathematical results obtained for binomial sampling are sometimes quite similar to results obtained using some of the more usual sampling models (see [4], [7]; compare the variance formulas in [3] and [5]); in such cases it will often not make much difference, from a practical point of view, which sampling model is utilized. (In Section 6 of the present paper some results for snowball sampling based on an initial sample of the more usual kind are obtained and compared with results presented in the earlier sections of this paper obtained for snowball sampling based on an initial binomial sample.) For snowball sampling based on an initial binomial sample, and with $s = k = 1$, so that each individual asked names just one other individual and there is just one stage beyond the initial sample, Section 2 of this paper discusses unbiased estimation of $M_{11}$, the number of pairs of individuals in the population who would name each other. One of the unbiased estimators considered (among a certain specified class of estimators) has uniformly smallest variance when the population characteristics are unknown; this one is based on a sufficient statistic for a simplified summary of the data and is the only unbiased estimator of $M_{11}$ based on that sufficient statistic (when the population characteristics are unknown). This estimator (when $s = k = 1$) has a smaller variance than a comparable minimum variance unbiased estimator computed from a larger random sample when $s = 0$ and $k = 1$ (i.e., where only the individuals in the random sample are interviewed) even where the expected number of individuals in the larger random sample $(s = 0, k = 1)$ is equal to the maximum expected number of individuals studied when $s = k = 1$ (i.e., the sum of the expected number of individuals in the initial sample and the maximum expected number of individuals in the first stage). In fact, the variance of the estimator when $s = 0$ and $k = 1$ is at least twice as large as the variance of the comparable estimator when $s = k = 1$ even where the expected number of individuals studied when $s = 0$ and $k = 1$ is as large as the maximum expected number of individuals studied when $s = k = 1$. Thus, for estimating $M_{11}$, the sampling scheme with $s = k = 1$ is preferable to the sampling scheme with $s = 0$ and $k = 1$. Furthermore, we observe that when $s = k = 1$ the unbiased estimator based on the simplified summary of the data having minimum variance when the population characteristics are unknown can be improved upon in cases where certain population characteristics are known, or where additional data not included in the simplified summary are available. Several improved estimators are derived and discussed. Some of the results for the special case of $s = k = 1$ are generalized in Sections 3 and 4 to deal with cases where $s$ and $k$ are any specified positive integers. In Section 5, results are presented about $s$ stage $k$ name snowball sampling procedures, where each individual asked to name $k$ different individuals chooses $k$ individuals at random from the population. (Except in Section 5, the numbers $\theta_{ij}$, which form the matrix referred to earlier, are assumed to be fixed (i.e., to be population parameters); in Section 5, they are random variables. A variable response error is not considered except in so far as Section 5 deals with an extreme case of this.) For social science literature that discusses problems related to snowball sampling, see [2], [8], and the articles they cite. This literature indicates, among other things, the importance of studying "social structure and...the relations among individuals" [2].

#### Article information

Source
Ann. Math. Statist. Volume 32, Number 1 (1961), 148-170.

Dates
First available in Project Euclid: 27 April 2007

http://projecteuclid.org/euclid.aoms/1177705148

JSTOR