Size constrained unequal probability sampling with a non-integer sum of inclusion probabilities

More than 50 methods have been developed to draw unequal probability samples with fixed sample size. All these methods require the sum of the inclusion probabilities to be an integer number. There are cases, however, where the sum of desired inclusion probabilities is not an integer. Then, classical algorithms for drawing samples cannot be directly applied. We present two methods to overcome the problem of sample selection with unequal inclusion probabilities when their sum is not an integer and the sample size cannot be fixed. The first one consists in splitting the inclusion probability vector. The second method is based on extending the population with a phantom unit. For both methods the sample size is almost fixed, and equal to the integer part of the sum of the inclusion probabilities or this integer plus one. AMS 2000 subject classifications: Primary 62D05.


Introduction
Unequal probability sampling with fixed sample size is an intricate problem.At least 50 methods are described in Brewer and Hanif (1983) and Tillé (2006) to draw an unequal probability sample with fixed sample size.All these methods assume the sum of the inclusion probabilities to be an integer number.There are cases, however, where this sum is not an integer.Two main examples where the sum of inclusion probabilities is not an integer are given below.A first example is that of sampling with probabilities proportional to size from a population divided into domains.Inclusion probabilities within domains often do not sum to integer numbers.Consequently, when one wants to control the sample size within domains, one usually uses rounding algorithms to obtain integer sample sizes for all domains, maintain the requested total sample size, and then use a stratified sampling algorithm to select the sample.However, this can become problematic when there is a large number of domains with small expected sample sizes, as is commonly the case in business surveys.In this kind of survey, the population is usually divided into size classes and activity sectors.Results are published by economic sectors of activity, according to a classification defined within each country.Size classes, in terms of revenue or of number of employees, is a secondary domain of interest for publications.More importantly, business sizes have to be taken into account in order to build an efficient sampling design.Sample sizes are then controlled for cells that are the intersection of these attributes.This is done in order to ensure that fixed sample sizes are respected both at the size class level and at the activity sector level, and also that further aggregations will not be hindered by accidentally empty sample cells.The proportionality relation between inclusion probabilities and business sizes is degraded by the large number of roundings and the resulting relative deviation from the original inclusion probabilities, that can be important for small domains.Another example is that of bootstrap procedures.Antal and Tillé (2011) have proposed a bootstrap method where units are re-sampled from the initial sample using the original inclusion probabilities.However, the sum of these probabilities within the bootstrap sample is usually not integer.
The case where the sum of inclusion probabilities is not an integer number was recently discussed by Bondesson and Grafström (2011).They proposed a generalization of the Sampford (1967) method to the case where the sum of the inclusion probabilities is not an integer.In their solution, the selection of one unit of the population is dealt with in a special way, and the final sample size is equal to the integer directly below the sum of inclusion probabilities, or to the integer directly above it.In this paper, we describe general solutions to overcome the problem when the sum of the inclusion probabilities is not an integer.All fixed size sampling designs can be, through these solutions, generalized to inclusion probabilities that do not sum to an integer.We give practical procedures to do so, and in particular to implement a maximum entropy design (see Hájek, 1981).
The paper is organized as follows.In Section 2 we present the first method based on splitting the inclusion probability vector into two new inclusion probability vectors.For this method we present two different algorithms for the splitting.One is based on the πps procedure for calculating inclusion probabilities and the other one allows sampling with maximum entropy.Differences between these two splitting algorithms are illustrated with a small example.The second method, based on an augmented population, is presented in Section 3. Estimation, with a small example, is shortly treated in Section 4. We comment on some applications in Section 5. Finally, in Section 6 we discuss the interest of the different methods.

Splitting into two fixed size designs
Assign a number π k ∈ [0, 1] to all units k of a finite population U , and suppose that one wants to randomly select a subset s of U with inclusion probabilities contained in the vector π = (π k ) k∈U .The value η = k∈U π k gives the expected size of the selected sample.When η is an integer, there exist many methods (see Brewer and Hanif, 1983;Tillé, 2006) of selection whereby only samples of size η can be selected.
When η is not an integer, we may want to use a method that enables us to select a sample with a size that is close to η while respecting the inclusion probabilities π k .More precisely, the size of the selected sample should be either equal to n or n + 1, where n is the integer such that n ≤ η < n + 1.We are looking for implementations of probability distributions P on the subsets s of U such that where |s| is the cardinal of s, and I k (s) = 1 if k ∈ s and 0 otherwise is the sample membership indicator function.These constraints imply that A first possible solution is to describe all probability distributions that satisfy conditions (2.1) and (2.2) through the splitting method developed by Deville and Tillé (1998).These distributions are obtained by constructing two vectors denoted by π where q = η − n.Once these vectors are computed, a realization r of a Bernoulli variable with parameter q is generated.If r = 1, a sample s of size n + 1 is drawn from U using any fixed-size method with inclusion probabilities (π + k ) k∈U .Similarly, if r = 0, a sample s of size n is drawn from U using any fixed-size method with inclusion probabilities (π − k ) k∈U .It is easy to see that this procedure gives a solution to our problem, and that all probability distributions that are solutions can be found through this procedure.We give in Subsections 2.2 and 2.3 two different methods to compute vectors π + and π − .The first one mimics the usual probability proportional to size computation of inclusion probabilities (πps procedure); the second one allows implementation of the maximum entropy sampling design.
However, suitable vectors (π − k ) k∈U and (π + k ) k∈U can be found through the usual procedure for probability proportional to size sampling (see Särndal et al., 1992, p. 89).Here the size measure is the original inclusion probability vector π.For some units with large inclusion probabilities π k , the quantities (n + 1)π k /η may be larger than one.The standard procedure is then to assign to these units an inclusion probability equal to one, and to compute proportional to size inclusion probabilities for the remaining units, repeating the operation several times if necessary.The solution can also be found directly in a few steps given in Algorithm 2.1.

Algorithm 2.1 (Direct computation of a πps-inclusion probability vector).
Here, the size measure is the value of π k , the desired sample size is n + 1 and the obtained inclusion probabilities are π + k , k ∈ U .One gives the general solution to the computation of a πps inclusion probability vector substituting n + 1 with a suitable size.

Order the population units so that
We then define π − k = (π k − qπ + k )/(1 − q), for all k ∈ U .As stated in Proposition 2.2, vectors π + k and π − k enjoy the required properties and can be used to implement a size constrained sampling design with average size η.

Computation for maximum entropy sampling design
Consider that each sample s of U is represented by a vector s in R N , N = |U |, whose components are equal to the sample membership indicators, so that s k = I k (s), for all k ∈ U .Let S denote the set of all possible samples of any size 1 ≤ n ≤ N of U , and Q be a subset of S. A maximum entropy sampling design p(.) on Q is a probability distribution whose support is equal to Q and that maximizes the entropy function, given by Some of the most frequently used sampling designs, such as simple random sampling, stratified sampling and Poisson sampling, are maximum entropy designs.The use of maximum entropy or high entropy sampling designs is commonly deemed to be desirable on the grounds that these designs retain a high level of randomness, even when they are subject to size and inclusion probabilities constraints.Properties of high entropy sampling designs are discussed amongst others in Hájek (1964); Berger (1998); Brewer and Donadio (2003) and Qualité (2008).If the sampling design is constrained to a vector of given inclusion probabilities π, then (see Darroch and Ratcliff, 1972) where α(Q, λ) = s∈Q exp(λ ′ s), I(s ∈ Q) is equal to 1 if s ∈ Q and to 0 otherwise, and the vector-parameter λ ∈ R N is such that the inclusion probabilities (2.6)Such a vector always exists when π lies in the interior of the convex hull of Q (see Brown, 1986, p.74).Degenerate cases where π is on the boundary of this set can be accounted for by cautiously allowing some coordinates λ k to take infinite values.A fast algorithm that allows to compute π from λ and reciprocally is described amongst others in Tillé (2006, pp. 82-83) in the case of the maximum entropy sampling designs with fixed sample size, also called conditional Poisson sampling.
In the present case, the support Q is the set of all samples of size n, denoted by S n added to the set of all samples of size n + 1, denoted by S n+1 .Hence we have that α(Q, λ) = α(S n , λ) + α(S n+1 , λ).
We also have that We can thus write the natural decomposition: It follows that maximum entropy sampling with given inclusion probabilities π and support Q is a mixture of maximum entropy sampling on support S n and of maximum entropy sampling on support S n+1 with the same vector-parameter λ.
Usual numerical approximation procedures to obtain a suitable vector λ from π (see for example Chen et al., 1994;Aires, 2000;Deville, 2000;Tillé, 2006, p.83) can easily be adapted to the case where Q = S n ∪S n+1 .Indeed, these procedures are based on a simplified Newton algorithm to find a solution λ of π − π(n, λ) = 0, where π(n, λ) = s∈Sn s • p(s, S n , λ), and the fact that we can explicitly compute conditional inclusion probabilities π(n, λ) for any integer n and parameter λ.In order to derive a suitable λ from π, we can use the same idea to find a solution of Once λ is obtained, we compute inclusion probability vectors π − = π(n, λ) and π + = π(n+1, λ).These vectors automatically satisfy Conditions (2.3), (2.4) and (2.5).Moreover, with this method, inequalities π − k ≤ π k ≤ π + k hold for all k in U as the inclusion probabilities of conditional Poisson sampling with a given parameter λ increase with the size of the samples (see for example Hájek, 1981).Any fixed-size design can then be used with these vectors, or, having already computed λ, we can very rapidly draw a sample from the maximum entropy sampling distribution.

An example
The two proposed algorithms for the splitting are illustrated in Table 1.This table contains the vectors π − and π + computed for a population of N = 10 units with strongly dispersed inclusion probabilities π whose sum is equal to η = 5.5.
On this example, with very heterogeneous inclusion probabilities, the algorithms give relatively different results.For more homogeneous inclusion probabilities, it is expected that the results would have been closer.Indeed, output vectors π − and π + are, with both methods, continuous functions of π, and for equal inclusion probabilities, both algorithms give exactly the same results.Each method has its advantages.On one hand, the πps method is easy to implement, requires very few computations, and provides vectors π − and π + that remain proportional to π (and thus to the auxiliary information used initially to compute π) except for some units for which π + k is equal to 1. On the other hand, when all π k are in (0, 1), the maximum entropy method does not assign values equal to 1 to elements of the vector π + and allows to implement a maximum entropy design.It is, however, computationally intensive, and can comfortably be used on populations of up to only a few thousand units.

Second general solution through an augmented population
Instead of splitting the probability vector in two and calculating π − and π + , another method to solve the problem consists of extending the population by adding a supplementary phantom unit labeled N + 1.This unit receives the inclusion probability So, with the added phantom unit, the sum of all inclusion probabilities has the integer value n + 1.Now, a sampling design is obtained by selecting a sample of size n+1 from the augmented population, and considering the induced marginal sampling design on the true population U .Thus the real sample size is n if the phantom unit is selected and n+1 if the phantom unit is not selected.Ignoring the phantom unit does not affect the inclusion probabilities for units 1, 2, . . ., N .Thus the inclusion probabilities π k are satisfied.Sampling designs obtained through this method are usually different from those obtained through the methods of Section 2. One exception is given in Proposition 3.1.
Proposition 3.1.If the method of augmented population is applied with a maximum entropy design, then the sampling design is the same as in Section 2.3.
A proof is given in Appendix.The method of augmented population is thus a simple way to implement the maximum entropy design when the sum of the inclusion probabilities is not integer, with available implementations of fixed size maximum entropy sampling.

Estimation
Usually we are interested in estimating the total of a study variable y, which takes a fixed value y k for unit k.The population total Y = k∈U y k can be estimated without bias by the well known Horvitz-Thompson estimator If the π k 's do not sum to an integer, and the sample size cannot be fixed, then it may be more efficient to condition on the realized sample size.For the first method (Section 2), this corresponds to conditioning on the outcome of the random choice between π − k and π + k .If the conditional inclusion probabilities ) have been calculated, then we may use the conditional estimator depending on the realized sample size n or n + 1. Estimator ŶC is unbiased and also unbiased conditional on the actual sample size.The Horvitz-Thompson estimator ŶHT is not unbiased conditional on the sample size.In this situation we should also use a conditional variance estimator adapted to ŶC .Indeed, we then have that var ŶC = E var ŶC |s| + var E ŶC |s| , (4.3)However, ŶC is unbiased conditional on the size of s.It follows that the last term in Equation 4.3 is null and that any unbiased estimator of the variance of ŶC conditional on size is also an unbiased estimator of the variance of ŶC .The gain in efficiency by using ŶC instead of ŶHT can be rather substantial.We illustrate this by an example.The population details are given in Table 2.
In Table 2 we have used the maximum entropy design for calculation of π − k and π + k and also for sample selection.The full list of samples, their probabilities and the value of the estimators ŶHT and ŶC are given in Table 3.For this example we get that V ( ŶHT ) = 46.14 and V ( ŶC ) = 5.01.For samples of size n = 2, ŶHT is negatively biased and for samples of size n = 3, ŶHT is positively biased.

Applications
As a variance reduction technique, we may want to divide the population into rather small non-overlapping strata and make sure that the sample size varies Table 3 The list of all possible samples, their probabilities and the value of the estimators ŶHT and ŶC for the population given in as little as possible within the strata.With the proposed methods, this can easily be achieved.If the inclusion probabilities sum to an integer for the entire population, then it is also possible to coordinate the strata sample sizes to have a fixed overall sample size.The procedure is as follows.
Let there be H strata and let η h = k∈U h π k , n h < η h < n h + 1, where H h=1 η h = n and n is an integer.Some strata should get sample size n h and some n h + 1. Stratum h should get sample size n h + 1 with probability q h = η h −n h .As H h=1 q h = n− H h=1 n h is an integer (say m), any fixed size sampling design on {1, . . ., H}, with inclusion probabilities q 1 , q 2 , . . ., q H , can be used to select m strata that will get sample sizes n h + 1.For each selected stratum we calculate π + h and for the non-selected strata we calculate π − h .Now we can apply a fixed-size unequal probability design within each domain, with these inclusion probability vectors, to select our sample.By this procedure we respect the initial inclusion probabilities, have a minimum variance in the domain sample sizes, and a fixed overall sample size.
Another important application is that of bootstrap methods for a finite population.Antal and Tillé (2011) have shown that if the sample is selected without replacement with unequal inclusion probabilities, the bootstrap method must take the sampling design into account.They propose a two-phase bootstrap procedure.In the first phase, a set of units is selected once in the bootstrap sample with the same unequal probabilities as the original sample.In the second phase, the units that are not selected in the first phase are resampled with equal probabilities with replacement.However, during the first phase, the sum of the inclusion probabilities in the sample is not integer.For this first phase, the authors needed to have sampling procedures that can be used when inclusion probabilities do not sum to an integer.The authors used a size-constrained sampling design as described in Section 2.

Discussion
Once vectors π + and π − are computed and randomly chosen, any known fixedsize sampling method can be applied on the selected vector, which makes these procedures very general.Both solutions of Section 2 and 3 cover all probability distributions that satisfy Conditions (2.1) and (2.2).However, only a limited number of fixed size sampling procedures are actually implemented.Using these sampling procedures on an augmented population as in Section 3 or after a trial, with updated inclusion probabilities as in Section 2 usually leads to different sampling designs.Maximum entropy sampling, is a notable exception where both methods coincide to the same sampling design.
The advantage of the splitting technique is that π − k and π + k are calculated and we may use the estimator ŶC .The advantage of using the method of augmented population with a phantom unit is that it does not require calculation of π − k and π + k .Thus it is easier to implement, but if the conditional inclusion probabilities are not calculated, we may not use the more efficient estimator ŶC .For some designs it is however possible to calculate the conditional inclusion probabilities and use the more efficient estimator even if the sample was selected by the approach with an augmented population.Chen et al., 1994).Since p(•) has fixed size, such a vector λ is defined up to an additive constant.We can thus force λN+1 = 0.If λ is the vector that holds the N first coordinates of λ, we then have that p(s) = exp(λ ′ s) s∈Sn exp(λ ′ s) + s∈Sn+1 exp(λ ′ s) .

Table 1
Computation of π − k and π + k from π k by means of the πps-method and the maximum entropy method

Table 2
Population of size N = 5 used to exemplify the differences between ŶC and ŶHT