Surprise sampling: Improving and extending the local case-control sampling

: Fithian & Hastie [7] proposed a sampling scheme called local case-control (LCC) sampling that achieves stability and eﬃciency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classiﬁcation with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear “surprising” in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in [7] and [1], the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and [1] as special cases. Under same model speciﬁcations, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspeciﬁed and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justiﬁcations of the claimed advantages and optimality of the estimation and the sampling design. Diﬀerent from [1], our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a speciﬁc loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.


Introduction
Nowadays, with the rapid development of data capturing and storage techniques, people meet huge amounts of data in various fields.With the growth of the data size, computational capacity becomes more and more crucial to implement efficient data analysis.Despite significant progress on computer hardware, computational ability may still become a major constraint for data analysis when the data size is sufficiently large.For instance, it might be very time and resource consuming if we want to try a variety of competing models instead of only fitting one or two predetermined models, or if we need to refit the model from time to time with new observations arriving continuously, or if we need to apply the data partition and reusing techniques such as cross-validation, bootstrapping, bagging, and so on.All these computationally intensive procedures require tremendous computational costs, especially for those large data sets.
A simple approach to reduce the computational cost is to draw a subsample from the full data set and then analyze the subsample.The easiest subsampling design is to draw the data uniformly.However, uniform subsampling might be very inefficient for some data structures.For instance, in classification problems with two classes (one class for positive examples called "cases" and the other for negative examples called "controls"), when the classes are imbalanced, that is, one of the classes (usually the class of cases) is rare and the other is dominant, uniform subsampling is unfavorable because it ignores the unequal importance of the data points.For imbalanced data, case-control sampling is a well-known subsampling design [4,11,12].It draws uniform subsamples from each of the two classes with different sampling percentages.Often comparable number of cases and controls are sampled, yielding a subsample with no obvious imbalance, to increase the efficiency of estimation.[2] and [14] showed that by fitting a logistic model on the case-control subsample and then making a simple adjustment, one can still get a consistent estimator for the regression parameters if the logistic model is correctly specified.Thus, case-control design can help reduce the computational burden and retain satisfactory estimation efficiency under the imbalanced structure.Besides classification, imbalanced data structure also appears for other data types.A survey on predictive modeling under various imbalanced distributions is provided by [3].
Subsampling designs have also received attention in epidemiological cohort studies.The cohorts to follow up usually involve a great number of subjects and the occurrence of a certain disease is interested.In most cases, the occurrence rate of the disease is low during the entire follow-up time, so the cohort is essentially imbalanced.Meanwhile, the collection of covariate information on all involved subjects can be very expensive and time consuming because of the large cohort size.Hence, subsampling designs were developed to save sampling cost and time.The widely studied ones include case-cohort design [13], case-control design, and nested case-control design [15].More generalized casecohort designs aiming to improve the estimation efficiency were developed later by [5,6,19].The sampling probability of all these subsampling designs depends on the observed follow-up times and the censoring indicators, but not the covariates.These designs are usually used for covariates ascertainment, that is, the covariates are observed only when the subject is selected into the subsample.
Statistical models are usually imposed for data analysis, but in real applications the models are easily misspecified.For some machine learning approaches, models are just used to derive meaningful loss functions, without caring about if they are correctly specified or not.Under these circumstances, the target parameter becomes a certain population risk minimizer corresponding to the loss function used [9].When there exists possibly model mis-specification, robust analysis procedures are often preferred.However, as [7] pointed out, in classification problems, when the logistic model is misspecified, the standard case-control sampling can not provide a consistent estimator of the target parameter.To overcome this drawback, [7] proposed a local case-control (LCC) sampling design, which depends not only on the class label, but also on the predictors and a pilot estimate (a pilot estimate is a "good" guess for the target parameter).Their design provides a clever way to remedy imbalance locally throughout the feature space and possesses potential robustness against model mis-specification.Moreover, an elegant LCC estimate mimicking the full sample maximum likelihood estimate (MLE) is proposed.They showed that when the logistic model is correctly specified and the pilot is consistent and independent of the full data, the asymptotic variance of their proposed estimate is twice the variance of the full data MLE; when the logistic model is misspecified, the proposed estimate is still consistent and asymptotically normal as long as the pilot is consistent and independent of the data.
The LCC design aims to simultaneously speed up computation and provide a simple procedure to obtain a good estimate even under model mis-specification, but the authors did not discuss the optimality of the design.More recently, [17] modified the LCC design smartly to get an optimal subsampling method that minimizes the conditional asymptotic mean squared error (MSE) of the subsample estimator given the full data.Later on, [1] extended the method to generalized linear models with canonical links.However, they only considered the optimization criterion regarding the conditional MSE.Some other non-uniform subsampling designs for the linear model include [10] and [16].
In this paper, we propose an improved subsampling design which accommodates various types of statistical learning objectives and includes the LCC sampling and [1] as special cases.For estimating the target parameter based on the subsample, we apply the Horvitz-Thompson (HT) type estimation [8].The new sampling design is derived by optimizing certain well-defined criteria, such as prediction accuracy, estimation accuracy, or MSE.For different criteria, the proposed design has its corresponding form adaptively. Basically, it draws a data point with a large error of pilot prediction or a large score into the subsample with higher probability.Provided a pilot estimate, such a data point is in certain sense unusual, or "surprising", for that given pilot, so we call the proposed design surprise sampling design.
The advantages of the proposed surprise sampling are summarized as follows.
(i) The surprise sampling design is derived based on well-defined objectives.For a specific objective, the corresponding design is optimal.Meanwhile, the proposed design flexibly adapts varying objectives.That is, for different objectives, one can derive the optimal designs adaptively.In contrast, the objective of the LCC design is not clearly defined and the optimality is not discussed in [7].[1] and [17] only considered the objective of minimizing the conditional MSE.(ii) The proposed estimators are always consistent and asymptotically normal, regardless of the correctness of the model specification and the consistency of the pilot estimate.The consistency of the estimator in [7] requires the consistency of the pilot.The pilots used in [1] and [17] are also consistent.(iii) If the pilot estimate is consistent and the logistic model is correctly specified, the proposed estimator is no worse than that in [7] in the sense that they have the same asymptotic efficiency under the LCC sampling.(iv) The validity of the proposed estimation procedure does not require the pilot estimate to be independent of the full data, while [7] requires the independence.This relaxation is useful in applications especially when there is no other data source and one needs to get the pilot from the full data.(v) The large sample properties derived for the proposed estimators are population-wise (unconditional), while the parallel large sample properties in [1] and [17] are developed data-wise (conditional on the full data).(vi) The proposed approach can be generally applied to not only supervised learning such as classification and regression but also unsupervised learning tasks, because it essentially only requires a well-defined loss function with a finite-dimensional parameter.Hence the application of the approach is more than the scope of logistic model or the other generalized linear models.
We proceed as follows.Section 2 introduces the notation and describes the problem setting.In Section 3, we present the main idea of the proposed surprise sampling and the specific forms of the sampling design are respectively derived to reach various objectives, such as best prediction accuracy and estimation accuracy.Section 4 gives out the large sample properties of the HT type estimator under the surprise sampling design.In Section 5, extensive simulation studies are carried out to show the effectiveness of the proposed approach.The results of real data analysis are provided in Section 6. Section 7 concludes.All the technique details are summarized in the Appendix.

Notation and problem setting
Suppose the full data contains n subjects and let d i be the observed data point for the i-th subject.We assume that d i , i = 1, . . ., n, are independent and identically distributed (i.i.d.) copies of a random element D whose distribution stands for the population.For supervised learning problems, D can be decomposed into a response, denoted by Y , and a q-dimensional predictor or covariate vector, denoted by X. Correspondingly, the observed data points d i = (y i , x i ), i = 1, . . ., n, are i.i.d.copies of D = (Y, X).We aim to predict Y through X, or learn the regression function f (x) = E(Y |X = x), based on the observed data.Usually, a model is imposed on f (x) characterized by a p-dimensional parameter, denoted by θ throughout the paper, and a corresponding loss function is minimized to obtain an estimate of θ.Write the loss function as l(d; θ), where d = (y, x).The loss function can take the form of squared loss, negative log-likelihood, hinge loss, Huber loss, etc.Some classical choices include logistic regression with l(d; θ) = −y(α + β x) + log(1 + exp(α + β x)) for a binary response, where θ = (α, β ), Poisson log-linear model with l(d; θ) = −y(α + β x) + exp(α + β x) for a counting response, linear model with squared loss l(d; θ) = (y − α − β x) 2 /σ 2 for a continuous response, where θ = (α, β , σ), and also nonlinear models such as neural networks with squared loss or crossentropy loss.For unsupervised learning tasks, there is no response-covariate structure of data.Typically, we derive a loss function l(d; θ) based on a certain purpose and aim at minimizing the loss with respect to θ.For instance, in the geometric view of principal component analysis (PCA), the goal is to find a kdimensional affine space in R q that best approximate the n examples in terms of Euclidean distance.Parameterizing the affine space by α + Uβ where U consists of k-columns of an orthogonal basis of the space, we end up with the following loss function l(d; θ) = d − (α + Uβ) 2 , where θ = {α, U, β} and • stands for the Euclidean norm.
The target parameter that we aim to estimate is the so-called population risk minimizer which minimizes the population risk R(θ) = E[l(D; θ)], that is, In supervised learning, when the regression model is correctly specified, that is, there exists some θ 0 satisfying f (x) = f θ0 (x), then it is easy to see that θ 0 = θ * .However, as we mentioned in Section 1, in many real applications, f (x) does not satisfy the specified model and this is the so-called model mis-specification.Meanwhile, in some unsupervised learning tasks such as PCA, there is no imposed model.It is well-known that no matter the model specification holds or not, θ * can be well defined under general conditions.The full sample version of the risk minimizer, denoted by θ * , is given by θ * = arg min θ n i=1 l(d i ; θ).Under suitable regularity conditions, one can show that θ * is consistent for θ * , that is, θ * converges in probability to θ * [9].
When n is sufficiently large, subsampling designs are preferred to save computational cost.For each i = 1, . . ., n, let Δ i be a 0-1 valued binary indicator, indicating whether the ith data point is sampled into the subsample (Δ i = 1) or not (Δ i = 0).Let θ be a pilot estimate.As we mentioned in Section 1, a pilot estimate is a guess of θ * .It can be obtained from a pilot sample that is either a subset of the full data or any other samples such as a historical sample.Let π i be the conditional sampling probability of the ith data point given the full data and the pilot estimate, that is, π i = P(Δ i = 1|d 1 , . . ., d n ; θ).Also, Δ i 's are generated independently with each other given the full data and the pilot.
For binary response, the LCC design proposed by [7] , where p(t) = exp(t)/[1 + exp(t)] and (α, β) is a pilot estimate of (α, β).After obtaining the subsample, they fit a logistic regression to the subsample and then do a simple adjustment to get the estimator of the target parameter.The main advantage of their estimation procedure is that it basically maintains the original form of the maximum likelihood estimation with full data.However, as we mentioned in Section 1, the validity of their approach heavily relies on the consistency of the pilot estimate, yet it has not been extended to other regression models or more general learning tasks.We apply the HT type estimation.Specifically, the HT type estimator is given by The HT type estimation is general enough for various kinds of loss functions.The computational complexity of (2.1) is similar to that of full data.For instance, if the objective function based on the full data is convex, so is the objective function in (2.1).Meanwhile, by inverting the sampling probability, it is quite intuitive to expect the consistency of the HT type estimator regardless of the consistency of the pilot estimate.More importantly, based on the HT type estimation, we can derive an optimal form of the subsampling design for a specific objective.The details of the derivation is discussed in the following section.

Surprise sampling design
To give out the specific form of the proposed subsampling design, we first heuristically present the asymptotic properties of the HT type estimator θ.By the definition of θ, it is easy to see that n i=1 Δ i g(d i ; θ)/π i = 0.Under suitable conditions, we can show that θ is consistent for θ * and that Furthermore, we can show that √ n( θ − θ * ) converges in distribution to a Gaussian vector, denoted by Z, with mean zero and variance-covariance matrix and π is a probability that may depend on D and θ * (i.e., π = π(D; θ * )).We give out sufficient conditions that guarantee the above results in Section 4.

Overall prediction accuracy
If the main purpose of the data analysis is prediction, the minimized population prediction error R(θ * ) = E[l(D; θ * )] is a natural criterion to measure the prediction accuracy.Based on a subsampling percentage π and the corresponding HT type estimator θ, the prediction error can be measured by R( θ), which is R(θ) evaluated at θ = θ.By Taylor expansion, we have that The leading term in (3.2) converges in distribution to Z AZ/2 according to the asymptotic property of θ.By some calculation, we have that Then it is natural to select π that minimizes (3.3), which can be treated as the average difference between the prediction error based on the subsample and the minimal population prediction error.The following proposition, proved in Appendix A.1, gives out the optimal form of π.
Proposition 1.Let r ∈ (0, 1) be a constant.The optimal π to minimize (3.3), subject to E(π) r, is given by where c is such that E(π) = r and for constants a and b, a ∧ b = min{a, b}.
The constant r is used to control the subsampling rate.Note that the optimal π depends on some unknown quantities, such as A and θ * .Given a pilot estimate θ, we propose the following subsampling design and c is a constant selected to reach the predetermined subsampling rate.For the given subsampling rate r, an algorithm based on the bisection method to get the largest c such that n −1 n i=1 π i r is given in Section 3.3.The proposed subsampling probability is proportional to the score evaluated at the pilot.When the value is relatively large for a data point, it implies that this data point has a large prediction error for the given pilot estimate, that is, it is somewhat more "surprising" than the ones with smaller score values.Under such design, the data point with a larger score value is more likely to be drawn into the subsample.Thus, we call the proposed subsampling design "surprise" sampling, or score sampling.

Estimation accuracy
If the main concern is the estimation accuracy of v θ, as an estimator of v θ * , where v is a given p-dimensional constant vector, then the objective becomes to give out an efficient subsampling scheme to increase the estimation accuracy.Without loss of generality, assume that v = 1.From (3.1), we have that √ nv ( θ − θ * ) converges in distribution to a Gaussian vector with mean zero and variance v A −1 V π A −1 v. Then we select π to minimize this asymptotic variance.
Proposition 2. Let r ∈ (0, 1) be a constant.The optimal π to minimize where c is such that E(π) = r.
Given a pilot θ, the proposed subsampling design is given by where c is a constant selected to reach the predetermined subsampling rate.Again, this design is proportional to the score evaluated at θ.
To elaborate the relationship between the proposed surprise sampling and the LCC sampling, we discuss the case of generalized linear models with D = (Y, X) and d i = (y i , x i ).Let Z = (1, X ) and θ be the collection of all the regression parameters (i.e., the coefficients corresponding to X plus the intercept).For a generalized linear model, the conditional probability or density function of Y given X = x, denoted by p(y|x; θ), can be written as the form of ψ(y, θ z), where ψ is a known function up to a finite-dimensional parameter and z = (1, x ) .A natural choice of the loss function is the negative log likelihood l(y, x; θ) = − log ψ(y, θ z).Define S(y, t) = ∂ ∂t ψ(y, t)/ψ(y, t).Then it is easy to see that g(y, x; θ) = −S(y, θ z)z.When the model is correctly specified, there exists θ 0 = θ * , known as the true parameter value.Let Then the optimal π to minimize the asymptotic variance of v θ is given by π = c|S(Y, θ 0 Z)| ∧ 1, where c is a constant controlling the subsampling rate.
Based on Proposition 3, given a pilot estimate θ, we would propose the subsampling design For the logistic model with a binary response, we have ψ(y, t) = p(t) y (1 − p(t)) 1−y and S(y, t) = (y − p(t)).Then the proposed subsampling design becomes π i = c|y i −p( θ z i )|∧1, which is exactly the LCC sampling proposed by [7].Thus, the LCC sampling can be viewed as a special case of the surprise sampling design, with a somewhat narrow objective to achieve optimality in estimating a certain direction of the regression parameter vector.Surprise sampling, however, is better justified, since if the data analysis objective is altered, the sampling design can be adaptively modified.We show in the next section that if logistic model is correctly specified and the pilot estimate is consistent, the proposed HT type estimator θ has the same asymptotic efficiency as the LCC estimator of [7].
Besides optimizing the prediction accuracy and estimation accuracy, other objectives can also be considered.The optimal selection probability π adaptive to various objectives can be derived similarly to those in Propositions 1 and 2. For instance, [1] and [17] mainly considered minimizing the conditional MSE of θ given the full data.In our proposal, for minimizing the (unconditional) MSE, the optimal design is given by π = (c A −1 g(D; θ * ) ) ∧ 1, which is similar to the optimal form derived by [17] under the logistic model and [1] under the generalized linear models.

The algorithms
Let πi be the kernel part of the proposed surprise sampling design, e.g., πi = Ã−1/2 g(d i ; θ) for prediction and πi = |v Ã−1 g(d i ; θ)| for estimation.The implementation procedure of the surprise sampling is summarized in Algorithm 1.The constant c is obtained by Algorithm 2.
Output: the constant c to control the subsample size.Rearrange πi 's in ascending order π(1) ) by the bisection method, where

Large sample properties
In this section we discuss the asymptotic properties of the HT type estimator under the proposed surprise sampling designs, which require some conditions listed below.The first set of conditions includes the regularity conditions on the parameter space and the underlying data distribution.The second set of conditions is about the subsampling probabilities π i 's.
There exists a probability π that may depend on D so that Note that the conditions on the subsampling probabilities are stated in a general manner so that we can first present a general theorem on the large sample theory of θ defined in (2.1).The theorem shows the consistency and the asymptotic normality of θ under the above conditions.
The key step to getting the asymptotic distribution is to establish the expansion (3.1).We give out the proof in Appendix A.2.It is worthwhile to mention that the consistency and the asymptotic normality of θ do not require the pilot θ to be consistent nor to be independent of the full data, which, however, are both required in [7].The relaxations can be useful in practice, as we mentioned in Section 1.This is one of the main reasons that we propose to use the HT type estimator.
Based on Theorem 1, we can immediately derive the asymptotic properties of the HT type estimator under the proposed surprise sampling designs.For the surprise sampling, the specific forms of the design π i are given in (3.4) or (3.5), depending on the data analysis purpose.We first consider the situation where the pilot θ is consistent.Here, the probability π = (c A −1/2 g(D; θ * ) ) ∧ 1 for prediction and π = (c|v A −1 g(D; θ * )| ∧ 1 for estimation.We further introduce a condition which, together with C1-C4 , can deduce the general conditions C5-C7 .
The following corollary presents the asymptotic properties of θ with the consistent pilot estimate.

Corollary 1. If conditions C1-C4 and C8 hold and
It seems that Corollary 1 only repeats the results of Theorem 1.However, one needs to notice that in Corollary 1 the probability function π in V π takes the optimal form given in Proposition 1 and 2. It means that when the pilot estimate is consistent, the proposed sampling designs given in (3.4) or (3.5) are optimal in the sense that the corresponding HT type estimator θ asymptotically reaches the best prediction accuracy or estimation accuracy.
When the assumed model is correctly specified, the target parameter θ * becomes the true parameter θ 0 .For the binary response problem discussed in [7], we have shown in Proposition 3 that the LCC sampling is a special case of the proposed surprise sampling.Consider the sampling design π i = |y i − p( θ z i )|, that is, the constant c = 1.We have the following result for the corresponding θ.

Corollary 2. If the logistic regression model p(θ z) = exp(θ z)/[1 + exp(θ z)]
holds with θ 0 being the true parameter value, X does not concentrate on a hyperplane of dimension smaller than q, E( X 4 /π) < ∞, and , where Σ full stands for the asymptotic variance of the MLE for the full sample.
Corollary 2 indicates that under the correctly specified logistic model, the proposed HT type estimator θ is asymptotically as efficient as the LCC estimator of [7], as long as the pilot is consistent.Again, our result does not require the independence between the pilot and the full data.Both Corollaries 1 and 2 are proved in Appendix A.2.
Next we consider the situation with an inconsistent pilot estimate.When the pilot estimate is not consistent, the LCC estimator is no longer consistent.However, based on Theorem 1, the HT type estimator can be still consistent and asymptotically normal as long as the pilot has a certain limit in probability.Specifically, suppose that θ converges in probability to a limit denoted by θ.Let Ā = E[G(D; θ)] and V π = E[g(D; θ * ) ⊗2 /π], where π = (c Ā−1/2 g(D; θ) ) ∧ 1 for prediction and π = (c|v Ā−1 g(D; θ)|) ∧ 1 for estimation.Another condition is needed here.
C9. θ is an interior point of Θ, the matrix Ā is positive definite, For the inconsistent pilot, we have the following result.

Corollary 3. If conditions C1-C4 and C9 hold and
Again we give out the proof in Appendix A.2.When the pilot estimate is not consistent for θ * , the surprise sampling design is no longer optimal.However, θ is still consistent for θ * and asymptotically normally distributed.Thus, the proposed estimation procedure is robust for the model mis-specification as well as the inconsistent pilot.
Finally, a plugged-in approach can be applied to estimate the asymptotic variance-covariance matrix of θ.Specifically, we define Then a consistent estimator for the asymptotic variance-covariance matrix is given by Â−1 Vπ Â−1 .

Simulation studies
We conduct extensive simulation studies to examine the effectiveness of the proposed approach and make some comparison with the LCC sampling.The numerical studies mainly focus on various regression type problems, but as we have already mentioned, the whole procedure is general enough to be extended to unsupervised learning tasks.We first consider binary response where the proposed HT type estimator can be compared with the LCC estimator of [7].= E( β) − β 0 2 ; Var: Var = q j=1 Var( βj ).

Simulation 1: Correctly specified logistic model
In simulation 1 we consider the scenario where the logistic model is correctly specified.We set q = 50 and all the predictors X are generated independently from the standard normal distribution.Given X, Y is generated from a Bernoulli distribution with success probability p(θ Z), where θ = (α, β ) with the first 25 components of β being 1, the rest 25 being 0, and α being chosen to yield P(Y = 1) ≈ 10%.The entire sample size n is set to be 10 6 .For the pilot estimate, we use data drawn from the full sample with size 10 4 .We consider two types of consistent pilot estimates.The first one is to draw a random sample with uniform probabilities and get the logistic MLE to be the pilot; the second one is to draw a 50-50 split case-control sample and apply the weighted case-control approach to get the pilot.For the subsampling design, we apply the LCC sampling π i = |y i − p( θ z i )|, which is a special case of the proposed surprise sampling.Both the LCC estimator and the proposed HT type estimator based on the negative log-likelihood loss are obtained for comparison.The procedure is repeated for 1000 times.We record the squared bias and variance of the estimator for β, denoted by β, over the 1000 realizations for each of the two methods under the two pilots, respectively.The results are summarized in Table 1.
From Corollary 2, we know that when the logistic model is correctly specified, the proposed HT estimator is asymptotically as efficient as the LCC estimator under the LCC sampling.From Table 1, we see that the behaviors of the two estimates are very close to each other, which coincides with the finding of the corollary.Also, the pilot method has little impact on the behavior of the proposed estimator, as long as it is consistent.
It is worthwhile to mention that under the LCC sampling, the average size of the subsample is around 6.6% of the entire sample size.Adding the sample size used for the pilot, it means that we use about 7.6% of the full data size to reach half of the estimation efficiency.We also record the computation time.On the computer we conduct all the numerical studies (a laptop running macOS 10.14 with an Intel I7 processor and 16GB memory), it takes around 352 seconds to calculate a full sample MLE.The whole procedure of the subsample estimation, including obtaining the pilot, evaluating the sampling probabilities, drawing the subsample, and calculating the HT type estimate, however, takes around 19 seconds, only 5.4% of the time for the full sample MLE.

Simulation 2: Incorrectly specified logistic model with a consistent pilot
In simulation 2 we turn to the scenario where the logistic model is incorrectly specified.Here q is set to be 5 and again all the predictors are generated independently from the standard normal distribution.Besides the linear combination of the 5 predictors, we add the quadratic term of the first predictor into the success probability of the Bernoulli distribution, but still fit the data by the usual logistic model with linear terms only.Thus, the fitted model is incorrectly specified.The parameters are set to yield P(Y = 1) ≈ 1%, so the data is more imbalanced than the last scenario.The entire sample size n = 10 6 , and the pilot sample size is 10 4 .We still use the logistic MLE with uniform subsampling and the weighted case-control estimate as the pilot estimate, respectively.It is easy to see that the two pilots are consistent.Similar to simulation 1, the LCC sampling is applied and the LCC estimator and the HT estimator are obtained.The procedure is repeated for 1000 times and the results parallel to Table 1 are summarized in Table 2.
From the results we see that the LCC estimate and the HT estimate are essentially unbiased for the target parameter β * .Meanwhile, in this case, the empirical variance of the proposed HT estimate is smaller than that of the LCC estimate.Under the LCC sampling, the average size of the subsample is around 1.9% of the entire sample size.

Simulation 3: Incorrectly specified logistic model with an inconsistent pilot
In simulation 3 we compare the two estimators under incorrectly specified logistic with an inconsistent pilot estimate.The data generation scheme is the same as that in simulation 2. For the pilot estimate, we still draw a subsample of size 10 4 with uniform probabilities.However, the difference here is that we use the probit MLE to be the pilot estimate.Then the resulting pilot estimate is no longer consistent for the target parameter.Using this inconsistent pilot, the LCC sampling is applied and the LCC estimator and the HT estimator are obtained.The procedure is repeated for 1000 times.Here, besides the results of the two estimates, we also report the squared bias and variance of the pilot estimate.The results are summarized in Table 3.
The consistency of the LCC estimator relies on the consistency of the pilot, but the that of the proposed HT estimator does not, as Corollary 3 illustrates.From Table 3, we see that the pilot estimate is clearly biased.Compared with the HT estimator, the bias of the LCC estimator is quite obvious.Although the squared bias of the LCC estimator is not large in the absolute value, it is of the same order with its variance.The proposed HT estimator, however, has much smaller squared bias relative to its variance, showing positive evidence for Corollary 3.This is an advantage of the HT estimator over the LCC one.

Simulation 4: Comparison of LCC sampling and surprise sampling
Proposition 3 shows that the LCC sampling is a special case of the proposed surprise sampling design, and is optimal when the target is to estimate v θ * with v = E(σ Z Z).However, when the target direction changes, the LCC sampling is not adaptively optimal, but our surprise sampling design is.In this simulation, we again generate data by the same scheme as that in simulation 2, but suppose that the main concern is to estimate the regression coefficient of the first predictor, i.e., β * 1 .Equivalently speaking, the target direction v here is a 6-dimensional column vector with the second element being 1 and all the others being 0. For this target, the LCC sampling is not optimal.Following Proposition 2, the optimal design is given by The two subsampling designs are compared, with c decided to yield comparable average subsample sizes between the two designs.We still use the same two pilot estimates as simulation 2 and consider three estimation approaches: the LCC estimator, the HT estimator under the LCC sampling, and the HT estimator under the optimal surprise sampling.The procedure is repeated for 1000 times.For the HT estimator, besides the squared bias and variance, we also report the average of the variance estimates and the empirical coverage percentage of the 95% Wald confidence interval.The results are summarized in Table 4.
The variance of the HT estimator under the optimal surprise sampling is much smaller than that of the LCC estimator.By using comparable subsample sizes (around 1.9% of the entire sample size), the relative efficiency of the HT estimator under the optimal sampling to the LCC estimator is at least 3 in our situation.Also, the HT estimator under the optimal sampling is more efficient than that under the LCC sampling, which is expected according to Proposition 2. The plugged-in variance estimate and the Wald confidence interval give out reasonable performance.
In the remaining studies we turn to more general regression models with counting and continuous responses, where the LCC approach is no longer available.

Simulation 5: Counting response under log-linear model
In simulation 5 we consider the counting response.We set q = 2 and the two predictors X = (X 1 , X 2 ) are generated independently from the standard normal distribution.Two schemes for generating Y given X are considered.In the first one, Y is generated from a Poisson distribution with mean exp(α+β 1 X 1 +β 2 X 2 ), while in the second one, Y is generated from a Poisson distribution with mean exp(α ).Thus, the Poisson log-linear model only with the linear terms is correctly specified for the first scheme, but misspecified for the second.In both schemes, the parameters are set to yield P(Y 1) ≈ 93%, implying that the generated data is imbalanced in the sense that most responses are 0 or 1.The entire sample size n = 10 5 .We draw a pilot sample of size 1000 from the entire sample with uniform probabilities and fit the Poisson log-linear MLE to be the pilot, denoted by θ = (α, β1 , β2 ) .According to Proposition 3, we use the surprise sampling design π i = (c|y i − exp( θ z i )|) ∧ 1, where z i = (1, x 1i , x 2i ) and c is set to make the subsample size equal to 1000.The proposed HT type estimator based on the negative log-likelihood loss is calculated.For comparison, we draw a uniform subsample of size 2000 to calculate the Poisson log-linear MLE.The size is 2000 since the proposed surprise sampling needs to pay for its pilot sample.The procedure is repeated for 1000 times.The squared bias and variance of the two estimators are recorded.For the HT estimator, we also record the average of the variance estimates and the empirical coverage percentage of the 95% Wald confidence interval.The results are summarized in Table 5.
From the results, we find out that no matter whether the model is correctly specified or not, the HT estimator based on the proposed surprise sampling is essentially unbiased for the target parameter, and is much more efficient than the uniform subsample MLE with a comparable sample size.The plugged-in variance estimate and the Wald confidence interval give out satisfactory performances.

Simulation 6: Continuous response under linear model
In the last simulation we consider the continuous response.Still q is set to be 2, and the two predictors X = (X 1 , X 2 ) are generated independently from a normal distribution with zero mean and variance 0.01.Two schemes for generating Y given X are considered.In the first one, Y is generated from a normal distribution with mean α + β 1 X 1 + β 2 X 2 and variance 0.01, while in the second one, Y is generated from Poisson distribution with mean α + 1 and variance 0.01.Thus, the linear regression model only with the linear terms is correctly specified for the first scheme, but misspecified for the second.In both schemes, the parameters are set to yield P(−0.5 < Y < 0.5) ≈ 99.6%, implying that most responses are near zero.The entire sample size n = 10 5 .We draw a uniform pilot sample of size 1000 from the entire sample and fit the normal linear regression MLE to be the pilot θ.For the surprise sampling, we use π i = (c|y i − θ z i |) ∧ 1, where c is set to make the subsample size equal to 1000.The proposed HT type estimator based on the least squared loss is calculated.Similar to the last simulation, the MLE based on a uniform subsample of size 2000 is also obtained for comparison.The parallel results are summarized in Table 6.
The observations of the results are basically similar to those of Table 5, so we omit repeating the details.The last two simulations show that the proposed surprise sampling and the corresponding HT estimator are quite effective for counting and continuous data.

Web Spam data
We first apply the proposed approach to the Web Spam data available on the LIBSVM website and originally from [18].The same data was also analyzed by [7] using the LCC approach for spam filtering.The data consists of 350000 web pages with about 60% are labeled as "web spam" that designed to manipulate search engines rather than display legitimate content.Following [7], we use frequency of the 99 unigrams that appeared in at least 200 documents as features, log-transformed with an offset to reduce skew.This data set is marginally balanced, but has considerable conditional imbalance in the sense that for some feature values, the web label is easy to predict.
We keep the same subsampling procedure as that in [7].The LCC sampling design is used and the corresponding selection percentage is about 10% for this data.To assess the sampling distribution of the estimators, we repeatedly take uniform subsample of size n = 100000 from the full data for 100 times.In each replication, we use the weighted case-control approach adopted in Simulation 1 with size 10000 to get the pilot estimate.By the LCC design, the subsample size for parameter estimation is also around 10000.We fit the full sample MLE (of size 100000), the LCC estimate, and the proposed HT estimate.Following Corollary 2, under the Logisitic model, the asymptotic variance of the proposed HT estimate is twice the variance of the full sample MLE, and is the same as that of the LCC estimate.Since in this real data, it is very likely the logistic model is incorrectly specified, as [7] mentioned, the asymptotic variance of the HT estimate and the LCC estimate should be a little bit more than two times the variance of the MLE.On the other hand, the subsample size is around 20000 (including the sample for pilot estimate), i.e., 20% of the full sample size.Then a uniform subsample of size 20000 should yield variance roughly 5 times that of the full sample.
Figure 1 shows the relative variances of the subsampling estimates to the full sample MLE.Specifically, the horizontal axis indexes the 100 regression coefficients to fit, and the vertical axis stands for the variance of each estimated coefficient relative to that of the full-sample MLE of the same coefficient.The triangle is for the HT estimate and the circle is for the LCC estimate.We find that most relative variances of the two subsampling estimates are slightly larger than 2, as expected, but are substantially smaller than 5, implying the LCC sampling is more efficient than the uniform subsampling with a comparable sample size.Moreover, for most estimated coefficients, the relative variance of the HT estimate is smaller than that of the LCC estimate, implying that the proposed HT estimator is slightly more efficient than the LCC estimator.This is con- sistent with the finding of Simulation 2 .The average computation time of 100 replications for the full sample MLE is 66.64 seconds.The average computation time for the LCC estimate (including the subsampling design implementation) is 19.32 seconds and for the HT estimate is 20.04 seconds.

Micro-blog data
In the second example we use the proposed method to analyze a data set of micro-blog, a Chinese version of twitter.We collect 625895 tweets posted on Sina micro-blog during January to March, 2018.For each tweet, the time of posting (in hour), the number of comments, retweets, and likes are recorded.Moreover, we also record the gender, the number of followers, fans, and the number of tweets posted before of the blogger who posted the tweet.We set the number of likes to be the response and the other variables to be the predictors.Since the number of likes is a counting variable, the log-linear model is used to do analysis.The response, ranging from 0 to tens of thousands, is very right skewed, i.e., imbalanced.The mean of response is 1038 while the median is 72.Around 55% of the tweets have the number of likes less than 100, while 2.11% of the tweets have the number of likes larger than 10000.In Figure 2, we show the histogram of the response.
In this analysis we focus on predicting the response by the predictors.Usually, the full data should be randomly separated into the training part for training model and the testing part for measuring prediction accuracy.To increase the stability of the results, here we use a ten-fold cross-validation style approach, that is, the full data is randomly split into ten parts and in each time, nine parts of the data serve as the training set and the rest one part serves as the testing set.Each time we train the log-linear model based on the training set and then calculate the root mean square error (RMSE) of the predicted values on the testing set, i.e., RMSE = n −1 t i∈It (y i − ŷi ) 2 , where I t is the indicator set of the testing data, n t is the size of I t , y i is the response value, and ŷi is the predicted value.Finally, the average of the ten RMSEs, called ARMSE, is treated as the measure of the prediction accuracy.In the training set, we use the same sampling design as that in Simulation 5 to get the subsample, i.e., π i = (c|y i − exp( θ z i )|) ∧ 1, where y i is the response, z i = (1, x i ) , x i is the standardized predictor vector, and θ is the pilot estimate which is the MLE fitted with a uniform subsample of size 10000.By taking a proper value of c, the size of the surprise subsample is set to be 10000.To make comparison, we also draw a uniform subsample of size 20000, which has a comparable sample size to the surprise subsample, to train the model by MLE.The prediction result of the full data MLE serves as the benchmark.
The ARMSE based on the full data is 6.848×10 3 .The ARMSE based on surprise sampling data is 7.194×10 3 , which is quite close to that of the full data.The ARMSE based on the uniform subsample data is, however, 2.367 × 10 13 , which is much greater than that of the surprise sampling.One of the main reasons of such result is the right-skewness of the response data.With the subsampling percentage used here, the uniform subsampling has little chance to pick out the tweets with the extremely large responses, say, larger than 10000.Because of the significant influence of the large responses on the model fitting, the training models based on the uniform subsample data in some folds have quite different coefficients for some predictors from the full data models.Consequently, the prediction based on the uniform subsample models are very inaccurate for some responses in the testing set, resulting in extremely large RMSEs.By contrast, the surprise sampling is more likely to select those large responses, or in other words, the "surprising" responses, into the subsample due to its design motivation (the selection probability can be 1 for those extremely large responses compared with the pilot prediction).Consequently, the training models based on surprise sampling data take the large responses into consideration and even-tually they have similar coefficients estimation and prediction performance with the full data models.The average computation time of 10 folds for the full data is 97.45 seconds.For the surprise sample data, the average computation time is 4.22 seconds and for the uniform sample data, is 3.54 seconds.

Conclusion and discussion
We develop an improved version of the LCC sampling, called surprise sampling design, to reduce computational burden and retain good prediction or estimation performance.The proposed sampling design is flexibly adaptive to different objectives.For a specific objective, the design has the corresponding optimality.The LCC sampling design, in this sense, can be viewed as a special case of the proposed surprise sampling design.For parameter estimation, we propose the HT type estimation approach.The resulting estimator is consistent and asymptotically normally distributed, without requiring the pilot estimator to be consistent or to be independent of the full data.For the binary response, if the logistic model is correctly specified and the pilot is consistent, the HT estimator is asymptotically as efficient as the estimator of [7] under the LCC sampling.The surprise sampling design and the HT estimation can be extended to more general responses such as counting and continuous response.
The proposed subsampling approach is more of a working principle for efficient data analysis which can be applied to various statistical learning problems.In the numerical studies, we mainly focus on regression.However, as we mentioned many times, the optimal sampling design, the algorithm described in Section 3, and the theories discussed in Section 4 can be readily extended to unsupervised learning tasks.The main reason is that our approach setup essentially only requires a specific loss function and a finite-dimensional parameter.One example of unsupervised learning is briefly given in Section 2. Some more detailed discussion on the approach and more meaningful applications of unsupervised learning are of great interest.
For the proposed approach, there are still limitations that lead to several directions worth further research.Firstly, although the sample size is much reduced for the subsample, the calculation of the subsampling probabilities may involve the full data.This can be still computationally intensive.Secondly, the pilot estimate in our approach is a guess of the target parameter, so they are of the same dimension.In some real applications, people may just want to use a relatively simpler model to get the pilot quickly and then apply a more complicated model to estimate the target parameter or do prediction.Then the pilot model used is different from the fitted model and the pilot is no longer a formal guess of the target parameter.Thus, the surprise sampling design needs adjustment and the optimality of the design requires more investigation.Thirdly, the HT type estimator defined in (2.1) is not semiparametrically efficient, but an immediate variant one is.The surprise sampling can be defined accordingly with the variant, in which the sampling probability π i can be estimated by the kernel smoothed approach.It is possible to further improve the estimation which turns out to achieve the lower bound in (A.1) and hence imply the desired result.
Proof of Proposition 2. Note that the asymptotic variance of √ nv ( θ − θ * ) which we want to minimize is Then it directly follows that the optimal π is given by π where c is such that E(π) = r Proof of Proposition 3. Let e 1 be a p-dimensional unit vector whose first component is one and all the others are zero.We first prove that (A.2) When θ = θ 0 we have that where Cov stands for the variance-covariance matrix of a random vector.Hence, Next, we have that v A −1 g(y, x; θ 0 ) = e 1 (−S(y, θ 0 z)z) = −e 1 S(y, θ 0 z)(1, x ) = −e 1 (S(y, θ 0 z), S(y, θ 0 z)x ) = −S(y, θ 0 z).
On one hand, by Proposition 2 we know that the asymptotic variance of v θ based on the optimal π is 2).On the other hand, the asymptotic variance of v θ based on the π defined in the proposition is which is equal to (A.3).Therefore, π(Y, X; θ 0 ) = c|S(Y, θ 0 Z)| ∧ 1 is the optimal π to minimize the asymptotic variance of v θ.

A.2. Proof of the theorem and corollaries
In order to make the presentation more concise, we use some abbreviation of the notation.Specifically, for i = 1, . . ., n, let l i (θ) = l(d i ; θ), g i (θ) = g(d i ; θ), and G i (θ) = G(d i ; θ).Let R n (θ) = n −1 n i=1 Δ i l i (θ)/π i and denote the σ-algebra generated by the observed data and the pilot estimate θ by F n .

Since
θ p → θ * and Ã p → A by Lemma 4, the first term in the right-hand-side of (A.9) is no larger than sup (A,θ)∈N (A,θ * ) V 1n (A, θ)−V 1 (A, θ) as n is sufficiently large, where N (A, θ * ) is a neighborhood in (A, θ) about (A, θ * ).By C3 and the uniform law of large number, sup (A,θ)∈N (A,θ * ) V 1n (A, θ) p → 0 because of the continuity of V 1 (A, θ) with respect to (A, θ).Thus, we have V 1n ( Ã, θ) Proof of Corollary 2. When the covariates vector X does not concentrate on a hyperplane of dimension smaller than q and E( X 4 /π) < ∞, conditions C3 , C4 , and C8 hold.When the logistic model is correctly specified, it can be showed that Σ full = {E[p(θ 0 Z) ( The conclusion follows immediately. Proof of Corollary 3. We still take π i = (c Ã−1/2 g(d i ; θ) ) ∧ 1 as an example.1) Consistency: The proof of consistency of θ follows exactly the same step as that in Theorem 1, so we skip the details.
2) Asymptotic normality: The proof of the asymptotic normality of θ also follows the similar steps to those in Theorem 1.The main difference lies in that if θ p → θ, by C6 , we can show that Ã p → Ā using arguments similar to those in the proof of Lemma 4. Meanwhile, it can also be shown similarly that V 1n ( Ã, θ) p → V 1 ( Ā, θ) = V π .Thus, n −1/2 n i=1 Δ i g i (θ * )/π i is asymptotically normal with mean zero and variance-covariance matrix V π .Lemma 3 still holds here.It follows then that √ n( θ − θ * ) is asymptotically normal with mean zero and variance-covariance matrix A −1 V π A −1 .

Fig 1 .
Fig 1.Relative variance of the subsampling estimate ( βj ) to the full sample MLE ( βj,MLE ).The triangle is for the HT type estimate.The round is for the LCC estimate.

Fig 2 .
Fig 2. Histogram of the number of likes in Micro-blog data.

Table 1
Comparison of LCC and HT estimate under the correctly specified logistic model.

Table 2
Comparison of LCC and HT estimate under the incorrectly specified logistic model with consistent pilot estimate.

Table 3
Comparison of LCC and HT estimate under the incorrectly specified logistic model with inconsistent pilot estimate.

Table 4
Estimation of β * 1 by using local case-control sampling and surprise sampling under the logistic model.

Table 5
Estimation results under the Log-linear model.Full MLE: full sample MLE; Sub-MLE: MLE with uniform subsample of size 2000; HT: Horvitz-Thompson type estimate under the surprise sampling; Bias 2 : average of the squared bias; Var: empirical variance; Var Est.: average of the variance estimate; CP: coverage probability of the 95% Wald confidence interval.

Table 6
Estimation results under the linear model.
Sub-MLE: MLE with uniform subsample of size 2000; HT: Horvitz-Thompson type estimate under the surprise sampling; Bias 2 : average of the squared bias; Var: empirical variance; Var Est.: average of the variance estimate; CP: coverage probability of the 95% Wald confidence interval.