Fithian & Hastie  proposed a sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear “surprising” in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in  and , the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and  as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from , our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.
Kani Chen was supported by Hong Kong GRF grants 16309816 and 1616212117. Wen Yu was supported by the National Natural Science Foundation of China Grants (12071088).
The authors thank Professor Cheng Zhang and Pengfei Ma for providing the micro-blog data, and thank the referee for their constructive comments and suggestions.
"Surprise sampling: Improving and extending the local case-control sampling." Electron. J. Statist. 15 (1) 2454 - 2482, 2021. https://doi.org/10.1214/21-EJS1844