The Annals of Statistics

Local case-control sampling: Efficient subsampling in imbalanced data sets

William Fithian and Trevor Hastie

Full-text: Open access

Abstract

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept–reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set.

Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients $\theta^{*}$. By contrast, our estimator is consistent for $\theta^{*}$ provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE—even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to $1+\frac{1}{c}$ if we multiply the baseline acceptance probabilities by $c>1$ (and weight points with acceptance probability greater than 1), taking roughly $\frac{1+c}{2}$ times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

Article information

Source
Ann. Statist. Volume 42, Number 5 (2014), 1693-1724.

Dates
First available in Project Euclid: 11 September 2014

Permanent link to this document
https://projecteuclid.org/euclid.aos/1410440622

Digital Object Identifier
doi:10.1214/14-AOS1220

Mathematical Reviews number (MathSciNet)
MR3257627

Zentralblatt MATH identifier
1305.62096

Subjects
Primary: 62F10: Point estimation
Secondary: 62D05: Sampling theory, sample surveys

Keywords
Logistic regression case-control sampling subsampling

Citation

Fithian, William; Hastie, Trevor. Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann. Statist. 42 (2014), no. 5, 1693--1724. doi:10.1214/14-AOS1220. https://projecteuclid.org/euclid.aos/1410440622


Export citation

References

  • Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika 59 19–35.
  • Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. Adv. Neural Inf. Process. Syst. 20 161–168.
  • Breslow, N. E. and Cain, K. C. (1988). Logistic regression for two-stage case-control data. Biometrika 75 11–20.
  • Breslow, N. E., Day, N. E. et al. (1980). Statistical Methods in Cancer Research. The Analysis of Case-Control Studies 1. Distributed for IARC by WHO, Geneva, Switzerland.
  • Chawla, N. V., Japkowicz, N. and Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6 1–6.
  • Fears, T. R. and Brown, C. C. (1986). Logistic regression methods for retrospective case-control studies using complex sampling procedures. Biometrics 6 955–960.
  • Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Ann. Statist. 28 337–407.
  • He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21 1263–1284.
  • Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
  • Huber, P. J. (2011). Robust Statistics. Springer, Berlin.
  • Lumley, T., Shaw, P. A. and Dai, J. Y. (2011). Connections between survey calibration estimators and semiparametric models for incomplete data. Int. Stat. Rev. 79 200–220.
  • Mani, I. and Zhang, I. (2003). kNN approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of Workshop on Learning from Imbalanced Datasets. ICML, Washington, DC.
  • Manski, C. F. and Thompson, T. S. (1989). Estimation of best predictors of binary response. J. Econometrics 40 97–123.
  • Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 22 719–748.
  • Owen, A. B. (2007). Infinitely imbalanced logistic regression. J. Mach. Learn. Res. 8 761–773.
  • Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66 403–411.
  • Scott, A. J. and Wild, C. J. (1986). Fitting logistic models under case-control or choice based sampling. J. Roy. Statist. Soc. Ser. B 48 170–182.
  • Scott, A. J. and Wild, C. J. (1991). Fitting logistic regression models in stratified case-control studies. Biometrics 47 497–510.
  • Scott, A. and Wild, C. (2002). On the robustness of weighted methods for fitting models to case-control data. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 207–219.
  • Webb, S., Caverlee, J. and Pu, C. (2006). Introducing the webb spam corpus: Using email spam to identify web spam automatically. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS). CEAS, Mountain View, CA.
  • Weinberg, C. R. and Wacholder, S. (1990). The design and analysis of case-control studies with biased sampling. Biometrics 963–975.
  • Xie, Y. and Manski, C. F. (1989). The logit model and response-based samples. Sociol. Methods Res. 17 283–302.