## The Annals of Statistics

- Ann. Statist.
- Volume 42, Number 5 (2014), 1693-1724.

### Local case-control sampling: Efficient subsampling in imbalanced data sets

William Fithian and Trevor Hastie

#### Abstract

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept–reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set.

Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients $\theta^{*}$. By contrast, our estimator is consistent for $\theta^{*}$ provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE—even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to $1+\frac{1}{c}$ if we multiply the baseline acceptance probabilities by $c>1$ (and weight points with acceptance probability greater than 1), taking roughly $\frac{1+c}{2}$ times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

#### Article information

**Source**

Ann. Statist. Volume 42, Number 5 (2014), 1693-1724.

**Dates**

First available in Project Euclid: 11 September 2014

**Permanent link to this document**

https://projecteuclid.org/euclid.aos/1410440622

**Digital Object Identifier**

doi:10.1214/14-AOS1220

**Mathematical Reviews number (MathSciNet)**

MR3257627

**Zentralblatt MATH identifier**

1305.62096

**Subjects**

Primary: 62F10: Point estimation

Secondary: 62D05: Sampling theory, sample surveys

**Keywords**

Logistic regression case-control sampling subsampling

#### Citation

Fithian, William; Hastie, Trevor. Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann. Statist. 42 (2014), no. 5, 1693--1724. doi:10.1214/14-AOS1220. https://projecteuclid.org/euclid.aos/1410440622