Open Access
2022 New hard-thresholding rules based on data splitting in high-dimensional imbalanced classification
Arezou Mojiri, Abbas Khalili, Ali Zeinal Hamadani
Author Affiliations +
Electron. J. Statist. 16(1): 814-861 (2022). DOI: 10.1214/21-EJS1939

Abstract

In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions. We show that due to data scarcity in one class, referred to as the minority class, and high-dimensionality of the feature space, the LDA ignores the minority class yielding a maximum misclassification rate. We then propose a new construction of hard-thresholding rules based on a data splitting technique that reduces the large difference between the misclassification rates. We show that the proposed method is asymptotically optimal. We further study two well-known sparse versions of the LDA in imbalanced cases. We evaluate the finite-sample performance of different methods using simulations and by analyzing two real data sets. The results show that our method either outperforms its competitors or has comparable performance based on a much smaller subset of selected features, while being computationally more efficient.

Funding Statement

Abbas Khalili was supported by the Natural Sciences and Engineering Research Council of Canada through Discovery Grants (NSERC RGPIN-2015-03805 and NSERC RGPIN-2020-05011).

Acknowledgments

We would like to thank the editor Professor Domenico Marinucci, an associate editor, and two referees for their insightful comments and suggestions that improved the quality of this paper. We thank the National High Performance Computing Center (NHPCC) at Isfahan University of Technology for their computational support to conduct our numerical experiments. Arezou Mojiri is grateful to (late) Soroush Alimoradi and also Ali Rejali for their help and constant support during her graduate studies.

Citation

Download Citation

Arezou Mojiri. Abbas Khalili. Ali Zeinal Hamadani. "New hard-thresholding rules based on data splitting in high-dimensional imbalanced classification." Electron. J. Statist. 16 (1) 814 - 861, 2022. https://doi.org/10.1214/21-EJS1939

Information

Received: 1 November 2020; Published: 2022
First available in Project Euclid: 19 January 2022

MathSciNet: MR4366822
zbMATH: 1493.62392
Digital Object Identifier: 10.1214/21-EJS1939

Subjects:
Primary: 62H30

Keywords: ‎classification‎ , high-dimensionality , imbalanced , linear discriminant analysis , thresholding

Vol.16 • No. 1 • 2022
Back to Top