In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions. We show that due to data scarcity in one class, referred to as the minority class, and high-dimensionality of the feature space, the LDA ignores the minority class yielding a maximum misclassification rate. We then propose a new construction of hard-thresholding rules based on a data splitting technique that reduces the large difference between the misclassification rates. We show that the proposed method is asymptotically optimal. We further study two well-known sparse versions of the LDA in imbalanced cases. We evaluate the finite-sample performance of different methods using simulations and by analyzing two real data sets. The results show that our method either outperforms its competitors or has comparable performance based on a much smaller subset of selected features, while being computationally more efficient.
Abbas Khalili was supported by the Natural Sciences and Engineering Research Council of Canada through Discovery Grants (NSERC RGPIN-2015-03805 and NSERC RGPIN-2020-05011).
We would like to thank the editor Professor Domenico Marinucci, an associate editor, and two referees for their insightful comments and suggestions that improved the quality of this paper. We thank the National High Performance Computing Center (NHPCC) at Isfahan University of Technology for their computational support to conduct our numerical experiments. Arezou Mojiri is grateful to (late) Soroush Alimoradi and also Ali Rejali for their help and constant support during her graduate studies.
"New hard-thresholding rules based on data splitting in high-dimensional imbalanced classification." Electron. J. Statist. 16 (1) 814 - 861, 2022. https://doi.org/10.1214/21-EJS1939