The robust nearest shrunken centroids classiﬁer for high-dimensional heavy-tailed data

: The nearest shrunken centroids classiﬁer (NSC) is a popular high-dimensional classiﬁer. However, it is prone to inaccurate classiﬁcation when the data is heavy-tailed. In this paper, we develop a robust general- ization of NSC (RNSC) which remains eﬀective under such circumstances. By incorporating the Huber loss both in the estimation and the calcula- tion of the score function, we reduce the impacts of heavy tails. We rigorously show the variable selection, estimation, and prediction consistency in high dimensions under weak moment conditions. Empirically, our proposal greatly outperforms NSC and many other successful classiﬁers when data is heavy-tailed while remaining comparable to NSC in the absence of heavy tails. The favorable performance of RNSC is also demonstrated in a real data example.


Introduction
The nearest shrunken centroids classifier (NSC, [49]) is a popular high-dimensional classifier [50, 24, e.g]. It assigns each observation to the class with the nearest centroid, i.e, the within-class mean. In high dimensions, NSC performs variable selection by shrinking the centroid estimates. Thanks to its simplicity, NSC is extremely interpretable and computationally efficient. Moreover, it is observed to achieve remarkable accuracy on many benchmark datasets.
However, heavy-tailed data have been attracting much attention in recent years, as such data frequently arise in many fields, such as biometry, ecological systems, finance, and sociology [43,39,47,53]. NSC is vulnerable in this situation. For one thing, NSC estimates the centroid of each class with the sample mean, but the sample mean is easily impacted by a few outliers, and tends to be inaccurate for heavy-tailed data. In theory, the concentration property for the sample mean is usually proved by assuming the predictors to be sub-Gaussian or sub-exponential, but such assumptions are not appropriate for heavy-tailed data. For the other, NSC measures the distance between the observation and each centroid with the squared Euclidean distance. The quadratic form of this distance amplifies the influence of a few extreme values as well.
To tackle this challenge, we propose the robust nearest shrunken centroids method (RNSC) that is suitable for potentially heavy-tailed data. RNSC gains robustness by combining the Huber loss and Huber estimators with the original NSC method. The impacts of outliers are mitigated, while the resulting classifier continues to be sparse and interpretable. In theory, RNSC is consistent when only the fourth moment exists, which ensures the applicability of RNSC on a wide range of data. In numerical studies, RNSC exhibits better performance than many existing methods when heavy tails or outliers are present, and is comparable to NSC when data are Gaussian.
RNSC is a unique addition to the robust high-dimensional classification literature, even though there exist a few pioneering proposals for this topic. For example, [23] proposed the component-wise median estimators in a high-dimensional scheme. The properties of median-based classifier were also discussed by [21] and [28]. Although the median is robust and can be applied to heavy-tailed data, it is likely to suffer from efficiency loss. In contrast, RNSC balances between the mean and the median estimates depending on how heavy-tailed the data is. Other examples for robust classifiers include [6] and [20], but it is unclear if similar frameworks can be established for more robust versions of NSC.

The nearest shrunken centroids method
We first briefly review the nearest shrunken centroids (NSC) classifier. In a classification problem, we have a pair of random variables, (Y, X), where the predictors X = (X 1 , . . . , X p ) T ∈ R p and the response Y ∈ {1, . . . , K}, with K being a positive integer. Let Pr(Y = k) = π k be the prior probability that an observation belongs to Class k. We want a prediction on Y based on the information from X. The classical nearest centroids method assigns an observation to the class with the closest centroid with respect to 2 distance. In high dimensions, the nearest shrunken centroids (NSC) method further enforces variable selection to facilitate interpretation and accurate prediction.
Suppose that we observe the dataset {Y i , X i } n i=1 , where (Y i , X i ) are independent and identically distributed copies of (Y, X), and n is the sample size. In high dimensions, we have that p is much larger than n. We further let C k be the set of indices of the n k samples in Class k.
In NSC, we first find initial estimates for the centroids of the k-th class and the variability of each X j : (1)

S. Ren and Q. Mai
We further compute the grand mean for X j asX ·j = 1 n n i=1 X ij , and the sample proportion for Class k asπ k = n k n , which is also the estimator for π k . Then we shrinkX ·jk as follows. Let where m k = 1/n k − 1/n so that m k S j equals the estimated standard error of the numerator in d jk . For a user-speficied parameter Λ ≥ 0, we soft-threshold d jk to be d jk = sign(d jk )(|d jk | − Λ) + . Then the shrunken estimator for μ jk is: and the estimated classifier is given by: It is easy to see that, for sufficiently large Λ, for many j we haveX ·j1 = · · · = X ·jK . These variables do not have any effect on the final classification.

Our proposal of the robust nearest shrunken centroids method
The nearest shrunken centroids method is a powerful method that has been proved to have excellent performance [50,24]. However, it is sensitive to heavytailed distributions and outliers for at least two reasons. First, it is apparent that X ·j ,X ·jk and S j attempt to estimate the (conditional) mean and variance of X j . They are sample estimates with some shrinkage when applicable. However, sample estimates are known to be unstable in the presence of heavy tails. If data have several observations far away from the truth, the sample estimates tend to be inaccurate. Consequently, variable selection and prediction could be negatively impacted. Second, even if we knew the true (conditional) means and variance, NSC uses the 2 loss to measure the distances between X and the centroids. If X is drawn from a heavy-tailed distribution, it is likely that X could have a few elements that drive the prediction by coincidence and thus the classification is prone to errors.
To resolve these issues, we propose the robust NSC by replacing both the estimates and the 2 loss with robust counterparts. We achieve both goals through the Huber loss [25]. The Huber loss is defined as: where r > 0 is a parameter that controls the level of robustness for the resulting estimate. It is well-known that, by clipping the 1 and the 2 loss at r, the Huber loss effectively reduces the contribution of extreme values. The Huber loss has recently been regenerating many interests in high dimensions but has not been combined with classification methods to the best of our knowledge.
To apply the Huber loss, we first note some simple facts in NSC. It is easy to see that the estimatesX ·j andX ·jk defined in (1) are minimizers to the 2 loss: while S 2 j is also related to the 2 loss, as To make these estimates more robust, we replace these estimates with suitable Huber estimates. We let where H > 0 is a tuning parameter. We useX ·jk as an initial robust estimator for the conditional mean of X j within Class k, andQ jk as a robust estimator for the second moment. Then we estimate the grand mean and the conditional variance of X j as follows: With these Huber estimates, we soft-threshold the centroid estimates in a similar way to (2)-(3). Namely, for a tuning parameter Λ ≥ 0, we let The shrunken Huber estimator for the centroid of X j within Class k iŝ Moreover, we also use the Huber loss when we calculate the discriminant score. We assign a new observation X * to the clasŝ where h > 0 is a tuning parameter. Compared with (4), we use the Huber loss instead of the 2 loss to measure the distance between X * with each centroid so that the effects of extreme values are limited. In practice, the parameters H and h are tuned separately to obtain higher accuracy. Numerically, there is a small probability that a variance estimator is a negative number close to zero. We exclude the jth predictor from the classifier ifS j < 0. We refer to this modified NSC as the robust nearest shrunken centroids method (RNSC). As NSC, when Λ is reasonably large, mostX ·jk equalX ·j by shrinkage, and the corresponding X j is excluded from final classification in RNSC. The remaining variables give us a sparse classifier. RNSC has a similar interpretation as NSC, as it assigns an observation to its closest centroid and the selected variables can still be viewed as those differentially distributed across classes.
However, RNSC is more widely applicable than NSC, especially when the data are potentially heavy-tailed. In the next section, we will rigorously show in theory that RNSC is consistent under much weaker tail conditions than NSC. The robustness comes at a very low price, as we will later demonstrate in numerical studies that, when the heavy tail is not an issue, RNSC still works almost as well as NSC.
We want to remark that robust estimators other than the Huber estimator could be used to robustify the sample mean and variance. For example, the median-of-means estimator has similar robust properties to the Huber estimator [42,31,7,27,2]. We choose to use the Huber estimator in RNSC because, in addition to the estimation of the parameters, we also need a robust way to calculate the discriminant score. RNSC uses the Huber loss to replace the squared loss in the discriminant score for this purpose. The median-of-means estimator is not associated with a robust loss function that can be used in the discriminant score.
Finally, RNSC implicitly assumes that the marginal within-class variances are constant across classes so that the pooled estimatorS 2 j in (8) can be used. As suggested by the referee, we note that RNSC can be generalized to the heterogeneous case where the within-class variance is not constant. We explore this direction by developing the Heterogeneous Robust Nearest Shrunken Centroids method (HRNSC). Relevant results along this line are presented in Appendix A.

Choice of tuning parameters
RNSC has three tuning parameters, H, h, and Λ. We propose to tune them by two-step cross-validation. In the first step, we find the optimal H by tuning the pair of parameters (H, Λ) with the 2 loss classifier by cross-validation. In the second step, we fix H as chosen in the first step, and find the optimal (h, Λ) by cross-validation. Note that Λ is tuned twice because in both steps we need a reasonably large Λ to construct a sparse classifier that minimizes the crossvalidation error. But only the second choice of Λ is used in the final classifier. This tuning procedure is used in RNSC in our reported numerical studies with the number of folds set to 10. We admit that such a tuning procedure may have some loss of robustness, as H is chosen according to the 2 loss instead of a robust loss. If we could tune (H, h, Λ) jointly and use the Huber loss throughout the cross-validation, the results could be better. However, we choose to tune H and h separately for computational concerns. Also, we observe that such a tuning procedure leads to fairly good results in numerical studies to be presented. In the future, it will be interesting to develop a robust tuning procedure. A related work is [9], in which the authors considered robust cross-validation in a low-dimensional non-parametric problem.
As suggested by the referee, an alternative approach for fitting RNSC is to apply the recent tuning-free principle by [55]. It is shown therein that we can solve an equation system to simultaneously find a suitable Huber loss parameter and the corresponding estimator. We describe such a method, referred to as the tuning-free RNSC (TF-RNSC), in Section 2.4. We want to remark though that TF-RNSC only avoids tuning on H, but still resorts to cross-validation on h and Λ.

The tuning-free RNSC
We describe the tuning-free RNSC (TF-RNSC) as a variant of RNSC that alleviates the computation cost of cross-validation when we construct the Huber estimates. TF-RNSC is based on the tuning-free principle in [55]. We first give a brief review of this principle.
Consider m independent and identically distributed (i.i.d.) random variables Z 1 , . . . , Z m with mean ξ. For a given r, we can find the Huber estimator for ξ asξ In order to automatically choose r, [55] propose to solve the following equation system: where ψ r (x) = sign(x) min(|x|, r) = 1 2 f r (x) is 1 2 of the derivative of the Huber loss and z refers to a user-specified parameter that controls the confidence level. The authors suggest z = log(m) in their paper, which we adopt. As a result, we obtain a choice ofr and the correspondingξr without tuning.
In TF-RNSC, we do not construct estimates according to (8) that requires user-specified Huber loss parameters. Instead, when we estimate the Class k centroid for the j-th variable, we solve (14) Similarly, we estimate the Class k second moment of the j-th variable byQ T F jk that solves (8) with Then TF-RNSC proceeds much the same as RNSC, with the only exception thatX T F ·jk andQ T F jk are used in the place ofX ·jk andQ jk . With these tuning-free estimates, we further apply cross-validation to choose (h, Λ) in the final classifier. The tuning-free principle does not apply in this part, as our ultimate goal is prediction instead of estimation. In this sense, we cannot completely avoid tuning in RNSC even with the assistance of the tuning-free principle. But we abuse the terminology and refer to the resulting classifier as tuning-free RNSC (TF-RNSC) to distinguish it from our proposal of RNSC.

Theoretical results
In this section, we study the properties and benefits of RNSC under the highdimensional setting where p could be much larger than n. For variable selection, We show that RNSC consistently obtains the set of all important variables. For the classifier, we first show that with high probability the estimated classifier gives the same result as the true one. Both the results for variable selection and prediction require a mild condition that includes many heavy-tailed distributions. We then give a special case that the prediction error of RNSC converges to the Bayes error in the Gaussian scenario.
We only present the theoretical results for RNSC, but we conjecture that TF-RNSC has similar properties that can be proved with some simple modifications to our proofs for RNSC. The theoretical properties of RNSC are a result of the convergence ofX jk andQ jk in Propositions 2 & 3 (Appendix B.1). For TF-RNSC,X T F jk andQ T F jk have similar properties according to Theorems 2.1 & 2.2 in [55].
Similar to NSC, RNSC can be used as a heuristic classifier without model assumptions. However, for the sake of theoretical studies, we introduce an intuitive model that allows us to define the set of important variables. We assume that the number of classes K is fixed and Pr(Y = k) = π k ∈ (0, 1). We further assume the following model for X given Y , where μ k = (μ 1k , . . . , μ pk ) T ∈ R p is the conditional mean for Class k and ε = (ε 1 , . . . , ε p ) T ∈ R p is the noise term. For any j ∈ {1, . . . , p}, ε j follows a distribution g j with mean 0 and variance σ 2 j . This model is also studied in [35], but the main focus in that paper is variable transformation. In comparison, we work with the original data without transformation.
Throughout the rest of this section, let C be a generic positive constant that can vary in different places. The following assumptions are needed to obtain the results of variable selection and classifier convergence. Recall that g j is the probability density function for ε j .
All these assumptions are very mild. Assumptions (A1) and (A2) guarantee that the mean and variance do not go to infinity as the dimension increases. These two technical conditions simplify our calculation, but, if needed, we can also allow ζ, u, U to diverge with n, p at the price of more tedious proofs. Assumption (A3) bounds π k away from 0 and 1, which ensures that we have a decent sample size for each Class k. Assumptions (A4) and (A5) are regularity conditions on the distribution of ε j . We only assume that g j 's are bounded and ε j 's have uniformly bounded fourth moments. These assumptions are much weaker than the popular sub-Gaussian assumption in the high-dimensional statistics literature, and allow the application of RNSC to heavy-tailed data.
We first present the theoretical result on variable selection. We start with defining the set of important variables. For any k 1 , k 2 ∈ {1, . . . , K} such that for all x ∈ R if and only if μ jk1 = μ jk2 . Thus, the jth predictor is an important variable if and only if there exist k 1 = k 2 such that μ jk1 = μ jk2 . We define the set of important variables as We also denote the number of important predictors as q = |D|. The estimator for D isD Further define as the smallest nonzero mean difference. We have the following theorem.

Theorem 1.
Assume that (A1)-(A5) hold. Let N 0 be defined as in (18). Then we have the following conclusions: 2. Furthermore, assume that n → ∞ and log p Theorem 1 shows that, with a proper choice of tuning parameters H and Λ, RNSC exactly recovers the set of important variables with a probability tending to 1 even when log p = o(n), which is usually the best we can do in high dimensions. Also, recall that this is achieved without the sub-Gaussian or sub-exponential assumption. NSC is unlikely to have the same property, as the sample mean is used and we cannot derive similar concentration inequalities without sub-Gaussian/sub-exponential assumptions.
Next, we show that RNSC consistently estimates the classifier as well. Apparently, the classifier defined in (12) intends to approximate Let the training data set be (Y tr , X tr ). The following theorem shows that this approximation is very accurate.
Theorem 2 implies that the estimated classifier converges to the truth when only up to the fourth moment exists for X. This moment condition includes many heavy-tailed distributions that are frequently encountered. Therefore, The results in Theorem 2 guarantee that the classification problem with heavy-tailed data or data with outliers can be handled by RNSC. We note though, for any fixed h, we need q 5/2 log p n → 0. This requirement is stronger than classifiers with the normality assumption [8], which can often handle q log p n → 0. RNSC needs the model to be sparser (i.e, q to be smaller) because X * can be heavy-tailed. Even though we can consistently identify D and estimate μ k and σ 2 j accurately, X * could inflate the estimation error in the discriminant score. However, in what follows we will show that, if the data is not heavy-tailed, there is no need for the stronger requirement on q for RNSC to be consistent in prediction.
Consider the special case that ε ∼ N (0, Δ) with Δ being diagonal. Our model in (15) reduces to the diagonal linear discriminant analysis (LDA) model. Consequently, NSC can be viewed as an estimate for the so-called Bayes rule: which is the optimal classifier [24]. We show that RNSC also consistently estimates the Bayes rule under the normal assumption. In other words, we gain improved robustness with little loss of efficiency when data is not heavy-tailed. We denote φ as the probability density function for a standard normal random variable.
Theorem 3. Assume that (A1)-(A3) hold. Further assume that ε j ∼ N (0, σ 2 j ) independently. Then we have following conclusions: Theorem 3 shows that RNSC gives a similar prediction as the Bayes classifier under the normality and independence conditions. The dimensionality is allowed to diverge at the rate of q log p n = o(1). This rate is identical to that of LDA methods, which explicitly assumes that the data are Gaussian. Hence, RNSC is expected to work as well as the less robust methods when there is no heavy tail.
As suggested by the referee, we further compare our theoretical results with sparse linear discriminant analysis (LDA) methods [8,13,56,17,38,36, e.g]. To start, we note that our model in (15) includes the LDA model as a special case. The LDA model indicates that X is normal given Y [24]. If we assume that ε follows the multivariate normal distribution, (15) reduces to the LDA model. The normality assumption is critical for the theoretical study of sparse LDA methods, as most of them rely on the sub-Gaussian property to show the consistency in ultra-high dimensions. However, in our Theorems 1 & 2, we do not make the normality assumption and thus obtain theoretical results under weaker assumptions than the LDA model.
Another difference between LDA and RNSC is the way they handle correlations. RNSC (and its predecessor, NSC) is a distance-based classifier that calculates the distance in a coordinate-wise manner. It makes no attempt to model the correlation among ε. The classifier and the active set D are fully determined by within-class means and variances. Theorems 1 & 2 do not make assumptions on the correlation structure, either, as RNSC converges to its population counterpart regardless of the correlation structure. On the other hand, LDA explicitly exploits the correlation among variables. For example, if K = 2, LDA aims to estimate β = Σ −1 (μ 2 − μ 1 ), where Σ is the covariance matrix for X within Class k, and we need to select the set S = {j : β j = 0} [8,17,38].
These existing works also show that D can be quite different from S. In this sense, when some special correlation exists and the normality assumption holds, RNSC could be sub-optimal to LDA even on the population level. But this suboptimality is the price we pay for the extra flexibility in analyzing heavy-tailed data that does not satisfy the LDA model.
However, there is one special case when RNSC and sparse LDA are closely related, which is studied in Theorem 3. If ε j ∼ N (0, σ 2 j ) independently, the model in (15) becomes a diagonal LDA model (i.e, the LDA model with a diagonal covariance matrix). As such, RNSC and LDA intend to estimate the same Bayes rule. Theorem 3 shows that, under suitable conditions, RNSC indeed converges to the Bayes rule under the diagonal LDA model.

Simulation results
We present some simulation results to demonstrate the performance of RNSC. We consider several models to study binary-class and multi-class problems in heavy-tailed and outlier scenarios. For all models, we set the dimension p = 2000 and sample size n = 100K, where K is the number of classes. We assume the model in (15). We consider two sets of parameters in Models A and B, combined with four different distributions for ε j in Models X.1-X.4, where X=A or B. The parameters are given as follows. For each selection of parameters, we consider four settings.
We consider our proposal of RNSC and TF-RNSC defined in Section 2.4 in the numerical studies. Other competitors include the original NSC method by [49], the median-based classifier (Med; [23]), sparse optimal scoring (SOS; [13]), support vector machine with SCAD penalty (SVM-s; [57]) and 1 penalized logistic regression (Logistic; [24]). NSC is implemented by R package pamr. SVM-s is implemented by the R package penalizedSVM and SVM-s in Model B is performed by one-vs-one method. The 1 penalized logistic regression is implemented by the R package glmnet. In multi-class problems, SOS is implemented by the R package sparseLDA, while in binary case, SOS is implemented by the R package TULIP [45] by exploiting an equivalence between SOS and direct sparse discriminant analysis [38,37]. We consider 200 replicates for each model and report the results in Table 1. For the prediction error in Table 1, we can see that in Models A.1 & B.1, our proposal has slightly worse performance than NSC. This is because the noise term is normal, and NSC is a more efficient estimate of the Bayes rule. RNSC has a little loss of efficiency because it uses robust estimates. But RNSC has noticeable advantages in all the other settings, where the data have heavy tails or extreme outliers. Moreover, RNSC successfully selects all important variables in most replicates in all models. Other competitors are likely to miss at least 1 or 2 important variables in most replicates. Such results support the application of RNSC in practice, where normality assumptions could be too stringent. They also support our theoretical studies that RNSC is consistent even when sub-Gaussian assumptions are not met.
On the other hand, TF-RNSC consistently gives a competitive performance in all the simulation settings. Even in Models A.1 & B.1 where the normality assumption holds, TF-RNSC is comparable to NSC. In all the models except for Model A.4, TF-RNSC is either comparable to or better than RNSC. Such a fact demonstrates the potential of TF-RNSC.

Application to a real dataset
In this section, we further demonstrate the performance of RNSC on the lung cancer data ( [22]). This dataset contains 181 patient samples with 12533 gene expression levels as predictors. The response is of two classes, either mesothelioma (MPM) or adenocarcinoma (ADCA). These two classes refer to two types of lung-cancer-related issues, where MPM is relatively rare and ADCA is much more common( [22]).
In the analysis, the dataset is randomly split into two parts, with the training set containing 30% of the samples (54 patients) and the testing set containing the rest. We fit classifiers by RNSC and all the competitors included in Section 4. For RNSC, we choose the tuning parameter by 10-fold cross-validation in the same way as in Section 4. The classification errors are evaluated on the testing set. We repeat this procedure for 100 times. The results are reported in Table 2. It can be seen that RNSC has better accuracy on prediction than all the other methods. TF-RNSC is the second best, but the one-sided paired t-test suggests that the difference between RNSC and TF-RNSC is borderline insignificant (p-value=0.054). However, RNSC is significantly better than all the other methods at the 0.05 level, while TF-RNSC is not significantly better than NSC (p-value=0.2513) or Med (p-value=0.1569). In what follows, we focus on the results of RNSC. The favorable performance of RNSC can be partially explained by the heavy tails or the presence of outliers in the data. For simplicity, we say that a gene is "frequently selected" (FSG for short) if it is selected at least in 70 out of the 100 replicates. Eleven genes are uniquely frequently selected by RNSC, but not by other methods. For the sake of space, we plot the empirical probability density functions of 4 of them in Figure 1. It can be seen that there are outliers far away from the bell shape in these 4 gene expressions when the patients are detected as ADCA. Gene expressions of 39409 at and 39795 at tend to be heavy-tailed when the patients are detected as MPM. Hence, it is more difficult for methods such as NSC to detect them.
We further check the empirical excess kurtosis (EEK) of selected genes by RNSC and NSC. The EEK is defined as empirical kurtosis minus 3, where 3 is the kurtosis of normal distribution. Large EEK is indicative of heavy tails or the presence of outliers. The EEKs of genes for each class are presented in Figure 2. These genes are either selected by NSC or uniquely selected by RNSC. Within the class of ADCA, the EEKs of the expression of selected genes are generally large. Thus, the robust method of RNSC is more appropriate on this dataset  and produces more accurate classification results.
Finally, there is evidence that the genes uniquely selected by RNSC are associated with MPM. We list several genes out of them and the publications that point out their association with the disease. WT1 products are found to be a possible marker for MPM by [1]. [30] observes an elevated expression level of PEA-15 in MPM cells. [29] finds the over-expression of SEMA3C in MPM cells. It is observed by [14] that there are significant gene expression differences of GFPT2 between ADCA and MPM. PTGIS is deregulated with statistical significance in at least one cell line in [41]. These studies support that the variables identified by RNSC are biologically meaningful as well.

Discussion
We propose RNSC as a robust high-dimensional classifier. Numerical and theoretical results suggest that RNSC is a competitive classifier on a wide range of classification problems. The improved robustness in RNSC is a result of employing the Huber loss. We use the Huber loss because it is naturally related to the 2 loss in the original NSC method, and strikes a delicate balance between efficiency and robustness. We note though that there are other approaches for high-dimensional robust statistics. For example, estimation for conditional median regression was proposed based on different regression models [3, 16, 54, e.g]. [10] developed a truncation-type estimator. It is out of the scope of this paper to conduct an exhaustive study of robust extensions of NSC with these works, but such an investigation is expected to be an interesting future research topic.

Appendix A: The heterogeneous robust nearest shrunken centroids method
In NSC and RNSC, the pooled sample variances, S 2 j andS 2 j , are used in standardization, which leads to the assumption that the true variances σ 2 j for each j stays constant across classes in Section 3. However, if the true variances are not constant, the shrinkage procedure in (10) and the estimation for (16) may need to be modified. Hence, we consider a generalization of RNSC to heterogeneous problems, referred to as the heterogeneous robust nearest shrunken centroids classifier (HRNSC).

A.1. Methodology
Similar to the idea of RNSC, we aim to develop the HRNSC classifier of the formδ where X * is a new observation, andX ·jk ,S jk are robust estimates of the withinclass mean and variance that will be formally defined later. Note that HRNSC differs from RNSC in that we have a class-specificS jk and the additional term of 2 logS jk in the classifier. The class-specificS jk models the potential heterogeneity in the data. The term 2 logS jk makes the HRNSC classifier identical to the quadratic discriminant analysis (QDA) when we make the normality assumption and let h = ∞. For some recent developments of sparse QDA, see [44,32,18,26, e.g].
To constructX ·jk ,S jk , we start with finding some robust initial estimates. LetX ·j ,X ·jk ,Q jk andS 2 j be defined as in (8) and (9), and denotẽ as the initial estimator for the within class variance. These estimates involve minimizers of the Huber loss and are thus robust. But they do not induce sparsity in the final estimator. To enforce sparsity, we further consider shrunken versions of them.
We note that, a predictor X j is not important in (27) if and only ifX ·jk ,S jk are both constant across k. Consequently, we need to shrink bothX ·jk and S 2 jk =Q jk −X 2 ·jk towards the pooled mean and variance, respectively. In other words, we need to shrinkX ·jk −X ·j andS 2 jk −S 2 j to zero. Recall that, in RNSC, the shrinkage is scaled by a robust estimate of the standard error of each estimate. We use the same technique in HRNSC. Exact formulas for the standard errors ofX ·jk andS 2 jk are difficult to derive, but we can use some surrogates as follows.
Recall thatX ·jk is the sample mean within Class k andX ·j is the pooled sample mean for the jth predictor. We also define S 2 jk = 1 n k −1 i∈C k (X ij − X ·jk ) 2 as the sample variance within Class k andS 2 j = K k=1π k S 2 jk as their average. These sample estimates are different from our Huber estimates, but their standard errors are easy to obtain, and we use robust estimates of their standard errors to scale our shrinkage of the Huber estimates. This approach is only an approximation, but yield reasonably good performance in our numerical studies. To this end, we have the following lemma concerning the variance of X ·jk −X ·j and S 2 jk −S 2 j . The proof of this lemma is in Section A.3.

Lemma 1.
For X ij , assume that X ij has mean μ jk and variance σ 2 jk given Then for the random variablesX ·jk −X ·j and S 2 jk −S 2 j defined as above, we have and To estimate the right-hand side of (29) & (30) for heavy-tailed data, we again resort to Huber estimates. According to (8), (28), and (9), we haveX jk as the estimate for μ jk ,Q jk for M 2jk ,S 2 jk for σ 2 jk , andS 2 j for K k=1π k σ 2 jk , respectively. We further define the Huber estimators for third and fourth moments as for M bjk with b ∈ {3, 4}. Then we havẽ as the robust estimator for Var[S 2 jk −S 2 j | n 1 , . . . , n K ]. Based on this conclusion, we consider the following shrunken estimator. Let where Λ 2 ≥ 0 is a tuning parameter. Withṽ 2 jk shrunken by Λ 2 , the shrunken Huber estimator for the variance σ 2 jk within Class k is To scaleX ·jk −X ·j , we again propose a robust estimator for the variance of X ·jk −X ·j as a surrogate. As the robust shrunken estimator for the within class variance is obtained in (34), we useS 2 jk as the robust estimator for σ 2 jk now.
j is a surrogate for the variance ofX ·jk −X ·j according to Lemma 1.
Then following the same shrinking procedure ofX ·jk in Section 2.2, we have The shrunken Huber estimator for the centroid of X j within Class k is In HRNSC, we plug (34) & (36) into (27). Compared to RNSC, HRNSC adapts to non-constant within-class variance.

A.2. Numerical analysis
We have three sets of numerical results for HRNSC. We first apply HRNSC on Models A.X and B.X in Section 4 to compare it with the methods in Table 1. Then we modify model A.X by changing the variance of five predictors following the important variables to be C.X. To be specific, we let the variance of jk in model A.X to be k times of its original variance for any 10 < j ≤ 15. We last check its performance on the real dataset previously discussed in Section 5.
HRNSC has many additional tuning parameters compared to existing methods. For the sake of time, we employ tuning-free Huber estimators in HRNSC and use grid-search to determine the parameters Λ 1 , Λ 2 and h with 10-fold cross-validation in HRNSC. The results of 200 replicates for Models A.X and B.X, and Models C.X are given in Tables 3 and 4, respectively. Moreover, we apply HRNSC on the lung cancer dataset introduced in Section 5 for 100 replicates. On average, HRNSC has the prediction error of 2.0% (0.72), 149(45.0) selected genes, and 17 frequently selected genes (in the parentheses are corresponding standard errors.).
It can be seen that both in the simulation and real data study that HRNSC outperforms NSC, but does not have a clear advantage over RNSC. One potential reason is that, given the high dimensions and the low sample sizes, estimating too many parameters may bring in excessive variability. Another possibility is that HRNSC may be intrinsically less robust than RNSC because it needs to estimate the third and fourth moments. Therefore, we believe that the full development of HRNSC is worth further investigation, which is out of the scope of the current paper.

A.3. Proof of Lemma 1
We prove Lemma 1 in this section. The following proposition is necessary in the proof. , whereZ is the sample mean. Then we have following property: Proof of Lemma 1. ForX ·jk −X ·j , as X ij are independent for all i, we have that For S 2 jk −S 2 j , we know that Since all S 2 jk and S 2 jk are independent, we have By Proposition 1, we have for all k. Therefore, we have jk by straightforward calculation, we obtain (30) in Lemma 1.

Appendix B: Proof of theoretical results
To prove Theorems 1, 2 and 3, we first show that the Huber estimatorX ·jk converges to the true mean μ jk andS j converges to the true standard error σ j in Section B.1. Then, the properties of variable selection and convergence of classifier will be justified.
In this part, to give details in the proof and simplify the notation, we define and where The terms that contain constants such as C 1 , C 2 , m 0 , u, U , V and ζ will be eventually merged into the constant C.
To simplify the expression in our discussion, we replace δ in the original in Proposition 2. Same expression will be applied in following propositions.
The result in Proposition 2 can be generalized to the Huber estimators for variance. As mentioned after Theorem 5 in [19], we have the following proposition: Proposition 3 (The discussion following Theorem 5 in [19] For any fixed j ∈ {1, . . . , p}, any class k and i ∈ C k , {X ij } are independent and identically distributed. We can then generalize the conclusions in Propositions 2 & 3 to obtain the concentration inequalities forX ·jk andS j . Considering the fact thatX ·jk andS j are related to n k andπ k , it is necessary to introduce the following proposition to bound them. Proposition 4 (Hoeffding's inequality). Let n k be the sum of n independent and identically distributed Bernoulli random variables with probability π k and π k = n k n be the estimator for π k . Then we have and for any k and > 0.
Combining Propositions 2 & 4, we have the following result forX ·jk : For any j ∈ {1, . . . , p}, any class k and i ∈ C k , letX ·jk defined in (8) be the Huber estimator for the class mean μ jk . Then by letting Proof of Lemma 2. By Proposition 2, we have the following concentration inequality forX ·jk conditional on Y : Define A = {n k : n k ≥ nπ k /2}. Then for the marginal probability, we have By applying the same strategy on the Huber estimator for Q jk = EX 2 ij , we obtain the following result.

Lemma 3.
For any j ∈ {1, . . . , p}, any class k and i ∈ C k , letQ jk defined in (8) be the Huber estimator for the second moment Q jk = EX 2 ij . Then by Proof of Lemma 3. By Proposition 3, we have the following concentration inequality forQ jk conditional on Y : Then following the same method in the proof of Lemma 2, we have the result Combining the results in Lemmas 2 & 3, we now can obtain the concentration inequality forS j .
Then, by letting where C 1 is defined in (40).

By the results in Proposition 4 and Lemmas 2 & 3, we can show
This lemma helps us merge several events that will be simultaneously used in later proofs into one. We can show the convergence ofS j by using it as well.
Then, by letting Proof of Lemma 5. This is a direct consequence of Lemma 4.

B.2. Proof of Theorem 1
With the setup above, we can now give the variable selection result. To provê D → D, we first show that for all j ∈ D c , alld jk shrink to 0 with a proper choice of Λ, which indicatesX ·jk =X ·j for all k. Then we show that, on the other hand, when j ∈ D, at least two ofX ·jk are different for some k 1 and k 2 with probability converging to 1.
We have the following femma for the case where j ∈ D c .
Proof of Lemma 6. We can show For a given class k, we have According to (A3), all π k are bounded away from 0 and 1. Thus, given a proper constant C for m 0 = C/n, we have m k ≥ m 0 for all k. Then with the choice of Λ ≥ 8 m0w , where 8 < m kSj Λ when |S j − σ j | ≤ C 1 holds. Thus, for j ∈ D c , μ j1 = . . . = μ jK = μ j . Then we have By the definition ofX ·j , we have Therefore, and the LHS of (49) becomes According to Lemma 4, Combining the results above, we finally obtain The following lemma proves the case where j ∈ D.
Proof of Lemma 7. Among all Huber mean estimators, {X ·jk , . . . ,X ·jK } for any j ∈ D, we pick the smallest one, denoted asX jka , and the largest one, denoted asX jk b . SinceX ·j = K k=1π kXjk , we haveX jka ≤X ·j ≤X jk b . Therefore, Since the threshold function (x) + is convex and satisfies the property (ax) + = a(x) + for any real number a, we have which is equivalent to For any j ∈ D, if the event {|X ·jk − μ jk | < 4 } holds for all k, event {|S j − σ j | < C 1 } and Assumptions (A1) and (A2) hold, we have Cm 0 wΛ ≥ (m kaSj Λ + m k bS j Λ) for some constant C ≥ 2 max l1,l2 for any . If μ jk b − μ jka < 0, then (μ jk b − μ jka ) ≤ −N j by definition. Thus, we have Under the condition given in the lemma that ≤ Nj 16 , we haveX jk b −X jka < 0, which contradicts to our choices ofX jk b andX jka . On the other hand, if μ jk b = μ jka , then since j ∈ D, there is another μ jkc such that μ jkc = μ jk b . If μ jkc > μ jk b , μ jkc − μ jka ≥ N j by definition. Thus, we havẽ 16 . In this case,X jkc >X jk b contradicting to our choice ofX jk b . On the other hand, if μ jkc < μ jk b , μ jkc − μ jka ≤ N j . In this case, we havẽ 16 . Thus,X jkc <X jka , contradicting to our choice ofX jka . Therefore, (μ jk b − μ jka ) > 0 and we obtain Combining the results above with Lemma 4, we reach the result Lemma 7 aims to show that the important variables can be selected as long as all Huber estimators,X ·jk andS j , are sufficiently close to the truth. Combining this result with the conclusion in Lemma 6, we finally can prove the result of variable selection.

B.3. Proof of Theorems 2 and 3
For the classifier in the nearest shrunken centroids method, we denote and its estimator asρ For our proposal, denote and its estimator asρ For the non-gaussian scenario, the Bayes misclassification rate does not have an explicit form. Therefore, we consider the difference between the classifier and its estimator. Intuitively,δ h (X * ) = δ h (X * ) if all the estimators for Huber loss functions of important variables converge to the truth. To give a cleaner expression of the bound, we first show thatρ h,k → ρ h,k for any k, then prove We start from the Huber loss function part.

Proposition 5 (Markov inequality)
. Let X * ,j be a random variable with mean μ jY * and fourth moment E(ε * j ) 4 ≤ κ 2 < ∞. Then for any > 0, we have Then we have the following lemma.

Proof of Lemma 8. Since the derivative of the Huber loss
Then we can show that
Summarizing the conditions above, we have By the results in Lemma 4 and Proposition 5, with probability greater than 1 − τ ( ), we have With the preparations above, we can eventually obtain the convergence of the classifier. where B 1,k = {min l =k {|ρ h,l (X * ) − ρ h,k (X * )|} ≤ 4K + 2qhϑ( )} and B c 2 is the compliment of B 2 . We start with K k=1 Pr(B 1,k | Y * = k). We have for each k. Now consider the equation |f h ( x+μ jk −μ jl σj ) − f h ( x σj )| = 0. Based on the value of μ jk , μ jl and h, it can be simplified to a quadratic equation or linear equation, with two solutions x = a 1 (μ jk , μ jl ), x = a 2 (μ jk , μ jl ) or one solution x = a 1 (μ jk , μ jl ), where a i are constants determined by μ jk , μ jl .
We then prove a special case where we have the normality assumption. The following proposition is needed in the proof of Theorem 3.
Proposition 6 (Lemma 11 in [33]). Let Φ and φ denote the distribution and density functions of a standard Gaussian random variable. Then Proof of Theorem 3. [49] points out that the NSC classifier is equivalent to the linear discriminant analysis classifier when the covariance matrix in LDA is diagonal. Given the condition that all predictors are independent, we have Pr δ h (X * ) = δ bayes (X * ) (Y tr , X tr ) = Pr δ h (X * ) = δ nsc (X * ) (Y tr , X tr ) .
Denote the event B = | X * ,j −X ·jk Combining the results above, by the result in Lemma 5 and Theorem 1, with probability greater than 1 − (q + p)C exp{− Cn 2 v 2 }, we have Pr δ h (X * ) = δ bayes (X * ) (Y tr , X tr ) ≤Γ( ) + Cq Furthermore, assume that n → ∞ and q log p n → 0. By letting H → ∞ while Ch−C → 0, which indicates that our estimator for the classifier consistently gives the same result as the true one.