Sparse supervised dimension reduction in high dimensional classiﬁcation

: Supervised dimension reduction has proven eﬀective in analyzing data with complex structure. The primary goal is to seek the reduced subspace of minimal dimension which is suﬃcient for summarizing the data structure of interest. This paper investigates the supervised dimension reduction in high dimensional classiﬁcation context, and proposes a novel method for estimating the dimension reduction subspace while retaining the ideal classiﬁcation boundary based on the original dataset. The proposed method combines the techniques of margin based classiﬁcation and shrinkage estimation, and can estimate the dimension and the directions of the reduced subspace simultaneously. Both theoretical and numerical re-sults indicate that the proposed method is highly competitive against its competitors, especially when the dimension of the covariates exceeds the sample size.


Introduction
Recent advances in biomedical sciences have provided statisticians with a large spectrum of new research projects, such as gene microarray analysis, protein structure analysis, and so on.These projects often involve data sets with small number of observations but huge number of covariates.For instance, a gene expression dataset concerning diagnosis of leukaemia [11] consists of only 72 patients with 7,129 genes expressed for each patient.Such a "large-p-small-n" scenario imposes great challenge to conventional statistical techniques due to the curse of dimensionality.A natural way of remedy is to reduce the data dimension, by eliminating some irrelevant covariates or creating some informative combinations of covariates.
In the literature, various dimension reduction techniques have been developed, especially in the route of supervised dimension reduction, where both scalar response Y and p-dimensional covariate X are utilized.Among other supervised dimension reduction techniques, slice inverse regression (SIR) [18], sliced average variance estimate (SAVE) [7], and principal Hessian directions (pHd) [19] are most popularly used.These methods focus on the conditional distribution of X|Y and rely on certain distributional assumptions such as the linearity assumption [18].Recently, the minimum average variance estimation method (MAVE) [29] is proposed for estimating the conditional mean function E(Y |X), which requires no distributional assumption and incorporates estimation of dimension reduction subspace in the framework of local linear smoother [10].
Note that the aforementioned dimension reduction methods are mainly developed in the regression context, and very little effort has been devoted to dimension reduction in the classification context.Most existing methods [2,4,6] formulate dimension reduction for classification in a generalized regression framework by treating P (Y |X) as a continuous response, so that successful techniques for regression such as SIR and SAVE can still be applicable.However, as pointed out in [6], dimension reduction for classification may require more careful treatment since the classification decision functions can be substantially affected by some minor change in P (Y |X).
In addition, the effectiveness of conventional dimension reduction methods can be deteriorated when dimension of X is high, or even greater than the sample size.First, the dimension reduction directions are estimated as a combination of all covariates, and thus can be difficult to interpret.Second, the hypothesis testing procedures for determining the proper dimension of the dimension reduction subspace might become unreliable due to the low power of the hypothesis tests, as demonstrated by the numerical examples in Section 4. To circumvent this difficulty in the regression context, sparse dimension reduction [20] proposes to incorporate variable selection techniques, such as LASSO, into the framework of supervised dimension reduction.
This article introduces a novel sparse supervised dimension reduction technique for high dimensional multicategory classification, which directly estimates the reduced subspace and automatically identifies the low-rank structure of the classification decision functions, while retaining the classification boundary based on the original dataset.The contribution of the proposed method is threefold.First, a margin based loss function is adopted, which directly targets on the classification decision functions rather than the conditional probabilities as in logistic regression.As a result, the proposed method is capable of identifying the central discriminant space (CDS) [6] that is most relevant to the classification, which is in contrast to most existing methods that seek to estimate the larger central space (CS) [3].Second, a bi-level LASSO type penalty is incorporated that encourages sparsity at both the direction level that forms the basis of the CDS, and the covariate level within each direction.Consequently, the proposed method not only automatically recovers the low-rank structure of the CDS consisting of the most relevant directions, but also produces sparse representation of each direction by removing the redundant covariates.This is advantageous compared to variable selection with a single-level penalty, which does not estimate the CDS at all.Lastly, the sparsity of the proposed method is especially attractive in sparse high dimensional classification, where most covariates are expected to be irrelevant.Furthermore, theoretical study reveals that the proposed method indeed overcomes the difficulty of large-p-small-n, and achieves an consistent estimation.
The rest of this paper is organized as follows.Section 2 briefly reviews the CS and central mean subspace (CMS) [5] for regression, as well as the CDS for classification.Section 3 presents our proposed sparse supervised dimension reduction method for estimating the central discriminant subspace, together with a tuning parameter selection criterion through data perturbation technique.Section 4 compares the proposed method against other top performers through a variety of simulated and real examples, followed by a brief summary in Section 5.

Preliminaries
In this section, we briefly review dimension reduction for regression and classification.Particularly, we will focus on the CMS for regression and CDS for classification.

Dimension reduction for regression
In a regression setting with response Y ∈ R and covariate X ∈ R p , assume that X is standardized such that E(X) = 0 p and Var(X) = I p , where 0 p is a vector of 0's and I p is a p × p identity matrix.
The goal of supervised dimension reduction is to seek the CS S(B cs ) of minimal dimension such that where ⊥ ⊥ denotes independence, B cs is a base of S(B cs ), and P S(Bcs) is an orthogonal projection onto S(B cs ).When the conditional mean function E(Y |X) is of primary interest, the CMS S(B cms ) is defined as the reduced subspace of minimal dimension such that Equivalently, projecting the original data onto S(B cms ) will not lose any information in regression of the conditional mean function.
To estimate the CMS, the MAVE method proposes to find a reduced subspace such that the regression mean function can be properly estimated based on the projected data on the reduced subspace.Specifically, the MAVE method estimates the CMS as the solution of min B,aj ,bj n i,j=1 subject to B T B = I d , where a j +b T j B T (X i −X j ) is a local linear approximation of E(Y |X i ) based on point X j , and n i=1 w ij = 1 with w ij being the kernel weights centered at B T X j .The optimization in (2.3) can be solved through an iterative scheme by fixing B or a j and b j respectively.

Dimension reduction for classification
In classification, a decision function φ : R p → {1, . . ., K} is estimated from a training sample (X i , Y i ), i = 1, . . ., n, independent and identically distributed according to some unknown distribution P (x, y), where X i ∈ R p and Y i ∈ {1, . . ., K}.The ideal classification decision function is known as the Bayes rule, φ B (X) = arg max k∈{1,...,K} p k (X) with p k (X) = P (Y = k|X).The primary goal of dimension reduction for classification is to seek the CDS, defined as the subspace S(B cds ) of minimal dimension such that for all values of X.That is, the Bayes rule φ B obtained from the projected data on S(B cds ) should be the same as that obtained from the original data.Note that the CDS can be a proper subspace of the CS, and that the estimation of CDS requires careful treatment as the classification decision function is highly sensitive to the distribution of Y |X in many situations.For illustration, consider a simple binary classification problem with Y ∈ {1, 2} and two independent standard normal covariates X 1 and X 2 .Suppose the conditional probability is defined as p(Y = 1|X) = 0.9, if X 1 ≥ 0, and γ otherwise, where 0 < γ < 1.In this example, the CS is spanned by (1, 0) ′ regarless of the value of γ.However, the Bayes rule φ B (x) = sign(x 1 ) if γ < 0.5, and φ B (x) ≡ 1 if γ ≥ 0.5, implying that the CDS is spanned by (1, 0) ′ if γ < 0.5, and a empty space if γ ≥ 0.5.Clearly, the CDS is a proper subspace of the CS if γ ≥ 0.5, and it changes substantially when γ changes from above 0.5 to below 0.5.

Sparse dimension reduction for classification
This section presents a sparse supervised dimension reduction method for high dimensional classification with multiple classes.The proposed method combines the techniques of margin based classification and shrinkage estimation, and is capable of estimating the dimension and the directions of the CDS simultaneously.

Margin based multicategory classification
In margin based multicategory classification, a classification function vector f = (f 1 , . . ., f K ) is constructed by minimizing a cost function of f over a linear function class subject to the sum-to-zero constraint K k=1 f k (x) = 0; ∀x ∈ R p .Here L(y i , f (x i )) is a loss function, J(f ) is a regularization term for penalizing model complexity, λ > 0 is the degree of penalization, and the zero-sum constraint K k=1 f k (x) = 0 is enforced to avoid redundancy in f .The decision function φ(x) = arg max k f k (x) estimates the Bayes rule φ B (x).Note that we restrict the candidate function to be linear because linear classifiers are in general sufficient to separate different classes and often yield more stable performance than their nonlinear competitors in high dimensional settings [12].
The loss L(y, f (x)) is margin based if it can be written as L(u(f (x), y)), where is the generalized functional margin [22].Clearly, larger components of u(f (x), y) imply better separation and more accurate classification.Based on the generalized functional margin, some popularly used loss functions are proposed: where u(f (x), y) is written as u = (u 1 , . . ., u K ) to simplify notations.Although the loss functions (3.3), (3.4) and (3.5) decrease with respect to u encouraging better classification, they are constructed based on different principles.In specific, (3.3) corresponds to the "one-versus-rest" approach, whereas (3.4) and (3.5) correspond to the simultaneous formulation that treats all classes at one time.Mathematical analysis reveals that only (3.4) is Fisher consistent when no dominating class is present [17,30].Particularly, when L(u) is set as in (3.4), the minimizer of E(L(u(f (X), Y ))) subject to the sum-to-zero constraint is and hence that φ , assuming that F K is sufficiently rich [17].Therefore, we focus our attention on (3.4) in this article, although the proposed method can be adapted to other loss functions.
The regularization term J(f ) in high dimensional classification is usually set to be the componentwise L 1 -norm of the candidate function, that is, The L 1 -norm regularization term allows the parameter estimation and variable selection at the same time, which is desirable when the data dimension is high.In addition, λ is a tuning parameter controlling the tradeoff between the loss function and the regularization term, and thus needs to be determined in order to optimize the classification performance.

Sparse dimension reduction
The proposed sparse supervised dimension reduction for classification is motivated by the MAVE method and the asymptotic consistency of the margin based classification method.In specific, the dimension reduction matrix B and the corresponding f can be estimated by solving Here L(u) is defined as in (3.4), and , whose columns correspond to the potential directions of the CDS.The original sum-to-zero constraint 0 for the ease of implementation [26].The resulting classification decision function is φ(x) = arg max k fk ( B T x).Note that the key difference between (3.6) and the ordinary margin based classification in (3.1) is the dimension reduction matrix B, which leads to automatic estimation of the CDS without sacrificing the classification accuracy.The estimated CDS spanned by B allows visualization and exploration of the original dataset with high dimension, which is one of the primary purposes of dimension reduction [3].
To illustrate the validity of (3.6) in estimating the CDS, note that the data fitting component n −1 n i=1 L(u(f (B T x i ), y i )) in (3.6) approaches the limiting functional E(L(u(f (B T X), Y ))) as n diverges, and hence that the minimizer of (3.6) approaches that of E(L(u(f (B T X), Y ))) under certain regularity conditions [21].Analogous to [17], it can be shown that the minimizer of assuming that F K is sufficiently rich.Therefore, φ( Furthermore, the LASSO type of regularization terms w k 1 and B 1 in (3.8) perform a bi-level variable selection, encouraging sparsity at both the direction level and the covariate level.The zero entries in w k 1 automatically remove the redundant columns in B and recover informative directions of the CDS.In addition, the sparseness of the remaining columns in B leads to simpler representation and easier interpretation for the informative directions.This is in contrast to the single-level penalty in L 1 -norm multicategory support vector machine (L1MSVM) [26] that performs only variable selection at the covariate level.As a result, our method is capable of identifying a low-rank structure of the CDS that usually has a rank much less than K − 1, whereas L1MSVM in general yields the decision function vector of dimension K − 1 without further dimension reduction.This idea of bi-level penalty is closely related to the hierarchical penalization in [27], which performs both group and within-group variable selection for grouped predictors.The novelty of our proposed method is that each dimension reduction direction is a sparse combination adaptively identified out of all the covariates, whereas each direction merely consists of the covariates within a given group in [27].
As computational remarks, the sub-optimization problems in Steps 2 and 3 can be solved by any linear programming algorithm, which is available in most standard statistical softwares.When p is large, to expedite Steps 2 and 3, one can restrict the dimension of B to be p × (K − 1), or solve for w (m+1) or B (m+1) column by column.In addition, Step 5 is mainly for orthonormalizing B, which is optional as the primary goal of dimension reduction in classification is to recover S(B cds ) rather than the base of S(B cds ).

Tuning parameter selection
The estimation accuracy of the proposed dimension reduction method largely depends on the tuning parameters λ, and hence that they need to be properly determined.In this section, we rewrite φ as φλ to indicate its dependency on λ, and present a tuning parameter selection criterion based on the prediction accuracy of φλ , obtained from the projected data on the reduced subspace.
To assess the prediction accuracy of φλ , the generalization error (GE) of φλ is used, defined as GE( φλ ) = E(I(Y = φλ (X))), where I(•) is an indicator function, and the expectation is taken with respect to the unknown P (x, y) and thus needs to be estimated from data.
In the literature, estimation of the GE given fixed X's has been extensively investigated; c.f., [8,9,23] for more details.Wang and Shen [25] proposes a data adaptive GE estimation technique for random X in the context of classification, where GE( φλ ) is decomposed as a sum of GE's of binary classifiers, where t : (1, . . ., K) → {0, 1} K maps j to a vector of length K which has all entries equal to 0 except the jth one equal to 1. Furthermore, Here known as covariance penalty [9] measuring the prediction accuracy of φ on X n , accounts for the randomness of X, and is independent of φλ and thus can be omitted in the estimation of GE( φλ ).
To estimate Cov and D 1k terms in (3.9), a data perturbation technique [25] can be employed.The idea is to generate local perturbations of X and Y to evaluate sensitivity of φλ by estimating its classification accuracy via its derivative estimation.The formulas are given in (11) and ( 12) of [25].The proposed tuning parameter selection technique yields higher estimation accuracy than cross validation in a wide variety of numerical examples while achieving asymptotic optimality for both fixed and random X's [25].The numerical experiments in Section 4 also demonstrate that this selection criterion yields satisfactory performance in estimating the dimension and the directions of the CDS.

Asymptotic properties
In this section, we present some asymptotic results showing that the proposed method is able to estimate the CDS, while yielding comparable asymptotical classification performance to standard classification methods, such as L1MSVM, in terms of the generalized hinge loss for the large-p-small-n classification.Similar results have been established in [13,29] in the regression context.
To handle the large-p-small-n problem, we first introduce some notations.Write X(p) = (X (1) , . . ., X (p) ) T as a truncated infinite-dimensional random vector X = (X (1) , X (2) , . ..)T , and p is allowed to grow with n at a much faster rate.To relate to the existing learning theory, we work on the following equivalent formulation of (3.6), and define f (p) ( B T x) as the solution of (3.10).The optimal performance is defined by is the class of all linear decision functions.For any f ∈ F (p), its performance is measured by the excess hinge risk e L (f , We now quantify the magnitude of e L ( f (p) ( B T x), f (p) ) under the following assumptions.Assumption A: Assume that sup 0<j<∞ X (j) < ∞.Assumption B: There exists a finite s * such that f (p) ∈ F (p, s * ) for all p, where F (p, s * ) = {f : The probability in inequality (3.11) can be bounded by Theorem 2 of [26] and the desired result immediately follows from Corollary 1 therein.
Assumption B describes an L 1 -norm "sparse scenario", which is weaker than the commonly used assumption on the L 0 -norm sparseness.Specifically, suppose that the true decision functions are sparse in L 0 -norm, depending on only a finite number of predictors, that is f where the number of non-zero entries in b * and B * is finite, independent of p.Then, it is straightforward to verify that f (p) ∈ F (p, s * ) for some constant s * , which satisfies Assumption B.
Under the spareness assumption, Theorem 3.1 yields an error rate tending to zero as long as p n grows no faster than exp(n), which indicates the optimal performance can be obtained even for p ≫ n.More importantly, the rate of convergence in Theorem 3.1 is comparable to that of L1MSVM [26], implying that the proposed method achieves the purpose of dimension reduction without sacrificing its classification performance.

Numerical experiments
This section presents numerical studies to examine the finite-sample effectiveness of the proposed sparse supervised dimension reduction procedure (SSDR).The purpose of the numerical studies is two-fold.First, we compare the estimation accuracy of S( B) by SSDR to some popular competitors, including SIR, SAVE and pHd, when B cds is know.Specifically, the vector correlation coefficient q 2 [16] and the trace correlation r 2 [15] between B and B cds , are employed to assess the closeness between S( B) and S(B cds ), where ρ 1 , . . ., ρ d are eigenvalues of B T cds B B T B cds and d is the dimension of B cds .If d = 0, for simplicity, we define q 2 ( B, B cds ) = r 2 ( B, B cds ) = 1 if the dimension of B is also 0, and 0 if the dimension of B is greater than 0. Note that both q 2 ( B, B cds ) and r 2 ( B, B cds ) range from 0 to 1, and larger values of q 2 ( B, B cds ) and r 2 ( B, B cds ) indicate S( B) and S(B cds ) are closer.
Second, we compare the classification accuracy of the classification decision functions constructed based on the corresponding B. This can be viewed as a complementary assessment of the estimation accuracy of B, when B cds is unknown as in the real examples.In specific, we first obtain B through various dimension reduction methods, then project the original dataset onto S( B) and construct the classification decision function φ( B T x) by applying multicategory SVM (MSVM) [17] to the projected data on S( B).A test error, defined as is used to measure the classification accuracy of φ and the estimation accuracy of B. Furthermore, to illustrate that SSDR can retain the classification accuracy while estimating S(B cds ), we also compare its test error with the standard classification methods including L1MSVM and linear discriminant analysis (LDA).All numerical analyses are conducted in R2.7.2.The "dr" routine in the dr package is employed for SIR, SAVE and pHd, which estimates the dimension reduction spaces and performs marginal tests concerning their dimensions, and the "solve" routine in lpSolveAPI package is employed for solving the linear programming problems in Algorithm 3.1.

Simulation
Two simulated examples are examined.
Evidently, SSDR delivers superior performance over its competitors.Specifically, in Examples 1a and 1b, SSDR recovers S(B cds ) in almost all replications, and yields smaller test errors than SIR, SAVE and pHd.In Example 1a, SAVE and pHd can hardly identify any direction of S(B cds ), whereas both SSDR and SIR deliver satisfactory performance and S( B) by SSDR is closer to S(B cds ) than that by SIR as it yields larger value of q 2 and r 2 .In Example 1b, SSDR, SAVE and pHd are able to recover S(B cds ) in all replications, while SIR tends to overestimate S(B cds ).In Examples 2a and 2b, the number of covariates is larger than that of observations and hence that SIR, SAVE and pHd are not applicable, whereas SSDR is able to handle this large-p-small-n scenario.From Table 1, the small values of q 2 and r 2 suggest that S( B) by SSDR is somewhat different from S(B cds ).However q 2 and r 2 are no longer appropriate to assess the estimation accuracy of S( B) in Example 2 as there can be multiple CDS's leading to perfect classification due to the high dimension.Figure 1 displays averaged q 2 and r 2 of SSDR in Example 2 with θ = 1 and p = 10, 20, 50, 100, 200, 500 over 100 independent replications.Clearly, both q 2 and r 2 decrease as p increases, while the test error of SSDR remains 0 for all p's.This suggests that although S( B) by SSDR deviates from S(B cds ) when p is large, the classification decision function based on S( B) remains the same as that based on S(B cds ).Finally, we compare the test error of SSDR to that of standard classification methods including L1MSVM and LDA that do not provide estimate of S(B cds ) at all.It is clear that SSDR outperforms LDA in all examples, and compares similarly to L1MSVM.In Example 2b, although S( B) is seemingly different from S(B cds ), SSDR is able to yield perfect classification in all replications.In Example 2a, due to the high dimension but relatively small separation with θ = 0.2, all methods fail to achieve the perfect classification but SSDR is still able to deliver smaller test error than both L1MSVM and LDA.The main goal is to develop a diagnosis method to determine whether a case is benign or malignant, based on 30 features computed from a digitized image of a fine needle aspiration.More details about the data is given in [28].The Lung cancer dataset consists of only 32 patients with three different types of pathological lung cancers, 9 from the first, 13 from the second and 11 from the third type, and 56 discrete features extracted from clinical and X-ray data.In Lung cancer, there are four missing values in the 4th feature and one missing value in the 38th feature, which are replaced by the modes of the 4th and 38th features, respectively.The data is described in details by [14].
In the Wisconsin breast cancer example, we randomly choose 200 cases for training and the remainder for testing; in the Lung cancer example, 24 randomly chosen cases are for training and the remainder for testing.For each pair of training and testing sets, tuning is conducted for λ using the same grid search scheme as in the simulated examples.Since S(B cds ) is unavailable in these real examples, the test errors and the dimension of B, averaged over 100 simulation replications, are summarized in Table 2.
In the Wisconsin breast cancer example, SSDR yields the smallest test error than other dimension reduction methods, and compares favorably to L1MSVM and LDA as well.Furthermore, both SSDR and SIR find one significant direction for the CDS in all replications, SAVE identifies too many significant directions for the CDS which deteriorates its classification accuracy, and pHd does not  2, the estimated directions by SSDR provide good separation of different classes of samples in both examples.In Wisconsin breast cancer example, patients with smaller values on direction 1 are much more likely to be malignant.In Lung cancer example, patients with negative values on both directions are more likely to have the third type of lung cancer, and patients of the first and second type of lung cancer can be discriminated according the relative largeness in these two directions.Furthermore, in the Wisconsin breast example, the estimated CDS is spanned by Direction 1 = 0.097X 2 + 0.184X  Evidently, SSDR yields sparse representation of the dimension reduction directions in both examples.In Wisconsin breast cancer example, one direction with only 11 out of 30 covriates is identified, and in Lung cancer example, two directions with only 10 and 18 out of 56 covariates are identified.

Summary
This article proposes a novel methodology for estimating the dimension reduction subspace for classification.In contrast to existing methods viewing the classification problem as a generalized regression problem, our proposed method directly pursues the minimal sufficient discriminant subspace to retain the classification boundary based on the original dataset.In addition, it enables estimation of the dimension and the sparse directions of the dimension reduction subspace simultaneously.Its asymptotic classification performance is shown to be comparable to the standard classification techniques in large-p-small-n scenario.Numerical analyses demonstrate that the proposed method outperforms several other top competitors in both simulated and real examples.It is worthy of pointing out that we assume that the true classification boundaries are linear as linear classification boundaries seem more appropriate when the data dimension is high [12].This assumption may fail when the dimension is relatively small and the true classification boundaries are often nonlinear.The extension of the proposed dimension reduction approach to nonlinear case is under investigation.

Fig 1 .
Fig 1.The averaged q 2 , r 2 and test error in Example 2 with θ = 1 and various p's.

2 Fig 2 .
Fig 2. The estimated CDS based on two randomly chosen replications from the Wisconsin breast cancer and the Lung cancer example.The left panel is the Wisconsin breast cancer example and the right panel is the Lung cancer example, where the digits denote different classes respectively.

Table 1
Averaged q 2 , r 2 and test errors as well as estimated standard errors (in parenthesis) of SIR, SAVE, pHd, SSDR, L1MSVM and LDA in the simulated examples.Here Example 1a and 1b correspond respectively to the simulated Example 1 with γ = 0.1 and γ = 0.6, and Example 2a and 2b correspond respectively to the simulated Example 2 with θ = 0.2 and We now examine two real examples, Wisconsin breast cancer and Lung cancer.Both datasets are available from the University of California at Irvine machine learning data repository at http://www.ics.uci.edu/∼ mlearn/MLRepository.html.The Wisconsin breast cancer dataset, collected at University of Wisconsin Hospitals, consists of 569 cases with two diagnoses, 212 malignant and 357 benign.