Classification by sparse generalized additive models

We consider (nonparametric) sparse (generalized) additive models (SpAM) for classification. The design of a SpAM classifier is based on minimizing the logistic loss with a sparse group Lasso/Slope-type penalties on the coefficients of univariate additive components' expansions in orthonormal series (e.g., Fourier or wavelets). The resulting classifier is inherently adaptive to the unknown sparsity and smoothness. We show that under certain sparse group restricted eigenvalue condition it is nearly-minimax (up to log-factors) simultaneously across the entire range of analytic, Sobolev and Besov classes. The performance of the proposed classifier is illustrated on a simulated and a real-data examples.

Probably the most commonly used model for p(x) is logistic regression, where it is assumed that the log-odds (logit) and β ∈ R d is a vector of (unknown) regression coefficients.The corresponding Bayes (linear) classifier for the model (1.1) is η * (x) = I{β T x ≥ 0}.Assuming d < n and estimating β by the maximum likelihood estimator (MLE) β yields the plug-in classifier η L (x) = I{ β T x ≥ 0}.Logistic regression classification has been well-studied and is one of the main tools used by practitioners.In particular, it is well-known known that E( η L , η * ) ∼ d/n which is rate-optimal (in the minimax sense) over all linear classifiers (see, e.g.[1]).
Yet, a linear model for g(x) in (1.1) might be too restrictive and oversimplified in a variety of applications.Nonparametric models allow one more flexibility by assuming much more general assumptions on g(x).One can use then various nonparametric techniques (e.g., kernel estimators, local polynomial regression, orthogonal series or nearest neighbours) to estimate the unknown p(x) (or, equivalently, g(x)) and define a nonparametric plug-in classifier η N (x) = I{ p(x) ≥ 1/2}.[26] showed that with the properly chosen tuning parameters, η N (x) achieves the minimax rates across various smoothness classes.Thus, assuming that the unknown p(x) has a smoothness s, E( η N , η * ) = O(n − s 2s+d ).A general nonparametric classifier η N suffers, however, from a "curse of dimensionality" problem, where the convergence rates rapidly slow down as d increases and the required sample size for consistent classification grows exponentially with d.To handle it one needs some additional assumptions on the model.
Note that classification is particularly challenging near the decision boundary {x : p(x) = 1/2} or, equivalenlty, {x : g(x) = 0}, where it is especially hard to predict the class label accurately, while at points far from the boundary, it is easier.Thus, assume that for all 0 < h < h * , P (|p(X)−1/2| ≤ h) ≤ Ch α for some C > 0, α ≥ 0 and 0 < h * < 1/2 known as the margin or low-noise condition [23].The case α = 0 essentially corresponds to no assumption, while the second extreme case α = ∞ implies the existence of a hard margin of size h * separating between two classes.[3] showed that under the above margin assumption, the rates of misclassification excess risk of η N can be indeed improved.Assuming again that p(x) has a smoothness s, under some mild conditions on the marginal distribution P X of X, the rate for the misclassification excess risk becomes of the order n − s(1+α) (2+α)s+d approaching the parametric rate n −1 as α → ∞.It can even be reduced further under stronger assumptions on P X .
Although the margin assumption improves the convergence rates, it provides only a partial solution to the curse of dimensionality problem since computational burden for fitting a general nonparametric classifier remains prohibitive even for relatively small and moderate d.In addition, it is difficult to interpret such a model, to perform feature selection, etc.
To overcome these challenges one needs additional structural assumptions on the model.A common approach is to consider (generalized) additive models (GAM)( [11]).In particular, for logistic regression, assume that where g j (•)'s are univariate "smooth" functions.To make the model (1.2) identifiable impose Eg j (X j ) = 0 for all j = 1, . . ., d.The GAM (1.2) is a natural nonparametric generalization of the linear logistic regression model (1.1).Additive models have become a standard tool in high-dimensional nonparametric regression and classification and can be efficiently fitted by the backfitting algorithm [11].Additivity assumption drastically improves the convergence rates of the resulting classifier η A .Thus, adapting the results of [26] and [19] one can show that ) if all univariate components g j are of smoothness s, or, more generally, Nevertheless, in the era of "Big Data", the number of features d might be very large and even larger that the sample size n (large d, small n setups).The GAM (1.2) requires d ≪ n and cannot cope with a curse of dimensionality in this case.Reducing dimensionality of a feature space by selecting a sparse subset of significant active features becomes essential.Thus, consider a sparse generalized additive model (SpAM) assuming that where J ⊆ {1, . . ., d} is an (unknown) subset of active features of cardinality |J | = d 0 that have a "significant" impact on the outcome.We are mainly interested in sparse setups, where SpAMs have been intensively studied in the context of nonparametric regression w.r.t. the quadratic risk, where a common approach is based on penalized least squares estimation with different combinations of sparsity-induced and regularized smoothness convex penalties.See, e.g., [16,17,21,13,19,2] for various SpAM regression estimators and their properties.Their comparison is discussed in [2] and [10].[13] and [10] extended this approach to more general convex losses relevant, in particular, for the considered logistic regression model.However, their estimators are not adaptive to the (usually unknown) smoothness of g j 's.
Classification by sparse linear logistic regression was studied in [1].They showed that the corresponding minimax misclassification excess risk is of the order d 0 ln(de/d 0 )/n and designed penalized MLE classifiers that achieve it.
In this paper we investigate (nonparametric) SpAM classifiers.We propose a SpAM estimator of g in (1.3) defined in the Fourier/wavelet domain w.r.t. the penalized logistic loss with sparse group Lasso or more general sparse group Slope penalties on the Fourier/wavelet coefficients.The estimator and the resulting plug-in classifier are inherently adaptive to the unknown sparsity and smoothness classes.We establish the minimax rates for the misclassification excess risk for SpAMs across analytic, Sobolev and Besov classes of functions and show that there is a phase transition between sparse and dense SpAMs.In particular, if g j , j ∈ J has a smoothness s j , the minimax misclassificitaion excess risk is of the order max . We prove that with a proper choice of tuning parameters and under certain restricted sparse group eigenvalue condition, the sparse group Lasso/Slope SpAM classifiers η sgL and η sgS are simultaneously nearlyminimax (up to log-factors) across the entire range of those classes.
The rest of the paper is organized as follows.Section 2 presents the SpAM model for classification and defines sparse group Lasso and Slope classifiers.In Section 3 we establish the minimaxity of the proposed SpAM classifiers across Sobolev, analytic and Besov classes.Their performance is illustrated on a simulated and a real-data examples in Section 4. All the proofs are given in Appendix A.

SpAM classifier
Consider a sparse additive logistic regression model: where X ∈ R d is a vector of linearly independent features with a marginal probability distribution P X on a bounded support X .Without loss of generality assume that 3), where g j lies in some class of functions F j on the unit interval (e.g., Sobolev H s j [0, 1] or more general Besov B s j p j ,q j [0, 1]), Eg j (X j ) = 0 and the subset of active features J is a priori unknown.
Expand the unknown univariate functions g j , j = 1, . . ., d in some orthonormal basis {ψ ℓ } ∞ l=0 in F j as where β jℓ = g j (x j )ψ ℓ (x j )dx j .For simplicity of exposition we consider the same basis for all g j 's.Due to the identifiability conditions Eg j (X j ) = 0, we consider β j0 = 0 for all j.By Parseval's identity Estimate the unknown coefficients β jℓ from the data sample (X i , Y i ), i = 1, . . ., n. Obviously, for all j = 1, . . ., d, we set β jℓ = 0 for all ℓ > n.To estimate β jℓ for 1 ≤ ℓ ≤ n consider additive truncated estimators of the form and the logistic loss where B ∈ R d×n is the regression coefficients matrix and Ψ(X) ∈ R d×n is the matrix with entries Ψ jℓ (X) = ψ ℓ (X j ).Let B ∈ R d×n be the matrix with the true (unknown) regression coefficients β jℓ , j = 1, . . ., d; ℓ = 1, . . .n. Sparse representation of g and smoothness properties of g j 's in (1.3) are directly expressed in terms of B. Evidently, all β jℓ = 0 for all ℓ for j / ∈ J .In addition, if the chosen basis {ψ ℓ } ∞ l=0 allows sparse representation of g j , j ∈ J , such that they can be well-approximated by a small number of basis functions, then the decreasingly ordered |β| j(ℓ) tend rapidly to zero as ℓ increases for j ∈ J .These arguments induce a double row-wise sparsity structure of B: it has only d 0 nonzero rows (global row-wise sparsity) and even those are "approximately sparse" in the sense that they have only a few "large" entries (local row-wise approximate sparsity).To capture such type of sparsity we consider a logistic sparse group Lasso estimator of B. Sparse group Lasso was proposed in [22] for linear regression.Its logistic version was used in multiclass classification by multinomial linear logistic regression in [25] and [14].Within the considered SpAM binary classification setup we define the logistic sparse group Lasso estimator B sgL of B as follows: where | B j• | 2 and | B j• | 1 are respectively the l 2 and l 1 -norms of a j-th row of B, and λ > 0 and κ > 0 are tuning parameters.The corresponding estimator and the resulting logistic sparse group Lasso classifier A more flexible generalization of (2.3) is a logistic sparse group Slope estimator that utilizes sorted norms in the penalty: where with some ambiguity of notations and the corresponding sparse group Slope classifier (2.7)

Minimaxity across various function classes
We now show that with the proper choice of tuning parameters the proposed logistic sparse group Lasso and Slope classifiers are adaptively nearly-minimax (up to log-factors) across various classes of functions F j for g j 's in (1.3).
As usual with convex penalization, one needs some (mild) assumptions on the design.Consider a truncated version g n of g: Let random vectors Ψ(X j ) = (ψ 1 (X j ), . . ., ψ n (X j )) T , j = 1, . . ., d be the rows of the matrix Ψ(X) and define the cross-covariance matrices By simple algebra, Given non-increasing sequences {λ j } d j=1 and {κ ℓ } n ℓ=1 , consider the cone of matrices is the sparse group Slope norm of A and ||A|| F is its Frobenius norm, and assume that The weighted restricted eigenvalue (WRE) condition 3.1 is an extension of the WRE condition introduced for Slope in [4] to sparse group Slope and random design.Such type of assumption is common for convex penalties (see [4] for discussion).In particular, for isotropic X, due to orthogonality of the basis, V jj = I n and V jk = 0 n×n for j = k.Thus, and Assumption 3.1 is trivially satisfied.Evidently, for sparse group Lasso with constant λ and κ

Sobolev classes
Consider the orthonormal cosine series ψ 0 (x j ) = 1, ψ ℓ (x j ) = √ 2 cos(πℓx j ), ℓ = 1, 2, . . .and let β jℓ = √ 2 1 0 g j (t) cos(πℓt)dt be the cosine Fourier coefficients of g j .Suppose that g j belongs to a periodic Sobolev class In particular, for an integer s j it is equivalent to j (1) = 0, m = 1, . . ., s j − 1 (e.g., [24], Section 1.7.1).Consider a set of sparse additive functions and the corresponding set of logistic SpAM classifiers The following theorem establishes the upper bounds for the misclassification excess risk for logistic sparse group Lasso and Slope classifiers over periodic Sobolev classes: , ℓ = 1, . . ., n.Then, The proposed SpAM classifiers are inherently adaptive to a sparsity parameter d 0 , a set of active features J and smoothness parameters s j 's.Theorem 3.2 implies that a more flexible Slope-type penalty allows one to reduce the log-factor in the first term of the upper bounds.We now show that the upper bounds in Theorem 3. for some C > 0, where the infimum is taken over all classifiers η based on the data (X 1 , Y 1 ), . . ., (X n , Y n ).
The obtained misclassification excess risk bounds contain two terms and indicate on a phase transition between sparse and dense SpAMs.The term d 0 ln(de/d 0 ) n ∼ 1 n ln d d 0 corresponds to the error of selecting d 0 nonzero univariate components g j 's out of d and is common for model selection.The term j∈J n − 2s j 2s j +1 is due to simultaneous nonparametric estimating d 0 functions g j ∈ H s j ([0, 1]).Assume for simplicity that s j = s for all j ∈ J , so that the second term becomes d 0 n − 2s 2s+1 .One then immediately realizes that the model selection error is dominating for sparse cases, where d 0 d e −n 1 2s+1 , while for less sparse cases nonparametric estimation is the main error source.A similar phase transition phenomenon for sparse linear classification was shown in [1].
It can be shown that the exact minimax risk (without the extra ln n-factor) can be achieved by the penalized SpAM estimator of [10] applied to the classification setup.However, it is not adaptive to smoothness, as both penalties and optimal values of the smoothing parameters depend on s j 's.For a particular case of design on a regular lattice, [2] proposed an adaptive SpAM estimator based on a certain complexity penalty on the number of nonzero rows of B and their nonzero entries, with the exact minimax risk within the Gaussian additive models framework.Sparse group Lasso/Slope can be viewed as a convex surrogate of their complexity penalty.We conjecture that under certain conditions the results of [2] can be extended to a general design and the logistic loss, but the use of a complexity penalty will make it computationally infeasible in this case.

Analytic functions
Assume now that g j 's are very smooth, where g j belongs to a class of analytic functions A α j [0, 1] = g j : ∞ ℓ=0 e 2α j ℓ β 2 jℓ ≤ R j , α j > 0 .Such functions admit analytic continuation into a band of width α j of the complex plane (e.g., [9]) and, in particular, are infinitely differentiable.
Define a set of functions and the corresponding set of SpAM classifiers We now show that the logistic sparse group Lasso classifier η sgL in (2.5) applied to the Fourier cosine coefficients β jℓ 's with the same choice for tuning parameters as in Theorem 3.2 for Sobolev classes is also nearly-minimax over C A (d 0 , α), while the logistic sparse group Slope classifier η sgS in (2.7) achieves the exact minimax rate for these classes.

Besov classes
Consider now more general Besov classes B s p,q of functions.The precise formal definition of Besov spaces can be found, e.g. in [18].On the intuitive level, (not necessarily integer) s measures the number of function's derivatives, where their existence is required in the L p -sense, while q provides a further finer gradation.The Besov spaces include, in particular, the traditional Sobolev H s and Hölder C s classes of smooth functions (B s 2,2 and B s ∞,∞ respectively) but also various classes of spatially inhomogeneous functions like functions of bounded variation, sandwiched between B 1 1,1 and B 1  1,∞ ( [18,8]).Besov classes can be equivalently characterized by coefficients of function's expansion in orthonormal wavelet series.Given a compactly supported scaling function φ of regularity r and the corresponding mother wavelet ψ, one generates an orthonormal wavelet basis on the unit interval with ψ hk (t) = 2 h/2 ψ(2 h t − k), h ≥ 0, k = 0, . . ., 2 h − 1 (see [6]).Assume that g j ∈ B s j p j ,q j [0, 1], 1 ≤ p j , q j ≤ ∞, ( 1 p j − 1 2 ) + < s j < r and let β j,hk = 1 0 g j (t)ψ hk (t)dt be its wavelet coefficients.Then, with l p j and/or l q j norms in (3.1) being replaced by the corresponding l ∞ -norms for p j = ∞ and/or q j = ∞ (e.g., [18,8]).Moreover, let ℓ = 2 h + k, h ≥ 0, 0 ≤ k ≤ 2 h − 1 and re-write the triple-indexed wavelet coefficients β j,hk in terms of double-indexed β jℓ .Then, Define the set of sparse additive functions where 1 ≤ p j , q j ≤ ∞, ( 1 p j − 1 2 ) + < s j < r for all j ∈ J , and the corresponding set of SpAM classifiers n for some C 1 , C 2 > 0 that can be derived from the proof.Then, , ℓ = 1, . . ., n.Then, Theorem 3.7.Consider the SpAM (1.1)- (1.3), where d 0 ln( de d 0 ) ≤ n and g ∈ G B (d 0 , s).Then, for some C > 0.
Thus, sparse group Lasso and Slope wavelet-based SpAM classifiers with the tuning parameters from Theorem 3.6 are simultaneously nearly-optimal (up to log-factors) across the entire range of Besov classes including smooth and non-smooth functions.

Examples
In this section we present experimental results for the developed SpAM classifiers applied to both simulated and real data.Solving numerically (2.3) and (2.6) implies convex programming and various methods can be used.We could not find an available software for fitting logistic sparse group Slope and therefore tested Lasso-type classifiers.In particular, we adapted the R package sparsegl from CRAN based on the group-wise majorization-minimization algorithm (see [15] for details) 1 .Vanilla Lasso and group Lasso were fitted as particular cases corresponding respectively to λ = 0 and κ = 0 in (2.3).

Simulations
We generated a sample of n feature vectors X i = (X i1 , . . ., X id ) T ∈ R d according to where W ij and U i are i.i.d.U [0, 1].Such a design results in correlations 0.5 between all features.We considered a SpAM with a logit function where while other additive components are identically zero.The resulting vectors g 1 (X 1 ), g 2 (X 2 ), g 3 (X 3 ) ∈ R n were centred and scaled to have unit Euclidean norm.For a generated X i , the class label Y i was sampled from a Bernoulli distribution B(1, p(X i )), i = 1, . . ., n with p(X i ) = e g(X i ) 1+e g(X i ) .We We used cosine series basis, and applied Lasso, group Lasso and sparse group Lasso (2.3) to find g sgL in (2.4).The tuning parameters were chosen by 10-fold cross-validation.The performance of the resulting SpAM classifiers (2.5) was measured by misclassification excess risk to compare the misclassification errors of the considered classifiers w.r.t.those of the Bayes (oracle) classifier η * (X) = I{g(X) ≥ 0} on the independently generated test sets of size 100.The procedure was replicated 10 times for each combination of d and n.
Table 1 presents the misclassification excess risks on the test sets, the numbers of selected features (nonzero rows of the regression coefficient matrix B) and the overall numbers of selected coefficients (nonzero entries in B) averaged over 10 replications.
Table 1 demonstrates that a more flexible sparse group Lasso consistently outperforms vanilla Lasso and group Lasso.Misclassification excess risks for all methods converge to zero as n increases for all d.Group Lasso converges more slowly and results in a larger number of selected coefficients, often leading to zero training errors.This behavior can be explained by the method keeping all coefficients of nonzero rows.
It should be mentioned that while Lasso-type methods perform well for prediction, they may not be as successful in feature selection (support recovery) for strongly correlated features which are common for high-dimensional data (see, e.g., [5], Section 7.2).Thus, although in the vast majority of cases, all the procedures correctly selected X 1 , X 2 and X 3 , they often also included spurious features and coefficients especially for large d.Sparse group Lasso typically selected less features.

Real-data example
To illustrate the performance of various Lasso-type classifiers on real data we considered the wellknown benchmark email spam example from Example 1 of [12].The data consists of 4601 real and spam email messages, where for each email the outcome (mail or spam) and d = 57 numeric attributes including relative frequencies of the most commonly occurring words and other email characters are available.The goal is to design a classifier for automatic spam detection based on this data.
Following [20] we used only n = 300 randomly selected samples for the training set to study the performance for relatively small n (w.r.t.d) case.We used cosine series basis, and applied Lasso, group Lasso and sparse group Lasso (2.3) to find g sgL in (2.4).The features were scaled first to have unit Euclidean norms and the test set was used as the hold-out set for choosing the tuning parameters.The performance of the resulting SpAM classifiers (2.5) was measured by misclassification errors on the remaining 4301 observations from the test set.For comparison, we also present the results reported in [20] for their SpAM classifier, which is based on a certain additive smoothing procedure.
Table 2 provides misclassification errors for the test test, the subsets of selected features and the overall numbers of selected coefficients for the considered classifiers.Lasso-type classifiers resulted in smaller misclassification errors than SpAM of [20].One realizes that sparse group Lasso classifier yielded the smallest misclassification error and both group-based classifiers outperformed vanilla Lasso.The latter disregarded the global row-wise sparsity of the matrix B and identified 32 active features (nonzero rows), although it resulted in the smallest overall number of nonzero coefficients.Grouping coefficients from the same features allowed one to reduce the number of active features to 22, where both sparse group Lasso and group Lasso selected the same sets of features.While all entries of nonzero rows for group Lasso are nonzeroes (hence, 22 × 300 = 6600), sparse group Lasso, in addition, thresholded part of them as well.
Summarizing, the results of this section indicate that Lasso-type classifiers demonstrate good performance across various types of data sets, and confirm the established theoretical findings.A more flexible sparse group Lasso overall outperforms vanilla Lasso which overlooks a grouped structure of coefficients, and group Lasso, which fails to account for within-row sparsity and retains the entire nonzero rows.
Solving (A.5) and (A.7) after some algebra implies that the suitable solutions are ǫ

∼ d 0 j∈J n 1 2sj0 ln n j∈J 1 α
case d/4 ≤ d 0 ≤ d (hence, d 0 ∼ d), the model selection error is negligible w.r.t.nonparametric estimation term.Evidently, no classifier can perform better than an oracle who knows the subset of truly active features J .Therefore, for Sobolev and Besov classes in this case d 0 ln de d 0 +1 and ǫ * 2 = C j∈J n − 2s j 2s j +1 .Similarly, for analytic functions, d 0 ln de d j and ǫ * 2 = C ln n n j∈J 1 α j .Applying (A.6) completes the proof.

Table 1 :
Average misclassification excess risks for various Lasso-type SpAM classifiers.The average numbers of selected features and of nonzero coefficients are given in parentheses.