A Fisher consistent multiclass loss function with variable margin on positive examples

: The concept of pointwise Fisher consistency (or classiﬁcation calibration) states necessary and suﬃcient conditions to have Bayes consistency when a classiﬁer minimizes a surrogate loss function instead of the 0-1 loss. We present a family of multiclass hinge loss functions deﬁned by a continuous control parameter λ representing the margin of the positive points of a given class. The parameter λ allows shifting from classiﬁcation uncal-ibrated to classiﬁcation calibrated loss functions. Though previous results suggest that increasing the margin of positive points has positive eﬀects on the classiﬁcation model, other approaches have failed to give increasing weight to the positive examples without losing the classiﬁcation calibration property. Our λ -based loss function can give unlimited weight to the positive examples without breaking the classiﬁcation calibration property. Moreover, when embedding these loss functions into the Support Vector Machine’s framework ( λ -SVM), the parameter λ deﬁnes diﬀerent regions for the Karush—Kuhn—Tucker conditions. A large margin on positive points also facilitates faster convergence of the Sequential Minimal Optimization algorithm, leading to lower training times than other classiﬁcation calibrated methods. λ -SVM allows easy implementation, and its practical use in diﬀerent datasets not only supports our theoretical analysis, but also provides good classiﬁcation performance and fast training times.


Introduction
Many of the most used classification algorithms are based on the minimization of a surrogate convex loss function since the direct minimization of the 0-1 loss is computationally intractable.Some examples of these algorithms include Support Vector Machines (SVMs) [8,10,34], boosting [13,7,22], and logistic regression [14].Conditions such as convexity, continuity, and differentiability of these surrogate loss functions are easy to analyze; however, the statistical implications of using these surrogate loss functions are not so evident [2].The notion of classification calibration was initially defined by Bartlett et al. [2,3] as a pointwise form of Fisher consistency for classification.It was shown to be a necessary and sufficient condition for a binary classifier to be Bayes consistent when the empirical risk Ψ of a surrogate loss function converges to the minimal possible Ψ-risk.Tewari and Bartlett [37,Theo. 2] extended this classification calibration concept to multiclass problems.However, the extension of binary loss functions to multiclass classification settings is non-trivial, leading to a large body of research to better understand classification calibration in multiclass scenarios [23,26,37,41,42,43,28,40]. The main contribution of this work is the formulation of a pointwise Fisher consistent (classification calibrated) multiclass loss function that can give arbitrary high weight to the margin of the positive points, which is shown to beneficial in terms of classification accuracy and training times.Our loss function overcomes some limitations of previous approaches since (i) it allows overweighting the margin of positive points while maintaining classification calibration, (ii) it yields consistent classification accuracies with respect to the classification calibration domain, and (iii) it can be efficiently trained when embedded in the Support Vector Machine's framework (λ-SVM).
There exist two main strategies to extend binary learning algorithms to a multiclass setting.The first approach consists of formulating the multiclass clas-sification problem as a combination of several binary classification tasks.It includes strategies such as one-versus-rest, one-versus-one, and pairwise coupling [1].Though these strategies are easy to implement, having optimal solutions of those binary classifiers does not guarantee having a global optimal solution for the multiclass problem.Additionally, the multiclass loss function does not necessarily inherit the classification calibration properties of its binary counterpart [37,26,42].For example, the hinge loss function commonly used in Support Vector Machines has shown to be classification calibrated for binary problems [25], but the one-versus-rest strategy may be inconsistent when there is no a dominating class [26].The second approach is based on the formulation of multiclass surrogate loss functions to use the same global optimization procedure as in the binary case.Though several multiclass hinge loss functions can be found in the literature [39,9,23,28,19], only two of them have been shown to be classification calibrated for every multiclass problem: Lee et al.'s loss function [23] and Liu and Yuan's loss function 1 (reinforced multicategory hinge loss) [28].However, these approaches present some limitations.On the one hand, Lee et al.'s loss function does not consider the slack of the positive points of a given class, so it overlooks valuable information for the classification algorithm as it will be shown in our experiments (Section 6).On the other hand, the reinforced multicategory hinge loss considers both the margin of the positive and negative points of a given class, but experimental results in [28] on two synthetic datasets show that best classification performances are obtained for values of γ that overweight the margin of the positive points, and make the loss function classification uncalibrated.As pointed out by the authors, this is a surprising result.However, it points out the importance of paying attention to the error of positive examples.Additionally, the reinforced multicategory hinge loss assigns a margin of (L − 1) to the positive points, with L the number of classes, which is justified as a natural choice to have sum-to-zero loss functions.Unfortunately, using this margin in the context of Support Vector Machines is not beneficial for the optimization algorithm as it represents a boundary between different Karush-Kuhn-Tucker (KKT) conditions as we show in Proposition 3 (Section 5).A more detailed comparison between the different multiclass loss functions proposed in the literature can be found in Section 3.1.
The key contributions of this paper are: • Formulation of a new family of multiclass hinge loss functions with a single control parameter λ ∈ R that represents the margin of the positive points of a given class.Our family of loss functions takes into account both the error of the positive and negative points of a given class, and it allows us to freely overweight the error associated to the positive points without losing classification calibration.A property that was attempted by Liu and Yuan [28] and Huerta et al. [19], but was not fully completed.• Characterization of the classification calibration domain of this family of hinge loss functions.We show that its classification calibration proper-I.Rodriguez-Lujan and R. Huerta ties can be fully controlled by λ.This analysis reveals that our family of loss functions is classification calibrated provided that the margin of the positive points is larger or equal than (L − 2)/2 with L the number of classes in the problem.This interesting property makes it possible to define a classification calibrated hinge loss function for every multiclass classification problem while overweighting the error of positive points.In other words, as long as one chooses λ / ∈ [0, (L − 2)/2] one can optimize the λ meta-parameter guaranteeing the classification calibration property.
• Formulation of a common framework that allows connecting the new family of loss functions with other classification calibrated multiclass hinge loss functions studied in the literature (Figure 1).Certain values of λ recover the hinge loss functions proposed by Lee et al. [23], Liu and Yuan [28] (γ = 1/2), and Huerta et al. [19], but appropriate values for λ allow overcoming some limitations of previous approaches.Lee et al.'s loss function does not take into account the margin of the positive points of a given class, Huerta et al.'s loss function is not classification calibrated for classification problems with more than three classes, and the reinforced multicategory hinge loss cannot give large weight to positive classes without losing the classification calibration property.• New multiclass SVM algorithm, named λ-SVM, formulated under the Inhibitory Support Vector Machine's formalism to guarantee sum-to-zero decision functions [19].λ-SVM implementation is based on Sequential Minimal Optimization (SMO) [30,20], the de facto standard in non-linear SVM training software [6].Our C++ and Matlab implementations of λ-SVMs are provided as Supplementary Material.• Theoretical and empirical analysis of the λ-SVM solutions (Karush-Kuhn-Tucker conditions) as a function of λ to show that choosing λ in [(L − 2)/2, L − 1] slows down training times given the presence of different KKT conditions in the vicinity of λ. • Empirical proof in real-world datasets of the advantage of (i) using classification calibrated loss functions in terms of classification accuracy, and (ii) overweighting the error of the positive points in terms of computational speed.
The paper is organized as follows.Section 2 defines classification calibration for multiclass problems and establishes its relationship with Bayes consistency.Section 3 presents our family of multiclass hinge loss functions with variable margin λ and characterizes the relationships between this family of loss functions and other multiclass losses existing in the literature.Section 4 formulates Theorem 2 that states the range of values of λ that makes our family of loss functions classification calibrated (classification calibration domain).Section 5 integrates our family of loss functions into the Support Vector Machines' framework to give rise to a new multiclass SVM model with variable margin λ (λ-SVM).Section 5 also analyzes λ-SVM solutions and KKT conditions to define a range of values for λ with good convergence properties.Section 6 provides results on four publicly available datasets in terms of classification accuracy and training times as a function of the margin of positive points λ.This section also provides a comparison with MSVMpack [21], a well-known package for multiclass Support Vector Machines.Finally, Section 7 formulates the conclusions derived from this work.A detailed proof of Theorem 2 can be found in Appendix A, and C++ and Matlab codes for λ-SVMs are provided as Supplementary Material [33].

Classification calibration for multiclass loss functions
Given an L-class classification problem (L ≥ 2), the goal of a multiclass classification algorithm is to find a classifier φ : X → Y such that the class label of every input pattern x ∈ X is correctly estimated.In other words, our goal is to find a classifier φ such that φ(x) = y for all (x, y) ∈ X × Y.However, this goal is fully achievable only when the classification problem is separable; oth-erwise, the objective is to correctly classify the maximum number of samples.Without loss of generality, let's assume that x i ∈ X ⊆ R M is an input vector, and y i ∈ Y = {1, 2, . . ., L} is its class label.We are interested in minimizing the expected misclassification risk that is expressed as where E X Y is the expectation with respect to the distribution of X × Y, and I A is the indicator function taking the value 1 if A is true, and 0 otherwise.The misclassification risk yields the probability that φ(x) provides an incorrect prediction for x ∈ X .The least possible R(f ), R * , defines the Bayes risk.This is the risk associated with the Bayes rule, which is the optimal classification strategy consisting of predicting the majority class for x.The Bayes risk R * is defined as is the probability of class y given the point x.However, in practice we do not have a whole representation of X × Y, but we have a set of N training pairs {(x i , y i )} N i=1 .In this case, our goal is to minimize the empirical error on the training data, which is given by Therefore, the minimum possible value of the empirical error is zero, and it corresponds to the case when all the training points are correctly classified.
In what follows, we assume that the classifier φ is expressed as the combination of functions f and pred: Here, f is an L-vector that belongs to F, a class of vector functions f : X → R L .We refer to f (x) = (f 1 (x), f 2 (x), . . ., f L (x)) as the decision function vector or the decision functions of point x.Each coordinate of f corresponds to the evaluation in x of the decision function associated with each class.The function pred discretizes f (x), and it is defined as pred(x) = arg max j {f j (x)}.Given that maximizing argument of f is invariant with respect to the addition of a constant to all entries in f , it is advisable to impose a sum-to-zero constraint in order to simplify the analysis.Then, the class of vector functions F is defined as According to this mathematical framework, the classification function φ is unequivocally defined by the decision function f and, thus, the goal of the classifier is to minimize Eq. ( 1) with respect to f .However, the direct minimization of Eq. ( 1) is known to be NP-hard [11,4], so it is common to minimize instead surrogate loss functions Ψ y (f (x)) that approximate the 0-1 loss function and have good computational guarantees such as differentiability and convexity.More precisely, Ψ y (f (x)) is defined as a continuous function from R L to R + , and it can be understood as the loss associated with predicting the label of x using f (x) when the true label is y.Therefore, the expected risk associated with Ψ y (Ψ-risk) is defined as Then, in practice, the classifier is inferred from the decision function f N that minimizes the empirical Ψ-risk as In this framework, Bartlett et al. formulate the concept of classification calibration as a necessary and sufficient condition to have Bayes consistency when the empirical risk of a binary loss function Ψ y converges to the minimal possible Ψ-risk [3].Tewari and Bartlett extend this classification calibration concept to multiclass problems [37,Theo. 2].They show that multiclass classification calibration is equivalent to Bayes consistency assuming convergence of the empirical Ψ-risk to the minimal possible Ψ-risk, and they characterize classification calibration in terms of geometric properties of the loss function.Interestingly, Tewari and Bartlett also show that Bayes consistency of binary classifiers does not automatically imply Bayes consistency of the multiclass loss function and, thus, the classification calibration problem is more interesting in multiclass settings.The classification calibration definition derives from the minimization of the Ψ-risk.Writing the Ψ-risk as follows the minimization of Eq. ( 3) is equivalent to the minimization of the inner conditional expectation for each x.Initially proposed by Tewari and Bartlett [37, Definition 1], the classification calibration property can be defined as follows Definition 1. [37,42] A surrogate function Ψ y (f (x)) is said to be classification calibrated w.r.t. a margin vector f (x) = (f 1 (x), f 2 (x), . . ., f L (x)) T if for all {P y (x)} y∈Y ∈ Δ L , where Δ L = {P ∈ R L : P j ≥ 0 ∀i = 1, . . ., L and L i=1 P i = 1} is the probability simplex in R L , the following conditions are satisfied: 1.The risk minimization problem f (x) = arg min f (x)∈F y∈Y P y (x)Ψ y (f (x)) has a unique solution f (x) = ( f1 (x), f2 (x), . . ., fL (x)) T for all x ∈ X ; and 2. arg max y∈Y fy (x) = arg max y∈Y P y (x) for all x ∈ X .
Intuitively, Definition 1 states that the loss function Ψ y is classification calibrated if its minimum allows recovering the index of the maximum probability for all x ∈ X .
Finally, it is worth noting that classification calibration is closely related to the concept of proper loss functions.However, classification calibration is a weaker condition as it only focuses on classification rather than estimating probabilities as in the case of properness [31].For a more detailed explanation of the classification calibration framework and the consequent Bayes consistency properties, the reader is referred to [3,37] and references therein.

Family of loss functions with variable margin λ
The analysis of Bayes consistency and classification calibration of several multiclass hinge loss functions has been extensively addressed in the literature [23,26,37,41].However, many existing multiclass loss functions are not classification calibrated.In order to provide a classification calibrated multiclass hinge loss function for every multiclass classification problem, we propose to use a family of loss functions regulated by a control parameter λ.Our set of loss functions for a data point x i can be expressed as where [ρ] + takes the value ρ for ρ ≥ 0, and 0 otherwise.Intuitively, the above equation imposes variable margin λ for points in class y i and margin 1 for points belonging to other classes.Eq. ( 4)-( 5) are indeed a continuum of loss functions parametrized by λ.Finally, note that Ψ y (f (•)) satisfies arg min j {Ψ j (f (x i ))} = arg max j {f j (x i )} = pred(x i ).

Connection with other multiclass loss functions
The connection between our family of loss functions and some other multiclass loss functions proposed in the literature is shown in Figure 1.Certain values of the parameter λ allow us to recover some existing classification calibrated loss functions.The equivalence to Lee et al.'s loss function [23] is obtained with λ < −1, but our family of loss functions is able to consider the slack of the positive points of a given class, which is beneficial to efficient learning (Section 6).The equivalence to the reinforced multicategory hinge loss [28] is obtained for γ = 1/2 and λ = L − 1, where L is the number of classes.In fact, the authors suggest to use the reinforced multicategory hinge loss with γ = 1/2 as a good trade-off between classification accuracy and classification calibration.However, the best performance is generally obtained in classification uncalibrated scenarios (γ > 1/2) in which the margin of the positive points dominates in the loss function.On the other hand, the reinforced multicategory hinge loss sets the margin of positive points λ equal to (L − 1) to have sum-to-zero decision functions.Beyond the mathematical convenience, this decision restricts the classification calibration domain of the loss function to γ ≤ 1/2.In Section 4, we show that our family of loss functions not only provides optimal decision functions different from those of the reinforced multicategory hinge loss, but it also allows us to have a classification calibrated loss function while giving arbitrarily high weight to the margin of the positive points of a given class.Furthermore, setting the margin of positive points equal to (L − 1) has negative effects on the optimizer since three different KKT solutions are obtained for any interval containing λ = L − 1 (Section 5).Our loss function also becomes equivalent to that of Inhibitory Support Vector Machines (ISVMs) proposed by Huerta et al. when λ = 1 [19].Huerta et al. show that their loss function is classification uncalibrated for problems with more than three classes.This result matches with that obtained in Section 4. The introduction of the variable margin in our loss functions makes it possible to define a classification calibrated loss function for every classification problem regardless of its number of classes.
Besides the multiclass loss functions that can be treated as special cases of our multiclass loss function, other multicategory loss functions can be also found in the literature.Guermeur and Monfrini propose a new multiclass loss function with quadratic loss instead of hinge loss (MSVM2 loss function) [18].As stated by Guermeur and Monfrini, the main advantage of using the 2-norm loss is that the training algorithm can be expressed, after an appropriate change of kernel, as the training algorithm of a hard margin machine.Guermeur and Monfrini established a generalized radius-margin bound on the leave-one-out error of the hard margin version of their loss function.This provides them with a differentiable objective function to perform model selection for the MSVM2 loss.However, hinge loss is usually preferred for classification tasks.Additionally, though Guermeur and Monfrini state that their MSVM2 loss function can be seen as a quadratic loss variant of the multiclass SVM of Lee et al. [23], the MSVM2 consistency properties are not discussed.In Section 6, we included a comparison in terms of classification accuracy and training times between the MSVM2 loss function implemented in the MSVMpack package [21] and our loss function.
Liu and Shen's multiclass loss function [27] is an extension of the binary ψlearning originally proposed by Shen et al. [35].ψ-learning is as another marginbased technique that replaces the convex SVM loss function by a non-convex ψ-loss function.Shen et al. show that ψ-learning can achieve good classification rates while maintaining the margin interpretation.They also show that their loss function converges to the Bayes decision rule.In contrast, our loss function extends the hinge loss function traditionally used in SVMs while ensuring consistency for certain values of λ.As an extension of traditional SVMs, the λ-SVM problem is convex and solvers commonly used for SVMs can be applied.However, these solvers are not suitable for the ψ-loss function; a method based on a difference convex (dc) decomposition is used instead to solve the multiclass ψ-learning optimization problem.
Finally, the L1MSVM approach is another multiclass Support Vector Machine model that is based on the L1-norm [38].L1MSVM simultaneously performs feature selection and classification through an L1-norm penalized sparse representation.L1MSVM is formulated to use several loss functions that can be expressed in a unified fashion.Wang and Shen conduct a detailed analysis of L1MSVM considering Lee et al.'s loss function [23], which is known to be classification calibrated.Unlike L1MSVM, in this work we embed our loss functions into the Inhibitory Support Vector Machine framework with an L2-norm regularization term.

Classification calibration domain
According to the framework described in Section 2, the analysis of classification calibration requires to minimize the inner conditional expectation for each x in Eq. (3).In what follows, we fix x and omit dependencies on x to simplify the notation.Replacing Ψ y (f ) by our set of loss functions in Eq. ( 3) we obtain f = arg min f Now, we are ready to formulate the following theorem that characterizes the classification calibration domain of our family of loss functions.
Theorem 2. Given a multiclass classification problem with L classes, the family of loss functions defined in Eq. ( 4)-( 5) Proof.A detailed proof can be found in Appendix A. The sketch of the proof can be outlined as follows: for λ ≤ L − 1, it is shown that the optimal decision functions are lower bounded by −1, while for λ > L − 1 the decision functions are upper bounded by λ.Taking into account these bounds together with the sum-to-zero constraint, the minimization problem in Eq. ( 6) is formulated as an optimization problem with equality and inequality constraints.Then, the relationships between decision functions and class probabilities, which allow us to determine the classification calibration properties of our loss functions, are stated by the Karush-Kuhn-Tucker (KKT) conditions [5].
According to Theorem 2, we can define classification calibrated multiclass hinge loss functions for any multiclass classification problem by means of the scalar parameter λ.Certain values of λ enable not only to have classification calibrated loss functions, but also, unlike the other classification calibrated loss function [23], to take into account the margin of positive points.Figure 2 shows the ratio of classification uncalibrated solutions obtained by Monte Carlo simulations for different values of the control parameter λ and the number of classes L. These results were obtained by counting the number of classification uncalibrated cases when minimizing the empirical Ψ-risk in Eq. ( 6) for 10,000 random probability simplex in R L .Monte Carlo simulations give evidence of the classification calibration domain presented in Theorem 2: λ ∈ (−∞, 0) ∪ ((L − 2)/2, +∞).

Fig 2. Monte Carlo simulation results on the minimization of the empirical Ψ-risk associated with our multiclass loss functions with variable margin λ. Figure shows the ratio of classification uncalibrated cases as a function of the control parameter λ and for different number of classes L.
The number of simulations was set to 10,000.

Classification calibration for Support Vector Machines
Large-margin classifiers make tractable the minimization of the 0-1 loss by using convex surrogate loss functions.Examples of this approach are Support Vector Machines [8] and boosting [13].The general formulation of a large-margin classification algorithm with regularization is min , where J(f ) is a regularization term to penalize the model complexity, and ρ is the regularization parameter.Our proposed loss functions can be used in any standard regularized empirical risk minimizer.We used the Sequential Minimal Optimization (SMO) implementation of the Inhibitory Support Vector Machines (ISVMs) [19] since, as described further down in this section, they implicitly produce sum-to-zero decision functions for any example, while standard SVMs do not.For example, Lee et al.'s implementation of SVMs needs to explicitly add a sum-to-zero constraint not necessary in the ISVM implementation [23].The best feature of the ISVM is the easiness of the implementation that allows a quick adaptation to any variable margin framework.
ISVM is an extension of SVM to provide a simple algorithm for multiclass classification by directly integrating the concept of inhibition into the SVM formalism.The objective of the inhibition mechanism behind the ISVM algorithm is to find a hyperplane associated with each class, {w j } L j=1 , that exerts downward pressure on the rest hyperplanes while trying to maximize its generalization capability.ISVM decision function for class j evaluated in a data point x i has the form where Φ is a mapping function from the original input space to a higherdimensional space V (feature space) where the optimal hyperplane is calculated.The parameter μ is a scalar number that regulates the inhibitory term, which is the key difference with respect to standard SVMs.The optimal decision vector f is determined by following the standard SVMs' framework.The ISVM primal problem is expressed as where w is the concatenation of the hyperplanes of each class, w = [w 1 , . . ., w L ], {η ij } are the slack variables that provide room to handle the noisy data, and y ij takes the value 1 if the pattern x i belongs to class j (i.e., y i = j) and −1 otherwise.Note that now the trade-off between the regularization term and the loss function is controlled by the cost parameter C instead of the regularization parameter ρ.To simplify the notation, in what follows we assume that the cost parameter C is already normalized by the number of training points (N ) and the number of classes (L).Inhibitory Support Vector Machines use an input space formed by L concatenations of the original input space X , and they use a feature space that is the product space V L .Then, an input vector χ i ∈ R ML is formed by L concatenations of the original training pattern x i ∈ R M .The corresponding nonlinear transformation Υ(χ) ∈ V L is defined as Υ(χ) = (Φ(x), Φ(x), . . ., Φ(x)) (L times), and Υ j (χ) is the composition of Υ(χ) with the projection operator onto the j-th coordinate subspace corresponding to the j-th class; that is, Υ j (χ) = (0, 0, . . ., Φ(x), . . ., 0) with all coordinates except the j-th equal to zero.The transformations Υ and Υ j inherit many properties from the mapping function Φ(x).In particular, Huerta et al. show that the optimal value for μ is μ = 1/L, which can be obtained directly from the minimization of the Lagrangian of Problem ( 8)-( 10) [19].They also show that, in that limit, ISVMs become a tight bound to probabilistic exponential models.The inhibition term, therefore, is the average over the evaluation of the hyperplanes of each class.Interestingly, μ is dependent on the number of classes of the problem, but independent of the training points themselves.This result is especially appealing when working with multiclass margin vectors since it yields sum-to-zero decision functions without imposing additional constraints in the optimization problem.ISVM automatically embodies all the zero-sum loss functions: The loss function in Eq. ( 10) corresponds to λ = 1.Therefore, it is straightforward to integrate the family of loss functions presented in Eq. ( 4)-( 5) into the ISVM's framework.It can be formulated as follows To obtain the solution to Problem ( 14)-( 16), we compute its Lagrangian as follows where the Lagrange multipliers are α ij ≥ 0 and ζ ij ≥ 0. The decision function associated with the j-th class for a training point x i (Eq.7) is now expressed as f j (x i ) = w, Υ j (χ i ) − μ w, Υ(χ i ) .We calculate the partial derivatives of L with respect to the primal variables w, η, and μ to make them equal to zero.It leads to Then, as in [19, Appendix B], replacing Eq. ( 19) in Eq. ( 20) yields the optimal μ as μ = 1/L.Since the partial derivatives of the Lagrangian w.r.t.μ and w do not depend on λ, this reasoning is valid for any λ ∈ R, and, thus, sum-to-zero decision functions are guaranteed for all λ ∈ R.This property makes ISVM's framework advantageous for implementing multicategory margin vectors.Now, we obtain the ISVM dual problem by applying Eq. ( 18)-( 19) and Properties ( 11)-( 13) to the Lagrangian in Eq. ( 17) with μ = 1/L.It leads to the dual cost function W that has to be maximized with respect to the Lagrange multipliers, α ij , where and . Now, the decision function for the j-th class can be written in terms of the Lagrange multipliers and the kernel function as We can simplify the evaluation by just computing since the remaining terms simply add the same constant to all the classes.The class of the test sample x is defined as arg max j f j (x).Following the notation in [19], we change the double index notation α ij for a new index k running from 1 to NL.Assuming lexicographical order in the α ij s, the dual cost function W can be written as where Then, it is easy to see that the KKT conditions for the λ-SVM training problem are ). Huerta et al. provide a very easy and simple implementation of ISVM with λ = 1 based on Sequential Minimal Optimization (SMO) [30] that can be easily translated to the variable margin setting with minimal changes in the computer program.
As originally proposed by Platt [30], the resolution of the proximity to the KKT condition in the optimization algorithm is controlled by a tolerance parameter T > 0 and a numerical resolution , which depends on the machine precision.
Then, the fulfillment of the KKT conditions is formulated as follows An interesting point of analysis is to determine the stability of the SMO optimization algorithm by taking into account the different regions of optimal decision functions defined by λ and summarized in Figure 3 (for more details, the reader is referred to Appendix A).Note that since SMO is used as optimization algorithm, λ-SVM optimal decision functions are defined as a function of the optimal Lagrange multipliers αj s according to Eq. ( 21), which can be easily obtained by means of the equality V i = y i ( fi − l i y i ).To illustrate the negative effect of having different KKT solutions in the proximity of λ in terms of computational cost, we measured the training times in the simplest case in which SMO is applied to single point.We set class probabilities to P 1 = 0.375, P 2 = 0.34, and P 3 = 0.28, we created N = 264 training points, and we set T = 10 −3 , = 10 −6 , and C = 10 6 .The resulting training times for different λ-regions are shown in Figure 3.
The different KKT conditions derived from Figure 3 together with the KKT numerical conditions in Eq. ( 24)-( 26) allow us to formulate the following proposition.
Proposition 3. The optimal solution for the SVMs with variable margin has three possible KKT solutions in the domain λ ∈ (L − 1 − T, L − 1 + T ) for any resolution proximity T > 0: The resolution proximity T in the KKT conditions (Eq. ( 24)-( 26)) implies to solve the dual problem for an effective margin λ eff ∈ (λ − T, λ + T ).That is why the SMO algorithm shows a slow convergence for λ in the proximity of the boundary between different KKT solutions.It should also be noted that there may exist other points subject to KKT variations inside the same classification calibration region since the solutions for λ ∈ (−1, 0) and λ ∈ ((L − 2)/2, L − 1) depend on the class probability distribution, which in turn depends on λ.This is not the case for λ > (L − 1) since the transition between the two possible solutions is only defined by the class probabilities; that is, the KKT conditions are constant given any λ > (L − 1) and any point.It means that the margin λ = (L − 1) imposed by the reinforced multicategory hinge loss [28], though guaranteeing classification calibration, may slow down the convergence of the the SMO algorithm given that the optimizer is searching across different KKT  regions.Proposition 3 and Figure 3 suggest that the margin of the positive points should be chosen somewhere in (−∞, −1) ∪ (L − 1, +∞) with enough space with respect to the tolerance T to not incur in KKT instability problems.
The case λ ∈ (−∞, −1) corresponds to Lee et al.'s loss function [23].However, values of λ (L − 1) provide the best training times as shown in Figure 4.The advantage of strongly considering the margin of the positive points in terms of training times will be also confirmed in the following section.

Experimental evaluation
The aim of this section is to conduct an empirical evaluation in terms of classification accuracy and training times of the λ-SVM model introduced in Section 5. We used four real-world datasets from the UCI data repository [24] described in Table 1.Some of these datasets involve real applications such as classification on gas sensor arrays [32].Base error was obtained by predicting the majority class in each dataset.In Covtype dataset, a random selection of 50, 000 points was performed.In the Abalone dataset, age bands were obtained dividing age by 5.These datasets were chosen because they have a large number of training points compared to the dimensionality.This favors large values of the cost parameter C, which in turn can reveal differences between classification calibrated and uncalibrated loss functions since the regularization term almost vanishes.Otherwise, under appropriate regularization, all SVM models are classification calibrated [36].
We generated five different partitions of each experiment.The first 90% of samples was selected as the training set, and the remaining 10% of samples constituted the test set.The training samples were used to build the λ-SVM model.We used a function with compact support as a kernel.Kernels with nonzero tails such as the Gaussian kernel can be detrimental in scenarios with finite number of points and C very large since points that are significantly far from the point of interest can still have a notable contribution, especially when there are not enough points in the neighborhood of the point of interest.Specifically, we used the compactly supported kernel proposed and analyzed in [15,16,41].
The compactly supported kernel can be written as follows and, where K(x, x ) is the Gaussian kernel, D > 0, and ν ≥ M +1 2 (M is the number of features).This kernel preserves positive definiteness as shown in [16].The function φ D (•) induces sparsity since all entries satisfying x − x ≥ D are set to zero in the kernel matrix.Therefore, the constant D is called the thresholding or truncation parameter as it regulates the support size of the kernel K D,ν .The parameter ν controls the degree of smoothness or differentiability of φ D,ν .Different choices of D and ν produce different compactly supported kernels.When D → 0, K D,ν (x, x ) evaluates as zero for every x = x , and it is equal to 1, otherwise.When D → ∞, K D,ν (x, x ) recovers the Gaussian kernel.Since the value of ν has no influence in the sparsity of the kernel, it is generally fixed at some value [41].In this paper, we fixed ν = M +1 2 in order to ensure positive definiteness.We normalized the parameter γ, which determines the Gaussian kernel width, by the number of features: We defined D as a function of γ as follows D = ( γ/M ) −1 to reduce the number of parameters to adjust by cross validation.The intuition behind the definition of D is to maintain certain consistency between the Gaussian kernel width and the support size.The wider the Gaussian kernel, the larger support size.The optimal cost parameter C and kernel width γ were those with the lowest error when performing 10 cross-validation on the training set.The cost parameter took values in the grid {10 i | i = 0, 1, . . ., 7}, and the kernel width γ was selected from the grid {10 i | i = −3, −2, . . ., 3}.The test set was used to report a reliable estimation of the performance of the model.The algorithm used a tolerance level of T = 5 • 10 −2 to exit.We imposed a training time limit of 2, 500 seconds for the Sensor, Pendigit and Abalone datasets, and a time limit of 4, 000 seconds for the Covtype dataset.We used our C++ implementation of λ-SVMs, which is provided as Supplementary Material.The Matlab code for λ-SVMs can be also found in the Supplementary Material [33].Table 2 shows the average classification errors and training times (in seconds) over the five test sets when different values of λ are considered in the loss function.Results correspond to the optimal cost parameter C (C opt.) and kernel width γ (γ opt.) determined by cross-validation.The values of λ were chosen to have different classification calibration scenarios according to the analysis presented in Section 4. Recall that λ < −1 recovers the classification calibrated loss function originally proposed by Lee et al. [23], λ = 1 provides the ISVM loss function [19], and λ = (L − 1) is equivalent to the reinforced multicategory hinge loss [28].
The minimum classification error is always achieved by a classification calibrated loss function with λ ≥ (L − 1).In general, classification errors for λ ≥ (L − 1) are either lower or similar to those corresponding to classification uncalibrated scenarios, while training times are usually lower.For example, given the optimal C and γ for each value of λ in the Pendigit dataset, λ-SVMs  [23], Huerta et al. [19], and Liu and Yuan [28]'s loss functions are indicated as Lee, ISVM, and RML (Reinforced Multicategory Loss), respectively.
with large λ are at least 7 times faster than λ-SVMs with smaller values of λ.Classification rates for the other classification calibrated loss corresponding to λ < −1 are competitive, but training is slower than for λ ≥ (L − 1).This emphasizes the importance of counting the margin of positive points in the loss function in contrast to [28].Moreover, the fact that training did not finish for the smallest values of λ in several datasets also corroborates the remarkable relevance of shifting overweight onto the margin of the positive points.It should be noted that in the Covtype dataset, the training times for λ (L − 1) are the highest since the optimal cost parameter C is set to 10 7 in these cases; however, the lowest values of λ did not explore the complete C grid given that they expired the training time limit.The following analysis of training times as a function of λ for a given γ and C will show the advantages of taking λ (L −1) in terms of computational cost.
Figure 5 shows the average training times for different values of the cost parameter C as a function of λ.In order to have comparable training times, the mode of the optimal kernel parameter γ across all the cross validation runs is chosen for each dataset.Figure 5 shows that the training times for λ (L − 1) are significantly lower than those corresponding to loss functions with negative λ or λ in the interval ((L−2)/2, (L−1)).Differences are especially noticeable for the largest values of the cost parameter C.This result proves the advantage of strongly overweighting the margin of the positive points, and makes preferable the use of λ (L − 1) instead of λ < −1 (Lee et al. loss function), λ = 1 (ISVM), or λ = (L − 1) (reinforced multicategory).Finally, the long training times observed for λ in the interval ((L − 2)/2, (L − 1)) and in the proximity of λ = (L − 1) are presumably due to the presence of different KKT regions as analyzed in Section 5. Thus, avoiding values for λ in or close to the interval ((L − 2)/2, L − 1) is strongly recommended.
In short, our classification calibrated loss functions not only provide consistency guarantees that are directly reflected in the performance of the classification models, but they also provide excellent training times when the error of the positive points is significantly overweighted.A good value for λ should be large enough to strongly consider the margin of the positive points and safely keep away from the region where transitions between different families of solutions are possible.For example, setting λ = 100L seems an appropriate choice in terms of classification calibration and training times according to our experimental results.Nevertheless, the best value for λ should ideally be determined empirically for each dataset by cross validation.

Comparison with other multiclass-SVM implementations
The aim of this section is to compare our multiclass loss function in terms of classification accuracy and computational times with other multiclass SVMs implementation and other loss functions different from those that can be treated as special cases of λ-SVM.In this section, we compare the λ-SVM solver with the MSVMpack package [21], an open source software package that implements the generic multiclass SVM formulation proposed by Guermeur [17].MSVMpack uses a Quadratic Programming solver based on the Frank-Wolfe method [12], and each step of the descent is obtained by solving a linear program (LP) by means of the lp solve solver [29].MSVMpack implements four multiclass loss functions: Weston and Watkins [39], Crammer and Singer [9], Lee et al. [23], and Guermeur and Monfrini [18].For more details about the MSVMpack package, the reader is referred to [21].
We followed the same experimental setup described in Section 6.We included the kernel with compact support in the MSVMpack implementation thanks to the flexibility of this software package to customize kernel functions.The Covtype dataset is not included in the comparison given its high computational cost.Both implementations, MSVMpack and λ-SVMs were configured to run in one single processor in order to better control the computational times.Please, note that our goal is not to compete with the excellent implementation provided by MSVMpack, but to provide insight into SVMs' multiclass loss functions in terms of classification calibration properties and computational cost.The results for MSVMpack for the Sensors, Pendigit, and Abalone datasets and the four loss functions (Weston and Watkins, Crammer and Singer, Lee et al., and Guermeur and Monfrini) are shown in Table 3. [39], Crammer and Singer [9], Lee et al. [23], and Guermeur and Monfrini [18] MSVMpack classification rates are similar to those obtained by λ-SVM.When the optimal λ is chosen in Table 2, λ-SVMs classification rates are equal or higher than those obtained by any of the loss functions implemented by MSVMpack.This means that using our loss function only can improve the classification accuracy.Regarding the training times, in general, λ-SVMs are faster for large values of λ while maintaining competitive classification accuracies.Since MSVMpack and λ-SVMs implement Lee et al.'s loss function, both implementations can be compared.For Lee et al.'s loss function, experimental results show that (i) MSVMpack and λ-SVM provide similar results; and (ii) training times are dataset-dependent: MSVMpack implementation is faster that λ-SVM implementation in the Pendigit and Abalone datasets, while λ-SVM is faster than MSVMpack in the Sensors dataset.Overall, these experimental results show the efficiency of MSVMpack implementation, but they also reveal that there is still room for improvement in the loss function itself.

Conclusions
In this paper, we have proposed a family of multiclass hinge loss functions regulated by a control parameter λ that controls the margin of the positive points of a given class.These surrogate loss functions, Ψ y , exhibit different classification calibration properties as a function of λ.We have determined the values of λ for which the proposed loss functions are classification calibrated, and we have shown that our family of loss functions allows us to define a classification calibrated hinge loss function for every multiclass classification problem.Unlike other classification calibrated hinge loss functions, we can give arbitrarily high weight to the margin of the positive points, which is empirically shown to be positive for learning.Our family of loss functions is general enough to recover Lee et al. [23] and Liu and Yuan [28]'s classification calibrated loss functions by setting λ ≤ −1 and λ = (L − 1), respectively, with L the number of classes.However, we show that other values of λ allow overcoming some limitations of previous approaches while maintaining classification calibration properties.
We have embedded our family loss functions in the Support Vector Machine's formalism (λ-SVM) and implemented a Sequential Minimum Optimization (SMO) algorithm.We have shown that the optimization algorithm has different convergence rates that can be explained in terms of the classification calibration domain and the different families of SVMs' solutions and KKT conditions defined by λ.In particular, values of λ (L − 1) provide the fastest convergence while guaranteeing classification calibration.
We have compared the performance of λ-SVMs in four real-world datasets to conclude that classification calibrated loss functions considering the margin of positive points only can improve classification uncalibrated loss functions in terms of classification accuracy.Additionally, λ-SVMs with large values for λ exhibit the lowest training times, which matches with our theoretical analysis of SMO's solutions.These results reveal the importance of strongly overweighting the positive samples in the learning process.
In conclusion, a value of λ large enough would guarantee classification calibration while taking the maximum advantage of the positive examples and providing good convergence rates.Though the optimal value for λ should be determined in a validation phase, our theoretical and empirical results indicate that λ = 100L is a good choice.It not only ensures classification calibration, but it also provides good classification performance and training times.

Appendix A: Detailed proof of Theorem 2
This Appendix provides a detailed proof of Theorem 2. In what follows, we assume that class probabilities {P 1 , P 2 , . . ., P L } for a point x are all different and ordered as P 1 > P 2 > . . .> P L , and let f 1 , f 2 , . . ., f L be the decision functions associated with these class probabilities.Before addressing the proof, let us formulate two properties of our loss function that make the classification calibration analysis more tractable.
The proof of Theorem 2 is the result of the combination of Lemmas 6-8.Lemma 6.Given a multiclass classification problem with L classes, the λparametrized family of loss functions defined in Eq. ( 4)-( 5) is classification calibrated for λ < −1.
Proof.Firstly, we show that the minimizer f of Eq. ( 6) is lower bounded by −1 for λ < −1.The solution f 1 = L − 1 and f 2 = f 3 = . . .= f L = −1 is a feasible solution lower bounded by −1, and it evaluates as (1 − P 1 )L in Eq. ( 6).Let f 1 be another solution with f 1 j < −1.We obtain the following chain of inequalities for the objective function in Eq. ( 6) Then, any solution with f j < −1 produces a larger value in the Ψ-risk than the solution and, thus, it cannot be minimizer.Therefore, in what follows, we only need to consider f with f j ≥ −1 for all j = 1, . . ., L. Imposing the sum-to-zero constraint, L l=1 f l = 0, we obtain the following inequalities for all and, thus, all the terms [λ−f l ] + in Eq. ( 6) vanish, and the problem is equivalent to that proposed by Lee et al. in which the positive examples of a class do not take part in the loss function [23].This case has already been shown to be classification calibrated [23,26].We include the proof for completeness' sake.For λ < −1, the following equality holds Consequently, minimizing Eq. ( 6) is equivalent to maximizing The Lagrangian of Problem ( 28) is given by and the maximizer must satisfy the Karush-Kuhn-Tucker (KKT) conditions [5]:
From the complementary slackness condition, we can ensure that either f l = −1 (and P l = μ − α l ) or α l = 0 (and P l = μ).Note that it is not possible to have f l = −1 for all l = 1, 2, . . ., L since it violates the sum-to-zero condition, and only one f l can have α l = 0 since all class probabilities are assumed to be different.Taking into account that the probability associated with α l = 0 is maximum since the remaining probabilities are defined as P i = μ − α i with α i ≥ 0, the optimal solution is f1 = L−1 (P 1 = μ) and fm = −1 (P m = μ−α m ) for m > 1.This solution is classification calibrated.Lemma 7. Given a multiclass classification problem with L classes, the λparametrized family of loss functions defined in Eq. ( 4)-( 5 Proof.For the time being, let us assume that the optimal decision functions are lower bounded by −1.We show that this assumption is correct at the end of the proof.
Then, we have the following optimization problem max The Lagrangian of Problem ( 30) is given by On the one hand, the maximizer of Problem (30) must satisfy the KKT conditions for l ∈ A: • Stationarity: ∂L ∂f l = P l + α l − μ = 0 for all l ∈ A. • Complementary slackness: α l (f l − λ) = 0 for all l ∈ A. • Primal feasibility: f l > λ for all l ∈ A, and l∈A∪B f l = 0. • Dual feasibility: α l ≥ 0 for all l ∈ A, and μ ≥ 0.
From the complementary slackness condition, we can ensure that either f l = λ or α l = 0; and, then, P l = μ − α l or P l = μ, respectively.From the primal feasibility condition, it is not possible to have f l = λ.Additionally, if there exists f l with α l = 0, it must be unique since it implies P l = μ and all the class probabilities are different.In fact, the probability associated with α l = 0 is maximum according to Property (5).
Figure 6 summarizes the analysis of the KKT conditions for A and B subsets.Given a fixed value for λ and according to the relationships between the decision functions and the class probabilities imposed by the KKT conditions, different configurations for A and B are possible: CASE I: −1 ≤ λ < 0. In order to satisfy l∈A∪B f l = 0, it is necessary that there exists x positive (x > λ).The decision function f l taking the value x has to be that with the maximum class probability (Property ( 5)).It can be seen that the remaining decision functions take the value −1 for class probabilities lower than P 1 (1 + λ)/(2 + λ), and they evaluate as λ otherwise.In case that there exists P l = P 1 /2, its associated decision function f l also takes the value λ as it maximizes the objective function in Problem (30).Assuming that n decision functions are equal to −1 and (L − n − 1) decision functions are equal to λ, the value of x is imposed by the primal feasibility condition l∈A∪B f l = 0: x = n − (L − n − 1)λ ≤ (L − 1).Therefore, our family of loss functions is classification calibrated for −1 ≤ λ ≤ 0. CASE II: λ = 0. Two configurations are possible in this case: The problem is classification uncalibrated since there is not a single maximum.• CASE II.2.The decision functions associated with the n lowest class probabilities take the value −1, the decision function corresponding to the maximum probability takes the value x = n, and the remaining decision functions evaluate as λ = 0. Hence, the problem is classification calibrated.
In order to establish when the minimizer is characterized by CASE II.2, we need to determine when it is better for the objective function in Problem (30) to have a decision function taking the value −1 instead of λ = 0.By Property (5), it is sufficient to find out when Problem (30) is larger for f L = −1 (and f 1 = 1) than for f L = 0 (and f 1 = 0): 2P L (−1) + P 1 (1) > 2P L (0) + P 1 (0) ⇒ P L < P 1 /2.
When P L = P 1 /2, the objective function in (30) is 2y P1 2 − yP 1 = 0 and then y can take any value in the interval (−1, 0).Therefore, the loss function is classification calibrated when P L < P1 2 and classification uncalibrated otherwise.This implies that the loss function is classification uncalibrated for λ = 0, since there exists at least one probability distribution that makes the loss function classification uncalibrated.CASE III: 0 < λ < L − 1.Two different cases should be analyzed: Note that it is not possible to have A = ∅ and B = ∅ since the primal feasibility condition l∈A∪B f l = 0 is not satisfied.
First of all, we analyze the distribution of the solutions of Problem (30) assuming CASE III.1.The loss function is classification calibrated when the minimizer has a single f l = λ.For the primal feasibility condition l∈A∪B f l = 0, it must be satisfied that (−1) (L − 2) + y + λ = 0, and, thus, y = (L − 2) − λ.Imposing −1 < y < λ, we get that y only exists for λ > (L − 2)/2; otherwise, more than one decision function needs to be equal to λ.Then, assuming A = ∅, the loss function is classification uncalibrated for 0 ≤ λ ≤ (L − 2)/2 and classification calibrated for (L − 2)/2 < λ ≤ (L − 1).Now, we analyze CASE III.2 by using the results from CASE III.1.CASE III.2 is always classification calibrated as there always exists a single maximum f 1 = x > λ.Therefore, we need to determine when it is better for the objective function in Problem (30) to have f 1 in A (f 1 = x > λ) instead of having f 1 in B (f 1 ≤ λ).As there always exists a feasible solution for the CASE III.1, making f 1 = x > λ is only possible when it actually maximizes the value of the objective function in Problem (30).Then, for (L − 2)/2 < λ ≤ (L − 1) our family of loss functions is classification calibrated since CASE III.Without loss of generality, we assume that there is not a class probability verifying P l = P 1 /2, and, thus, we do not have any decision function with value y2 .According to the sum-to-zero-constraint, we have (L − 1 − n A )(−1) + n A λ + x = 0, and, thus, The value of the objective function in Problem (30) for CASE III.1 is while the value of the objective function in Problem (30) for CASE III.2 is Therefore, the loss function for 0 ≤ λ ≤ (L − 2)/2 is classification uncalibrated if there exists a distribution of probabilities {P i } L i=1 for which Eq. ( 33) is larger than Eq. ( 34) for all n A = 0, 1, . . ., n B −1.Then, we need to impose the difference between Eq. ( 33) and Eq. ( 34) to be positive for all Replacing y and x according to equalities in Eq. ( 31) and Eq. ( 32), respectively, we obtain +2(λ + 1) To simplify the notation, we define θ n B +1 = 2(L − 1 − n B (λ + 1) + 1) > 0. Note that the subset of values of λ satisfying θ n B +1 = 0 is a measure-zero set as it corresponds to values of λ such that (λ + 1) = αL for α ∈ Z + .Then, the loss function is classification uncalibrated if we can find a probability distribution {P i } L i=1 for which Eq. ( 35) holds for all n A = 0, 1, . . ., n B − 1.We have the following system of linear inequalities From the last inequality we have necessary condition to have a classification uncalibrated loss function.In fact, it is easy to see that r < 1/2 for λ > 0. Imposing P 1 (λ + 1) − 2(λ + 1)P i < 0 for all i = 2, . . ., n B , probability distributions {P i } L i=1 that make the loss function classification uncalibrated for all 0 < λ ≤ (L − 2)/2 can be constructed as follows, P i = a i P 1 for i = 2, . . ., L , where the coefficients a i are any real numbers satisfying In particular, a probability distribution making the loss function classification uncalibrated for all 0 < λ ≤ (L − 2)/2 is obtained by taking the limit of r when λ → 0 (r is a decreasing function w.r.t.λ); that is, lim λ→0 r(λ) = 1 2 − and lim λ→0 n B = L − .Then, the distribution given by is classification uncalibrated for all 0 < λ ≤ (L − 2)/2.Note that in the limit this distribution is the same as the one obtained for λ = 0. To conclude, let us show that the minimizer of the empirical Ψ-risk is lower bounded by −1.The term in Eq. ( 6) associated with class i can be written as the following piecewise function and hence, the Ψ-risk for a point x is expressed as 7.The monotonicity of g(f i ) in the interval [−1, λ] depends on the prior probability P i , being monotonically increasing for P i < 1/2 and monotonically decreasing otherwise.Note that class probabilities {P i } L i=2 always satisfy P i < 1/2, except for the maximum probability that can be larger than 1/2.The sum-to-zero constraint together with Property (5) force f 1 to be positive.This constraint also affects to f 2 , f 3 , . . ., f L , which might take values larger than −1 if making f 2 , f 3 , . . ., f L equal to −1 has a negative impact on the minimizer due to the subsequent increase of f 1 .In other words, if f i>1 has a value larger than −1 in the solution of Problem (30), it means that it is not beneficial for the minimizer to decrease the value of this decision function by paying the cost of increasing the value of other decision functions.If a decision function f j>1 has a value greater than −1, setting f j < −1 is obviously worse than having f j = −1 since it does increase not only the contribution of its own class but also the contributions of some other classes, whose decision functions are forced to augment to fulfill L l=1 f l = 0. Note that an increase of f 1 could be beneficial for the minimizer only for f i ∈ (−1, λ) when P i > 1/2 since g(f i ) is monotonically decreasing in this domain.However, this case is impossible since it only applies to the majority class and f 1 ≥ λ according to the preceding analysis summarized in Figure 6.Without imposing decision functions to be lower bounded by −1 and assuming instead that the decision functions are lower bounded by ν < −1, it is also true that f 1 ≥ λ.The KKT conditions for f lower bounded by ν < −1 can be easily inferred from the above analysis.In this case, the decision functions are in the set {ν, y, λ, x} with y unique and ν ≤ y ≤ λ < L − 1.Then, necessarily f 1 ≥ λ.Lemma 8. Given a multiclass classification problem with L classes, the λparametrized family of loss functions defined in Eq. ( 4)-( 5) is classification calibrated for λ > L − 1.
Proof.Firstly, we show that the Ψ-risk minimizer f in Eq. ( 6) is upper bounded by λ for λ > L − 1.Let us assume that there exists a solution f 1 such that one decision function, f 1 j , is larger than λ.According to Property (5), this function has to be f 1  1 .Then, we parametrize f 1 as f 1 1 = λ + with > 0, and f 1 m = −1 + m with m ∈ R for m > 1.As f 1 is a feasible solution, it must satisfy From the complementary slackness condition, we can differentiate four cases: • CASE A: α l = 0 and β l = 0.This case is impossible since it implies f l = −1 and f l = λ simultaneously.• CASE B: α l = 0 and β l = 0; then, f l = −1 and P l = (1 + μ − α l )/2.
On the other hand, the maximizer of Problem (38)  From the complementary slackness condition, we can ensure that either f l = −1, which is not possible for all l = 1, 2, . . ., L given the sum-to-zero constraint, or γ l = 0 (and P l = μ).If there exists a decision function f l with γ l = 0 (f l = z < −1), it must be unique since all the prior probabilities are different, and it has to be that associated with the lowest probability.Figure 8  Therefore, this case is classification calibrated.• CASE II: C = ∅ and D = ∅.For the time being, we assume that there does not exist P l such that P l = (P L + 1)/2, and, thus, we do not have any decision function such that f l = y.Then, assuming that n decision functions take the value λ, the value of z is given by z = n(−λ − 1) + L − 1 in order to satisfy the primal constraint L l=1 f l = 0. Imposing z < −1, we obtain n > L/(λ + 1), which is greater than zero for λ > L − 1, and, thus, at least one decision function takes the value λ.When only one decision function takes the value λ (n = 1), our loss function is classification calibrated.In fact, this is the case.It is easy to see that the difference between the Ψ-risk corresponding to the solution for n = 1 and any other solution for n > 1 is negative, and, thus, the solution for n > 1 cannot be minimizer.Therefore, our loss functions are also classification calibrated in this case, and the minimizer is given by f1 = λ, fL = L−λ−2, and fm = −1 for 2 ≤ m ≤ L − 1.
The next question to solve is to determine when the minimizer of our loss functions is defined by either CASE I or CASE II (n = 1).The difference in the Ψ-risk between CASE I and II is (L − 1 − λ) + 2P 1 (λ − L + 1) + P L (−λ − 1 − L), which is positive when P 1 > (1 + P L )/2.Then, the minimizer is defined by CASE I when P 1 < (1 + P L )/2 and by CASE II, otherwise.
Finally, the subset of probabilities not considered in the preceding analysis and corresponding to case when there exists P l such that P l = (P L + 1)/2 is also classification calibrated for λ > L − 1.Note that this case is only welldefined when P 1 > (P L + 1)/2; otherwise, β l must be β l < 0, which violates the dual feasibility condition.It can be seen that having the decision functions in either subset {z, −1, y} or {z, −1, y, λ} does not improve the solution f1 = λ, fL = L − λ − 2, and fm = −1 for 2 ≤ m ≤ L − 1 (CASE II).
Summing up, our family of loss functions is classification calibrated when λ > L − 1.

Fig 1 .
Fig 1. Connections between different multiclass loss functions proposed in the literature and our family of variable margin loss functions.
and vectors f ∈ F are known as multicategory margin vectors [44].

Fig 3 .
Fig 3.  Training times of the SMO algorithm trained with a single point x as a function of the λ-regions defined by the classification calibration domain and the optimal decision functions fj (x) = N i =1 αi j y i j K(x i , x).

Fig 5 .
Fig 5. λ-SVM results training times (in seconds) as a function of λ for different values of the cost parameter C. Results for the mode of the kernel parameter γ with the lowest crossvalidation error for each λ and C are shown.Lee et al.[23], Huerta et al.[19], and Liu and Yuan[28]'s loss functions are indicated as Lee, ISVM, and RML (Reinforced Multicategory Loss), respectively.

Fig 6 .
Fig 6.Relationship between the set of class probabilities {P l } L l=1 and the set of decision functions {f l } L l=1 for −1 ≤ λ ≤ L−1 according to the KKT conditions of the Ψ-risk minimizer in Eq. (6).x and y are possible values of the decision function of a given class satisfying λ < x ≤ L − 1 and −1 < y < λ, respectively.

Fig 7 .
Fig 7. Contribution of class i to the empirical Ψ-risk of a single point as a function of the decision value f i when the probability of the class is either 7a lower than 1/2 or 7b larger than 1/2.
summarizes the analysis of the KKT conditions for C and D. Let us analyze the two feasible scenarios: • CASE I: C = ∅ and D = ∅.The analysis of the solutions is equivalent to CASE III.1 in the proof of Lemma 7. The number of decision functions taking the value λ is given by n B = n C = L/(λ + 1) , which is zero for λ > L − 1.The minimizer in this case is f1 = y = L − 1 and fm>1 = −1.

Fig 8 .
Fig 8.  Relationship between the set of class probabilities {P l } L l=1 and the set of decision functions {f l } L l=1 for λ > L − 1 according to the KKT conditions of the Ψ-risk minimizer in Eq. (6).y and z are possible values of the decision function of a given class satisfying −1 < y < λ and z < −1, respectively.

Table 1
Datasets used to evaluate λ-SVMs

Table 2 λ
-SVMs classification error rates and training times.L denotes the number of classes in the