Classification with Minimax Fast Rates for Classes of Bayes Rules with Sparse Representation

We construct a classiﬁer which attains the rate of convergence log n/n under sparsity and margin assumptions. An approach close to the one met in approximation theory for the estimation of function is used to obtain this result. The idea is to develop the Bayes rule in a fundamental system of L 2 ([0 , 1] d ) made of indicator of dyadic sets and to assume that coeﬃcients, equal to − 1 , 0 or 1, belong to a kind of L 1 − ball. This assumption can be seen as a sparsity assumption, in the sense that the proportion of coeﬃcients non equal to zero decreases as ”frequency” grows. Finally, rates of convergence are obtained by using an usual trade-oﬀ between a bias term and a variance term.


Introduction
Consider a measurable space (X , A) and π a probability measure on this space.
Denote by D n = (X i , Y i ) 1≤i≤n n observations of (X, Y ) a random variable with values in X × {−1, 1} distributed according to π.We want to construct measurable functions which associate a label y ∈ {−1, 1} to each point x of X , such functions are called prediction rules.The quality of a prediction rule f is given by the value R(f ) = P(f (X) = Y ) called misclassification error of f .It is well known (e.g.Devroye et al. [1996]) that there exists an optimal prediction rule which attains the minimum of R over all measurable functions with values in {−1, 1}.It is called Bayes rule and defined by f * (x) = sign(2η(x) − 1), where η is the conditional probability function of Y = 1 knowing X defined by η(x) = P(Y = 1|X = x).
The value is known as the Bayes risk.The aim of classification is to construct a prediction rule, using the observations D n , which has a risk as close to R * as possible.Such a construction is called a classifier .Performance of a classifier fn is measured by the value called excess risk of fn .In this case R( fn ) = P( fn (X) = Y |D n ) and E π denotes the expectation w.r.t.D n when the probability distribution of (X i , Y i ) is π for any i = 1, . . ., n.We say that a classifier fn learns with the convergence rate φ(n), where (φ(n)) n∈N is a decreasing sequence, if an absolute constant C > 0 exists such that for any integer n, E π [R( fn ) − R * ] ≤ Cφ(n).
We introduce a loss function on the set of all prediction rules: This loss is a semi-distance (it is symmetric, satisfies the triangle inequality and d π (f, f ) = 0).For all classifiers fn , it is linked to the excess risk by where the RHS is the risk of fn associated to the loss d π .In classification we can consider three estimation problems.The first one is estimation of the Bayes rule f * , the second one is estimation of the conditional probability function η and the last one is estimation of the probability π.Usually, estimation of η involves smoothness assumption on the conditional function η.However, global smoothness assumptions on η are somehow too restrictive for the estimation of f * since the behavior of η away from the decision boundary {x ∈ X : η(x) = 1/2} may have no effect on the estimation of f * .
In this paper we deal directly with estimation of f * .But, in this case, the main difficulty of the classification problem is the dependence on π of the loss d π (usually, we use a loss free from π, which upper bounds d π to obtain rates of convergence).
Moreover, using the loss d π , we don't have the usual bias/variance trade-off, unlike many other estimation problems.This is due to the fact that we do not have an approximation theory in classification for the loss d π .This gap is due to the difficulty that d π depends on π, thus, this theory has to be uniform on π.We need approximation results of the form: where P X is the marginal distribution of π on X , f * = sign(2η−1), P is a set of probability measures on X × {−1, 1} and the family of classes of prediction rules (F ǫ ) ǫ>0 is decreasing (F ǫ ⊂ F ǫ ′ if ǫ ′ < ǫ) and F ǫ is less complex than {f * : π ∈ P}, in fact we expect F ǫ to be parametric.Similar results appear in density estimation literature, where, for instance, P is replaced by the set of all probability measures with a density with respect to the Lebegue measure lying in an L 1 −ball and F ǫ is replaced by the set of all functions with a finite number (depending on ǫ) of coefficients non equal to zero in the decomposition in the chosen orthogonal basis.But approximation theory in density estimation does not depend on the underlying probability measure since the loss functions used there are generally independent of the underlying statistical problem.In this paper, we deal directly with the estimation of the Bayes rule and obtain convergence result w.r.t. the loss d π by using an approximation approach of the Bayes rules w.r.t.d π .Theorems in Section 7 of Devroye et al. [1996] show that no classifier can learn with a given convergence rate for arbitrary underlying probability distribution π.Thus, assumption on f * has to be done to obtain convergence rates.In this paper, assumption on f * is close to the one met in density estimation when we assume that the underlying density belongs to an L 1 −ball.
Usually, a model (set of measurable functions with values in {−1, 1}) is considered and we assume that the Bayes rule belongs to this model.In this case the bias is equal to zero and no bound on the approximation term is considered.In Blanchard et al. [2003], question on the control of the approximation error for a class of models in the boosting framework is asked.In this paper, it is assumed that the Bayes rule belongs to the model and nature of distribution satisfying such condition is explored.
Another related work is Lugosi and Vayatis [2004], where, under general conditions, it can be guaranteed that the approximation error converges to zero for some specific models.In the present paper, bias term is not taken equal to zero and convergence rates for the approximation error are obtained depending on the complexity of the considered model (cf.Theorem 2).
We consider the classification problem on X = [0, 1] d .All the results can be generalized to a given compact of R d .Like in many other works on the classification problem an upper bound for the loss d π is used.But, in our case we still work directly with the estimation of f * .For a prediction rule f we have In order to get a distribution-free loss function, we assume that the following assumption holds (A1) The marginal P X is absolutely continuous w.r.t. the Lebesgue measure λ d and This is a technical assumption used for the control of the P X measure of some subset of [0, 1] d .In recent years some assumptions have been introduced to measure a statistical quality of classification problems.The behavior of the regression function η near the level 1/2 is a key point of the classification's quality (cf.e.g.Tsybakov [2004]).In fact, the closest is η to 1/2, the more difficult is the classification problem, nevertheless when we have η ≡ 1/2 the classification is trivial since all prediction rules are Bayes rules.Here, we measure the quality of the classification problem thanks to the following assumption introduced by Massart and Nédélec [2003]: Strong Margin Assumption (SMA): There exists an absolute constant 0 < h ≤ 1 such that: Under assumptions (A1) and (SMA) we have Thus, estimation of f * w.r.t. the loss d π is the same as estimation w.r.
where λ d is the Lebesgue measure on [0, 1] d .
The paper is organized as follows.In the next section we propose a representation for functions with values in {−1, 1} in a fundamental system of L 2 ([0, 1] d ).The third section is devoted to approximation and estimation of Bayes rules having a sparse representation in this system.In the fourth section we discuss about this approach.
Proofs are given in the last section.
2 Classes of Bayes Rules with Sparse Representation Theorem 2 of Subsection 3.1 is about the approximation of the Bayes rules when we assume that f * belongs to a kind of "L 1 −ball" for functions with values in {−1, 1}.
The idea is to develop f * in a fundamental system of L 2 ([0, 1] d , P X ) (that is a countable family of functions such that the set of all finite linear combinations is dense in )) inherited from the Haar basis and to control the number of coefficients non equal to zero.In this paper we only consider the case where P X satisfies (A1).We can extend the study to a more general case by taking another partition of [0, 1] d adapted to P X .
First we construct such a fundamental system.We consider a sequence of partitions of X = [0, 1] d by setting for any integer j, where k is the multi-index and for any integer j and any k ∈ {1, . . ., 2 j − 1}, We consider the family S = φ where 1I A denotes the indicator of a set A. Set S is a fundamental system of L 2 ([0, 1] d , P X ).This is the class of indicators of the dyadic sets of [0, 1] d .
We consider the class of functions f defined P X − a.s.from [0, 1] d to {−1, 1} which can be written in this system by k , P X − a.s., where a where, for any point x ∈ [0, 1] d , the right hand side applied in x is a finite sum.
Denote this class by F (d) .In what follows, we use the vocabulary appearing in the wavelet literature.The index "j" of a k is called "level of frequency".Since S is not an orthogonal basis of L 2 ([0, 1] d , P X ), the expansion of f w.r.t.this system is not unique.Therefore, to avoid any ambiguity, we define an unique writing for any mapping f in F (d) by taking a (j) k ∈ {−1, 1} with preferences for low frequencies when it is possible.Roughly speaking, for f ∈ F (d) , denoted by k ′ = 0 for all J > J 0 and k ′ ∈ I d (J) satisfying φ k ′ = 0. We can describe a mapping f ∈ F (d) satisfying this convention by using a tree.Each knot corresponds to a coefficient A (J) k .The root is A (0) 0,...,0 .If a knot, describing the coefficient A (J) k , equals to 1 or −1 then it has no branches, otherwise it has 2 d branches, corresponding to the 2 d coefficients at the following frequency, describing the coefficients At the end all the leaves of the tree equals to 1 or −1, and the depth of a leaf is the frequency of the coefficient associated.The writing convention says that a knot can not have all his leaves equal to 1 together (or −1).In this case we write this mapping by putting a 1 at the knot (or −1).In what follows we say that a function f ∈ F (d) satisfies the writing convention (W) when f is written in S using the writing convention describes in this paragraph.Remark that this writing convention is not an assumption on the function since we can write all f ∈ F using this convention.
Representation of the Bayes rules using Dyadic decision trees has been explored by Nowak and Scott [2004].
Is it possible to write every measurable functions from [0, 1] d to {−1, 1} in the fundamental system S using coefficients with values in {−1, 0, 1}?Since the family of set ( I(j) k : j ∈ N, k ∈ I d (j)), where Å denotes the interior of A, is a basis of open subsets of [0, 1] d , this question is equivalent to this one: "Take A a Borel of [0, 1] d , is it possible to find an open subset O of [0, 1] d such that the symmetrical difference between A and O has a Lebesgue measure 0?" Unfortunately, the answer to this last question is negative.There exists F ⊂ [0, 1] d a Borel, closed, with an empty interior and a positive Lebesgue measure λ d (F ) > 0. For example, in the one dimension case, the following algorithm yields such a set.Take (l k ) k≥1 a sequence of numbers defined by l k = 1/2 − 1/(k + 1) 2 for any integer k.Denote by F 0 the interval [0, 1] and construct a sequence of closed sets (F k ) k≥0 like in the following picture.
It is easy to check that F = ∩ k≥0 F k is closed, with an empty interior and a positive Lebesgue measure.For the d-dimensional case, the set Hence, one can easily check that for any measurable function f from [0, 1] d to {−1, 1} and any ǫ > 0, there exists a function g ∈ F (d) such that the set of all measurable functions from [0, 1] d to {−1, 1}.Now, we exhibit some usual prediction rules which belong to F (d) .
where λ d is the Lebesgue measure on [0, 1] d and A∆O is the symmetrical difference.
Now, we define a model for the Bayes rule by taking a subset of F (d) .For all functions w defined on N and with values in R + , we consider w , the model for Bayes rules, made of all prediction rules f which can be written, using the previous writing convention (W), by where a Proposition 1.Let w be a mapping from N to R + such that w(0) ≥ 1.The two following assertions are equivalent:

The class F
And if w is too large then the approximation by a parametric model will be impossible, that is why we give a particular look on the class of function introduced in the following Definition 2.
then we say that F w is a L 1 −ball of prediction rules.
Remark 1.We say that F w is a "L 1 −ball" for a function w satisfying (3), because , the sequence (⌊w(j)⌋) j∈N belongs to a L 1 −ball of N N , with radius (2 dj ) j∈N .Moreover, definition 2 can be link to the definition of a L 1 −ball for real valued functions, since we have a kind of base, given by S, and we have a control on coefficients which increases with the frequency.Control on coefficients, given by (3), is close to the one for coefficients of a real valued function in L 1 −ball since it deals with the quality of approximation of the class w by a parametric model.
w , the repartition of coefficients non equal to zero in the decomposition of f at a given frequency becomes sparse as the frequency grows.That is the reason why w can be called a sparse class of prediction rules.For exemple, if (⌊w(j)⌋/2 dj ) j≥1 decreases and (3) holds then number of coefficients non equal to 0 at the frequency j is smaller than j −1 per cent of the maximal number of coefficients (that is 2 dj ).
Remark 3. If we assume that P X is known then we can work with any measurable space X endowed with a Lebesgue measure λ, while assuming that P X << λ.In this case, we take  d) .We consider the writing of f in the fundamental system introduce in Section 3.1 with writing convention (W): is a low oscillating block of f when f has exactly 2 d − 1 coefficients, in this block, non equal to zero at each level of frequencies greater than J + 1.In this case we say that f has a low oscillating block of frequency J.
Remark that, if f has an oscillating block of frequency J, then f has an oscillating block of frequency J ′ , for all J ′ ≥ J.The function class F is "minimal".
Nevertheless, the following proposition shows that F (d) 0 is a rich class of prediction rules from a combinatorial point of view.We recall some quantities which measure a combinatorial richness of a class of prediction rules.For any class F of prediction rules from X to {−1, 1}, we consider 2 j+1 , 1 2 j+1 , . . ., 1 2 j+1 , for any j ∈ N. Thus, for any integer m, we have N(F has an infinite V C-dimension.

Thus every class
w ′ ), which is the case for the following classes.Now, we introduce some examples of L 1 −ball of Bayes rules.We denote by F w of prediction rules where w is equal to the function This class is called the truncated class of level K.
We consider exponential classes.These sets of prediction rules are denoted by F where Remark 4. For the one-dimensional case, an other point of view is to consider where a k (x)dx for any j ∈ N and k = 0, . . ., 2 j −1.For the control of the bias term we assume that the family of coefficients (a belongs to a L 1 −ball.But this point of view leads to analysis and estimation issues. First problem: Which functions with values in {−1, 1} have wavelet coefficients in a L 1 −ball and which wavelet basis is more adapted to our problem (maybe the Haar basis)?Second problem: Which kind of estimators could be used for the estimation of these coefficients?As we can see, the main problem is that there is no approximation theory for functions with values in {−1, 1}.We do not know how to approach, in L 2 ([0, 1]), measurable functions with values in {−1, 1} by "parametric" functions with values in {−1, 1}.Methods developed in this paper may be seen as a first step in this field.We can generalize this approach to functions with values in Z. Remark that when functions take values in R, that is for the regression problem, usual approximation theory is used to obtain a control on the bias term.
Remark 5. Other sets of prediction rules are described by the classes w where w is from N to R + and satisfies where (a j ) j≥1 is an increasing sequence of positive numbers.
3 Rates of Convergence over F Theorem 2 (Approximation Theorem).Let F (d) w be a L 1 −ball of prediction rules.We have: where f * is the Bayes rule associated to π.For example, J ǫ can be the smallest integer J satisfying +∞ j=J+1 2 −dj ⌊w(j)⌋ < ǫ/A.Remark 6.No assumption on the quality of the classification problem, like an assumption on the margin, is needed to state Theorem 2. Only assumption on the "number of oscillations" of f * is used.Theorem 2 deals with approximation of func- w by functions with values in {−1, 1} and no estimation issues are met.
Remark 7. Theorem 2 is the first step to prove an estimation theorem using a trade-off between a bias term and a variance term.We write Since f ǫ belongs to a parametric model we expect to have a control of the variance term, E π d π ( fn , f ǫ ) , depending on the dimension of the parametric model which is linked to the quality of the approximation in the bias term.
the smallest the bias is.Especially, we have a bias equal to zero when η = 1/2 (in this case any prediction rule is a Bayes rules).Thus, more difficult the problem of estimation is (that is for underlying probability measure π = (P X , η) with η close to 1/2), the smallest the bias is.This behavior does not appear clearly in density estimation.

Estimation Result
We consider the following class of estimators indexed by the frequency rank J ∈ N: where coefficients are defined by and card i : To obtain a good control of the variance term, we need to assure a good quality of the estimation problem.Therefore, estimation results are obtained in Theorem 3 under (SMA) assumption.In recent years we have understood that (SMA) assumption can lead to fast rates but is not enough to assure any rate of convergence (cf.corolary 1 at the end of section 3.3), thus we have to define a model for η or f * , here we use a L 1 −ball of prediction rules as a model for f * .
Remark 9.The upper bound can be split in the bias term: ǫ and the variance term: Aǫ + exp −na(1 − exp(−h 2 /2))2 −dJǫ .Remark that a bias term appears in the variance term.

Optimality
This section is devoted to the optimality, in a minimax sense, of estimation in classification models such that f * ∈ F (d) w .Let 0 < h < 1, 0 < a ≤ 1 ≤ A < +∞ and w a mapping from N to R + .we denote by P w,h,a,A the set of all probability measures π = (P X , η) on [0, 1] d × {−1, 1} such that 1.The marginal P X satisfies (A1).
3. The Bayes rule f * , associated to π, belongs to We use the version of Lemma of Assouad in the appendix of Lecué [2006c] to lower bound the minimax risk on P w,h,a,A .From Theorem 3 and Theorem 4, we can deduce the optimality (up to a logarithm term) of the estimator f (Jn) where the rank J n is obtained by an optimal trade-off between the bias term and the variance term.
Theorem 4. Let w be a function from N to R + such that (i) ⌊w(0)⌋ ≥ 1 and ∀j ≥ 1, ⌊w(j)⌋ ≥ 2 d − 1 We have for all n ∈ N, , Remark 10.For a function w satisfying assumptions of Theorem 4 and under (SMA), we can not expect a convergence rate faster than 1/n, which is the usual lower bound for the classification problem under (SMA).
From the previous Theorem we obtain immediately Theorem 7.1 of Devroye et al. [1996].We denote by P 1 the class of all probability measures on [0, 1] d ×{−1, 1} such that the marginal distribution P X is λ d (the Lebesgue probability distribution on [0, 1] d ) and (SMA) is satisfied with the margin h = 1.The case "h = 1" is equivalent to R * = 0.That is for a perfect classification problem, where Y is an exact function of X given by Y = f * (X) = η(X).
Corollary 1.For any integer n we have It means that no classifier can achieve a rate of convergence in the classification models P 1 , even if these classification problems are all very good (Y is given by f * (X) without any noise and there are no spot of low probability).

Rules
In this section we apply results stated in Theorem 3 and Theorem 4 to different w introduced at the end of Section 2. We give rates of convergence and lower bounds for these models.Using notations introduced in Section 2 and subsection 3.3, we consider the following models.For w = w Theorem 5.For the truncated class where C K,h,a,A > 0 is depending only on K, h, a, A and for the lower bound, there exists C 0,K,h,a,A > 0 depending only on K, h, a, A such that, for all n ∈ N, For the exponential class α where 0 < α < 1, we have for any integer n where C ′ α,h,a,A > 0 and for the lower bound, there exists C ′ 0,α,h,a,A > 0 depending only on α, h, a, A such that, for all n ∈ N, In both classes, order of J n is ⌈log an/(2 d log n) /(d log 2)⌉, up to a multiplying constant.
A remarkable point is that the class K has an infinite VC-dimension (cf.Section 2).Nevertheless, the rate log n/n is achieved on this model.

Discussion
In this section we discuss about representation and estimation of "simple" prediction rules in our framework.In considering the classification problem over the square [0, 1] 2 , a classifier has to be able to approach, for instance, the "simple" Bayes rule f * C which is equal to 1 inside C, where C is a disc of [0, 1] 2 , and −1 outside C. In our framework, two questions need to be considered: • How is the representation of the simple function f * C in our fundamental system, using only coefficients with values in {−1, 0, 1} and with the writing convention (W)?
• Is the estimate f (Jn)

n
, where J n = ⌈log an/(2 d log n) /(d log 2)⌉ is the frequency rank appearing in Theorem 5, a good classifier when the underlying probability measure has f * C for Bayes rule?
At a first glance, our point of view is not the right way to estimate f * C .In this regular case (the border is an infinite differentiable curve), the direct estimation of the border is a better approach.The main reason is that a 2-dimensional estimation problem becomes a 1-dimensional problem.Such reduction of dimension makes estimation easier (in passing, our approach is specifically good in the 1-dimensional case, since the notion of border does not exist in this case).Nevertheless, our approach is applicable for the estimation of such functions (cf.Theorem 6).Actually, direct estimation of the border reduces the dimension but there is a big waste of observations since observations far from the border are not used for this estimation point of view.It may explain why our approach is applicable.Denote by the ǫ−covering number of a subset A of [0, 1] 2 , w.r.t. the infinity norm of R 2 .For example, the circle For any set A of [0, 1] 2 , denote by ∂A the border of A.
Theorem 6.Let A be a subset of [0, 1] 2 such that N (∂A, ǫ, ||.|| ∞ ) ≤ δ(ǫ), for any ǫ > 0, where δ is a decreasing function from R * + with values in R + satisfying ǫ 2 δ(ǫ) −→ 0 when ǫ tends to zero.Consider the prediction rule f A = 21I A − 1.For any ǫ > 0, denote by ǫ 0 the greatest positive number satisfying δ(ǫ 0 )ǫ 2 0 ≤ ǫ.There exists a prediction rule constructed in the fundamental system S at the frequency rank J ǫ 0 with coefficients in {−1, 1} denoted by For instance, there exists a function f n , written in the fundamental system S at the frequency level J n = ⌊log(4n/(π log n))/ log 2⌋, which approaches the prediction rule f C with a L 1 (λ 2 ) error upper bounded by 36(log n)/n.This frequency level is, up to a multiplying constant, the same one appearing in Theorem 5.In a more general way, any prediction rule with a border having a finite perimeter (for instance polygons) is approached by a function written in the fundamental system at the same frequency rank J n and the same order of L 1 (λ 2 ) error (log n)/n.Remark that for this frequency level J n , we have to estimate n/ log n coefficients.Estimations of one of these coefficients a is smaller than n −1 .Thus, number of coefficients estimated with no observations is small compare to the order of approach (log n)/n and is taken into account in the variance term.Now, the problem is about finding a L 1 −ball of prediction rules such that for any integer n the approximation function f n belongs to such a ball.This problem depends on the geometry of the border set ∂A.It arises naturally since we chose a particular geometry for our partition: dyadic partitions of the space [0, 1] d , and we have to pay a price for this choice which has been made independently of the type of functions to estimate.But this choice of geometry in our case is the same as the one met in density approximation using approximation theory while choosing a particular wavelet basis.Depending on the type of Bayes rules we have to estimate, a special partition can be considered.For example our "dyadic approach" is very well adapted for the estimation of Bayes rules associated to chessboard (with the value 1 for black square and −1 for white square).This kind of Bayes rules are very bad estimated by classification procedure estimating the border since most of these procedure have regularity assumptions which are not fulfilled in the case of chessboard.
We can extend our approach in several different ways.Consider the dyadic partition of [0, 1] d with frequency J n .Instead of choosing 1 or −1 for each square of this partition (like in our approach), we can do a least square regression in each cell of the partition.Inside a square Sq = I (Jn) k , where k ∈ I 2 (J n ), we can compute the line minimizing where f is taken in the set of all indicators of half spaces of [0, 1] d intersecting Sq.Of course, depending on the number of observations inside the cell Sq we can consider bigger classes of functions than the one made of the indicators of half spaces.Our classifier is close to the histogram estimator in density or regression framework, which has been extend to smoother procedure.The other way to extend our approach deals with the problem of the underlying choice of geometry by taking S for fundamental system.One possible solution is to consider classifiers "adaptive to the geometry".
Using an adaptive procedure, for instance aggregation procedure (cf.Lecué [2005]), we can construct classifiers adaptive to the "rotation" and "translation".Consider the dyadic partition of [0, 1] 2 at the frequency level J n .We can construct classifiers using the same procedure as (4) but for partitions obtained by translation of the dyadic partition by (n 1 /(2 Jn log n), n 2 /(2 Jn log n)), where n 1 , n 2 = 0, . . ., ⌈log n⌉.We can do the same thing by aggregating classifiers obtained by the procedure (4) for partitions obtained by rotation of center (1/2, 1/2) with angle n 3 π/(2 log n), where n 3 = 0, . . ., ⌈log n⌉, of the initial dyadic partition.In this heuristic we don't discuss about the way to solve problems near the border of [0, 1] 2 .
is the empty set then take g = 1.Otherwise, consider the set of index I O 1 built in the same way as previously, and for any (j, k) ∈ I O 1 we take a we go on.Denote by I the final family obtained by this construction (I may be finite or infinite).Then, we enumerate the indexes of I by (j For the first (j 1 , k 1 ) ∈ I take a (j 1 ) If the construction stops at a given iteration N then f takes its values in {−1, 1} and the writing convention (W) is fulfilled since every cells k such that a (j) k = 0 has a neighboring cell associated to a coefficient non equals to 0 with an opposite value.Otherwise, for any integer j = 0, the number of coefficient a Proof of Theorem 2. Let π = (P X , η) be a probability measure on X ×{−1, 1} belonging to P w,A .Denote by f * a Bayes classifier associated to π (for example f * = sign(2η − 1)) .We have Let ǫ > 0. Define by J ǫ the smallest integer satisfying +∞ j=Jǫ+1 2 −dj ⌊w(j)⌋ < ǫ A .
We write f * in the fundamental system (φ ) using the convention of writing of section 3.1 but we start at the level of frequency J ǫ : We consider where for all k ∈ I d (J ǫ ).Note that, if k , moreover f * take its values in {−1, 1}, thus ,we have Proof of Theorem 3. Let π = (P X , η) be a probability measure on X ×{−1, 1} satisfying (A1), (SMA) and such that f * = sign(2η − 1), a Bayes classifier associated to π, belongs to F (d) w (a L 1 −ball of Bayes rules).Let ǫ > 0 and J ǫ the smallest integer satisfying +∞ j=Jǫ+1 2 −dj ⌊w(j)⌋ < ǫ/A.We decompose the risk in the bias term and variance term: where f (Jǫ) n is introduced in (4) and f ǫ in (7).
Using the definition of J ǫ and according to the approximation Theorem (Theorem 1), the bias term satisfies: For the variance term we have (using the notations introduced in ( 4) and ( 8)): Let k ∈ I d (J ǫ ).For any m ∈ {0, . . ., n}, we introduce the sets We have k ) and Moreover, denote by Z 1 , . . ., Z n some variables i.i.d. with a Bernoulli with parameter p Concentration inequality of Hoeffding leads to for all t > 0 and m = 1, . . ., n.

Denote by a
(Jǫ) k the probability P X ∈ I > 1/2, applying second inequality of (10) leads to < 1/2 then similar arguments used in the previous case and first inequality of (10) lead to Like in the proof of Theorem 2, we use the writing = 1/2.Thus, the variance term satisfies: We have shown that for all ǫ > 0, where J ǫ is the smallest integer satisfying +∞ j=Jǫ+1 2 −dj ⌊w(j)⌋ < ǫ/A.Proof of Theorem 4. For all q ∈ N we consider G q a net of [0, 1] d defined by: and the function η q from [0, 1] d to G q such that η q (x) is the closest point of G q from x (in the case of ex aequo, we choose the smallest point for the usual order on R d ).Associated to this grid, the partition X ′ (q) 1 , . . ., X ′ (q) 2 dq of [0, 1] d is defined by x, y ∈ X ′ (q) i iff η q (x) = η q (y) and we use a special indexation for this partition: and we say that for the usual order on N d .Thus, the partition (X ′ (q) j : j = 1, . . ., 2 dq ) has an increasing indexation according to the order of (x ′ (q) k 1 ,...,k d ) for the order defined above.This order take care of the previous partition by splitting blocks in the right given order and inside a block of a partition we take the natural order of N d .We introduce an other parameter m ∈ {1, . . ., 2 qd } and we define for all i = 1, . . ., m, X (q) i = X ′ (q) i and X i .Parameters q and m will be chosen later.We consider W ∈ [0, m −1 ], chosen later, and define the function f (where λ d is the Lebesgue measure on [0, 1] d ) on X 1 , . . ., X m and (1 − mW )/λ d (X 0 ) on X 0 .We denote by P X the probability distribution on [0, 1] d with the density f X w.r.t. the Lebesgue measure.For all σ = (σ 1 , . . ., σ m ) ∈ Ω = {−1, 1} m we consider We have a set of probability measures {π σ : σ ∈ Ω} on [0, 1] d × {−1, 1} indexed by the hypercube Ω where P X is the marginal on [0, 1] d of π σ and η σ its conditional probability function of Y = 1 given X.We denote by f * σ the Bayes rule associated to π σ , we have f * σ (x) = σ j if x ∈ X j for j = 1, . . ., m and 1 if x ∈ X 0 , for any σ ∈ Ω.Now we give conditions on q, m and W such that for all σ in Ω, π σ belongs to P w,h,a,A .If we take then P X << λ and ∀x ∈ Since we have ⌊w(j)⌋ ≥ 2 d − 1 for all j ≥ 1 and ⌊w(0)⌋ = 1, and ⌊w(j − 1)⌋ ≥ ⌊w(j)⌋/2 d , then f * σ ∈ F w for all σ ∈ Ω iff Take q, m and W such that (11) and ( 12) are fulfilled then, {π σ : σ ∈ Ω} is a subset of P w,h,a,A .Let σ ∈ Ω and fn be a classifier, we have .
Proof of Corollary 1: It suffices to apply Theorem 4 to the function w defined by w(j) = 2 dj for any integer j and a = A = 1 for P X = λ d .
Proof of Theorem 5: 1.If we assume that J ǫ ≥ K then +∞ j=Jǫ+1 2 −dj ⌊w ) where C = a(1 − e −h 2 /2 )2 −d (A −1 (2 d(1−α) − 1)) 1/(1−α) .We have ǫ n ≤ (log n/(nC)) 1−α .For J n = J ǫn , we have For the lower bound we have for any integer n, j=1 B ∞ (x j , ǫ 0 ).Since 2 −Jǫ 0 ≥ ǫ 0 , only nine dyadic sets of frequency J ǫ 0 can be used to cover a ball of radius ǫ 0 for the infinity norm of R 2 .Thus, we only need 9N(ǫ 0 ) dyadic sets of frequency J ǫ 0 to cover ∂A.Consider the partition of [0, 1] 2 by dyadic sets of frequency J ǫ 0 .Except on the 9N(ǫ 0 ) dyadic sets used to cover the border ∂A, the prediction rule f A is constant, equal to 1 or −1, on the other dyadic sets.Thus, by taking k 1 ,k 2 , where a k 1 ,k 2 is equal to one value of f A in the dyadic set I (Jǫ 0 ) k 1 ,k 2 , we have these 2 d coefficients equal to 1, and the same convention holds for −1.Moreover if we have A be written in S using only coefficients with values in {−1, 0, 1}.Nevertheless, the Lebesgue measure satisfies the property of regularity, which says that for any Borel B ∈ [0, 1] d and any ǫ > 0, there exists a compact subset K and an open subset O such that K ⊆ A ⊆ O and and, either η is λ d -almost everywhere continuous (it means that there exists an open subset of [0, 1] d with a Lebesgue measure equals to 1 such that η is continuous on this open subset) or if η is λ d −almost everywhere equal to a continuous function, then f η ∈ F

w
depends on the choice of the function w.If w is too small then the class F (d) w is not very rich, that is the subject of the following Proposition 1.If w is too large then F (d) w would be too complex for a good estimation of f * ∈ F (d) w , that is why we introduce Definition 2 in what follows.
a partition of X adapted to the previous one I (j−1) k : k ∈ I d (j − 1) and satisfying P X (I (j) k ) = 2 −jd .All the results below can be obtained in this framework.Now, examples of functions satisfying (3) are given.Classes F (d) w associated to these functions are used in what follows to define statistical models.As an introduction we define the minimal infinite class of prediction rules, by F ) = 2 d − 1, for all j ≥ 1.To understand why this class is important we introduce a notion of local oscillation of a prediction rule.This concept defines a kind of "regularity" for functions with values in {−1, 1}.
made of all prediction rules with one oscillate block at level 1 and of the indicator function 1I [0,1] d .If we have w(j 0 ) < w (d) 0 (j 0 ) for one j 0 ≥ 1 and w(j) = w (d) 0 (j) for j = j 0 then the associated class F (d) w contains only the indicator function 1I [0,1] d , that is the reason why we say that F (d) 0 x 1 , . . ., x m )) = 2 m .Hence, the following proposition holds.Proposition 2. The class of prediction rules F (d) 0 0 < α < 1, and are equal to F (d) w when w = w (d) α and Let w be a function from N to R + and A > 1, we denote by P w,A the set of all probability measures π on [0, 1] d × {−1, 1} such that the Bayes rules f * , associated to π, belongs to F (d) w and the marginal of π on [0, 1] d is absolutely continuous and one version of its Lebesgue density is upper bounded by A. The following Theorem can be seen as an approximation Theorem for the Bayes rules w.r.t. the loss d π uniformly in π ∈ P w,A .

w
be a L 1 −ball of prediction rules.Let π be a probability measure on [0, 1] d × {−1, 1} satisfying assumptions (A1) and (SMA), and such that the Bayes rule, associated to p i, belongs to F (d) w .The excess risk of the classifier f (Jǫ) n satisfies for any positive number ǫ, P w (d) K ,h,a,A of probability measures on [0, 1] d × {−1, 1} and P k ∈ I 2 (J n ), depends on the number of observation in the square I (Jn) k associated this coefficient.The probability that no observation "falls" in I (Jn) k Since {η ≥ 1/2} is almost everywhere open there exists an open subset O of [0, 1] d such that λ d ({η ≥ 1/2}∆O) = 0.If O is the empty set then take g = −1, otherwise, for all x ∈ O denote by I x the biggest subset I (j) k for j ∈ N and k ∈ I d (j) such that x ∈ I (j) k and I (j) k ⊆ O. Remark that I x exists because O is open.We can see that for any y ∈ I x we have I y = I x , thus, (I k ∈ I d (j), non equals to 0 is ⌊w(j)⌋ and the total mass of cellsI (j) k such that a (j) k = 0 is j∈N k∈I d (j) 2 −dj card k ∈ I d (j) : a (j)k = 0 which is greater or equal to 1 by assumption.Thus, all the hypercube is filled by cells associated to coefficients non equal to 0. So f takes its values in {−1, 1} and the writing convention (W) is fulfilled since every cells I (j) k such that a (j) k = 0 has a neighboring cell associated to a coefficient non equals to 0 with an opposite value.Moreover f = 1I [0,1] d .