Minimax bounds for Besov classes in density estimation

: We study the problem of density estimation on [0 , 1] under L p norm. We carry out a new piecewise polynomial estimator and prove that it is simultaneously (near)-minimax over a very wide range of Besov classes B α π, ∞ ( R ). In particular, we may deal with unbounded densities and shed light on the minimax rates of convergence when π < p and α ∈ (1 /π − 1 /p, 1 /π ].


Introduction
We consider n independent and identically distributed random variables X 1 , . . ., X n defined on an abstract probability space (Ω, E , P ).We suppose that X i admits a density f with respect to the Lebesgue measure on [0, 1] and are interested in estimating f from the observations X 1 , . . ., X n .
We consider p ≥ 1, a subset F of the linear space (L p ([0, 1]), • p ), and define the celebrated minimax risk , where d p denotes the distance in the above space, and where the infimum is taken over all estimators f with values in F. The minimax rate of convergence, that is the rate at which R p (F) converges to 0 (if it converges), is the best possible for procedures based solely on assumptions modelled by F. This rate can therefore be seen as a benchmark for statistical procedures.
In the present paper, our aim is to define a (near) minimax estimator under smoothness constraints.We will therefore pay special attention to bounded subsets F of Besov spaces B α π,∞ .The subscript π indicates here in which (quasi) norm the regularity α is measured.The smaller π is, the larger the class is, and the more difficult the estimation problem is.It is sometimes said that these Besov spaces allow taking into account spatially inhomogeneous smoothness.The minimax results rely heavily on π, p, α as described below.
The situation appears to be more complicated to deal with when π < p and α ∈ (1/π−1/p, 1/π].The condition α > 1/π−1/p ensures that B α π,∞ ⊂ L p ([0, 1]) and allows the use of a L p loss.However, B α π,∞ is not included in L ∞ ([0, 1]) as α ≤ 1/π.A key paper for understanding the importance of the condition f ∞ < +∞ in estimation procedures is that of [Bir14].In this paper, [Bir14] showed a lower minimax bound.It reveals that the usual rate n −α/(1+2α) cannot be true over the whole range (1/π − 1/p, 1/π] when p = 2.However, this rate is the right one when the density is bounded.This problem of determining the optimal rates when the target function is smooth but not bounded has already been studied in the literature in other statistical settings such as, for instance, the Gaussian white noise model and the regression model with random design.We refer here to [Bar02,Bir04,Lep15] and the references therein.In the latter model, the minimax rate is known (up to log factors) for the L 2 loss and is n − min{α/(1+2α),α−1/π+1/2} when π ∈ [1, 2) and α ∈ (1/π − 1/2, 1/π].We observe therefore a possible deterioration of the rates when the target is not bounded.This is similar to density estimation but contrasts with the white noise model where the boundedness assumption can be removed without changing the exponent in the optimal rates: it is α/(1 + 2α) regardless of π ∈ [1, 2), α ∈ (1/π − 1/2, 1/π] when p = 2. Several procedures for unbounded densities have been proposed in the statistical literature.We may cite for instance the wavelet thresholding procedures of [BTWB10,RBRTM11].They lead to nice oracle inequalities for the L 2 loss without condition on the supremum norm of f .Also worth mentioning are the general procedures of [LW19,Bir06,Bir14].The first is based on a pointwise selection scheme.It leads to local risk bounds, which can be integrated to become global.The other two are based on robust tests and allow for more general assumptions than smoothness.Despite this, the minimax rate does not seem to have attracted much attention in the literature when α ∈ (1/π − 1/p, 1/π] and π < p.In the present paper, we define a new estimator and show that it achieves the rate (log n) β n − min{α/(1+2α),(α−1/π+1/p)/(1+α−1/π)} for some β independent of n.Moreover, we show that this rate is optimal, up to logarithmic factors.In particular, when π < p − 2, the minimax rate is never n −α/(1+2α) .When π > p − 2, there is an elbow effect, the exponent being (α − 1/π + 1/p)/(1 + α − 1/π) for "small values of α" and α/(1 + 2α) for "large values".
We restrict our study to the estimation of the density on the unit interval [0, 1] but other estimation domains have also been considered in the literature.For results concerning the estimation on the real line under smoothness constraints, we refer to [BH78, JLL04, Efr08, RBRTM11, GL11, GL14].However, it should be noted that the minimax rates on Besov classes can be very different, depending on whether the estimation domain is compact or not.As far as we know, the rates on the real line are not fully known.In higher dimension, the whole point is to reduce the curse of dimensionality.A solution is to allow the smoothness of f to vary with the direction.For more informations on these issues of anisotropy, we refer to [Kle09,GL11,Aka12,GL14].
Our estimation strategy is based on projection estimators and on a new estimator selection rule.This procedure may be thought of as a mix between a Lespki-type procedure [Lep92] and the one of [Sar14].It leads to a piecewise polynomial estimator of degree r that is (near) minimax and adaptive over the full scale of Besov classes π ∈ (0, p) and α ∈ (1/π − 1/p, r + 1).In other terms, it achieves the rates given above, up to logarithmic factors, without the prior knowledge of α and π.Our estimator has also computational properties.We may build it in polynomial time when p > 1, and more precisely in around O(n p/(p−1) log n) operations.It is also possible to make our estimator computable in (nearly) linear time when the classical condition α > 1/π is met.
The paper is organized as follows.We make explicit the minimax rates of convergence in the next section.Section 3 is devoted to the construction of our estimator and contains intermediate results such as an oracle inequality and a result on the approximation of the elements of B α π,∞ by piecewise polynomial functions.The proofs are postponed to Section 4.

Besov classes
We recall here the definition of Besov spaces to introduce our notations, and refer to [DL93] for more details.
We consider α, π > 0, π ∈ (0, +∞] and the smallest integer r larger than α. Let then ω p (f, x) be the modulus of smoothness , and a quasi semi-norm otherwise.We define the Besov space B α π,π as the set of functions f on [0, 1] with finite (quasi) semi-norm |f | B α π,π .We investigate in the next section the problem of minimax estimation over the class
Up to log factors, the rate that is achieved when α > ᾱ cannot be improved as it corresponds to the minimax rate for Hölder semi-balls.In the literature, [Bir14] obtained a lower bound that matches with this upper bound (to within logarithmic factors) when p = 2 and α ≤ ᾱ.Slight modifications of his proof lead to: Moreover, if α ≤ 1/π − 1/p and p > 1, for n large enough, In these inequalities, c 2 depends on R, p, π, α only.
Therefore, the optimal rate of convergence is n −ψ , up to possible logarithmic factors.When α is smaller than 1/π − 1/p, B α π,∞ (R) is not included in L p ([0, 1]) and minimax results for B α π,∞ (R) are meaningless.The second point of the proposition says that the minimax rate does not tend to zero even under an additional condition on the L p norm when p > 1.
When p = 1, ᾱ = 1/π − 1, and we recover that ψ = α/(1 + 2α) on the whole range (1/π − 1, 1/π].This rate is actually free of logarithmic factors, see [Bir06].The formula for ψ when p > 1, α < 1/π does not seem to appear in the literature.We suspect, however, that already existing estimators are (near) minimax.Of particular note are the wavelet estimators of [BTWB10,RBRTM11] and the selection rule of [LW19].Unfortunately, the authors did not explicitly address this issue.But there is a likeness between the L 2 oracle inequalities of [BTWB10,RBRTM11] and ours (see (8) or (16) below).What is missing is the computation of the risk from these inequalities.As for the very general paper of [LW19], the authors explain in their Theorem 2 how to control the maximal risk of their estimator over a collection F. The functions of F need not to be bounded.The condition relates rather to the inclusion of F into a L q space or more generally a Lorentz space.The proof of Theorem 1 also turns out to be based on a similar embedding.For each of these references, a logarithmic factor in the convergence rates is to be expected.In [LW19], this factor may be due to a too local approach to the estimation problem.In our paper, and in [BTWB10,RBRTM11], this factor appears in the penalties/thresholds.These could therefore be a little too large.Improving them would require controlling the underlying empirical process more accurately.The question of whether the optimal rates involve When α > 1/π, the minimax rate of convergence is, up to logarithmic factors, n −ψ where This result has been known for a long time.We present in Figure 1 above a graph to summarize these rates when p = 4.
There are therefore three possible formula for the rate on (1/π − 1/p, +∞).The well-known elbow phenomenon refers in literature to the non-differentiability of ψ at (p − π)/(2π) when p > 2 + π, i.e. when moving from the orange zone to the green zone.Actually, ψ is not differentiable at min{ᾱ, 1/π} either.Therefore, there is always at least one elbow effect on (1/π − 1/p, +∞) and sometimes two.

Estimation procedure
For all subset F ⊂ L p ([0, 1]), and f ∈ L p ([0, 1]), we set d p (f, F)=inf g∈F d p (f, g).The notation |I| stands for both the cardinal of a finite set I, and the length of an interval I.We set N = N \ {0}.The letters c, c , C, C , . . .denote quantities that may change from line to line.

Collection of partitions
We introduce here tree-structured partitions m of [0, 1] derived from the recursive algorithm of [DY90] that are frequently encountered when estimating smooth functions by histograms or more generally by piecewise polynomial estimators (see [BB09,Aka12,Sar14] among other references).
Consider a partition m of [0, 1].We may refine m by dividing some intervals I of m into two equal parts.The collection of all partitions that can be constructed from m by this way is denoted by M(m).We then define collections M of partitions by induction by setting M 0 = {{[0, 1]}} and The collection M is therefore composed of partitions of [0, 1] into intervals with endpoints of the form k/2 .It does not only contain the regular partitions of [0, 1] of size 2 k , k ≤ but also partitions that are very thin locally, and wider elsewhere.
We moreover define the collection M ∞ = ≥0 M of partitions m that can be constructed by this algorithm in a finite number of steps.

Projection estimators
We estimate the density f by means of piecewise polynomial estimators.They are defined as projection estimators as described below.
Consider an integer r, a collection m of disjoint intervals, and the space P r (m) = I∈m P I 1 I , where P I is a polynomial function of degree at most r of piecewise polynomial functions.Let (ϕ I,j ) I∈m,j∈{0,...,r} be the orthonormal basis of P r (m) defined from the Legendre polynomials where a < b denote the extremities of I. We define the projection estimator and omit the dependency in r to lighten the notations.
Since their introduction by [Cen62], projection estimators have received considerable interest in statistical estimation.They are at the heart of many model selection procedures, see [BM98,BBM99,Mas07] for key references.For pedagogical reasons, we present below a risk bound for this estimator when p = 2: The proof of this result is merely based on Hölder's inequality, see Section 4.2.It highlights what will allow us to estimate unbounded densities f of B α π,∞ .When m is a regular partition, that is when the intervals I ∈ m are of the same size, the term in the infimum is f q (r + 1) 2 |m|/n.In particular, by letting q tend to 1, For more general partitions m, and q → +∞, These two inequalities apply either for regular partitions, without assumption on f ∞ , or for any partition when f ∞ is finite.More generally, the risk bounds of projection estimators sometimes involve the supremum norm of f , and sometimes not, depending on whether or not the model checks a condition linking the L 2 and L ∞ structure of the model (equation (7.16) in [Mas07]).For more details on this phenomenon, we refer to [Bir14].
It is worth mentioning that (4) and (5) are not suitable for estimating densities f in B α π,∞ (R) when α < 1/π and π < 2. Indeed, such densities may not be bounded and may be poorly approached by piecewise polynomial functions based on regular grids.Making the bias term d 2 2 (f, P r (m)) small requires working with partitions m adapted to the spatial inhomogeneity of f .In some sense, (3) fills the gap between (4) and (5).We can deal with irregular partitions and unbounded densities.The condition on the supremum norm is replaced by a condition on the L q norm.The larger q is, the smaller ( I∈m |I| 1−q/(q−1) ) (q−1)/q is and vice versa.
Two issues remain to be addressed.First, the risk of the estimator depends on the choice of m.We need to explain how to define a partition m realizing a good compromise between the two terms in (3).Second, we need to compute the risk of this estimator when the target f lies in a Besov class.In the section below, we start by answering the first point.

Selection rule
For all partitions m, m and interval J, we set For all interval I, we define We set for any collection m of intervals and ξ > 0, where log 2 denotes the logarithm base 2. We consider κ 1 , κ 2 , and We consider some ∈ N ∪ {∞} and define for m ∈ M , We then define m as any partition of M satisfying and shrink the resulting estimator f = min{1, n f m −1 p } f m.This criterion can be seen as a Lespki-type procedure [Lep92] (see also [BB09]) modified as in [Sar14] in order make possible the construction of the estimator by dynamic programming algorithms.We leave these computational aspects aside for the moment to focus on the theoretical properties of f .They will be discussed in Section 3.7.
Thereby, the risk of the estimator is controlled by the best possible compromise between a bias term d p (f, P r (m)) and an estimation term up to a multiplicative factor.The partitions m can vary freely in M , and can be very thin locally and thus well adapted to the target density f .Note that this result improves with (we can even set = +∞ in theory).
Our theorem applies for any density f ∈ L p ([0, 1]), and in particular for unbounded densities.We may always set q = p in the infimum but playing with larger values of q will be worthwhile as v q (m) is non-increasing with q.The value of q = +∞ is allowed using standard algebra in R ∪ {∞} to deal with bounded f .In that case There are two main differences between the present estimation term and that of Proposition 3. We have here an additional term m + log n which will be of the order of log n for best partitions m and Besov-type regularity constraints.This implies that our results will be slightly sub-optimal in some cases.But this term cannot be avoided in general as (8) leads to the optimal rates in the sparse zone (orange zone of Figure 1), see Section 3.6.The second difference lies in the presence of the additional term w(m)( m + log n)/n.For adequate partitions m, and densities f ∈ B α π,∞ (R), this last term will be, in the worst case, of the same order of magnitude as M q (m)v q (m)( m + log n)/n.

Approximation
We need to bound (8) from above when f lies in a Besov class to get our maximal risk bound.This problem falls under the theory of approximation and is treated below.
We know from the literature that functions f ∈ B α π,∞ (R) can be well approximated by piecewise polynomial functions defined over a moderate number of intervals.We refer to [DY90, BM00, Aka12].Let us observe that When δ ≤ 1, a concavity argument entails v q (m) ≤ |m|.Moreover, if p = 1, w(m) = |m| and Corollary 3.3 of [DY90] may be used to control the righthand side of (8).In general, however, v q (m) and w(m) may be much larger than |m|: we do not only need to control the size of m but also the thinness of the intervals.The following result, to be proved in Section 4.4, is tailored to solve this problem.
Then, there exist and m ∈ M such that and • Suppose that α < δ(1/π − 1/p).Then, there exists a partition m ∈ M with ≤ k such that and if p > δ, • Suppose that α = δ(1/π − 1/p).Then, there exists a partition m ∈ M with ≤ k such that and if p > δ, In the above inequalities, C depends on α, δ, p, r, π only.

Minimax bound
We may therefore define partitions m with moderate v q (m) and w(m) that are well adapted to the variations of f .This makes it possible to bound the infimum in (8) from above.We state: Theorem 6.Let r ≥ 0, p ∈ [1, +∞), and f be the estimator defined in Section 3.3 with κ 1 = κ 2 = κ, = +∞ and ξ given by Theorem 4. For all π ∈ (0, p), α ∈ (1/π − 1/p, r + 1) and R > 0, the estimator f satisfies for n large enough, where ψ is defined in Section 2. Combined with Proposition 2, this theorem says that our estimator f is adaptive and near-minimax over a large scale of Besov classes.It implies Theorem 1.Note that the rate is optimal in the orange zone of Figure 1.
The precise value of ᾱ can be explained by the following reasoning.A function f ∈ B α π,∞ "almost belongs" to L q where 1/q = 1/π − α.Assuming that this is true, we see from Proposition 5 that the change of rates occurs at α = δ(1/π − 1/p) with δ = pq/(2q − p).This leads to the equation for which ᾱ is solution.By the way, slight modifications of the proof entail that τ = 0 when α < 1/π and α < ᾱ when the supremum is taken over all the densities of B α π,∞ (R) whose L q norm is uniformly bounded (with 1/q = 1/π −α).When π ≥ p, the functions in B α π,∞ (R) can be approached by piecewise polynomial functions based on regular partitions, see Lemma 12 of [BBM99].More precisely, for every α > 0 and k ≥ 1, there is a regular partition m of size 2 k such that d p (f, P r (m)) ≤ CR2 −kα .For such a partition, v q (m) = w(m) = |m|.By putting these results in (8), we deduce that our estimator achieves the rate (log n/n) α/(1+2α) over the Besov classes B α π,∞ (R) for all π ≥ p and α ∈ (0, r+1).

Computational statistics
To compute our estimator, we have to minimize γ(m) + (1 + 2 1−p ) pen p ξ (m) on the set of partitions m of M .It is not advisable, in practice, to solve this optimization problem by an exhaustive search of m among all the partitions of M .This is due to the very large cardinal of M , even when is moderate.Actually, the calculation of γ(m) itself is an optimization problem that can hardly be solved by a naive approach.
Fortunately, dynamic programming allows here to build the estimator more efficiently.We refer to [Don97, BSR04, AL11, Aka12,Sar14] for some examples of this technique in statistics.In particular, we may slightly adapt the algorithm of [Sar14] to perform the exact computation of m in at most C (2 + n) operations (see his Proposition A.1).The term C does not depend on , n and by operations we mean elementary operations such as additions, multiplications etc., computation of elementary functions or integrals.Naturally, the numerical complexity of the procedure increases with .Since the theoretical results improve with , the choice of is, at first sight, a sort of compromise between the theoretical and computational properties of the estimator.
One may, however, in practice, be interested only in linear (or polynomial) time estimators.This is equivalent to setting and looking for the classes of functions on which our estimator is (nearly) rate optimal.We show: ) and R > 0. Let c and be the largest integer such that 2 ≤ n c , κ 1 = κ 2 = κ and f be the estimator defined in Section 3.3.
The numerical complexity of our procedure is of the order of n c (log n).By setting c = 1, we get the preceding rates under the standard condition α > 1/π in nearly linear time.When p > 1, we may set c = p/(p − 1) to get all the rates in polynomial time.
Remark: the minimum length of an interval of a partition m ∈ M is 2 − .In particular, it is not necessary to go below 1/n to be (near) optimal when α > 1/π.We need to cross this threshold to estimate unbounded densities corresponding to smaller values of α.

Proof of Proposition 2
Due to the results on Hölder's classes, we only need to show This inequality can be shown by slightly adapting the arguments of [Bir14] to the case p = 2.It is actually attained on a two-point problem {f 1 , f 2 }.

Proof of Proposition 3
Let f m be the L 2 projection of f on P r (m) defined by where the sum runs over the intervals I ∈ m and j ∈ {0, . . ., r}.Pythagoras' Therefore, We use ϕ I,j ∞ = (2j + 1)/|I| to get and apply Hölder's inequality twice: .

Proof of Theorem 4
We define the collection I ∞ gathering all the dyadic intervals that appear in the partitions m of M ∞ , that is, all the intervals with endpoints k2 − , (k + 1)2 − where k ∈ {0, . . ., 2 − 1} and ≥ 0. Without loss of generality, we may assume that these intervals are of the form [a, b) when b < 1 and [a, b] when b = 1.We set for I ∈ I ∞ , collection m of intervals and ξ > 0, m) + w p ξ (m).We begin by proving a uniform risk bound in deviation for the projection estimators: Lemma 1.For all r ∈ N, f ∈ L p ([0, 1]), ξ ≥ log 4 + log(r + 1) and probability larger than 1 − (1/2)e −ξ : for all J ∈ I ∞ , and m ∈ M ∞ , , where c 1 only depends on p, r and where c 2 only depends on p.
Proof of Lemma 1.Let f m∨J be the L 2 projection of f on P r (m ∨ J) defined by By using the triangle inequality, and the elementary inequality (a + b) p ≤ 2 p−1 (a p + b p ), It follows from standard results about the L p norm of the L 2 projection operator that where c only depends on r.We refer for instance to the argument below Lemma 3 of [BM00] for the proof of this inequality.We now tackle the second term in (12).Consider ≥ 0, and note that the intervals of I = {I ∈ I ∞ , |I| = 2 − } are of the form [k/2 , (k + 1)/2 ) (the interval is closed when (k + 1)/2 = 1).In particular, |I | = 2 .We deduce from Bernstein's inequality (Theorem 2.10 of [BLM13]) that there exists for each interval I ∈ I and j, an event Ω(I, j) of probability 1 − e −ξ with ξ = 2ξ + 2 log 2 on which We then set Ω ξ = I∈I ≥0 j∈{0,...,r} Ω(I, j), and deduce from a union bound which is not larger than (1/2)e −ξ as ξ ≥ log 4 + log(r + 1).
Proof of Lemma 2. As in the previous proof, we may apply Bernstein's inequality to get with probability 1 − (1/2)e −ξ : for all j ∈ {0, . . ., r}, and where is such that 2 − = |I| and ξ = 2ξ + 2 log 2. By using the elementary inequality √ 2ab ≤ a/2 + b, Therefore, which leads to the result after some computations.
Proof of Theorem 8. Up to an increase of κ, the two preceding lemmas assert: with probability 1 − e −ξ , for all m ∈ M , J ∈ I ∞ , κ 1 , κ 2 ≥ κ, where c depends on p and r only.In the sequel, we repeatedly use the inequality (a + b) p ≤ 2 p−1 (a p + b p ) without mentioning it again.The triangle inequality implies and (17) ensures The triangle inequality entails for all m ∈ M , We apply (17) with the inequalities pen By using (6), and taking the infimum over all the partitions m ∈ M , and (7) entails We then use (18) to get and apply (15).
We use ϕ I,j ∞ = (2j + 1)/|I| to get We now bound above A by repeatedly using Hölder's inequality as in the proof of Proposition 3. First, we use it twice to get for every q ≥ p, A ≤ Second, let q ≥ p/2 such that 1/q = 1/q + 1/(p m ) and for j ≥ 0, By combining this result together with (20) and ( 19), we obtain We now turn to the term w p ξ (m).We have, We simply use |ϕ I,j | ≤ (2j + 1)/|I| to get which ends the proof.
Proof of Theorem 4. We denote by x + the positive part of a real number x.We have, We apply Proposition 8 to define an event Ω ξ of probability 1 − e −ξ on which (16) holds true.By using moreover Since ξ ≥ 2p log n, and w(m) ≥ 1 for all m, (2n) p e −ξ ≤ 2 p (w(m)/n) p .We then bound d p p (f, f m) thanks to (16) and Lemma 3.

Proof of Proposition 5
We introduce for each interval I the space P r (I) = {P 1 I , where P is a polynomial function of degree at most r} composed of polynomial functions on I.We need the following claim: Claim 1.Let j ∈ N and mj be the regular partition of [0, 1] of size 2 j .Then, for all where τ = α + 1/p − 1/π and where C depends on π, r only.
This claim is a slight revisit of Lemma 1 of [Aka12] in a unidimensional context.Note that it holds for all f ∈ B α π,∞ (R) whereas her result is restricted to functions f ∈ B α π,π (R) when π ∈ (1, p).The term in front of R2 −jτ is also made more explicit here (which will be of interest for the proof of Theorem 6).
Sketch of the proof of Claim 1.We make the following minor modification in the proof of [Aka12].Instead of applying Hölder's inequality to her inequality (27), we apply Minkowski's integral inequality.By using her notations, this yields where the term C(d, r, σ, p, q) comes from her equation ( 19).
If we go back to her calculations, we can observe that C(d, r, σ, p, q) does not depend on σ in a unidimensional setting (as σ = H(σ)).Moreover, the dependency on q comes from her equation (20), and can be removed thanks to Theorem 2.6 in Chapter 4 of [DL93].We therefore write C(r, p) in place of C(d, r, σ, p, q) (as d = 1).
We thus have, which gives the result using

Minimax bounds for Besov classes 3205
We now prove the following lemma.
Proof of Lemma 4. We begin by defining two preliminary collections mj , m j of disjoint intervals of [0, 1] by induction.
Since (η j ) j is bounded below, we may define the smallest such that m = ∅.We then set m = m = j=0 mj .Note that mj = m j = I ∈ m, |I| = 2 −j and m ∈ M .
We now show the bound on |m j | = | mj | for j ≥ 1.For all I ∈ mj , there exists I ∈ m j−1 such that I ⊂ I .Therefore, (η j−1 ) π < d π p (f 1 I , P r (I )), and hence By using m j−1 ⊂ mj−1 and Claim 1, which shows the first bound in (21).As to the second one, we merely use that mj ⊂ mj and | mj | = 2 j .Now, which shows (22).
Proof of Proposition 5. We apply Lemma 4 with We deduce a partition m with |m j | such that In the first case, |m j | = 0 when j > (α + δ/p)/(α + δ/p − 1/π)k, which shows that m ∈ M .As to the second case, we have |m j | = 0 when j > k and thus m ∈ M k .

Note now that
I∈m By using |m j | ≤ 2 j and (24), we get in the first case where C depends on α, δ, p, π.

Proof of Theorem 6
It is well known that there is an embedding of Besov spaces into L q spaces, see [DP88,DeV98].We use here the following lemma.It will be proved after the present proof for the sake of completeness.
We recall that ᾱ is the unique positive solution of (11) that is Let q α ≥ p such that 1/q α = 1/q α + ε and 1/δ α = 2/p − 1/q α .We apply Proposition 5 with k defined as the smallest integer larger than and δ = max{1, δ α }.We therefore get a partition m ∈ M such that and We use (33) with a concavity argument when δ α < 1 to get Moreover, we have Lemma 5 implies that M q α (m) ≤ f q α ≤ c(1 + R) for some c.Due to our choice of k, we have for n large enough, using m ≤ c log n for some suitable c , .
We now claim that the term w(m)( m + log n)/n in (8) is negligible when n is large.This is indeed straightforward when α ≥ p/π − δ/p as our bounds for 3210 M. Sart w(m) and v q α (m) are of the same order of magnitude (up to a logarithmic term in case of equality).When α < p/π − δ/p, we have, using (34) and (35), We apply the claim below (proved after the present proof) to get w(m) ≤ C 2 kβ(1+α) for some β < 1.
It follows from the definition of k that for n large enough, .
As β < 1, the exponent is always larger than α/(1 + 2α), and the term is negligible.Since f p ≤ f q α ≤ c(1 + R), we may apply Theorem 4 to get the result for n large enough.Suppose now that α ∈ (1/π −1/p, ᾱ) and α ≤ 1/π.Then, α < δ α (1/π − 1/p).Since α > 1/π − 1/p, we may suppose that δ α > 1.Let k be the smallest integer larger than We derive from the second point of Proposition 5 a partition m ∈ M k such that v qα (m) ≤ C2 k(1−1/δα) , w(m) ≤ C2 k(1−1/p) , and Lemma 5 ensures that f qα/(1+qα/(pk)) ≤ ck(1 + R), for some c depending only on π, α and p.We may thus apply Theorem 4 to get when n is large enough, By replacing k by (36), we observe that the first three terms are all of the same order of magnitude (up to log terms).We then use Lemma 5 to bound the L p norm of f in the remaining term.
In borderline case α = ᾱ, and α ≤ 1/π, we do as above.We simply apply the third point of Proposition 5 instead of the second one, which leads to additional logarithmic factors in the final result.
Finally, when α = δ α (1/π − 1/p), we do as above, but additional logarithmic factors appear as we apply the third point of Proposition 5.
Proof of Lemma 5. We apply Claim 1 in Section 4.4 with j = 0, p replaced by q, and with r defined as the smallest integer larger than α.Then, there is a polynomial function g of degree at most r such that where C depends on α and π only.Since g is a polynomial function, there exists C depending only on r such that g q ≤ C g 1 (see Theorem 2.6 in Chapter 4 of [DL93]).Therefore, using the triangle inequality, We finally put f q ≤ d q (f, g) + g q , (37) and (38) together.
Proof of Claim 2. We first suppose that δ ≥ 1.Then,
m j−1 by mj = {I ∈ mj , ∃I ∈ m j−1 such that I ⊂ I and d p (f 1 I , P r (I)) ≤ η j } m j = {I ∈ mj , ∃I ∈ m j−1 such that I ⊂ I and d p (f 1 I , P r (I)) > η j }.Therefore, mj , m j are subsets of mj , mj ∪ m j is a partition of ∪ I ∈ m j−1 I and d p (f 1 I , P r (I)) ≤ η j for all I ∈ mj d p (f 1 I , P r (I)) > η j for all I ∈ m j .