Wavelet block thresholding for samples with random design: a minimax approach under the $L^p$ risk

We consider the regression model with (known) random design. We investigate the minimax performances of an adaptive wavelet block thresholding estimator under the $\mathbb{L}^p$ risk with $p\ge 2$ over Besov balls. We prove that it is near optimal and that it achieves better rates of convergence than the conventional term-by-term estimators (hard, soft,...).


Motivations
In recent years, wavelet thresholding procedures have been widely applied to the field of nonparametric function estimation.They excel in the areas of spatial adaptivity, computational efficiency and asymptotic optimality.Among the various thresholding techniques studied in the literature, there are the term-byterm thresholding (hard, soft, . . . ) initially developed by Donoho and Johnstone (1995) and the block thresholding (global, BlockShrink, . . . ) introduced by Kerkyacharian et al. (1996) and Hall et al. (1999).
Several recent works demonstrated that the block thresholding methods can enjoy better theoretical properties than the conventional term-by-term thresholding methods.This superiority has been proved for various statistical models via the minimax approach under the L 2 risk.See, for instance, Cai (1999) and Cavalier and Tsybakov (2001) for the Gaussian sequence model, Cai and Chicken (2005) for the density estimation, Chicken (2003) for the regression model with nonequispaced samples and Chicken (2007) for the regression model with random uniform design.
This paper presents an extension of a result established by Chicken (2007).We prove that a generalized version of the BlockShrink construction achieves better rates of convergence than the conventional term-by-term thresholding estimators.
The main contributions of this study concern the two following points : -The model: we consider the regression model with (known) random design, not necessarily uniform.-The statistical approach: we adopt the minimax approach under the L p risk over Besov balls (regular, sparse and critical zones).The parameter p can be greater than or equal to 2.
From a technical point of view, the proof is significantly more complicated than for the uniform design and the case p = 2.We combine a general theorem established by Chesneau (2006, Theorem 4.2) with several probability inequalities such that the Talagrand inequality and the Borel inequality.
The paper is organized as follows.Section 2 introduces the model, the adopted minimax approach, the wavelet bases and the considered estimator.Section 3 presents the main result while Section 4 contains a detailed proof of the main result.

The model
We observe n pairs of random variables {(X 1 , Y 1 ), . . ., (X n , Y n )} governed by the equation: where the design variables (X 1 , . . ., X n ) are i.i.d. with X i ∈ [0, 1], the variables (z 1 , . . ., z n ) are i.i.d.Gaussian with mean zero, variance one and are independent of (X 1 , . . ., X n ).We denote by g the density of X 1 .The function f is unknown.
The goal is to estimate f from the observations {(X 1 , Y 1 ), . . ., (X n , Y n )}.Additional assumptions on the functions f and g will be specified latter (see Theorem 3.1 below).
To estimate f , several adaptive methods have been elaborated (according to the nature of the design).See, for instance, the transformation method of Cai and Brown (1998), the model selection method of Baraud (2002) and the kernel method of Gaïffas (2006).
In this study, we shall consider a particular wavelet thresholding estimator.For the sake of clarity, let us denote this estimator by fn .The performances of fn will be measured under the global L p risk defined by Here, p is a real number greater than or equal to 2 and E n f is the expectation with respect to the distribution of the observations {(X 1 , Y 1 ), . . ., (X n , Y n )}.The unknown regression function f is supposed to belong to a wide class of functions: the Besov balls.Wavelets and Besov balls are presented in the next subsection.

Wavelets and Besov balls
We consider an orthonormal wavelet basis generated by dilation and translation of a compactly supported "father" wavelet φ and a compactly supported "mother" wavelet ψ.For the purpose of this paper, we use the periodized wavelet bases on the unit interval.Let us set And let us denote the periodized wavelets by Then, there exists an integer τ such that the collection ζ = {φ per τ,k (x), k = 0, . . ., 2 τ − 1; ψ per j,k (x), j = τ, . . ., ∞, k = 0, . . ., 2 j − 1} constitutes an orthonormal basis of L 2 ([0, 1]).The superscript "per" will be suppressed from the notations for convenience.
For any integer l ≥ τ , a square-integrable function on [0, 1] can be expanded into a wavelet series For further details about wavelets, see Meyer (1990) and Cohen et al. (1993).
Since ψ is compactly supported, the following property of concentration holds: for any m > 0, any j ≥ τ and any x ∈ [0, 1], there exists a constant C > 0 satisfying (2.2) Now, let us define the main sets of function considered in our statistical approach We say that a function f belongs to the Besov balls B s π,r (M ) if and only if there exists a constant M * > 0 such that the associated wavelet coefficients satisfy with the usual modification if q = ∞.We work with the Besov balls because of their exceptional expressive power.For a particular choice of parameters s, π and r, they contain the Hölder and Sobolev balls (see, for instance, Meyer (1990)).

The estimator
We are now in the position to describe the main estimator of the study.It is a L p version of the BlockShrink estimator developed by Cai (2002) for the Gaussian sequence model.
Let p ∈ [2, ∞), d ∈ (0, ∞) and L be the integer L = ⌊(log n) p/2 ⌋ where the square brackets denote the floor function.Let j 1 and j 2 be the integers defined by For any j ∈ {j 1 , . . ., j 2 }, let us set A j = 1, . . ., 2 j L −1 and, for any We define the (L p version of the) BlockShrink estimator by fn (x) = This estimator was first defined in this L p form by Picard and Tribouley (2000) for general statistical models.

Comments:
-The sets A j and B j,K are chosen such that ∪ K∈Aj B j,K = {0, . . ., -It is easy to show that αj,k and βj,k are unbiased estimators of α j,k and β j,k , the wavelet coefficients of f .Moreover, they satisfy several probability inequalities which will be at the heart of the proof of the main result.Further details concerning these inequalities are given in Section 4. -The considered BlockShrink estimator is adaptive since it does not depend on the smoothness of the unknown function f .However, it depends on the norm parameter p.An open question is : can we construct a block thresholding procedure that is adaptive to the L p risk for all p ?

Main result
Theorem 3.1 below determines the rates of convergence achieved by the Block-Shrink estimator under the L p risk over Besov balls.
Theorem 3.1.Let us consider the regression model with random design (2.1).Suppose that : • the unknown regression function f is bounded from above, i.e. f ∞ ≤ M ′ , where M ′ > 0 denotes a known constant.• the density g of X 1 is known and bounded from above and below.
Let us consider the BlockShrink estimator fn defined by ( 2.3) with a large enough threshold constant d. where The rates of convergence presented in Theorem 3.1 above are minimax except in the cases {p > π} ∩ {ǫ > 0} and ǫ = 0 where there is an extra logarithmic term.They are better than those achieved by the conventional term-by-term thresholding estimators (hard, soft,. . .).The main difference is for the case {π ≥ p} where there is no extra logarithmic term.Let us mention that Theorem 3.1 can be proved for p ∈ (1, 2) if we consider the BlockShrink estimator (2.3) defined with L = ln n.Further details can be found in Chesneau (2006).Further details about the rates of convergence for the regression problem (2.1) via the minimax approach under the L p risk over Besov balls can be found in Chesneau (2007).
As mentioned in the motivations of the paper, Theorem 3.1 is an extension of a result proved by Chicken (2007, Theorem 2) for the uniform design, the L 2 risk and the Hölder balls B s ∞,∞ (M ).
Comments on the choice of the thresholding constant d.From a theoretical point of view, it is difficult to determine the exact minimum value of d such that fn achieves the rates of convergence exhibited in Theorem 3.1.In fact, Theorem 3.1 holds for d ≥ µ 1 where µ 1 refers to the constant of Proposition 4.1 below.

Proof of Theorem 3.1
Thanks to a result proved by Chesneau (2006, Theorem 4.2), the proof of Theorem 3.1 is an immediate consequence of Proposition 4.1 below.This proposition shows that the estimators ( βj,k ) k defined by (2.4) satisfy a standard moments inequality and a specific concentration inequality.
Proposition 4.1.Let p ≥ 2. There exist two constants µ 1 > 0 and C > 0 such that, for any j ∈ {j 1 , . . ., j 2 }, K ∈ A j and n large enough, the estimators ( βj,k ) k defined by (2.4) satisfy -the following moments condition : -the following concentration inequality : 2) The moments inequality has been proved by Chesneau (2007).The proof of the concentration inequality (4.2) combines several concentration inequalities such that the Talagrand inequality and the Borel inequality.They are recalled in the two auxiliary lemmas below.

Suppose that
Then, there exist two absolute constants C 1 > 0 and C 2 > 0 such that, for any t > 0, we have: Lemma 4.2 (The Borel inequality (see Adler (1990))).Let D be a subset of R.
Let (η t ) t∈D be a centered Gaussian process.Suppose that Then, for any x > 0, we have We are now in the position to prove Proposition 4.1.Here, C represents a constant which may be different from one term to the other.We suppose that n is large enough.
Proof of the Proposition 4.1.By the definiton of βj,k , we have the following decomposition βj,k − β j,k = A j,k + B j,k , where By the l p -Minkowski inequality, for any µ > 0, we have where Let us investigate separately the upper bounds of U and V.
• The upper bound for U. Our goal is to apply the Talagrand inequality described in Lemma 4.1.Let us consider the set C q defined by and the functions class F defined by By an argument of duality, we have where A j,k is defined by (4.4) and r n denotes the function defined in Lemma 4.1.Now, let us evaluate the parameters T , H and v of the Talagrand inequality.
First of all, notice that, for p ≥ 2 (and, a fortiori, q = 1 + (p − 1) −1 ≤ 2), an elementary inequality of l p norm gives − The value of T .Let h be a function in F .By the Cauchy-Schwarz inequality, the assumptions of boundedness of f and g and the property of concentration (2.2), for any x ∈ [0, 1], we find Hence T = C2 j/2 .− The value of H.The l p -Hölder inequality and the Hölder inequality imply Since (ǫ 1 , . . ., ǫ n ) are independent Rademacher variables, also independent of X = (X 1 , . . ., X n ), the Khintchine inequality yields Now, let us consider the i.i.d.random variables (N 1 , . . ., N n ) defined by An elementary inequality of convexity implies I ≤ 2 p/2−1 (I 1 + I 2 ) where Let us analyze the upper bounds for I 1 and I 2 , in turn.− The upper bound for I 1 .The Rosenthal inequality applied to (N 1 , . . ., N n ) and the Cauchy-Schwartz inequality imply For any m ≥ 1, j ∈ {j 1 , . . ., j 2 } and k ∈ {0, . . ., 2 j − 1}, the assumptions of boundedness of f and g give Therefore I 1 ≤ Cn p/2 .− The upper bound for I 2 .Since E n f (N 1 ) ≤ C, we have I 2 ≤ Cn p/2 .Combining the obtained upper bounds for I 1 and I 2 , we find (4.8) Putting (4.6), (4.7) and (4.8) together, we see that Hence H = Cn −1/2 L 1/p .− The value of v.By the assumptions of boundedness of f and g and the orthonormality of ζ, we obtain Hence v = C. Now, let us notice that, for any j ∈ {j 1 , . . ., j 2 }, we have we have So, for µ large enough and t = 8 −1 µL 1/p n −1/2 , the Talagrand inequality yields We obtain the desired upper bound for U.
• The upper bound for V. Here, we apply the Borel inequality described in Lemma 4.2.Let us consider the set C q defined by and the process Z(a) defined by where B j,k is defined by (4.5).Let us notice that, conditionally on X = (X 1 , . . ., X n ), Z(a) is a centered Gaussian process.Moreover, by an argument of duality, we have sup Let us work on the set B c µ , the complementary of B µ .By the Jensen inequality, the fact that Z ) and the assumptions of boundedness made on g, we find − The upper bound for sup a∈Cq V ar n f (Z(a)|X).Let us define the set A µ by Let us work on the set A c µ , the complementary of A µ .Using the assumptions of boundedness of g, we have The obtained values of N and Q will allow us to conclude.For any x > 0, we have (4.9) The Borel inequality described in Lemma 4.2 implies Moreover, by definition of A µ , we have Putting the inequalities (4.9), (4.10) and (4.11) together, for x = 8 −1 µL 1/p n −1/2 and µ large enough, we obtain (4.12) Lemma 4.3 below provides the upper bounds for P n f (A µ ) and P n f (B µ ).Lemma 4.3.For µ and n large enough, we have By the inequality (4.12), the fact that L = ⌊(log n) p/2 ⌋ and Lemma 4.3, for µ large enough, we have Combining the obtained upper bounds for U and V, we achieve the proof of Proposition 4.1.
Proof of Lemma 4.3.. Let us investigate the upper bounds for P n f (B µ ) and P n f (A µ ).
• The upper bound for P n f (B µ ).First of all, notice that the random variables are i.i.d.and, since g is bounded from below, we have So, for any j ∈ {j 1 , . . ., j 2 }, the Hoeffding inequality implies the existence of a constant C > 0 such that We obtain the desired upper bound by taking µ large enough.
• The upper bound for P n f (A µ ).The goal is to apply the Talagrand inequality described in Lemma 4.1.Let us consider the set C q defined by and the functions class F ′ defined by where r n denotes the function defined in Lemma 4.1.Thus, it suffices to determine the parameters T , H and v of the Talagrand inequality.
− The value of T. Let h be a function of F ′ .Using the Hölder inequality, the fact that g is bounded from below and the concentration property (2.2), for any x ∈ [0, 1], we find − The value of H.The Cauchy-Schwarz inequality implies that .(4.13) Since (ǫ 1 , . . ., ǫ n ) are independent Rademacher variables, also independent of X = (X 1 , . . ., X n ), the Khintchine inequality and the fact that g is bounded from below give Using the property of concentration (2.2) and the inequalities (4.13)-(4.14),we find ≤ Cn 1/2 2 j .
Hence H = C2 j n −1/2 .− The value of v. Using the fact that g is bounded from below, the Hölder inequality and the property of concentration (2.2), we have Hence v = C2 2j .Now, let us notice that if t = 2 −1 µ then For any j ∈ {j 1 , . . ., j 2 }, t = 2 −1 µ with µ large enough, the Talagrand inequality gives This ends the proof of Lemma 4.3.
Now, let us investigate separately the upper bounds for E n f (sup a∈Cq Z(a)|X) and sup a∈Cq V ar n f (Z(a)|X).− The upper bound for E n f (sup a∈Cq Z(a)|X).Let us consider the set B µ defined by