Regression in random design and Bayesian warped wavelets estimators

In this paper we deal with the regression problem in a random design setting. We investigate asymptotic optimality under minimax point of view of various Bayesian rules based on warped wavelets and show that they nearly attain optimal minimax rates of convergence over the Besov smoothness class considered. Warped wavelets have been introduced recently, they offer very good computable and easy-to-implement properties while being well adapted to the statistical problem at hand. We particularly put emphasis on Bayesian rules leaning on small and large variance Gaussian priors and discuss their simulation performances comparing them with a hard thresholding procedure.


Introduction
We observe independant pairs of variables (X i , Y i ), for i = 1, . . ., n, under a random design regression model: where f is an unknown regression function that we aim at estimating, and ε i are independant normal errors with E(ε i ) = 0, Var(ε i ) = σ 2 < ∞.The design points X i are assumed to be supported in the interval [0, 1] and have a density g which will be supposed to be known.Furthermore we assume that the design density g is bounded from below, i.e. 0 < m ≤ g, where m is a constant.Many approaches have been proposed to tackle the problem of regression in random design, we mention among others the work of Hall and Turlach [17], Kovac and Silverman [22], Antoniadis et al. [4], Cai and Brown [8] and the model selection point of view adopted by Baraud [6].
The present paper provides a Bayesian approach to this problem based on warped wavelet basis.Warped wavelets basis {ψ jk (G) j ≥ −1, k ∈ Z} in regression with random design were recently introduced by Kerkyacharian and Picard in [20].The authors proposed an approach which would depart as little as possible from standard wavelet thresholding procedures which enjoy optimality and adaptivity properties.These procedures have been largely investigated in the case of equispaced samples (see a series of pioneered articles by Donoho et al. [14], [15], [13]).Kerkyacharian and Picard actually pointed out that expanding the unknown regression function f in the warped basis instead of the standard wavelets basis could be very interesting.Of course, this basis has no longer the orthonormality property nonetheless it behaves under some conditions as standard wavelets.
Kerkyacharian and Picard investigated the properties of this new basis and showed that not only is it well adapted to the statistical problem at hand by avoiding unnecessary calculations but it also offers very good theoretical features while being easily implemented.More recently Brutti [7] highlighted their easy-to-implement computational properties.The novelty of our contribution lies in the combination of Bayesian techniques and warped wavelets to treat regression in random design.We actually want to investigate whether this yields optimal theoretical results and promising pratical performances, which will prove to be the case.We do not deal with the case of an unknown design density g which requires further machinery and will be the object of another paper.Bayesian techniques for shrinking wavelet coefficients have become very popular in the last few years.The majority of them were devoted to fixed design regression scheme.Let us cite among others, papers of Abramovich et al. [1], [2], Clyde et al. [10], [11], [12], [5], Chipman et al. [9], Rivoirard [25], Pensky [24] in the case of i.i.d errors not necessarily Gaussian.Most of those works are taking as distribution prior a mixture of Gaussian distributions.In particular, Abramovich et al. in [1] and [2] have explored optimality properties of Gaussian prior mixed with a point mass at zero and which may be viewed as an extreme case of a Gaussian mixture: where β jk are the wavelet coefficients of the unknown regression function, τ 2 j = c 1 2 −jα and π j = min(1, c 2 2 −jβ ) are the hyperparameters.This particular form was devised to capture the sparsity of the expansion of the signal in the wavelets basis.Our approach will consist in a first time in using the same prior but in the context of warped wavelets.In Theorem 1 we show that the Bayesian estimator built using warped wavelets with this prior and this form of hyperparameters achieves the optimal minimax rate within logarithmic term on the considered Besov functional space.Unfortunately, the Bayesian estimator turns out not to be adaptive.Indeed, the hyperparameters depend on the Besov smoothness class index.In order to compensate this drawback, Autin et al. in [5] suggested to consider Bayesian procedures based on Gaussian prior with large variance.Following this suggestion, we will consider priors still specified in terms of a normal density mixed with a point mass at zero but with large variance Gaussian densities.In Theorem 2 we prove again that the Bayesian estimator built with this latter form of prior, still combined with warped wavelets achieves nearly optimal minimax rate of convergence while being adaptive.Eventually, our simulations results highlight the very good performances and behaviour of these Bayesian procedures whatever the regularity of the test functions, the noise level and the design density which can be far from the uniform case may be.This paper is organized as follows.In section 2 some necessary methodology is given: we start with a short review of wavelets and warped wavelets, explain the prior model and discuss the two hyperparameters form we consider.We give in section 3 some definitions of functional spaces we consider.
In section 4, we investigate the performances of our Bayesian estimators in terms of minimax rates in two cases: the first one when the Gaussian prior has small variance, the second case focuses on Gaussian prior with large variance.Section 5 is devoted to simulation results and discussion.Finally, all proofs of main results are given in the Appendix.
For a given square-integrable function f in L 2 [0, 1], let us denote In this paper, we use decompositions of 1-periodic functions on wavelet basis of L 2 [0, 1].We consider periodic orthonormal wavelet bases on [0, 1] which allow to have the following series representation of a function f : where we have denoted ψ −1,k = φ 0,k the scaling function.
We are now going to give the essential background of warped wavelets which were introduced in details in [20].First of all let us define G is assumed to be a known function, continuous and strictly monotone from [0, 1] to [0, 1].Let us expand the regression function f in the following sense: or equivalently where Hence one immediately notices that expanding f (G −1 ) in the standard basis is equivalent to expand f in the new warped wavelets basis {ψ jk (G), j ≥ −1, k ∈ Z}.This may give a natural explanation that in the follow-on, regularity conditions will be expressed not for f but for f (G −1 ). We

Priors and estimators
We set in the following As in Abramovich et al. (see [1], [2]), we use the following prior on the wavelet coefficients β jk of the unknown function f with respect to the warped basis {ψ jk (G), j ≥ −1, k ∈ Z}: Considering the L 1 loss, from this form of prior we derive the following Bayesian rule which is the posterior median: where where Φ is the normal cumulative distributive function and We set : We introduce now the estimator of the unknown regression f where J is a parameter which will be precised later.Note that in our case, the estimator resembles the usual ones in [5], [1] and [2], except that the deterministic noise variance has been replaced by a stochastic noise level γ 2 jk .Its expression is given by ( 4).This change will have a marked impact both on the proofs of theorems by using now large deviations inequalities and on simulations results.Futhermore, such L 1 rule is of thresholding type.Indeed, as underlined in [1] and [2], βjk is null whenever βjk falls below a certain threshold λ B .Some properties of the threshold λ B that will be used in the sequel are given in lemma 1 in Appendix.

Gaussian priors with small variance
In this paper, two cases of hyperparameters will be considered.The first one involves Gaussian priors with small variances.We will state as suggested in Abramovich et al (see [1], [2]) : where α and β are non-negative constants, c 1 , c 2 > 0. This choice of hyperparameters is exhaustively discussed in Abramovich et al. [2].The authors stressed that this form of hyperparameters was actually designed in order to capture the sparsity of wavelet expansion.They pointed out the connection between Besov spaces parameters and this particular form of hyperparameters.They investigate various practical choices.
For this case of hyperparameters (10), the estimator of f will be denoted f.

Gaussian priors with large variance
The second form of hyperparameters considered in the paper involves Gaussian priors with large variance as suggested in Autin et al. [5].
As a matter of fact, we suppose that the hyperparameters do not depend on j and we set : Besides, w j (n) := w(n).We suppose that there exist q 1 and q 2 such that for n large enough This form of hyperparameters was emphasized in [5] in order to mimic heavy tailed priors such as Laplace or Cauchy distributions.Indeed, Johnstone and Silverman in [18], [19] showed that their empirical Bayes approach for regular regression setting with a prior mixing a heavy-tailed density and a point mass at zero proved fruitful both in theory and practice.Pensky in [24] also underlined the efficiency of this kind of hyperparameters.We underscore that contrary to the first form of hyperparameters (10), this latter forms (11) and ( 12) lead to an adaptive Bayesian estimator.For this case of hyperparameters ( 11) and ( 12), the estimator of f will be denoted f .

Functional spaces
In this paper, functional classes of interest are Besov bodies and weak Besov bodies.Let us define them.Using the decomposition (2), we characterize Besov spaces by using the following norm The Besov spaces have the following simple relationship B s 1 p,q 1 ⊂ B s p,q , for s 1 > s or s 1 = s and q 1 ≤ q and B s p,q ⊂ B s 1 p 1 ,q , for p 1 > p and s 1 ≥ s − 1/p + 1/p 1 .The index s indicates the smoothness of the function.The Besov spaces capture a variety of smoothness features in a function including spatially inhomogeneous behavior when p < 2. We recall and stress that in this paper as mentioned above, the regularity conditions will be expressed for the function f (G −1 ) due to the warped basis context.More precisely we shall focus on the space B s 2,∞ .We have in particular We define the Besov ball of some radius R as B s 2,∞ (R) = {f : f s2∞ ≤ R}.Let us define now the weak Besov space W (r, 2) Definition 1.Let 0 < r < 2. We say that a function f belongs to the weak Besov body W (r, 2) if and only if: And we have the following proposition Proposition 1.Let 0 < r < 2 and f ∈ W (r, 2).Then For the proof of this proposition see for instance [21] .To conclude this section, we have the following embedding B s 2,∞ ⊂ W 2,2/(1+2s) which is not difficult to prove (see for instance [21]).
Corollary 1.If one chooses for the prior parameter α = 2s + 1, one gets This corollary shows that with this specific choice of hyperparameter α, one recovers the minimax rate of convergence up to a logarithmic factor that one achieves in a uniform design.

Bayesian estimators based on Gaussian priors with large variance
Theorem 2. We consider the model (1).We assume that the hyperparameters are defined by (11) and (12).Set J := J n such that 2 Jn = n/ log n, then we have : .
It is worthwhile to make some comments about the results of Theorem 2. Here, the estimator turns out to be adaptive and contrary to the similar results in Proposition 2 in [20] we no longer have the limitation on the regularity index s > 1/2.Moreover, Kerkyacharian and Picard [20] had to stop the highest level J such that 2 J = (n/ log(n)) 1/2 , here we stop at the usual level J n such that 2 Jn = n/ log(n) one gets in standard thresholding .

Simulations and discussion
A simulation study is conducted in order to compare the numerical performances of the two Bayesian estimators based on warped wavelets and on Gaussian prior with small or large variance, described respectively in section 2.2.1 and 2.2.2 and the hard thresholding procedure using the universal threshold σ 2 log(n) based on warped basis introduced by Kerkyacharian and Picard [20] for the nonparametric regression model in a random design setting.For more details on Kerkyacharian and Picard procedure, the readers are referred to Willer [26], see also [16].In fact, we have decided to concentrate on the procedure of Kerkyacharian and Picard because it is interesting to point out differences and compare performances obtained by Bayesian procedures which apply local thresholds and a universal threshold procedure.The main difficulties lie in implementing the Bayesian procedures with the stochastic variance (4).Note also the responses proposed by Amato et al. [3] and Kovac and Silverman [22].All the simulations done in the present paper have been conducted with MATLAB and the Wavelet toolbox of MATLAB.We consider here four test functions of Donoho and Johnstone [13] representing different level of spatial variability.The test functions are plotted in Fig. 1.For each of the four objects under study, we compare the three estimators at two noise levels, one with signal-to-noise ratio RSNR = 4 and another with RSNR = 7.As in Willer [26] we also consider different cases of design density which are plotted in Fig. 2. The first two densities are uniform or slightly varying whereas the last two ones aim at depicting the case where a hole occurs in the density design.The sample size is equal to n = 1024 and the wavelet we used is the Symmlet8.In order to compare the behaviors of the estimators, the RMSE criterion was retained.More precisely, if f (X i ) is the estimated function value at X i and n is the sample size, then The RMSE displayed in Tab. 1 are computed as the average over 100 runs of expression (17).In each run, we hold all factors constant, except the design points (random design) and the noise process that were regenerated.E1 corresponds to the Bayesian estimator based on Gaussian prior with large variance, E2 to the Bayesian estimator based on Gaussian prior with small variance and E3 to the estimator built following the Kerkyacharian and Picard procedure in [20].
The following plots compare the visual quality of the reconstructions (see Fig. 3. to Fig. 8).The solid line is the estimator and the dotted line is the true function.We shall now comment and discuss the results displayed in Tab.1 as well as the various visual reconstructions.The performances are always better for the Bayesian estimators except for the case of the HeaviSine test function.More precisely, the RMSE for Blocks whatever the noise level and design densities are smaller for Estimator 1, moreover the RMSE are almost equal for Estimator 1 and 2 in the case of Bumps test function, whatever the design densities and for a noise level RSNR=4.This may be due to the irregularity of the Bumps, Blocks and Doppler test functions which are much rougher than the HeaviSine which is more regular.Indeed, Estimator 1 and 2 tend to detect better the corner of Blocks, the high peaks in Bumps, and the high frequency parts of Doppler as the graphics show it.We may explain this by the fact that Estimators 1 and 2 have level-dependent thresholds whereas Estimator 3 has a hard universal threshold.As for the reconstructions, one can see that they are slighly better in the case of Sine density and small noise, whereas there are small deteriorations when a hole occurs in the design density but this change does not affect the visual quality in too big proportions.This fact highlights the interest of "warping" the wavelet basis.Warping the basis allows the estimators to behave still correctly when the design densities are far from the uniform density such as in the case of Hole2.

Appendix
In the sequel C denotes some positive constant which may change from one line to another line.We also assume without loss of generality that σ = 1 in model (1).We have that hence we get E(γ 2 jk ) = 1/n, the expression of γ 2 jk being given by ( 4).Let us define the following event: To make proofs clearer we recall the Bernstein inequality that we will use in the sequel.(see in [23] Proposition 2.8 and formula (2.16)) Proposition 2. Let Z 1 , . . ., Z n be independant and square integrable random variables such that for some nonnegative constant b, Z i ≤ b almost surely for all i ≤ n.Let Then for any positive x, we have Lemma 1.Let ς be some positive constant.We have Proof of Lemma 1 Let us deal with the first case j ≤ J α .To bound P(|γ 2 jk − 1/n| > ς/n) we will use the Bernstein inequality and apply Proposition 2. In the present situation . First of all, in order to apply the Bernstein inequality, we need the value of the sum we have Let us now deal with the second case j ≤ J n .To bound P(|γ 2 jk −1/n| > ς/n) we will follow the lines of the proof of the first case.Here again According to (21), we have The following lemma shows that the properties of the Bayesian estimators f and f can be controlled on the event Ω δ n .To lighten the notations for the proof of this lemma, we will denote Ω n for Ω δ n and Ω c n the complementary of Ω n .
Lemma 2. We have Proof of Lemma 2.
We have Let us first deal with the variance term V.The estimator βjk can be written as βjk = w jk βjk with 0 ≤ w jk ≤ 1.We have because 0 ≤ w jk ≤ 1.Then, using Cauchy Scharwz inequality we get Using (20) and (40) we have We recall that 2 Jn = n/ log(n), accordingly by choosing ς large enough we have As for the term which completes the proof for f .The proof for f is similar, all inequalities hold a fortiori since, in the case of the estimator f we have P(Ω c n ) ≤ e −Cn 1−1/α (see (19)).
Let us place in the setting of Theorem 1.We recall that βjk is zero whenever | βjk | falls below a threshold λ B and we have the following lemma concerning the behavior of λ B Lemma 3. On the event Ω δ n defined by (18) with δ = 1/(2n), for α > 1 we have and J α is taken such that 2 Jα = ( 3 2n ) −1/α .
Proof of Lemma 3. We follow the lines of the proof of lemma 1. in [1].On the one hand we have (see proof of lemma 1. in [1] where c is some suitable positive constant.Besides, we have 1/(2n) ≤ γ 2 jk ≤ 3/(2n), therefore ) where c denotes a positive constant depending on c 1 and c 2 and which may be different at different places.Since On the other hand, for the reverse inequality, we have (see proof of lemma 1. in [1] page 228 and formula ( 14) in [1] page 221) which completes the proof.
Proof of Theorem 1.Let us place on the event Ω δ n defined by ( 18) with δ = 1/(2n).By the usual decomposition of the MISE into a variance and a bias term we get We are now going to upperbound each term A 1 , A 2 and A 3 .We start with As precised in the beginning of section 2. We have λ B ≈ log n n and b j ≤ 1 for j ≤ J α hence we get from which follows by fixing again m and ς large enough