On a Nadaraya-Watson Estimator with Two Bandwidths

: In a regression model, we write the Nadaraya-Watson estimator of the regression function as the quotient of two kernel estimators, and propose a bandwidth selection method for both the numerator and the denominator. We prove risk bounds for both data driven estimators and for the resulting ratio. The simulation study conﬁrms that both estimators have good performances, compared to the ones obtained by cross-validation selection of the bandwidth. However, unexpectedly, the single-bandwidth cross-validation estimator is found to be much better than the ratio of the previous two good estimators, in the small noise context. However, the two methods have similar performances in models with large noise.


Introduction
Consider n ∈ N\{0} independent random variables X 1 , . . . , X n having the same probability density f with respect to Lebesgue's measure. Consider also the random variables Y 1 , . . . , Y n defined by where b is a measurable function from R into itself and ε 1 , . . . , ε n are n i.i.d. centered random variables with variance σ 2 > 0 and independent of X 1 , . . . , X n .
Since Nadaraya [15] and Watson [20], a lot of consideration has been given to the estimator of b defined by where K : R → R is a kernel, and h > 0 is the bandwidth. This estimator has been dealt with as a weighted estimator, for K 0: and is often called "local average regression". It is studied e.g. in Jones and Wand [11], Györfi et al. [8] or defined in Tsybakov [19]. Recent papers still propose methods to improve the estimation, see Chang et al. [3]. Several strategies have been proposed to select the bandwidth in a data driven way. Cross-validation based on leave-one-out principle is one of the most standard methods to perform this choice (see Györfi et al. [8]), even if a lot of refinements have been proposed. Optimal rates depend on the regularity of the function b and have been first established by Stone [18]: roughly speaking, they are of order O(n −p/(2p+1) ) for b admitting p derivatives. From theoretical point of view, the rates of the adaptive final estimator are not always given, nor proved.
In this paper, we re-write the Nadaraya-Watson as the quotient of two estimators, an estimator of bf divided by an estimator of f : quite bad estimators of the numerator and of the denominator. This remark is of important interest for practitioners. In a second time, we increased the noise (σ = 0.7), and finally obtained results indicating that the two methods can have similar Mean Integrated Squared Errors (MISE) in this more difficult context. This, together with the fact we establish a theoretical risk bound on the PCO adaptive 2bNW estimator, imply that the PCO method, for both numerator and denominator, remains an interesting bandwidth selection method. Moreover, we believe that both positive but also negative results are of interest, and detailed tables, explanations and discussion are given in Section 5.

Bound on the MISE of the 2bNW estimator
First, we state some simple risk bound results in the case of a fixed bandwidth.
Consider β > 0 and := β , where β denotes the largest integer smaller than β. In the sequel, the kernel K and the density function f fulfill the following assumption.

Assumption 2.1.
(i) The map K belongs to L 2 (R, dy), K is bounded and R K(y)dy = 1.

(ii) The density function f is bounded.
Under this assumption, a suitable control of the MISE of bf n,h has been established in Comte [4], Proposition 4.2.1.

Proposition 2.2. Under Assumption 2.1,
where (bf ) h := K h * (bf ) and c K,Y := K 2 2 E(Y 2 1 ). In order to provide a suitable control on the MISE of the 2bNW estimator, we assume that b and f fulfill the following assumption.   2 dy. The idea behind Proposition 2.4 is that we cannot pretend to accurately estimate b on domains where few X i 's are observed. Such domains correspond to small level of the density. For small m n , the set S n excludes these cases. Proposition 2.4 gives a decomposition of the risk of the quotient estimator as the sum of the risks of the estimators of the numerator bf and of the denominator f , up to the multiplicative constant 8c f /m 2 n . Therefore, the rate of the quotient estimator is, in the best case, the worst rate of the two estimators used to define it (see also Remark 2.5 below). The factor 1/m 2 n may imply a global loss with respect to this rate. Clearly, the smaller is m n , the larger is the loss.
For instance, if f is lower bounded by a known constant f 0 on a given compact set A, then we can take S n = A and m n = f 0 . In that case, no loss occurs. If f 0 is unknown, we still can bound the risk with S n = A and 1/m 2 n = log(n) for n large enough. A log-loss occurs then in the rate. Remark 2.5. We consider, for β, L > 0, the Nikol'ski ball H(β, L), defined as the set of = β times continuously derivable functions ϕ : For instance, for p ∈ N, any function ϕ ∈ C p+1 (R) such that supp(ϕ (p) ) = [0, 1] and ϕ (p+1) More subtly, ψ : Now, assume that bf belongs to H(β 1 , L) and f to H(β 2 , L). We also assume that the kernel K satisfies Assumption 2.1 and is of order Then, it follows from Tsybakov [19], Chapter 1, that This implies that choosing h opt = c 1 n 1/(2β1+1) in Proposition 2.2 yields which is a standard optimal rate of estimation on Nikol'ski balls. The same rate holds for the estimation of f under our assumptions, with β 1 replaced by β 2 , and h opt = c 2 n 1/(2β2+1) . This implies that So, the rate is optimal if β = min(β 1 , β 2 ) is the regularity of b.
However, such bandwidth choices are not possible in practice, as they depend on unknown regularity parameters. Data driven bandwidth selection methods are settled to automatically reach a squared bias-variance compromise, inducing the optimal rate if the function under estimation does belong to a regularity space.

A bandwidth selection procedure for the 2bNW estimator based on the GL method
Moreover, we will need the following conditions. Assumption 3.1. There exists m > 0, not depending on n, such that and for every c > 0, there exists m(c) > 0, not depending on n, such that Example. Consider the dyadic bandwidths defined by Then, Thus, Assumption 3.1 is fulfilled. Consider also bf n,h,η (x) := (K η * bf n,h )(x) We apply the Goldenshluger-Lepski bandwidth selection method to bf n,h by solving the minimization problem with υ > 0 not depending on n and h, and c K,Y = K 2 2 E(Y 2 1 ). In the sequel, the solution to the minimization Problem (1) is denoted by h n .
The idea behind the criterion is that A n (h) is an estimate of the squared bias term (bf ) h − bf 2 Theorem 3.2 states that bf n, hn automatically leads to a compromise between the squared bias ( (bf ) h − bf 2 2 ) and the variance (V n (h)) terms. The multiplicative constant c, which is larger than one, is the price of the method but preserves the rate. Lastly, the additive quantity c log(n) 2 /n is negligible with respect to the possible rate of convergence (see Remark 2.5).
We recall now a version of the result proved by Goldenshluger and Lepski [7], which is available for the estimator of f . See also a simplified proof in Comte [4], Section 4.2. Let us consider with χ > 0 not depending on n and h . Under Assumptions 2.1 and 3.1, there exist two deterministic constants c , c > 0, not depending on n, such that Gathering (2) and Theorem 3.2 yields a Corollary similar to Proposition 2.4.
The comments following Proposition 2.4 and in Remark 2.5 apply here.

A bandwidths selection procedure for the 2bNW estimator based on the PCO method
The Goldenshluger-Lepski method is mathematically very nice and provides a rigorous risk bound for the adaptive estimator with random bandwidth. However, it has been acknowledged as being difficult to implement, due to the square grid in h, η required to compute intermediate versions of the criterion and to the lack of intuition to guide the choice of the constants υ and χ which should be calibrated from preliminary simulation experiments, see e.g. Comte and Rebafka [6]. This is the reason why Lacour et al. [13] investigated and proposed a simplified criterion (PCO) relying on deviation inequalities for U -statistics due to Houdré and Reynaud-Bouret [10]. This inequality applies in our more complicated context and Lacour-Massart-Rivoirard's result can be extended here as follows.
Let us recall that K h (·) = 1/hK(·/h) and (see Lemma 6.1). Let h min be the smallest bandwidth value in H n and consider crit(h) := bf n,h − bf n,hmin Then, let us define h n ∈ arg min h∈Hn crit(h).
The idea behind the proposal of Lacour et al. (2017) is that, instead of comparing estimators bf n,h to a collection of estimators bf n,h,η for different bandwidths η, it is sufficient to compare them to the same single estimator, corresponding to the smallest bandwidth. See their Section 3.1 for more heuristic elements. This implies a faster and more efficient numerical procedure.
In the sequel, in addition to Assumption 2.1, the kernel K, the functions b and f , the distribution of Y 1 and h min fulfill the following assumption. bf is bounded, and there exists α > 0 such that E(exp(α|Y 1 |)) < ∞.
As for Assumption 2.3, we can note that assuming bf bounded does not require b to be bounded, since most densities decrease fast at infinity. Moreover, the moment condition here is E(exp(α|Y 1 |)) < ∞ and is stronger than for the Goldenschluger and Lepski method (E(Y 6 1 ) < ∞).
Theorem 4.2 states that the estimator bf n, hn has performance of order of the best estimator of the collection inf h∈Hn E( bf n,h − bf 2 2 ) up to a factor (1 + ϑ). Indeed, the two other terms can be considered as negligible. If bf is in the Nikol'ski ball H(β 1 , L) as in Remark 2.5, then the first right-hand-side term is of order n −2β1/(2β1+1) . Since for h min = 1/n, (bf ) hmin − bf 2 2 is of order n −2β1 , both this term and the last residual term log(n) 5 /n are negligible compared to the first one. Now, we state the result that can be deduced from Lacour et al. [13] for the estimator of f . Let us consider By Lacour et al. [13], Theorem 2, there exists two deterministic constants a , b > 0, not depending on n and h min , such that for every ϑ ∈ (0, 1), Again, we can gather this last result and Theorem 4.2 to get the following Corollary.

Corollary 4.3. Let m n be a positive real number and consider
Consider also ϑ ∈ (0, 1). Under Assumptions 2.1, 2.3 and 4.1, The proof of Corollary 4.3 relies to the same arguments as the proof of Corollary 3.3 provided in Section 3.3, and is therefore omitted.

Simulation study
For the noise, we consider ε ∼ σN (0, 1), with σ = 0.1 and σ = 0.7. For the signal, we take either X ∼ N (0, 1) or X ∼ γ(3, 2)/5 (where the factor 5 is set to keep the variance of X of order 1, as in the first case). For the function b, we took functions with different features and regularities: We illustrate in Figures 1 and 2 the difference between a sample generated with σ = 0.1 (small noise) and with σ = 0.7 (large noise), compared with the functions to estimate. We can see that the first case is easy and that the second one is very difficult. Notice that the vertical scales are different.

Estimation of bf
The PCO method is implemented for f and bf with a kernel of order 7 (i.e. is a Gaussian density with mean 0 and variance j. Note that, for n j,h (x) := 1/hn j (x/h), it holds that Note that the bandwidth of the density estimator is selected as in Comte and Marie [5], by minimizing The L 2 -norm is computed as a Riemann sum on the interquantile interval, while the penalty is explicit and exact, thanks to Formula (3). The cross-validation (CV) criterion for selecting the bandwidth of bf n,h is computed as follows: where N (.) is the Gaussian kernel, also used to compute the estimator bf n,h in this case. It provides an estimation of bf h 2 2 − 2 bf h , bf 2 relying on the idea that the empirical for t, bf 2 The chosen bandwidth is the Table 1 100*MISE (with 100*std in parenthesis below) for the estimation of bf corresponding to the four examples b 1 , . . . , b 4 , 200 repetitions, X ∼ N (0, 1) and σ = 0.1. Columns PCO and CV correspond to the two competing methods. "Or" is for "oracle" and gives the average error of the best possible estimator of the collection, computed for each sample.  minimizer of CV (h) in the same collection as previously. Tables 1 and 2 give the MISE obtained for 200 repetitions and sample sizes 250, 500 and 1000, for the estimation of bf with PCO and CV methods, for σ = 0.1 (Table 1) and σ = 0.7 ( Table 2). The column "Or" gives the mean of the minimal squared errors for each sample, which requires to use the unknown true function and represents what could be obtained at best (that is if the best possible bandwidth was chosen for each sample). We postpone results with X ∼ γ(3, 2)/5 in Appendix A since they are similar. We can see that the PCO method is globally better than the CV, with no important difference, and the oracle shows that we are in the right orders even if not at best. Table 3 presents the mean of the selected bandwidths in each case PCO and CV, and allows to compare it with the oracle bandwidth, for the same paths and configurations as previously. The conclusion here is that, in mean, the PCO method over-estimates the oracle bandwidth, while the CV method slightly under-evaluates it. Clearly, the too-large choice gives better results.

Estimation of b
Now, we present the results for the estimation of the regression function b, obtained either with a single-bandwidth estimator, or with the ratio of two adaptive PCO estimators of bf and f .
The PCO estimators are the ones studied above, which proved to be good estimators (see also the study for the estimation of f in Comte and Marie [5]). We simply take a point by point ratio of the two adaptive PCO estimators. The oracle we refer to is computed with the estimator of b obtained as a quotient of the two oracles of bf and f for each path. It is the best performance we can expect with a PCO-ratio strategy.
For the one-bandwidth Nadaraya-Watson estimator b n,h , it is computed with the Gaussian kernel N (.). The leave-one-out cross-validation criterion which is minimized for the bandwidth selection is

Small noise case
We've started the study with σ = 0.1, which in our mind was an easy case (see Figure 1). Table 4 presents the results for the estimation of b, either with the CV NW criterion, with ratio of PCO of bf and f , or with the ratio of the best estimators of bf and f in the collection. More precisely, the column "Or" gives here the MISE computed with the estimator of b obtained as a quotient of the two oracles of bf and f in each example and for each sample path. Clearly, the performance of the Nadaraya-Watson cross-validation criterion is much better, within a multiplicative factor from 2 and up to 6. The variance of the quotient estimators (oracle and PCO) are large, which shows that the mean performance is probably deteriorated by a few very bad results. However, the result is puzzling: even the ratio of the two best estimators of the numerator and denominator does not reach the good performance of the single-bandwidth CV method. Table 5 shows in addition that the selected bandwidths are in mean very small. We can check that the ratio of this bad numerator divided by a bad denominator fits well to the b quotient function: this is illustrated by Figure  3. It is likely that both imply a compensation resulting in a locally, and thus also globally, better estimate. We can notice that the selected bandwidth also decrease more slowly when n increases (see Table 5) than for the estimator of bf (see Table 3). Our explanation (see the heuristic Remark 5.1 below) is that  the risk of the Nadaraya-Watson estimator behaves as C(h 2α + σ 2 /(nh)), for some α > 0 related to the regularity of b, like in the projection least-squares method (see e.g. Baraud [1]). In the small noise case, σ 2 makes the variance term negligible, so that the bandwidth selection method aims at having small bias term h 2α . On the other hand, the risk decomposition of the estimator of bf involves a variance term of order K 2 2 E(Y 2 1 )/(nh), and in all our examples, empirical evaluations of E(Y 2 ) is in the range [0.34, 0.70], making the ratio with σ 2 between 34 and 70. In other words, the variance term for this estimator is 34 to 70 times larger. This is why it is important to investigate large noise case and a less favorable signal to noise ratio.

Large noise case
When setting σ = 0.7, the empirical order of E(Y 2 ) for the four models is between 0.91 and 1.31, which divided by σ 2 gives now a value between 1.85 and 2.67. This is much smaller than previously. This corresponds to a more difficult estimation problem, as can be seen from Figure 2.
We now comment the results given in Table 6. The MISE are quite larger, but in Figure 4, we show examples of estimated curves in this case, and the associated orders of MISEs, computed for 25 repetitions; they are not as good as for small noise, but still reasonable. The results in Table 6 show that the MISE have now the same orders, and the oracles can be much better than the results of the Nadarya-Watson estimator. The selected bandwidths are larger and decreasing with n (see Table 7 in Appendix).
The conclusion of this study is that adaptive estimation of functions with kernel estimators and bandwidth selection relying on the PCO method proposed by Lacour et al. [13] gives very good results in theory and practice, not only for density estimation. However, for regression function estimation, one bandwidth selected with a criterion directly suited to the regression function is safer than the two different bandwidths selected when considering the Nadaraya-Watson estimator as a quotient of two functions that may be estimated separately. The results are not bad, but the strategy must be devoted to more complicated contexts where direct estimators of b are not feasible. Then,

and for a nonnegative kernel with compact support [−1, 1], if the regression function b is Lispchitz continuous, then
Moreover, and for a fixed h > 0, by the law of large numbers, Then, for small h, the first limit has order K 2 2 f (x) and the second one has order f 2 (x). To sum up, the risk of b n,h (x) is heuristically of order Ch 2 + σ 2 K 2 2 f (x)/(nh). This explains why, for small σ 2 , the variance term gets small and the estimator can choose small bandwidth to make the bias as small as possible.

Proof of Proposition 2.4
On the one hand, by Comte [4], Proposition 3.3.1, and, by Proposition 2.2, For the proof of Inequality (4), the reader can also refer to Tsybakov [19]. On the other hand, Inequalities (4) and (5) allow to conclude.

Proof of Theorem 3.2
First, let us prove the following lemma. Then, Proof. Since E(ε k ) = 0 and X k and ε k are independent for every k ∈ {1, . . . , n}, Let us find a suitable control of E(A n (h)). First of all, for any h, η ∈ H n , bf n,h,η − bf n,η Then, 2586

F. Comte and N. Marie
On the one hand, On the other hand, let C be a countable and dense subset of the unit sphere of L 2 (R, dx) and consider m(n) > 0. Then, by Lemma 6.1, where, for any ψ ∈ C, In order to apply Talagrand's inequality (see Klein and Rio [12]), we compute bounds.
Then, since 1 n η∈Hn 1 η m, and there exists a constant c 4 > 0, not depending on n, such that The same ideas give that there exists a constant c 5 > 0, not depending on n and h, such that Therefore, by Inequalities (6)-(9), there exist two deterministic constants c, c > 0, not depending on n, such that

Proof of Corollary 3.3
As established in the proof of Proposition 2.4, Theorem 3.2 and Inequality (2) allow to conclude.

Proof of Theorem 4.2
The proof relies on three lemmas, which are stated first.

Steps of the proof.
The proof of Theorem 4.2 is dissected in three steps.
Step 1. In this step, a suitable decomposition of bf n, hn − bf 2 2 is provided. On the one hand, bf n, hn − bf 2 2 + pen( h n ) = bf n, hn − bf n,hmin On the other hand, Step 2. In this step, let us provide some suitable controls of E(ψ i,n (h)) and E(ψ i,n ( h n )) ; i = 1, 2, 3.

Consider
By Lemma 6.2, 2. On the one hand, for every η, η ∈ H n , consider Then, On the other hand, Then, 3. By Lemma 6.3, Step 3. Consider

By
Step 2, there exists a deterministic constant c U,V > 0, not depending on n, h and h min , such that Then, by Lemma 6.4, By Inequality (10), there exist two deterministic constant c 1 , c 2 > 0, not depending on n, h and h min , such that This concludes the proof.
First, note that for every η ∈ H n , For any z, z ∈ R × [−m(n), m(n)], Therefore, a n λ 2 n 2 First, note that for every η ∈ H n , For any η, η ∈ {h, h min } and z ∈ R × [−m(n), m(n)], Therefore, for any θ ∈ (0, 1), • The constant c n . Consider First, note that for every η ∈ H n , For any η, η ∈ {h, h min } and (k, l) ∈ Δ n , Moreover, Then, there exists a universal constant c 1 > 0 such that Therefore, since m(n) is larger than 1, there exists a universal constant c 2 > 0 such that • The constant d n . Consider For any (α, β) ∈ S, with, for every u ∈ R,
Therefore, since |H n | n, there exists a deterministic constant c 7 > 0, not depending on n and h min , such that On the other hand, where, for i = 2, 3, 4, for every z, z ∈ R 2 . Consider k, l ∈ {1, . . . , n} such that k = l. By Markov's inequality, Then, there exists a deterministic constant c 8 > 0, not depending on n and h min , such that The same ideas give that there exists a deterministic constant c 9 > 0, not depending on n and h min , such that E sup h∈Hn |U 3,n (h, h min )| n 2 c 9 log(n) n .
For i = 4, by Markov's inequality, Then, there exists a deterministic constant c 10 > 0, not depending on n and h min , such that Therefore,
On the one hand, since K 1 1 and bf is bounded, On the other hand, So, by Bernstein's inequality, there exists a universal constant c 1 > 0 such that with probability larger than 1 − 2e −λ , Then, with probability larger than 1 − 2|H n | 2 e −λ , For every s ∈ R + , consider Then, for any A > 0, where ∞ 0 e −s/2 ds = 2. Since there exists a deterministic constant c 3 > 0, not depending on n and h min such that m(n, θ) c 3 log(n) 2 n , by taking A := 4c 3 log(n) 3 /n, Therefore, since |H n | n, there exists a deterministic constant c 4 > 0, not depending on n and h min , such that