Asymptotic validity of bootstrap confidence intervals in nonparametric regression without an additive model

Bootstrap for nonparametric regression has been around for more than 30 years. Nevertheless, most results are based on assuming an additive regression model with respect to independent and identical (i.i.d.) errors. An exception is the Local Bootstrap of Shi [23] for which, however, no bootstrap consistency results are available. We attempt to remedy this here while at the same time showing bootstrap consistency for a more general class of methods that fall under the heading of Model-free Bootstrap of Politis [18].


Introduction
Consider regression data {(Y i , X i )}, i = 1, . . . , n which are independent and identical (i.i.d.) pairs. The response Y i is real-valued; for simplicity, we also suppose that X i is univariate and random with density function f (x), but the method works in the same way when X i is multivariate. Just assuming that the pairs {(Y i , X i )}, i = 1, . . . , n are i.i.d. can be considered a "Model-free" regression setup where interest lies on nonparametric estimation of the conditional expectation m(x) = E(Y i |X i = x).
Nevertheless, often practitioners assume a model equation that relates Y i to X i ; this would be a "model-based" setup despite that fact that the model could be of nonparametric nature. A typical model for nonparametric regression with i.i.d. errors is: or, more generally, where the i 's are assumed i.i.d. with mean 0 and variance 1. As in the Modelfree case, m(x) = E(Y |X = x) is the target of interest, and σ 2 (x) = Var(Y |X = strap is described in detail in the recent book by Politis [18] but no consistency results have been proven so far. In this paper we will provide the first theoretical results on Model-free bootstrap consistency, and we will further compare to the Local bootstrap by simulation. We also reveal a close relationship between the two methods under specified assumptions, i.e., the Local bootstrap is essentially a special case of the Model-free bootstrap. Although both the Local bootstrap and the Model-free bootstrap can be implemented using any form of nonparametric regression estimators, to fix ideas in this paper we focus on the Nadaraya-Watson (N-W) kernel estimator [11]: (1.2) whose properties have been very well studied [2,7,10,8]. E.g., the conditional cumulative distribution function (CDF) and the quantile estimator based on N-W estimator were studied by Li and Racine [9] and Qu and Yoon [20].
The following is a brief introduction to the spirit of Local bootstrap and Model-free method in regression problem; more details can be found in sections 3 and 4.
(i) Local bootstrap. Instead of resampling data in an i.i.d. fashion, the Local bootstrap method resamples data independently but not identically distributed. More specifically, the probability mass function (p.m.f.) that the Local bootstrap uses is derived by the N-W estimator first. WLOG, suppose we have n i.i.d. pairs (X i , Y i ), i = 1, . . . , n at our disposal. To each X i , the bootstrap will assign a bootstrap response Y * i which is selected from one of the n responses, i.e., Y 1 , . . . , Y n , according to the p.m.f. (ii) Model-free bootstrap. The idea of Model-free bootstrap is transforming the non-i.i.d. data to i.i.d. data first, and then applying i.i.d. resampling. More particularly for regression problem, by probability integral transform theory, we can transform each observation Y i to approximately U (0, 1) random variable U i . Ideally U i 's are i.i.d., thus we can resample U i independently to obtain bootstrap data U + i . The last step is using estimated quantile function of Y |x to transform U + i 's back, and denote this final bootstrap data at different X i by Y + i .
The rest of the paper is organized as follows. In section 2, we address all the necessary assumptions for the theoretical analysis. Section 3 and 4 introduce formal definition of local bootstrap and model free bootstrap as well as important lemmas. Main results and theorems are presented in section 5. In section 6 we discuss the details of parameter selection, like the choice of transformed function and bandwidth. Simulation results are presented in section 7. Proofs of most lemmas and theorems can be found in the Appendix.

Assumptions and problem set up
This paper is motivated to prove consistency of Model-free bootstrap method and Local bootstrap method without the assumption of an additive model; instead we assume that the data {(Y i , X i )}, i = 1, . . . , n are i.i.d. pairs. Recall definition of N-W estimator (1.2), and denote the density function of independent random variables X 1 , . . . , X n by f (x), and the function of interest m(x) = E(Y |X = x). Let f Y |X (y|x) be the conditional probability density function (PDF) of Y given X = x, D(y|x) is the conditional cumulative density Consider the following assumptions: Assumption A for r.v. X: is twice differentiable, and both first and second order derivative functions satisfy the Lipschitz condition |g(x) − g(y)| ≤ C|x − y| for some C > 0; here g denotes either f or f ; (iii) m(x) is twice differentiable, and both first and second order derivative functions satisfy the Lipschitz condition |g( Assumption B for r.v. Y : (v) D(y|x) > 0 and is twice differentiable with respect to both y and x in S.

Assumption C for kernel function K(·):
(i) K(·) is a symmetric, bounded and twice differentiable function; Assumption D for sample size n and bandwidth h: Li and Racine [8] proved the following theorems under the assumption of i.i.d. pairs, Assumption A (i)-(iv), Assumption C (i)-(iii), and Assumption D (i) for estimator (1.2): and Remark 2.1. Motivations of some assumptions: a. In the following sections, we need the same assumptions Li and Racine [8] used, only with Assumption A (iii) and (iv) excluded. Additional assumptions, which are slightly stronger and play similar role in proof, are introduced in Assumption B (i) and (ii). The motivation is to show validity of bootstrap methods without additive model, we need to not only assume will allow us to apply Dominated Convergence Theorem.; b. Assumption A (iv), B (v) and C (iv) help to show uniform almost surely convergence for an estimator of D(y|x), more details please see Section 4; c. Assumption B (iii) and (iv) are essentially conditions of Lyapunov central limit theorem applied in Section 5; d. Assumption D (ii) is essentially under smoothing technique.

Local bootstrap
The Local bootstrap method proposed by Shi [23] was originally for the scenario that the regressors x i 's are deterministic design points in [0,1]. In what follows, we address the condition that regressors X i are random variables. The procedure of local bootstrap can be described as follows: (i) For each observation X i , we create an estimator of conditional probability mass function (p.m.f.) of Y given X i : i.e. we estimate the probability distribution of Y given X i as a discrete dis- (iii) Compute bootstrap kernel estimator: Shi [23] provided theoretical analyses on Y * i but not the asymptotic properties ofm * n,h (x); This gap is filled in section 5 but needs the following lemma and theorem presented in rest of current section.

Lemma 3.1. Under Assumption B (i)-(iii) and Assumption C (i)-(iii),
where g(·) is a continuous function (in this paper, Notice i disappears on the right side of (3.3) and thus O(h 2 ) is uniform for i = 1, . . . , n. This lemma is also valid if replacing Y i by Y * i , and O(·) by O p (·) under the same assumptions: where E Xi,Y * i is the expectation in bootstrap world, and thus is conditional expectation in real world. Lemma 3.1 and (3.4) then lead to the following preliminary result:

Theorem 3.2. Under Assumption B (i)-(iv) and Assumption C (i)-(iii),
where g(·) is the same continuous function as the one in lemma 3.1, and similarly to equation (3.4), E * is the expectation in bootstrap world, and thus is conditional expectation in real world.

Model-free bootstrap
The principle of all bootstrap methods is imitating the original sampling that occurred in real world. If the data are sampled i.i.d., then bootstrap method resamples data i.i.d.. The problem local bootstrap method solves is the response data are collected independently but not identically distributed. As a solution, the local bootstrap assigns different probabilities to the observed responses.
The Model-free bootstrap of Politis [13] is using a different technique: transforming the non-i.i.d. data to i.i.d. (or approximately i.i.d.), and then performing i.i.d. resampling. In the Model-free setup of i.i.d. pairs, the transformation could be achieved by the probability integral transform [1]. More specifically, denote the conditional distribution function D x (·) of Y as and assume it is continuous with respect to y; then, D x (Y ) follows Unif(0,1) distribution. Since the true conditional distribution D x (y) is unknown, it can be estimated by a N-W estimator, i.e., Estimator (4.2) is consistent (see Li and Racine [8]) but not continuous in y. A doubly smoothed version of (4.2) is defined by where Λ(·) is a continuous and strictly increasing CDF over its support. Now define two estimators of quantile-inverse function: Both D −1 X (·) and D −1 x (·) have good properties, see [13,14]; more discussions are given in section 6. To specify the ideas, we apply D −1 x (·) defined in (4.5) in all our theoretical results; the same results are expected to hold if one uses D −1 x (u). In this section, we denote Model-free bootstrap data by Y + , to distinguish from Local bootstrap data Y * . Politis [13] proposed three ways to generate Model-free bootstrap data Y + 1 , . . . , Y + n associated with the same original regressors X 1 , . . . , X n . Method 1. Obtain transformed data U 1 , · · · , U n by computing Then sample randomly from the transformed data U 1 , . . . , U n to get bootstrap pseudo-data u + 1 , . . . , u + n . Finally x denote the estimator D x (·) as computed from the delete- . . , n}. Now define the "predictive" residuals as Then sample randomly from predictive u-data u Remark 4.1. In the rest of this paper, we will follow the notation given by Politis [18], i.e., denote Method 1 by the term Model-free (MF), Method 2 by the term Limit Model-free (LMF), and Method 3 by the term Predictive Modelfree (PMF).
Further discussion please see theorem 4.4 and its proof. We can also modify LMF by using Y + i =D −1 Xi (u + i ), and expect it has similar property, but the discussion is omitted due to space limit. Remark 4. 3. Let x f is the point of interest, and denotem n,h (x f ) the estimator proposed by Politis [13], i.e.
We noticedm n,h (x f ) andm n,h (x f ) are numerically identical. In fact, by the definition of D x (·) (4.2) and D −1 where u is a variable for integration, and U is random variable, whose expectation is denoted by E U , conditional on x f , and U follows unform [0,1] distribution. Notice (4.6) is approximation of (4.7), consequentlym n,h (x f ) is also an approximation ofm n,h (x f ) if u i 's are random variables from Unif(0,1).
The following describe resampling algorithm to establish Model-free confidence intervals for m(x f ): (a) Pick one Model-free method from MF, LMF and PMF, to obtain bootstrap . . , n}, re-estimate the conditional CDF D x (·); Denote the bootstrap estimates by D + x (·) and D + x (·) (which are unsmoothed and smoothed CDF estimator described earlier). (c) Give it a replicate of the bootstrap confidence interval root: Steps (a)-(c) in the above are repeated B times, and the B bootstrap root replicates are collected in the form of an empirical distribution with αquantile denoted by q(α).
The following lemmas and theorems reveal MF, LMF and PMF method has the same rate of convergence to true m(x) in probability as local bootstrap method. In fact, LMF is closely related to the local bootstrap. Notice the major difference from LMF to MF and PMF, are the algorithm how u + i are resampled. LMF resamples u + i directly from Unif(0,1) distribution, while MF and PMF resample u + i from transformed data U 1 , . . . , U n . The following theorem shows the intuition behind of MF and PMF by supposing U 1 , . . . , U n are Unif(0,1) distributed. The setting of Theorem 4.5 is a hybrid between MF and LMF.

Then under Assumption B (i)-(iv) and Assumption C (i)-(iii), a similar result as Theorem 3.2 holds, that is:
where g(·) is the same continuous function in lemma 3.1.

Lemma 4.6. Under Assumption A (v), Assumption B (i)-(v) and Assumption C (i)-(iv), we have:
h and h 0 are two bandwidth parameters, and S is any fixed compact set.
Lemma 4.6 and Polya's Theorem lead to the following theorems: Assumption C (i)-(iv), we have: x (y) have different orders of almost surely convergence. Fortunately, order of almost surely convergence is not necessary to show pointwise convergence of MF and PMF. We only need corollary 4.7 which holds and supports main results for both MF and PMF.
Remark 4.11. Theorem 4.10 is valid for PMF as well. The proof is exactly the same, thus omitted.
Remark 4.12. The proof of the theorems above are inspired by [6,10,15,21]. The theorems, lemmas and corollaries in this section are the cornerstone of the consistency results in the next section.

Main results
Although we provided theorem 3.2 and theorem 4.10 which show the consistency of bootstrapped sample mean, the main interest in this paper is consistency of bootstrapped N-W estimator (4.8) by local bootstrap method, LMF, MF and PMF. Before moving forward, the following notations and lemmas are needed: By equation (5.1), we have the corollary of Lemma 3.1: Similarly to (3.4), this lemma is also valid if replacing Y i by local boot- is the expectation in bootstrap world, and thus is conditional expectation in real world. Let us re-write (1.2) and (4.8): The new form is equivalent to original N-W estimator. For the rest of this paper, we use superscript * to denote local bootstrap and model free bootstrap for simplicity. Recall lemma 5.1, we have the following theorems:

Theorem 5.3. Under Assumption A (ii) (v), Assumption B (i)-(v) and Assumption C (i)-(iv),
Finally, by Lyapunov CLT, we can get: Recall equations (2.1) and (2.3), and by Theorem 5.4, we have: In order to build bootstrap confidence intervals for m(x), according to equation (2.3), we either have to estimate the bias (2.4), or use an under-smoothed bandwidth as given by Assumptions (viii); the latter is recommended in this paper: Then an equal-tailed (1 − α)100% bootstrap confidence interval can be constructed: where q * (α) denote the α-quantile of empirical distribution function ofm n,h (x)− E * m * n,h (x). In the next two sections, we go over the parameters setup and simulation results.

The transform and transform-back functions
In this section, we discuss the selection of transform & transform back functions of model free method between discrete and smooth conditional CDF kernel estimation. Politis applied the smooth kernel estimator in [13,14], i.e. D x (·), see (4.3). The discrete estimation, D x (·) (4.3), was not suggested because it makes the transformed data unnatural and too equidistant in application.
For the selection of transform back function, that transforms U * i (resampled from U 1 , . . . , U n , where U i is transformed data for MF or PMF method, true Unif(0,1) data for LMF method), to bootstrapped data Y + |x. As mentioned earlier, all theorems in this paper are proven based on smooth transform back function D −1 x (·). The validity of these theorems if replacing the smooth transform back function by discrete transform back function D −1 x (·) requires further study but very likely remains, due to the fact two transform back functions share many same properties. In addition, this selection problem between D −1 x (·) and D x (·) is similar to the question discussed by Efron [3]. He was interested the performance of resampling from regular empirical distribution function and from a kernel distribution estimation, and so compared the mean squared error (MSE) of g(x)d F (x) and g(x)d F (x), see [22]. For the sake of simplicity, If we assume x 1 , . . . , x n are deterministic, and do some simple computation, can obtain: Denote the second term on the left side by . It is apparently determines which MSE is greater. For example, if > 0, D −1 x (·) is better with respect to MSE. Although it is extremely difficult to estimate , when g(·) is identical function: The MSE difference is trivial when h 0 → 0 (requires n → ∞). More similar discussions can by found in [8,9].

Selection of bandwidth
Since the smooth kernel estimation of conditional CDF has two different bandwidths h and h 0 , see lemma 4.6, the whole procedure of model free method requires to optimize both of them. Forĥ, the cross validation method proposed by Politis [13] is employed in this paper. For h 0 , Politis [13] employed R software function bw.nrd0() which is a fast bandwidth selection for density estimation. Li and Racine [8] showed that h 0 = O(n − 2 5 ) and h = O(n − 1 5 ) by minimizing weighted IMSE. Therefore running another cross validation for h 0 does not guarantee the correct order but definitely increase the overall complexity of algorithm. We suggest the same idea by Pan and Politis [12], that is simply makingĥ 0 =ĥ 2 . In our simulation, we find that both two selections of h 0 work well, and give numerical close outputs. The work on bandwidth h and especially h 0 require further study.

Simulation
In this section we present simulation results for both additive model and nonadditive model. The results are tabulated in the following two sub-sections.
For each model, 500 datasets are simulated. Each dataset has sample size n = 100, where regressors x 1 , . . . , x n are randomly drawn from normal distribution with mean π and standard deviation π. Kernel function employed by N-W estimator, D x (·) and D x (·) is always a normal density function. In each cell of the following tables, first line gives estimated coverage probability, second line gives mean of length of confidence interval, third line gives standard error of length. "Norm" denote normal approximation by (2.3). "MB" is model-based bootstrap and "PRMB" is predictive residual model-based bootstrap, both of them were proposed by Politis [13] for nonparametric regression. "LB" stands for local bootstrap. "LMF", "MF", "PMF" are using statistics defined in equation (4.9). "LMF using N-W" is using a different statistics defined in equation (4.8). All the intervals have confidence level 95%.
• The standard error of the reported coverage probability levels over the 500 replications is 0.05 × 0.95/500 ≈ 0.0097 • Since the true model is m(x) = sin(x), this simulation has some symmetry that helps us to adjust the CVRs. To elaborate, note that for any x ∈ [0, π], we have |m(x)| = |m(2π − x)|, and the same symmetry holds for the derivatives of m(x) as well due to the sinusoidal structure. Due to this symmetry, for any good confidence interval method,m n,h (x) at each two symmetric points are supposed to have exactly same limit distribution, and hence numerically very close CVRs. • PRMB method show outstanding CVR, while had very high average length and standard error. After diving into the stored bootstrap data, it turned out in some bootstrap iteration, the bootstrap were very high, and resulted in a wide confidence interval. Such confidence intervals are apparently outliers, and deserve further root cause analysis.
• Theorem 4.4, the equivalence between the local bootstrap and LMF is verified in the simulation. Notice that only "LMF using N-W" produces numerically similar results as local bootstrap. Finite sample difference exists when using "LMF", which can be relieved by larger sample size. • In table 2, CDF of laplace distribution violate assumption (vii). More precisely, CDF of Laplace distribution is not continuously differentiable at 0, so lemma 4.6 might fail or the convergence has a slower rate. It implies that U 1 , . . . , U n calculated from (X i , Y i ), i = 1, . . . , n might not be approximately Unif(0,1) distributed, see Fig 1. This is the primary reason at some points LMF, MF and PMF show over-coverage problem for Laplace error.

Non-additive model
In this subsection we present the simulation result of a non-additive model. For the sake of comparison to section 7.1, we use the following model: x +(1−cx) 2 , and c x = x/2π, Z ∼ N (0, 1) independent of W that will be distributed as 1 2 χ 2 2 − 1. Thus x has mean 0 and variance 1. E(Y |X = x) = sin(x) and Var(Y |X = x) = 1/4, and regressors x 1 , . . . , x n are randomly drawn from normal distribution with mean π and standard deviation   π with sample size n = 100. However, x has skewness depending on x that violating i.i.d. assumption. The result is presented in table 3. It seems that PMF performs the best among the different methods although it yields some over-coverage, e.g. at x = π and x = 1.25π.

Appendix: Proof
Proof of lemma 3.1. Since l is a fixed positive integer, by Assumption C(i), K l (·) is also symmetric, bounded and twice differentiable function. Then, where A(x) = f X (x) K l (u)du. The very last step we can get O(h 2 ) is because of Assumption B(ii), by which and dominated convergence theorem, we can interchange order of differentiation and integration. Then by Assumption B(iii), the integration is finite. Notice the original term of O(h 2 ) is not related to i anymore, so they have the common bound.
Proof of theorem 3.2. By definition of Big O notation in probability, we just need to show: First, by the definition of local bootstrap algorithm, By lemma 3.1, we have: are independent and under Assumption B(iv), we also have Finally, we get: Proof of theorem 4.4. For the resampling method of local bootstrap, at X i , recall (3.1), G Xi is the p.m.f. to resample data and generate Y * i . On the other side, remark 4.2 explains LMF also uses a p.m.f. H Xi to generate Y * i . To show the equivalence, we will prove G xi = H xi .
The reason we can define such a function for model free, is that no matter what u * i we get from Unif(0,1), by the transformed function D −1 x (·), Y + will be always one point from {Y 1 , . . . , Y n }. Thus for each Y j , there is a probability we choose it as Then, we have: Proof of Theorem 4.5. By the algorithm of MF, but notice each u i , i = 1 . . . , n is assumed to be Unif(0,1) distributed. Then, Then, consider Let E * denote expectation in local bootstrap world, and E + denote expectation in model free bootstrap world. For the right side of inequality above, the second term is proved in theorem 3.2. For the first term, by equation (4.7), we know , where u comes from Unif(0,1) distribution. Let random vector U = (u 1 , . . . , u n ), independent from u. Then we have: Then consider the variance: For the second term, in theorem 3.2, we already show that is bounded. Then let us see the first term: This and (A.2), Chebyshev inequality, and the definition of big O in probability notation leads to: , then we write: By Assumption A (v) and thm 1. → 0, for large sufficiently enough n we can find another δ s.t. inf x∈Sf (x) ≥ δ > 0. And for the second term, by theorem 6.2 in Li and Racine [8], we have: By Assumption A (ii) and Assumption B (v), ∂ 2 Dx(y) ∂y 2 and sup x∈S ∂Dx(y) Then just need to show As S is compact, it can be covered by a finite number L n of interval {I k } n 1 with length l n , and L n = constant/l n . We write: For Q 2 , the sup can be ignored because it only concern x k,n but not x, and x k,n is the central point of the interval I k . We will show Q 1 and Q 3 in the last part, just consider Q 2 first. To show Q 2 = O(η n ) a.s., by Borel-Cantelli Lemma, we just need to show For any η n > 0, we have

Asymptotic validity of bootstrap C.I. in regression w/o additive model
Since K(·) is bounded, let its supremum is A 1 , moreover, Λ(·) is a CDF and so |Λ(·) ≤ 1|, thus we have Z n,i ≤ 2A1 nh . Define λ n = (nh ln(n)) Where we use As f (x) is bounded, and uK 2 (u)du = 0 since K is symmetric. Because A 2 is independent of x, we get And let λ n = [(nh) ln(n)]  1 2 a.s. Now consider Q 1 and Q 3 . By Assumption C (iv), K(·) is Lipschitz continuous and Lipschitz constant is C 1 , we know that By using the same choice of L n above, we have l n = constant · h 3 ln n n , and 1 2 a.s.
And by exactly same argument, we can also show 1 2 a.s.
Proof of Lemma 4.9. Recall (4.3), the definition of D x (·) and (4.5), the definition of D −1 x (·), and both are sequence functions with n. Denote D x (·) the function D x (·) converge to, then D x (·) and D x (·) are CDF and 1 to 1 increasing functions and have inverse function D −1 x (·) and D −1 x (·). For a fixed q 0 ∈ (0, 1), Apparently D −1 x (·) is 1-1 increasing function as well, so Similarly, we could show that, pick M 2 < y 0 , Back to (A.5), we have: Now by the same proof of Heine-Cantor theorem, we can find finite open covers for C, thus finite different N and δ work for all q ∈ C. Then obviously for ∀q 0 ∈ C, ∃N = N ( ) > 0 and δ = δ( ), such that ∀n > N, |q − q 0 | < δ, we have Then, to show the theorem, we just need to prove Because we are eventually showing the convergence in probability, according to assumption (i), (v) and (vi), both X and Y have finite expectation and variance, by Chebyshev's inequality, both X and Y are bounded in probability. Furthermore, it is obvious that D x (Y ) is also bounded in probability, i.e. For ∀ > 0, ∃M > 0, such that where D −1,i x (·) denote the same function D −1 x (·) but with variables x i , y i instead of random variables X i , Y i . However, the equality above might fail because D xi (y i ) can reach 0 or 1, which leads D −1 x (·) and D −1 x (·) to infinity. Fortunately, we just study these functions in probability, which implies Y |X = x, and D x (Y ) are both bounded in probability. ∀ > 0, ∃M , s.t. Since g(x) = x k or g(x) = |x| k , k = 1, 2, 3, let h(x) = x k only, then