Nonparametric regression with parametric help

In this paper we propose a new nonparametric regression technique. Our proposal has common ground with existing two-step procedures in that it starts with a parametric model. However, our approach di↵ers from others in the choice of parametric start within the parametric family. Our proposal chooses a function that is the projection of the unknown regression function onto the parametric family in a certain metric, while the existing methods select the best approximation in the usual L2 metric. We find that the di↵erence leads to substantial improvement in the performance of regression estimators in comparison with direct one-step estimation, irrespective of the choice of a parametric model. This is in contrast with the existing two-step methods, which fail if the chosen parametric model is largely misspecified. We demonstrate this with sound theory and numerical experiment.

The decomposition (1.3) with ✓ 0 and m 0 as given in (1.1) has a projection interpretation. For this, we consider an equivalence relation such that two functions f 1 and f 2 are equivalent if the di↵erence is a linear function. The space of the equivalence classes forms a Hilbert space if we endow it with the inner product hf 1 , f 2 i = Ef 00 1 (X)f 00 2 (X).
Let H g be the space of equivalence classes spanned by g, i.e., H g = {c · g(·) : c 2 R}.
By estimating m through the decomposition (1.3), as described in the next section, we may a↵ord a substantial room for reducing the bias. In this paper, we demonstrate the advantage with a local linear smoother, but the main idea can be extended to other local smoothers, see Remark 1 in Section 2. The conventional local linear estimator of m with a bandwidth b has the asymptotic bias b 2 c K m 00 (x)/2 with a constant c K depending on the kernel of the local linear smoother, while our new approach based on the decomposition (1.3) gives b 2 c K m 00 0 (x)/2, see Proposition 1. This implies a reduction in the asymptotic average squared error since E m 00 (X) 2 = E (✓ 0 g 00 (X) + m 00 0 (X)) 2 (1.4) Our approach is related to the existing literature where two-step procedures have been proposed that consist of a parametric and a nonparametric fit of the data. These include Hjort and Glad (1995), Glad (1998), Gozalo and Linton (2000), Rahman and Ullah (2002), Fan et al. (2009) and Talamakrouni et al. (2015Talamakrouni et al. ( , 2016. All these papers considered the approach that finds a pilot estimator of a parametric model assuming that the chosen parametric model is correct, and then updates the parametric fit by a nonparametric adjustment. This was done by an additive, multiplicative or a more general adjustment based on nonparametric fits of the data or of the residuals from a parametric fit. The success of these two-step procedures turns out to depend highly on the choice of a pilot parametric model, which we illustrate in Section 3. Our approach is di↵erentiated from these in that we do not fit a parametric model in the first step, but estimate ✓ 0 such that Eg 00 (X)(m 00 (X) ✓ 0 g 00 (X)) = 0. By doing this we can always reduce the bias for any choice of g with E g 00 (X) 2 > 0, as is seen from (1.4). The estimation of the model (1.3) is also of independent interest as it answers the question of what happens in the estimation of partially linear models Y = ✓ 0 g(Z) + m 0 (X) + " if the two covariates X and Z are identical or if they nearly coincide. Indeed, we use the profiling technique (Severini and Wong, 1992) to estimate (1.3), which is known as a useful technique of fitting partially linear models. Our discussion in this paper can be generalized to more complex semiparametric models, such as generalized partially linear models and generalized partially linear additive models, with common covariates in the parametric and nonparametric components. In these models one may also allow specifications of the parametric part g(✓, X) where the parameter ✓ does not enter linearly. In this paper, to avoid technical di culties and to make the presentation transparent, we focus our discussion on the model (1.3) where g(✓, X) is linear in ✓. For simplicity we also assume that the covariate X is univariate. This paper is organized as follows. In the next section we discuss the estimation of m based on the decomposition (1.3), and develop its asymptotic theory. In Section 3 we present numerical evidences that support the theory. Proofs are deferred to the Appendix.

Methodology and Theory
Our estimation procedure consists of two steps. In the first step, the parameter ✓ 0 is estimated by an estimator✓. A choice of✓ will be discussed below. In the second step, a local smoother is applied to regress Y ✓ g(X) onto X. The result of the second step is our estimator of m 0 . We take a local linear regression estimator as the local smoother.
Specifically, let S b U denote the local linear kernel smoother with a baseline kernel function K and a bandwidth b taking X as the predictor and U as the response. It can be written as for each ✓. We proposem =✓g +m b (·,✓) (2.1) as an estimator of m = ✓ 0 g + m 0 .
The di↵erence between our proposal and the existing two-step procedures is in the first step. For a direct comparison between the two approaches, suppose that one chooses a parametric model of the form {✓g(·) : ✓ 2 R}. Then, the existing two-step procedures estimate ✓ ⇤ where ✓ ⇤ g is the best approximation of the true regression function m in the usual L 2 metric so that ✓ ⇤ = Em(X)g(X)/Eg(X) 2 , while ours estimates ✓ 0 as defined in (1.1).
We discuss the statistical properties ofm at (2.1). Our first result states thatm as an estimator of m = ✓ 0 g + m 0 behaves likem b (·, ✓ 0 ) as an estimator of m 0 that utilizes the knowledge of ✓ 0 and for this it su ces to have a consistent estimator✓ of ✓ 0 In particular, it is not required that✓ approximates ✓ 0 with a certain rate of convergence.
For stating this result we make use of the following assumptions. (A2) The function g and the true regression function m have continuous second order derivatives and fulfill 0 < E g 00 (X) 2 < 1 and E m 00 0 (X) 2 < 1.
(A3) The kernel K is a probability density function with compact support, say [ 1, 1].
(A4) For the bandwidth b it holds that b ! 0 and nb ! 1.
Proposition 1. Assume (A1)-(A4) and that an estimator✓ fulfills (2.2). Then, it holds . The proposition demonstrates that the asymptotic variance and bias ofm as an estimator of m are the same as those ofm b (·, ✓ 0 ) as an estimator of m 0 . The asymptotic variance equals that of the direct estimator S b Y . However, the Thus, the average squared bias ofm is smaller than that of S b Y , see (1.4). To maximize the reduction of the bias, one may choose g 2 G that maximizes which is equivalent to choosing g that minimizes Remark 1. The main idea behind the bias reduction implied by Proposition 1 can be applied to other local smoothers. For example, in the case of the pth order local polynomial smoother with an odd p, we choose a function g such that for a function ⌘ denotes its kth derivative. Then, there is a unique decomposition m = In this case, It remains to find a consistent estimator of ✓ 0 . Recall that ✓ 0 we need to estimate is the one that fulfills E g 00 (X)m 00 (X, ✓) = 0, among all ✓ in the decompositions m = ✓g +m(·, ✓), where m(x, ✓) = m(x) ✓g(x). We achieve this by using the profiling technique. The profiling technique has been proposed for the partially linear model Y = ✓ 0 g(Z)+m 0 (X)+ " with Z 6 = X. The profile least squares estimator of ✓ 0 is given bŷ where h is a second bandwidth, which may be chosen to be the same as b in (2.1). The next proposition demonstrates that✓ h is a consistent estimator of ✓ 0 . We need the following additional assumption for the statement of this proposition.
(A5) For the bandwidth h it holds that h ! 0 and nh 4 ! 1. which may be deduced from our asymptotic analysis presented in the Appendix. From our propositions we get the following corollary.
uniformly for x 2 [a L , a U ].
We have again the interpretation that we already formulated after the statement of Proposition 1. Also by profile estimation we get an estimator of m = ✓ 0 g + m 0 that optimally chooses one from a class of local linear estimators. Thus, profile estimation works quite well also in the degenerate case X = Z of the partially linear model Y = ✓ 0 g(Z) + m 0 (X) + ". The estimatorm =✓ h g +m b (·,✓ h ) depends on the bandwidths b and h. We may take h = b for simplicity and choose a common bandwidth by cross validation. We employed this strategy in our simulation and found that it worked quite well, see Section 3. To indicate its dependence on b we writem b form with h = b. Letm Our estimator of m is then given bymb. We will check whether the cross validation approach works in the next section by simulation.

Simulation Results
The purpose of this simulation study is to support the asymptotic theory we demonstrated in Section 2 and to compare our approach with other competitors. This is done with the CV bandwidth selectors introduced also in the previous section. We generate (X i , Y i ) according to the model with X i being generated from the uniform distribution on [a L , a U ] with a L = 0 and a U = 1, and " i from N (0, 2 ) independent of X i . For noise level we made two choices, = 0, 1 and = 0.5. In the application of our approach, we took g(x) = sin(⇡x). According to (1.1), this choice gives ✓ 0 = 1 and m 0 (x) = ⇢x + cos(⇡x). We made two choices for : = 0, 0.5, and three choices for ⇢: ⇢ = 0, 1, 2.
We compared our approach with a parametric fit, the direct local linear fit and the two-step procedure starting with a parametric fit to the model E(Y i |X i ) = ✓g(X i ) and then making a nonparametric adjustment. The parametric fit we considered in this comparison ism pa =✓g where✓ minimizes P n i=1 (Y i ✓g(X i )) 2 . We denote the direct local linear smoother bym ll h = Sh(Y ), whereh is chosen to minimize respect to h. The two-step procedure withm pa as a parametric start ism ts b =✓g +mb(·,✓), whereb is chosen by minimizing the CV criterion From the tables we note that the bias ofm pa does not change as n or the noise level varies, which is well expected. We also note that the properties of our proposalmb and the direct local linear estimatorm ll h do not change as ⇢ varies. This stems basically from the property of the weight w b that Our theory in Section 2 tells that there is a larger reduction in the bias of our proposal in comparison with that of the direct local linear estimator if g 00 is closer to m 00 , see (2.3). This is evident in the numerical results. We note that under the data generating model (3.1) g 00 gets closer to m 00 when = 0 than when = 0.5. The ISB values ofmb in the tables are less than those ofm ll h for both values of and the relative di↵erence is larger when = 0. We also find thatmb has smaller variance as well. The smaller variance achieved by our proposal is due to the reduced bias and the CV bandwidth choiceb that trades o↵ the bias and the variance. Theoretically, with a fixed bandwidth applied to both methods, the variance of our proposal is asymptotically the same as that of the direct local linear estimator while the bias of the first is smaller than that of the latter. The smaller bias then gives our proposal some room for sacrificing bias to reduce variance by increasing bandwidth in trading o↵ the bias and the variance. Thus, the CV criteria tend to chooseb >h, which results in the smaller variance as well as the smaller bias. This is well demonstrated in Figure 1, which depicts the distributions of the CV bandwidth choicesb (left) for our proposal andh for the direct local linear estimator (right).
Our proposal exhibits the best performance in all cases except ( = 0, ⇢ = 0), in which case the parametric method is the best as expected. For the two cases of ⇢ = 0 ( = 0 and 0.5), our proposal and the two-step procedure show comparable performance. In these cases, the true regression function m is not far from the parametric function g. Indeed, so that the squared distances between m and g in the case ⇢ = 0 equal 0 and 1/8 for = 0 and = 0.5, respectively. However, m gets away from g as ⇢ > 0 increases and is more distant from g when = 0 than when = 0.5 if ⇢ > 0. The main lesson from the results in the tables is that the existing two-step procedurem ts b with the CV choiceb deteriorates very fast as ⇢ departs from ⇢ = 0. The performance ofm ts b is even worse than the direct local linearm ll h when ⇢ > 0. This is in contrast with our proposalmb whose performance does not change as ⇢ varies.
The success ofm ts b when ⇢ = 0 is mainly due to the fact that g(X) is orthogonal to m 0 (X) in the space of square integrable random variables. In this case, the estimation of ✓ 0 and m 0 in m = ✓ 0 g + m 0 may be done by marginal regression. The marginal regression for ✓ 0 is simply the parametric fit that minimizes P n i=1 (Y i ✓g(X i )) 2 with respect to ✓. Thus, in this case the minimizer✓, which is the parametric start of the two-step estimator m ts b , approximates well the true ✓ 0 = 1 at the parametric rate. This observation and our simulation results suggest that the success of the existing two-step procedurem ts b depends highly on the choice of a pilot parametric model, while our approach does not as long as the chosen function g satisfies E g 00 (X) 2 > 0.

A.1 Proof of Proposition 1
From the standard kernel smoothing theory, the condition (A1) gives that, if a function ⌘ is twice continuously di↵erentiable on [a L , a U ], then uniformly for x 2 [a L , a U ]. We also note that there exists an absolute constant 0 < C < 1 with probability tending to one. For (A.2) what we need is that the support of the baseline kernel K contains a nontrivial interval in both of the half intervals [ 1, 0] and [0, 1], which For j 0, we getμ j (x; b) = µ j (x; b)+o P (1) uniformly for x 2 [a L , a U ]. Let Note that c(x; h) = µ 2 for all x 2 [a L + h, a U h]. This and a version of (A.1) for (S h g g)(x) give Similarly, for the second assertion it holds that where the last equality follows from the definition of m 0 at (1.1). For the last assertion at (A.4), let D h (x) := (S h g g)(x) and J h (x) = n 1 P n i=1 w h (X i , x)D h (X i ). Then T 3 = n 1 P n i=1 (J h (X i ) D h (X i )) " i . From the versions of (A.1) and (A.2) for the bandwidth h, we have sup x2[a L ,a U ] |D h (x)| = O P (h 2 ). Also, similarly as in (A.2) there exists an absolute constant 0 < C 0 < 1 such that uniformly in [a L + 2h, a U 2h] under additional smoothness assumptions on g and f . The continuity of 2 (·) in the assumption (A1) and the result (A.5) give Var(T 3 |X 1 , . . . , X n ) = n 2 This completes the proof of the proposition. ⇤