A difference based approach to the semiparametric partial linear model

A commonly used semiparametric partial linear model is considered. We propose analyzing this model using a difference based approach. The procedure estimates the linear component based on the differences of the observations and then estimates the nonparametric component by either a kernel or a wavelet thresholding method using the residuals of the linear fit. It is shown that both the estimator of the linear component and the estimator of the nonparametric component asymptotically perform as well as if the other component were known. The estimator of the linear component is asymptotically efficient and the estimator of the nonparametric component is asymptotically rate optimal. A test for linear combinations of the regression coefficients of the linear component is also developed. Both the estimation and the testing procedures are easily implementable. Numerical performance of the procedure is studied using both simulated and real data. In particular, we demonstrate our method in an analysis of an attitude data set. AMS 2000 subject classifications: Primary 60K35.


Introduction
Semiparametric models have received considerable attention in statistics and econometrics.In these models, some of the relations are believed to be of certain parametric form while others are not easily parameterized.In this paper, we consider the following semiparametric partial linear model where X i ∈ R p , U i ∈ R, β is an unknown vector of parameters, a is the unknown intercept term, f (•) is an unknown function and ǫ i 's are independent and identically distributed random noise with mean 0 and variance σ 2 and are independent of (X ′ i , U i ).

Literature review
The semiparametric partial linear model has been extensively studied and several approaches have been developed to construct the estimators.A penalized least-squares method was used in for example [33,13,9].A kernel smoothing approach was introduced in [30].A partial residual method was proposed for example in [10].And a profile likelihood approach was used in [29] and [6].The test of significance of partial linear model was discussed in [15,23,35].Moreover, the estimation of the nonparametric component is discussed in [7,31,17,19].
The issue of achieving the information bound in this and other non-and semiparametric models has been examined by [26] and extensively discussed in [1].
In this article, a difference based estimation method is considered.The estimation procedure is optimal in the sense that the estimator of the linear component is asymptotically efficient, see for example [27], and the estimator of the nonparametric component is asymptotically minimax rate optimal.[25] introduced a first-order differencing estimator in a nonparametric regression model for estimating the variance of the random errors.[18,24] extended the idea to higher-order differences for efficient estimation of the variance in such a setting.[21] used differencing for testing between a parametric model and a nonparametric alternative.
In particular, [34,35] introduced the differencing method to semiparametric regression with the focus on estimating the linear component.By using higherorder differences [34,35] showed that the bias induced from the presence of the nonparametric component can be essentially eliminated.He constructed an estimator of the linear component and showed it to be asymptotically efficient under the condition that the nonparametric function f is fixed (for all n) and has a bounded first derivative.See also [14,22].

Main results
In this paper, instead of focusing on the linear component as in [34,35], we treat estimation of both the linear and the nonparametric components.We extend the results in [34,35] to general smoothness classes for the nonparametric component and the condition on nonparametric component is weakened.In addition, our results hold uniformly over such classes and so enable traditional asymptotic minimax conclusions.They also show what minimal smoothness assumptions are needed.Moreover, we consider the hypotheses testing problem of the linear coefficients and an F statistics is constructed.We show that asymptotic power of the F test is the same as if the nonparametric component is known.We also consider adaptive estimation of the nonparametric function f using wavelet thresholding.It is interesting to note that although the differences are correlated the correlation should be ignored and the linear regression coefficient vector β should be estimated by the ordinary least squares estimator instead of a generalized least squares estimator which takes into account the correlations among the differences.If the correlation structure is incorporated in the estimation, the resulting generalized least squares estimator will not be optimal (in most cases, even not consistent).

Estimation procedure
The procedure begins by taking differences of the ordered observations (ordered according to the values of U i ).Let d t , t = 1, 2, . . ., m+1 be an order m difference sequence that satisfies t d t = 0 and Then D i can be seen as the mth order difference of Y i .The goal of this step is to eliminate the effect of the nonparametric component f .Now the problem reduces to the standard multiple linear regression problem.We then estimate the linear regression coefficients β by the ordinary least squares estimator based on the differences.Both the intercept a and unknown function f can be estimated based on the residual of the linear fit under certain identifiability assumptions.
We estimate the nonparametric function f by both kernel and wavelet thresholding methods.The results show that under certain conditions both the linear and nonparametric components are estimated as well as if the other component were known.We also derive a test for linear combinations of the regression coefficients of the linear component.The test is fully specified and the test statistic is shown to asymptotically have the usual F distribution under the null hypothesis.
Both the estimation and the testing procedures are easily implementable.Numerical performance of the estimation procedure is studied using both simulated and real data.The simulation results are consistent with the theoretical findings.
The paper is organized as follows.Section 2 considers the simpler case where X i does not depend on U i to illustrate the whole procedure.In Section 3 treats the general case where U i are possibly correlated with the X i and the main results are given.The testing problem is considered in Section 4. A simulation study is carried out in Section 5 to study the numerical performance of the procedure.Real data applications are also discussed.The proofs are contained in Section 6.

Independent case
In this section, we consider a simple version of the semiparametric partial linear model (1) where X i does not depend on U i .In section 3 we will consider the setting where X i may depend on U i .We shall always assume that X i are random vectors.For the nonparametric component U i , either U i = i/n or U i are i.i.d.random variables on [0, 1] and independent of X i .In the second case, we also assume the density function of U i is bounded away from 0. Assumptions on the function f are needed to make the model identifiable.Here we assume 1 0 f (u)du = 0 for the case where U i = i/n; and assume E(f (U i )) = 0 for the case where U i are random variables.
Let X i = (X i1 , X i2 , . . ..X ip ) ′ be p−dimensional independent random vectors with a non-singular covariance matrix Σ X .Define the Lipschitz ball Λ α (M ) in the usual way: where ⌊α⌋ is the largest integer less than α and α ′ = α−⌊α⌋.Suppose f ∈ Λ α (M ) for some α > 0. Then the partial linear model (1) can be written as Here we assume the error terms ǫ i , i = 1, 2, . . ., n, are i.i.d.random variables with finite variance σ 2 .The goal is to estimate the coefficient vector β, the intercept a, and the unknown function f .This will be done through a difference based estimation.
Suppose a difference sequence d 1 , d 2 , . . ., d m+1 satisfies One example of a sequence that satisfies these conditions is Remark 1.The asymptotic results in the theorems to follow require that the order m → ∞ and that (3) be satisfied.However, even the simple choice of m = 2 seems to yield quite satisfactory performance as attested by the simulations in Section 5.
Remark 2. The asymptotic results like those in Theorem 1-5 are valid when X depends on n (say X = X(n)) under the condition that the multivariate sample CDF of (X(n), U ) converges to that which would occur as a limit in the setting of (3).We omit the details.
Remark 3. The case where U i is multi-dimensional is much more involved than the one dimensional case since it is not easy to take difference.To use the difference based method in a high dimensional space, we need to carefully define the difference sequence {d t }, see for example [4] and the references therein about the difference in high dimensional space.In this article, we only consider the one dimensional case.
We now consider the difference based estimator of β. where In ( 5), δ i are the errors related to the nonparametric component f in (1) and w i are the random errors which are correlated, and have the covariance matrix Ψ = (Ψ i,j ) given by (6).For estimating the linear regression coefficient vector β, we use Although not entirely intuitive, it is important in this step to ignore the correlation among the w i and use the ordinary least squares estimate.If instead a generalized least squares estimator is used, i.e. (Z ′ Ψ −1 Z) −1 Z ′ Ψ −1 D, which incorporates the correlation structure, the optimality results in Theorem 1 below and Theorem 5 in the next section will not generally be valid.
Remark 4. Using the generalized least squares method ( ( on the differences is similar to applying the ordinary least squares regression of Y on X in the original model ( 1).This would cause significant bias due to the presence of f .See Section 5 for numerical comparison.
).This means condition (3) is necessary for this procedure to be asymptotically optimal.The factor (1 + 2 m k=1 c 2 k ) describes the inefficiency that results from choice of a particular m and corresponding {c 1 , . . ., c m }.It can perhaps best be recorded on a scale of relative values for the resulting standard deviations: See Table 1 for a few such values for {c k } as in (4) and for the optimal {c k } of [18].Note that even modest values of m yield quite high relative standard deviations.Remark 6.A similar result has been derived in [34,35] under stronger conditions, where α ≥ 1.
A natural estimator of the intercept a is â = . Since β is an efficient estimator of β, â is also an efficient estimator of a.We thus have the following result.
Remark 7.For the case where α ≤ 1/2 and the U i are i.i.d.random variables, it can be seen from the previous discussion that â is asymptotically normal, but the asymptotic variance may depend on f and the distribution of U i .
Once we have the estimator β and â, they can be plugged back into the original model (1) to remove the effect of the linear component.For i = 1, 2, . . ., n, the residuals of the linear fit are The nonparametric component f can then be estimated by the Gasser-Mueller estimator based on r i .Let k(x) be a kernel function satisfying k(x)dx = 1 and has ⌊α⌋ vanishing moments.Take h = n −1/(1+2α) and let Theorem 3.For each α > 0, the estimator f given in (8) satisfies for some constant C > 0.Moreover, for any x 0 ∈ (0, 1), Theorem 3 is a standard results.It shows that the estimator f given in (8) attains the optimal rate of convergence over the Lipschitz ball Λ α (M ) under both the global and local losses for the semiparametric problem.
The kernel estimator constructed above enjoys desirable optimal rate properties.However, it relies on the assumption that the smoothness parameter α is given which is unrealistic in practice.It is thus important to construct estimators that automatically adapt to the smoothness of the unknown function f .We shall now introduce a wavelet thresholding procedure for adaptive estimation of the nonparametric component f .

Wavelet thresholding method
We work with an orthonormal wavelet basis generated by dilation and translation of a compactly supported mother wavelet ψ and a father wavelet φ with φ = 1.A wavelet ψ is called r-regular if ψ has r vanishing moments and r continuous derivatives.
For simplicity in exposition, in the present paper we use periodized wavelet bases on provided the primary resolution level j 0 is large enough to ensure that the support of the scaling functions and wavelets at level j 0 is not the whole of [0, 1].The superscript "p" will be suppressed from the notation for convenience.An orthonormal wavelet basis has an associated orthogonal Discrete Wavelet Transform (DWT) which transforms sampled data into the wavelet coefficients.See [11,32] for further details.
Wavelet thresholding methods have been well developed for nonparametric function estimation.One of the best known wavelet thresholding procedures is Donoho-Johnstone's VisuShrink ( [12]).We shall now develop a wavelet thresholding procedure for the nonparametric component f in the semiparametric model similar to the VisuShrink for nonparametric regression.

Estimation of nonparametric component
For simplicity, here we suppose n = 2 J for some integer J.The procedure begins by applying the discrete wavelet transformation to the residuals of the linear fit, r = (r 1 , r 2 , . . ., r n ).Let v = n − 1 2 W • r be the empirical wavelet coefficients, where W is the discrete wavelet transformation matrix.Then v can be written as v = (ṽ j0,1 , . . ., ṽj0,2 j 0 , v j0,1 , . . ., v j0,2 j 0 , . . ., v where v j0,k are the gross structure terms at the lowest resolution level, and v j,k (j = j 0 , . . ., J − 1, k = 1, . . ., 2 j ) are empirical wavelet coefficients at level j which represent fine structure at scale 2 j .For convenience, we use (j, k) to denote the number 2 j +k.Then the empirical wavelet coefficients can be written as where ξ j0,k and θ j,k are the wavelet coefficients of f , τ j,k and τj0,k denote combination of approximation error and the transformed linear residuals n − 1 2 W • (X(β − β) + a), and z j,k and zj0,k are the transformed noise, i.e.W • ǫ.Our goal now is to estimate the wavelet coefficients ξ j0,k and θ j,k .
The estimate of f at the equally spaced sample points U i is then obtained by applying the inverse discrete wavelet transform (IDWT) to the denoised wavelet coefficients.That is, {f The estimate of the whole function f is given by We have the following theorem.
Theorem 4. Suppose the wavelet is r-regular and the moment generating function of ǫ i exist in a neighborhood of the origin.Then for all 0 < α ≤ r the wavelet thresholding estimator f defined in (4) satisfies for some constant C > 0.Moreover, for any x 0 ∈ (0, 1), . Remark 8. Similar result for estimating the nonparametric component using wavelet thresholding method has been derived in [17].In [17] the linear component and nonparametric component were estimated simultaneously but the estimation of the linear coefficients did not achieve the asymptotic efficiency.
Comparing the results in Theorem 4 with the minimax rate given in (3), the estimator V is adaptive to within a logarithmic factor of the minimax risk under both the global and local losses.Furthermore, it is not difficult to show that the extra logarithmic factor is necessary under the local loss.See, for example [2].

Dependence case
We now turn to the random design version of the partial linear model (1) where both X i and U i are assumed to be random and need not be independent of each other.Note that asymptotical efficiency in this setting has been discussed, for example, in [27].Again let X i be p dimensional random vectors.Let U i be random variables on [0, 1] and suppose that (X ′ i , U i ), i = 1, . . ., n, are independent with an unknown joint density function g(x, u).Assume the ǫ i are independent of (X ′ i , U i ).Let h(U ) = E(X|U ) and S(U ) = E(X ′ X|U ).Suppose f (u) ∈ Λ α (M f ), and h(u) ∈ Λ γ (M h ) for some α > 0 and γ > 0. (When X is a vector, we assume each coordinate of h(u) satisfies this Lipschitz property.)Similar to the previous case, to make the model identifiable, assume E(f (U i )) = 0.Moreover, suppose the marginal density of U is bounded away from 0, i.e. there exists a constant c > 0 such that g(x, u)dx ≥ c for any u ∈ are the order statistics of the U i 's and X (i) and Y (i) are the corresponding X and Y .Note that X (i) 's are not the order statistics of X i 's, but the X associated with U (i) .Similar to the independent case, we take the m-th order differences ), and w i = m+1 t=1 d t ǫ (i+t−1) .Again we estimate the linear regression coefficient vector Theorem 5. When α + γ > 1/2 and S(u) > 0 for every u, the estimator β is asymptotically efficient, i.e.
Remark 9. We can see from this theorem that we do not always need α > 1/2 to ensure the asymptotic efficiency.We only need one of the two functions f (u) and h(U ) = E(X|U ) to have minimal smoothness.Theorem 1 can be considered to be a special case where γ is infinity.Remark 10. [34] obtained similar results for the partial linear model (1) under the conditions that both f and h have bounded first derivatives and hence satisfy the conditions with α = 1 and γ = 1.In this case the condition α + γ > 1/2 of Theorem 5 is obviously satisfied.
When α > 1/2 we can use the same procedure as in the previous section to efficiently estimate the intercept a. i.e. â = 1 n n i=1 (Y i − X ′ i β).Also, the asymptotic variance of â depends on the joint distribution of X and U .
Once we have an estimate of β, we can then use the same procedure to estimate f (u) as in the fixed design case.Similarly, the estimator also attains the optimal rate of convergence over the Lipschitz ball Λ α (M ) under both the global and local losses.
The proof of Theorem 5 is given in Section 6.The following lemma is one of the main technical tools.It is useful in the development of the test given in Section 4.

Testing the linear component
In this section, we consider the problem of testing the null hypothesis that the linear regression coefficients satisfy certain linear constraints.That is, we wish to test where C is an r × p matrix with rank(C) = r.A special case is testing the hypothesis H 0 : In this section, we shall assume the errors ǫ i are independent and identically distributed N (0, σ 2 ) variables.

Fixed design or independent case
We divide the testing problem into two cases.We first consider the case where U i = i/n (fixed design) or the U i 's are random but independent of the X i 's.From the previous sections, we know that asymptotically in this case the estimator β of the linear regression coefficient vector Both the covariance matrix Σ X and the error variance σ 2 are unknown in general and thus need to be estimated.
To estimate the error variance σ 2 , set n−m−p to estimate σ 2 .Note that HD 2 2 = w ′ Hw + 2w ′ Hδ + δ ′ Hδ.Suppose α > 1/2.Then it is easy to see that δ ′ Hδ a.s.−→ 0 as n → ∞.Since w ′ Hδ|δ ∼ N (0, 2σ 2 δ ′ Hδ), we know that w ′ Hδ a.s.−→ 0.Here we also assume that the first term of the difference sequence satisfies that 1 − d 2 1 = O(m −1 ) (the sequence given in (4) satisfies this condition).It can be shown that σ −2 w ′ Hw is approximately distributed as chi-squared with n − m − p degrees of freedom.Theorem 6. Suppose α > 1/2 and 1 − d 2 1 = O(m −1 ).For testing H 0 : Cβ = 0 against H 1 : Cβ = 0, where C is an r × p matrix with rank(C) = r, the test statistic asymptotically follows the F (r, n − m − p) distribution under the null hypothesis.Moreover, the asymptotic power of this test (at local alternatives) is the same as the usual F test when f is not present in the model (1).
Hence an approximate level α test of H 0 : Cβ = 0 against H 1 : Cβ = 0 is to reject the null hypothesis H 0 when the test statistic F ≥ F r,n−m−p;α where F r,n−m−p;α is the α quantile of the F (r, n − m − p) distribution.
Remark 11. [35] considered the testing problem.A χ 2 statistic was derived under the condition that σ 2 is known.

General random design case
We now turn to the test problem in the general random design case where U i are random and correlated with X i .Again suppose that (X ′ i , U i ), i = 1, . . ., n, are independent with an unknown joint density function g(x, u).We will show that the same F test also works in this case.Notice that in this case, asymptotically where . Lemma 1 shows that Z ′ Z/n converges to Σ * .Based on this observation and the discussion given in Section 4.1, we have the following theorem.

Numerical study
The difference based procedure for estimating the linear coefficients and the unknown function introduced in the previous sections is easily implementable.
In this section we investigate the numerical performance of the estimator using both simulations and analysis of real data.

Simulation
We first study the effect of the unknown function f on the estimation accuracy of the linear component and then investigate the effect of the order of the difference sequence.In the first simulation study, we take n = 500, U i iid ∼ Uniform(0, 1), a = 0 and consider the following four different functions, And x+0.05 ).The test functions f 3 and f 4 are the Bumps and Doppler functions given in [12].When we do simulation, we will normalize these functions to make them have unit variance.We also consider the case where f ≡ 0 for comparison.The errors ǫ i are generated from the standard normal distribution.For X i and β, we consider two cases: Case (1).p = 1, i ), I 3 ), β = (2, 2, 4) ′ where I 3 denotes the 3 × 3 identity matrix.
We first examine the effect of the unknown function f on the estimation of the linear component.In this part, the difference sequence in equation ( 4) with m = 2 is used.The mean squared errors (MSEs) of the estimator β is calculated over 200 simulation runs.We also consider the case where the presence of f is completely ignored and we directly run least squares regression of Y on X in model (1).The results are summarized in Table 2.The numbers insides the parentheses are the MSEs of the estimate when the nonparametric component is ignored.By comparing the MSEs in each row, it can be easily seen that we can estimate the linear coefficients nearly as well as if f were known.On the other hand, if f is simply ignored and β is estimated by applying the least squares regression of Y on X directly, the estimator is highly inaccurate.The

Table 2
The MSEs of estimate β over 200 replications with sample size n = 500.The numbers insides the parentheses are the MSEs of the estimate when the nonparametric component is ignored 0.0028 0.0028 (1.970) 0.0028 (0.054) 0.0034 (0.013) 0.0033 (0.011) Case (2) 0.0027 0.0023 (0.705) 0.0023 (0.025) 0.0037 (0.009) 0.0032 (0.007) Table 3 The MSEs of the estimate f over 200 replications with sample size n = 500 mean squared errors are between 2 to over 600 times as large as those of the corresponding estimators based on the differences.For estimating the nonparametric function f , we use a kernel method with the Parzen's kernel.The bandwidth was selected by cross validation, see for example [20,28].For comparison, we also carried out the simulation in the case where β = 0.The mean squared error of the estimated f is summarized in Table 3.It can be seen that the MSEs in each column are close to each other and hence the performance of our estimator f does not depend sensitively on the structure of X and β.
We now consider the effect of the order of the difference sequence m on the estimation accuracy.In this study, different combinations of the function f and the Cases (1) and (2) yield basically the same results.As an illustration of this, we focus on Case (2) and f = f 2 .We compare four different values of m: 2, 4, 8, 16.The difference sequence in equation ( 4) was used in each case.We summarize in Table 4 the mean and standard deviation of the estimate β and the average MSE of the estimate f .By comparing the means and standard deviations in each row we can see that the performance of the estimator does not depend significantly on m.
Next, we consider the test of linear coefficient.In this study, we focus on case (2) with two different sets of linear coefficients.One of them is β = (2, 2, 4) ′ , the other one is β = (0, 0, 4) ′ .The hypothesis that will be tested is H 0 : β 1 = β 2 = 0.The total number of rejects (at level 0.05) over 200 runs and the mean value of F statistics are summarized in Table 5.We also compare the F statistics with its nominal distribution for the case β = (0, 0, 4) ′ and f = f 2 .The empirical cumulative distribution function and the quantile-quantile plot are plotted in figure 1.It can be seen that the F statistics fit the distribution very well and F test performs as if the nonparametric component is known.

Application to attitude data
We now apply our estimation and testing procedures to the analysis of the attitude data.This data set was first analyzed in [8] using multiple linear regression and variable selection.This data set was from a study of the performance of supervisors and was collected from a survey of the clerical employees of a large financial organization.This survey was designed to measure the overall performance of a supervisor, as well as questions that related to specific characteristic of the supervisor.The numbers give the percent proportion of favorable responses to seven questions in each department.Seven variables, Y (over all rating of the job being done by supervisor), X 1 (raises based on performance), X 2 (handle employee complaints), X 3 (does not allow special privileges), X 4 (opportunity to learn new things), X 5 (rate of advancing to better job), and U (too critical to poor performances) are considered here.The goal is to understand the effect of variables (X 1 , . . ., X 5 and U ) on Rating (Y ). Figure 2 plots each independent variable against the response Y .We can see that the effect of U on Y is not linear, while the effect of other variables are roughly linear.So we employ the following model, Using the estimation procedure discussed in Section 3 with m = 2, the linear component in the model ( 12) is estimated as 18.1127 − 0.0208X 1 + 0.  We thus perform the simultaneous F test to test the hypothesis H 0 : β 1 = β 3 = β 5 = 0 against H 1 : at least one of them is nonzero.The value of the F statistic is 2.1577 and the p value is 0.1206.In comparison, the value of the F statistic for the global hypothesis H 0 : β 1 = • • • = β 5 = 0 is 18.4038 and the p value is less than 0.0001.The results show that we fail to reject the hypothesis H 0 : β 1 = β 3 = β 5 = 0. We can thus refine the linear component by using only Learning (X 2 ) and Complaints (X 4 ) as independent variables.In this case, the estimated linear component is 16.3467 + 0.6725X 2 + 0.2068X 4 .The F value for this model is 34.3635 and p value is less than 0.0001.
We can then estimate the nonparametric component of the effect Critical (U ).For this, we run kernel estimation using the residuals of the linear fits as we did in Section 5.1.Figure 3 shows the nonparametric fits.The left panel plots the estimate of f under the model (12) but we ignore the linear component, the middle panel plots the estimate of f under the model ( 12) with all linear variables and the right panel plots f with the variables X 2 and X 4 in the linear part.We can see that the plot on the left panel is quite different from the other two.And the two plots on middle and right are similar since including a small number of additional non-significant variables does not have a large effect the estimates of the remaining parts of the model.Moreover, we test the significance of the nonparametric function, i.e.H 0 : f (u) = a+ bu for some constants a, b.We follow the test procedure described in [16].The p-value of the likelihood ratio test is 0.043, which shows the nonparametric function is significant.Actually, we have significant result with p-value 0.0259 when we fit a quadratic function to the nonparametric component.Note that in [8], the standard multiple linear regression was used to model the relationship between the response and the explanatory variables.The linear model failed to detect the relation of the variable U and Y , and it concluded that variable U did not have significant effect on Y .

Proofs
We shall prove Theorems 1, 3, 4, 5 and 6.The proof of Theorem 7 is similar to that of Theorem 3 and Theorem 6, respectively.We will first prove some technical lemmas.

Technical Lemmas
Lemma 2. Under the assumptions of Theorem 1, Proof.The asymptotic normality of √ n(Z ′ Z) −1 Z ′ w follows from the Central Limit Theorem and the fact that Note that with m = o(n), so 1 n (Z ′ Z) a.s.This implies nV ar((Z ′ Z) −1 Z ′ w|Z) The following lemma bounds the difference between the DWT of a sampled function and the true wavelet coefficients.See, for example, [3].Lemma 4. Let ξ J,k = f, φ J,k and n = 2 J .Then for some constant C > 0, sup

−→ E[(
The following lemma is from [5].Here we use the fact that, since h(U ) has γ > 0 derivatives.Similarly, lim n→∞ for any i, the first part of the above expression is of order n −2α .And δ i E(dh i ) is of order n −α−γ for any i, so the second part of the above expression is of order n 1−2α−2γ .This implies 1 n E(Z ′ δδ ′ Z) = O(n −2α ) + O(n 1−2α−2γ ).
For Theorem 3, we shall only prove the convergence rate under the pointwise squared error loss, the rate of convergence under the global mean integrated squared error risk can be derived using a similar line of argument.

Fig 1 .
Fig 1. QQ-plot and the plot of empirical cdf of the F statistics.On the right plot, the dot line is the plot of the true cdf.

Fig 2 .
Fig 2. Plot s of the individual explanatory variables against the response variable.

Fig 3 .
Fig 3. Kernel estimates of the nonparametric component f .The points are the residuals of respective linear fits.

Table 1
Values of relative standard deviation for various m and {c k }

Table 4
The mean and standard deviation of the estimate β and the average MSEs of the estimate f over 200 replications with sample size n = 500

Table 5
The total number of rejects of F test over 200 replications with sample size n = 500 at level 0.05.The numbers insides the parentheses are the mean value of F statistics

Table 6
The estimated coefficients of the linear component and the significance tests 4 − 0.3747X 5 .The F statistic and the p value for testing each coefficient H i0 : β i = 0 against H i1 : β i = 0 are given in Table6.The p-values for β 1 , β 3 and β 5 are exceedingly large.
1 − Σc 2 k )Σ * .Finally for the third equation, let dh i = t d t+k δ i−t δ i+k−t is of order n −2α