Oracally efficient estimation for single-index link function with simultaneous confidence band

Over the last twenty-five years, various √ n-consistent estimators have been devised for the coefficient vector in the popular semiparametric single-index model. In this paper, we prove under general assumptions that the kernel estimator of the link function by a univariate regression on the index variable is oracally efficient, namely, the estimator with the true single-index coefficient vector is asymptotically indistinguishable from that with any √ n-consistent coefficient vector estimator. As a mathematical byproduct of the oracle efficiency, a simultaneous confidence band is constructed for the link function based on the oracally efficient kernel estimator. Simulation experiments corroborate the theoretical results. The proposed simultaneous confidence band is applied to analyze and test hypothesis about the Boston housing data. MSC 2010 subject classifications: Primary 62G08; secondary 62G15.


Introduction
Nonparametric regression methods have for the last three decades become widely used in place of the classic parametric regression as they are free from the constraints of pre-determined form with finitely many unknown parameters.Yet nonparametric models pay for their flexibility the price of "curse of dimensionality", i.e., unacceptable inaccuracy of function estimates when the number of predictors is large.Myriads of semiparametric models have been developed for over two decades in order to combine the strength of purely nonparametric models with those of classic parametric models.[10] contains in-depth dis-cussion about parametric and nonparametric components of one typical semiparametric model, the partially linear model.The generalized additive model advocated by [16], is another popular semiparametric model, see also, for example, [18,25,26,27,38,44,45].Another attractive semiparametric model is the single-index model, similar to the first step of projection pursuit regression, see [4,6,14,19].The single-index model can be written as where X = (X 1 , . . ., X d ) T is a d×1 predictor vector and the unknown parameter θ 0 = (θ 0,1 , . . ., θ 0,d ) T is the single-index coefficient vector.In addition, the link function g is an unknown univariate function, and the noise satisfies E(ε|X) = 0, E(ε 2 |X) = σ 2 (X).The linear combination X T θ 0 of X 1 , . . ., X d is referred to as the single-index variable or index.
There has been a folklore that since the true parameter θ 0 is estimated by some θ up to order n −1/2 , much smaller than the typical convergence rate n −2/5 for nonparametric function estimation, one can safely ignore the difference between θ 0 and θ, and estimate the link function g by univariate regression of Y on X T θ instead of X T θ 0 .In contrast, both unknown parameters and nonparametric functions in partially linear models can be estimated with oracle efficiency (meaning as efficient as if all other unknowns were given), see for instance, [10,28].We believe that most experienced statisticians would agree that current statistical theory of single-index model is seriously defective due to the absence of a reliable estimator of the link function g, however tempting it is to profess faith in the folklore that regressing Y on X T θ is equivalent to regressing Y on X T θ 0 .
Under general assumptions, we have rigorously proved the above heuristics, namely oracle efficiency for a plug-in estimator of the link function g.Oracle efficiency in the context of smooth function estimation was best explained by [24], while the concept was later expanded by [25,26,28,38,37] for models with additive structures.In terms of the single-index model (1.1), if θ 0 were known by an "oracle", one could construct standard Nadaraya-Watson or local linear estimator g of g by regressing Y on X T θ 0 , hence g is an infeasible benchmark for estimating g.The Nadaraya-Watson or local linear plug-in estimator ĝ of g by regressing Y on X T θ is called oracle, as Theorem 1 concludes that the difference g − ĝ is uniformly of order n −1/2 , negligible compared to the error between g and g.
This ideal property of ĝ makes it asymptotically indistinguishable from g uniformly, and automatically inherits all the global asymptotic properties of g, in particular, the simultaneous confidence band of g based on g.Nadaraya-Watson and local linear estimators of regression function come equipped with simultaneous confidence band (SCB), see for instance [7,9,41].SCB is an extremely powerful tool for making inference on the entirety of an unknown curve with quantifiable error probability, yet it has been rather underexplored in nonparametric curve estimation literature, due to the tremendous difficulty of obtaining limiting distribution for global estimation error (also known as maximal deviation).For recent theoretical developments on SCB in various context, see for instance [15,23,29,36,47,48].It should be pointed out that our proof of Theorem 1 requires only that the estimator θ of θ 0 to be √ n-consistent, regardless whether it is derived from kernel based ( [11]) or spline based ( [39]) methods.
The rest of the paper is organized as follows.Section 2 states the main theoretical results on "oracle efficiency" and the SCB under some appropriate assumptions of model (1.1).Section 3 decomposes the estimation errors of ĝ and g into three parts for comparison, to break down the proof of the main theorem into three propositions.Section 4 describes the actual steps to implement the SCB.Section 5 reports findings of a simulation study.A real data example appears in Section 6.All technical proofs are in the Appendix.

Let the observations {X
1), then one has If θ 0 were known by an "oracle", standard kernel smoothing method offered by the univariate Nadaraya-Watson (NW) estimator gNW of g is given by gNW In fact, θ 0 is unknown.Therefore we replace θ 0 in (2.2) with its √ n-consistent estimator θ to obtain the oracle NW estimator ĝNW given by ĝNW Similarly, we construct the univariate oracle local linear (LL) estimator ĝLL of g based on {X T i θ, Y i } n i=1 that mimics the would-be local linear estimator gLL based on (2.4) where the response vector Y = (Y 1 , . . ., Y d ) T , the weight and design matrices are Throughout this paper, for any vector υ = (υ 1 , . . ., υ d ) T ∈ R d , we denote the norm Without loss of generality, we take θ 0 2 = 1.The technical assumptions we need are as follows: The second order derivative of the link function g is continuous on (a, b).
The above conditions (K) and (S) provide only two sets of elementary assumptions that support the high level Assumption (A1).In general, our Assumptions (A1)-(A7) allow for rather wide selection of any √ n-consistent estimator θ in order to establish the main Theorem 1 below.

Theorem 1. Under Assumptions (A1)-(A7), as n → ∞, the estimators ĝNW
According to classical theory on nonparametric confidence band in [7] and [9], Assumptions (A2)-(A3), (A5)-(A7) ensure that for any z ∈ R where and K denotes the first order derivative of kernel function K. Combining the above with Theorem 1, one obtains Corollary 1.Under Assumptions (A1)-(A7), for any z ∈ R, Hence for any α ∈ (0, 1), an asymptotic 100(1 − α)% simultaneous confidence band for g(x θ ), Alternatively, an asymptotic 100 Remark 1.It is reasonable to expect the oracle efficiency of Theorem 1 to hold as well under the settings of regression spline, P spline, etc., and one reviewer has pointed out that there are four combinations: spline and kernel for the coefficient vector θ and the link function g and it will be quite interesting to see which combination is better and under what assumptions.We have chosen kernel smoothing for the link function g simply because its SCB has been best investigated and understood.The estimation of coefficient vector θ 0 is only a preliminary step for estimating g, so any √ n-consistent estimator θ will do.We have used the B spline estimator θ in numerical works of Sections 5 and 6 due to its fast computing (see comparison in [39]).Further research may lead to faster procedures to estimate θ 0 or more accurate SCBs for g than ours.

Proposition 1. Under Assumptions (A1)-(A7), as n → ∞,
Remark 2. It is easy to see that Theorem 1 follows from Assumption (A2) and Propositions 1, 2 and 3. Hence, the Appendix is devoted to the proofs of these propositions, rather than Theorem 1.If one were to prove the corresponding results for the LL estimator, one would extend Proposition 1 to include the term n ).These do not add a great deal of difficulty.

Implementation
In the following, we outline the procedures to construct the SCB given in Corollary 1.The triweight kernel function, as the index range, and the compact interval [â 0 , b0 ] = [0.9â+ 0.1 b, 0.9 b + 0.1â] over which the SCB is constructed.The bandwidth is taken to be a MISE-relevant undersmoothing bandwidth fulfilling Assumption (A7) h = h opt (log n) −0.25−1/ log n , where h opt is the MISE optimal bandwidth with order n −1/5 , see [5].
The estimated index coefficient vector θ is the polynomial spline estimator proposed by [39].The pilot estimator of f θ0 (x θ ) is the kernel density estimator with bandwidth h f = the Silverman's rule-of-thumb (ROT) bandwidth ( [34], page 48, eqn (3.31)), which is the default bandwidth for kernel density estimator in R.Meanwhile, the estimator of σ 2 θ (x θ ) results from the Nadaraya-Watson estimator with bandwidth where εi = Y i − ĝ(X T i θ).The consistency of fθ (x θ ) and σ2 θ (x θ ) follows from standard theory of kernel smoothing and Slutsky's Theorem entails that Corollary 1 still holds when v(x θ ) is plugged into any consistent estimators fθ (x θ ) and σ2 θ (x θ ) satisfying that sup has asymptotic confidence level 1 − α.
For visualization of actual function estimates, Figure 4 depicts various univariate functions at (δ, c) = (1.0,0), (1.5, 0.2), including the scatterplot of data, the curve of the true univariate function g, the estimated function of g using the true index coefficient vector θ 0 , the estimated function of g using the estimated index coefficient vector θ and asymptotic 95% SCBs with n = 500.Other settings yielded similar results, but are not included to save space.From Table 2, one can see the SCBs based on θ and θ 0 have similar performances.There is no significant differences between their coverage percentages and both are close to the nominal level for large sample size.Meanwhile, Figure 4 shows that the three curves of g, gLL , ĝLL are very close.All these results reveal that the oracle estimator ĝLL (x θ ) is asymptotically as efficient as the infeasible estimator gLL (x θ ) regardless of noise level and/or heteroscedasticity, which is consistent with our asymptotic theory.

Real data analysis
As an illustration, we apply our method to the Boston Housing Data, consisting of the median value of homes in 506 census tracts in Boston Standard Metropolitan Statistical Area in 1970 and 13 accompanying sociodemographic statistics values.[8] estimated a housing price index model based on this data, while [2] did further analysis with their ACE algorithm to select four covariates.The response and explanatory variables of interest are: MEDV: Median value of owner-occupied homes in $1000's; RM: average number of rooms per dwelling; TAX: full-value property-tax rate per $10,000; PTRATIO: pupil-teacher ratio by town school district; LSTAT: proportion of population that is of "lower status" (%).Some regression studies had been used to reveal the potential relationship between MEDV and four covariates, for instance, [21,32,37,40,46].
We follow the previous works to use the same four explanatory variables and take logarithmic transformations on TAX and LSTAT for our analysis.The following single-index model is proposed to fit the data: and the four covariates are further standardized to facilitate the application of [39] for estimating θ = (θ 1 , θ 2 , θ 3 , θ 4 ).
By the spline method of [39], the estimated index coefficient vector is θ = ( θ1 , θ2 , θ3 , θ4 ) = (0.4924, −0.1022, −0.2949, −0.8125).It implies that RM has a positive effect whereas log(LSTAT) has the most negative effect on the housing price.In Figure 5(a), the univariate LL estimator of the link function and corresponding asymptotic 95% SCB are displayed together with the scatter points about MEDV and the index θ1 RM + θ2 log(TAX) + θ3 PTRATIO + θ4 log(LSTAT).The straight solid line represents the least squares regression line.Obviously the null hypothesis H 0 : g(x θ ) ≡ β 0 + β 1 x θ , for some β 0 , β 1 ∈ R will be rejected since the 95% SCB couldn't totally cover the straight regression line.In fact, the asymptotic p-value is 0.00849761 that is calculated as and t k , k = 0, . . ., 400 are equally spaced grid points over the interval [â 0 , b0 ] where we construct the SCB, while β0 + β1 x θ is a least squares linear approximation to ĝLL (x θ ).In other words, the asymptotic p-value α is a solution of The scatter plot in Figure 5 (a) shows a group of data points with the similar medium value around $50, 000, and wonder how much influence they might have.We have removed these 16 data points from the data and redone the analysis, as seen in Figure 5 (b), and obtained a revised asymptotic p-value of 0.00976571.Our conclusion based on comparing the plots in Figure 5 and the corresponding p-values is that the influence of these 16 data points is negligible.
Through the shape of the SCB, we can see the curve of the estimated link function has a roughly increasing trend.These findings are consistent with the observations in [21,40,46], but are put on rigorous standing due to the quantification of type I error by computing asymptotic p-value relative to the SCB.

Appendix
Throughout this section, ϕ n ∼ ψ n means lim n→∞ ϕ n /ψ n = c, where c is some nonzero constant.For functions and U p (•) if the convergence is in the sense of uniform convergence in probability.
We first state the classic Bernstein inequality used in the proofs of Propositions 1-3.

A.1. Proof of Proposition 1
According to the definitions of B(x θ ), B(x θ ) given in (3.2) and the Taylor expansion of the kernel function K at (X T i θ 0 − x θ )/h under Assumption (A5), one has where R i,θ0 is the remainder term of the first order Taylor expansion, It is easy to see from Assumptions (A1), (A5) that max Clearly, with the addition of Assumptions (A3), In the following, we focus on analyzing sup To bound 1554

A.2. Proof of Proposition 2
Firstly, similar to (A.1), we make use of the second order Taylor expansion of the kernel function K at (X T i θ 0 − x θ )/h, the expression of V (x θ ) given in (3.3) can be written as Assumption (A5) on Lipschitz continuity of K and Assumption (A1) ensure that Obviously, Assumption (A6) implies that Secondly, we define a sequence provided by Assumption (A6).The noise ε i is decomposed as tail, mean and truncated parts, i.e., . Correspondingly we define the three parts of According to Assumption (A6), The Borel-Cantelli Lemma implies that thus, applying the similar proof process of bounding sup ).Define |V 1,3 (x θ )| = sup With respect to V 2 (x θ ), it can also be decomposed into three parts using a truncation method.Then we still continue to apply Bernstein's inequality, the Borel-Cantelli Lemma and a discretization technique, similar to the proof of (A.3), (A.8), to obtain
Additionally define