Integrated conditional moment test for partially linear single index models incorporating dimension-reduction

Studying model checking problems for partially linear singleindex models, we propose a variant of the integrated conditional moment test using a linear projection weighting function, which gains dimension reduction and makes the proposed method act as if there exists only one covariate even in the presence of multiple dimensional regressors. We derive asymptotic distributions of the proposed test; i.e., an integral of a Ma’s research was supported by NSF grant DMS 1306972. Zhang’s research wa supported by Natural Science Foundation of SZU (801, 00036112), the NSFC (Tianyuan fund for Mathematics, NO.11326179) of China, and NSFC grant 11101157 of China. Sun’s research was supported by the President Fund of University of Chinese Academy of Sciences. Corresponding author. Liang’s research was partially supported by NSF grants DMS1007167 and DMS-1207444, and by Award Number 11228103, made by National Natural Science Foundation of China.


Introduction
One common task of model regression is to estimate a conditional mean function.Single index models, a generalization of multiple linear regression models with an unknown link function, have been widely used to estimate a conditional mean function because they relax restrictive assumptions imposed on parametric models of conditional mean functions such as linear or generalized linear models (Ichimura;1993;Härdle et al.;1993), and therefore gain more flexibility.There are various estimation procedures for single-index models.See Horowitz (2009) for a comprehensive survey and various applications of single-index models.To further combine the interpretability of multiple linear models with the flexibility of single-index models, their hybrid, the partially linear single-index models (PLSiM), have been studied and applied for analyzing various complex data generated from biological and economic studies in the literature (Xia and Härdle;2006;Yu and Ruppert;2002;Wang et al.;2010;Liang et al.;2010).The first remarkable work on PLSiM can be traced back to Carroll et al. (1997), in which a backfitting algorithm was proposed to estimate parameters of interest in a more general case.Yu and Ruppert (2002) suggested a penalized spline estimation procedure.Xia and Härdle (2006) applied the minimum average variance estimation (MAVE, Xia et al.;2002) to PLSiM and developed an effective algorithm.More recently, Wang et al. (2010) studied estimation in PLSiM with additional assumptions imposed on the model structure.Liang et al. (2010) proposed a profile least squares estimation procedure.Using PLSiM for data analysis, a natural question is that, given a set of covariates, how one feels confident that the model fits data well.In this paper, we propose a method for model checking in PLSiM integrating the dimension reduction principle.
Correct model specification is fundamental and critical in data analysis.There is a vast literature on model checking problems.Most results were established by evaluating the difference of the conditional expectation under the null and alternative hypotheses, or the expectation of the residual under the null hypothesis.Generally speaking, there are two classes of methods for model checking.The first class focuses on the procedure based on smoothing residuals and is called "local approach", which includes Dette (1999); Härdle et al. (1998); Horowitz and Härdle (1994); Eubank and Hart (1992); Härdle and Mammen (1993).Hart (1997) gave a comprehensive survey on nonparametric lack-of-fit tests.It is worth mentioning that the null asymptotic distributions of these tests are generally free of the data generating process and therefore are convenient to calculate critical values.However, these methods mostly suffer from "curse of dimensionality", detect the alternative hypothesis with only nonparametric rate, and need careful selection of bandwidth.Lavergne and Patilea (2008) incorporated a dimension-reduction idea in smoothing the residuals and constructed a test by using the linear projection approach.The resulting test can avoid "curse of dimensionality" and has higher power as they advocated.But they also introduced a penalty function and needed an initial value, which need to be determined in advance.Furthermore, no guideline on how to choose the penalty function is available.
An alternative to the local approach is the conditional moment (CM) test (Newey;1985;White;1984), which was inspired by a fact that certain conditional moments can be reexpressed as a finite number of unconditional moment restrictions.As a consequence, the sample versions of these unconditional moments result in various CM tests.Unfortunately, these CM tests are generally not consistent as pointed out by Bierens (1982), because for any tests based on a finite number of moment restrictions, one can always construct a data generating process which guarantees that the null hypothesis is false but the moment restrictions involved hold; that is, such tests are not consistent.As a remedy, Bierens (1982) proposed an integrated conditional moment (ICM) test, which differs from the CM tests such that ICM is still a CM test, but based on uncountable many moments.The idea works as follows.
Let X be the covariates and Y be the response variable, x be a vector with the same length as X and e(λ, X, Y ) be the model error with an unknown parameter vector λ, that can be infinitely dimensional.The integrated conditional moment (ICM) tests transform the conditional expectation of the null hypothesis E(e(λ, X, Y )|X) = 0 into uncountable many moments E{e(λ, X, Y )w(X, x)} = 0 for any x with the weighting function satisfying the equivalence of the conditional expectation and infinite unconditional moments.There is a large amount of literature on the ICM test and the following five weight functions have been proposed in the literature: exponential weighting function exp( √ −1x ⊤ X) by Escanciano (2006); Bierens (1982); linear indicator weighting function I(X ⊤ θ ≤ u) for u ∈ R 1 by Stute and Zhu (2002); Escanciano (2006); logistic weighting function {1 + exp(x ⊤ X)} −1 by Lee et al. (2001); simple indicator weighting function I(X ≤ x) by Stute (1997); Lin et al. (2002); and trigonometric weighting function cos(x ⊤ X) + sin(x ⊤ X) by Bierens and Ploberger (1997).But some weighting functions lead to inconsistent model checking methods and different choices of weighting function lead to different power properties.Furthermore, no best weighting function in terms of power is available since Bierens and Ploberger (1997) showed that all these weighting functions lead to asymptotic admissible tests.It is worth pointing out that the tests based on the linear indicator weighting and simple indicator weighting functions are similar to the idea of using integrated moments over half-spaces for goodness of fit, which can be traced back to Beran and Millar (1989).
Among these weighting functions, the linear indicator weighting function is the most attractive because its associated tests avoid choosing the integrating function which is necessary for the logistic, exponential and trigonometric weighting functions.In fact, the introduction of the linear indicator weighting function initially aims to prevent the high-dimensional problem of data which may harass the testing methods based on the simple indicator weighting function.The linear indicator weighing function was first studied by Stute andZhu (2005, 2002) and Xia et al. (2004) for checking generalized linear models and single-index models, where the projection direction is chosen to be the regression parameter.The rationality of the linear indicator weighting function was justified in Escanciano (2006) on a basis of an important proposition proved by Jones (1987) that the departure can be detected by a projection of the function along a certain direction; that is, for random variables ε and x with E x < ∞, E(ε|x) = 0 if and only if E(ε|x ⊤ θ) = 0 almost sure for any unit vector θ.
The linear indicator weighing function involves a nuisance parameter, i.e. the projection direction.The choice of the projecting direction is very critical to three pivotal concerns.The first one is that the choice of the projection direction must ensure the equivalence of the null hypothesis and the weighted infinite unconditional moments.Stute and Zhu (2002) and Xia et al. (2004) chose the projection directions as regression parameters and then weakened the null hypothesis into the independence of the residual and the regressors (Escanciano;2006).The second one is concerned with the power performance of the test.As it has been known that the projection direction should distinguish the projected residual of the alternative hypothesis from that of the null hypothesis along this direction.Otherwise, the corresponding test will lose power completely and thus have a bad performance.Moreover, to improve power performance, the projection direction should detect the difference as possible.The third concern caused by the nuisance projection direction is the calculability of the critical value.The null distribution of the ICM test is case-dependent, and thus a bootstrap method is usually required to define the critical value.The complexity of the projection direction makes the testing statistic difficult to compute and the bootstrap method hard to carry through.
Aiming to avoid the deficiency of the most smoothing based tests and the existing ICM tests, we propose a variant of the ICM test using a linear projection weighting function and choose the projection direction by fitting a single-index model from the estimated squared model errors against all the covariates.Such a method of choosing the projection direction prevents the three problems aforementioned.It will be shown that the proposed method can avoid the problem of data sparsity, is free from the smoothing parameter, is easy to compute and has satisfactory power performance.
The rest of this paper is organized as follows.In Section 2, we describe the proposed test and study its asymptotic properties.We further suggest a bootstrap method for calculating the critical value, and investigate the power performance.Simulation studies and a real data analysis are respectively conducted to evaluate the proposed tests in Sections 3 and 4. The technical details for the proofs of the main results and the estimation of the parameters under the null hypothesis are presented in the Appendix.

The test procedures and their theoretical properties
Consider the partially linear single-index model (PLSiM) of the form: where Z=(X ⊤ , T ⊤ ) ⊤ , and X and T are the p-th dimensional and q-th dimensional covariate vectors, respectively, β is an unknown index vector which belongs to the parameter space is the Euclidean norm of β, α = (α 1 , . . ., α q ) ⊤ , and g(•) is an unknown differentiable function.We are interested in checking the specification of the PLSiM: Jones (1987), we consider E{e(β, α, g)|θ ⊤ Z} = 0 for any unit vector θ.Since θ ⊤ Z is scalar, this transformation overcomes the dimensional problem; However, such a testing method requires the nonparametric smoothing on the residual and belongs to the category of "local method".Though it prevents the deficiency of the dimensional problem, it suffers from all other shortages of the local method.We further target the equivalent form E{e(β, α, g)I(θ ⊤ Z ≤ u)} = 0 for any u ∈ R 1 and any unit (p+q)-vector θ.This operation can avoid estimating E{e(β, α, g)|θ ⊤ Z} nonparametrically and therefore no bandwidth selection is required.By treating u as a random variable, we formulate our null hypothesis as: E[E 2 {e(β, α, g)I(θ ⊤ Z ≤ u)|θ, u}] = 0. Based on this equality, we construct a Crámer-von Mises type testing statistic.By considering that the nuisance parameter θ follows a degenerated distribution, the Crámer-von Mises type testing statistic can be simplified as a weighted summand.It is worth mentioning that this simplification leads to higher statistical power at the cost of inconsistency.Further, the resulting statistic works as satisfactory under the null hypothesis.The value of θ is determined by fitting a single-index model with the squared model error of PLSiM as the response and all the regressors as covariates.The obtained value of θ can tell the projected residual of the alternative models from that of the null hypothesis.
Suppose that {(Y i , X i , T i ), i = 1, . . ., n} is the random sample from (Y, X, T).Based on this sample, we estimate the parameters β and α and the unknown function g under the null hypothetical model (2.2).The estimation details and the associated asymptotic properties of the proposed estimator are presented in Appendix A.2. Let β, α and g be the estimators of β, α and g.Then our S. Ma et al. sample version of E{e(β, α, g)I(θ ⊤ Z ≤ u)} is defined as To study the asymptotic properties of R nw (u), we first introduce the following notation.Let A ⊗2 = AA ⊤ for a matrix A. Let V (β) = β ⊤ X, and for any random variable (or vector) ξ, ξ(β be the Jacobian matrix of size p × (p − 1) with Let β 0 and α 0 be the true parameters in the null hypothetical model (2.2).
For notation simplicity, we denote For any fixed u, Proposition 2.1 shows that √ nR nw (u) is asymptotically normal.A simple test may be constructed by considering { √ nR nw (u)} 2 , which follows an asymptotically central χ 2 distribution.But this score-type test may not be consistent since it is based on only one moment condition.

The proposed test
for any u and unit vector θ.If we take θ as a random variable and denote the distribution of θ by f (θ)dθ, then our test statistic is defined as in which F nθ (du) is the empirical distribution of (θ ⊤ Z 1 , . . ., θ ⊤ Z n ).Accordingly we have the following result for T n .
Theorem 1.Under Assumptions (C1)-(C5) in the Appendix and the null hypothesis H 0 , one has Theorem 1 follows directly from the result of Proposition 2.1 and the continuous mapping theorem.

Calculation of the critical value
Since the asymptotic covariance of the process √ nR nw (u) depends on the variables Y and Z, the test statistic T n defined in (2.6) is case-dependent and the critical value cannot be obtained directly based on this distribution.We use a bootstrap procedure to mimic the null distribution of the test statistic.
The procedure for calculating the critical value based on the bootstrap test statistic is as follows: Step 1: Compute the estimated projection direction θ by fitting a single-index model with "synthesis" data Step 2: Compute the test statistic Step 3: Generate the random variable sequence {ǫ ib } n i=1 , b = 1, . . ., B from the two-point distribution which respectively takes values 1∓ √ 5 2 with probability 5± √ 5 10 , so that the variance equals to 1, and compute the following arguments for each b: Step 4: For each b, we re-calculate the bootstrap estimators α b and β b .Then we calculate the bootstrap fitted value and residuals, and we further define the bootstrap test statistic where (2.7) Step 5: We calculate the 1 − κ quantile of the bootstrap test statistic as the κ-level critical value.
Theorem 2. Under Assumptions (C1)-( C5) in the Appendix and the null hypothesis H 0 , one has for all b = 1, . . ., B, where R w (u) is defined in Proposition 2.1.

Local power properties
To investigate the sensitivity of the proposed test, we consider the alternative hypothetical models with E(η i |X i , T i ) = 0 and some arbitrary bounded measurable function D(•, •).

A simulation study
In this section, we report simulation results to evaluate the performance of the proposed testing procedure.
We examine the proposed test procedure for the PLSiM under a sequence of alternative models with different values for C o given in Cases 1 and 2. When C o = 0, the model is the null model.In this example, we use the estimated projection direction θ by fitting a single-index model with the squared model error e 2 i ( β, α, g) of (3.1) as the response and Z i as covariates.The powers of the tests are calculated at the four significant levels: 0.01, 0.025, 0.05, and 0.10.For the bandwidth selection, the conditions of Lemma A.4 in the Appendix include the optimal bandwidth for h in the process of estimating (α ⊤ , β ⊤ ) ⊤ .Thus, the standard bandwidth selection methods, such as K-fold cross validation, validation (CV), generalized cross validation (GCV) or the rule of thumb, can be employed.In this example, we use 5-fold cross validation as it is not too computationally intensive suggested by Cui et al. (2011).After obtaining these estimators ( α ⊤ , β ⊤ ) ⊤ , g(•) is further calculated to obtain the residual e( α, β, g).From Condition (C3), we know that bandwidth h needs undersmoothing.Thus, we can use the bandwidth h = h × n −2/15 , where h is the one estimated by the 5-fold cross validation, for obtaining ( α ⊤ , β ⊤ ) ⊤ .
The power pattern in this simulation study is depicted in Figure 1 with dotted and solid lines for n = 200 and n = 400 by using h, from which it is easy to see that when the null hypothesis H 0 is true, that is, C o = 0, the percentages of H 0 being rejected are close to the corresponding nominal level for all four nominal levels, and they are closer to the significance levels for larger sample size n.When H 0 is not true, that is, C o = 0, as the value of C o = 0 increases or the sample size n increases, the empirical percentages of rejecting H 0 approach S. Ma et al.The results demonstrate that our proposed testing procedure is powerful.

Real data analysis
In this section, we study a dataset from an automobile data set (Johnson;2003) to discover possible factors that influence the price.The suggested retail price (the manufacturer's assessment of the vehicle's value, including adequate profit for the automaker and the dealer) to serve as the response variable, Y .Four binary variables are used to indicate the type of a vehicle including: sports car, T 1 (1=yes, 0=no); sport utility vehicle, T 2 (1=yes, 0=no); wagon, T 3 (1=yes, 0=no); and minivan, T 4 (1=yes, 0=no).In addition, two other binary variables T 5 (1=yes, 0=no) and T 6 (1=yes, 0=no) are used to indicate whether the car/truck is all-wheel drive and rear-wheel drive, respectively.Since the number of cylinders takes the values 3, 4, 5, 6, 8 and 12, we created 6 binary variables T 7 , T 8 , T 9 , T 10 , T 11 and T 12 to indicate whether the number of cylinders is 3, 4, 5, 6, 8 and 12, respectively.All these binary variables mentioned above are treated as the explanatory variables in the linear part of the PLSiM.Moreover, 6 additional measurements are considered including engine size, X 1 ; horsepower, X 2 ; weight in pounds, X 3 ; wheel base in inches, X 4 ; length in inches, X 5 ; and width in inches, X 6 .These continuous variables are considered as the index variables in the PLSiM.We have a total of 18 explanatory variables and 386 observations by removing 41 missing values and one large outlier from the original data set with 428 observations.The PLSiM (2.1) investigated in this paper is applied  1.
We first applied the proposed testing procedure to check whether the assumed semiparametric PLSiM fits the data adequately.We treated "wheel base in inches (X 3 )" as the leave-one-component for implementing the EFM procedure to estimate θ.We generated 500 wild bootstrap replications, and obtained the associated p value 0.56.This indicates that the PLSiM is appropriate for fitting this dataset.After we use the EFM method to estimate ( α ⊤ , β ⊤ ) ⊤ , we can estimate g(•).Figure 2 shows the scatter plots of Y − α ⊤ T against the estimated single-index variants β ⊤ X, along with the estimated curve g(•) by local linear smoothing.It indicates that a nonlinear curve g(v) may be considered for this dataset.To support our conjecture, we also linearly fit Y − α ⊤ T against β ⊤ X, and draw the straight line in Figure 2, which is not entirely encapsulated in the 95% pointwise confidence bands of the nonlinear curve estimate g(•).This again suggests that a linear regression is not enough.Figure 2 also shows that a larger value of the estimated index variable composed from the car's efficiency and power yields a higher retail price.As a result, the PLSiM is sensible and useful for the manufacturer's suggested retail price.

Discussions
We have proposed a convenient testing procedure to check whether PLSiM fit complex datasets properly, and studied asymptotic properties of the proposed test statistic.Our numerical experiments indicate that the proposed testing procedure is a promising tool.

S. Ma et al.
(squares); local linear estimated curve of g(•) (solid line) with the associated 95% pointwise confidence intervals (broken lines) along with a linear regression (solid straight line).
For testing nonparametric models or semi-parametric models, the conventional CV, GCV and 5-fold procedures of bandwidth selection are generally suggested under the null hypothesis (Härdle et al.;2004).However, how to select bandwidth under the alternative hypothesis is not trivial and deserves a further investigation for true model is generally unavailable.In our numerical experiments, we have simply used the same bandwidth selection strategy as under the null hypothesis.
Some further studies from this line of work include: (i) the dimension of covariates is diverging with the sample size, and (ii) the response variable is not continuous.
(C1) g(•) has two bounded and continuous derivatives.(C2) The density function f V ( β) (•) of the random variable V ( β) = β ⊤ X is bounded away from 0 on S β for β in a neighborhood of β 0 and satisfies the Lipschitz condition of order 1 on S β , where S β = {β ⊤ x, x∈S} and S is a compact support set of X. (C3) The kernel K is a bounded and symmetric density function with a bounded derivative, and satisfies In this article, we estimate the unknown parameters β, α and the unknown function g using the estimating function method (EFM) proposed by Cui et al. (2011).We first choose an identifiable parameterization by eliminating β 1 .The parameter space Θ can be rearranged to a form {((1 −

A.2.1. The kernel estimating functions for the nonparametric function g
For a given β and α, we can estimate g(•) and g ′ (•) by using the local linear estimating functions.Let h be the bandwidth, and K(•) be the kernel density function satisfying K h (•) = h −1 K(•/h).Denote by a and b the values of g and g ′ evaluating at v, respectively.The local linear approximation for g( The estimators g(v, ̟) and g ′ (v, ̟) are obtained by solving the local estimating equations with respect to a and b given as: Let a(v, ̟) and b(v, ̟) be the estimators of a and b at v, the local linear estimators of g(v) and g ′ (v) are g(v, ̟) = a(v, ̟) and g ′ (v, ̟) = b(v, ̟), respectively.Then , ) l for j = 0, 1, 2 and l = 0, 1.
Lemma A.2.Under Assumptions (C1)-( C3) and the null hypothesis H 0 , if h → 0 and nh 3 → ∞ as n → ∞, we have, for ∀β ∈ Θ, the asymptotic mean squared error of g ′ (β ⊤ x, ̟ 0 ) given as Lemma A.2 is a consequence of the derivation of bias and variance for g ′ (β ⊤ x, ̟ 0 ), which is similar to that of Carroll et al. (1998).Following the same reasoning as the proofs for (2.4) in Cui et al. (2011), one can obtain the following lemma.

A.4. Proofs of the main results
Proof of Proposition 2.1.Note that where which can be further decomposed as I n,1 + I n,2 with Recall (A.4) and given nh 4 → 0, we have the following simplification.
Proposition 2.1 follows from the central limit theorem.
Then Theorem 3 can be proved by the arguments similar to the proofs of Proposition 2.1 and Theorem 1.
Proof of Theorem 4.Under the alternative (2.8) with n 1/2 δ n → 1, the asymptotic expansion of √ nR b nw (u) has a non-random shift Ω.It is easy to validate that the random symmetrization variable V i in √ nR b nw (u) makes the effect of such non-random shift vanish.As a result, under the alternative (2.8) with n 1/2 δ n → 1, for R b nw (u) defined in (2.7), we have ǫ i e i β 0 , α 0 , g Π u (Z i ) + o p (1).
Theorem 4 can be proved using the arguments similar as for proving Proposition 2.1 and Theorem 1.
.5) Proposition 2.1.Under Assumptions (C1)-(C5) in the Appendix and the null hypothesis H 0 , √ nR nw (u) converges to R w (u) in the Skorohod space D[−∞, ∞] p+q , where R w (u) is a centered Gaussian process with covariance function

Fig 1 .
Fig 1. Powers calculation for Case 1(upper panel) and Case 2 (lower panel) with n = 200 (dotted lines) and n = 400 (solid lines).Columns from left to right correspond to nominal levels 0.01, 0.025, 0.05, and 0.1 represented by horizontal lines, respectively.
and h → 0, as n → ∞, in which h is the kernel bandwidth for estimating g(•).(C5) E{J ⊤ g ′ (V ) X} ⊗2 and E( T) ⊗2 are positive definite.A.2.Estimation of β, α and g under the null hypothesis For any vector ζ = (ζ 1 , . . ., ζ s ) ∈ R s , denote the Euclidean norm by ζ = ( s k=1 |ζ k | 2 ) 1/2 .For positive numbers a n and b n , n ≥ 1, let a n ∼ b n denote that lim n→∞ a n /b n = 1.Let I d be an identity matrix of dimension d × d.

Table 1
The Results for real data example *The absolute value is calculated by the absolute value of coefficient/standard error to fit this dataset and the estimated coefficients, the associated standard errors and p-values are given in Table