Estimation of a semiparametric transformation model: a novel approach based on least squares minimization

Consider the following semiparametric transformation model Λ θ ( Y ) = m ( X ) + ε , where X is a d -dimensional covariate, Y is a univariate response variable and ε is an error term with zero mean and independent of X . We assume that m is an unknown regression function and that { Λ θ : θ ∈ Θ } is a parametric family of strictly increasing functions. Our goal is to develop two new estimators of the transformation parameter θ . The main idea of these two estimators is to minimize, with respect to θ , the L 2 distance between the transformation Λ θ and one of its fully nonparametric estimators. We consider in particular the nonparametric estimator based on the least-absolute deviation loss constructed in Colling and Van Keilegom (2018). We establish the consistency and the asymptotic normality of the two proposed estimators of θ . We also carry out a simulation study to illustrate and compare the performance of our new parametric estimators to that of the proﬁle likelihood estimator constructed in Linton et al. (2008).


Introduction
Transforming the data is a very common practice in statistics in order to improve the performance of a model or to interpret in an easier way a model. Transformation models can be encountered in a lot of various contexts, like in survival analysis and in quantile regression for example. In survival analysis, we mention the seminal works of Cox (1972) and Bennett (1983), who introduced respectively the Cox proportional hazards model and the proportional odds model to examine the effect of covariates on the survival time. The 'Box-Cox quantile regression model', based on the Box and Cox (1964) transform, is very popular in quantile regression, see Buchinsky (1995), Machado and Mata (2000), Mu and He (2007), and Fitzenberger et al. (2010), among others.
Historically speaking, transformations of the response variable go back to the simple linear regression model Y = X t β + ε, where Y is a dependent variable, X is a vector of explanatory variables, β is a vector of unknown regression parameters and ε is the error term. This model relies on heavy assumptions and the violation of one or several of these assumptions could lead to inconsistent or inefficient estimation of the corresponding parameters and also to wrong predictions of the response Y . As a possible solution to this problem, Box and Cox (1964) introduced a parametric family of power transformations and suggested that this power transformation, when it is applied to the response variable Y , might induce additivity of the effects, homoscedasticity and normality of the new error term and reduce skewness and hence satisfy as much as possible the assumptions of the new linear regression model. Note that the Box and Cox (1964) transformation also includes as special cases the logarithm, the square root, the inverse and the identity.
The class of transformations introduced by Box and Cox (1964) has been generalized, see for example the Yeo and Johnson (2000) transform. We also mention the book of Carroll and Ruppert (1988), the review paper of Sakia (1992) and the papers of Zellner and Revankar (1969), John and Draper (1980), Bickel and Doksum (1981) and MacKinnon and Magee (1990) for more classes of transformations and more details on this topic.
In the literature on transformation models the regression function and the transformation of the response can be either parametric or nonparametric. The above mentioned papers all consider regression models which assume a parametric form for both functions. In the context of nonparametric transformations and parametric regression functions, we mention the work of Horowitz (1996), who proposed nonparametric estimators of the transformation and the cumulative distribution function of the error term and the work of Chen (2002), who proposed a rank-based estimator of the transformation that has the advantage of not involving nonparametric smoothing.
Next, in the context of fully nonparametric transformation models of the form Λ(Y ) = m(X) + ε, where Λ(·) and m(·) are respectively an unknown transformation and an unknown regression function, Chiappori et al. (2015) and more recently Colling and Van Keilegom (2018) constructed fully nonparametric estimators of the transformation Λ. The main motivation of the estimators constructed in Colling and Van Keilegom (2018) with respect to the ones constructed in Chiappori et al. (2015), was to avoid kernel smoothing on Y since this can work badly in practice if the distribution of Y is very skewed. Their main idea was to rewrite the transformation Λ(Y ) as Γ(U ), where Γ is an increasing function, and F Y (·) is the distribution function of Y , and to construct estimators based on kernel smoothing of U , which works globally better since U is uniformly distributed. We also mention the work of Breiman and Friedman (1985) who constructed an algorithm for estimating the different components of the same model when the regression function m is supposed to be additive.
In fully nonparametric contexts that are slightly different from that of the previous model, we would also like to mention the works of Horowitz (2001) and Jacho-Chavez et al. (2010) among others, who proposed nonparametric estimators of a generalized additive model with an unknown link function.
In this paper, we will focus on a model that assumes a parametric form for the transformation function, while the regression function is left unspecified, i.e., we will consider a semiparametric transformation model of the following form: where m(·) is an unknown regression function, Λ θ is a transformation belonging to a parametric family of strictly increasing functions and θ ∈ Θ where Θ is a compact subset of R k .
We will denote by θ 0 the true but unknown value of θ. Moreover, we assume that X is a d-dimensional covariate with compact support χ, Y is a univariate response variable with support Y and the error term ε has zero mean and is independent of X. We also introduce the and ε(θ 0 ) = ε. Finally, let F X , F ε(θ) , f X and f ε(θ) be the distribution and density functions of X and ε(θ). We assume that we have randomly drawn an iid sample (X 1 , Y 1 ), . . . , (X n , Y n ) from model (1.1), where the components of X i are denoted by (X i1 , . . . , X id ) for i = 1, . . . , n. Linton et al. (2008) extensively studied the semiparametric transformation model (1.1) and proposed two estimation methods for the unknown true parameter vector θ 0 : a profile likelihood method and a mean squared distance from independence method. Moreover, they established the asymptotic properties of these two estimators and showed in their simulation study that the profile likelihood estimator outperforms the other one. The main idea of the profile likelihood method is to maximize the log-likelihood function of the vector (X,Y ) with respect to θ, after having replaced all unknown functions in the likelihood by nonparametric estimators. Then, the profile likelihood estimator of θ is defined by where m(·, θ) and f ε(θ) (·) are suitable nonparametric estimators of m(·, θ) and f ε(θ) (·) respectively and Λ θ (y) = ∂ ∂y Λ θ (y). In the literature we can find several other contributions on the semiparametric transformation model (1.1). First, we mention the work of Vanhems and Van Keilegom (2018) who studied the estimation of this model when some of the regressors are supposed to be endogenous. Next, we refer to Colling and Van Keilegom (2016), Colling and Van Keilegom (2017) and Kloodt and Neumeyer (2017), who developed tests for the parametric form of the regression function based on the error distribution function, the integrated regression function and a L 2 -distance between the nonparametric and the parametric fits of m, respectively, while Allison et al. (2018) and Kloodt and Neumeyer (2017) constructed significance tests for the explanatory variables in the model based on Fourier-type conditional expectations and on U -statistics, respectively. Moreover, Hušková et al. (2018) proposed tests for the validity of the model involving characteristic functions and Colling et al. (2015) and Heuchenne et al. (2015) studied nonparametric estimators of the error density and the error distribution respectively. Finally, we also mention the work of Neumeyer et al. (2016) who introduced estimators of the different components of a heteroscedastic transformation model and proved the asymptotic normality of these estimators.
In this paper, our goal is to construct two new estimators of the transformation parameter θ 0 in the context of a semiparametric transformation model of the form (1.1). These estimators will be competitors of the profile likelihood estimator θ P L introduced in (1.2).
The main idea of the new estimators of θ 0 is to minimize, with respect to θ, the L 2 -distance between the transformation Λ θ and one of its fully nonparametric estimators. In Section 2, we will explain in more detail the intuition behind our two new estimators of θ 0 , while we will give their exact definitions in Section 3. Next, in Section 4, we will present the theorems that establish the consistency and the asymptotic normality of these two estimators.
A simulation study comparing the performance of our new estimators with that of the profile likelihood estimator is performed in Section 5. Finally, Section 6 contains the technical assumptions and the proofs of the main results.

Main idea of the new estimators
As mentioned in the introduction, the main idea of the new estimators of the transformation parameter θ 0 is to minimize, with respect to θ, the L 2 -distance between the transformation Λ θ and one of its fully nonparametric estimators. Nonparametric estimators of the transformation have already been constructed in the literature, see Chiappori et al. (2015) and Colling and Van Keilegom (2018). Moreover, we explained in the introduction why the estimators constructed in Colling and Van Keilegom (2018) perform globally better than those constructed in Chiappori et al. (2015). The simulation studies performed in Chiappori et al. (2015) and Colling and Van Keilegom (2018) also show that a nonparametric estimator of the transformation based on the least absolute deviation loss performs better than a corresponding estimator based on the least squares loss, since the former is less sensitive to outliers. Consequently, we will use here the nonparametric estimator based on the least absolute deviation loss constructed by Colling and Van Keilegom (2018), which is, as far as we know, the nonparametric estimator of the transformation that performs globally the best.
To construct this estimator we need to assume that the true transformation Λ = Λ θ 0 satisfies Λ(0) = 0 and Λ(1) = 1. However, as we will see later, other identifiability constraints are possible as well. The latter condition on Λ fixes the location and the scale of the model, which is sufficient to identify the model. See Chiappori et al. (2015) and Colling and Van Keilegom (2018) for more details about the identification of the model. Following the same idea as in Colling and Van Keilegom (2018), we rewrite the transformation Λ(Y ) as Λ(Y ) = Γ(U ), where Γ is an increasing function, and Note that T (0) = 0 and T (1) = 1, and hence combined with the imposed condition on Λ, we find that Γ(0) = 0 and Γ(1) = 1, i.e. Γ satisfies the same identification constraints as Λ.
We estimate the variable U by where F Y (y) = n −1 n i=1 1 {Y i ≤y} is the empirical distribution function of Y 1 , . . . , Y n . Next, to estimate the transformation Γ, first note that for all x ∈ χ, as shown in Theorem 3.1 in Colling and Van Keilegom (2018), where ϕ(u, x) = P (U ≤ u|X = x) is the conditional distribution of U given X, and x 1 is the first component of the vector x = (x 1 , . . . , x d ) t . Hence, we can write Γ(u) as for any positive weight function v(·) and loss function (·) satisfying (0) = 0. In particular, we can work with the loss (u) = u 2L b (u)−1 , where L b (·) = L(·/b), L is a given distribution function and b > 0 is a bandwidth sequence. This loss function is a smooth approximation of the absolute deviation loss (u) = |u| for b small. To estimate Γ(u), we will replace the unknown function λ 1 (u, x) by an appropriate estimator. Let K is a univariate kernel and h u and h x are bandwidth sequences. Finally, define Consequently, a natural estimator of θ 0 is given by (2.4) However, it is important to remind that the estimator Γ LAD,b ( T (·)) has been constructed under the particular identification conditions Λ(0) = 0 and Λ(1) = 1. Certain classes of transformations do not satisfy these identification constraints. The class of Yeo and Johnson (2000) transformations, for example, satisfies Λ θ (0) = 0 and Λ θ (0) = 1 instead of Λ θ (1) = 1 for all θ ∈ Θ. Expression (2.4) will then lead to an inconsistent estimator of θ. In the next section we will explain in detail how we can adjust the estimator Γ LAD,b ( T (·)) with additive and multiplicative constants so that the corresponding adjusted estimator in (2.4) is consistent under identification conditions that are more general than Λ θ (0) = 0 and Λ θ (1) = 1 for all θ ∈ Θ.
Another possibility to allow for other identification conditions would be to consider , where Γ * LAD,b (·) and T * (·) are estimators of some suitable adaptations Γ * (·) and T * (·) of Γ(·) and T (·), depending on the particular identification conditions considered. However, the estimator Γ LAD,b ( T (·)) has several advantages. First, its asymptotic properties have already been developed in Colling and Van Keilegom (2018), which will facilitate the proofs in this paper. Second, we know that this estimator avoids kernel smoothing of Y , and the latter is known to work badly in practice if the distribution of Y is skewed. This is not necessarily the case for Γ * LAD,b ( T * (·)). Indeed, if we consider the Yeo and Johnson (2000) transform for example, with Λ θ (0) = 0 and Λ θ (0) = 1 as identifica- , and so the estimation of T * (Y ) will require kernel smoothing of Y since it depends on the density function f Y of Y . Finally, the expression of the estimators Γ * LAD,b (·) and T * (·) depends on the imposed identification conditions, whereas our goal is to construct an estimator of θ 0 that is consistent under general identification conditions.
The following proposition will be on the basis of the adjustment of the expression (2.4) and takes into account the set of identification conditions (I1) and (I1'). The assumptions (A1)-(A4) are given in the Appendix. Let U 0 be a compact subset in the interior of the support U of U , and let Y 0 be a compact set strictly included in T −1 (U 0 ). for all x ∈ χ and y ∈ Y 0 , where T and S 1 are defined in (2.1) and (2.3) respectively. Moreover, the right hand side of (3.1) does not depend on x.
The proof of this proposition is given in Section 6.2. Consequently, using Proposition 3.1 and the fact that Γ LAD,b ( T (y)) is a nonparametric estimator of S 1 (T (y), x)/S 1 (1, x), it is natural to define the following estimator of θ 0 : where w is a certain positive weight function with support included in Y 0 , that has been added to facilitate the proofs of the main asymptotic results that will be presented in the next section. Moreover, if the transformation satisfies in particular Λ θ (0) = 0 and Λ θ (1) = 1 for all θ ∈ Θ, expression (3.2) equals expression (2.4) up to the weight function w that we have added in the meantime. An alternative estimator can be obtained by letting the constants Λ θ (1) − Λ θ (0) and Λ θ (0) in the expression of θ 1 be free parameters that do not depend on θ. In that way we will have k + 2 parameters over which we minimize our weighted L 2 -distance (k being the dimension of θ) instead of just k. This could lead to a better estimator of θ. Therefore, we define a second estimator of θ 0 , which is obtained by replacing Λ θ (1) − Λ θ (0) and Λ θ (0) performs well, c 1 and c 2 should be approximately equal to Λ θ 1 (1)−Λ θ 1 (0) and Λ θ 1 (0) and then θ 2 should perform similarly as θ 1 . Otherwise, c 1 and c 2 could compensate a bad estimator Γ LAD,b ( T (·)) by taking some other values far away from Λ θ 1 (1) − Λ θ 1 (0) and Λ θ 1 (0).

Notations and definitions
Before establishing the main asymptotic results, we need to introduce several notations.
Finally, the reason for defining all these functions comes from the article of Chen et al. (2003). Indeed, Chen et al. (2003) proposed sufficient high-level conditions for the consistency and asymptotic normality of a class of semiparametric optimization estimators, that we will verify here for our estimators θ 1 and γ 2 ; see the proofs of Theorems 4.1 and 4.2 in Section 6.2. These sufficient conditions are mainly conditions on the class of functions H and either on the functions 1 , M 1 and M n,1 for the estimator θ 1 or on the functions 2 , M 2 and M n,2 for the estimator γ 2 .

Consistency and asymptotic normality
The following theorems establish respectively the consistency and the asymptotic normality of θ 1 and γ 2 . The assumptions under which these results are valid can be found in the Appendix.
Theorem 4.1. Assume (A1)-(A13). Then, under either (I1) or (I1') and (I2), Theorem 4.2. Assume (A1)-(A13). Then, under either (I1) or (I1') and (I2), where with respect to the components of θ, and the matrix V 1 is given by Colling and Van Keilegom (2018). where is the (k + 2) × (k + 2) matrix of partial derivatives of M 2 (γ, h) with respect to the components of γ, and the matrix V 2 is given by The proofs of these two theorems are given in Section 6.2. Note that the covariance matrices V 1 and V 2 are derived from the pathwise derivatives of the vectors M 1 (θ 0 , h 0 ) and The exact expressions of these pathwise derivatives, as well as of their i.i.d. representations, are given in the proof of Theorem 4.2. Note also that Theorem 4.1(ii) implies that θ 2 is consistent for θ 0 , and that Theorem 4.2(ii) implies that θ 2 is asymptotically normally distributed with variance-covariance matrix given by the lower k × k submatrix of the matrix Ω 2 .

Simulations
In this section, we perform simulations in order to compare the performance of our new estimators θ 1 and θ 2 of the transformation parameter with that of the profile likelihood estimator θ P L proposed by Linton et al. (2008) and defined in (1.2).
In each model, Λ θ (Y ) represents the Yeo and Johnson (2000) transformation, i.e, , and X 1 , . . . , X n are independent uniform random variables on [0, 1]. For each model we will also consider three sample sizes : n = 100, n = 200 and n = 300 and four values of the transformation parameter : θ 0 = 0 which corresponds to a logarithmic transformation, θ 0 = 0.5 which corresponds to a square root transformation, θ 0 = 1 which corresponds to the identity and θ 0 = 1.5.
The goal will be to analyze the influence of the sample size n, the value of the transformation parameter θ 0 , the variability of the regression function m(x) (by comparing Model 4 to Model 1), the variability of the error term (by comparing Model 2 to Model 1) and the distribution of the error term (by comparing Model 3 to Model 1) on the bias and variance of the different estimators. Note that, in Model 3, we consider ε ∼ 2t 10 √ 5 instead of ε ∼ t 10 to ensure that V (ε) = 1, exactly as in Model 1. In that case, if we observe some significative difference in the performance of the estimators between Models 1 and 3, we will be sure that it comes from the distribution, and not from the variability, of the error term.
Next, exactly as in Colling and Van Keilegom (2018), we will work with the unsmoothed estimator of the transformation Λ(·). This estimator is arbitrarily close to the estimator Γ LAD,b ( T (·)) when the bandwidth b is close to zero, and is defined as follows : The smoothed estimator Γ LAD,b ( T (y)) that we use in the theory is used in order to facilitate the proofs of the asymptotic properties, see Colling and Van Keilegom (2018)  ., x * Nx generated between min 1≤j≤n X j and max 1≤j≤n X j , from which we remove the values x * for which the expression S 1 (u, x * ) diverges. This can happen if ∂ ∂x 1 ϕ(w, x * ) is very close to 0 for some w. Moreover, we also remove the x * -values that are within 0.01 of the values x * for which S 1 (u, x * ) diverges, even if the corresponding integrals do not diverge. Note that removing some x * -values is allowed We consider the Epanechnikov kernel K(x) = 3 4 (1 − x 2 )1 {|x|≤1} and we select the bandwidths h x and h u by the classical normal reference rule for kernel density estimation, i.e., h x = (40 √ π) 1/5 σ x n −1/5 and h u = (40 √ π) 1/5 σ u n −1/5 , where σ x and σ u are the classical estimators of the standard deviation of X and U respectively. Finally, we take v(x) = 1 for all x such that S 1 (u, x) does not diverge.
Consequently, as the Yeo and Johnson (2000) transformation satisfies Λ θ (0) = 0 for all θ ∈ Θ, we approximate θ 1 and γ 2 respectively by We have chosen to work with w(Y ) = 1 {Y ∈Y 0 } , where the compact set Y 0 is chosen large enough such that it contains (quasi) all values in the sample.
Moreover, to compute the profile likelihood estimator θ P L introduced in (1.2), we use a Nadaraya-Watson estimator to estimate m(·, θ), with the same Epanechnikov kernel K as above and a bandwidth estimated by a cross-validation procedure, and we use a classical kernel density estimator to estimate f ε(θ) (·), with the Epanechnikov kernel K and a bandwidth estimated by the classical normal reference rule for kernel density estimation. The estimator θ P L is then obtained iteratively with the function optimize in R. We refer to Colling and Van Keilegom (2016) for more details on the implementation of this estimator.
Tables 1 to 4 show the bias, variance and mean squared error of the profile likelihood estimator θ P L and of our new estimators θ 1 and θ 2 for all considered values of n and θ 0 when Models 1 to 4 are generated respectively, each of them obtained on the basis of 200 samples.
First, when the sample size n increases, the mean squared error of all estimators decreases, especially due to a significant decrease of their variance, which is an expected outcome. Next, we observe that θ 2 outperforms θ 1 in all scenarios in terms of variance and in most of the scenarios in terms of bias, which could be expected since θ 2 offers more flexibility and freedom than θ 1 as explained in Section 3. Hence, we will concentrate our following analysis on the comparison between θ P L and θ 2 .
Next, when θ 0 increases from 0 to 1, we observe globally that the bias of θ P L and θ 2 tend to decrease in absolute value while their variance tends to increase, which leads to a general increase in their mean squared error. Conversely, if θ 0 increases from 1 to 1.5, the mean squared error of θ P L and θ 2 tends to decrease due to a decrease of their variance, even if their bias tends to increase in absolute value. This suggests that the parametric transformation is more difficult to estimate when the response Y is less variable. Indeed, a logarithmic transformation θ 0 = 0 will be easier to detect due to the presence of very high values in the sample Y 1 , . . . , Y n in comparison with the identity transformation θ 0 = 1 for instance.
Using the same reasoning, it seems logical that θ P L and θ 2 perform both better under Model 4 (m(x) = 6x − 3) than under Model 1 (m(x) = 2x − 1) in terms of bias and variance. Indeed, in Model 4, the regression function m(x) is more variable than in Model 1, which helps for estimating θ. Similarly, if we compare the results obtained under Model 1 to the ones obtained under Model 2, θ P L and θ 2 perform both better when ε ∼ N (0, 1) than when ε ∼ N (0, 0.5 2 ), especially in terms of variance while the results seem globally comparable in terms of bias. Consequently, a more variable error term also helps for estimating θ.
Moreover, under Model 4, θ 2 slightly outperforms θ P L for n = 100 and n = 200, whereas θ P L and θ 2 perform equivalently for n = 300. However, despite the fact that θ P L and θ 2 perform globally equivalently under Model 4, θ 2 clearly outperforms θ P L under Models 1 and 2 for all considered values of n and θ 0 , which suggests that the profile likelihood estimator of Linton et al. (2008) suffers more than our new estimator when the model becomes more difficult to estimate.
Finally, if we compare the results obtained under Model 1 to the ones obtained under Model 3, it is clear that all the estimators perform better in terms of bias and variance when ε ∼ N (0, 1) than when ε ∼ 2t 10 √ 5 for all considered values of n and θ 0 . Consequently, the distribution of the residuals also has an impact on the quality of the estimations of θ 0 , which is again an expected conclusion. In particular, residuals that are normally distributed help for estimating θ 0 . However, when ε ∼ 2t 10 √ 5 , the estimator θ 2 clearly outperforms the profile likelihood estimator θ P L . Indeed, θ P L suffers considerably under Model 3.
In conclusion, the new estimator θ 2 outperforms globally speaking the estimators θ 1 and θ P L , and especially when the model becomes more difficult to estimate (Models 1 to 3). In the latter case, the performance of the profile likelihood estimator drops significantly and θ 1 also outperforms θ P L .

Assumptions
The following conditions are needed for the main results of this paper. They are related to the distribution of ε, the transformations Γ and Λ θ , the regression function m, the kernels K and L, the bandwidths h x , h u and b, the joint density function of U and X, the weight functions v and w, the functions M 1 and M 2 , and the matrices ∆ 1 and ∆ 2 , defined in the statement of Theorem 4.2.
(A1) The distribution function F ε of ε is absolutely continuous and has a density f ε that is continuous on its support. Moreover, X and ε are independent and the support Y of Y is a connected subset of R.
(A2) The transformation Γ is strictly increasing and twice continuously differentiable on U 0 , where U 0 is a compact subset in the interior of U.
(A3) The regression function m is continuously differentiable.
(A7) The joint density function f Y,X of (Y, X) is uniformly bounded and m + 2-times continuously differentiable on Y 0 × χ 0 , where χ 0 ⊆ A 1 is the compact support of the weight function v(x) defined in (A9). We also assume that inf y: (A9) The weight function v has compact support χ 0 ⊆ A 1 with nonempty interior and satisfies χ 0 v(x) dx = 1. Moreover, v is continuous on χ and is m-times continuously differentiable on χ 0 .
(A10) The weight function w is positive, has support included in Y 0 and satisfies sup y∈Y 0 w(y) < ∞.
Assumptions (A1)-(A9) are basically the same as in Colling and Van Keilegom (2018) and are required since our proofs rely on the weak convergence of the estimator Γ LAD,b ( T (y)) that is established in the latter paper. Assumptions (A10)-(A11) are technical conditions that are related to the fact that we have to restrict to the compact subset Y 0 of Y. Finally, assumptions (A12)-(A13) are required for the application of Theorems 1 and 2 in Chen et al. (2003).

Proofs of the main results
In this section, we will prove the three main results of this paper. The first one justifies the definitions of our new estimators of θ 0 and the second and the third results establish respectively the consistency and the asymptotic distributions of these new estimators.
Proof of Proposition 3.1. The proof consists mainly in rewriting the function S 1 (u, x).
First, note that for u ∈ U 0 , the conditional distribution of U given X can be rewritten as since Γ(u) = Λ(T −1 (u)) is strictly increasing for u ∈ T (Y 0 ), Γ(U ) = Λ(Y ) and X and ε are independent. Hence, .
Before proving the two main asymptotic results of this paper, we need to consider a technical lemma regarding the estimator Γ LAD,b ( T (·)) and regarding the bracketing number N [ ] ( , H, || · || L 2 ) of the space H defined in Section 4.1, i.e. the smallest number of -brackets needed to cover the space H with respect to the norm ||h|| L 2 = [E(h 2 (Y ))] 1/2 . Proposition 6.1. Assume (A1)-(A9). Then, (i) The function Y 0 → R : y → Γ LAD,b ( T (y)) belongs to H with probability tending to 1.
Proof. First, we will prove (i). It is clear that T is monotone and that T (y) ∈ U 0 for n large and for y ∈ Y 0 , since Y 0 is a compact set that is strictly included in T −1 (U 0 ) and since sup y | T (y) − T (y)| = o P (1). Hence, it suffices to show that Γ LAD,b ∈ C 1 c (U 0 ) with probability tending to 1. Since sup u∈U 0 |Γ(u)| < ∞ and sup u∈U 0 |Γ (u)| < ∞ thanks to assumption (A2), the result follows if we can show that The former follows from Theorem 5.2 in Colling and Van Keilegom (2018), which establishes the weak convergence of the process √ n Γ LAD,b (·) − Γ(·) as a process defined on U 0 . For the latter result, define and note that by construction Q n ( Γ LAD,b (u), u) = 0 for all u ∈ U 0 , where Q n (q m , u) = ∂ ∂qm R n (q m , u). Hence, the derivative of Q n ( Γ LAD,b (u), u) with respect to u is also equal to 0, i.e.
We will now prove (ii). Every function h in the class H can be written as h = f • g for some f ∈ C 1 c (U 0 ) and some monotone function g that maps Y 0 into the bounded set U 0 . We will construct brackets for the space H by combining brackets for C 1 c (U 0 ) with brackets for the space of monotone and bounded functions. First, it follows from Corollary 2.7.2 in Van der Vaart and Wellner (1996) . . , f N 1 , ≤ f N 1 ,u be the N 1 brackets for the space C 1 c (U 0 ). Next, Theorem 2.7.5 in Van der Vaart and Wellner (1996) shows that the -bracketing number for the space of monotone and bounded functions with respect to the L 2 -norm on Y 0 is bounded by N 2 ≤ exp(K 2 −1 ) for some K 2 < ∞. Let g 1, ≤ g 1,u , . . . , g N 2 , ≤ g N 2 ,u be the N 2 brackets for the latter space. We will show that the bracketing number N [ ] ( , H, || · || L 2 ) of the space H is bounded by N 1 × N 2 ≤ exp((K 1 + K 2 ) −1 ).
For a given 1 ≤ j ≤ N 1 and 1 ≤ k ≤ N 2 , let h j,k, (y) = min g k, (y)≤u≤g k,u (y) f j, (u) and h j,k,u (y) = max Then, it is clear that for all h ∈ H there exist 1 ≤ j ≤ N 1 and 1 ≤ k ≤ N 2 such that and each of these four terms is easily seen to be bounded by a finite multiple of 2 . This finishes the proof.
Proof of Theorem 4.1. The proof consists in verifying Conditions (1.1) to (1.5) in Theorem 1 in Chen et al. (2003). These conditions are mainly conditions on the functions 1 , M 1 and M n,1 in case (i) and on the functions 2 , M 2 and M n,2 in case (ii). All these functions are defined in Section 4.1.
Condition (1.4) in Chen et al. (2003) is directly ensured by the fact that where || · || L∞ is the supremum norm over Y 0 .
Finally, for Condition (1.5) in Chen et al. (2003) it suffices by Lemma 1 in the latter paper to show that: is P -Donsker (i = 1, 2), where P is the probability measure of Y .
Then, it is clear that 1,j,k 1 ,k 2 , (y) ≤ 1,j (y, θ, h) ≤ 1,j,k 1 ,k 2 ,u (y) for all y. Next, consider where c is defined at the beginning of Section 4.1 and is an uniform upper bound for h ∈ H,Ȧ j (θ) = ∂ ∂θ A j (θ) and similarly for the other functions. Hence, using Cauchy-Schwarz inequality, and this is bounded by a finite multiple of 2 by assumptions (A10) and (A11). It now follows that the integral in (6.1) is finite.
. This is easily seen to hold true thanks to assumptions (A10) and (A11) and calculations similar to those leading to (6.2) above.
Next, in the proof of Theorem 4.1 we showed that Lemma 1 in Chen et al. (2003) is verified in our case. This lemma is not only sufficient for Condition (1.5) but also for Condition (2.5), see Remark 2 in Chen et al. (2003).
Finally, we verify Condition (2.6) in Chen et al. (2003). At the end of Section 4.1, we justified that M n,1 (θ 0 , h 0 ) = 0 and M n,2 (γ 0 , h 0 ) = 0. Moreover, in case (i) and for j = 1, . . . , k, we have: Using Corollary 5.1 in Colling and Van Keilegom (2018), we have h b (y) − h 0 (y) = n −1 n i=1 ϕ v X i ,Y i (y) + o P (n −1/2 ) uniformly in y ∈ Y 0 . Consequently, by assumptions (A10) and (A11) and the fact that E(h 0 (Y )) < ∞. The last expression is a sum of i.i.d. terms. Hence, we conclude the proof of case (i) using the multivariate central limit theorem and the fact that E(ϕ v X,Y (y)) = 0 for all y by Corollary 5.1 in Colling and Van Keilegom (2018). Similarly, in case (ii), we have It suffices to apply again the multivariate central limit theorem to conclude the proof of this theorem.