Adaptive estimation in the supremum norm for nonparametric mixtures of regressions

Abstract: We investigate a flexible two-component semiparametric mixture of regressions model, in which one of the conditional component distributions of the response given the covariate is unknown but assumed symmetric about a location parameter, while the other is specified up to a scale parameter. The location and scale parameters together with the proportion are allowed to depend nonparametrically on covariates. After settling identifiability, we provide local M-estimators for these parameters which converge in the sup-norm at the optimal rates over Hölder-smoothness classes. We also introduce an adaptive version of the estimators based on the Lepskimethod. Sup-norm bounds show that the local M-estimator properly estimates the functions globally, and are the first step in the construction of useful inferential tools such as confidence bands. In our analysis we develop general results about rates of convergence in the sup-norm as well as adaptive estimation of local M-estimators which might be of some independent interest, and which can also be applied in various other settings. We investigate the finite-sample behaviour of our method in a simulation study, and give an illustration to a real data set from bioinformatics.


Introduction
Practitioners are frequently interested in modelling the effect of a d-dimensional explanatory vector X on a response random variable Y by using a regression model estimated from a random sample (X i , Y i ) 1≤i≤n of (X, Y ). To allow varying parameters for different groups of observations, finite mixtures of regressions (FMRs) have been suggested in the literature. Statistical inference for parametric FMR models using a moment generating function method was first introduced by Quandt and Ramsey (1978). An approach based on the expectationmaximization (EM) algorithm was suggested by De Veaux (1989) in the twocomponent case. Zhu and Zhang (2004) established the asymptotic theory for testing for the number of components in parametric FMR models. More recently, Städler et al. (2010) proposed an 1 -penalized method based on a Lasso-type estimator for a high-dimensional FMR model with d n.
To gain further flexibility, various authors suggested the use of semiparametric FMR models. Hunter and Young (2012) studied the identifiability of an mcomponent semiparametric FMR model and numerically investigated an EM algorithm for estimating its parameters.  showed asymptotic normality of a semiparametric estimator in a two-component mixture of linear regressions. Huang and Yao (2012) and Huang et al. (2013) considered a semiparametric linear and nonlinear FMR model with Gaussian noise in which means, variances and mixing proportions depend on covariates nonparametrically. They established also the asymptotic normality of their local maximum likelihood estimator and investigated a modified EM-type algorithm. Recently Butucea et al. (2017) proposed a Fourier based approach to deal with a new semiparametric topographical mixture model able to capture the characteristics of dichotomously shifted response-type experiments. See also Compiani and Kitamura (2016) for an overview on semiparametric mixtures with a focus on econometric applications.
In this paper we investigate a two-component FMR model, in which one of the conditional component distributions is unknown but assumed symmetric about a location parameter µ, while the other is specified up to some scale parameter σ. The location parameter µ, the scale parameter σ as well as the proportion p are allowed to depend nonparametrically on the covariates. After settling identifiability, we provide local M-estimators for these parameters which converge in the sup-norm at the optimal rates over Hölder-smoothness classes. We also introduce an adaptive version of the estimators based on the Lepskimethod, see Lepskii (1992). Sup-norm bounds show that the local M-estimator properly estimates the functions globally, and allow for a slight additional smoothing of the estimated functions in order to obtain continuous estimates without deteriorating the rates of convergence. Further, uniform rates are the first step for the construction of confidence bands which are a very useful inferential tool, see e.g. Chernozhukov et al. (2014). Inspired by Butucea et al. (2017), the contrast that we use in the estimation procedure is based on characteristic functions, thus simplifying the approach in Bordes and Vandekerkhove (2010) which requires an additional smoothing when building the contrast. We also develop general useful technical tools based on the Bernstein-inequality in Giné et al. (2000) when the contrast has the form of a U-statistic. In our analysis we develop general results about rates of convergence in the sup-norm as well as adaptive estimation of local M-estimators which might be of some independent interest, and which can also be applied in various other settings, e.g. to the models in Butucea et al. (2017) or in Huang and Yao (2012). The paper is organized as follows. In Section 2 we formally introduce the model. Section 3 deals with identifiability of the parameters, for which we provide some general results. Section 4 introduces the estimation methodology and in particular develops the contrast function underlying the M-estimator. In Section 5 we obtain optimal rates of convergence in the sup-norm for our estimators, while Section 6 deals with adaptivity. In Section 7 we provide results of some numerical experiments, and also analyze the ChipMix data set from Martin-Magniette et al. (2008) which was previously analyzed in  using linear FMRs. Section 8 presents our general theory for local M-estimators as well as technical tools for contrasts in the form of U-statistics. Finally, Sections 9 -12 contain the technical proofs.

Two-component mixture of location-scale regressions
We consider the following nonparametric regression model for sequences of independent and identically distributed (i.i.d.) random vectors (X i ) i∈N supported on a compact set I ⊂ R d , d ≥ 1, and i.i.d. random variables (Y i ) i∈N , (W i ) i∈N , (ε 1,i ) i∈N and (ε 2,i ) i∈N . The explanatory variables X i and the response variables Y i are assumed to be observable while the latent variables W i and the error variables ε 1,i and ε 2,i are not. The covariates X i are assumed to have a probability density function (pdf), denoted by : I → (0, ∞), with respect to the Lebesgue measure. The unknown location and scaling functions µ : I → R and σ : I → (0, ∞) partially determine the distributional relationship between the explanatory and response variables along with the unknown mixing function p : I → (0, 1). Finally conditionally on {X i = x}, the variables W i are assumed to have a Bernoulli-distribution with parameter p(x), that is P(W i = 1|X i = x) = p(x) and P(W i = 0|X i = x) = 1 − p(x) .
Further we assume that conditionally on {X i = x}, the vectors ε 1,i and ε 2,i have zero-symmetric conditional pdfs, denoted respectively f x andf , wheref is assumed to be known and not to depend on x, while f x is unknown and may depend on x. If we furthermore have the conditional independence relations ε 1,i ⊥ ⊥ W i |X i and ε 2,i ⊥ ⊥ W i |X i , the random vectors (Y i , X i ) have the following joint density where the functional parameter collects all the x-local quantities to be estimated from the data.

Identifiability
Regarding the identifiability problem, it is enough to consider model (2.1) without covariate as we aim to estimate the various parameter-functions for each given value of x. Our identification strategy and results will be similar to those in Bordes et al. (2006) and Hohmann and Holzmann (2013). We suppose that both the known pdff as well as the unknown pdf f are zero-symmetric and have finite third-order moments. Hence, we consider mixtures of the following form Note that we may assume thatf is normalized, that is y 2f 2 (y)dy = 1. In the following we provide two sets of identifiability assumptions. The results rely on the symmetry of the component pdfs. Indeed f is symmetric if and only if its characteristic function or Fourier transform is real-valued. Our first assumption imposes a constraint on the true mixing parameter p * but requires only mild conditions on the component pdfsf and f * .
The second assumption does not impose a restriction on the mixing parameter but rather depends on the relationship of both component densitiesf and f * . That is, the characteristic functions of these densities need to be distinguishable in the tails in one of the following manners.
Condition 1. We consider the two following conditions: (C1) For large t ∈ R it holds that ϕ f * (t) = 0 and for all σ > 0, we have (C2) For large t ∈ R it holds that ϕ f * (t) = 0, ϕf (t) = 0 and for all σ > 0, we have

Estimation Methodology
We first present our estimation methodology in the model (3.1) without covariates. The approach to build a contrast function based on Fourier transformation is inspired by Butucea and Vandekerkhove (2014). In particular, as opposed to Bordes et al. (2006) and Bordes and Vandekerkhove (2010) we do not require an additional smoothing parameter for the indicator to obtain a smooth contrast function. Hence, in this restricted setting our approach yields asymptotically normally distributed estimators at √ n-rate without additional smoothing. Specifically, first assume that the observations Y j have density f mix (y; ϑ * ) as in (3.1), wheref , f * ∈ E 3 and The characteristic function of f mix (·; ϑ * ) is given by Assumption 1 or 2 is imposed globally. The asymptotic contrast is given by where again E γ denotes the expectation with respect to the distribution P γ , which is the probability measure from the underlying statistical model, i.e.
In order to estimate the contrast M , we use a U-statistic type estimator localized at x, where K : R d → R is a kernel function and h ∈ (0, ∞) is a bandwidth parameter. The estimatorθ n : I → R 3 of the parameter function θ * (·) is then defined as the pointwise minimizer of (4.6), that iŝ where Θ is a suitable compact subset of (0, 1) × (0, ∞) × R\{0} that we specify below.

Optimal rate of convergence in the supremum norm
In this section we derive the convergence rate of the estimatorθ n (x; h) for the underlying parameter functions p * (·), µ * (·), σ * (·) over Hölder smoothness classes. We focus on the supremum norm error for the following reasons. First, although the estimator is defined as a pointwise minimizer in (4.7), convergence in the sup-norm shows that it properly estimates the parameter functions p * (·), µ * (·), σ * (·) in a global way. Second, a sup-norm bound allows to slightly smooth the estimated functions in order to obtain continuous estimates, without deteriorating the rates of convergence. Third, uniform rates are the first step for the construction of confidence bands which are a very useful inferential tool, see e.g. Chernozhukov et al. (2014). Our technical analysis is based on general results for local M-estimators obtained in Section 8.1, and hence is quite different from that in Butucea et al. (2017) who prove pointwise asymptotic normality using undersmoothing. Indeed, our approach could also be applied to their model to obtain similar results as in Theorems 5.1 and 6.1 below.
We investigate estimation over Hölder-smoothness classes of functions. Denote the set of Hölder smooth functions on I with Hölder parameter α > 0 and Hölder constant L > 0 taking values in some set U by Here α = max{k ∈ N 0 | k < α} and we use the standard multi-index notation for multivariate derivatives, i.e. for k = (k 1 , . We suppose that p * (·), µ * (·), σ * (·) and (·) are Hölder smooth with the same parameters α and L > 0. Specifically, for given For convenience the sets U p , U σ , U µ , U in the definition of Γ(α) in (5.1) are assumed to be compact rectangular sets. We shall take Θ in the definition (4.7) of the estimatorθ n (x; h) to be Note that we excluded the conditional density f * x of ε 1 given X = x from the parameter set. Indeed we of course do not assume that this is known, but in their present form the rates are not uniform with respect to this parameter. Extensions are possible but would result in still higher technical complexity.
(K3) The probability density q has a finite third absolute moment and is bounded.
Thus, the estimatorθ n (·; h n ) has the convergence rate log n n α 2α+d in the supnorm for convergence in probability over the parameter set Γ(α). A classic result from Stone (1982) states that this rate is optimal for nonparametric regression in d dimensions over Hölder smoothness classes. The proof of Theorem 5.1 which relies on the theory presented in Section 8.1 is given in Section 10.

Adaptive estimation
In Theorem 5.1, the choice of the bandwidth h n ∼ log n n 1 2α+d requires a-priori knowledge of the smoothness parameter α. In this section we shall make the estimatorθ n (x; h) in (4.7) adaptive w.r.t. this parameter by using the Lepskimethod, see Lepskii (1992), Lepski et al. (1997) and Golubev et al. (2000). We shall use an indirect approach and choose an adaptive bandwidth based on the gradients of the contrast functions in (4.5) and (4.6), where ∂ θ = (∂ p , ∂ σ , ∂ µ ) . We let h(α) = h n (α) = (log n/n) 1/(2α+d) and r(α) = r n (α) = h(α) α which we consider over a grid of smoothness parameters where N = log n = min{k ∈ N | k > log n}, and set For a sufficiently large constant C Lep < ∞ we consider the Lepski choicê which leads to the estimator θ ad n (x) = argmin θ∈Θ M n (θ, x; hk) .
In order to make use of the highest possible smoothness order b, we need the following assumption.

( K2)
The kernel K is of order b .
The proof of this theorem, which is given in Section 10, is again based on a general adaptivity result for local M-estimators obtained Section 8.1.

Simulations
We propose in this section to investigate the finite sample size properties, in the supremum norm sense, of the functional estimatorθ n (x; h n ) = (p n (x),σ n (x),μ n (x)) over two models (M1) and (M2) described below in dimension d = 1. Commonly to both models we choose Notice that we set the variance of ε 1,i | {X i = x} equal to 1/4 in both models (M1) and (M2) for fair comparison. Identifiability also of model (M1) is guaranteed by Theorem 3.1 since Assumption 1 is satisfied. The density q in the empirical contrast M n (ϑ, x; h) in (4.6) is a N (0, 1) distribution, the kernel K(·) = 1/2(1 − |x|)I −1≤x≤1 (triangular kernel) and where κ is a smoothness/scaling parameter the influence of which is to be tested. The general form within brackets is a sort of rule of thumb. Thus we refrain from implementing the Lepski search and instead manually investigate the influence of the bandwidth over a suitable grid of values. The initialization is done at: p(x) initial = p(x) + unif(−0.1, 0.1), µ initial = µ(x) + unif(−0.25, 0.25) and σ initial = σ(x) + unif(−0.1, 0.1) for model (M1). We compute our estimator θ n (·, h n ) over a testing grid which is basically the interval [−5, 10] divided in cells of size 0.05. In Figure  1, respectively Figure 4, we present the behavior of the supremum error distribution over the location, scaling and proportion functions in model (M1), respectively model (M2), for n = 4000, 10, 000 and 20, 000 and index values of κ defined in (7.1). In Figure 2, resp. 3, we display one single run performance for n = 4000, resp. n = 10, 000, under κ = 4 and 10, to illustrate the influence of the sample size n and the scaling parameter κ on our method. Comments on Figures 1-4. We remark first that, despite the fact that the variance of the second component is common to models (M1) and (M2), the performances in terms of bias and variance of the supremum norm are dramatically better for model (M2). While this seems to be somewhat in contrast to our theory since the Gaussian density is much smoother than the Laplace density, we suspect that the effect is due to better separability of the two densities (Gaussian and Laplace) in model (M2) as compared to model (M1) (both Gaussian). Further, Figures 2 and 3 show the positive impact of the sample size on the largest estimation deviation over the different parameter functions. We clearly see that the estimated curves better "hold onto" the target curves when the sample size increases from 4,000 to 10,000. Let us finally notice that the most difficult parameter to control is the scaling as illustrated in Figures 2 (b) and 3 (b) by the quite large amplitude of the oscillations along the graphs.

Application to NimbleGen high density array
We consider the NimbleGen high density array dataset analyzed by Martin-Magniette et al. Martin-Magniette et al. (2008) and Bordes et al . The aim of these authors was to fit a simpler linear model than (2.1), where basically p(x) = p ∈ (0, 1) is fixed, the location function µ x = α + βx where α and β are respectively the intercept and slope of the second component linear regression function, and the scaling function σ(x) = σ is known. Originally . Let us also remind that the estimated value of p found by these authors is very similar and about 0.35. In Figure 6 we display the graph of (x, y) → 1/σ n (x)f (y/σ n (x)) over a (x, y)-grid to illustrate the influence of the scaling on the first component shape population. In Figure 7 we display successively    . The consequence of this is that our method implicitly considers that the unknown component has a more spread out distribution, due to the symmetry assumption, than the one obtained by the previous authors. This remark also implies that there is a stronger overlap between the first and the second component which explains Further we can observe in Figure 5 that the first component shrinks slightly over the interval [10,13] which is also detected by our method as it is demonstrated on Figure 7 (b).

General estimation theory for local M-estimators
In this section we develop rates of convergence and adpative estimation for general local M-estimators. The proofs of the results in this section are given in Section 11. Let Γ(α), α ∈ [a, b] be sets which index statistical models (P γ ) γ∈Γ(α) on some measurable space. Let I ⊂ R d be a compact rectangle, and let Θ ⊆ R m . Suppose that the deterministic contrast function M (·, ·; γ) : Θ × I → R is uniquely minimized in its first argument by θ * (x; γ), i.e.
The function M (θ, x; γ) is assumed to be a limiting version of a sequence of random contrast functions M n (θ, x; α) under P γ . In our specific model, the parameter α corresponds to the Hölder-degree of smoothness in the previous sections, where M n (θ, x; α) = M n (θ, x; h n (α)) is given in (4.6) with h n (α) = log n n 1 2α+d , and M (θ, x; γ) in (4.5). We suppose that M n (·, x; α) are minimized by someθ n (x; α), i.e.
Consider the gradients of the contrast functions We formulate a result on rates of convergence in sup-norm when the nuisance parameter α is known a priori, and subsequently formulate a Lepski-type method to obtain estimates which are adaptive with respect to α. We work with the following high-level assumptions.
Assumption 4. Assume that Θ is compact and convex with Θ = int(Θ). For convenience to avoid boundary issues, assume that the contrast functions are defined on an open and convex set Ξ ⊃ Θ. Given 0 < a < b < ∞ and α ∈ [a, b] let (Γ(α), · α ) be subsets of normed spaces. Further let r(α) = r(α; n) be given rates of convergence, which tend to ∞ in n for given α, and increase in α for given n.  (A1) Let (Γ(α), · α ) be compactly nested spaces, i.e. Γ(α) ⊂ Γ(α ) and Γ(α) is compact with respect to · α whenever α < α. Furthermore, Γ(α) is closed with respect to · a . Additionally, for any α, α n α, it holds that , the function M (·, x; γ) is twice continuously differentiable in its first argument and the Hessian matrix is positive definite. In particular the eigenvalues λ 1 x,γ;α ≥ · · · ≥ λ m x,γ;α of the matrices V x θ * (x; γ); γ are positive. Furthermore, the map i.e. for all θ, θ ∈ Ξ, we have sup α∈ [a,b] sup γ∈Γ(α) where the Lipschitz constant L Hess < ∞ depends only on Ξ, I, a, b and Γ(α). (A5) The empirical contrast is continuously differentiable in its first argument and for the gradients it holds that for some C * * < ∞, imsart-ejs ver. 2014/10/16 file: HHP_paper.tex date: October 2, 2019 The result shows that under the conditions of the theorem, the local M-estimator θ n (x; α) inherits its rate of convergence from that of the gradients as stated in (A5). In our setting, this rate will be the sup-norm rate in d dimensions over α-Hölder classes, that is r(α) = log n/n α 2α+d . Let us turn to adaptive estimation with respect to α. Our approach will be to use the Lepski method for the gradients S n in (8.3) and hence to obtain a data driven nuisance parameterα n ∈ [a, b] so that lim sup and then to use the estimatorθ n (x;α n ). As in Section 6 we let α k = a + k (b − a)/N , k = 0, . . . , N = log n and r k = r(α k ). For the choicê where the Lepski constant C Lep < ∞ has to be chosen large enough, we let α n = αk and The following high-level assumption allows to bound the probability of stopping early in the selection rule (8.4).

Uniform bounds for U-processes
In this section we provide tools which allow us to deal with the stochastic components in the high-level assumptions (A5), (A6) and (A7) in case the contrast function is a local U-statistic such as Here τ is a smooth function that is symmetric in its first two arguments, K is a kernel function and h > 0 is a bandwidth parameter. Proofs for the results in this section are given in Section 12. The first result will be used to take care of (A6) as well as of (A5) when applied to the coordinates of the gradient w.r.t. θ.

Theorem 8.3 (Uniform stochastic error for U-statistics). Consider the local U-statistics
The support I ⊂ R d of γ is supposed to be a compact cuboid, and sup γ∈Γ γ ∞ < ∞. Further, K : R d → R is a Lipschitz continuous and bounded L 2 -kernel; for some non-empty set A, (h n (α)) n∈N , α ∈ A is sequences of bandwidth parameters so that sup where C < ∞ depends on τ ∞ , L τ , K ∞ , L K , ρ, I, Θ, but is free from n and the sequences of bandwidth parameters.
Remark 2. If τ (and hence M n ) take values in R k (e.g. the gradient of a Ustatistic) it will be enough to check that every coordinate function fulfills the assumptions of Theorem 8.3.
The next result, which takes care of (A7), is directly formulated for the gradient.
Then for positive constantsc 1 ,c 2 > 0, there is an increasing linear function u (depending onc 1 ,c 2 ) such that for sufficiently large values of C Lep we have that As mentioned above we assume that y 2f (y) dy = 1.
Since for µ * = µ = 0 identification follows directly by Lemma 9.1, we assume µ * = µ and derive a contradiction to show identification under the other conditions.
Now suppose that Condition (C2) holds. We need to consider three cases.
Proof of Proposition 4.1. Since q > 0, by continuity it suffices to prove the equivalence imsart-ejs ver. 2014/10/16 file: HHP_paper.tex date: October 2, 2019 By (4.2) we have that E ϑ * H(Y, t, θ * ) = 0 for all t ∈ R. For the converse, suppose now that θ ∈ [0, 1] × (0, ∞) × R is such that for all t ∈ R, we conclude that for all t ∈ R, Hence, the function is symmetric about zero. Taking the Fourier transforms on both sides of once again yields equation (9.8), i.e.
Multiplying the first equation by sin(µt) and the second one by cos(µt) once again gives As Assumption 1 is fulfilled we can repeat the proof of Theorem 3.1 starting after (9.9). Note that we cannot use Theorem 3.1 to confirm the result because τ (·; θ|ϑ * ) does not have to be a density. The same method works under Assumption 2. Finally the contrast property for M is straightforward since q is a strictly positive weight function over R.

Outline
In this section we provide the proofs of Theorems 5.1 and 6.1. The strategy is to check the assumptions (A1) -(A6) as well as (A7) in Section 8.1 for our particular model, and then to apply Theorems 8

Main Lemmas
We choose some open rectangle Ξ with closure Ξ contained in (0, 1) × (0, ∞) × R\{0} which contains the parameterset Θ in ( That is: where C depends only on Ξ, I and q. The following lemma then takes care of (A5), (A6) and (A7). In its statement, for the uniform rate for (A5) we discuss separately the bias and variance components for the gradients S n θ, x; h in (6.1).  [a,b] h n (α) → 0, sup α∈ [a,b] log n nh n (α) d → 0 , The constant C * > 0 depends only on a, b, the function classes Γ(α), Θ, I, q and K; the constant C STOCH > 0 depends only on K ∞ , L K , U , I, Θ but is free from a and b.

Conditions (A5), (A6) and (A7) hold for any compact cuboid J ⊂ int(I). To be specific on (A5), we have for any compact retangle
Particularly, when h n (α) = log n n 1 2α+d , there is a constant C > 0 so that The proofs of Lemmas 10.1 and 10.2 are given in Section 10.4.

Derivatives associated with the contrast function
The following lemma lists the derivatives of the function H(y, t, θ) in (4.3), as well as some useful bounds.

Lemma 10.3. The derivatives of the function H(y, t, θ) in (4.3) are given by
Under Assumption (M3), there is a constant C > 0 depending only on Ξ and f so that for all t ∈ R, θ,θ ∈ Ξ we have Proof of Lemma 10.3. The derivatives of H(y, t, θ) are obtained by straightforward calculation. Properties (i)-(iii) are immediate from the fact that the functions sin, cos, ϕf , ∂ϕf and ∂ 2 ϕf are bounded. For (iv)-(vi), we additionally use the Lipschitz continuity of sin, cos, ϕf , ∂ϕf and ∂ 2 ϕf . In particular, the Lipschitz continuity of t → exp(it) with Lipschitz constant 1 yields Let us now turn to the derivatives of the asymptotic contrast M (θ, x; γ) in (4.5) and its Hessian V x (θ; γ).

Lemma 10.4. We have under our assumptions that
Proof of Lemma 10.4. We have that Therefore, from the definition of H(y, t, θ) in (4.3) we deduce that Taking derivatives under the integral gives (10.4). The derivatives (10.5) -(10.6) are obtained by straightforward computation.

Proofs of Lemmas 10.1 and 10.2
Proof of Lemma 10.1. (i) Let us start by showing that for each given x ∈ J, the When inserting the true parameter θ * (x), the derivatives (10.5) -(10.6) reduce to Since M (·, x; γ) attains a minimum at θ * (x), the Hessian matrix Since q, > 0 and the function t → E γ ∂ θ H(Y, t, θ * (x)) X = x is continuous, we conclude for all t ∈ R. It remains to show that v = 0.
First note that the first and second summand in (10.8) are zero for t ∈ π µ * (x) Z. Hence, we have v 3 = 0 as ϕ f * x , p * (x) > 0. Since g is zero on R, so is its first derivative, which exists asf and f * x have finite third moments. Now let us differentiate g at t = 0. The derivative is determined by because µ * (x), p * (x) = 0. Minding v 3 = 0, we derive v 1 = 0. And since the function t → t 1 − p * (x) ∂ϕf σ * (x)t sin µ * (x)t is non-zero in a neighbourhood around 0 excluding 0, we get v 2 = 0 by (10.8), so that the matrix V x (θ * (x); γ) is indeed positive definite.
(ii) This is immediate from (10.4), the Lipschitz continuity of the derivatives in Lemma 10.4, and the fact that q has finite moments of order up to 3.
Before turning to the proof of Lemma 10.2 we show two lemmas which are required to deal with the bias in (10.1). The first lemma gives a well-known bound on the bias when using higher-order kernels for functions from Hölder classes.
Then, for any compact cuboid J ⊂ int(I), there is some constant 0 < C Hol < ∞ depending only on [a, b], L, U and K so that sup α∈ [a,b] sup imsart-ejs ver. 2014/10/16 file: HHP_paper.tex date: October 2, 2019 Proof. Fix any ∈ H(α, L, U ), x ∈ J, α ∈ [a, b] and h ∈ (0, ∞). Using the Taylor expansion of order α of around x and using that K is a kernel of order b, we get for some τ ∈ [0, 1] and independently of , x, α, n and h that where the suprema are taken over α ∈ [a, b], γ ∈ Γ(α), h ∈ (0, ∞). h ∈ (0, ∞) and x ∈ J we estimate (10.10) where the inequality follows from the boundednes of characteristic functions by 1, by boundedness and Lipschitz continuity of sin, cos, by compactness of Ξ and by (10.3) for k = 0. The term (10.9) is treated directly by Lemma 10.5, which gives a standard bias estimate for Hölder functions using higher-order kernels. The term (10.10) is handled by the fact that x → f * x (y) is Hölder-α-smooth with Hölder constant L(y) that is integrable in y so that Hölder-α-smoothness extends to the family of characteristic functions (ϕ f * x ) x∈I . Note that the k-th partial derivatives of f * · (y) are bounded by L(y), |k| ≤ α so that Lemma 10.5 is applicable again. The second estimate follows by similar calculations.
Proof of Lemma 10.2. First let us prove (10.2). We shall show that the assumptions of Theorem 8.3 are fulfilled. The gradient of the empirical contrast M n is given by According to Lemma 10.3 (i), (ii), (iv) and (v), each of the coordinates of the function fulfils all of the assumptions postulated on the function τ in Theorem 8.3, from which we obtain (10.2).
Second, let us prove (10.1). We will show that for all θ, x, α, γ, h, we have Let us make a zero addition of the term within the norm in (10.11). Since is bounded by sup U and the functions H and ∂ θ H(·, t, ·)/(1+|t|) are uniformly bounded according to Lemma 10.3 (i) and (ii), it is enough to examine occurring differences. We estimate where the first summand is treated by Lemma 10.3 (i) and the fact that is Hölder-α-smooth, so that by Lemma 10.5, The second summand is dealt with by Lemma 10.6 so that (10.12) h α 1 + |t| .

Proofs for Section 8.1
This section provides the proofs for Theorems 8.1 and 8.2. It is organized as follows. In Section 11.1, in Theorems 11.1 and 11.2 we extend results in van der Vaart and Wellner (1996) for consistency and rates of convergence of M-estimators, specifically (van der Vaart and Wellner, 1996, Theorem 3.2.3) and (van der Vaart and Wellner, 1996, Corollary 3.2.5) by making them uniform over the probability model as well as introducing a covariate parameter x.
The proof of Theorem 8.1 in Section 11.2 then requires to check the assumptions of Theorem 11.2. For the adaptive result, Theorem 8.2, we show in Lemma 11.3 that the Lepski-choice adaptively estimates the gradient of the contrast. Then Theorem 11.2 can again be used to obtain the adaptive rate of convergence for θ ad n (·). Finally, the proofs of Theorems 11.1 and 11.2 are provided in Section 11.3.

Consistency and rates of uniform convergence for M-estimators
We start with the following general results on consistency and uniform rates of convergence, the proofs of which are provided in Section 11.3. We fix a parameter value α, and drop it in the notation, and also write Γ = Γ(α). For brevity, we shall also often write θ * = θ * (x, γ) for the minimizer in (8.1).
Then the estimatorθ n (·) is uniformly consistent, i.e. for all ε > 0, we have The following theorem is a generalization of (van der Vaart and Wellner, 1996, Theorem 3.2.5) that gives conditions for uniform convergence rates uniformly over the model parameters γ for possibly unidentifiable models.
Theorem 11.2 (Rate of convergence: General result in sup-norm). Let the following assumptions be satisfied.

Proofs of Theorems 8.1 and 8.2
Proof of Theorem 8.1. We need to check the assumptions of Theorem 11.2 for φ n = id and t n,γ = r n,γ = r n .
Fix some ε <ε. Then for any γ ∈ Γ, x ∈ I, Let us prove the first point of (i). For any γ ∈ Γ, x ∈ I, a second-order Taylor approximation around θ * (x; γ) yields for every θ ∈ Ξ with θ − θ * (x; γ) = ε the existence of a ξ x,θ,γ ∈ [θ, θ * (x; γ)] so that according to (A3) and (A4), where λ m x,γ is the smallest eigenvalue of V x θ * (x; γ); γ . Since eigenvalues of a matrix depend continuously on its entries, the entries of the Hessian matrices V x θ * (x; γ); γ depend continuously on (x; γ) by (A3), so that we can deduce Let us prove the second part of (i). By applying the fundamental theorem of calculus on the path [θ, θ * ] using the functions where we bounded the directional derivatives by the gradient. Hence, the second part of (i) is given directly by (A5).
We will prove (ii), i.e. the uniform consistency ofθ n (·), by using Theorem 11.1. As uniform consistency of the contrast M n is given by (A6), only ( * ) in the assumptions of Theorem 11.1 needs to be proved.
Next let us examine (11.2). By using Cauchy-Schwarz' inequality, we get for all α ∈ [a, b], γ ∈ Γ(α) that By definition ofk, we have as the set of grid points grows logarithmically in n. Hence, we further deduce by index shifting that In order to treat the last factor, we first observe that for l < j we have r j ≤ r l , Hence, there is an n 0 ∈ N so that for all n ≥ n 0 , 0 ≤ l < j ≤ k n (α), we have sup α∈ [a,b] sup γ∈Γ(α) Subsequently, deduce that for any n ≥ n 0 , α ∈ [a, b], γ ∈ Γ(α), 0 ≤ l < j ≤ k n (α), we have imsart-ejs ver. 2014/10/16 file: HHP_paper.tex date: October 2, 2019

Proofs of Theorems 11.1 and 11.2
Proof of Theorem 11.1. We have that Fix ε > 0. Because of ( * ) there is an η > 0 so that for any γ ∈ Γ, the inequality The proof of Theorem 11.2 is similar to (van der Vaart and Wellner, 1996, Theorem 3.2.5).