Regression using localised functional Bregman divergence

: This paper is concerned with a uniﬁed approach to estimat- ing regression methods based on a certain divergence and its localisation. Some past papers have demonstrated theoretically and numerically that infusing a little localisation in the likelihood-based methods for regression and for density estimation can actually improve the resulting estimators with respect to suitably deﬁned global risk measures. Thus a variety of local likelihood methods have been suggested. We demonstrate that simi- lar eﬀect can also be observed in the general framework discussed in this paper and with respect to robust estimation procedures. Localised versions of robust regression estimation procedures perform better with respect to global risk measures based on minimisation of Bregman divergence mea- sures. An intricate relationship between regression model’s inadequacy and its robustness can be better analysed by using the local approach developed in this paper. We support our claims with a short simulation study.


Introduction
This paper discusses a divergence-based method for estimating a regression function and a local version of this method. The smoothing approaches using localisation have been discussed in literature: most of those are based on certain form of local likelihood, see [20], [7], [15], [8], [13], [9] and references therein. In this paper we aim to develop a general framework of localised regression which includes methods based on the local likelihood as special cases. The proposed framework naturally induces a robust version. The estimation scheme in this paper is composed via functional Bregman divergence composed by a strictly convex function ( [4], [23]), while the parametric model for regression function is chosen simply by exploiting a composition of linear predictor and a smooth link function as in the generalized linear model ( [16], [23] and [17]). The localisation is applied in the estimation scheme by slotting a kernel function, as in [20]. Hence the proposed localised regression inference in this paper is essentially composed by choosing the strictly convex function U in the functional Bregman divegence, the link function G in the model, and the kernel function K for the localisation. We claim that an appropriate choice of U and G leads to asymptotic risk improvement by the localised estimator over the global (non-localised) estimator. Furthermore we show that choosing a suitable U in the functional Bregman divergence naturally induces a robust version of the resulting localised regression, with a similar risk improvement by the localised estimator when an appropriate choice of G is made.
In a recent paper [18] we demonstrate theoretically and numerically that infusing a little localisation in a robust parametric density estimation procedure can bring benefits when the quality of the density estimator is measured by using a global risk measure based on a minimisation of a Bregman divergence measure. In this way, an extension in robustness context of a past results has been delivered. These past results ( [15], [13], [8]) have demonstrated similar effects when localising likelihood-based methods. Specifically, they show that a little localisation helps to improve the non-localised estimators when the quality of the estimator is measured by using global risk measures.
In regression setting, the idea of localisation of the inference was pioneered in [20]. It was presented as an extension of the idea of scatterplot smoothing [6] to generalized linear regression models. To the best of our knowledge, there have not been efforts to apply the localisation approach to robustify inference in regression, neither for the simple linear regression setting, nor for the generalized linear regression models. Our paper represents and attempt to fill this gap. However, we stress on the fact that in the regression setting of the current paper, even the global estimator (that we then localise for better performance) suggested in this paper, seems to be new to the best of our knowledge.
It is well-known that likelihood-based inference is based on the idea of the minimisation of the Kullback-Leibler (KL) divergence between an ideal and empirical distribution. We illustrate below that the use of Bregman divergence (BD) as a generalization and replacement of the KL divergence can, when properly applied, bring about intrinsic robustification to standard likelihood-based inference when the quality of the regression fit is measured by using a suitable global risk measure. Further point of this paper is that when coupled with a suitable localisation, this robustification effect can be magnified, in a way similar to the one that has been demonstrated for density estimation in [18].
From the very beginning we stress that the general setting of our paper does allow for model misspecification. That is, we allow for the possibility that the regression function, defined as the conditional expected value of the output variable given the input vector, may not be equal to any of the parametric relationships used to model this function. Admittedly, this is the more realistic scenario in practice. Intuitively, if we are ready to admit (parametric) model misspecification then it is to be expected that a localisation would be helpful in improving the performance with respect to a global risk measure as a local estimator would be better adaptable to the unknown regression function. We give sufficient conditions for this effect to happen asymptotically by relating the choice of the function G to the choice of U .
The paper is structured as follows. We start with a detailed discussion of the setup of our paper in Section 2. The global estimation scheme is firstly introduced using the functional Bregman divergence [11] using the setup discussed in [23]. Then the localisation is naturally composed by exploiting the kernel function. Section 3 is devoted to asymptotic statements about the global and local estimators discussed in our framework. The risk difference between the global estimator and the local estimator is asymptotically evaluated in Section 3.3. The choice of U and G to yield the asymptotic risk improvement by the local estimator is addressed in Section 3.4. Section 4 discusses approaches to robustify our estimators by choosing a suitable divergence measure. It is seen in Section 4.1 that using a suitable U and applying it on the residuals helps us to robustify the method discussed in the previous sections. The advantage of localisation in the sense of the risk improvement is also observed in the robustness setting in Section 4.3. Numerical implementations and simulations are presented in Section 5. Some final discussions are presented in Section 6. Section 7 presents the proofs of our theoretical results.

Background
Bregman divergence was introduced in [4]. It can be defined for d-dimensional (d ≥ 1) vectors but also for matrices or for functions. For a given strictly convex function U : A → R, where A ⊂ R d is a convex set, the Bregman divergence between two points X ∈ A and Y ∈ A is defined as with ∇ denoting gradient-taking. This definition can be applied point wise for positive density functions f , g defined on a common domain. The point-wise application means that in this case d = 1, ∇ means a simple derivative U and we interpret locally, for a fixed t Using this localised divergence measure at the point t, we then define the global (or also called functional) Bregman divergence between the densities f and g: That is to say the KL divergence in this exponential family in canonical form is tantamount the Bregman divergence defined via the convex function ψ(θ). In addition the Riemannian metric (∂ 2 /∂θ∂θ T )ψ(θ) derived from ψ(θ) in this exponential family is easily seen to equal the Fisher information matrix (compare also Theorem 2.1 in [1]). This intimate relation with KL divergence has been an important push to consider applications of Bregman divergence also outside the exponential family setting. There is still the belief that the data's distribution is "close to exponential family" but may "deviate slightly" as in the robustness paradigm. In such cases it is prudent to start from the very beginning with a proposal of a convex function U in (2.1) and proceed with it. The above divergence between parameters will be replaced by functional Bregman divergence (see [11]) when analysing the quality of the fit of such models to data.

Setup
is the conditional density of Y given X = x, and q is the density of X. The support of the density q is expressed as D which is assumed to be a compact set in R d . Let t ∈ D ⊂ R d be a target point at which we want to estimate the value of regression function Given that the precise distribution of the pairs (Y i , X i ) is virtually never known in practice, there is a need to approximate this ultimate function μ(t) = E[Y |X = t] and a simple wellknown approach is to approximate certain (possibly nonlinear transformation of it) via a function that is linear in the input observations x. As usually done in practice, one postulates a parametric model for μ in the form is the vector of explanatory variables and θ = [θ 0 θ 1 · · · θ d ] T ∈ Θ ⊂ R d+1 is the parameter vector that also includes the component θ 0 as an intercept. The one-to-one transformation G (whose inverse is denoted as G −1 above) represents the so-called link function.
As said in the Introduction, we allow for model misspecification. That is, we allow for the possibility that for all θ ∈ Θ, the relationship m(x, θ) = μ(x) may hold. This general setting is sometimes called approgression [5]. For the particular case where model misspecification is excluded, some of our results in this paper coincide with results that have been obtained in [23].

Bregman divergence
Throughout this paper we denote by U the set of strictly convex functions on R. Now we fix a U ∈ U . Then the discrepancy between μ(·) and its parametric model m(·, θ) = m θ (·) can be measured by the functional Bregman divergence defined as where u = U : the derivative of U .The reason to add the term functional to the Bregman divergence above is that it can be interpreted as a weighted form of the point-wise Bregman divergence between μ(x) and m θ (x), with a weight function given by the density of X.
In what follows, for the purpose of easier tractability, we prefer to utilize where U * is the convex conjugate of U : U * (s) = sup z∈R {zs − U (z)}, and we have used a fact that (U * ) = u −1 in (2.4). The fact that is an easy consequence of the fundamental properties of the Legendre transformations. It has been derived explicitly, for example, in [1] (Equation 1.68 on page 17). The equality (2.6) implies that minimising either quantity with respect to θ would deliver the same outcome hence we can be guided by the simplicity of the resulting optimization problem when making a choice. Up to terms not involving θ, it is clear that D U * (u(m θ ), u(μ)) is more tractable and we focus on this choice from now on. The usual parametric regression can be carried out by using a certain estimatorθ of the true value of θ. Necessary tools for this estimation scheme are x, (2.8) by which the estimatorθ and the true value θ * of θ can be defined as where F n is the empirical distribution function based on (Y 1 , X 1 ), . . . , (Y n , X n ), F is the cumulative distribution function with its density f and 0 d+1 is the zero vector in R d+1 . Note that θ * in (2.9) is the minimiser of (2.5), and the minimiser of the empirical version of (2.5) is preciselyθ in (2.10). The regression function estimator can be obtained by substitutingθ into θ in m(·, θ): (2.11)

Local Bregman divergence
Now we introduce a localisation of the Bregman divergence by slotting a kernel function K, where K(z) is a smooth unimodal integrable function, symmetric around z = 0 d and satisfying K(0 d ) = 1. Our proposed local version of the Bregman divergence corresponding to (2.4) is defined as Here h > 0 is the scalar bandwidth which controls the degree of localisation. This local divergence aims to evaluate the discrepancy between u(m(·, θ)) and u(μ(·)) locally around the evaluation point t.
For a better adaptation along the regression curve, we now allow the parameter θ to vary with t and suggest a scheme to estimate θ depending on t. The necessary functions are listed as follows: We note that (2.12) and (2.13) are localised version of (2.7) and (2.8) respectively, with the use of the kernel K. Using these functions, we define the true parameter θ * (t) at t and its estimatorθ(t) as follows: This local estimatorθ(t) of θ * (t) also suggests us to make a regression estimator defined asμ which we call the local estimator of μ(x), because the involved estimator of parameter is determined locally. Sinceθ(x) can vary depending on x,μ L would be expected to be more flexible thanμ G . On the other hand, we call the estimator μ G (x) in (2.11) the global estimator of μ(x). Let us introduce the notation We also define the following (d + 1) × (d + 1) matrices  [10] and in [18] the effect of this choice is discussed is more detail. In any case, when a robustification is sought, it makes sense to choose U which is parameterized by a parameter, r, say, whereby the parameterized function represents some sort of distortion of the logarithmic function. Very often this is achieved by replacing in the core definition a logarithm by the Box-Cox transformation. In such a way, the new (non-local) robust estimation procedure can be interpreted as one that tries (empirically) to minimise "the discrepancy between a distribution in an ideal parametric family and one that modifies the true distribution to diminish (or emphasize) the role of extreme observations" (see [10], p. 754). Their procedure is called maximum L r -likelihood estimation procedure to reflect the distortion of the usual likelihood function (obtained as a limiting case when r = 1) via the distortion function L r (u) = {u 1−r − 1}/(1 − r), r > 0. These authors have not discussed robustness issues except to mention that these might be interesting to be discussed in a future work and they do not discuss any effects of localisation. Our estimation procedure is different to theirs and it can be implemented in such a way as to deliver robustification in regression setting that is consistent with the classical requirement for a robust estimator to have a bounded influence function. A paper that deals explicitly with the aspects of Bregman divergence in regression is [23]. It is more closely related to our own paper although again it does not analyse any localisation effects and do not deal with model misspecification. Further, we note that in the global setting, the Bregman divergence they use is not the same as ours. They aim to measure the discrepancy between y and m(x, θ), rather than between μ(x) and m(x, θ) as we do. However it can be confirmed that, eventually, their estimation procedure (Equation (20) in p.126) for θ coincides with ours in the no-misspecification case μ(·) = m(·, θ * ).
We should also mention that the paper [9] also deals explicitly with the model misspecification issue in regression. It points out that a localisation bandwidth h controls the behaviour of the local estimator, with two distinctive effects. For nonparametric consistency, a bandwidth h tending to zero as sample size n → ∞ is to be chosen. On the other hand, with h large, the estimator would share asymptotic efficiency with the parametric estimator if the parametric model is precisely correct but at the same time, it will suffer much less than the parametric estimator when a slight misspecification of the parametric model occurs. However the paper assumes that the conditional density p(y|x) belongs to an exponential family. We do not need this assumption.
The authors of [23] investigate the question of how to tune the bandwidth h depending on the degree of model misspecification. In contrast, we focus on investigating the gain of using the large h asymptotic approach instead of purely parametric regression in cases where the model misspecification is small so that we happen to be in the large h regime. In addition, in [23] the conditional distribution of Y given X is again supposed to belong to an exponential family. We do not need this assumption in our methodological part.
We also investigate robustness properties of the estimators constructed by using the large h approach. Noticing the structure of the loss function Q in [10], it becomes apparent via simple calculations that the resulting influence function of their estimator (and hence also of our estimator), in terms of the notations introduced in our paper, is proportional to Even if we assume, as we do, that the input (design) variable X could be restricted to be in a compact set, this influence function is clearly unbounded in general (unless the Y i observations also stay in a bounded set). In contrast, our method, in its form presented in Section 4 can deliver robustification in a classical sense, with a bounded influence function.

Assumptions
We will now state the assumptions for our asymptotic statements related toμ G andμ L to hold. In the sequel, for any (d + 1) (A6) For any t ∈ D and any vectorθ satisfying is positive definite, and for any t ∈ D and any vectorθ satisfying ||θ −θ|| < ||θ(t) −θ||, Ψ(t,θ) is nonsingular.

Remark 2.
The following remarks should be made regarding the above assumptions for our theory: (a) (A0) relates to the shape of kernel function used for the localisation. A typical choice is the Gaussian kernel K(z) = exp(−||z|| 2 /2). (b) We aim to develop asymototics forμ L and henceθ(t) under the scenario where the bandwidth h grows as n increases. The typical order is set to , however other orders might also be possible, see [8]. We do not pursue the optimal order of h in this paper. (c) (A2), (A3), (A4) and (A5) are necessary to demonstrate consistency of our local estimatorθ(t), see [14], [21] and [22]. (d) (A6) claims that the matrix Ψ(t, θ * ) is approximated well by Ψ(t,θ) for θ close toθ, and that this is also true for the estimated version Ψ(t, θ * ). (e) (A7) assures the non-singularity of Ψ(θ * ) and Ψ(t,θ) forθ close toθ.
Asymptotic normality ofθ(t) andμ L (t) can be demonstrated as follows: converges in distribution to a (d + 1)-dimensional Gaussian distribution with mean vector zero and with covariance matrix and τ > 0 is the limit of √ n/h 2 .

Asymptotics for the risk
The effect of the localisation is revealed when the performance comparison of the local and the global estimator is completed by using global-performance risk measures. The risk of the global estimatorμ is defined as the expected value of its divergence: where the expectation is based on the sample. The risk of the local estimator The difference between the risks of the global and of the local estimators is thus defined as We have the following result for the risk difference: for j = 1, ..., d and x j is the j-th component of x.
From Theorem 5 it is clear that (since under Assumption (A7) the matrix Ψ(θ * ) is positive definite) the local estimator outperforms the global one asymptotically. It is not always the case, however, that Assumption (A7) would hold. In Theorem 6 below we investigate sufficient conditions for (A7) to hold. We also note that obviously when the parametric model holds then η j (θ * ) will be zero. In this particular case the claim is that the risk difference (3.2) is still approximated by a positive quantity

Efficient choice of the link function
Our methodology requires choices of the strictly convex function U ∈ U and of the link function G. The asymptotic result in Theorem 5 suggests us an efficient choice of the link function G for a fixed U . Theorem 6. Fix a strictly convex function U . The choice G ≡ u + α for a constant α makes Ψ(θ * ) positive definite, for any μ. With this choice of G we have Remark 3. We are focusing now on analyzing Σ * of (3.1) in Theorem 4. Consider the case G(t) = u(t) + α for some constant α. Then Ψ(θ * ) is positive definite by Theorem 6. The same result as in Theorem 6 of [23] holds for Σ * .
for some constant c > 0, then the asymptotic covariance matrix Σ * ofθ(t) attains the lower bound

Remark on robustness
The influence function ofθ can be obtained as (3.6) which is unbounded unless y stays in a bounded set. The influence function of θ(t) at t ∈ D can be obtained similarly. We do not pursue this here. By examining (3.6) it becomes clear why the influence function can be unbounded. This is due to the appearance of the residual function y − m(x, θ * ) as a multiplier in (3.6). This observation also suggest ways to implement a robust estimator as a substitute toθ. The way we adopt in this paper is presented in the next Section.

Composition of the divergence
A typical member of U 0 is the so-called pseudo-Huber loss (see, e.g., [12]) The pseudo-Huber loss is similar to the Huber loss, but has continuous derivatives of all orders and is a strictly convex function of t for any fixed δ > 0. It should be noted that L δ is bounded.
In what follows we shall denote the residual function as To yield a robust estimator of θ, we utilize an another feature of Bregman divegence defined as In particular, if the pseudo-Huber loss (4.2) is used for U then also U (0) = 0 holds. By minimising D U (θ) we aim to minimise the discrepancy between the residual r(θ) and 0 rather than between Y and m(·, θ) as implemented in [23]. This is the key to robustification since now the influence function can be made bounded even when the Y -observations are not, as will be seen in the discussion that follows.

Estimators
In this setting, necessary functions corresponding to (2.7) and (2.8) are put into Associated functions corresponding to local version (2.12), (2.13), (2.17) and the matrix (2.18) can be defined in the same way, hence we will deal with these functions as before but by using (4.4) as a starting point. The estimatorθ as well as the true parameter value yielding the best approximation to Y can be defined in a same manner as in (2.10) and (2.9), respectively. Local estimatorθ(t) and the parameter θ(t) at t are also defined in a similar way to (2.15) and (2.14), respectively. Furthermore, the above parameter estimators suggest immediately the global estimatorμ G and the local estimatorμ L of the regression function as given in (2.11) and (2.16).
With the new ρ(y, x, θ) and ψ(y, x, θ) defined in (4.4) and (4.5), the influence function ofθ can be derived as where andc 2 (y, x, θ) We now compare the estimatorθ obtained via the estimating equation based on (4.5) with that based on another divergence.
The so-called density power divergence [2] is defined as a Bregman divergence withŨ β (t) = t 1+β for t ≥ 0 and β > 0, see Section 9.2 of [3] as well. Robust inference using this divergence with 0 < β < 1 was discussed in [2] and [3]. This divergence, however, cannot be utilized in its original form in regression setting since thisŨ β (t) is defined on t ≥ 0, though it is itself natural for the divergence of densities.
A suitable version ofŨ β for the regression problem is U β (t) = |t| 1+β for β > 0. This U β is strictly convex on R so that it is a member of U . Hence it is possible to consider the divergence (2.3) and (4.3) based on U β for inference.
However, it is easily confirmed that the inference using the divergence (4.3) with U = U β is not robust, since U β is not bounded. Using the form of U β and (4.6), we see that the influence function of estimator based on this power divergence U β is not bounded. This is in stark contrast to the fact that the influence function based on the divergence associated with U = L δ is bounded, due to the fact that u = U = L δ is bounded.
Both U β and L δ are members of U , and L δ is also a member of U 0 but U β is not. This difference is essential when analysing robustness properties.

Risk improvement in the robust setting
Similarly to Theorem 5, the local regression estimatorμ L can improve the risk of the global estimatorμ G also in the robust setting when using the divergence (4.3) and the associated functions (4.4) and (4.5).
The risk of the global estimator can be defined as whereas the risk of the local estimator is where the expectation is based on the sample. The risk difference between global and local estimators is therefore given as The following theorem sates that, under Assumptions (A0)-(A7), the local estimator is better than the global estimator when using the global risk (4.9) as a measure of performance: for j = 1, ..., d.
We now discuss, in a similar way as in Theorem 6, examples of sufficient conditions for the positive definiteness of Ψ(θ * ). We claim that a choice for G = u associated with U = L δ , the pseudo-Huber loss in (4.2), leads to the desired positive definiteness. To see this, we introduce the constants both of which certainly exist by the smoothness of G = u and the assumption (A4) as well as the assumed compactness of D and Θ c . We formulate the following theorem: Note thatc 2,1 (y, x, θ) in (4.11) is always positive for any (y, x) as well as for any θ and any positive δ. In addition, sincec 2,2 (y, x, θ) is bounded in (4.12) for any positive δ we see thatc 2 (y, x, θ) will be positive for large δ. Since the positive definiteness of Ψ(θ * ) in (4.7) can be captured only throughc 2 in (4.10), Ψ(θ * ) will become positive definite for large δ.
Remark 5. The claim of Theorem 8 represents a new perspective on the cost of robustness guarantee. It is a part of the folklore that a strong robustness generally implies low efficiency of the method, and pursuing efficiency reduces the robustness of the method. Theorem 8 suggests that large value of δ guarantees the positive definiteness of Ψ(θ * ), and hence the positiveness of the leading term of the risk difference in Theorem 7. Therefore, large δ guarantees the risk improvement by the proposed local estimator. But we note that by the definition of the pseudo-Huber loss in (4.2), this large δ reveals a weak robustness. Hence the dilemma between robustness and efficiency also appears in our current setting when analysing the risk improvement by the local method. We also note that our methodology, as implemented in the local estimator setting, allows us to analyse the separate effects of model misspecification and of robustness, as well as their interplay. Using a bandwidth h 2 = O( √ n) as in Assumption (A1) represents a balance in the sense that using a smaller order bandwidth h will help if model misspecification of the conditional mean prevails whereas larger h would be helpful if there is a need for a more robustification. However, the robustification achieved by the local estimator has its limitations. Indeed, an attempt for a stronger robustification by choosing small δ may cause Ψ(θ * ) to not be positive definite anymore. Hence, if for some reasons, a strong robustification is aimed at then the global estimator may do a better job.
From a practical point of view, one possible drawback of the choice G = u in Theorem 8 is that the resulting G −1 = u −1 requires |θ T x| < δ since This means that m(x,θ) = G −1 (θ T x) would not be defined when |θ T x| ≥ δ. For large δs, this does not matter theoretically, but there is nonzero probability thatθ T x is bigger than δ for some x thus causing a numerical instability in practice. A less problematic choice of G = u is the choice G = u −1 . We again achieve a risk improvement by the local estimator as shown in the following result: Theorem 9. Let U be equal to the pseudo-Huber loss L δ in (4.2), and let G = u −1 = (U ) −1 . Then Ψ(θ * ) becomes positive definite for sufficiently large δ.

Numerical implementations and simulations
The main purpose of this section is to discuss the numerical implementation of the estimations and to compare the performance of the global and local estimators. Furthermore, the risk reduction effect illustrated in our Theorems will be demonstrated in a short simulation study.

Numerical implementation of the estimation algorithms
First, we discuss the numerical implementation of the regression estimatorsμ G andμ L . These are obtained via plug-in once the corresponding parameter estimatorsθ andθ(t) are obtained. Using the obtained regression estimators, we will be able to analyze and demonstrate the risk reduction achieved when usinĝ μ L instead ofμ G . Theoretical analysis of these improvement effects was been presented in Theorems 5 and 7 and in this section additional numerical support is demonstrated. Although the numerical implementations of the estimators are standard, we discuss them here for the purpose of making our discussion selfcontained.
The relations (5.3) and (5.5) help us to arrive at a more convenient expression for the update (5.1): )(Y − m(θ [k] )). (5.6) Note that an efficient choice G = u gives W = I n , the identity matrix, as confirmed by (5.4). Furthermore the choice G = u simplifies Δ as . Therefore (5.6) is finally simplified tô )).
As long as we utilize G = u = U , we only need to care to update the matrix Δ(θ) and the vector m(θ).

The Local Estimator
A similar algorithm can be implemented for the local estimatorμ L (t) at t ∈ D.

Estimators in Bernoulli Regression
Applications in Figure 1(a) and (5.11) in Figure 1(b). The bandwidth for the local estimator is h = 0.25 · (200) 1/4 . The utilized parametric model is We see from Figure 1(a) that the local estimatorμ L (x) (dotted) fits to the true μ(x), while the global estimatorμ G (x) (dashed) cannot capture the structure of μ(x). In contrast, in Figure 1(b) the parametric model includes the true μ(x), hence both the global and local approaches give reasonable estimators.

Simulation related to Theorem 5
Referring to the proof of Theorem 5, we need to calculate an estimate of (5.11).
which corresponds to the left hand side of (3.3). To calculate the O(h 2 )-term of the right hand side of (3.3), we need to obtain an approximation to Ψ(θ * ) in (2.18) and η j (θ * ) in (3.4). The integrals involved in (5.12), (2.18) and (3.4) do not have a closed form in general and we replace them by Monte Carlo integral approximation.
To this end, we generate a random sample of size S drawn from F (y, x): F (y, x). Independently, we generate a data set (y from F (y, x) for t = 1, . . . , T , and calculate estimatorsθ(·) (t) andθ (t) at each iteration t = 1, . . . , T . Using these estimators at each iteration t, we havê ). We then have the estimates of (5.12) and respectively, whereη see (3.4). Here x * sj is the j-th component of x * s . The vector θ † can be chosen as θ † = θ * for the parametric case μ(x) = m(x, θ * ). For the more general situation where μ(x) differs from the parametric model (i.e., for the "approgression" case), it is not as easy to determine the "true" vector θ * as the solution of (2.9). In this case we would utilize θ † defined as .
Under the choice of G = u, it is easily verified that (5.14) can be expressed aŝ where is the Hadamard product, see Chapter 7 in [19], X * = [1 S X * ], . . , c 2 (y * S , x * S , θ † )} and 1 S stands for the S-dimensional vector of 1's.
For simplicity d = 1 is adopted in the following experiments. Also we set parameters to S = 200 and T = 100.

Normal regression example
Under the usual setting of linear regression with normal errors, we modelled the joint density f (y, x) associated with the distribution function F (y, x) as follows: (5.15) i.e., the conditional distribution of Y given X = x is N (μ(x), (0.2) 2 ). We choose the true regression function as μ(x) = sin x in this simulation design, and we utilize as a parametric model. This example has also been used in [9]. Further we note that U (t) = t 2 /2 corresponds to the normal regression.
Results of simulations are exhibited in Figures 2(a) and (b). For both simulated cases c = 1, 1.4,L in (5.13) is positive (dashed), which means that the local estimator improves the risk. As h increases from h = 1 2 · √ n to h = (1.4) 2 · √ n, L is getting uniformly smaller, which means that the difference between the local estimatorμ L (x) and the global estimatorμ G (x) is disappearing just as the theory in the previous section claims. The claim of Theorem 5 is clearly illustrated especially on Figure 2

Bernoulli regression example
Next we implement a Bernoulli regression example using the following experimental design: That is, the conditional distribution of Y given X = x is the Bernoulli distribution with the probability of success equal to μ(x). The true regression function μ(x) was chosen as in (5.10). The parametric model utilized in this case was Notice that the associated convex function U (t) for the Bernoulli (binomial) regression is given as The results of simulations are exhibited in Figure 3(a)(c = 1) and (b)(c = 1.4). Similar tendency to the one observed in Figures 2(a) and (b) can be observed now in Figures 3 (a) and (b): the risk difference is positive, which reveals a superiority of the local estimator to the global one. Although the curve ofR(θ † )/h 2 happens to be positioned uniformly below the risk difference curve in these two cases, these two curves are fairly close.

Robustness of Regression Estimators
In the rest of this section, we fix U as the pseudo-Huber loss in (4.2). We aim to check whether the global and the local estimators are robust against outliers. To do this, we generate the data set as follows. Ordinal data (with no outliers) is generated according to the distribution model (5.15). The data with outliers is generated according to the model and f (y, x) is that in (5.15). By (5.16), outliers will appear on the region with a small probability p. For calculation of the estimate of the risk, we generate, in a similar manner, another independent data set (y * 1 , x * 1 ), . . . , (y * S , x * S ) according to (5.15) and (5.16). We investigate the robustness of estimators via the variation of the global risk. According to Section 4.3, the estimate of the risk for an estimatorμ can be defined as hereμ (t) is the estimator obtained by using t-th simulated sample of size n drawn from (5.15) or (5.16). The risk comparison was carried out under the setting μ(x) = sin x n = 100, p = 0, 0.05, 0.1, T = 100, S = 200, δ = 6, 12 and h = 1.5, 3. Table 1 includes 10 4 times the calculated estimates of the risk (5.18) for estimators: andμ LSE (x) =θ T LSEx , whereθ LSE is the usual least squared estimator,θ 1 andθ 1 (x) are respectively obtained via the algorithms in Section 5.1.1 and 5.1.2 respectively, with G = u −1 ,θ 2 andθ 2 (x) are those obtained with G = u, We observe from Table 1 that the risk ofμ LSE is mostly affected by outliers. For the case δ = 6, the local estimatorsμ L1 andμ L2 perform better than the global estimatorsμ G1 andμ G2 , andμ LSE irrespective of the value of p. The results for using smaller h(= 1.5) are totally better than those using h = 3. The global estimators behave almost similarly toμ LSE . The same tendency can be observed for the case δ = 12. Hence, it can be claimed that the local estimators are more robust in the sense of small values of the risk (5.18).
We further investigated the behaviour of estimators by using yet another risk measure: the MISE. For an estimatorμ of μ, an estimate of integrated squared error of t-th estimatorμ (t) is calculated as It is seen from Table 2 that the local estimatorsμ L1 andμ L2 perform well even in this comparison using an usual risk accuracy measure such as the MISE. However there is a possibility that the local estimators might be defeated by the global estimators for a larger p.
Next, we calculated estimators using the two sets of data: with and without outliers, to investigate their robustness. A simple linear regression is considered under the same setting as above, where the true regression function is now μ(x) = 1.2 + 0.8x.
We calculated and compared the estimatorsμ G1 (x),μ L1 (x) andμ LSE (x). We report in Figure 4 the results of comparison for these three regression estimators under the setting n = 100, p = 0, 0.1, δ = 4 and h = 1.5 forμ L1 . Figure 4(a) illustrates the no-outliers case (p = 0). Obviouslyμ LSE (solid) gives a good fit, whileμ G1 (dashed) andμ L1 (dotted) look slightly curved when using G −1 = u. Figure 4(b) exhibits the result where the data includes outliers generated according to (5.16) with p = 0.1. We observe from Figure 4(b) that all three estimators are affected by the outliers in the region (5.17), but clearlŷ μ LSE is the most affected. The local estimatorμ L1 adjusts to some local features of the data, hence it becomes to be more close to the outliers than the global estimatorμ G1 . However the fit of the local estimator around −3 ≤ x ≤ 0.5 is better than that of the global one. It becomes clear thatμ G1 andμ L1 are more robust thanμ LSE . Theμ G1 seems more robust than theμ L1 in this particular example which might be due to the fact that the outliers are clustered in one cluster. We further implemented a design where the outliers were spread in more clusters. The data with outliers is generated as h(y, x|p), (5.20) where and f (y, x) is in (5.15). The result for p = 0.1 is exhibited in Figure 4(c). The global estimator (dashed) behaves similar to the estimator based on LSE (solid), but the two clusters of outliers have more significant effect on the LSE estimator. The local estimator with h = 1.5 (dotted) fits the data in the main area −2 ≤ x ≤ 1, but also follows the outliers outside the main area. Hence the global estimator is more robust than the local one also in this case with two clusters of outliers. Summarizing the results in this Section 5.3, we can say that the global and the local estimators perform well. The local estimator may sometimes be less competitive in comparison to the global with respect to resistance to outliers as it has not been constructed with this goal in mind. However, it is tailored well to minimise the combined effect of model misspecification and robustness. This is important property of the local estimator as typically in practice there is a need to deal with both these effects.

Simulation related to Theorem 7
We have designed a simulation to check Theorem 7 under the setting of Poisson regression. The ordinal data (with no outliers) is generated by where which means that Y |X = x ∼ Poisson(μ(x)) and X ∼ U (−3, 3), a uniform distribution on the interval [ −3, 3]. The data with outliers is made as where g(y, x) = 1(y = 30) × This means that the outliers occur as Y = 30 on −3 ≤ x ≤ −1 with a small probability p. For calculation of the risk, we generate, in a similar manner, another independent data set (y * 1 , x * 1 ), . . . , (y * S , x * S ) according to (5.21) and (5.22). To illustrate the effect of using Theorem 7, we focus on (4.9) and on the term in the right hand side of the Theorem. Exploiting similar Monte Carlo integrals to the ones we use in the previous Sections, we can compose, with refer to (5.18), an estimate of the risk difference betweenμ G1 andμ L1 as for an estimate of (4.9), and for an estimate of We have demonstrated the simulation to check Theorem 7 forμ G1 andμ L1 with the setting of μ(x) = 5 + 4 sin x, T = 100, S = 200 and the bandwidth h = n 1/4 was utilized for the local estimatorμ L1 . Theorem 7 claims that the risk difference RD(μ G1 ,μ L1 ) in (5.23) is asymptotically bounded from below by RHS(θ) in (5.24) divided by h 2 = n 1/2 : We observe from Figures 5(a) and (b) that the risk difference is positive, which reveals a risk improvement by the proposed local method. The risk difference under the data including outliers (solid line in Figure 5(b)) is getting slightly bigger than that without outliers (solid line in Figure 5(a)). It can be recognized that the dashed line stays below the solid line in both Figures. The zigzag shape of the solid line could be attributed to small sample fluctuation. This demonstrates that in this example the inequality (5.25) holds not only in the case p = 0 but also in the case p = 0.05. It also holds for p = 0.1 although we did not include the graph for this case.

Discussion
This paper presents a unified way to compose a localised regression inference method by utilising the following triplet (U, G, K): a strictly convex function U for the estimation scheme with the functional Bregman divergence, the link function G in the parametric model utilised, and the kernel function K for localisation.
In most statistical analyses, the estimation scheme is considered separately from the model, and the model is often suggested irrespective of the estimation procedure: the choice of the estimation method and the construction of model are independent. A natural question arises: is there any model that is more suitable for the estimation scheme we would like to use? The results in Sections 3 and 4 provide an answer to this question: a specific relationship between the choice of U (for the estimation scheme in (2.3)) and G (for the model in (2.2)) yields an advantage of the use of the local estimator associated with the kernel K, see Theorems 5, 6, 7, 8 and 9. This suggests that it is beneficial to construct the model with a reference to the estimation scheme to be implemented. There is no need to care too much about the discrepancy between the model and the underlying true structure as the localisation helps to adjust to the latter.
As the choices on U and of G are interrelated (for example: U = u = G−α in Theorem 6), it is not really important which one is chosen "first". In practice, the choice can be dictated by our prior information and by the degree of confidence in this prior information. If, for example, we have a strong confidence in the form of the global regression function, then we can start with a choice of the link function G and then determining U from the relationship U = G − α. On the other hand, if we do not have a firm idea about the parametric form of the regression function or we wish to perform in a robust way, then we may wish to choose a suitable U first (for example, the pseudo-Huber loss) and then choose G accordingly as U + α. As in our paper we stressed on the possibility of model misspecification (i.e., on the case m(x, θ) = μ(x)) we have focused on the choice of the function U at a first step.
In conclusion, we can say that the approgression approach that we have adopted in this paper, seems to be the realistic practical setting. In that sense, the paper [23] can be considered a special case of our methodology. Sensible asymptotic statements about the behavior of the estimators of the true regression function (i.e., of the conditional expected value of the output given the input) in our setting can be attained when this regression function is close to some parametric model. This type of requirement is similar to the Fisher consistency in classical robustness theory. These are the types of statements that we have derived in this paper. We have revealed that the intricate relationships between the utilised regression model's inadequacy and its robustness can be analysed more conveniently by using the local approach developed in this paper. We supported our claims with a short simulation study. There are several extensions possible as a future research work. For example, additional sparseness-type penalties can be added in the main minimisation problem. Another problem to study is the data-driven choice of the constants c and δ in Section 5.

Proof of Theorem 1
Proof Define F = {ρ(·, ·, θ)| θ ∈ Θ}, F n (t) = 1 − K · − t h n ρ(·, ·, θ)| θ ∈ Θ Theorem 1 can be proven in almost the same way as the proof of Theorems 1 and 2 in [14], using the notions of Glivenko-Cantelli class of functions and bracketing numbers. And, by assumptions (A2)-(A5), the bracketing number N [] (ε, F, L 1 (P )) is finite for any > 0, see Lemma 6.1 in [22]. This reveals that F is a Glivenko-Cantelli class and hence, the right hand side of (7.2) tends to zero as n grows. Further, with the use of (A3), we conclude thatθ converges in probability to θ * , see Theorem 2.4.1 in [21] and also Theorem 1 in [14]. Next our focus is onθ(t 3) The first term in the right hand side in (7.3) appears in (7.2), hence it tends to zero. It can be proven by using Lemmas 1 to 5 in [14] that the second and third terms also tend to zero. This together with (A3) leads the convergence in probability ofθ(t) to θ.

Proof of Theorem 2
Theorem 2 can be proven by almost the same way as Theorem 1 in [18].

Proof of Theorem 3
Theorem 3 can be obtained in a similar manner to Theorem 2 in [18].

Proof of Theorem 4
Proof It is easily confirmed from Theorem 1 that (2) (t, y, x, θ * )dF (y, x) Hence the result follows by the asymptotic normality ofθ. A simple application of delta method immediately yields the result forμ L (t).

Proof of Theorem 5
Proof By the definition of Bregman divergence, it is easily confirmed that here we note that the first term of the right hand side is nonnegative. By referring to Theorem 2, we can put Using this, we can expand u(μ G ) − u(μ L ) as Further, sinceθ converges in probability to θ * , Theorem 2 yieldŝ By also noting (2.8), (2.17) and (2.9), the latter calculations lead us to and M such that uniformly on x. By combining (7.17) and (7.18), for any ε satisfying 0 < ε < 3 −1 , there exists δ 0 such that for any δ > δ 0 . Evaluations (7.16), (7.19) and (7.20) furnish to reach, for δ > δ 0 , which is positive for any x ∈ D. This confirms that the matrix (4.7) is positive definite for sufficiently large δ.

Proof of Lemmas
Lemmas 1, 2 and 3 can be established in a similar way as in Lemmas 1, 2 and 3 of [18].