Improved Density and Distribution Function Estimation

Given additional distributional information in the form of moment restrictions, kernel density and distribution function estimators with implied generalised empirical likelihood probabilities as weights achieve a reduction in variance due to the systematic use of this extra information. The particular interest here is the estimation of densities or distributions of (generalised) residuals in semi-parametric models defined by a finite number of moment restrictions. Such estimates are of great practical interest, being potentially of use for diagnostic purposes, including tests of parametric assumptions on an error distribution, goodness-of-fit tests or tests of overidentifying moment restrictions. The paper gives conditions for the consistency and describes the asymptotic mean squared error properties of the kernel density and distribution estimators proposed in the paper. A simulation study evaluates the small sample performance of these estimators. Supplements provide analytic examples to illustrate situations where kernel weighting provides a reduction in variance together with proofs of the results in the paper.


Introduction
In many statistical and economic applications, additional distributional information about the data observation d z -vector z may be available in the form of moment restrictions on its distribution. These constraints may arise from a particular economic or physical law, e.g., Chen (1997, Section 5), be implied by estimating equations, Qin and Lawless (1994, Example 1), or correspond to known population moments of another observable random vector correlated with z, e.g., in survey samples with auxiliary population information available from census data, e.g., Chen and Qin (1993) and Qin and Lawless (1994, Example 2). The primary purpose of the paper is to explore the advantages of this additional information for the estimation of the density and distribution function of a scalar residual-like function of z which may depend on unknown parameters.
To this end, let g(z, β) denote a d g -vector of known functions of the data observation d z -vector z ∈ Z and the d β -vector β ∈ B of parameters where the sample space Z ⊆ R dz and parameter space B ⊂ R d β with d β ≤ d g . The moment indicator vector g(z, β) will form the basis for inference in the following discussion and analysis. In particular, it is assumed that the true value β 0 taken by β uniquely satisfies the population unconditional moment equality condition E[g(z, β 0 )] = 0, (1.1) efficiency gains from the knowledge that the mean of residuals is zero. The outline of the paper is as follows. Section 2 briefly describes (G)EL estimation and the associated (G)EL implied probabilities. The main results concerning p.d.f. and c.d.f. estimators are given in Sections 3 and 4 for both known and unknown β 0 cases. The finite sample performance of the proposed estimators is evaluated via a simulation study reported in Section 5. Section 6 concludes. Supplement Supplement A: Proofs and B: Examples in the Supplementary Information respectively details some additional assumptions for and the proofs of the results in the main text and analyses a number of examples to illustrate the the properties of the estimators developed in the paper.
Remark 2.3. The implied probabilities were given for EL by Owen (1988), for ET by Kitamura and Stutzer (1997), for quadratic ρ(·) by Back and Brown (1993), and for the general case in the 1992 working paper version of Brown and Newey (2002); see also Smith (1997). For any function a(z, β) and GEL estimatorβ the implied probabilities can be used to form a semiparametrically efficient estimator n i=1π i a(z i ,β) of E[a(z, β 0 )] as in Brown and Newey (1998).

GEL-Based Density Estimation
Suppose the p.d.f. f (·) of the scalar random variable u = u(z, β 0 ) is of interest, where the scalar function u : Z × B → U ⊆ R is known up to the parameter vector β 0 .
Let N denote an open neighbourhood of β 0 .
Assumption 3.1. For all β ∈ N there exists a function v : Z × B → V ⊆ R dz−1 such that the vector of functions (u(z, β), v(z, β) ) is a bijection between Z and U × V.
Remark 3.1. Equivalently Assumption 3.1 may be restated as requiring that for every β ∈ N there exists a bijection between z and some d z -vector w = w(z, β) such that, given {w j (z, β)} dz j=2 , u(z, β) and w 1 (z, β) are bijective. That is to say, z may be solved for uniquely given values for u, v and β.
Remark 3.2. A function u(z, β) satisfying Assumption 3.1 may be thought of as defining a generalised residual in the sense of Cox and Snell (1968) and Loynes (1969), withû i = u(z i ,β), i = 1, . . . , n, the estimated residuals. Of course, other possibilities of interest are included, e.g., estimating the density of an element of z subject to the extra information available in the moment condition (1.1).
is a kernel function and b = b n > 0 is a bandwidth sequence; see Rosenblatt (1956) and Parzen (1962). The estimatorf (3.1) will serve as a benchmark for later comparisons. The properties off are well known and can be formally established under different combinations of smoothness and integrability conditions on the kernel k and density f ; see, e.g., Rao (1983, Section 2.1). A standard set of such conditions is given in Assumption 3.2 below. If k is square integrable, but not absolutely integrable, as is the case for the sinc kernel, conditions such as those in Tsybakov (2009, Theorem 1.5) can be imposed.
Remark 3.4. Higher order approximations to MSE[f (u)] can be obtained if f is sufficiently smooth. See, e.g. Rao (1983, Theorem 2.1.5), Wand and Jones (1995, Section 2.8) or Pagan and Ullah (1999, Section 2.4.3). The idea of using higher order kernels as a bias reduction technique originates at least as far back as Bartlett (1963). Hence, Remark 3.5. If k is a (2r)th order kernel and Assumption 3.2(b) holds with s = 2r, the remainder term in E[f (u)] is o(b 2r ). The ∼ n −1 term is kept explicit with O remainder for reasons that will become apparent below.
The mean integrated squared error (MISE), , is a commonly used global measure of performance. The optimal bandwidth is then defined as that value of b > 0 minimising MISE, or an approximation thereof. In particular, the asymptotically optimal bandwidth is defined as the value b * minimising the two leading terms in the expansion (4r+1) . The asymptotically optimal MISE is thereby Remark 3.6. If k is of order greater than two, it necessarily takes negative values. Hencef (3.1) itself need not be a density function. Note, however, that the positive part estimator,f + (u) = max[f (u), 0] has MSE at most equal to MSE[f (u)]. Further modifications that ensure integration to unity can be applied as described in Glad, Hjort and Ushakov (2003).
The GEL-based kernel density estimator incorporates the information embedded in the moment restriction (1.1) replacing the sample EDF weights n −1 in the construction off (u) (3.1) by the implied probabilitiesπ i , i = 1, . . . , n; viz.f (u) and associated auxiliary parameter η; see Smith (2011, Section 3).
Remark 3.8. If the validity of the moment restriction (1.1) is in doubt, a pre-test can be conducted using the GEL-based criterion (2.1) paralleling the classical likelihood ratio test; see, e.g., Kitamura and Stutzer (1997), Imbens et al. (1998) and Smith (1997Smith ( , 2011. For example, under the null hypothesis that (1.1) holds for some unique β 0 ∈ B, the normalised GEL criterion (2.1) evaluated at the estimated parameters, 2nP n (β,λ), is asymptotically chi-square distributed with d g − d β degrees of freedom. The parametric null hypothesis of known β 0 = β 0 can be tested at the α level using the To describe the properties of GEL-based kernel density estimatorf ρ (u) (3.4), the shorthand notation, e.g., E[g i |u] = E[g(z, β 0 )|{z : u(z, β 0 ) = u}], for conditional expectations given u is adopted.
Whenever c ρ = 0, as is the case for (G)EL with ρ 3 = −2 , e.g., EL, the n −1 bias term in (3.5) vanishes. In general, provided the bandwidth does not go to zero faster than n −1/(2r) , and certainly when b = b * ∼ n −1/(4r+1) , this bias term is at most third order. Its contribution to MISE is via the integrated squared bias (ISB) with the O(n −1 b 2r ) term generally non-zero and either positive or negative. With the asymptotically optimal bandwidth, n −1 (b * ) 2r ∼ n −3/2+1/(8r+2) , which approaches n −3/2 arbitrarily closely as r increases, whereas the leading terms in MISE[f (·; b * )] becomes arbitrarily close to n −1 . As long as E[g i |u] = 0, the GEL-based estimatorf ρ enjoys a second-order reduction in variance due to the n −1 term in (3.6), which does not depend on the choice of GEL carrier function ρ(·). Hence While this reduction is negligible asymptotically, the leading term in MISE[f ] approaches zero only a little more slowly than n −1 . Hence the effect could be substantial in small samples.
Intuitively, if b is very small, the kernel k b (u −û i ) is very narrowly centered around the incorrect valueû i potentially excluding the true value u i ; see, e.g., Silverman (1986, Figure 2.5) for a generic illustration. Assumption 3.3(c) requires nb 4 → ∞ regardless of the values of τ and α and b = n −1/4 is the fastest rate achievable when α = τ = 1. Note that the optimal bandwidth b * is excluded if [α(4r + 1) − 2]τ < 2.
To obtain higher order expansions for the mean and variance off (u) (3.7) andf ρ (u) (3.8) requires a further strengthening of the assumptions. Let ∇u(z, β) and ∇ 2 u(z, β) denote respectively the d βvector and d β × d β matrix of the first and second derivatives of u(z, β) with respect to β. Also let ∇u i = ∇u(z i , β 0 ) and ∇ 2 u i = ∇ 2 u(z i , β 0 ).
Remark 3.9. The general conclusion of Theorem 3.3 for both bias and variance is identical to that of Theorem 3.1, i.e., the estimation effects of substitutingû i for u i , i = 1, . . . , n, and the GEL implied probabilitiesπ i forπ i , i = 1, . . . , n, are both of order n −1 . The bias term inf induced by estimation is similar to that forf in Theorem 3.1 except that P in (3.10) replaces Ω −1 in (3.5) and two extra terms enter via ζ λ , viz. −a and E[G i Hg i ] in (2.2). These latter terms appear in the higher order asymptotic bias n −1 H(−a + E[G i Hg i ]) for the infeasible GEL estimator based on the optimal moment indicator vector G Ω −1 g(z, β), see Newey and Smith (2004, Theorem 4.2), and are inherited by all GEL estimators. Unlike Theorem 3.1 for the known β 0 case, this term no longer vanishes for a particular choice of a carrier function ρ. The replacement of Ω −1 by P represents the loss of information occasioned by the estimation of β 0 . In a number of cases, the term E[g i |u] P E[g i |u] may vanish, see, e.g., Supplement B: Example B.3. This of course always occurs for an exactly identified model d g = d β sinceπ i = n −1 andf ρ (3.8) andf (3.7) are identical. However, see Supplement B: Example B.4, in generalf ρ may still enjoy a second-order reduction in variance due to the systematic use of overidentifying moment information (1.1).

Bias Correction
While the contribution from the n −1 bias terms to MISE is of a lower order than the contribution from the variance terms, the effect of bias can be substantial in small and moderate samples, potentially offsetting any reduction in variance. The direction of the bias cannot of course be known a priori. Hence it may be advisable to bias-correct the density estimates by estimating and subtracting the n −1 bias term.
To be more specific, the bias-corrected estimates are defined aŝ whereδ(u) andδ ρ (u) are suitable (asymptotically) unbiased estimators of δ(u) (3.9) and δ ρ (u) (3.10). The implied probabilitiesπ i , i = 1, . . . , n, can be used to obtain efficient estimators of the component quantities entering δ(u) and δ ρ (u) with the modifications described in Glad et al. (2003) applied to ensure that the bias-corrected estimate is a density.

GEL-Based Distribution Function Estimation
The results for distribution function estimation parallel those given in Section 3 for density estimation but can be shown to hold under much weaker conditions, and so are given here separately.

Known β 0
When u i , i = 1, . . . , n, are observed, the c.d.f. F of u(z, β 0 ) can be estimated by Nadaraya (1964) and Watson and Leadbetter (1964). The kernel distribution function estimator (4.1) can be obtained by integrating (3.1) or motivated as a smoothed version of the EDF.
Assumption 3.2(a)(i) is sufficient for F to be an asymptotically unbiased and consistent estimator of F at all continuity points of F if b → 0 as n → ∞. In addition, if F is continuous then F converges to F uniformly with probability 1 (w.p.1.); see Yamato (1973). If k satisfies Assumption 3.2(a)(ii) with µ 2r+2 (k) < ∞ for some r ≥ 1, f satisfies Assumption 3.2(b) with s = 2r + 1, and b → 0 as n → ∞ (Assumption 3.2(c) is not required here), then Provided ψ(k) > 0, the asymptotically optimal bandwidth minimising the leading terms in (4.2) Remark 4.1. The leading term n −1 V F in (4.2) is the integrated variance and, hence, the MISE of EDF. Thus, whenever ψ(k) > 0 and b approaches zero at least as fast as n −1/(4r−1) , kernel smoothing provides a second order asymptotic improvement in MISE relative to the EDF. Smoothness of the kernel estimates and the reduction in MISE are the two main reasons to prefer the kernel distribution function estimator (4.1) over the EDF. The condition ψ(k) > 0 is satisfied if k is a symmetric second order kernel, since in this case ψ(k) = K(x)(1 − K(x))dx > 0. Although ψ(k) need not be positive in general, this property holds for certain classes of kernels, including Gaussian kernels of arbitrary order; see Oryshchenko (2017).
Remark 4.2. If k is of order greater than two, K is not monotone, and the resultant estimates may not themselves be distribution functions. However, if necessary, the estimates can be corrected by rearrangement; see Chernozhukov, Fernández-Val and Galichon (2009). The MISE of the rearranged estimator can be at most equal to, and is often strictly smaller, than the MISE of the original estimator.
The modified GEL kernel distribution function estimator corresponding tof ρ (3.4) which incorporates the information embedded in the moment restrictions (1.1) is Theorem 4.1. If Supplement A: Assumptions A.1-A.3 and 3.2(a)(i) are satisfied and b → 0 as n → ∞, then F ρ (u) = F (u) + o p (1) at all points of continuity of F . If, in addition, Assumption 3.1 is satisfied, then These results are qualitatively similar to Theorem 3.1, the important difference being that the reduction in variance is now first-order asymptotically, whereas the contribution from the n −1 bias term in (4.4) to MISE is of order n −1 b 2r . Ceteris paribus, the asymptotically optimal c.d.f. bandwidth converges to zero at a faster rate than that for density estimation. Hence the additional bias effect can be expected to be of less importance.

Unknown β 0
When β 0 is unknown, the analogues of F and F ρ are respectively.
Theorem 4.2. If Supplement A: Assumptions A.1-A.3 and 3.2(a)(i) are satisfied, Assumption 3.3(b) holds with τ = 1 for some 0 < α ≤ 1, and b → 0 and Similar to Theorem 3.2, Theorem 4.2 establishes that the differences between F (4.6) and F ρ (4.7) and their counterparts based on observable u i , i = 1, . . . , n, are negligible asymptotically. No additional requirements are placed on k beyond the standard conditions in 3.2(a)(i) and the restriction on the bandwidth is thus weaker than Assumption 3.3(c).
Higher order expansions similar to those in Theorem 3.3 may be obtained under the following conditions.
, but there is no requirement that ∆(u) is absolutely continuous in Theorem 4.3. Otherwise, the interpretation is exactly the same as in Theorem 3.3. In particular, the main qualitative conclusions in Supplement B: Examples B.3 and B.4 still hold.

Preliminaries
Consider the inverse hyperbolic sine (IHS) transformation model here β = (δ, γ, θ) and z = (y, x) . The IHS transformation has been proposed in Johnson (1949, p.158) as an alternative to the Box-Cox power transform, (y λ −1)/λ, y ≥ 0, and developed in Burbidge, Magee and Robb (1988) and MacKinnon and Magee (1990); see also, e.g., Ramirez, Moss and Boggess (1994), Brown, Greene, Harris and Taylor (2015) and the references therein for recent applications in statistics and econometrics, and Tsai, Liou, Simak and Cheng (2017) for comparisons with other transformations. When θ = 0, the IHS transform is defined as the limiting value, lim θ→0 arsinh(θy)/θ = y, which corresponds to the Box-Cox transform with λ = 1; when θ = 0, the shapes of the IHS transforms are similar to those of the Box-Cox with λ < 1. The advantage of the IHS transform is that it is a smooth function of y ∈ R and θ ∈ R with values at θ = 0 defined as the corresponding limits.
The infeasible optimal instruments in the IHS transformation model (5.1) are see Robinson (1991). The last element of S(x; β 0 ), s 3 (x; β 0 ), depends on the conditional distribution of u given x, and, in general, there is little reason to argue for a particular scalar function of x as a good In all cases the true parameters are δ 0 = 1, γ 0 = 2 and θ 0 = 0.08 which yield a signal-to-noise ratio of γ 2 0 /(1 + γ 2 0 ) = 4/5 = 0.8 somewhat more stringent than that of 16/17 = 0.941 in Robinson (1991, Section 7).
Three data generating processes for (x, u) are considered.
does not depend on the number of moment conditions d g and is the asymptotic reduction in integrated variance due to the constraint that the mean of u is zero; see also Supplement B: Example B.2. The second term in b is non-negative and represents the increase in integrated variance due to estimation of γ 0 and θ 0 ; it decreases as the number of moment condition increases; e.g. for d g = 4, 5, 10, 20, τ Dτ = 9. 8092, 9.8514, 9.9857 and 9.9859, respectively. Scenarios 2 and 3. x and u have joint density Stacy (1962), with parameters p = 2, d = ν and a = (2/ν) 1/2 for some ν > 4 and f N M is the normal mixture density with m components, viz.
j=1 ω j = 1, and m j=1 ω j µ j = 0, i.e., E[w] = 0. Here φ(x) denotes the standard normal p.d.f. and φ σ (x) = φ(x/σ)/σ. The joint density f ux is the density of u = w/x and x where w and x are independent. The conditional density of u given is the density of a noncentral t-distributed random variable with ν degrees of freedom and noncentrality parameter η allowing a wide variety of shapes for f u by varying the mixture f N M . The skewed unimodal and bimodal densities shown in Figure 1 describe the NM densities for Scenarios 2 and 3 respectively, i.e., the mixture densities Marron and Wand (1992, #2 and #8) centered to have zero mean.

Results
The study compares the performance of GEL-based kernel density p.d.f. and c.d.f. estimators. The GEL parameter estimators are CUE, EL and ET, the most notable special cases of the GEL family. For each estimator the mean and variance were computed on a grid 1000 of points between −5 and 5 and are reported as the integrated squared bias and integrated variance relative to those of the corresponding infeasible estimator based on the true u, i.e.,f and F .
Tables 1, 2 and 3 report results for Scenarios 1, 2 and 3 respectively. The ISB, IVar and MISE (all ×10 5 ) for the infeasiblef and F are presented. Rows ISB, IVar, and MISE are the ISB, IVar, and MISE off ,f ρ ( F , F ρ ) relative to the infeasiblef ( F ), respectively; row 'vs d g = 3' is the MISE off , f ρ (F , F ρ ) relative to the corresponding value for d g = 3; row 'w. vs unw.' is the MISE off ρ ( F ρ ) relative tof ( F ). Rows MISE, 'vs d g = 3', and 'w. vs unw.' examine the significance of the paired t-statistics in a two-sided test for equality of the respective ISE means, e.g., (f (u) − f (u)) 2 du; the symbol † indicates that the p-value is between 0.01 and 0.05 whereas ‡ that it is less than 0.01 and in all other cases the p-value is greater than 0.05. Values of relative MISE less than 1 are emphasised in bold.
All computations were carried out in MATLAB; the relevant code and additional results, including the properties of GEL estimators, are available from the first named author upon request. All results are based on 10, 000 random draws.
The results reported in Table 1 confirm these predictions. In fact, the reduction in variance is even larger than expected in small and medium samples due to the o(b) effects. Furthermore, estimatorsf andf ρ have smaller ISB relative tof . A comparison off andf ρ between d g = 3 (just-identified) and d g = 4, 5 (over-identified) for moderate and larger sample sizes emphasises further the contribution of additional moment information. Hencef andf ρ enjoy a reduction in MISE of as much as 21% for n = 100 and 10% for n = 2000 relative tof . The benefits are even more pronounced for c.d.f. estimation, where the reduction in MISE can be as much as 56% for n = 100 and around 53% in moderate samples. There are also small but statistically significant benefits to re-weighting which are mostly due to the smaller biases off ρ and F ρ relative tof and F at moderate and larger sample sizes. There is some deterioration in ISB, IVar and, thus, MISE with increases in d g which can be contributed to the increased importance of outliers.
Finally, while in moderate and large samples the performances of CUE, EL, and ET are virtually identical, in small samples ET can be unstable with larger d g .

Scenarios 2 and 3
Scenarios 2 and 3 with densities of (x, u) which are heavy-tailed and also, e.g., skewed and bimodal, illustrate the many difficulties for both GEL estimation and kernel p.d.f. and c.d.f. estimation which are absent in the relatively benign Scenario 1.
The performance of CUE in small samples is generally worse than that of EL and ET. It ranks last by MSE in both scenarios with n = 100 and 500, except Scenario 3 with n = 100 where ET underperforms. In a number of cases increasing with d g the optimisation routine for ET failed. Somewhat surprisingly, although it is known to be sensitive to outliers, EL appears to deliver good results in the simulation experiments. It ranks first by MSE in Scenario 3 with d g = 5 and alternates with ET otherwise. These differences become very small with n = 1, 000 and greater.
The conclusion about the inferior performance of CUE in small samples holds true for CUE-based kernel density p.d.f. and c.d.f. estimators as well; see Tables 2 and 3, in particular, the ISBs off andf ρ with d g = 4, 5 in Table 2. However, the ranking of EL and ET-based kernel density p.d.f. and c.d.f. estimators by MISE does not always correspond to the ranking of the underlying EL and ET estimators of β 0 by MSE. In particular, the sensitivity of EL to outliers adversely affects the estimatorŝ f ρ and F ρ via the implied probabilities in Scenario 3 with n = 500 and greater; see Table 3. ET and CUE perform better in those cases.
Unlike Scenario 1, in Scenario 3 none of the feasible kernel density estimators have smaller MISE than their infeasible counterparts for the sample sizes considered. In Scenario 2, with less complicated distributional features, these estimators do achieve a reduction in MISE with d g = 4, 5. The same is   true for the feasible kernel c.d.f. estimators in Scenario 2 with d g = 3, 4, 5, and more often than not in Scenario 3 as well, with the few exceptions mentioned above. Importantly, it is generally beneficial to increase the number of moment conditions beyond those necessary to identify the parameters except when stability of GEL estimators of β 0 is likely to deteriorate. Finally, the benefits of re-weighting are present, but not universal, and as expected, are quite small; cf. Supplement B: Example B.4.

Summary and Conclusions
Large sample results and simulation evidence reported in this paper suggest that it is generally sensible to apply either the standard or re-weighted kernel estimators to estimate the p.d.f. or c.d.f. of a scalar residual u(z, β 0 ) in a variety of situations, provided error associated with the estimation of β 0 satisfies some mild regularity conditions and care is taken to ensure the bandwidth is not too small. If the assumptions on u(z, β) prove difficult to verify in practice, using fourth or higher order kernels and the corresponding asymptotically optimal bandwidths will generally assist with ensuring the appropriate regularity conditions hold.
Incorporating information from overidentifying moment conditions by re-weighting the estimators using GEL implied probabilities offers efficiency gains which are realised in regular situations. However, if the model is highly nonlinear and the distribution of the data is heavy-tailed or contaminated with outliers, the methods proposed in this paper, including GEL, should be applied with some caution in very small samples. Robustified hybrid estimators such as the exponentially tilted empirical likelihood, see, e.g., Schennach (2007), may prove useful in these circumstances.
While the results in this paper were presented only for the scalar-valued u(z, β), generalisations to the vector case are relatively straightforward provided an analogue of the bijection Assumption 3.1 holds.
An issue for future research to usefully address is the construction of tests for overidentifying moment conditions or parametric restrictions based on the differences between the kernel p.d.f. estimatorsf ρ andf orf ρ andf for known β 0 . Test statistics of the Bickel-Rosenblatt type based on the integrated squared difference (f ρ (u) −f (u)) 2 du, Bickel and Rosenblatt (1973), Fan (1994Fan ( , 1998, or the integrated absolute difference, Cao and Lugosi (2005), would be of interest. Alternatively, Kolmogorov-Smirnov or Cramér-von Mises-type tests could be constructed based on the differences between kernel c.d.f. estimators. A Throughout the Appendix, 0 < C < ∞ and 0 ≤ ω ≤ 1 will denote generic constants that may be different in different uses. CS, T, and H refer to the Cauchy-Schwarz, triangle, and Hölder inequalities, respectively with LIE and WLLN the law of iterated expectations and Khintchine's i.i.d. weak law of large numbers. MVT is the mean value theorem.

Supplement A to "Improved Density and Distribution Function Estimation": Proofs
In addition, int(·) denotes the interior of ·, w.p.(a.)1 with probability (approaching) 1, and N is an open neighbourhood of β 0 .
Assumption A.2 is Newey and Smith (2004, Assumption 2). If Assumptions A.1 and A.2 hold then P )); see Newey and Smith (2004, Theorem 3.2). Let ∇ 2 g(z, β) denote a vector of all distinct second order partial derivatives with respect to β.
is four times differentiable with Lipschitz fourth derivative in a neighbourhood of zero.
Let a(z) denote a real scalar function of z such that E[a(z) 2 ] < ∞. Write a i = a(z i ), i = 1, . . . , n.
Proof. The first result follows from the expansion forπ i in Lemma A.1. In particular, noting E[g i ] = 0 and E[a i o p (n −1 )] = o(n −1 ) by uniformity of o p (n −1 ), then, by independence, uniformly i = 1, . . . , n. Lemma A.2 remains valid with Ω −1 replacing P .
at every continuity point y of f ; if f is uniformly continuous, then convergence is uniform. Under the same conditions lim b↓0 | Remark A.1. If k is Hölder continuous with exponent 0 < τ ≤ 1 and, thus, uniformly continuous, and absolutely integrable, then it is bounded.

A.3 Proofs of Theorems
By Corollary A.1 and Owen (1990, Lemma 3 Under Assumption 3.2(a)(i), E[k b (u − u i )] = f (u) + o(1). Invoking Assumption 3.1 and the change of variables z → (u, v ) , then, by LIE and Lemma A.
The final result is a direct consequence of Lemma A.2 and the same argument.
The same or similar terms appear in the expansions for the variance off in other contexts (the O(n −1 ) bias terms tend to be ignored as their contribution to MISE is o(n −1 )); cf. Muhsal and Neumeyer (2010, eq.(3.5)). As the next example demonstrates, these same effects appear in a large class of parametric moment condition models.

Example B.3 (GEL With A Constant And Zero Mean Restriction)
Consider GEL estimation based on moment indicator functions of the form g(z, β) = u(z, β)α(w) where u(z, β) is scalar, β a d β -vector of parameters, and α(w) a d g -vector of functions of w. Suppose that u(z, β 0 ) is independent of w, Assumption 3.1 holds, and the moment condition E[g(z, β 0 )] = 0 includes the restriction E[u(z, β 0 )] = 0. Furthermore, it is assumed that u(z, β) contains a constant; the inclusion of an explicit constant is not essential as the results here continue to hold if E[∂u(z, β 0 )/∂β |w]γ = c for some non-zero vector γ and scalar c, in which case E[α(w)] = Gγ/c. Without loss of generality let α 1 (w) = 1 and ∂u(z, β 0 )/∂β 1 = −1.
Remark B.2. Figure B.3 shows the values of the above quantities and the overall effect on the integrated variance for selected values of q and ν > 2; note that the validity of asymptotic expansions requires ν > 4, but variance is defined for ν > 2. While the main reduction in variance is still due to the zero mean restriction as in Example B.3 (Panels A and B), there are small additional gains due to re-weighting (Panel C). The latter do increase as more moment conditions are added.