Nonparametric estimation of accelerated failure-time models with unobservable confounders and random censoring

: We consider nonparametric estimation of an accelerated fail- ure-time model when the response variable is randomly censored on the right, and regressors are not mean independent of the error component. This dependence can arise, for instance, because of measurement error. We achieve identiﬁcation and conduct estimation using a vector of instrumen- tal variables. Censoring is independent of the response variable given the instruments. We consider settings in which regressors are continuously dis- tributed. However, the instruments may or may not be continuous, and we show how various independence restrictions allow us to identify and estimate the unknown function of interest depending on the nature of instruments. We provide rates of convergence of our estimator and showcase its ﬁnite sample properties in simulations.


Introduction
We consider identification and estimation of the following nonparametric accelerated failure-time (AFT) model in log form T = ϕ(Z) + U, (1.1) when the conventional assumption that E(U | Z) = 0 no longer holds due to potential dependence between U and Z. This dependence can arise for several reasons: measurement error, omitted variables, or simultaneity. For instance, when analyzing the effect of systolic blood pressure as a possible risk factor for developing cardiovascular diseases, measurement error is often an issue due to time constraints and other unobservable factors in routine care. This measurement error can bias, in an unknown direction, statistical evaluations of this effect [10].
Similarly, recent studies have tried to uncover the relation between unemployment status and body mass index (BMI) to explain the elevated morbidity and mortality among job seekers [see 43, among others]. In particular, one may be interested in understanding how BMI affects spells of unemployment duration. However, it is plausible that there are individual characteristics, unobserved to the statistician, that may determine both the length of unemployment spells and the subject's physical well-being. This simultaneity issue may render estimators based on the standard assumption that E [U | Z] = 0 inconsistent. Moreover, T is often not fully observed, and we assume it is subject to random right censoring, C. In particular, we consider a setting in which one observes Y = T ∧ C and δ = 1I (T ≤ C).
Our analysis focuses on the regression function ϕ when the regressor Z is restricted to have a continuous distribution with respect to the Lebesgue measure. To achieve identification and carry on estimation, we rely on a vector of instrumental variables, W , which are taken to satisfy some independence restrictions with respect to the error term. In the examples provided above, plausible instrumental variables are given by the same measurement run on parents or relatives.
For instance, parents' BMI is often used as an instrument for an individual's own BMI [see, e.g., 51, among others]. Similarly, genetic markers can be used as a source of exogenous variation to help the identification of causal effects in these settings [69].
In this paper, we consider that the censoring variable C is independent of T given the exogenous instruments, W . We thus exclude unobservable factors that affect both T and C simultaneously. Upon additional exclusion restrictions that we discuss below, we can recover the unknown function ϕ.
We consider two settings, depending on the properties of W . In the first setting, we take W to be mean independent of the error term, i.e., E [U | W ] = 0. We recover the regression function ϕ by solving the following integral equation (1.2) where V is an appropriate transformation of the censored response Y [see, e.g., 48, 66, among others]. Identification in this setting requires, among other things, W to be continuously distributed. One can then recover an estimator of ϕ by replacing population objects in equation (1.2) with sample counterparts. In a second setting, we consider the stronger assumption that U ⊥ ⊥ W , i.e., W is independent of U , and E(U ) = 0. Independence is equivalent to where S U |W (u | w) and S U (u) are the conditional and unconditional survivor functions of U , respectively [14,27,26]. Both survivor functions can be consistently estimated using the conditional and unconditional Kaplan-Meier estimators to take censoring into account [22,46,68]. Upon additional conditions that guarantee the consistency of the product-limit estimator, censoring can be very easily handled in this setting, and it does not require any major modification of the estimation procedure. In parametric models, the assumption of independence is often justified by efficiency considerations. One can decrease the variance of a parametric estimator by taking one step towards the Maximum Likelihood Estimator [63,65]. In this nonparametric setting, the stronger assumption of independence simply allows us to relax the relevance condition and accommodate settings in which W may only be binary or discrete.
In both cases, an additional technical difficulty for implementation and for the derivation of the asymptotic properties is that the resulting estimators are the solution, respectively, of a linear and a nonlinear ill-posed inverse problem. Hence, besides the smoothing step, which is common in nonparametric regressions, we have a further regularization step [see 28,45].
In this paper, we consider regularization through the Landweber-Fridman (LF) approach. In finite samples, the relative advantage of LF regularization is that it iteratively approximates the inverse of a conditional expectation operator. Thus, it avoids exact inversion of large matrices necessary, for instance, in Tikhonov regularization. Compared to sieve regularization, it does not require that the unknown function is well approximated by just a few terms of its series expansion [see, e.g., 19].
We provide a detailed explanation of the implementation of the Nonparametric IV estimator with LF regularization, and we derive an upper bound on the L 2 loss. We show that, under our identification assumptions, the random right-censoring does not affect the properties of the estimator. That is, the rate of convergence is the same as when the dependent variable is fully observed. When the instrument is mean independent, this result holds upon an appropriate choice of the bandwidth parameter.
The estimator based upon mean-independence is a linear estimator and thus relatively straightforward to implement. The L 2 rates are minimax under weak conditions on the smoothing and regularization parameters [20]. By contrast, the nonlinearity of the estimation procedure based on independence introduces some significant theoretical and practical difficulties. The LF procedure relies upon a first-order linear approximation of the nonlinear problem. Hence, regularity conditions only hold in the vicinity of the true solution. Moreover, the loss function can be decomposed into two parts: one which is due to estimation and can be handled similarly as in the case with mean-independence; and another one due to the nonlinearity of the inverse problem. The latter determines the convergence of our estimator and, in certain instances, can be controlled to reach the same rates as in the linear case. Whether these rates are minimax remains an open question, to the best of our knowledge.
Related work has considered the estimation of duration models with endogeneity and (possibly random) right-censoring. Frandsen [35] discusses nonparametric identification and estimation of a model with a binary endogenous treatment variable and a binary instrument, independent of the error term [see also 70]. More recently, Beyhum et al. [6] analyze a nonparametric duration model with endogenous treatment. They provide identification and estimation based on an instrumental variable assumption when the outcome is randomly censored on the right. Their estimator is also a solution to a nonlinear inverse problem, although they avoid ill-posedness by restricting the endogenous treatment to be discrete. They also discuss partial identification of the treatment effect when censoring is fixed. Sant'Anna [64] provides a nonparametric test of treatment effect heterogeneity for a binary treatment variable in cases where the treatment is assigned independently of the potential outcome conditional on observables. He also assumes that outcome and censoring are independent conditional on the treatment. He also allows for the treatment to be endogenous. Our work contributes to this literature by allowing the treatment variable to be continuous and potentially endogenous. However, all these papers assume that the effect of the treatment is heterogeneous, which is ruled by the additive separability of our model. This would be an interesting contribution, that we defer to further research.

Framework
We consider a random element (T, Z, W ) with T ∈ R, Z ∈ R p , and W is a qdimensional random vector, with q ≥ p. We let F denote the joint distribution of the random vector (T, Z, W ); and L 2 Z or L 2 W , the spaces of functions of Z or W , respectively, that are square-integrable with respect to F . Depending on the setting, we will impose additional restrictions on the distribution of (Z, W ). We analyze the model with δ = 1I (T ≤ C), and C ∈ R. We maintain the following assumption.
This assumption allows any relations between the unobserved response and the censoring variable to happen through observable components. For instance, the restriction in Assumption 2.1 holds when the censoring variable can be written as C = ψ(W, ν), with ν ⊥ ⊥ (T, Z, W ). When Z is exogenous, that is W = Z, Assumption 2.1 reduces to the standard exclusion restriction commonly imposed in AFT regressions with random censoring [see 48].
In the following, we let S ·|W (· | w) be the survivor function conditional on W . Our identification strategy is based on the following assumption about S C|W (C | w).

Assumption 2.2.
Let T be the support of T , such that sup t∈T |t| = T 0 < ∞. For every w, we have that S C|W (T 0 | w) > , for a constant > 0. Assumption 2.2 implies that the supremum of T is not censored with positive probability. This condition is relatively standard in this literature, and point identification of the parameters of interest is not possible without a similar restriction, to the best of our knowledge. For instance, when T represents unemployment spells and a randomly assigned interview determines censoring, Assumption 2.2 implies that the longest duration of a spell is finite and that interviews can be conducted late enough to guarantee that at least some of the individuals with the longest spell are interviewed after they found employment. If Assumption 2.2 does not hold, one can only hope to identify ϕ for those t which are in the interior of the support of C [6]. Assumption 2.2 is violated, in particular, when censoring is fixed (see Remark 1 below).

Case 1: U is mean independent of W
We first treat the case in which E(U | W ) = 0 and the joint distribution of (T, Z, W ) is absolutely continuous with respect to the Lebesgue measure.
Define the following random variable where S C|W is the survivor function of C conditional on W . We have Y = T , whenever δ = 1. To achieve identification of the function ϕ, we consider the following assumption.
This completeness condition is an unsettled assumption for identification in nonparametric instrumental regressions. The terminology used is in analogy with the notion of complete statistic [see, e.g., 52], and it is sometimes referred to as a strong identification condition [see, e.g., 31]. When the pair (Z, W ) is continuously distributed, Andrews [3] has derived a class of distributions for which completeness holds generically, in a sense defined within that paper. Some additional results about completeness that rely on stronger restrictions on the DGP are provided in D'Haultfoeuille [25]. When completeness fails, Babii and Florens [4] and Florens et al. [32] show that the estimator may still converge to the minimal norm solution.
Under the conditions above, we have the following proposition.
Remark 1 (Identification with fixed censoring). When censoring is fixed, one could achieve point identification of the function ϕ (up to location) as follows.  [15].

Case 2: U is independent of W
We now turn to the case when U ⊥ ⊥ W with E(U ) = 0.
The joint distribution of (T, Z) is still restricted to be absolutely continuous with respect to the Lebesgue measure. However, we do not impose any condition on the distribution of W . Therefore, we can identify the function ϕ with purely discrete instruments. Our presentation is largely based on Centorrino et al. [14], Dunker [26], and Dunker et al. [27], who consider identification and estimation in a similar model without random censoring.
We rewrite the independence condition as follows: and where, roughly speaking F is a survivor function in terms of t, and the negative of the density as a function of z.
The independence restriction, therefore, implies that We notice that, conceptually, nothing changes compared to the case when T is fully observed, at least for identification purposes. The error term U still has a well-defined (conditional and unconditional) survivor function. The main difference with the existing approach will be tackled in estimation, where the standard nonparametric estimators are replaced with Kaplan-Meier estimators. Equation (2.5) may be written as which defines a nonlinear integral equation of the first kind, where ϕ † is its true solution.
Identification of ϕ † is more complex in this context. In particular, given the nonlinear nature of the integral equation, we have to consider both conditions for global and local identification. We focus here on the latter that are easier to derive and are more easily interpretable. Interested readers can refer to Centorrino et al. [14], Chernozhukov and Hansen [21], and Fève et al. [30], for a discussion of global identification conditions in this context.
Our discussion of local identification is based on the linearization of the operator A(·). We provide mild sufficient conditions such that its Fréchet derivative exists and it is well-behaved. This discussion of local identification will lead us to impose, among others, a condition that is similar to the one in Assumption 2.3.
We start by assuming the following. Under the conditions in Assumption 2.4, the nonlinear operator A is Fréchet differentiable, and its Fréchet derivative A ϕ (φ) satisfies where A ϕ (φ) denotes the derivative of A at ϕ as a linear function ofφ. The nonlinear operator A is defined on the centered functions in L 2 Z and valued in L 2 U ×W . The mean of ϕ can be identified by noticing that E(T ) = E(ϕ(Z)), as long as E(U ) = 0. Therefore, we restrict our attention to the space of L 2 centered functions of Z, without loss of generality.
The Fréchet derivative A ϕ operates between L 2 Z and L 2 U ×W , and, under the conditions in Assumption 2.4 is a continuous linear operator for any ϕ. Under additional minor regularity conditions, it is also a Hilbert Schmidt operator, and thus compact and bounded [see 11].
We let D(A) be the domain of A, and for some finite constant R > 0, let We have the following definition.
Fréchet differentiability of the operator helps control the behavior of our nonlinear problem in the vicinity of the true solution. We further take the Fréchet derivative to be injective, which is tantamount to a rank condition on A ϕ † [see 17, Assumption 1, p. 788]. Finally, we need to restrict the amount of nonlinearity that is allowed for the ill-posed inverse problem at hand [see 17, Assumption 2, p. 789]. The last condition can be proven, using uniform boundedness of the first derivative of the conditional and marginal pdfs with respect to their first argument. We omit the proof for brevity. The last statement also follows from a Lipschitz continuity condition on A ϕ † [26].

Assumption 2.5 (Conditional completeness). Let ϕ ∈ E. Then
Assumption 2.6 (Scalability). For every ϕ ∈ B R (ϕ † ), we can write, This condition, under independence, is immediately implied by the completeness condition in Assumption 2.3. This is because, under the independence con- where equalities are intended almost surely. However, the inverse is not true in general: conditional completeness does not imply completeness. Assumption 2.5 implies condition (ii) in Definition 2.1. To further clarify its role, we consider the following example from Dunker et al. [27]. Assumption 2.6 of scalability is proven in Centorrino et al. [14], under more primitive conditions on the conditional expectation operator. This assumption implies condition (iii) in Definition 2.1 (see also 27,39).

Example 1. Let us assume that
We thus revisit the conditions in Definition 2.1 to obtain the following.

Framework
We observe an IID sample {(Y i , δ i , Z i , W i ), i = 1, . . . , n} from the joint distribution of the random vector (Y, δ, Z, W ). We take the supports of Z and W , when continuous, to be compact. In particular, we restrict the support of (Z, W ) to be the unit hypercube of dimension p + q, without loss of generality. One could allow for the support of the data to be unbounded. However, this substantially complicates the proof [see 9, for convergence results in sieve Nonparametric IV when the data have unbounded support].
In the following, we also let K(v) be a standard univariate kernel function, such as Gaussian or Epanechnikov. We also let K h (·) = K(h −1 ·), and K h (v) = j K h (v j ), the standard product kernel, for a scalar bandwidth h. Below we focus mainly on the practical implementation of our estimation procedure. As in any other nonparametric framework, we face the issue of selecting smoothing parameters. However, in nonparametric regressions with instrumental variables, we come across two kinds of 'smoothing parameters'. Namely, the bandwidth used for kernel estimates; and N , the number of iterations, used to regularize the ill-posed inverse problem. Separately, these two problems are standard, and several adaptive rules have been proposed. In the nonparametric instrumental regression setting, bandwidths and the regularization parameter compensate for one another. There likely exists a set of jointly 'optimal' choices for these two elements. However, this is a topic we do not tackle in this paper [see 12, for additional results on the selection of the tuning parameters in linear illposed inverse problems]. Below we thus consider data-driven procedures for the choice of tuning parameters that, although not optimal in the sense of oracle minimization of a given risk function, behave reasonably well in practice.

Case 1: U is mean independent of W
The estimation procedure is based on equation (2.2), where the conditional survivor function S C|W is replaced by a generalization of a Kaplan-Meier type estimator [48,66]; and on equation (1.2). To estimate the former, we follow the approach of Beran [5] [see also 22,37,68]. The latter can be cast as a linear integral equation of the first kind [49]. Beran's (1981) estimator of the conditional survivor function can be written as followŝ where h S is a bandwidth parameter. This estimator reduces to the standard Kaplan and Meier [46] estimator when the weights are all equal to n −1 . We provide conditions for the strong uniform consistency of this estimator in Section 4. Further, let A be the following conditional expectation operator such that A * : L 2 W → L 2 Z is the adjoint of the operator A [see 24, among others]. This notation allows us to express equation (1.2) as follows (3.2) Assumption 2.3 implies that the operator A is injective and therefore invertible. Under this condition, a unique solution to this problem exists, as shown in Proposition 2.1. However, the solution obtained by inverting the conditional expectation operator directly is not stable, and therefore we are faced with a linear ill-posed inverse problem. Heuristically, one could interpret the problem stated in equation (3.2) as a system of equations, in which the (infinite dimensional) matrix A is singular [13]. As discussed in the introduction, we explore the properties of our estimators using a Landweber-Fridman regularization approach.
The intuition underlying this regularization method is as follows. Equation (3.2) can be equivalently written as With simple algebra, one can show that the last identity also implies cA is an arbitrary constant, which satisfies cA * A < 1, with · being the operator norm. The solution ϕ thus needs to satisfy the following recursive identity An exact solution for ϕ would be given by the infinite sum A regularized solution is obtained by stopping this infinite sum after N terms: Similarly, equation (3.4) can be expressed recursively as with ϕ 0 = 0. The regularized estimator of ϕ is obtained by replacing r, A, and A * in equation (3.5) by consistent nonparametric estimators and using a stopping rule to determine the total number of iterations, N . Equation (3.4) is a solution of a linear optimization problem. We start iterating from N = 1, with ϕ 1 = cA * r. For k = 1, 2, 3, . . . , the iterative scheme converges towards the true solution as long as Hence, we need to select c in such a way that c A * A < 1. This condition implies that our iterative scheme is a contraction. Notice that A * A = A 2 = 1, as A is a conditional expectation operator, and its norm is equal to 1. Therefore, any c < 1 would guarantee convergence of our iteration scheme. Besides this restriction, the specific choice of c does not matter for our purposes, and the solution is insensitive to it. As in Engl et al. [28, p. 155], we can rewrite equation That is, as long as A * c A c < 1, which is the same condition as above. Values of c closer to the upper bound result in larger steps and fewer iterations for convergence. By contrast, if c is close to 0, the number of iterations can be extraordinarily large and, albeit precise, reaching the solution would require greater computational time. In our numerical experiment and empirical application, we use c = 0.5.
Below we outline the practical implementation of our estimator.
1. We compute the kernel weighted estimator of S C|W (y | w),Ŝ C|Z (y | w), as in equation (3.1), using local constant weights and bandwidth parameter h S . We then construct the dependent variablê 2. For the estimation of r(w) = E(V | W = w) and all the other population objects hereafter, we advocate using local polynomial regressions. While our asymptotic properties are developed using generalized kernels [see 61] to control the behavior of the estimator at the boundaries of the support, these are seldom used in practice. Local polynomial regressions are simpler to implement and do not have any boundary effects [29]. To simplify our exposition and without loss of generality, we consider local linear fitting. LetV to be the n × 1 vector of the generated dependent variable; K W,h W (w), the n × n diagonal matrix of kernel weights at the point w, where h W is a bandwidth parameter; and W(w) an n × 2 matrix with i-th row equal to (1, W i − w). We writê with e 1 = (1, 0) and M(w) a 1 × n vector. 3. Next, we estimate the two conditional expectation operators, A and A * .
Both operators are linear and can therefore be approximated by linear smoothers. To construct an estimator of A using local linear regressions, where k is the iteration index, for one draw from the simulated DGP in Section 5, n = 500, optimal stopping iteration N = 12.
we stack in a matrix of dimension n × n, the vectors M(W i ), for all i = 1, . . . , n, in a way thatÂ As above, we let and Z(z) a matrix with i-th row equal to (1, Z i − z). We finally have 4. Given estimators of r, A and A * , we start our iteration scheme fromφ 1 = cÂ * r . We compute each subsequent iteration aŝ

5.
To determine when to stop iterating, we adopt the cross-validation criterion developed in Centorrino [12]. We compute the leave-one-out version ofφ k , denotedφ k,−1 . Then we let for k = 1, 2, . . . . This function's typical shape can be observed in Figure  1 (this is the stopping function for one draw from the simulated DGP in Section 5, n = 500).
Equation (3.5) involves unknown density, distribution, and conditional mean functions, which are consistently estimated using locally weighted kernel approaches. We employ Gaussian kernels and select the bandwidth parameters, {h S , h W , h Z }, by Silverman's rule-of-thumb. This procedure delivers a consistent estimator of the unknown function ϕ.
Remark 2 (Additional Confounders). In many empirical settings, it is common to have additional observable confounders, X ∈ R d , which may be continuous or discrete and should be included in the regression model. The statistical model is , and Assumptions 2.1 and 2.2 must now hold conditional on (W, X). Because of theoretical considerations that will be explained in more detail below, it is not possible to modify the definition of the operators to include the additional exogenous variables. However, as explained in Hall and Horowitz [38], we can obtain an estimator of the function for every fixed value of X i and then smooth it with respect to it. That is, where M i (·) is the i-th elements of a mixed kernel projection vector [54], which depends on an additional bandwidth, h X , and is defined as above.

Case 2: U is independent of W
Estimation in the independent case proceeds similarly as above.
An estimator of the conditional cdf of the error term can be obtained using Beran's (1981) approach as in equation (3.1). An estimator of the unconditional survivor function can instead be obtained using a smoothed version of the simple Kaplan-Meier estimator.
The Landweber-Fridman estimator of ϕ † is based on a recursive definition as aboveφ where k = 0, 1, 2, . . . is an integer, and N > 0 is the total number of iterations; A(φ k ) is an estimator of A(ϕ) computed at the pointφ k ;Â * ϕ k is an estimator of the adjoint operator of the Fréchet derivative, and c < 1 is a strictly positive constant that determines the size of the step between consecutive iterations.
An additional step for implementation is to derive a closed-form expression From some elementary computations, we get which reduces to where E W denotes the expectation taken with respect to the marginal distribution of W . Let us now describe the practical implementation of this algorithm. In the following, we letT to be the estimator of the mean of T obtained by integrating the uncensored observations with respect to the empirical Kaplan-Meier distribution.
• We select an initial value ϕ 0 . Different choices of the initial conditions are possible. We may take ϕ 0 equal to the nonparametric estimation of the conditional expectation of T given Z, obtained as in Dabrowska [22]. This is not a consistent estimator if Z is endogenous but in many cases, the endogeneity bias is not too strong, and E(Y | Z) may be a reasonable starting value. Another possible choice is to solve the linear problem E(V | W ) = E(ϕ(Z) | W ) as detailed above. If ϕ † is identified under the mean independence condition, this solution is a consistent estimator, and we conjecture that imposing the independence restriction should improve the properties of this estimator. If ϕ † is under-identified this estimation gives an approximation [see 4,32]. Finally, one could use a linear or nonlinear parametric instrumental variable estimator. • At each iteration k ≥ 0, we compute the estimated centered residualŝ whereT is the sample mean of the random variable T estimated from the censored observations Y = T ∧ C [66]. Notice that this location normalization of the residuals correctly identifies the location of the function ϕ, under the assumption that E(U ) = 0.Â(φ k ) can be taken to be the difference between the conditional product-limit estimator of the distribution of U given W , and the unconditional product-limit estimator of the distribution of U . That is, we let If W is discrete, the conditional cdf may be computed by sorting with respect to the different (finite) values of W , allowing us to reach faster convergence rates. We provide a more detailed description of the latter case in Section 5. Finally, with tuning parameter h Z , and withfÛ k being a nonparametric density of the residuals at iteration k, whose construction is discussed in more detail below. Bandwidth parameters are chosen by Silverman's rule-of-thumb. Finally, we take c = 0.5 as discussed above.
• An important component in the construction of the estimator of the adjoint operator isfÛ k (Û ki ), the nonparametric estimator of the density of the error term. As our observations are right-censored, we follow the approach in Marron and Padgett [57], and Mielniczuk [59], and use the following estimator where k is the iteration index, for one draw from the simulated DGP in Section 5, n = 500, optimal stopping iteration N = 27.
whereFÛ k is the Kaplan-Meier estimator of the distribution of U in iteration k, and ΔFÛ k (U i ) are its finite differences. The rate of convergence of this estimator is the same as the usual nonparametric density estimator under standard assumptions.
• The last point is the choice of the stopping rule. This choice is crucial, as the regularization of the ill-posed inverse problem is provided by the stopping rule. It is common in the mathematical literature to adopt the so-called Morozov's discrepancy principle [see 8,45,60]. This principle leads to iterate up to N 0 > 0, such that where δ is a noise that is usually known, and τ is a positive constant, which depends on the properties of the known operator A. In this problem, however, we have an additional estimation error because of a nonparametrically generated regressor [56], which may blow our variance further as N → ∞. We, therefore, proceed as follows. We fix a maximum number of iterations, N max , based on the asymptotic theory derived below. We then check the norm ofÂ(φ k ), at each iteration j = 0, 1, 2, . . . , N max , and take N 0 as the iteration where the norm reaches its minimum. Otherwise, we take N 0 = N max . The typical shape of this function can be seen in Figure  2 (this is the stopping function for one draw from the simulated DGP in Section 5, n = 500).

Remark 3 (Additional Confounders
with the adjoint operator of the Fréchet derivative written as

The final estimator can be written aŝ
where M i (·) is defined as in Remark 2. This approach requires an estimator of the conditional density of U given X at each iteration, which suffers from the curse of dimensionality and may result in very slow rates of convergence. A potential alternative is to use the more restrictive assumption that U ⊥ ⊥ (X, W ) and E(U ) = 0, together with a flexible semi-parametric structure.

Framework
We briefly give the main result about the rate of convergence of our estimators. Our proofs are based on results by Engl et al. [ [26], and Centorrino et al. [14].
We start by collecting assumptions that are common across the various frameworks. To clarify our notations, we let to be a generalized kernel with correction at the endpoints as defined in Müller [61], where K + (·, t) and

Case 1: U mean independent of W
For this section, we let where K h W (·, ·) is a multivariate generalized kernel function as defined above, Estimators of the operators A and A * are constructed as n × n matrices of kernel weights. We also need the following additional assumptions.   injective. In particular, Condition (ii) entails that A and A * admit a singular value decomposition, with their singular values having zero as a limit point. This property generates the ill-posedness of the inverse problem defined by equation (3.2). We distinguish two cases: when the singular values of A * A converge to zero at a polynomial rate, we say that the inverse problem is mildly ill-posed, while if they have an exponential rate of convergence, we say that the problem is severely ill-posed. The degree of ill-posedness is related to the smoothness of the joint density of (Z, W ). In practice, the smoother the joint density is, the more the function ϕ is blurred when integrated with respect to it, and the more difficult the estimation problem becomes. The following example shows how the decay of the singular values of A and A * is related to the joint distribution of (Z, W ).
Example 2 (The Normal Case). Suppose that (Z, W ) ∈ R 2 is jointly normal with mean zero and covariance matrix given by: with | ρ |< 1. This implies that the conditional distribution of Z given W = w is normal with a mean equal to ρw and a variance equal to 1 − ρ 2 . Therefore, the eigenvectors associated to the operator A are Hermite polynomials, and its singular values are given by ρ j , for j = 0, 1, 2, . . . . As j → ∞, the eigenvalues are converging to zero at an exponential rate. In this jointly normal case, the inverse problem is therefore severely ill-posed. Assumption 4.4 is a smoothness condition on the conditional expectation of V given W , and the second part is tantamount to the requirement that V is square-integrable.
Finally, Assumptions 4.5, along with the conditions on the kernel function in Assumption 4.1 and the differentiability conditions in Assumption 4.3(i) are used to show the uniform consistency of the nonparametric estimators of the joint and marginal densities of (Z, W ), and of the conditional survival function S C|W [22,68]. 1 We obtain the following. (i) Estimation of r.
The first part of the proposition gives the rate of convergence for the nonparametric estimatorr. The dependent variable is estimated using standard kernel regressions, so that a projection argument, which is standard in this literature, makes us conclude that the first step estimation of the conditional survivor function is negligible, provided the bandwidth h S is chosen accordingly. Notice that this choice of bandwidth allows us to achieve the same rate for the estimation of E(T |W ) as if T was fully observed, and therefore map the results that follow into the class of standard nonparametric IV estimators. We argue that such a requirement is easily satisfied by taking h S = h W . That is, we use the same bandwidth to estimate the conditional survivor function and the conditional expectation. A proof of the first part of the Proposition is given in Appendix, under additional assumptions on the asymptotic representation ofŜ C|W .
The second part of the Proposition follows from the results in Darolles et al. [24]. The rate of convergence for the estimation of the operators is standard in nonparametric econometrics. As the operators are effectively estimators of conditional densities, their rates of convergence are those of the nonparametric estimator of the joint density of (Z, W ).
Denote by R the range of an operator. We present the main convergence rate in the following Theorem. and (ii) or the following weak source condition holds For a small constant > 0 such that, < 2ρ 2ρ+q and λ 2λ+p+q ≥ ρ 2ρ+q − 2 , this value is minimized for N 0 n 2ρ 2ρ+q − and The strong and weak source conditions link the smoothness properties of the conditional expectation operators, A and A * with the smoothness properties of the function ϕ. The first part of the theorem establishes the convergence rates for the mildly ill-posed problem, which are polynomial in the sample size. In the second part, we instead provide the rates for the severely ill-posed case, which are instead polynomial in the logarithm of the sample size. This is a standard result in the literature, as the super-smoothness of the joint density implies that the data contain very little information about the function ϕ, and a large sample is required to obtain a precise estimate [see 20].
For the mildly ill-posed problem, we note that our rate is the rate of estimation of E(V | W ) at a power β β+1 smaller than 1 which we may view as the cost of the resolution of the inverse problem. Note that one advantage of the Landweber-Fridman method is that β is not constrained by the qualification of the method, such as in the Tikhonov regularization where β is limited by 2 (see the Appendix for a formal definition).

Remark 5 (Minimax rates).
When the inverse problem is mildly ill-posed, the eigenvalues of the operator A * A decay geometrically at a speed equal to 2a, with a > 0. If s > 0 is the smoothness of the function ϕ, then ρ = s + a, and β = s/a. Therefore, our rates of convergence would be equal to where q is the dimension of W . If p = q, that is, we have as many instruments as endogenous variables, then this rate is minimax [18,20].
When the inverse problem is severely ill-posed, the rate of convergence is dominated by the bias term, which converges at a logarithmic rate. The rate is minimax for the given choice of the tuning parameters.

Case2: U independent of W
For nonlinear inverse problems, iteration methods like the one used here would in general not converge globally. We prove local convergence by appropriately restricting the initial condition and controlling the behavior of the Fréchet derivative of the operatorÂ. We assume the following. Assumption 4.6. Let B R (ϕ 0 ) to be a ball of radius R < ∞ around the initial condition, such that B R (ϕ 0 ) ⊂ D(A). We have ϕ † ∈ B R (ϕ 0 ).

Assumption 4.7. A andÂ are Fréchet differentiable, with A andÂ bounded linear operators.
We also impose the following additional Assumptions.

Assumption 4.11. The smoothing parameters satisfy h
8 is a standard regularity condition of conditional and unconditional densities. Part (ii) is not restrictive as long as we maintain that the joint support of (Z, W ) is compact. Assumption 4.9 restricts the density of the error term to be continuous and differentiable. Finally, Assumptions 4.10 and 4.11 are used for the uniform consistency of the nonparametric density estimators. One crucial difference of this estimator compared to the one we have previously described is that it involves estimating the density of the error term at each iteration. While it is plausible to assume that the support of the independent variables is bounded, such an assumption would be too restrictive for the error component U . Assumption 4.10 helps us accommodate possibly unbounded support of the error, following the approach of Hansen [40]. We choose the points u in expanding sets of the form {u :| u |≤ n }.
We let the following hold.
In the following, to simplify notations, we shall remove the dependence of δ n and γ n from the parameters { , λ, p, q}. We leave the values of δ n and γ n unspecified as they depend on the nature of the instrumental variable. For instance, if the instruments are binary, and their dimension q is relatively small compared to the sample size, one can sort the sample in a way to obtain δ n n −1/2 . On the contrary, if W ∈ R q is continuous, ≥ λ and one uses a standard nonparametric estimator for conditional distribution as in Li and Racine [55], then we have that δ n n −λ/(2λ+q) . This high-level assumption holds under Assumptions 2.1 and 2.2, and further regularity conditions similar to the ones provided in Assumptions 4.1 and 4.8-4.11.
Finally, we need to further restrict the local behavior of the Fréchet derivative, its adjoint, and their estimators. In practice, this is done by extending Assumption 2.6 to the estimators of the Fréchet derivative and its adjoint. This is presented in more detail in Appendix.
We also make two additional assumptions.
Assumption 4.14 (Tuning parameters). The tuning parameters satisfy the following restrictions (iii) There exists β * ∈ (1/2, 1), such that Assumption 4.13 is a source condition. Differently from the statement of Theorem 4.1, the source condition is not assumed on the function ϕ † directly, but rather on the difference between our initial condition and the true solution. This is due to the local nature of our estimation procedure. Similarly, when the inverse problem is nonlinear, we cannot allow for a weak source condition. This is because the error accumulates across iterations at a polynomial rate. Therefore, when the regularization bias only decreases at a logarithmic rate, the Landweber-Fridman algorithm cannot converge. Assumption 4.14 imposes restrictions on the tuning parameters. All restrictions depend on the unknown regularity of the ill-posed inverse problem which is determined by β. Proposing a data-driven procedure for the choice of these parameters is an essential step to be pursued in future research.
The following Theorem contains the main result of this Section. where, Otherwise, if Assumptions 4.14(i) and 4.14(iii) do not hold, and we only have The result of this Theorem gives an upper bound on the mean square error of our estimator.
For β ≤ β * , as defined in Assumption 4.14(iii), the upper bound is the same one we get in Theorem 4.1(i), under a strong source condition. However, for β > β * , we cannot reach the same upper bound. Heuristically, we have an additional term, (h 2 U κ n ) −1 , due to the estimation of the density of the error term. When β = 1, the regularization bias that accumulates across iterations converges to zero exactly as 1/N , and thus the term 1/(Nh 4 U κ 2 n ) dominates. The same effect holds for any β close enough to 1, or, more precisely, for any β > β * .
The same heuristic does not apply to the other terms in the decomposition when we can choose h U large enough and N small enough so that the nonlinearity error does not dominate. The condition in Assumption 4.14(i) on the tuning parameters serves exactly this purpose.
The last statement in the Theorem applies if we cannot choose N → ∞ slow enough to satisfy the conditions in Assumptions 4.14(i) and 4.14(iii). In this case, N satisfies which would be equivalent to the optimal choice of the regularization parameter for the linear ill-posed inverse problem in Theorem 4.1. However, in this case, the rate of convergence is slower because of the additional term (h 4 u κ 2 n ) −1 . A potential way to let h U go to zero more slowly is to use higher-order kernels, which is what we advocate in practice.

Example 3.
Let us consider the case in which both Z and W are continuous and scalar. We further take kernels of order ≥ λ ∧ 2. In this case,Â(ϕ) is an estimator of the conditional cdf of U given the instrument W , so that one could take , in a way that δ n n − λ 2λ+1 . Similarly,Â ϕ is a conditional expectation operator, and where the last equivalence follows by taking h Z n − 1 2λ+2 .Thus, δ n ∨ γ n = γ n . Let us take h U n −1/(2λ+1) . The condition on the growth of the number of iterations becomes where the unknown value of β determines the optimal convergence of the regularization constant, N .
Having established the rate of convergence of the proposed estimator, we now turn to an assessment of its finite-sample performance.
We then let where ζ ∼ N (0.1, 0.4 2 ), and ε ∼ N (0, 0.25 2 ). This generates dependence between U and Z in such a way that E(U | Z) = 0, while obviously E(U | W ) = 0, as W is taken to be independent of U in this example. We consider two separate scenarios for the censoring variable C. In the first case, we take C to be independent of all other variables in the model. We generate C from a normal distribution with mean equal to the 90th percentile of T , and variance equal to the variance of T . In the second case, we simulate ν ∼ N (μ ν , 0.25 2 ), where μ ν is twice the 90th percentile of T . We thus have that ν ⊥ ⊥ (T, Z, W ), and we take C = Zν.
For each simulated DGP, about 20% of the observations is censored. We first look at estimation under the mean independence condition E(U | W ) = 0. Figure 3 plots the median of the estimated function (solid gray line), and a simulated 95% confidence interval (dashed gray line) for M = 1000 simulated samples of size n = 500 drawn from these DGPs. The black dashed-dotted line is the nonparametric regression estimator under the assumption of mean independence [i.e., E(U | Z) = 0, see 22,23]; and the solid black line is the true regression function. We can notice how the simple nonparametric regression estimator is never fully contained in the 95% confidence bands, and it can highly distort marginal effects at every point of the support of Z. Numerical comparison of the Mean Integrated Squared Error (MISE) of the simple nonparametric regression against our nonparametric instrumental variable estimator confirms the graphical results (see Table 1). We point out that, as the sample size increases, the relative improvement of the MISE becomes larger. This has to be expected, as the simple nonparametric regression is inconsistent in this setting.
Finally, we consider M = 1000 Monte Carlo replications for increasing sample sizes n = {250, 500, 1000}. For each replication, we use the nonparametric estimator outlined above to estimate ϕ(z). The stopping rule used is the one described above. The constant for the Landweber-Fridman iteration was set at  0.5. All bandwidths for the conditional mean objects were selected via Silverman's rule-of-thumb. We consider local linear estimators of conditional means and operators, as described in Section 3, and we provide the MISE ofφ(z) with respect to the unfeasible estimatorφ(z) for each replication. The unfeasible estimator is defined as the estimator of ϕ that would be obtained if T was observed without censoring. We report summary results in Table 2, along with the median number of iterations N required for convergence. We can observe from Table 2 that the feasible estimator has reasonable properties compared to the unfeasible one. In most cases, the ratio between the median MISE tends to 1 as n increases, and as predicted by our theoretical results. As we take a larger proportion of censored observations, we can expect our estimator to approach more slowly the unfeasible one.  We also consider the same DGP when estimation is carried using an independence restriction. We keep the same data generating process as above. As we have noticed, our continuous instruments satisfy the restriction of independence, so that our estimator can be implemented using this stronger restriction as described in Section 3.
We can also directly compare the performance of the estimator under mean independence and independence. The latter restriction carries more information about the data generating process. However, rates of convergence can be slower due to the nonlinear ill-posed inverse problem. Moreover, as instruments are continuous, the linear estimator under mean independence is consistent. Finally, we conjecture that taking only one iteration from the mean independence estimator towards independence should be sufficient to achieve the smallest MISE. This is in parallel with the scoring method in Maximum likelihood estimation [63,65]. In that approach, any consistent estimator of the unknown parameter reaches the efficiency bound by taking a one-step deviation towards the Maximum Like-lihood estimator.
We can therefore assess a) how sensitive the performance of the estimation procedure is to various choices of the initial condition; b) test if our conjecture holds, at least in a limited simulation setting.
The results of this comparison are reported in Table 3 for various sample sizes.
The table is divided into four sub-tables. The first one refers to the estimator under mean independence. The second refers to the estimator under independence when the initial value is taken to be the local linear estimator of the conditional expectation of Y given Z,φ LL . The third sub-table considers the performance when the initial condition is the estimator under mean independence,φ MI . Finally, the last sub-table considers the performance of the independence estimator when we simply take a one-step deviation fromφ MI . For each sample size and type of simulation, we report the MISE and the median number of iterations N performed. For the latter estimator, the median number of iterations is always equal to 1, and it is therefore not reported. The median mean square error is multiplied by a factor of 100, for the convenience of the reader.
As expected, the performances of all estimators improve as the sample size increases. The properties of the estimators under mean independence are better than those of the estimator under independence. This may also be due to a choice of tuning parameters that is not optimal in the latter case. This would require further research that is beyond the scope of this work.
There does not appear to be a substantial difference in the properties of the estimator under independence when we take different initial conditions. The median mean square error does not change dramatically, and neither does the median number of iterations.
Finally, we find some evidence in favor of our conjecture, at least in our simulation study. That is, taking a one-step iteration using the independence restriction has, in some cases, better performance than taking multiple iterations.
Finally, we consider a Monte-Carlo simulation in which we replace the two continuous instruments with a single binary instrument, W , generated from a Bernoulli distribution with parameter equal to 0.5.
We then independently generate a normal random variable, ε ∼ N 0, 0.25 2 , and a uniform random variable, ω. We then let in a way that ζ follows a standard logistic distribution. Furthermore  Otherwise, we keep the same specifications of the regression function ϕ and the censoring variable C.
In this example, the estimation of the operator A is obtained by sorting the sample according to the values of the instrument W and obtaining an estimator for both survivor functions. Our estimators need to satisfy Assumption 4.7. Hence, we do not directly employ the Kaplan-Meier estimator, but its kernel smoothed version [see 47, 58, among others]. We letŜ U |W (u | W = 1) and S U |W (u | W = 0) be the estimators of the survivor function of U conditional on W = 1 and W = 0, respectively.
Let ψ(u, w) =Ŝ U |1 (u) −Ŝ U |0 (u). One can write the estimator of the adjoint operator A * ϕ in the following form , with ψ ∈ L 2 U ×W , andf U (u) a nonparametric estimator of the density of U as explained in Section 3.
In this case, the model is not identified under the mean independence restriction, as the completeness condition in Assumption 2.3 fails. As a matter of fact, the restriction E(ϕ(Z) | W ) = 0 reduces to which cannot imply ϕ = 0, except when Z is also binary, or when ϕ is a twoparameter function in Z.
Results for this simulation exercise are reported in Table 4. As above, the MISE is multiplied by a factor of 100.
The performance of our estimator worsens compared to the case where we have two continuous instruments, which may be expected. Nonetheless, we can Table 4 MISE and median number of iterations, N , with U ⊥ ⊥ W and W binary. appreciate how the MISE decreases as the number of observations increases. Moreover, the median number of iterations taken is often larger than above, which may be related to the fact that the information contained in each single iteration step is much smaller in this context.

Appendix A
Proof of Proposition 4.1. We only prove the first part of the Proposition, which is specific to this paper. The proof of the second part is identical to Darolles et al. [24] and Florens et al. [34], and it is omitted here for brevity. We introduce the following additional notations where the definition of these objects should be apparent. Under Assumption 2.2, we immediately obtain that H C|W (y | w) < 1 and H δ C|W (y | w) < 1. We make use of the following Lemma.
We letr We therefore decompose directly, under Assumption 4.4 [see 24,34]. We are therefore left with the term r −r. After simple computations, this difference can be rewritten as where the last step follows from Lemma A.1 and Assumption 4.5. Therefore, we have that where the last step follows from the conditions in Assumptions 4.3(i), 4.4(iii) and 4.5, which imply the uniform convergence of the conditional Kaplan-Meier estimator [22]; and Assumption 2.2, which implies that the conditional survivor function is almost surely bounded away from 0. Directly from the results in Darolles et al. [24], we obtain that wheref W (w) is the Nadaraya-Watson estimator of the density of W using generalized kernels as defined in Assumption 4.1. We now consider where the conclusion follows by the usual change of variable, the uniform boundedness of the kernel function, Lemma A.1 and Assumption 4.4. Using a similar argument, we have that which directly implies Finally, where the remaining terms are zero by the law of iterated expectations. By the usual change of variable, Assumption 4.4 and Lemma A.1, one can show that The result of the Proposition follows from Markov inequality, and the assump- Proof of Theorem 4.1. We just give the main steps of the proof. More details can be found in Carrasco et al. [11] and Centorrino [12]. Let us denote by R a generic positive constant. To reduce the notational burden, we use the notation Φ β (N ), where Φ β (N ) = N −β , under the strong source condition, and Φ β (N ) = (log(N )) −β under the weak source condition, respectively. As the operator A * A is compact and thus admits a singular value decomposition, we also use the notation Φ β (A * A) to signify that the function Φ β is applied to the singular values of A * A. Finally, the source condition implies that we can write and v ≤ R. We first recall the following definition.
Definition A.1 (Qualification). A regularization procedure, g N , is said to have qualification of order κ > 0, if: In particular, the Landweber-Fridman regularization has qualification equal to ∞, in the sense that for every η > 0, the inequality in equation (A.1) holds with 1 − ag N (a) = (1 − ca) N .
Moreover, we need the following.
Assumption A.1. There exist two positive constants R and η such that: Definition A.1 and Assumption A.1 together imply that This result is used repeatedly in the proof below. We havê Given the source condition and the qualification of Landweber-Fridman regularization, we directly have that ||III|| 2 = O P (Φ β (N )). Moreover , directly from the result in Proposition 4.1, and with h S = O P (h W ). Finally, thanks to we have: By using the Taylor theorem for integer powers of positive operators in Bhatia and Sinha [7], which applies provided c < 1, we obtain We ignore for the moment the remainder of the Taylor expansion, which is shown to be negligible under identical conditions. We thus have Therefore, which follows directly from the result of Proposition 4.1 and Assumption A.1; and where the last result follows from Proposition 4.1 and Assumption A.1. We finally notice that the reminder of the Taylor expansion can be treated in the same way. It can be therefore proven that the reminder is of the order n − 4λ 2λ+p+q N 2 Φ 2β (N ), and thus negligible under the conditions given in the statement of the Theorem. Finally, The result of the theorem follows.

Proof of Theorem 4.2.
To obtain uniform consistency of the nonparametric estimators, we must impose some additional assumptions. These are listed below. Without loss of generality, we use the word density irrespective of W being discrete or continuous. The marginal density of W can be estimated by different methods depending on the nature and the dimension of the instrument. Therefore, we suppose that there is a function d(·), such that This function could be a kernel for continuous or discrete variables [see 2,54]; or a product of indicator functions for purely discrete instruments. Then we definê Among other things, this lemma implies that This condition implies (but it is not implied by) a Lipschitz continuity condition onÂ [see 26,42,45].
We will also use the following results below. We now turn to the proof of the main result of the Theorem. We havê where the second line follows from A(ϕ † ) = 0. By replacing iterativelyφ j , for all k = 0, . . . , N − 2, and lettingê k =φ k − ϕ † , for all k = 0, 1, 2, . . . , we finally The first two terms are similar as in the asymptotic expansion of Landweber-Fridman regularization for linear inverse problems (see the proof of Theorem 4.1). By contrast, the terms in III and IV come from the nonlinearity of the inverse problem in our framework. As a matter of fact, these latter terms are identically zero when the ill-posed inverse problem is linear. To control these terms, we use the main result provided in Lemma A.2. Let δ n and γ n to be defined as in Assumption 4.12. We again use the letter R to denote a strictly positive constant, which may take different values in different instances. We start by considering the term in I. It follows from the strong source condition that Under the same conditions outlined earlier, ||I b || 2 = O P N −β , and ||I a || 2 = O P γ n N (1−β)/2 .
Similarly, for II, we have We now control the nonlinear terms following the approach in Centorrino et al. [14]. Let where the second inequality follows from Lemma A.4 with s = 0.5. Similarly,

The last result implies
so that the convergence of the nonlinearity term is dominated by the terms in IV .
Because of the restriction imposed in Assumption 4.14(i), we have that which, for β ≤ 1 and h 2 u κ n = o(1), cannot be satisfied for all β's. Therefore, we say there exists a β * < 1, such that the condition above is satisfied. This is equivalent to the condition given in Assumption 4.14(iii). Finally, reasoning as above, The result of the Theorem follows from Markov's inequality.