Towards optimal doubly robust estimation of heterogeneous causal effects

Heterogeneous effect estimation plays a crucial role in causal inference, with applications across medicine and social science. Many methods for estimating conditional average treatment effects (CATEs) have been proposed in recent years, but there are important theoretical gaps in understanding if and when such methods are optimal. This is especially true when the CATE has nontrivial structure (e.g., smoothness or sparsity). Our work contributes in several main ways. First, we study a two-stage doubly robust CATE estimator and give a generic model-free error bound, which, despite its generality, yields sharper results than those in the current literature. We apply the bound to derive error rates in nonparametric models with smoothness or sparsity, and give sufficient conditions for oracle efficiency. Underlying our error bound is a general oracle inequality for regression with estimated or imputed outcomes, which is of independent interest; this is the second main contribution. The third contribution is aimed at understanding the fundamental statistical limits of CATE estimation. To that end, we propose and study a local polynomial adaptation of double-residual regression. We show that this estimator can be oracle efficient under even weaker conditions, if used with a specialized form of sample splitting and careful choices of tuning parameters. These are the weakest conditions currently found in the literature, and we conjecture that they are minimal in a minimax sense. We go on to give error bounds in the non-trivial regime where oracle rates cannot be achieved. Some finite-sample properties are explored with simulations.


Introduction
Heterogeneous effect estimation plays a crucial role in causal inference, with applications across medicine and social science, e.g., improving understanding of variation, and informing policy or optimizing treatment decisions.The most common target parameter in this setup is the conditional average treatment effect (CATE) function, E(Y 1 − Y 0 | X = x), which measures the expected difference in outcomes had those with covariates X = x been treated versus not.The CATE is typically identified under standard causal assumptions (including no unmeasured confounding) as the difference between two regression functions, Important early methods for estimating the CATE often employed semiparametric models, for example partially linear models assuming τ (x) to be constant [Robins et al., 1992, Robinson, 1988], or structural nested models in which τ (x) followed some known parametric form [Robins, 1994, van der Laan, 2005, van der Laan and Robins, 2003, Vansteelandt and Joffe, 2014].These approaches reflect a commonly held conviction that the CATE may be more structured and simple than the rest of the data-generating process (with the zero treatment effect case an obvious example).This is similar in spirit to the common belief that interaction terms are more often zero than "main effects".
Recent years have seen a move towards more flexible estimators of τ (x).The first formal nonparametric model where the CATE has its own complexity separate from regression functions seems to be Example 4 of Robins et al. [2008].van der Laan [2005] (Section 4.2) proposed an important model-free "meta-algorithm" for estimating the CATE, a variant of which we study in this paper.The last 5-10 years has seen even more emphasis on nonparametrics and incorporating machine learning [Athey and Imbens, 2016, Foster and Syrgkanis, 2019, Foster et al., 2011, Hahn et al., 2020, Imai and Ratkovic, 2013, Künzel et al., 2019, Nie and Wager, 2021, Semenova and Chernozhukov, 2017, Shalit et al., 2017, Wager and Athey, 2018].The present work is in a similar vein, focusing on (i) providing more flexible CATE estimators with stronger theoretical guarantees, and (ii) pushing forward our understanding of optimality and the fundamental limits of CATE estimation.At the start of each of the Sections 3, 4, and 5, we detail related work and describe how our results fit.
After describing the setup and presenting a simple motivating illustration in Section 2, we go on to present our three main contributions: (i) a model-free oracle inequality for regression with estimated pseudo-outcomes (given in Section 3); (ii) general conditions for oracle efficiency of a doubly robust estimator we term the DR-Learner, with applications to specific regression methods under smoothness conditions (in Section 4); and (iii) a more refined analysis of a specialized estimator we call the lp-R-Learner, showing that faster rates can be achieved in the non-oracle regime, and giving a partial answer towards understanding the fundamental limits of CATE estimation (in Section 5).

Setup & Illustration
We assume access to an iid sample of observations of Z i = (X i , A i , Y i ), where X ∈ R d are covariates, A ∈ {0, 1} is a binary treatment or exposure, and Y ∈ R an outcome of interest.The distribution of Z is indexed by the covariate distribution and nuisance functions: Our goal is estimation of the difference in regression functions under treatment versus control τ (x) ≡ µ 1 (x) − µ 0 (x).Under standard causal assumptions of no unmeasured confounding, consistency, and positivity or overlap (ϵ ≤ π ≤ 1 − ϵ wp1, which we assume throughout), the function τ (x) also equals E(Y 1 − Y 0 | X = x), where Y a is the counterfactual outcome under A = a.We refer to µ 1 (x) − µ 0 (x) as the CATE, noting that our results hold regardless of whether the causal assumptions do.The average treatment effect (ATE) is given by E{τ (X)}.

Notation
We use P n (f ) = P n {f (Z)} = 1 n i f (Z i ) as shorthand for sample averages.When x ∈ R d we let ∥x∥ 2 = j x 2 j denote the usual (squared) Euclidean norm, and for generic (possibly random) functions f we let ∥f ∥ 2 = f (z) 2 dP(z) denote the (squared) L 2 (P) norm.We use the notation a ≲ b to mean a ≤ Cb for some universal constant C, and a ≍ b to mean cb ≤ a ≤ Cb so that a ≲ b and b ≲ a.We let a n ∼ b n mean a n /b n → 1 as n → ∞.
At various points we refer to s-smooth functions, which we define as those contained in the Hölder class H(s), a canonical function class in nonparametric regression, density estimation, and functional estimation.Intuitively, it contains smooth functions that are close to their ⌊s⌋-order Taylor approximations.More precisely, H(s) is the set of functions f : X → R that are ⌊s⌋-times continuously differentiable with partial derivatives bounded, and for which for all x, x ′ and m = (m 1 , ..., m d ) such that j m j = ⌊s⌋, where is the multivariate partial derivative operator.

Simple Motivating Illustration
Consider a simple data-generating process where the covariates X are uniform on [−1, 1], π(x) = 0.5 + 0.4 sign(x) and µ 1 (x) = µ 0 (x) are equal to the piecewise polynomial function defined on page 10 of Györfi et al. [2002], given by if − 1 ≤ x ≤ −0.5, x/2 + 0.875 if − 0.5 ≤ x < 0, −5(x − 0.2) 2 + 1.075 if 0 < x ≤ 0.5, x + 0.125 if 0.5 ≤ x < 1, , which is illustrated in Figure 1. Figure 1 also shows n = 1000 simulated data points from this data-generating process, approximately half of which are treated (shown on the left panel) and the other half untreated (shown on the right).Also shown are estimates of the corresponding µ 1 and µ 0 functions, using default tuning parameters with the smoothing.splinefunction in base R.
An interesting but likely common phenomenon occurs in this simple example.The individual regression functions are non-smooth, and difficult to estimate well on their own; this is especially true in the region where there are fewer treated indviduals.Thus the estimate µ 1 tends to oversmooth on the left, where there are more untreated individuals; in contrast, the estimate µ 0 tends to undersmooth on the right, where there are more treated individuals.This means a naive plug-in estimator of the CATE that simply takes the difference µ 1 − µ 0 will be a poor and overly complex estimator of the true difference, which is not only a constant but zero. Figure 1: Plot of simulated data where the regression functions µ 1 and µ 0 individually are complex and difficult to estimate, but their difference is simply constant and equal to zero.Thus a naive plug-in estimator of the CATE will be overly complex, yielding large errors.
In contrast, suppose for simplicity that the propensity scores π were known.Then a regression of the inverse-probability-weighted (IPW) pseudo-outcome (A−π)Y π(1−π) would, up to constants, behave just as an oracle estimator that had access to the actual counterfactual difference Y 1 − Y 0 , since the conditional mean of this pseudo-outcome is exactly τ (x). Figure 2 shows results from this procedure, as well as two other more efficient and doubly robust versions described in subsequent sections, again all using default tuning parameter choices from smoothing.spline.For these simulated data, the doubly robust estimators are much more efficient than the IPW estimator, and do a much better job of adapting to the correct underlying simplicity of the true τ .
The results shown in Figure 2 are typical for this data-generating process: across 500 simulations, the IPW and doubly robust estimators gave smaller integrated squared bias across X by a factor of 10-100, respectively, and the doubly robust estimators improved on the integrated variance of the IPW estimator by nearly a factor of 20.
Remark 1.If the CATE has no additional structure (e.g., smoothness, sparsity) beyond that of the individual regression functions µ a , then a simple plug-in estimator τ (x) = µ 1 (x) − µ 0 (x) would be sufficient.However, in general the complexities of µ a and τ may differ substantially, and in practice we often expect the CATE to be much more structured, as described in the Introduction.
In the following sections we study the error of procedures like those illustrated above, giving model-free guarantees for practical use, as well as a more theoretical study of the fundamental limits of CATE estimation in a large nonparametric model.Figure 2: Estimated CATE curves in a simple simulated example.The plug-in method inherits the large errors from estimating the individual regression functions, which are complex and non-smooth.The IPW and doubly robust methods adapt to the smoothness of the CATE, which is constant in this example, with the doubly robust methods more efficient.

General Pseudo-Outcome Regression
In this section we lay out a framework for characterizing the error of general two-stage regression estimators that regress estimated "pseudo-outcomes" on a covariate vector.We define a new notion of stability (which we prove holds for generic linear smoothers, for example), and then show how this stability can be combined with a small-bias condition to yield oracle efficiency.In addition to laying a foundation for the analysis of a doubly robust CATE estimator in the next section, the result should also be of independent interest in other problems involving regression with estimated or imputed outcomes [Ai and Chen, 2003, Bibaut and van der Laan, 2017, Fan and Gijbels, 1994, Kennedy et al., 2017, Rubin and van der Laan, 2005, 2006].
With a few exceptions, previous work on regression with estimated outcomes has not appeared to exploit sample splitting, and has largely focused on particular pseudo-outcomes, and particular regression estimators (in both stages); in contrast we lean on sample splitting in order to obtain a more general result that is agnostic about the methods used, beyond a basic stability condition.Prominent examples of previous work include Ai and Chen [2003], Rubin and van der Laan [2005], and Foster and Syrgkanis [2019], all of which gave results for pseudo-outcomes of a general form.Ai and Chen [2003] and Rubin and van der Laan [2005] did not use sample splitting and so limited their attention to particular estimators.Ai and Chen [2003] used sieves, and focused more on a finite-dimensional component appearing in the pseudo-outcome.Rubin and van der Laan [2005] considered least squares, penalized methods, and linear smoothers in the second stage, while being agnostic about the first-stage regression.However, their error bounds do not allow one to exploit double robustness in the pseudo-outcome, which is a crucial advantage in practice and which our result does allow.
Our results are closest in spirit to Foster and Syrgkanis [2019], who study generic empirical risk minimization when the loss involves complex nuisance functions.However there are some important distinctions to be made.Most importantly, Foster and Syrgkanis [2019] assume the loss satisfies a Neyman orthogonality property in order to obtain squared error rates; in contrast, our bound does not rely on this structure, but can exploit it when it holds, as shown in the following section.Crucially, our bound will also be seen to yield doubly robust errors, whereas the orthogonality-based results of Foster and Syrgkanis [2019] yield errors that are second-order but not doubly robust.Double robustness is essential when different nuisance components are estimated with different errors.Lastly, Foster and Syrgkanis [2019] pursue global error bounds, while we consider pointwise error, which can be beneficial for developing inferential tools.These and some other differences are discussed further in the next section.
In what follows we first define a new notion of estimator stability, which can be viewed as a form of stochastic equicontinuity for nonparametric regression (as opposed to averaging).
Definition 1 (Stability).Suppose D n = (Z 01 , ..., Z 0n ) and Z n = (Z 1 , ..., Z n ) are independent training and test samples, respectively, with covariate X i ⊂ Z i .Let: Then the regression estimator E n is defined as stable at X = x (with respect to a distance metric d) if Sample splitting plays an important role here: the regression procedure as defined estimates the pseudo-outcome on a separate sample, independent from the one used in the second-stage regression via E n .Examples of such procedures can be found in subsequent sections, e.g., as illustrated in Figures 3 and 5.The main role of sample splitting is that it allows for informative error analysis while being agnostic about the first-and second-stage methods.
Remark 2. With iid data, one can always obtain separate independent samples by randomly splitting the data in half (or in folds); further, to regain full sample size efficiency one can always swap the samples, repeat the procedure, and average the results, popularly called crossfitting and used for example by Bickel and Ritov [1988], Robins et al. [2008], Zheng and van der Laan [2010], and Chernozhukov et al. [2018].In this paper, to simplify notation we always analyze a single split procedure, with the understanding that extending to an analysis of an average across independent splits is straightforward.
As mentioned above, Definition 1 can be viewed as a generalization of the classic condition that Andrews [1994], Lemma 19.24 of van der Vaart [2000], Lemma 2 of Zheng and van der Laan [2010], term I 3,k of Chernozhukov et al. [2018], Lemma 2 of Kennedy et al. [2020], etc.).More specifically, Definition 1 generalizes from settings involving averages P n to generic regressions E n , where slower than root-n rates can appear in the denominator scaling.We also note that the stability in Definition 1 is local or pointwise, in the sense that it is defined at a particular X = x (because pointwise rates are the focus of this paper).More global variants of stability are also possible; for example, after this work appeared on arxiv, Rambachan et al. [2022] used an L 2 version of stability (their Assumption B.1 and Proposition B.1) to study L 2 rates (their Lemma B.1).
The next result shows that generic linear smoothers are stable, with respect to a natural weighted L 2 distance.This and all other proofs are given in Section 7. Linear smoothers are a fundamental class of regression estimators and include, for example, linear regression, series methods, local polynomials, nearest neighbor matching, smoothing splines, kernel ridge regression, and some versions of random forests [Biau, 2012, Scornet, 2016, Verdinelli and Wasserman, 2021].We suspect other kinds of estimators that are not linear smoothers are also stable in the sense of Definition 1, but leave this to future work.
Theorem 1. Linear smoothers of the form We note that Theorem 1 recovers results similar to Lemma 2 of Kennedy et al. [2020] for averages, i.e., when the weights all equal w i (x; X n ) = n −1 , but also covers the much larger class of all linear smoothers.The norm ∥ • ∥ w 2 defining the relevant distance d( f , f ) for linear smoothers is a natural weighted and conditional version of an L 2 (P) norm.Namely, it is a weighted average of the conditional L 2 (P) norm (given X), weighted more towards the point X = x depending on how localized the weights w i (x; X n ) are.
The next result shows how stability and consistency of f yield a rate of convergence result, relative to an oracle estimator that regresses the true unknown f (Z) on X, and that a further small-bias condition yields oracle efficiency (i.e., asymptotic equivalence to the oracle estimator).This lays out a recipe for deriving convergence rates and conditions for oracle efficiency for general pseudo-outcome regression problems, which we will use in the CATE setup in the next section.
Proposition 1.Under the same setup from Definition 1, also define: and the oracle mean squared error 1. the regression estimator E n is stable with respect to distance d, and then m is oracle efficient, i.e., asymptotically equivalent to the oracle estimator m in the sense that m(x) − m(x) R * n (x) p → 0.
We now go on to discuss each component of Proposition 1.The oracle risk R * n (x) will typically be known (at least up to constants) based on structural assumptions on the target regression function m(x) as well as the form of the estimator E n .For example, if m is s-smooth and E n is an appropriate minimax optimal estimator (e.g., based on series or local polynomial regression) then R * n (x) ≍ n −1/(2+d/s) .Similarly, if m is s-sparse and E n is an appropriate minimax optimal estimator (e.g., lasso) then R * n (x) ≍ s log d/n, under some assumptions.The stability condition was described previously; as mentioned there, it holds for linear smoothers and we expect it to hold more generally.The consistency condition d( f , f ) p → 0 is mild since it does not impose any particular rate requirement, and would generally follow from standard regression results (depending on the form of f ); although consistency at any rate is sufficient for asymptotic negligibility, finite-sample behavior could depend on the rate at which d( f , f ) converges to zero.
Under just stability of E n and consistency of f , one obtains the rate of convergence result is the crucial piece that will generally determine whether the estimator m is oracle efficient; if not, the rate would follow from this bias term.The bias b(x) can in be determined on a case-bycase basis, depending on the estimator f .For simple plug-in estimators of f , the bias would typically have a first-order dependence on the estimation error in whatever nuisance functions appear in f .However, in some cases one will be able to construct influence-function-based estimators of f that have only a second-order dependence on nuisance estimation error.The availability of such estimators will depend on the form of m, but would be expected to exist when m is a structured combination of regressions or densities (e.g., a dose-response curve, CATE, counterfactual density, etc.).We refer to Section 5.3 of Kennedy [2022] for some related general discussion of using influence functions for non-pathwise differentiable quantities like m.
Note E n { b(X) | X = x} is actually an estimate of the regression of the bias term b(X) at X = x; thus it can be viewed as a smoothed bias, which may be more or less averaged or localized depending on the form of E n .Since the form of b(x) will often be well-understood (e.g., derived analytically based on the form of f ), in the following proposition we relate it to its smoothed analog , and E n is a linear smoother with Proposition 2 shows that the smoothed bias can be expressed in terms of natural weighted norms of the components of the bias b(x) itself.For many linear smoothers i |w i (x; X n )| ≤ C with probability one, so that c n = 1 (e.g., this is a condition in the celebrated theorem of Stone [1977] guaranteeing weak universal consistency of linear smoothers).For series estimators with w where ρ is some basis of dimension k and Q hx = P n (ρρ T ), a simple bound can yield c n = sup x ρ(x) T ρ(x) (which is of order √ k for many series).We suspect the dependence on k can be avoided with more careful analysis, or when working with L 2 (P) norms.

DR-Learner
In this section we analyze a two-stage doubly robust estimator we refer to as the DR-Learner, following the naming scheme from Nie and Wager [2021] and Künzel et al. [2019].After detailing previous work, we describe the algorithm in detail, and then give model-agnostic error bounds which apply for arbitrary first-stage estimators, and as long as the second-stage estimator is stable in the sense of Definition 1.We go on to apply the error bound in multiple nonparametric models incorporating smoothness or sparsity structure, and then explore the performance of the DR-Learner in simulations.

Previous Work
Variants of the DR-Learner have been used before, though often tied to particular estimators and not incorporating sample splitting, which can allow for model-agnostic error bounds and reduced bias.van der Laan [2005] (Section 4.2) appears to be the first to propose the general DR-Learner approach, i.e., flexible regressions of the pseudo-outcome (2) below on covariates.Specifically, van der Laan [2005] (along with van der Laan [2013] and Luedtke and van der Laan [2016]) advocates regressing the pseudo-outcome on V ⊆ X to construct candidate CATE estimators, and then selecting among them with a tailored cross-validation approach [van der Laan and Dudoit, 2003].The main distinction with our work is they did not give specific error guarantees for the CATE.
To the best of our knowledge, previous papers that do give specific error rates for the DR-Learner either employ stronger nuisance estimation conditions than we show are required, or else do not allow the CATE to be smoother or more structured than the individual regression functions µ a .Lee et al. [2017] studied a local-linear version of the DR-Learner, but assumed the first-stage nuisance error was negligible.Semenova and Chernozhukov [2017] and Zimmert and Lechner [2019] studied series and local-constant variants, respectively, but required conditions on nuisance estimation that are as restrictive as for the ATE.Fan et al. [2019] also studied a local-constant variant, but did not consider the case where the CATE is smoother than the regression functions.
Our results are closest to Foster and Syrgkanis [2019], who also considered the DR-Learner and gave an oracle inequality for generic empirical risk minimization when the loss involves complex nuisance functions.However there are some important distinctions to be made.First and most importantly, our error bound for the DR-Learner is doubly robust, involving a simple second-order product of nuisance errors.In contrast, the results of Foster and Syrgkanis [2019] yield errors that are second-order but not doubly robust, instead involving L 4 errors of all nuisance components.This means our error bound is tighter when the propensity score and outcome regressions are estimated at different rates.Second, we focus on local estimation error at a point, whereas Foster and Syrgkanis [2019] consider global error (e.g., integrated MSE); in this sense our work is complementary.A third important distinction is that our results can be used to justify the validity of inferential procedures, such as confidence intervals and hypothesis testing, whereas Foster and Syrgkanis [2019] focus on global rates of convergence.
There is also a related literature on regression with estimated pseudo-outcomes, similar in spirit to Foster and Syrgkanis [2019], though giving somewhat more specialized results.For example, Ai and Chen [2003] and Rubin and van der Laan [2005] both studied regression with general pseudo-outcomes, though they focused on particular estimators, and did not give doubly robust error bounds.Ai and Chen [2003] restricted their attention to sieves, and focused more on a finite-dimensional component appearing in the pseudo-outcome.Rubin and van der Laan [2005] considered least squares, penalized methods, and linear smoothers in the second stage, but their error bounds are not doubly robust.Künzel et al. [2019] considered various method-agnostic "meta-learners", but not the DR-Learner; further, none of their methods are doubly robust, and so in general would inherit larger error rates from the underlying regression estimators.

Construction & Analysis
The algorithm below describes our proposed construction of the DR-Learner.
Step 2. Pseudo-outcome regression: Construct the pseudo-outcome and regress it on covariates X in the test sample D n 2 , yielding Step 3. Cross-fitting (optional): Repeat Step 1-2, swapping the roles of D n 1 and D n 2 so that D n 2 is used for nuisance training and D n 1 as the test sample.Use the average of the resulting two estimators as a final estimate of τ .K-fold variants are also possible.1 .In the second stage, these estimates are used to construct an estimate of the pseudo-outcome φ, which is then regressed on X using the test sample D n 2 .
Remark 3. The DR-Learner approach is motivated by the fact that (2) is the (uncentered) efficient influence function for the ATE [Hahn, 1998, Robins andRotnitzky, 1995]; this drives many of its favorable properties.The intuition is that, to efficiently estimate the ATE, the standard doubly robust estimator averages the pseudo-outcome φ, so to estimate the CATE the DR-Learner regresses φ on covariates.For a review of influence functions and semiparametric theory we refer to van der Laan and Robins [2003], Tsiatis [2006], and Kennedy [2022].
Next we present the main result of this section, which gives error bounds for the DR-Learner procedure (relative to an oracle) for arbitrary first-stage estimators, as long as the second-stage estimator is stable in the sense of Definition 1.
Theorem 2. Let τ dr (x) denote the DR-Learner estimator detailed in Algorithm 1, which regresses the estimated pseudo-outcome φ(Z) on covariates X. Assume: 1.The second-stage regression estimator E n is stable with respect to distance d, denote an oracle estimator that regresses the true pseudooutcome φ(Z) on X, and denote its risk by R and τ dr (x) is oracle efficient in the sense of Proposition The bound on the DR-Learner error given in Theorem 2 shows that it can only deviate from the oracle error by at most a (smoothed) product of errors in the propensity score and regression estimators, thus allowing faster rates for estimating the CATE even when the nuisance estimates converge at slower rates.Importantly the result is agnostic about the methods used, and requires no special tuning or undersmoothing.
Theorem 2 gives a smaller risk bound compared to earlier work.Semenova and Chernozhukov [2017] and Zimmert and Lechner [2019] gave a remainder error larger than the product of the nuisance mean squared errors, with oracle efficiency requiring this product to shrink faster than 1/nd and 1/(n/h) for series and kernel-based second stage regressions, respectively (for h a shrinking bandwidth).Fan et al. [2019] assumed the CATE was only as smooth as the individual regression functions, giving a larger oracle risk.The results from Foster and Syrgkanis [2019] are closest to ours, but as mentioned earlier their error bounds are not doubly robust, and instead involve L 4 errors of all nuisance components.Remark 4. Several important oracle inequalities for cross-validated selection of estimators exist in the literature [Díaz et al., 2018, Luedtke and van der Laan, 2016, van der Laan and Dudoit, 2003].These are relevant for cross-validated CATE estimation, but are conceptually different from our Theorem 2; for example, they use a different oracle.Namely, in cross-validated selection the oracle is the best performing among a group of learners, whereas in our setup the oracle is the specified learner E n when it is given access to the true pseudo-outcome φ(Z).
Remark 5.In addition to showing double robustness and giving conditions for oracle efficiency, Theorem 2 also has important implications for inference.Namely, if the DR-Learner is oracle efficient so that τ (x) − τ (x) = o P (R * n (x)), then whenever an inferential result is available for the oracle estimator τ (x) (which is just a standard regression of oracle pseudo-outcomes on covariates), this would also apply to τ (x), immediately allowing for the construction of confidence intervals, tests, etc.For example, if one used a local polynomial estimator for the second-stage regression E n , then under standard conditions [Masry, 1996] τ (x) would be asymptotically normal, and so τ (x) would be as well, by virtue of the oracle equivalence.Thus standard confidence intervals could be constructed, treating the estimated pseudo-outcomes as one would usual observed outcomes in nonparametric regression.This is a benefit of the asymptotic equivalence to the oracle in Theorem 2. In contrast, the results from Foster and Syrgkanis [2019], for example, instead give conditions under which the global risk of a DR-Learner is within a multiplicative constant of that of the oracle, which is not as immediately useful for inference.

Examples & Illustrations
An important feature of the oracle result in Theorem 2 is that it is essentially model-free: beyond a mild consistency assumption, it only requires that the second-stage regression estimator satisfies the stability condition in Definition 1.In the following corollaries, we illustrate the flexibility of this result by applying it in settings where the nuisance functions and CATE are smooth or sparse (i.e., in settings where local polynomial, series, lasso, or random forest estimators would work well).Similar results could be obtained in generic models with known bounds on mean squared error rates.
Corollary 1. Suppose the assumptions of Theorem 2 hold.Further assume: 1. E n is a minimax optimal linear smoother with i |w i (x; X n )| = O P (1).
Corollary (1) illustrates how the DR-Learner can adapt to smoothness in the CATE even when the propensity score and regression functions may be less smooth, and gives sufficient conditions for achieving the oracle rate depending on nuisance smoothness and dimension d.
It is instructive to compare the sufficient condition in (4) to the analogous condition for root-n consistency of a standard doubly robust estimator of the average treatment effect, which is √ αβ ≥ d/2 (cf.Equation 25of Robins et al. [2009a]).First, as the CATE smoothness gets larger, the sufficient condition (4) for the CATE approaches that for the ATE, as should be expected (intuitively, an infinitely smooth CATE should be nearly as easy to estimate as the ATE).Second, the term in the denominator of (4), dividing d/2, can be interpreted as a "lowered bar" for optimal estimation, due to the fact that the oracle rate n −1 2+d/γ is slower than root-n.This phenomenon was also noted in the dose-response estimation problem by Kennedy et al. [2017], and is in contrast with the error bounds given in Nie and Wager [2021] and Zimmert and Lechner [2019], which required ATE-like conditions for oracle efficiency of CATE estimators, via n −1/4 or faster rates on the nuisance estimators (note that if α = β then √ αβ ≥ d/2 means the nuisance error rate is faster than n −1/4 ).In the next section we will show how the sufficient condition (4) can even be improved upon.
Similar results could be obtained for other models.For example, suppose the propensity score and regression functions are α-and β-sparse, respectively, and estimated at rates α log d/2 and β log d/n, and similarly for the CATE with γ-sparsity.The corresponding condition for oracle efficiency would be αβ log 2 d/n 2 ≤ γ log d/n, which is √ αβ ≤ γn/ log d.However we note that we have not proved second-stage estimators attaining these rates satisfy the stability condition of Definition 1.

Simulation Experiments
In this section we study some finite-sample properties via simulations (R code is available in Section 8).We use four methods for CATE estimation: a plug-in that estimates the regression functions µ 0 and µ 1 and takes the difference (called the T-Learner by Künzel et al. [2019]), the X-Learner from Künzel et al. [2019], the DR-Learner from Section 4, and an oracle DR-Learner that uses the true pseudo-outcome in the second-stage regression.
First we use the piecewise polynomial model from the motivating example in Section 2.2, with outcome and second-stage regressions fit using smoothing.splinein R. Figure 4a shows the mean squared error for the four CATE methods at n = 2000 (based on 500 simulations with MSE averaged over 500 independent test samples), across a range of convergence rates for the propensity score estimator π.To control the convergence rate we constructed this estimator as π = expit{logit(π) + ϵ n }, where ϵ n ∼ N (n −α , n −2α ) so that RMSE( π) ∼ n −α (the estimator was truncated to lie within [0.01, 0.99] to ensure positivity holds).The results show that the plug-in estimator inherits the large error in estimating the individual regression functions, while the DR-Learner achieves much smaller errors and adapts to the smoothness of the CATE.The X-Learner has MSE in between the two.Consistent with Theorem 2, the MSE of the DR-Learner approaches that of the oracle as the propensity score estimation error decreases (i.e., as the convergence rate gets faster).
Next we consider a high-dimensional logistic model where X = (X 1 , ..., X d ) ∼ N (0, I d ), x j , so the propensity score and individual regressions are α and β sparse, respectively, while the CATE is zero.The normalization of the coefficients ensures π(X) ∈ [0.2, 0.8] with high probability and similarly for µ a .We use standard cross-validation-tuned lasso for all model-fitting, i.e., to estimate the propensity scores, regression functions, and second-stage fits.Figure 4b shows mean squared errors for the four CATE methods when d = 500 and α = β = 50, for n = 200 and n = 2000, across 100 simulations (median of mean square errors is reported due to high skewness).As expected from Theorem 2, the DR-Learner is closer to the oracle than the plug-in or X-Learner, though this high-dimensional setup yields larger estimation error than the previous simulation model.The relative performance of the DR-Learner seems to improve with sample size.

Local Polynomial R-Learner
The previous section gave general sufficient conditions under which the DR-Learner attains the oracle error rate of an estimator with direct access to the difference Y 1 − Y 0 , showing that this rate can in fact be achieved whenever the product of nuisance errors is of smaller order.This raises the crucial question of what happens when this product is not sufficiently small: in such regimes, is there any hope at still attaining the oracle error rate?
The current section provides a first answer to this question, with a more refined analysis of a different estimator.Specifically, we analyze a double-sample-split local polynomial adaptation of the R-Learner [Nie andWager, 2021, Robinson, 1988], which we call the lp-R-Learner for short.The R-Learner of Nie and Wager [2021] is a nonparametric RKHS regression-based extension of the double-residual regression method of Robinson [1988].A nonparametric seriesbased version of the R-Learner was also proposed in Example 4 of Robins et al. [2008], though assuming known propensity scores and not incorporating outcome regression.Chernozhukov et al. [2017] studied a lasso version of the R-learner.Some important distinctions between our results and those in previous work include the following: (i) our estimator is built from local polynomials, and incorporates a specialized form of sample splitting inspired by Newey and Robins [2018] for bias reduction, (ii) our sufficient conditions for attaining oracle efficiency are substantially weaker than the n −1/4 rates in Nie and Wager [2021] and Chernozhukov et al. [2017], and (iii) we give specific rates of convergence outside the oracle regime.
We first describe the lp-R-Learner in detail, then give the main error bound result, which holds under a Hölder-smooth model, and is valid for a wide variety of tuning parameter choices.Following that, we optimize the bound with specific tuning parameter choices, under different sets of conditions, and discuss the resulting rates.

Construction
The algorithm below describes the lp-R-Learner construction.
Let b : R d → R p denote the vector of basis functions consisting of all powers of each covariate, up to order ⌊γ⌋, and all interactions up to degree ⌊γ⌋ polynomials (cf.Masry [1996]). (5) Step 3. Cross-fitting (optional): Repeat Step 1-2 twice, first using (D n 1b , D n 2 ) for nuisance training and D n 1a as the test sample, and then using (D n 1a , D n 2 ) for training and D n 1b as the test sample.Use the average of the resulting three estimators of τ as the final estimator τ r .Remark 6.The kernel weights in the second step regression need to be multiplied by the ratio (A − π a )/(A − π b ) in order to ensure the independence of relevant products of nuisance estimators (i.e., that they are built from separate samples D n 1a and D n 1b ).This allows for multiplicative biases and thus faster rates due to undersmoothing, first introduced by Newey and Robins [2018] but for √ n-estimable functionals.In other words, this ensures the bias of the lp-R-Learner equals a product of biases of the nuisance estimators; other, different ratios used in this kernel weight would therefore generally not work in the same way.
Figure 5 gives a schematic illustrating the lp-R-Learner construction.
Figure 5: Schematic illustrating the lp-R-Learner approach.In the first stage, the nuisance functions π a and ( π b , η) are estimated from training samples D n 1a and D n 1b , respectively.In the second stage, these estimates are used in a kernel-weighted least squares regression of residuals (Y − η) on residual-scaled basis terms (A − π b )b, with weights A− πa A− π b K hx .

Main Error Bound & Oracle Results
Before giving the main error bound in this section, we first present the following condition on the nuisance estimators that we use in the analysis.At a high level, this condition requires the nuisance estimators to be linear smoothers with particular bias and variance bounds.
Condition 1.The nuisance estimators ( π a , π b , η) are linear smoothers of the form with weights w i• (x; X n 1• ) depending on tuning parameter k, which are localized in the sense that and which satisfy the conditional bias and variance bounds Conditions ( 1a)-( 1c) are relatively standard.Many popular estimators take the form given in (1a), as discussed just before Theorem 1. Condition (1b) holds for several prominent linear smoothers.For example, it holds for series estimators built from k basis terms, if properly localized, and for standard kernel or local polynomial estimators when taking the bandwidth parameter as h ∼ k −d , as shown for example in Proposition 1.13 of Tsybakov [2009].
Condition (1c) also has been shown to hold for series and local polynomial estimators, for example, when the underlying regression function is appropriately smooth.In particular, under standard conditions, the bias part of (1c) would hold for these methods when the propensity score π is α-smooth and the regression function η is β-smooth; we again refer to Belloni et al. [2015] and Tsybakov [2009] for a review of related results.For (1c) to hold uniformly over all x ∈ X would typically mean these bounds would only hold up to log factors; however our result will only require the bias to be controlled locally, near the point at which the CATE is to be estimated.
The next result gives error bounds on the lp-R-Learner from Algorithm 2, under Hölder smoothness conditions.Theorem 3. Let τ r (x) denote the lp-R-Learner estimator detailed in Algorithm 2. Assume: 1.The estimator η and observations Z are bounded, and X has density bounded above.
3. The eigenvalues of the sample Gram matrices Q hx and Q hx defined in (12) are bounded away from zero in probability.
4. The nuisance estimators ( π a , π b , η) satisfy Condition 1, with the bias and variance bounds holding for all x ′ such that ∥x ′ − x∥ ≤ h.
Let s = α+β 2 denote the average smoothness of the propensity score and regression function.Then, if the CATE τ (x) is γ-smooth and k −(α∧β)/d ≲ k/n, we have Before detailing the result and implications of Theorem 3, we first discuss the assumptions.The first part of Assumption 1 is mostly to simplify presentation, and could be weakened at the cost of added complexity; the second part ensuring X has bounded density is more crucial, but still mild.Assumption 2 is standard in the causal literature, and in theory could be guaranteed by simply thresholding propensity score estimates; however, extreme propensity values (i.e., positivity violations) are an important issue in practice, especially in the nonparametric and high-dimensional setup [D'Amour et al., 2017].Assumption 3 is relatively standard (see, e.g., Assumption (LP1) of Tsybakov [2009]) but would restrict how n and h scale; when d is fixed, standard bandwidth choices should suffice.Assumption 4 is arguably most crucial, and does the most work in the proof (together with the specialized sample splitting); however, as detailed in the discussion of Condition 1 prior to the theorem statement, Assumption 4 uses standard conditions commonly found in the nonparametric regression literature.Now we give some discussion and interpretation of the (in-probability) error bound of Theorem 3. The first three terms are the bias, and the last two are the variance (on the standard deviation scale).The bias has three components.The first h γ bias term comes from the bias of an oracle estimator with access to the true propensity score and regression function, and matches the bias of an oracle with direct access to the difference Y 1 − Y 0 .The other two bias terms come from nuisance estimation: the first k −2s/d term is the product of the biases of the propensity score and regression estimators, whereas the second k −2α/d term is the squared bias of the propensity score estimator.If the propensity score is at least as smooth as the regression function, then the first k −2s/d term will dominate.Some heuristic intuition about why these specific bias terms arise is as follows: by virtue of its least squares construction, the lp-R-Learner can be viewed as a product of the inverse of an "X T X-like" term involving products of π a and π b , and an "X T Y -like" term involving products of π a and η.The variance has two components.As with the bias, the first 1/ √ nh d term is the standard deviation of an oracle estimator with access to the true nuisance functions.The second term is a product of this oracle standard error with k/n, which is the additional variance coming from nuisance estimation (in fact k/n is the product of the standard deviations of the nuisance estimators).In standard regimes the tuning parameter k would have to be chosen of smaller order than n (e.g., k log k/n → 0 as in Belloni et al. [2015] and elsewhere) in order for Condition 1 to hold, making the variance contribution from nuisance estimation asymptotically negligible.This last point will be discussed further shortly.Now we give conditions under which the oracle rate is achieved by the lp-R-Learner, which we conjecture are not only sufficient but also necessary conditions.
Corollary 2. Suppose the assumptions of Theorem 3 hold.Further assume α ≥ β so the propensity score is smoother than the regression function, and take 1. h ∼ n −1/(2γ+d) , and 2. k ∼ n/ log 2 n so that k log k/n → 0.
Then, up to log factors, and the oracle rate is achieved if the average nuisance smoothness satisfies s ≥ d/4 1+d/2γ .
Corollary 2 shows that an undersmoothed lp-R-Learner can be oracle efficient under weaker conditions than the error bound given in Theorem 2 would indicate for the DR-Learner.
Remark 7. Note taking k ∼ n/ log 2 n amounts to undersmoothing as much as possible: it drives down nuisance bias, letting the variance k/n go to zero only very slowly, since the contribution of the latter is asymptotically negligible for CATE estimation as long as k ≲ n.In general, undersmoothing can often require specific knowledge of smoothness parameters, and data-driven approaches are often lacking.However, it is worth noting that in Corollary 2 the choice of k only depends on the sample size, and not on the underlying nuisance smoothness, for example, making it more feasible to implement.For example, one could set k and then use cross-validation or other approaches to select h [Bibaut andvan der Laan, 2017, van der Laan andDudoit, 2003]; however we leave a formal study of this to future work.
The result in Corollary 2 is remniscient of a similar phenomenon in marginal ATE estimation, where Robins et al. [2009b] showed that the condition s ≥ d/4 is necessary and sufficient for the existence of root-n consistent estimators of the ATE functional E{E(Y | X, A = 1)}.Our result thus shows that s ≥ d/4 is also sufficient for oracle efficient estimation of the CATE, but now the oracle rate is n −γ/(2γ+d) rather than root-n, and so there is in fact a weaker bar for oracle efficiency (namely s ≥ d/4 1+d/2γ ), depending on the smoothness γ.At one extreme, when the CATE is infinitely smooth, the condition we give for oracle efficiency of the lp-R-Learner recovers the usual s ≥ d/4 condition for the ATE.At the other extreme, when the CATE is non-smooth, oracle efficiency can be much easier to achieve (e.g., if the CATE is only γ = 1-smooth, it is only required that s ≥ 1/2 even for arbitrarily large dimension d).
We conjecture that the condition s ≥ d/4 1+d/2γ is not only sufficient but may also be necessary for oracle effiency in the above Hölder model, making the proposed lp-R-Learner minimax optimal in this regime when α ≥ β; however, we leave a proof of this to future work.

When s < d/4
1+d/2γ and the oracle rate is not achieved, the rate n −2s/d is slower than the usual functional estimation rate n −4s/(4s+d) , which is minimax optimal for the ATE if the covariate density is smooth enough [Robins et al., 2009b], and also occurs for simpler functionals like the expected density [Bickel andRitov, 1988, Birgé andMassart, 1995].To illustrate this gap, if s = d/8 then the rate in Corollary 2 is n −1/4 while the usual functional minimax rate is n −1/3 .Remark 8.There is a trade-off between the DR-Learner and lp-R-Learner.In short, the DR-Learner provides somewhat more general guarantees, and is computationally more straightforward, while the lp-R-Learner can achieve faster rates in somewhat more specialized Holder smoothness models, but is also somewhat more difficult to compute in practice, and requires undersmoothing.

Faster Rates with Known Covariate Density
Here we briefly consider how the lp-R-Learner rates from Corollary 2 can be improved by exploiting structure in the covariate density, as in Robins et al. [2008Robins et al. [ , 2017]].This is somewhat paradoxical since the CATE itself does not depend on the covariate density.Some intuition for a possible rate improvement is as follows.If the density of the covariates X was known, then one could construct nuisance estimators ( π a , π b , η) satisfying Condition 1 without any restriction on the tuning parameter k, such as the k log k/n → 0 restriction employed in Corollary 2. For example, a series estimator η could satisfy Condition 1 for any choice of k, if it took the form where b is a vector of appropriate basis functions (different from those used in the lp-R-Learner construction), and similarly for π.Intuitively this is because restrictions like k log k/n → 0 are only required to ensure the inverse of the sample Gram matrix P n (bb T ) converges to a limit (in operator norm), whereas if the density of X was known, then one could simply compute the population Gram matrix E(bb T ) exactly.
We conjecture however that the n −2s/d rate from Corollary 2 may not be improvable in the non-random fixed X case.A fixed design setup was recently considered by Gao and Han [2020], though in a somewhat specialized model where the propensity scores have zero smoothness.In fact when α = 0 our rate matches their lower bound in the non-oracle regime.
The next result gives rates for the lp-R-Learner when there are no restrictions on the choice of nuisance tuning parameter k.
Corollary 3. Suppose the assumptions of Theorem 3 hold, and in particular suppose Assumption 4 holds without any restrictions on k (e.g., as if the density of X is known).
Then if α ≥ β the convergence rate in Theorem 3 is optimized by taking , and so the oracle rate is achieved if the average nuisance smoothness satisfies s ≥ d/4 1+d/2γ .
Remark 9. When the density of the covariates X is unknown but smooth and estimable at fast enough rates, we expect results similar to those in Corollary 3 to still hold.This phenomenon also occurs for the ATE functional [Robins et al., 2008[Robins et al., , 2017]].However, here the analysis of the lp-R-Learner becomes much more complicated, so we leave this to future work.
When the nuisance tuning parameter k is unrestricted, one needs to balance all three dominant terms from Theorem 3: the two bias terms h γ and k −2s/d , as well as the variance term 1+d/2γ occurs here even when the density is known, perhaps again suggesting that this condition for oracle efficiency may be both sufficient and necessary.Whether the rate n −3s 2s+d(1+s/γ) is minimax optimal or not in the non-oracle regime is unknown; we will pursue this in future work.
Note the rate n −3s 2s+d(1+s/γ) is slightly slower than the usual functional estimation rate n −4s/(4s+d) .For example, when γ → ∞ and s = d/8, the CATE rate from Corollary 3 is n −3/10 whereas the usual functional estimation rate is n −1/3 .Thus there do appear to be benefits of exploiting structure in the covariate density, but whether the gap between the improved rate of Corollary 3 and the usual rate can be closed is unclear.An interesting more philosophical question is whether structure in the covariate density should be exploited for CATE estimation in practical settings, since the CATE itself does not depend on the covariate density.

Discussion
In this paper we studied the problem of estimating conditional treatment effects, giving new results that apply to a wider variety of methods and that better exploit when the CATE itself is structured, compared to the current state-of-the-art.Sections 3 and 4 were more practically oriented, giving model-free yet informative error bounds for general regression with estimated outcomes, and the DR-Learner method for CATE estimation, along with examples from smooth and sparse models, illustrating the flexibility of Proposition 1 and Theorem 2. In contrast, Section 5 was more theoretically oriented, aiming instead at understanding the fundamental statistical limits of CATE estimation, i.e., the smallest possible achievable error.We derived upper bounds on this error with a specially constructed (and tuned) estimator called the lp-R-Learner.Namely we showed that in a Hölder model, oracle efficiency is possible under weaker conditions on the nuisance smoothness than indicated from the DR-Learner error bound given in Theorem 2 -which itself is an improvement over conditions given in previous work.
Figure 6 summarizes the various error rates in this paper graphically, in an illustrative setup where the dimension is d = 20, CATE smoothness is γ = 2d, and nuisance smoothness α = β = s varies from 0-10.This shows the various gaps between methods/rates, in an interesting regime where the CATE is relatively smooth compared to the nuisance functions.
Our work raises some interesting open questions, some of the more immediate of which we list here for reference: 1. Can the error bounds in Proposition 1 and Theorem 2 be improved without committing to particular first-or second-stage methods?
2. What rates can be achieved by specialized sample splitting and tuning of a DR-Learner, rather than an R-Learner?
3. What are the analogous results of Corollaries 1 for other function classes?
4. Can the error bounds in Theorem 3 be obtained with other methods?A natural alternative is a higher-order influence function approach [Robins et al., 2008[Robins et al., , 2017]], which could avoid the α ≥ β condition, perhaps at the cost of some extra complexity.
We have obtained partial answers to some of these questions, but most remain unanswered and left for future work.One of the deepest open questions is whether the rates given in Corollaries 2 and 3 are minimax optimal or not (in the fixed and random design setups, respectively). .The dashed gray line is the oracle rate that would be achieved by a minimax optimal regression using Y 1 − Y 0 .The red line is the bound for a plug-in estimator, which is just the rate for estimating the individual regression functions.The blue line is the rate for the DR-Learner given in Theorem 2, which matches the oracle under conditions given there.The purple line is the rate achieved by the lp-R-Learner when tuned as in Corollary 2, which matches the oracle when s ≥ d/4 1+d/2γ .The black line is the improved rate achieved by the lp-R-Learner when tuned as in Corollary 3, e.g., with known covariate density.

Proof of Theorem 1
Here we use the notation of Proposition 1. Letting denote the numerator of the left-hand side of (1), we will show that which yields the result when 1/∥σ∥ w 2 = O P (1).First note that for linear smoothers we have and this term has mean zero since by definition of b and iterated expectation.Therefore where the second line follows since f (Z i ) − f (Z i ) are independent given the training data, and the third since var( where the second line follows from iterated expectation and independence of the samples, and the third by definition of ∥ • ∥ w 2 (and since the squared bias term from the previous line is non-negative).Therefore where the second line follows by Markov's inequality, the third from the bound in (7) and iterated expectation, and the last from the bound in (8).Therefore the result follows since we can always pick t 2 = 1/ϵ to ensure the above probability is no more than any ϵ.

Proof of Proposition 1
Stability and consistency together imply ) the result follows.

Proof of Proposition 2
We have where the second line follows by Holder's inequality, and the last by definition of the norms.

Proof of Theorem 2
This follows from Proposition 1, letting f (Z) = φ(Z) and noting that b by iterated expectation.

Proof of Corollary 1
The rate result follows since where the second equality follows from Proposition 2 together with Assumptions 1-3, and the third since E n is minimax optimal and the CATE is γ-smooth.For the oracle efficiency condition, note that which yields the result.

Proof of Theorem 3
To simplify notation, in this subsection we largely omit function arguments, for example writing τ = τ (x), b = b(X − x), K hx = K hx (X), etc.We also define b 0 = b(0) and We let X n denote all the covariates across the training and test samples (D n 1a , D n 1b , D n 2 ), and we let D n 1 = (D n 1a , D n 1b ) denote the training data.
First note that by definition we have τ = b T 0 P n (bK hx φ a b T ) −1 P n (bK hx ϕ).Thus we begin with the central decomposition where we define The general approach in this proof is to use conditional error bounds for each term in the decomposition above, from which one arrives at bounds in probability (cf.Lemma 6.1 of Chernozhukov et al. [2018]).
The term on the right-hand side of ( 9) is the difference between τ and an oracle version of the lp-R-Learner that has access to the true nuisance functions and so is built from ν and ϕ; we will show that it attains the same order as the oracle rate n −γ/(2γ+d) .The term (10) captures the error from estimating ( π b , η) in ϕ; we will show its order is the product of the biases from estimating ( π b , η) plus a typically smaller variance term.The term (11) captures the error from estimating ( π a , π b ) in ϕ a ; we will show it behaves similarly to (10), except involving the product of the biases of ( π a , π b ).In regimes where the oracle rate is not achievable and the propensity score π is smoother than the regression function µ, we will show that the term (10) dominates.

Term (9)
The analysis of the term in (9) follows that of a standard local polynomial estimator of pseudooutcome ϕ/ν on X with a special choice of kernel.This is because we can write where the weights are Thus the oracle estimator in ( 9) is a local polynomial estimator of the regression of ϕ/ν on X using scaled kernel function K hx ν, which has the same support as K hx and has a smaller upper bound since ν ≤ 1/4.
The above implies that the weights w i satisfy analogues of Proposition 1.12 and Lemma 1.3 in Tsybakov [2009], i.e., that they reproduce polynomials in X up to degree ⌊γ⌋, and have the localizing properties given in the following lemma.Lemma 1. Assume (i) |K(u)| ≤ K max 1(∥u∥ ≤ 1) and (ii) sup x ∥b(x)∥ ≤ B, and define Then the weights in (13) satisfy: Proof.Property (3) follows from Assumption (i).Properties (1) and ( 2) also follow immediately after noting that using the submultiplicative property of the operator norm, with Assumptions (i)-(ii) and the facts that ∥b(0)∥ = 1 and ν by iterated expectation.Therefore by the same logic as in Proposition 1.13 of Tsybakov [2009], using Lemma 1 with the Hölder condition on τ , we have where in the second line τ γ,x (X i ) = |j|≤⌊γ⌋ is the ⌊γ⌋-order multivariate Taylor approximation of τ at x evaluated at X i , the third line follows by the polynomial-reproducing property of w i , the Hölder assumption on τ , and Property 3 of Lemma 1, and the last line by Property 2 of Lemma 1.
For the variance we have since the variance of ϕ is bounded, and using Properties 1-2 of Lemma 1. Therefore term ( 9) satisfies Since ξ n = O P (1) and λ n = O P (1) by assumption, the above rate will end up matching the classical n −γ/(2γ+d rate, when balancing bias and variance by taking h ∼ n −1/(2γ+d) .
7.6.2Term (10) Now we bound the conditional mean and variance of term (10).First note we have by iterated expectation.Let denote the pointwise conditional bias of a generic estimator f of f at x. Then the conditional mean of term ( 10) is where the second line follows by iterated expectation, the third since π a ⊥ ⊥ η, the fourth and fifth by Properties 1-3 of Lemma 1 and since ν ≥ ϵ(1 − ϵ), and the last by Condition 1c via Assumption 4. For the conditional variance we have the decomposition For the term in (17) note that where the second line used that var( ϕ − ϕ | D n , X n ) and 1/ν are bounded, and the last Properties 1-2 of Lemma 1.
The second term (18) equals For the first term (19) above we have where the second line uses the fact that var( 2 ) in the second line, which would give the expression in the third line plus a k −2(α+β)/d term, which is smaller order in the undersmoothing regime.)Therefore when max(k For the second term in (20) we have where the second line uses the fact that cov( where the first line uses Property 3 of Lemma 1, which implies π(X i ) and π(X j ) are built from different observations if X i and X j are far enough apart, the second Cauchy-Schwarz, and the third Condition 1b.Therefore when k −2(α∧β)/d ≲ k/n, and defining where the first inequality follows from ( 22) and ( 23), the second by Properties 1 and 3 of Lemma 1, and the last by definition.
Therefore combining bounds on the four terms ( 16), ( 17), ( 19), and (20), we have that the term (10) satisfies For the first term in the product in (25) we have which we can tackle with similar logic as for term (10).First note by iterated expectation.Defining B n (x; f ) as in the previous subsection, the conditional means of the elements of (25) (omitting subscripts ℓ) are where the second line follows by iterated expectation, the third since π a ⊥ ⊥ π b , the fourth and fifth by Properties 1-3 of Lemma 1 and since ν ≥ ϵ(1 − ϵ), and the last by Condition 1c via Assumption 4.
The analysis of the conditional variance follows exactly the same logic as for term (10), and is of the same order.For the second term in the product in (25) we have and note Therefore for b ℓ the ℓ-th component of b(X i − x) using the boundedness of the kernel, X, and ϕ.For the term in ( 28) where the second line follows from the definition of R 2 (X i ) and since E(ϕ | X i ) is constant given X n .The rest follows the same logic as for the term in ( 18), noting that λ n terms do not appear since here there is only a kernel weight, and so no corresponding ∥ Q −1 hx ∥ term.
Therefore the square of the term (11) has conditional expectation bounded above by a constant multiple of

Combining Bounds
Now we use the following lemma (or equivalently Lemma 6.1 of Chernozhukov et al. [2018]) to deduce unconditional convergence from the previously derived conditional bounds.Proof.Note for some M ϵ depending on any ϵ > 0 we have where the first equality uses iterated expectation, the second Markov's inequality, the third the fact that (b n + s n ) 2 ≥ b 2 n + s 2 n , and the last the bounds on |E(Z n | X n )| and var(Z n | X n ).The result follows taking M ϵ = C/ϵ.
Recall our conditional bounds in ( 14), (24), and (29) involve the quantities (λ n , λ n , ξ n , ξ ′ n ).The quantities λ n and λ n are O P (1) by Assumption 3, and in the following lemmas we show that ξ n and ξ ′ n are as well, as long as the covariate density is bounded (Assumption 1).
Lemma 3. Assume X has a density bounded above by some C < ∞.Then Proof.First note that using the bound on the density and the volume of a d-ball of radius h.Therefore for any ϵ > 0 we have P(ξ n ≥ 5.3C/ϵ) ≤ ϵ by Markov's inequality, which yields the result.
Lemma 4. Assume X has a density bounded above by some C < ∞.Define Proof.For any i we have Note we can discard the k/n terms since, if k ≤ n then the constant term 1 dominates the variance multiplier, whereas if k ≥ n then k/n ≥ k/n and the k/n term dominates.

Figure 3
Figure3gives a schematic illustrating the DR-Learner construction.

Figure 3 :
Figure3: Schematic illustrating the DR-Learner approach.In the first stage, the nuisance functions π and ( µ 0 , µ 1 ) are estimated from training sample D n 1 .In the second stage, these estimates are used to construct an estimate of the pseudo-outcome φ, which is then regressed on X using the test sample D n 2 .
and h a bandwidth parameter.Step 1. Nuisance training: (a) Using D n 1a , construct estimates π a of the propensity scores π.(b) Using D n 1b , construct estimates η of the regression function η = πµ 1 + (1 − π)µ 0 , and estimates π b of the propensity scores π.Step 2. Localized double-residual regression: Define τ r (x) as the fitted value from a kernel-weighted least-squares regression (in the test sample D n 2 ) of outcome residual (Y − η) on basis terms b scaled by the treatment residual (A − π b ), with weights A− πa A− π b K hx .Thus τ r (x) = b(0) T θ for θ = arg min

Figure 6 :
Figure 6: Illustration of error bounds in this work, as a function of nuisance smoothness s = α = β.The dotted gray line represents the classical minimax lower bound for functional estimation, which is √ n when s ≥ d/4.The dashed gray line is the oracle rate that would be achieved by a minimax optimal regression using Y 1 − Y 0 .The red line is the bound for a plug-in estimator, which is just the rate for estimating the individual regression functions.The blue line is the rate for the DR-Learner given in Theorem 2, which matches the oracle under conditions given there.The purple line is the rate achieved by the lp-R-Learner when tuned as in Corollary 2, which matches the oracle when s ≥ d/4 1+d/2γ .The black line is the improved rate achieved by the lp-R-Learner when tuned as in Corollary 3, e.g., with known covariate density.
and the third the bias and variance from Conditions 1b-1c.(A slightly worse bound could have used var( i − x∥ ≤ h) = ξ n using boundedness of the kernel K and observations Z.For the variance we havevar P n (b ℓ K hx ϕ) | X n = E var P n (b ℓ K hx ϕ) | D n , X n X n (27) + var E P n (b ℓ K hx ϕ) | D n , X n X n (28)For the term in (27) note var P n (b ℓ

Lemma 2 .
Suppose a random variable Z n satisfies |E(Z n | X n )| ≲ b n and var(Z n | X n ) ≲ s 2 n for some b n = b(X n ) and s 2 n = s(X n ) 2 .Then Z n = O P (b n + s n ).