A unified analysis of regression adjustment in randomized experiments

Regression adjustment is broadly applied in randomized trials under the premise that it usually improves the precision of a treatment effect estimator. However, previous work has shown that this is not always true. To further understand this phenomenon, we develop a unified comparison of the asymptotic variance of a class of linear regression-adjusted estimators. Our analysis is based on the classical theory for linear regression with heteroscedastic errors and thus does not assume that the postulated linear model is correct. For a completely randomized binary treatment, we provide sufficient conditions under which some regression-adjusted estimators are guaranteed to be more asymptotically efficient than others. We explore other settings such as general treatment assignment mechanisms and generalized linear models, and find that the variance dominance phenomenon no longer occurs.


Introduction
Randomized experiments are the gold standard to answer questions about causality.Many researchers use multiple linear regression with a treatment indicator and some baseline covariates to analyze randomized experiments, in which the treatment coefficient is often interpreted as a causal effect.In some fields, this is known as the "analysis of covariance" (ANCOVA), which was first proposed by Fisher (1932) to unify "two very widely applicable procedures known as regression and analysis of variance".This common practice is motivated by the belief that regression adjustments can increase precision if covariates in the regression are predictive of the outcome.
However, as pointed out by many authors, this is not always true especially when there is a lot of treatment effect heterogeneity.Regression adjustment in randomized experiments has been studied in two different frameworks, namely the finite-population potential outcome model (Neyman, 1923;Rubin, 1974) and the super-population model that assumes the experimental units are drawn independently from an infinite population (see e.g.Imbens and Rubin, 2015, Chapter 7).Three estimators have been extensively studied in the literature: the simple difference-in-means or analysis of variance (ANOVA) estimator; the ANCOVA estimator that includes covariate main effects; and the regression-adjusted estimator that includes covariate main effects and all treatment-covariate interactions.The last one is termed as the analysis of heterogeneous covariance (ANHECOVA) estimator by Ye et al. (2022).The main conclusions about the asymptotic efficiency of these estimators are the same, regardless of whether the potential outcome model (Freedman, 2008a,b;Schochet, 2010;Lin, 2013;Guo and Basse, 2021) or super-population model (Koch et al., 1998;Yang and Tsiatis, 2001;Tsiatis et al., 2008;Schochet, 2010;Rubin and van der Laan, 2011;Ye et al., 2022) is used.Consider two estimators β1 and β2 that converge to the same limit.We say that β1 (asymptotically) uniformly dominates β2 if the (asymptotic) variance of β1 is always smaller or equal than that of β2 , no matter what the underlying distribution is.In both the potential outcome model and the super-population model, it has been found that ANHECOVA uniformly dominates the other two, but, somewhat surprisingly, ANCOVA does not uniformly dominate ANOVA.
A major limitation of the existing analysis of regression adjustment is that the investigations are restricted to specific estimators and provides limited insights into the phenomenon of uniform dominance.The variance calculations are often quite technical, which further make the theoretical results less accessible to practitioners.Furthermore, the existing literature does not tell us whether including all treatment-covariate interactions is preferred in other cases such as stratified experiments and generalized linear models.
In this article, we provide a unified analysis for a large class of linear-regression adjusted estimators.Besides the estimators mentioned above, our theory also applies to regression estimators with some coefficients fixed (such as the difference-in-differences estimator) or with treatment-covariate interactions only.By a simple application of the textbook theory for linear regression with heteroscedastic errors, this analysis not only recovers the known relationships between ANOVA, ANCOVA, and ANHECOVA, but also immediately provides a sufficient condition for uniform dominance when the expectation of the covariates is known (see Theorem 1 below).In the more practical situation when the covariate expectation is unknown, a slightly different sufficient condition is obtained (see Theorem 2 below).This condition shows that, for example, the so-called lagged-dependent-variable regression estimator is more efficient than the difference-in-differences estimator in randomized experiments, despite them having a bracketing relationship in observational studies (Ding and Li, 2019).This unified analysis allows us to explore whether the uniform dominance extends to more complicated settings and provide numerical counterexamples.Some further remarks are provided at the end of this article, whereas proofs of the technical Lemmas can be found in Appendix A.

Linear regression adjustment in randomized trials
a vector of unit covariates observed before treatment assignment, and Y i ∈ R is a real-valued outcome of the unit.We assume that (A i , X T i , Y i ), i = 1, . . ., n is independent and identically distributed, which is often a good approximation when the units are randomly sampled from a large population.To simplify the notation, we drop the subscript i when referring to a generic unit from the population.
Unless mentioned otherwise, we assume that each unit receives the treatment independently with equal probability pr(A = 1 | X) = π, where 0 < π < 1 is a known constant.In other words, treatment is assigned by a simple Bernoulli trial, which approximates random sampling without replacement that is often studied in the finite-population model (Freedman, 2008a,b;Lin, 2013).Under this assignment mechanism and standard assumptions in causal inference, the average treatment effect where Y (a) is the potential outcome of unit i under treatment level a, can be identified as (see e.g.Imbens and Rubin, 2015, Chapter 7): In this article, we consider the following class of regression adjusted estimators of where the individual components Γ (j) and ∆ (j) are either the real line R or a singleton.Define the constrained ordinary least squares estimator as θ = (α, β, γ, δ) = arg min γ∈Γ,δ∈∆ (2) We sometimes use the notation θ(Γ, ∆) (and similarly for the components of θ) to emphasize the dependence of the estimator on the sets Γ and ∆.Lemma 1 in Section 3.1 shows that β is a reasonable estimator of β ATE when the covariates are centered, i.e.E(X) = 0; otherwise β ATE can be estimated by β = β + δT X, where X = n i=1 X i /n.Before examining the asymptotic properties of β and β, we give several examples in the class of estimators (2).
Example 2. In some applications, the covariate vector X include the baseline value of the response before the treatment is assigned (let us call it Y 0 ).For simplicity, suppose the first entry of X is Y 0 , so X = (X 1 = Y 0 , X 2 , . . ., X p ) T .The difference-in-differences estimator corresponds to setting Γ = {1} × R p−1 and ∆ = {0} × R p−1 , while the laggeddependent-variable regression estimator corresponds to setting Γ ⊆ R p and ∆ = {0} × R p−1 .In observational studies, these two estimators rely on different identification assumptions (Ding and Li, 2019) and may converge to different limits.In the randomized experiment described above, both estimators should converge to the average treatment effect, but we are unaware of any comparison of their statistical efficiency in presence of covariates besides Y 0 .

Covariates with known expectation
We first consider estimation of β ATE when the covariates X have known expectation.As will be seen in a moment, the proof of uniform dominance is fairly straightforward in this case.
Consider the population counterpart to (2): Clearly, θ = θ(Γ, ∆), and we often suppress the dependence of θ on (Γ, ∆) if it is clear from the context.
Lemma 1.For any Γ and ∆ of the form described in Section 2, we have Without loss of generality, we shall assume that E(X) = 0 for the rest of Theorem 3.1; otherwise, we can simply replace X with X − E(X) since E(X) is known.When E(X) = 0, Lemma 1 shows that β is a reasonable estimator of β ATE .To study the asymptotic properties of β, we first state a classical result for linear regression with heteroskedastic error.For a proof of this result, see e.g.White (1980).
Lemma 2. Consider a linear regression of an independent and identically distributed sample of response Y ∈ R on regressors Z ∈ R p .Let θ and θ be sample and population least squares estimators and (θ) = Y − θ T Z. Suppose that E(ZZ T ) and E{ (θ) 2 ZZ T } are positive definite and Y , Z have bounded fourth moments.Then, as n → ∞, θ − → θ in probability and Note that these results do not require that the linear model is correctly specified.By applying Lemma 2 to our problem with an appropriate Z and regression error we obtain the expression for the asymptotic variance of β.The proof of this result is straightforward due to the block diagonal structure of E(ZZ T ).This is made possible by the assumption that E(X) = 0.
Lemma 3. Suppose that E(X) = 0 and the regularity conditions in Lemma 2 are satisfied.
In words, Theorem 1 says that, when the expectation of the covariates is known, one linear regression-adjusted estimator is uniformly dominated by another if the two linear models are nested, the first estimator is obtained from the larger model, and the larger model includes an interaction term whenever the corresponding main effect is present; there is no such requirement for the smaller model.The conditions in (6) can be easily applied to obtain variance orderings among the examples in Section 2. We will discuss them in more detail after deriving a similar sufficient condition when the expectation of X is unknown.

Covariates with unknown expectation
In most practical situations, we do not know the expectation of the covariates and it is common to centre the covariates empirically before performing the linear regression.Let θ be the least squares estimator in (2) with X i replaced by X i − X where X = n i=1 X i /n, that is, We have β = β if no interaction term is included, i.e. if ∆ = {0} p , because both (2) and ( 9) include an intercept term.More generally, by differentiating ( 9) with respect to α and β and following the proof of Lemma 1, it is straightforward to verify that β = β + δT X.Thus, β is a reasonable estimator of Estimator θ is invariant to any shift transformation of the covariates.In other words, θ remains the same if we replace X i by X i + c, i = 1, . . ., n, for any c ∈ R p .Therefore, the statistical properties of θ do not depend on E(X) and, for simplifying the analysis, we shall assume E(X) = 0 without loss of generality.The asymptotic variance of β generally differs from that of β due to the variability in X.The next result quantifies this difference and it is proved in Appendix A.4.
Let θk and θk be the solution to (2) and ( 9), respectively, for the choice (Γ, ∆) = (Γ k , ∆ k ), k = 1, 2. Let Ṽk be the asymptotic variance of βk , that is, Ṽk = lim n→∞ nvar( βk ).Our main goal is to derive conditions on (Γ 1 , ∆ 1 ) and (Γ 2 , ∆ 2 ) such that Ṽ1 and Ṽ2 admit a deterministic ordering.It seems natural to require that the models are nested: Table 1 provides a list of uniform dominance relationships in some basic comparisons.We use the R convention to denote linear models: explanatory variables in the regression are joined by +, 1 stands for the intercept term, and A : X stands for the treatment-covariate interaction.Using Table 1, we conjecture that in order for Ṽ1 Ṽ2 , the third condition in (6) needs to be modified.which is verified in the next theorem.
By verifying the conditions in (10), we have the following results concerning the examples in Section 2.
Corollary 1.The ANHECOVA estimator is asymptotically more efficient than the ANOVA and ANCOVA estimators.There is no guaranteed variance ordering between ANOVA and ANCOVA.
Corollary 2. The lagged-dependent-variable regression estimator is more efficient than the difference-in-differences estimator.

Variance ordering beyond linear regression
One can establish the uniform dominance in case of more sophisticated randomization schemes.In particular, our derivations extend to stratified randomization experiments with units grouped into B strata which is in alignment with the results of Liu and Yang (2020).The authors considered two asymptotic regimes with the number of strata B or the their sample sizes to increase with the total number of units n.Let S be the indicator variable of whether unit i belongs to stratum j.In this setting, for the uniform dominance to hold, the treatment assignment should be completely randomized within strata with allocation probability equal across strata, and the estimator should be obtained using weighted regression including centred variables A, S, X, as well as interactions of A with S and A with X.
In contrast, the results do not usually extend to a more general assignment mechanism which depends on X with π(x) = p(A = 1 | X = x) (see Table 2 below for counterexamples).Within this framework, we can identify β ATE for all Γ, ∆, but β in (2), ( 3) and ( 9) do not converge to β ATE .Assuming this scenario one can use weighted estimators (Stuart et al., 2011;Tao and Fu, 2019) to recover (1).
The result in Lemma 1 is tightly related to the properties of orthogonal projections and does not carry over to a wider class of generalized linear model, even if the link function we use is collapsible (a link function is collapsible if including independent regressors does not change the population regression coefficients, see Greenland et al., 1999;Daniel et al., 2021, for more detials).The most common collapsible link functions in this setting are identity and log function; the latter is the canonical link for Poisson regression and is frequently applied.Yet, if the link is collapsible but not the identity, including the treatment-covariate interaction may identify a different estimand.Thus, an attempt to seek the uniform dominance by comparing the asymptotic variance of β seems to be an ill-posed research question in this setting.

Simulation study
We carry out numerical simulation study to verify our theoretical developments in previous sections and explore scenarios under which the uniform dominance does not hold (see discussion in Section 6).We consider scenarios with potential outcomes generated from normal and Poisson distribution.In all scenarios, the covariate is centred and normally distributed X = X nc − X, X nc ∼ N (2, 1) whereas X is a sample mean of observations; the observed outcome is Y = AY (1) + (1 − A)Y (0); the treatment is assigned using a Bernoulli trial A ∼ Bernoulli(π).For linear regression, errors are normally distributed e 1 , e 0 ∼ N (0, 1).We consider the sample size of n = 1000.Below we describe simulation scenarios.
Scenario 1 Treated outcomes: Y (1) ∼ 5 + 2.5AX + e 1 , untreated outcomes: Scenario 2 Treated outcomes: Y (1) ∼ P oisson(µ 1 ), untreated outcomes: Scenario 4 The same as scenario 3, but with weights: w To study the performance of the estimators of β ATE , we calculated an average bias and a standard deviation over 1000 Monte Carlo replications.Finally, under the Poisson log model, true values of β ATE was approximated by the difference of the large sample average (n = 10 7 ) of potential outcomes.
Figure 1 shows results of simulations under Scenario 1 (top panels) and Scenario 2 (bottom panels).As for normally distributed outcomes, the estimates of β ATE are almost unbiased assuming any model (top-right panel).On the other hand, the standard deviation of β is the smallest for the model with both X and AX, nevertheless adjusting for covariate only it is not beneficial for higher values of π.When it comes to the results under the Poisson log model, β is not the estimator of β ATE and the uniform dominance results do not hold.Table 2 displays numerical performance of β under Scenarios 3 and 4 for which our theory does not hold.Unsurprisingly, assuming Scenario 3 with an assignment mechanism which depends on covariate X, β suffers from a substantial bias and the uniform dominance does not apply.Assuming Scenario 4, which involves weighting, the regression adjustment decreases the standard deviation of β, but the estimator in the larger model suffers from a considerable bias.

Discussion
Linear regression model is still widely used to estimate the average treatment effect in hope to increase the precision of the estimator.We re-established and generalized previous results on linear-regression adjusted estimators under possible model misspecification by providing a simplified and more accessible proof of uniform dominance.Yet, our proof has a geometric element that exploits the linearity of the regression adjustment, and this cannot be extended to other settings.Thus, the phenomenon of the efficiency gain seems to be limited to the estimation problems which fit into the linear framework and to the treatment assignment mechanisms which do not depend on X.

A Proofs
A.1 Proof of Lemma 1 Proof.Observe that α and β are always unrestricted in (3).By taking partial derivatives with respect to α and β, we obtain By multiplying the first equation by π and subtracting the second equation, we obtain Finally, by using the assumption that the treatment is randomized i.e.A ⊥ ⊥ X and E(A) = E(A 2 ) = π, we find that as desired.

A.2 Proof of Lemma 2
Proof.To be able to use the results of White (1980), we need to check that our assumptions are sufficient to evoke regularity conditions cited by the author.Since Y and Z have bounded forth moments, there exist η and H such that ) < H is also uniformly bounded.Furthermore, we assumed that E(ZZ t ) is positive definite, that is, E(ZZ t ) is non-singular and detE(ZZ t ) > η > 0. The same is valid for E( (θ) 2 ZZ t ).Thus, Assumptions 2 and 3 of White (1980) are satisfied and we can use the same steps as the author to prove the consistency and the asymptotic normality of θ.
= E(ZZ T ) −1 E(ZX T ), where is where * represent some unspecified matrices that are not important for deriving the quantities of interest.Therefore,

Table 2 :
Figure 1: Performance of estimators of β ATE over different values of π: Scenario 1 (top panels), Scenario 2 (bottom panels), SD: standard deviation.Performance of β.SD, standard deviation; all numerical entries are multiplied by 1000 and rounded to the nearest integer.