Designing Experiments Toward Shrinkage Estimation

We consider how increasingly available observational data can be used to improve the design of randomized controlled trials (RCTs). We seek to design a prospective RCT, with the intent of using an Empirical Bayes estimator to shrink the causal estimates from our trial toward causal estimates obtained from an observational study. We ask: how might we design the experiment to better complement the observational study in this setting? We propose using an estimator that shrinks each component of the RCT causal estimator toward its observational counterpart by a factor proportional to its variance. First, we show that the risk of this estimator can be computed efficiently via numerical integration. We then propose algorithms for determining the best allocation of units to strata (the best"design"). We consider three options: Neyman allocation; a"naive"design assuming no unmeasured confounding in the observational study; and a"defensive"design accounting for the imperfect parameter estimates we would obtain from the observational study with unmeasured confounding. We also incorporate results from sensitivity analysis to establish guardrails on the designs, so that our experiment could be reasonably analyzed with and without shrinkage. We demonstrate the superiority of these experimental designs with a simulation study involving causal inference on a rare, binary outcome.


Introduction
Recent years have seen increased interest in methods to integrate observational data with experimental data (Colnet et al., 2020).Such methods have been used to identify average causal effects in target populations (Bareinboim and Pearl, 2016;Kallus et al., 2018), identify heterogeneous treatment effects (Peysakhovich and Lada, 2016), and improve precision in causal estimation (Gagnon-Bartsch et al., 2021) This surge in methodological development is motivated, at least in part, by the increasing proliferation of observational databases.Such repositories provide statisticians with rich new data sources from which to learn.Yet the lurking danger of unmeasured confounding yields rightful trepidation about incorporating these data into estimation procedures (Colnet et al., 2020).Rosenman et al. (2020) developed a procedure for shrinking causal estimates from a stratified experiment toward the analogous estimates from an observational study.Shrinkage estimators are attractive in that they allow researchers to use observational data in tandem with experimental data, while protecting the integrity of the randomization of the experiment.They provide a guaranteed reduction in expected loss, relative to using the experimental data alone.
Separately, Rosenman and Owen (2021) proposed a method to design more powerful stratified experiments by utilizing information from an observational study to inform decisions about how to allocate a sample across strata and treatment arms.Risk reductions from this method are more modest, owing to the fact that the observational data is used only for design and not for inference.
Here, we combine the two approaches of design and shrinkage.We seek to design a prospective randomized trial.We optimize the design with two objectives in mind.Our intent is to shrink the causal estimates from our designed trial toward those of an observational study that is already completed.First, we want to design the experiment as a better complement to the observational data, allowing for significant gains in estimation precision.At the same time, we want the experiment to be usable in its own right, and seek to impose guardrails on the design such that the experiment will yield sufficiently precise causal estimates even if we do not elect to include the observational study in the final estimation.
The remainder of this paper proceeds as fellow.In Section 2, we define our problem and introduce notation and assumptions.Section 3 introduces our choice of shrinkage estimator, κ2, and demonstrates how to compute its risk efficiently.This section also discusses three different heuristics under which analysts can design prospective experiments with the intent of leveraging κ2 on the final results, while also protecting the utility of the experiment on its own.Section 4 contains two simulation studies, highlighting the risk improvements that can be attained by designing toward shrinkage.Section 5 concludes.

Notation
We operate in the stratified setting, with fixed strata k = 1, . . ., K defined by covariates.Such strata could be defined by subject matter knowledge, or they could be defined based on application of a modern machine learning method for uncovering heterogeneous treatment effects (Hill, 2011;Wager and Athey, 2018).We begin by setting up the problem in full generality, such that we have access to a pilot dataset that could be either observational or experimental.
The pilot dataset comprises nπ total units, indexed by j.With each unit, we associate two binary potential outcomes, Yj(0) and Yj (1) ∈ {0, 1}, representing the unit's outcome in the presence and absence of treatment.We also define a treatment variable Wj ∈ {0, 1}, representing whether unit j is treated or not; and a vector Xj ∈ R p representing measured covariates.
The quartets (Xj , Wj, Yj(0), Yj(1)) are sampled i.i.d.from some population distribution Fπ.Denote as Eπ, varπ, and Pπ the expectation, variance, and probability operators under Fπ.Covariate membership is defined by a variable Sj such that Sj = k ⇐⇒ Xj ∈ X k for some set of covariate values X k .
We design a future blocked experiment of the same treatment.In the experiment, we will recruit a fixed total of n rk units for each stratum k, and we will then randomize n rkt of those units to receive the treatment and n rkc = n rk −n rkt to receive the control.We call the list of tuples d = {n rkt , n rkc } K k=1 the "design."We will assume a sample size constraint such that for some fixed number nr.
Denote as Fr the sampling distribution for the triplets (Xi, Yi(0), Yi(1)) among units i in the experimental population.We refer to the conditional sampling distribution for units in stratum k as F r|S i =k .The experiment can then be understood as drawing n rk units from F r|S i =k and then assigning Wi by choosing a simple random sample of size n rkt from the set of n rk recruited units.We define as Er, varr, and Pr the expectation, variance, and probability operators over both the sampling and treatment randomization in the experiment.

Assumptions and Loss Function
We make the following standard assumption.
Assumption 1 (Consistency).For each unit ℓ in the pilot study or the future experiment, the observed outcome Y ℓ ∈ {0, 1} is given by that is, there is only one "version" of the treatment.
A key assumption is that the causal effects across strata are shared between the pilot and future RCT datasets, i.e.
We will typically invoke a slightly stronger version of Assumption 2, described below.
Though not yet realized, denote as τr the vector of difference-in-means estimates from the future experiment.
Our eventual causal estimator, τ ≡ τ (τr, τπ) = (τ1, . . ., τK ) T , will be a function of both τπ and τr.Our target of estimation in this manuscript is not an average treatment effect over the experimental population.Rather, we seek to obtain good estimates simultaneously for all of the stratum causal effects τ .Under Assumption 2, we can define a loss function under which we evaluate τ (τr, τπ).We use the simple, unweighted L2 loss: We seek to optimize the risk of τ , conditional on τπ, which yields the risk expression

Related Problems
Under this framing, we can relate this problem to several problems in the causal inference and experimental design.First, suppose that the pilot study is an RCT, such that under Fπ we have Wi ⊥ ⊥ Yi(0), Yi(1).Then, this problem becomes one of adaptive experimental design, in which a trial is conducted in multiple phases and information from the first phase can be used to design later phases.This problem has a rich history in the literature, stretching back to the foundational work of Thompson (1933Thompson ( , 1935)).While Thompson sampling was originally designed for a more generic problem involving maximizing the expected reward, it can be used to estimate average stratum treatment effects, as discussed in Offer-Westort et al. (2021).Adaptive experimental design is an area of active research, though much recent work has focused on methods that define strata in the second phase, rather than taking strata as fixed (Tabord-Meehan, 2018;Bai, 2019).For modern methods that incorporate a fixed stratification scheme, see Hahn et al. (2011) and Chambaz et al. (2014).
Next, suppose that the pilot study is an observational study, but that our ultimate causal estimator τ will simply be τr, the vector of difference-in-means estimators arising from the experiment.This problem is closely related to the survey sampling work of Neyman (1992).Neyman computed the optimal allocation for a stratified surveyunder a budget constraint -supposing pilot estimates of variance could be obtained for each stratum.These ideas can easily be extended to the causal inference setting.Here, unbiased pilot stratum variance estimates are obtainable if Wi ⊥ ⊥ Yi(0), Yi(1) | Xi -that is, unconfoundedness holds in the observational study (Imbens and Rubin, 2015).In our prior work, Rosenman and Owen (2021), we demonstrate that efficiency gains are possible even in the presence of unmeasured confounding, as long as we can bound the magnitude of the unmeasured confounding using the sensitivity model of Tan (2006).This is because the observational data can indicate parts of the covariate space where variation is higher or lower -and hence, where experimenters should overor undersample.
Lastly, suppose that the pilot study is an observational study, but that the experimental study is already completed, and our goal is to choose an estimator τ to trade off between causal estimates derived from the two data sources.This is an example of a so-called "data fusion" problem (Bareinboim and Pearl, 2016) that seeks to merge the observational and experimental data directly.Many methods rely on unconfoundedness in the observational study, including one of our prior papers (Rosenman et al., 2018) as well as Athey et al. (2019).Other studies have sought to weaken this condition, and frequently utilize alternative assumptions to proceed with merged estimations.Kallus et al. (2018) assumes that the hidden confounding has a parametric structure that can be modeled effectively.Peysakhovich and Lada (2016) propose a method for when the observational data are time series and the bias preserves a unit-level rank ordering.For an excellent overview of the available methods in this area, see Colnet et al. (2020).

Principles Guiding Estimator Choice and Experimental Design
In this paper, we will consider the case where the pilot study is an observational study, and the experimental study has yet to be implemented.Hence, we replace the subscript π with o (e.g.defining our vector of pilot study causal estimates as τo rather than τπ).We will choose an estimator τ (τr, τo) for combining the observational and experimental data within each stratum.Then, we will design our experiment explicitly to minimize the risk of this estimator.In this section, we highlight several characteristics of the estimator and experimental design procedure that we consider ideal.
First, we would like our estimator to be robust to unmeasured confounding in the observational study.The assumption of unconfoundedness -roughly, that all variables affecting both the treatment probability and the outcome have been measured in the observational study -is fundamentally untestable, and rarely holds in practice (Imbens and Rubin, 2015).In this setting, note that our problem is somewhat asymmetric: a simple vector of difference-in-means estimates from the experimental study will be unbiased, and will be "good enough" in many cases.Hence, we do not want to incorporate the observational data unless we have strong guarantees that it will reduce statistical risk.Thus, our chosen estimator should not be highly susceptible to bias due to unmeasured confounding in the observational study.
Second, we would like our procedure to generate experiments that are still valid if they are analyzed alone.We term this feature detachability.To motivate this idea, consider an extreme case where we fail to sample any treated or any control units from a particular stratum, under the assumption that the experimental units will simply complement observational units in those strata.Suppose that we later learn of a problem with the observational study that undermines the validity of Assumption 2e.g., perhaps the data was obtained fraudulently.In this case, we would be out of luck: the experiment has an effective sample size of 0, so we cannot use it on its own.Some measure of safety could be achieved by imposing design restrictions, such as mandating that each stratum must contain a minimum proportion of treated or control units.But we would like our procedure to naturally regularize toward more equal experimental allocations such that detachability is naturally achieved.
With these features in mind, we briefly discuss the possibility of using simpler data fusion estimators.In Rosenman et al. (2018) we propose a method that is based on simply pooling observational and experimental data within each stratum (termed the "spiked-in" estimator) and also methods based on convex combinations of τo of τr (termed the "weighted average" and "dynamic weighted average" estimators).These estimators were designed under the assumption that unconfoundedness holds in the observational study.Hence, they do not exhibit the desired robustness property.Hence, as discussed in the next section, we turn our attention toward estimators based on Empirical Bayes shrinkage.
3 Designing Towards Shrinkage 3.1 Choice of Estimator: κ 2 In Green and Strawderman (1991) and Green et al. (2005), the authors consider how to shrink between an unbiased estimator and a biased estimator.Their goal is to derive Empirical Bayes estimators that guarantee a risk reduction relative to using the unbiased estimator alone.The problem turns out to be quite similar to James-Stein estimation.
In Rosenman et al. (2020), we propose two estimators that are extensions of Green and Strawderman's work.The latter estimator, κ2 is constructed such that it shrinks each component of the unbiased estimator toward its counterpart in the biased estimator by a factor inversely proportional to its precision.We consider this property desirable in many applied settings.Moreover, this estimator typically outperformed competitor estimators -including those proposed by Green and Strawderman -in our applied data analysis.Hence, we proceed in this manuscript with the intention of using κ2 to shrink between our RCT and observational study estimators.We discuss the use of alternative estimators in the Appendix.
Let τr ∈ R K be the unbiased (RCT) estimator and τo ∈ R K be the biased (observational study) estimator.Denote as Σr = diag(σ 2 rk ) ∈ R K×K the diagonal covariance matrix of τr.Under mild conditions, we can assume τr ∼ N (τ , Σr) (see e.g.Li and Ding, 2017), where τ is the vector of true causal estimates.Our estimator is constructed as We can compute, Er (L(τ , κ2)), the risk of κ2, using results from Strawderman (2003).This yields the expression where the expectation is with respect to τr only, and τo is treated as a constant vector.As shown in Rosenman et al. (2020), the estimator is guaranteed to dominate τr under the squared-error loss L(τ , κ2) = ||κ2 − τ || 2 , as long as the condition is satisfied.
We will also frequently consider the positive part analogue of κ2, which constrains the shrinkage estimator from "over-adjusting" by not allowing it to put a negative coefficient on τo.The risk of κ2+ is guaranteed to be strictly lower than that of κ2.In simulations in Rosenman et al. (2020), κ2+ routinely dominated τr even when Condition 2 was not met.

Exact Risk Calculation Under Known Parameters
Our goal is to optimize the experimental design over the risk given in Expression 1.
We suppose, for a moment, that the quantities Σr and ξ = Er(τr − τo) are known.
Expression 1 can then be rewritten as The expectation is a difference of first moments of ratios of quadratic forms in normal random variables.Exact expressions for these integrals can be found in Bao and Kan (2013).In particular, (3) where , and The integrals in Expression 8 can be computed via numerical integration, yielding an efficient evaluation of the risk for each possible choice of the parameter values.

Design Options
In this section, we consider several heuristics for designing the experiment in the absence of perfect knowledge of Σr and ξ.

Neyman Allocation
Because the observational data is not randomized, we will need to conduct some form of statistical adjustment to remove confounding bias in τo.In this manuscript, we will use stabilized inverse probability of treatment weighting (SIPW) as our adjustment method.Briefly, this involves estimating the propensity score -the probability of treatment in the observational study -as a function of the observed covariates, and scaling the observed outcomes by the inverse of each unit's estimated propensity score.For more details, see Imbens and Rubin (2015).
Once the adjustment is made, a simple approach to the experimental design problem is to assume that there is no residual unmeasured confounding in the observational study.Importantly, we can make this assumption for the purposes of design only.The unconfoundedness assumption need not be strictly true to ensure good performance of κ2 once our experiment is completed.The shrinkage properties of κ2 ensure that its risk will be lower than τr as long as Condition 2 is met, irrespective of the presence of residual confounding in the observational study.Hence, we retain the implicit guarantee against a risk increase when it comes to inference, even if this assumption turns out to be incorrect.
If we are willing to suppose that Assumption 3 holds, as well as unconfoundedness, we can use the observational study to unbiasedly estimate the mean and variance of the potential outcomes in each stratum.Denote the estimates obtained from the observational data as The simplest design heuristic is to then use a Neyman allocation (Splawa- Neyman et al., 1990) without a cost constraint, e.g.
Such a design would be optimal if the risk of our estimator were only tr(Σr)/K, the first term in Expression 1.Though the design does not directly optimize over the shrinkage portion of the risk, it serves as a reasonable starting point for the purposes of design.As we will see in Section 4, it also typically yields good performance for κ2 in simulations.

Heuristic Optimization Assuming ξ = 0
Per the discussion in Section 3.2, we can compute the risk exactly if both Σr and ξ = Er(τr) − τo are known.Under Assumption 3 and unconfoundedness, Σr can be estimated unbiasedly for any choice of d = {(n rkt , n rkc )} k .However, ξ may be nonzero even if unconfoundedness holds, because we consider τo to be a fixed draw from the observational distribution, rather than a random variable.
In this section, we make the additional assumption that ξ = 0.Then, we have all the necessary parameter estimates to optimize R(κ2) over the choice of Unfortunately, R(κ2) is not convex in d.However, Optimization Problem 4 can be approximately solved using a greedy algorithm.Define as the allocation of RCT units to strata and treatment level at iteration j of the algorithm.Next, define Dj = {d | d swaps exactly one unit across strata or treatment level from dj} .
Because there are K strata and two treatment levels, the "swap set" Dj will typically contain 2K ×(2K −1) possible allocations.This cardinality can be reduced by imposing additional constraints, as will be discussed in Section 3.4.
Define R(d, V , ξ) as the value of R(κ2) evaluated under the design d with estimated stratum potential outcome variances V and error vectors ξ.We will evaluate R(d, V , ξ) under estimated variances V ⋆ = σ2 kt , σ2 kc K k=1 and ξ = 0. Now, we can approximately solve Optimization Problem 4 using Algorithm 5: (5) In words, Algorithm 5 will continue to swap units between strata and treatment levels until no swap will further reduce the estimated risk of the shrinkage estimator.The algorithm naturally enforces the sample size constraint and ensures that the returned values will be integers.
Because the objective is non-convex, it is plausible that Algorithm 5 could get stuck at a local optimum.Practically, we recommend running the algorithm at a few different starting points (e.g. an equally allocated design, the Neyman allocation, and several randomly chosen designs), and choose the design which achieves the minimum final value of the risk.While this approach is not guaranteed to find an optimum, it will nonetheless find a point with a reduced value of the objective function.If the relevant assumptions hold, this point will also be guaranteed to improve efficiency relative to a Neyman allocation.

Heuristic Optimization Assuming Worst-Case Error Under Γ-Level Unmeasured Confounding
The assumption that ξ = 0 is fundamentally optimistic: it is unlikely to hold even in the absence of unmeasured confounding.If there are unmeasured confounders, it may be far from the truth.We can take a more defensive approach by imposing a sensitivity model on the observational study, and optimizing under the worst-case choice of ξ.
As in Rosenman and Owen (2021), we constrain the magnitude of the unmeasured confounding by imposing the marginal sensitivity model of Tan (2006).Under this model, a key odds ratio -between the treatment probability conditional on the potential outcomes and covariates and the treatment probability conditional on covariates onlyis bounded between 1/Γ and Γ, for a user-chosen parameter Γ ≥ 1.The Tan model can be seen as extending the popular Rosenbaum sensitivity model (Rosenbaum, 1987) to the setting of inverse probability weighting.In practice, Γ can be chosen by computing the treatments odds ratio with and without conditioning on each of the measured covariates, and choosing the maximum.
Under Assumption 3, we have and such that parameters µ kt and µ kc reside within the intervals with at least 1 − α probability as long as the true confounding structure lies within the sensitivity model parameterized by Γ.The method relies on convex optimization and the bootstrap in order to generate valid confidence sets.
Because our outcome is binary, we can then use the confidence sets on µ kt and µ kc to obtain valid confidence sets for the stratum potential outcome variances σ 2 kt and σ 2 kc via the relations For a full justification, see Rosenman and Owen (2021).Putting these ideas together: under our calibrated choice of Γ and a reasonable choice of α, we set the worst-case value of the error under our sensitivity model.We collect these quantities into a vector ξ ′ .Next, we collect the corresponding values of the variances, e.g.
) otherwise and ) otherwise into a matrix V ′ .Finally, we can evaluate our function R(d, V ′ , ξ ′ ) to obtain the risk of κ2 for any experimental design d under these parameters.The procedure is henceforth exactly analogous to the one used in the prior section: we run Algorithm 5, substituting R(d, V ′ , ξ ′ ) for R(d, V , 0), and obtain the design that yields the lowest value of the risk.

Imposing Guardrails on Designs
Algorithm 5 can easily incorporate further restrictions on the set of possible designs.The simplest method is to set the risk computation for any invalid design equal to infinity.This will force the algorithm not to choose that design on the next iteration.
Practically speaking, we suggest incorporating three constraints on the set of possible designs.First, we suggest imposing a minimum sample size constraint such that the allocations to any stratum and treatment group cannot be lower than some value SSmin.This constraint serves two purposes.First, the risk expression in Equation 1 is dependent on the normality of τr.In simulations, we find that the risk expression is quite robust to modest deviations from normality, but sample sizes could plausibly become so small that a Central Limit Theorem need not hold even approximately.Second, this constraint naturally helps improve detachability, since it prevents the variance of any entry of τr from growing too large.
Second, we suggest imposing an explicit detachability constraint.Inherent in this constraint is the idea that an analyst wants to ensure that the experiment can be analyzed on its own in case stakeholders ultimately decide not to use the observational data at all.One way to impose this constraint is to assume the observational study point estimates of the strata potential outcome variances are correct, and to choose some baseline design d = {ñ rkt , ñrkc } k , e.g.equal allocation or Neyman allocation.We then invalidate any design where δ d ≥ 1 is some user-chosen tolerance parameter.In words, this approach invalidates any design d ′ for which the estimated risk of τr under design d ′ is larger than the estimated risk under the default design by a factor greater than δ d .
A more robust version of the detachability constraint can be imposed if the analyst assumes the Tan sensitivity model, as discussed in Section 3.3.3.Under a given choice of Γ and α, we can obtain bounds on the potential outcome means in each stratum.We reject any design such that max In words, this means that we are rejecting any design d ′ such that the risk of τr under design d ′ is larger than the estimated risk under the default design by a factor greater than δ d , for any configuration of the potential outcome means consistent with our sensitivity bounds.Practically the left-hand-side of Inequality 6 can be reduced to a quadratic fractional programming problem and solved via Dinkelbach's method (Dinkelbach, 1967).Third, we suggest imposing a risk reduction constraint.This means that we enforce Condition 2 explicitly by not allowing the optimization algorithm to choose allocations under which it fails to hold.Like the detachability constraint, we can impose either a point estimate version or a robust version.The point estimate version invalidates any design design The robust version incorporates the parameter bounds from the Tan sensitivity model.We reject any design such that 4 max 4 Simulation Study

Simulation Set-Up
We pattern the simulations on those in Rosenman et al. (2018), considering a situation with a relatively rare, binary outcome and a modest effect size.We first generate an observational super-population of 1 × 10 6 units, and an experimental super-population of the same size.We suppose that the completed observational study comprises 20, 000 units, sampled a single time from the corresponding super-population.The prospective RCT comprises 1, 000 units, which will be sampled repeatedly from the experimental super-population.
We define j ∈ O as the indexing variable for the observational super-population and i ∈ E as the indexing variable for the experimental super-population.We use ℓ as an index over both populations.For each unit ℓ ∈ O ∪ E , we suppose there is a covariate vector X ℓ ∈ R 5 where X ℓ iid ∼ N (0, Σ) for Σ such that each covariate has unit variance and roughly a quarter of the covariances are +0.1, roughly a quarter are −0.1, and the remainder are 0.Such a covariance structure is roughly consistent with the applied data analysis from the Women's Health Initiative, as used in Rosenman et al. (2018).
The untreated potential outcomes Y ℓ (0) are sampled as independent Bernoulli random variables with where ε ℓ are generated as IID N (0, 1) random variables and α is chosen such that the average incidence rate is 10%.The treatment variables in the observational study are independent Bernoulli random variables with The nonzero inner product between β and γ induces strong selection bias in the observational study, with treated units likelier to have larger untreated potential outcomes.
For both datasets, we suppose the data is split into twelve strata based on the first and second covariates.The strata boundaries are defined by the 25th and 50th quantiles of the first covariate and by the quartiles of the second covariate.We seek to obtain estimates of the average causal effect in each of the resultant strata.
In both datasets, we assign individual treatment effects to match three different structures.We number the strata from k = 1, . . ., 12 such that stratum 1 corresponds to the lowest stratum on both covariates, stratum 2 corresponds to the lowest stratum on the first covariate and the second quartile of the second covariate, stratum 3 corresponds to the lowest stratum on the first covariate and the third quartile of the second covariate, etc.The values of τ k , the stratum average treatment effects, in the constant, linear, and quadratic strata-level treatment effect models are respectively.In each case we choose the scale T > 0 so that Cohen's d (Cohen, 1988) in the observational study precisely equals 0.5, which Cohen calls a medium effect size.Denote as n k the total number of super-population units that fall into a given stratum.Given the τ k , the treated potential outcomes are assigned by randomly selecting τ k × n k units for which Y ℓ (0) = 0, and setting Y ℓ (1) = 1 for those units.For all of the remaining units, Y ℓ (0) = Y ℓ (1).Because we use the same process across both superpopulations, we enforce Assumption 3, such that the stratum-specific causal effects (as well as potential outcome means and variances) are shared under the observational and RCT data distributions.
The observational data is sampled a single time.Next, leveraging the observational sample, we compute the allocations of units to strata in the RCT (the RCT "design") using each of the methods discussed in the prior section.For each possible design, we sample the RCT units from the super-population 5, 000 times.For each iteration, for each stratum k, we assume treatment is assigned via a simple random sample of n rkt units out of the n rkt + n rkc units recruited for the stratum.Once the units are drawn and treatments are assigned, we compute the estimators τr, κ2, and κ2+.We compute the L2 distance between each estimate and the true treatment effects τ , and take the average over all 5, 000 simulations.

Ideal Case: No Unmeasured Confounding
We begin with the simplest case: we suppose all of the covariates are measured in the observational study, so there is no residual unmeasured confounding.This is an idealized case, in which all of the selection bias in the observational study can be removed with a statistical adjustment.We fit a propensity score to the observational study data and use stabilized inverse probability of treatment weighting (SIPW) to compute the observational causal estimates.
In Table 1, we show the average L2 errors over 5, 000 simulations.We consider the equal allocation and Neyman allocation designs, as well as the "naive" design assuming ξ = 0. We also consider the defensive approach discussed in Section 3.3.3,and compute the optimal design under errors computed with the Tan sensitivity model parameter set to 1.0, 1.1, 1.2, and 1.5.Lastly, we consider an "oracle" design, in which we run Algorithm 5 but provide it with the true values of the potential outcome variances and the error in the observational study.Results are given for the three different treatment effect models: constant (c), linear (ℓ), and quadratic (q).Risk estimates are expressed as percentages of the risk of τr when using an equal allocation for the given treatment effect model.
Across the three treatment effect models, we see that results are relatively consistent in the ordering of the estimators.The Neyman allocation always performs best when using τr alone, realizing an 8−14% error reduction relative to the equal allocation.The other allocations typically achieve more modest error reductions when using τr.
If we stick with the Neyman allocation but switch to using either the shrinkage estimator κ2 or its positive part analogue κ2+, we can realize massive additional error reductions on the order of 45 to 60%, depending on the treatment effect model.However, the naïve allocation is typically even better, allowing us to realize improvements of a few additional percentage points in error.The naïve allocation yields the best performance with κ2 and κ2+ in the linear and quadratic treatment effect models, with the robust allocations typically doing nearly as well.In the linear treatment effect model, we actually do best using a Γ = 1.1 robust design for κ2 and a Γ = 1.0 robust design for κ2+, with the naïve allocation performing nearly as well.

Practical Case: Unmeasured Confounding
The assumption of unconfoundedness is not defensible in many practical settings, in which a relatively sparse set of covariates are measured in the observational study.Moreover, the assumption is not testable.Hence, we run a second set of simulations in which we induce confounding by assuming that the third entry in xi is not measured.This third covariate affects both treatment probabilities and outcomes, so bias in the observational study cannot be fully corrected with a propensity score adjustment.
Results are given in Table 2. Results differ slightly from the simulations assuming no unmeasured confounding.When using κ2 and κ2+, the best non-oracle designs are the robust designs assuming Γ-level unmeasured confounding for Γ ∈ {1.0, 1.1, 1.2}.The design under Γ = 1 does best in the constant and linear case, while the design under Γ = 1.2 does best in the quadratic case.The naïve allocation still performs quite well, with risk typically only a few percentage points worse than the best performer.
These results point to a few practical guidelines for designing toward shrinkage.First, we find that the naïve allocation is quite robust, even when ξ is far from zero.This is evident from the fact that the naïve allocation always yields significant performance gains over the equal and Neyman allocations when using a shrinkage estimator, even when unmeasured confounding is present.
Second, in the presence of unmeasured confounding, one may achieve modest further improvements from enforcing a robust design at a relatively low value of Γ.In this example, we have not constructed the confounding such that it matches the form of our sensitivity model.Exclusion of the third covariate induces enormous discrepancies in the treatment odds between the true and estimated propensity scores in a small proportion of individuals: for about 0.2% of the super-population units, the difference exceeds a multiplicative factor of 100.For most of the population, however, the true and estimated treatment odds differ by a much smaller factor.The Tan model imposes a worst-case bound on the deviation between the true and estimated odds of treatment, so no value of Γ between 1.0 and 2.0 is large enough to account for our most extreme deviations.
Nonetheless, the magnitude of the worst-case error under the Tan model correlates reasonably well with the true values of ξ when choosing Γ = 1.0, 1.1, or 1.2, offering one explanation for the strong performance of the allocations designed under these schemes.In a more general sense, the robust allocations under Γ serve as a form of regularization, bringing the design closer to an equal allocation to hedge against the possibility that τo is far off from τ .In doing so, the robust designs may improve Risks are expressed as a percentage of the risk of τr using an equally allocated experiment, for each of the three treatment effect models.The minimum nonoracle risk in each row is denoted with an underline.
empirical performance even when the Tan model does not accurately characterize the form of unmeasured confounding.

Discussion
We have considered the problem of designing a stratified experiment when we plan to use an Empirical Bayes estimator, κ2, to shrink the stratum causal estimates toward estimates previously obtained from an observational study.As we have shown, the risk of κ2 can be computed explicitly if the stratum potential outcome variances and the stratum-specific errors of the observational study estimates, ξ, are known.In the absence of such information, we have proposed three heuristics -Neyman allocation, "naïve" allocation assuming ξ = 0, and robust allocation under the worst-case value of ξ given a sensitivity model -for designing the experiment.We have also emphasized "detachability," the ability to do good causal inference using the RCT on its own, and suggested imposing constraints on the experimental designs to ensure detachability is preserved.
We simulated a realistic scenario in which we have access to a large observational database and are interested in a relatively rare outcome.We considered a stratification yielding twelve strata, in which the outcome frequency varies dramatically across the strata.In this setting, there were significant risk improvements to be realized by using a shrinkage estimator.Additional gains were possible when switching from an equally allocated RCT to one designed explicitly for use with a shrinkage estimator.In simulations with and without unmeasured confounding in the observational study, we saw that the naïve allocation typically performed well, exhibiting surprising robustness to the case when ξ was far from zero.Our robust allocations outperformed in the case when unmeasured confounding was present in the data.
There are many plausible extensions to this line of work.We have assumed a rigid model for treatment effect heterogeneity in this paper: treatment effects vary according to a known stratification on observed covariates.More modern work (Lee et al., 2021;Nie and Wager, 2021) focuses on estimating heterogeneous treatment effects empiri-cally, allowing for greater flexibility in how the effects differ across units.Incorporating such ideas into this work, we might imagine using some of the observational data in a first stage to estimate a stratification scheme, and using the remaining data for shrinkage in a second stage.Alternatively, we could consider modeling the causal effect explicitly as a function of the covariates, e.g.defining τ (Xi) rather than a vector of true treatment effects τ .We could then define a flexible shrinker to trade off between estimates not within strata, but within nearby values of the covariates themselves.This process can be repeated for any estimator θ.Once integral expressions for the risk have been obtained, the design heuristics described in Section 3.3 can be deployed, using R(θ) as the objective function rather than R(κ2).

Table 1 :
Risk over 5, 000 iterations of τr , κ 2 , and κ 2+ under various experimental designs, in the case of no unmeasured confounding in the observational study.Risks are expressed as a percentage of the risk of τr using an equally allocated experiment, for each of the three treatment effect models.The minimum non-oracle risk in each row is denoted with an underline.

Table 2 :
Risk over 5, 000 iterations of τr , κ 2 , and κ 2+ under various experimental designs, in the case of unmeasured confounding in the observational study via failure to measure the third covariate.