Robust Adaptive Incorporation of Historical Control Data in a Randomized Trial of External Cooling to Treat Septic Shock

. This paper proposes randomized controlled clinical trial design to evaluate external cooling as a means to control fever and thereby reduce mortality in patients with septic shock. The trial will include concurrent external cooling and control arms while adaptively incorporating historical control arm data. Bayesian group sequential monitoring will be done using a posterior comparative test based on the 60-day survival distribution in each concurrent arm. Posterior inference will follow from a Bayesian discrete time survival model that facilitates adaptive incorporation of the historical control data through an innovative regression framework with a multivariate spike-and-slab prior distribution on the historical bias parameters. For each interim test, the amount of information borrowed from the historical control data will be determined adaptively in a manner that re-ﬂects the degree of agreement between historical and concurrent control arm data. Guidance is provided for selecting Bayesian posterior probability group-sequential monitoring boundaries. Simulation results elucidating how the proposed method borrows strength from the historical control data are reported. In the absence of historical control arm bias, the proposed design controls the type I error rate and provides substantially larger power than reasonable comparators, whereas in the presence bias of varying magnitude, type I error rate inﬂation is curbed.


Background
The methodology described in this paper was motivated by the problem of designing a randomized controlled trial to evaluate the effectiveness of using external cooling to con-trol fever and thereby reduce mortality in patients with septic shock admitted to a hospital intensive care unit (ICU).Septic shock is a complex, life-threatening condition involving sepsis, which is an inflammatory immunological response to infection, combined with hypotension, which is dangerously low blood pressure (Angus and van der Poll, 2013;Singer et al., 2016).The 60-day mortality rate for patients with septic shock admitted to an ICU is about 40% (Annane et al., 2003;Caironi et al., 2014;Asfar et al., 2014).
Standard ICU treatment for septic shock may involve antibiotics or surgery for infection, infusion of fluids, and a vasopressor to maintain the patient's mean arterial pressure (MAP) at a minimal level to sustain organ function (Dellinger et al., 2013).The patient's vital signs are monitored continuously.The attending physician repeatedly adjusts the vasopressor dose to keep the patient's MAP in a targeted range, e.g., 65-70 mm Hg.The immediate goal is to achieve septic shock reversal defined as the patient sustaining a MAP within or above the targeted range for 24 hours without vasopressor administration.There is no consensus among intensive care physicians on the ideal MAP range, or whether this range should be higher for patients with a history of hypertension.A randomized controlled trial reported by Asfar et al. (2014) comparing the target ranges 80-85 versus 65-70 mm Hg for MAP in septic shock patients with a history of hypertension admitted to an ICU found no significant difference in either 28-day or 90-day mortality rates.While the target MAP range will not be the focus of the trial described in this paper, the trial reported by Asfar et al. (2014) is our source for historical control data.
Septic shock often is accompanied by high fever, which may be controlled by external cooling, done either by covering the patient with a blanket containing circulating cold water, or applying ice packs directly to the patient's body.A clinical trial ("Sepsiscool") that randomized 101 febrile septic shock patients to standard treatment augmented by external cooling and 99 to standard treatment alone reported by Schortgen et al. (2012) found statistically significant evidence at the 0.05 level that external cooling reduced vasopressor requirements.The external cooling arm had a higher rates of septic shock reversal (87% versus 72%), and 14-day survival (81% versus 66%).Statistically significant evidence was lacking, however, for a survival benefit at ICU discharge (65% versus 57%) and at hospital discharge (57% versus 52%).While fever control has circulatory benefits, fever is beneficial for fighting the underlying infection.Therefore, it is unclear whether fever control is beneficial or harmful in terms of survival, and the use of external cooling remains controversial (Shime et al., 2013).This background motivates the trial described here, which will compare C = standard of care to E = standard of care augmented with external cooling started immediately upon randomization.The primary goal is to obtain a confirmatory evaluation of E versus C on survival during the 60 day period following randomization.The choice of a 60 day follow up is motivated by the fact that, based on historical data and clinical experience, the great majority of deaths in each arm will occur within 30 days.
In this paper, we present a Bayesian group-sequential test to compare E to C in terms of their 60-day survival distributions.We use posterior probability stopping boundaries that correspond to established frequentist group sequential boundaries (Shi and Yin, 2019;Proschan et al., 2006).Statistical inference follows from a Bayesian hierarchical discrete time survival regression model that facilitates robust adaptive incorporation of the historical control data from the trial reported by Asfar et al. (2014) into the group sequential posterior probabilities.Our model assumes first order intrinsic Gaussian Markov Random Field (IGMRF) priors for the hazard of death on each day of the observation period in each arm (Besag et al., 1991;Rue and Held, 2005).Compared to parametric survival regression models and Cox's proportional hazards model, our model provides greater flexibility for estimating the survival distribution on the domain [0, 60] days in each arm, and thus for estimating the difference in 60-day survival distributions between the two arms robustly.Our proposed comparator is more general than, but encompasses 60-day restricted mean survival time (RMS), which is the number of days that a patient is expected to survive during the 60 day period following randomization after admission to an ICU with septic shock (Royston and Parmar, 2013).We show that our proposed comparative testing and modeling framework, which neither requires proportional hazards treatment nor trial effects, provides greater power than the logrank test under the targeted benefit of E over C, even when not incorporating historical control data.
Our modeling approach for adaptively incorporating the historical control data into the group sequential tests is related to previous work on commensurate priors, see, e.g., (Hobbs et al., 2011(Hobbs et al., , 2012;;Murray et al., 2015).Power priors (Ibrahim and Chen, 2000) and multi-source exchangeability models (Kaizer et al., 2018) offer alternative approaches for incorporating historical data, although neither approach has been implemented with discrete time survival outcomes.The commensurate prior approach relies on hierarchically modeling data on the prospective and historical control arms while adaptively borrowing strength via priors that facilitate shrinking analogous parameters across data sources toward one another.Deviating from this strategy, we develop a regression model that explicitly parameterizes historical bias, and data-adaptively shrinks these bias parameters to zero using spike-and-slab priors.This framework facilitates robust adaptive incorporation of historical control data, based on the observed magnitude of historical bias, as the trial data accrue.Key innovations are the reformulation of commensurate priors through explicit parameterization of historical bias, development of a commensurate prior discrete time survival modeling framework, and formulation of a commensurate prior for a parameter following a GMRF prior.
In Section 2, we present our comparative test, which is based on a summary of the posterior 60-day survival distributions in the two arms.In Section 3, we describe the regression model to evaluate our comparative test.In Section 4, we discuss posterior estimation using a Gibbs sampler with Pólya-Gamma latent variables (Polson et al., 2013).In Section 5, we address important design considerations, including specification of the target benefit of E relative to C and the Bayesian posterior probability monitoring boundaries to ensure type I error rate control.In Section 6, we report the results of a computer simulation study that evaluates performance of various design implementations over a range of differences between the hazards of death for the historical and concurrent controls.In Section 7, we conclude with a brief discussion.

Comparative Test
For the primary analysis, we will follow patients who remain alive for 60 calendar days from randomization with primary outcome variable Y * being the follow up day of death.Thus, Y * has discrete support {1, • • • , 60, > 60} days during the follow up period, with Y * > 60 if the patient is alive at the end of 60 day follow up.The discrete support arises from the fact that time of randomization and death each are recorded as a calendar date.If a patient dies during the calendar day on which they are randomized, then Y * = 1; whereas, if a patient dies during the calendar day after they are randomized, then Y * = 2; and so on.While Y * is unknown for any patient who is alive at the end of 60-day follow-up, it is partially observed with Y * > 60.We use an asterisk to distinguish this from the observed time Y , which is subject to additional left-censoring mechanisms arising from loss to follow up and interim data analyses being carried out while some participants are in the midst of follow up.We will provide the formal connection between Y * and Y later in Section 3, where the modeling framework is described.
The primary goal of the trial will be to evaluate the clinical effectiveness of E compared to C for reducing mortality.To do this, we compare the 60 day survival distributions in E versus C using posterior mean utilities, which in arm a = E, C is defined as where U (•) reflects the utility function.Formally, the trial will test the hypotheses The utility function U (•) in (1) should be defined to reflect the relative desirability of each possible realization of Y * .In the sequel, we specify U (Y * = t) = t − 1, for t = 1, . . ., 60, but leave R = U (Y * > 60) unspecified subject to R ≥ 60.There are a few notable specifications for R. Taking R = 60, μ a in (1) reflects 60-day restricted mean survival time in arm a = E, C, which is the number of days a person is expected to survive during the 60 days following admission to the ICU with sepsis (Royston and Parmar, 2013) Sensitivity of the utility-based comparison in (1) to R = U (Y * ≥ 60) may be assessed by varying R from 60 to ∞ and identifying change points, if any, where the conclusion differs.When the 60-day survival distributions in each arm are identical, μ E = μ C for all R > 60; and when both latter scenario includes crossing 60-day survival distributions in which one arm has a better 60-day survival probability but worse shorter-term survival probabilities.
Our comparative test will be based on the posterior mean utility in the two arms arising from our Bayesian hierarchical model described in Section 3. We will apply our testing criteria in a group-sequential manner and reject the null hypothesis as follows: If Pr (μ E > μ C | Data) ≥ p, then stop the trial and recommend E, where 'Data' refers to the historical data and the currently available prospective trial data.Similarly, we will fail to reject the null hypothesis as follows: If Pr (μ E > μ C | Data) < q, then stop the trial and recommend C.
In Section 5, we discuss timing for the group sequential tests, and a method for determining boundary values p and q that control the type I error rate, and a maximum sample size that controls the type II error rate under the targeted benefit of E over C.

Probability Model
To develop a Bayesian model for learning about μ E and μ C from the data, we exploit the connection between these quantities and the hazard of death in each arm on each day during the 60-day follow-up period.We denote the hazard of death under regimen (2) μ a is a function of {π a,t : t = 1, . . ., 60}.We use this relationship to construct a Bayesian hierarchical model for learning about the hazard of death in each arm on each day during the 60 day follow up period, and thereby learning about μ E and μ C , based on the historical control data reported by Asfar et al. (2014) and the prospective trial data.
To do this, we propose the following discrete time survival regression model.Denote the i-th patient's treatment arm by a i = E, C or C 0 for the historical control patients.We will assume that Y * i | A = a i ; β, i = 1, . . ., n, are independent, where θ denotes the vector of model parameters, which we will describe in detail below.Let denote the model-based hazard of death in arm A = a on day t, and note that We assume the following logistic regression model for the hazard of death in arm A = a on day t during the follow up period: On follow-up day t, ζ + z t is the historical-versus-trial control treatment bias defined as the log odds ratio of the hazard of death in the historical control arm C 0 versus the prospective control arm C. If ζ = 0 and z t = 0, t = 1, . . ., 60, then the prospective and historical control patients' survival times are exchangeable, and it is appropriate to incorporate all data from C 0 when comparing E to C. As the values of these historical bias parameters move away from 0, the difference between the prospective and historical control survival distributions increases, and it becomes increasingly inappropriate to incorporate C 0 data into the test statistic.
For patients randomized to the trial control arm C, our regression model in (3) allows the hazard of death to be different on each day of the 60-day observation period, with logit {π C,t (β)} = γ + g t , on day t.Thus, the parameters {g t : t = 1, . . ., 60} quantify daily deviations from the shared parameter γ for the hazard of death.Our regression model also allows the odds ratio for the hazard of death in arm E versus C to be different on each day of the 60-day observation period, with The parameters { t : t = 1, . . ., 60} quantify daily deviations from the shared parameter λ, with (λ + t ) the E-versus-C effect on the hazard of death on day t, for t = 1, . . ., 60.
Leveraging insights from Ghosh et al. (2018) on prior specification for logistic regression and its similarity to the proposed model, we assume γ ∼ t 7 (−5, 5) and λ ∼ t 7 (0, 2.5), where t ν (μ, σ) denotes a generalized t-distribution with degrees of freedom ν, location μ and scale σ.Our choice of −5 for the location of the prior distribution on γ reflects the prior information that Pr(Y * > 60 | C) = 0.58 and the model identity Pr( −60 .To partially pool information from adjacent days about the hazard of death in each arm, we use first-order IGMRF priors that assume independent increments in the time-varying log-hazard parameters, Hennerfeind et al. (2006) also use this GMRF prior in a discrete time survival regression model.Following Besag et al. (1991) and Rue and Held (2005), the independence of the increments follows from specifying improper priors for g, and , as follows: where Q is a 60 × 60 symmetric tridiagonal matrix with diagonal equal to (1, 2, . . ., 2, 1) and first off-diagonals equal to (−1, . . ., −1), with p 0 | σ 2 taking the same form.We specify σ 2 g , σ 2 ind ∼ Scale-inv-χ 2 {1, 2.5 2 /59} which has a prior effective sample size of 1 and an expected value of 2.5 2 /59.Note that Var{g 1 − g 60 | σ 2 g = E[σ 2 g ]} = 2.5 2 which reflects our prior expectation that the log-odds-ratio for the hazard of death on days 1 and 60 is very likely between −5 and 5.The same logic applies to ( 1 − 60 ), which reflects the difference in the log-odds-ratio between arms E and C on days 1 and 60.In this way, these prior distributions are weakly informative.
To facilitate data-adaptive borrowing of strength from the historical control data reported by Asfar et al. (2014), we assume the following spike-and-slab prior for ζ: 4) and the following multivariate spike-and-slab prior for z: (5) By setting R ζ = 4000, conditional on ν ζ = 1, ζ is effectively equal to zero.Similarly, by setting R z = 4000, conditional on ν z = 1, z follows a IGMRF highly concentrated about 0, and thus z is effectively equal to 0. Recall that we constrain 60 t=1 z t = 0. Therefore, conditional on (ν z , ν ζ ) = (1, 1), our probability model will effectively treat historical and prospective control patients as exchangeable, thereby facilitating full borrowing of strength from the historical control data.In contrast, conditional on ν ζ = 0, ζ follows a weakly informative t 7 (0, 2.5 2 ) prior distribution, and conditional on ν z = 0 and σ 2 z , z follows a IGMRF prior with (z t − z t−1 ) | σ 2 z iid ∼ Normal(0, σ 2 z ), t = 2, . . ., 60 where σ 2 z has a prior expected value equal to 1/R z but effective sample size equal to 1.This prior specification for σ 2 z facilitates posterior computation in that it is feasible to transition from ν z = 1 to ν z = 0, yet conditional on ν z = 0, the data will strongly influence the full conditional posterior for σ 2 z .Conditional (ν z , ν ζ ) = (0, 0), our probability model will treat historical and prospective control patients as nonexchangeable, thereby facilitating much less borrowing of strength from the historical control data.For (ν z , ν ζ ) = (1, 0), the historical bias satisfies proportional odds and our model will borrow some strength regarding the temporal variation in the hazards, whereas for (ν z , ν ζ ) = (0, 1), the average hazard across the 60 day follow up is similar in the two control arms, but the temporal variation differs.By setting p ζ = 0.5, and p z = 0.5, a priori we assign the spike and slab distributions in (4) and (5) equal weight.Our model will learn about the values of ν ζ and ν z adaptively in the usual Bayesian manner, by conditioning on the observed data, and this will be reflected in the group sequential decision making during the trial.Because it is possible for ν z = 0 with σ 2 z < 1/R z , we suggest assessing posterior evidence for exchangeability with regard to z based upon ν * z = ν z + (1 − ν z )I(σ 2 z ≤ 1/R z ) rather than ν z .Our approach for incorporating historical data differs from existing commensurate prior approaches in several key ways.First, we use regression to explicitly parameterize a historical trial effect, or bias, which in this case is ζ + z t on follow-up day t = 1, . . ., 60.Second, we specify each spikeand-slab prior as a two-part mixture of a Gaussian distribution with its density highly concentrated about zero (the spike) and a weakly informative distribution (the slab) for the corresponding parameter (cf.Murray et al., 2015).Defining the spike in this way, rather than as a probability mass at zero, facilitates posterior computation (see, e.g., George and McCulloch, 1993).Third, we specify a multivariate spike-and-slab prior for z that is a two-part mixture of GMRF prior distributions, which simultaneously facilitates temporal smoothing across elements of z and shrinkage of all elements to zero.
We will refer to the above model formulation as the "Comm NPO" model, i.e. the commensurate non-proportional-odds prior model.In the sequel, for comparison, we consider three modifications to the Comm NPO model.The Comm PO model omits z altogether and thus assumes proportional odds for the hazard of death in the concurrent and historical control arms.The "Trad NPO" model, where Trad indicates traditional, ignores the historical control data altogether, as in a traditional clinical trial, and thus includes neither ζ nor z, since historical data needed for estimating these parameters is not incorporated.The "Trad PO" model additionally omits the term, and thus assumes that the hazards for death in the concurrent arms satisfy the proportional odds assumption.

Posterior Estimation
To estimate the posterior distribution arising from the Comm NPO model, we develop a blocked Gibbs sampling algorithm.Let n a denote the number of patients in arm a = E, C, or C 0 , with n C0 = 761 and n = n E + n C + n C0 .Indexing patients by i, for patients who die during follow up we will observe Y * i , and otherwise only know that Y * i ≥ U * i , where U * i denotes the patient's administrative right-censoring time.As usual, we define the observable random variables At a given test, we will have sample observations D n = {(y i , δ i , a i ) : i = 1, . . ., n}.Under the proposed probability model described in Section 3, the likelihood function takes the form denoting the linear terms by x T a,t β, with δ i,t = δ i 1 {t,...,60} (y i ), r a,t = n i=1 1 {t,...,60} (y i )1 a (a i ), and d a,t = n i=1 δ i,t 1 a (a i ).That is, δ i,t is the binary indicator that patient i died on day t, r a,t is the number of patients in arm a who were alive at the start of day t and thus were at risk of dying that day, and d a,t is the number of patients in arm a who died on day t, a = E, C, C 0 and t = 1, . . ., 60.
The likelihood function may be factored into one contribution from the historical control data and another from the prospective trial data as follows: Only the likelihood contribution of historical control data depends on the historical bias parameters ζ and z.In this way, the proposed approach augments the prospective data likelihood with a historical control data contribution that includes historical bias parameters.This factorization illuminates two things: (i) multiple historical data sources could augment the prospective likelihood function each with their own contribution and historical bias parameters, (ii) individual-level historical data are not required for inference, rather sufficient statistics are all that is needed, which in this case are the number at risk and the number who died on each day during the 60-day follow up period.Recognizing that L(β | D n ) has the structure of a logistic regression model and following a similar strategy to (Polson et al., 2013), we developed an efficient and conditionally conjugate Pólya-Gamma Gibbs sampler.We provide additional details in the Supplement (Murray et al., 2020), and freely-available, user-friendly R software for implementation of all models under consideration (see Supplementary Materials).

Preliminaries
For the septic shock ICU trial, we anticipate an accrual rate of 34 patients per month across 10-15 sites.Patients will be randomized in permuted blocks by site, so that throughout the trial at each site there will be approximately an equal number of patients assigned to arms E and C. We aim to enroll up to 956 patients, i.e. 478 patients in each concurrent arm, which is expected to take 28 months.We provide justification for this sample size below.We plan to carry out up to three tests at 10 month intervals.To ensure that each patient who remains alive will be followed for at least 60 days, we plan to complete enrollment at 28 months and carry out the final test at 30 months.
Because the goal is to determine whether E confers a survival benefit relative to C, but not in demonstrating that E is harmful, we are concerned with controlling the probabilities of incorrectly recommending E when μ E = μ C , i.e. making a type I error, and of incorrectly recommending C when E achieves a target benefit with μ E > μ C , i.e. type II error.A key complication, which is a central issue in our testing procedure, is that we do not know the degree to which the historical and prospective control data will agree.As we will demonstrate later, the degree of similarity between these data sources will affect the type I and II error rates of our design, since posterior inference will be based on the proposed Comm NPO model that incorporates the historical control data adaptively.To deal with this, we choose monitoring boundaries and a sample size that control the type I and II error rates when posterior estimation is based on the Trad NPO model, which is fit only to the prospective data, as done conventionally.Specifically, omitting the historical control data C 0 , we design the trial to have type I error probability 0.025 and type II error probability 0.10, equivalently, power 0.90.This strategy leads to increased power and reduced expected sample size when the historical data are congruous with the prospective data, while providing the desired power when the historical data are so incongruous that the Comm NPO model learns to borrow zero strength.For intermediate cases, the operating characteristics of the proposed design with the Comm NPO model are less clear, but may be evaluated using computer simulation.

Determining an Alternative
Under our utility-based comparative test, the meaning of the word "power" is not entirely obvious, since the alternative H A : μ E > μ C is determined by the two timevarying hazard functions and a shared utility function.Below, we explain how elicited relative risk reduction (RRR) values were used to determine the two hazard functions and resulting different numerical values for μ E and μ C used to compute power and select a sample size.The two survival distributions under this alternative are given in Figure 1.
Our approach will ensure that the design is adequately powered even when the historical control data are substantially incongruous with the prospective control data.This is because, in this case, the proposed Comm NPO model will tend to borrow little strength from the historical control data and result in posterior inference that is similar to that from the Trad NPO model fit only to the prospective data.In contrast, when the historical and prospective control data are congruous, the proposed Comm NPO model will tend to borrow substantially from the historical control data, leading to a design with larger power and smaller mean sample size.In Section 6, we will report computer simulation results that illustrate how the magnitude of the difference between the historical and prospective control data will affect the operating characteristics of the proposed design.
To obtain numerical values of μ C and target μ E for constructing a design, we set R = 60 so these parameters reflect 60-day RMS, and elicited information from FS and SK, who will be running the trial and are co-authors of this paper.To do this, we used the RRR in mortality through follow up day t, which was familiar to FS and SK, defined as Motivated by the survival benefits of external cooling observed in the trial reported by Schortgen et al. (2012), Drs.Schortgen (FS) and Katsahian (SK) hypothesized a 40% RRR in mortality for E versus C through calendar day 15, but a smaller 20% reduction through day 30, denoted by RRR 15 = 0.40 and RRR 30 = 0.20.To fully specify the target alternative scenario, we defined hazards of death in each concurrent arm on each day during the 60-day observation period, subject to the hypothesized RRR 15 = 0.40 and RRR 30 = 0.20, as follows.First, we set π C,t = π C0,t , for t = 1, . . ., 60, where π C0,t is the posterior mean from a discrete time survival model with the GMRF prior fit to the historical control data reported by Asfar et al. (2014).The survival distribution in arm C that corresponds to these hazards is displayed as the dot-dash line in the left panel of Figure 1.Second, relying on the information provided by FS and SK, we assumed that .20, and used linear interpolation to obtain values for RRR 16 through RRR 29 .The resulting RRR curve is displayed in the right panel of Figure 1.Third, given π C,t and RRR t , we solved for π E,t , for each t = 1, . . ., 60.This gave the null RMS μ C = μ C0 = 39.8 days and target μ E = 44.9days.The survival distribution in arm E that corresponds to the elicited hazards is displayed as the solid line in the left panel of Figure 1.

Controlling the Type I and II Error Rates
We will allow the monitoring boundary values for our comparative test to differ across the three planned data analyses.Let p j and q j denote the boundary values for analysis j = 1, 2, 3. To ensure that the design recommends either E or C at the trial's conclusion, we require p 3 = q 3 .Following the theoretical method of Shi and Yin (2019), for j = 1, 2, 3, we set p j = Φ(a j ) and q j = Φ(b j ) where a j and b j reflect upper and lower frequentest group-sequential Z-score boundaries that provide the desired type I error rate and Φ(x) denotes the standard normal distribution function.Specifically, we set p 1 = 0.999, q 1 = 0.145, p 2 = 0.995, q 2 = 0.408 and p 3 = q 3 = 0.977, which were obtained from the gsDesign package in R and correspond to an asymmetric groupsequential design based on the Hwang-Shih-DeCani spending function (Hwang et al., 1990).For this design, asymptotically the probability of crossing the upper boundary at or before analysis j under the null is given by 0.025[1 − exp{4(j/3)}]/[1 − exp(4)], j = 1, 2, 3. Similarly, asymptotically the probability of crossing the lower boundary without crossing the upper boundary at or before analysis j under the null is given by (1 − 0.025)[1 − exp{2(j/3)}]/[1 − exp(2)], j = 1, 2, 3. We used an iterative algorithm to select a sample size that controls type II error rate when using the Trad PO model that ignores the historical control data so that incorporating these data provides the potential for greater power and shorter a trial.Details about this iterative algorithm are provided in the Web Supplement.We validated the type I and II error rates of our design for the planned sample size and probability models using computer simulation, as reported in Section 6.

Simulation Study
Our computer simulation study has two major aims: (i) validate the type I and II error rates of our planned group sequential design, when posterior estimation is based on the Trad NPO model fit only to prospective data, and (ii) assess how the degree of agreement between the prospective control and historical control data affects the proposed design's operating characteristics when posterior estimation is based on the proposed Comm NPO model that adaptively incorporates the historical control data.For comparison, we included a group-sequential log-rank test with monitoring boundaries analogous to the proposed design, which we refer to as the "LRT" approach.We also evaluated the Comm PO model that assumes historical control arm bias follows a proportional-odds assumption, and the Trad PO model that does not incorporate the historical control arm data and also assumes the concurrent arms follow a proportional-odds assumption.
For the Bayesian methods, after 200 warm-up iterations, we ran our Gibbs sampler for 2,000 iterations and used the resulting samples for posterior calculations.Each Gibbs sampler run for the Comm NPO model took about six minutes.The samplers for other models required fewer steps and were slightly faster, requiring one to five minutes per run.To speed up computations, we simulated trials in parallel on a HP Linux distributed cluster.
We simulated 2,000 trials each under a null scenario and a target alternative scenario.For aim (i), prospective control data were generated from a survival distribution with π C,t = π C0,t , for t = 1, . . ., 60, i.e. ζ = 0 and z = 0, so that C was exactly congruous with the observed historical control data C 0 .For the null scenario, the prospective external cooling data were generated from this same survival distribution.For the alternative scenario, data were generated from the target survival distribution displayed in Figure 1.This investigation facilitates achieving aim (i) as well as evaluating the benefit of using our proposed Comm NPO model when the prospective and historical control data tend to be congruous.To achieve aim (ii), we simulated an additional 10,000 trials with varying differences between μ C and μ C0 = 38.9under both the null and the target alternative, which we always defined in terms of the elicited RRR curve displayed in Figure 1.For each trial, we sampled ζ ∼ Uniform(−0.5, 0.5), but kept z = 0, and generated prospective control data from a survival distribution with logit{π C0,t } = logit{ π C,t } + ζ, for t = 1, . . ., 60. Recall that ζ quantifies the magnitude of disagreement between the true survival distribution of the C data and the observed historical C 0 data.Since z = 0, the shape of the temporal variation in the hazards for death remain similar across sources, however.Since {π a,t , t = 1, • • • , 60} determine μ a through (1), μ C is a one-to-one function of ζ in our computer simulation study.We determined that ζ ∈ [−0.5, 0.5], which corresponds to μ C ∈ [31.0, 46.6]For each simulated trial, we assigned each of the 956 subjects a random enrollment day drawn uniformly between calendar days 1 and 855, and we carried out data analyses at calendar days 305, 610 and 915.Based on the enrollment day for each simulated subject, we were able to determine the observed data at each of the three analyses.For the LRT, at each data analysis we calculated and stored the log-rank test statistic.For the Bayesian designs, at each data analysis we calculated and stored Pr(μ For the Comm NPO and PO models, we also calculated and stored the posterior probability of exchangeability between the current and historical control arms with respect to ζ and z, e.g., PPE = Pr(ν ζ = 1 | D n ).To quantify the amount of borrowing from the historical data at each analysis, we defined, calculated and stored the relative information gain (RIG) as follows, and similarly for the Comm NPO model.RIG measures the increase in posterior precision for the treatment effect of interest, μ E − μ C , due to incorporation of the historical control data.Compared to effective historical sample size (Hobbs et al., 2011), RIG is a more general metric for the amount of strength borrowed from the historical data.Table 1 gives the proportions of simulated trials that recommended E over C under the null and alternative hypotheses, in the case with ζ = 0 and z = 0 where C and C 0 are fully congruous.The results in Table 1 reflect a non-binding lower boundary for recommending C, i.e. the trial may continue when the lower boundary is crossed.When adhering to the lower boundary, for all three methods and under both the null and alternative, the proportion of trials recommending E is unchanged, whereas mSS = mean sample size is much smaller.Because stopping early to recommend E is rare under the null, mSS is similar for all methods.Under the target alternative, LRT has the largest mSS, followed by Trad PO, Trad NPO, Comm NPO, and then Comm PO, which stopped to recommend E at the first analysis in nearly one third of the simulated trials.
The LRT and Trad methods exhibited one-sided type I error rates near the intended nominal 0.025 level, whereas the Comm methods exhibited slightly lower rates.Under the target alternative scenario, the LRT and Trad PO methods exhibit similar power near 0.85, whereas the Trad NPO method exhibits power near 0.90, as intended.The lower power of these methods is likely due to the fact that LRT is most powerful for proportional-hazards effects and the Trad PO model assumes proportional-odds, neither of which is the case here.In contrast, the Comm PO and NPO methods exhibit similarly higher power of 0.97 and 0.96, respectively.This substantial increase in power is due to incorporating the historical control data when the prospective control data follow a survival distribution that is congruous with the observed historical control data.In particular, there is little power loss when modeling the historical bias with both ζ and z, rather than ζ alone.ity estimates are based on a generalized additive model fit to the relevant simulated data.The estimates are similar when based, instead, on LOESS (locally weighted smoothing) (Cleveland, 1979).Under the null, LRT and the two Trad models have one-sided type I error rates near the 0.025 nominal rate for all μ C , whereas the type I error rate for the Comm models vary with μ C .In particular, the type I error rate inflates when μ C is slightly above μ C0 = 39.8, deflates when μ C is slightly below μ C0 , and moves back toward the nominal rate as μ C deviates further from μ C0 .The Comm NPO model exhibits less variability than the Comm PO model in this regard.Under the target alternative, LRT and Trad PO exhibit lower power than Trad NPO for all μ C , whereas the power of Curr-Hist follows the same trend as its type I error rate, achieving the most power when μ C is equal to or slightly greater than μ C0 .For the Comm models, both the null and alternative curves for Pr(Recommending E) as a function of μ C in Figure 2 have non-monotone shapes, with a temporary dip in each curve for values near and below μ C0 and a temporary rise for values near and above μ C0 .This is due to the fact that, when μ C is in these intermediate ranges below and above μ C0 , μ C based on the Comm models tends to be shrunk toward μ C0 .Thus, the type I error is inflated for values of μ C slightly above μ C0 and the power is reduced for values of μ C slightly below μ C0 .This estimation bias may be considered the price paid for using a Comm model to borrow strength from the historical data.Additional simulation results in the Web Supplement (see Supplementary Materials) demonstrate that specifying a lower prior probability on the spike distributions (i.e.p ζ and p z ) will reduce type I error inflation, but at the cost of lower power when the two sources are congruous.The Comm PO model exhibits similar patterns in this regard.Recall that the Comm NPO model allows for temporal variation in the historical control arm bias through z as well.Because this simulation study only varied the value of ζ, the PPE for z was relatively invariant in μ C at around 0.85 and thus is not illustrated here.RIG provides an alternative means to capture the strength of borrowing across all bias parameters, regardless of the Comm model's complexity.When μ C from Trad NPO is near μ C0 , Comm NPO tends to have PPE near 1, which leads to RIG > 0 and thus greater posterior precision for the treatment effect of interest.In contrast, as μ C from Trad NPO deviates far from μ C0 , PPE and RIG tend toward 0, indicating that incorporation of historical control data that is incongruous with the new control data leads to little gain in posterior precision for the treatment effect of interest.When μ C from Trad NPO is near 35 or 45, Comm NPO tends to have intermediate values for PPE and RIG < 0, i.e. it suffers some reduction in posterior precision for the treatment effect of interest.

Discussion
We have proposed and investigated the operating characteristics of a Bayesian group sequential design for a randomized controlled trial to assess the effectiveness of using external cooling to control fever in patients with septic shock admitted to an ICU.
The proposed design will use a group sequential comparative test based on the difference between 60-day survival distributions in the external cooling and control arms.Statistical inference will be based on a Bayesian discrete time survival model that facilitates borrowing strength from the historical control data using an intuitive regression framework.
Our computer simulations showed that, if the historical control data are unbiased, then the proposed approach will control the type I error rate and provide substantially greater power than similar Bayesian designs that do not incorporate the historical control data as well as a more traditional design based on a group-sequential log-rank test.In contrast, when the historical data are biased, regardless of the magnitude, the proposed approach provides reasonable power and avoids run-away type I error inflation and bias in the estimate of the treatment effect of interest.Increasing the prior probability on the spike distribution for each historical bias parameter will reduce the maximum type I error inflation and estimate bias at the cost of lower power when the data sources are congruous.
If desired, one may modify the regression model's linear predictor to obtain covariateadjusted treatment effects.For example, the proposed model may be extended to include enrollment center, patient prognostic covariates, or possibly high dimensional biologic/genomic variables.While doing this is conceptually straightforward, technically it may be much more complicated.In part, this is because such an extension may involve treatment-covariate interactions, thus requiring a more complex regression model and a covariate-specific group sequential decision structure.

60
t=1 t×Pr(Y * = t | A = a) and Pr(Y * > 60 | A = a) are greater in one arm than the other, the conclusion will be invariant to R as well.In contrast, when 60 t=1 (t−1)×Pr(Y * = t | A = a) and Pr(Y * > 60 | A = a) favor opposite arms, the value of R will determine which arm's 60-day survival distribution is clinically preferable.This

Figure 1 :
Figure 1: Survival distributions for the two prospective arms under the target alternative scenario (left panel) and the corresponding relative risk reduction of mortality for E versus C (right panel).

Figure 2 :
Figure 2: Probability of recommending external cooling as a function of μ C under the null (left panel) and target alternative (right panel).The vertical dotted line reflects the point of zero historical bias with μ C0 = 39.8 as estimated from the historical control data.The horizontal dotted lines reflect the desired 0.025 one-sided type I error rate and 0.90 power.For visual clarity, the y-axis range differs across the two panels.

Figure 2 Figure 3 :
Figure 2 displays an estimate of the probability of recommending E under the null and target alternative as a function of μ C , corresponding to each method.The probabil-

Figure 3
Figure 3 displays estimates of E[ μ C − μ C | μ C ] as a function of μ C for each Bayesian model, and scatter plots of μ based on each Comm model versus the Trad NPO model.Both Trad models are biased upward for all μ C , through the Trad PO model much more so, likely because the PO assumption is violated under the target alternative.Under the null, which satisfies the PO assumption, neither Trad model exhibits noticeable bias for any μ C (results not shown here).Both Comm models tend to over-estimate μ C when

Figure 4
Figure 4 further elucidates the borrowing properties of the Comm NPO model, with scatter plots of PPE = posterior probability of exchangeability for ζ (left panel) and RIG (middle panel) versus μ C , as well as RIG versus PPE for ζ (right panel).The Comm PO model exhibits similar patterns in this regard.Recall that the Comm NPO model allows for temporal variation in the historical control arm bias through z as well.Because this simulation study only varied the value of ζ, the PPE for z was relatively invariant in μ C at around 0.85 and thus is not illustrated here.RIG provides an alternative means to capture the strength of borrowing across all bias parameters, regardless of the Comm model's complexity.When μ C from Trad NPO is near μ C0 , Comm NPO tends to have PPE near 1, which leads to RIG > 0 and thus greater posterior precision for the treatment effect of interest.In contrast, as μ C from Trad NPO deviates far from μ C0 , PPE and RIG tend toward 0, indicating that incorporation of historical control data that is incongruous with the new control data leads to little gain in posterior precision for the treatment effect of interest.When μ C from Trad NPO is near 35 or 45, Comm NPO tends to have intermediate values for PPE and RIG < 0, i.e. it suffers some reduction in posterior precision for the treatment effect of interest.

Table 1 :
days, provides a suitably wide Probabilities of recommending E over C under the null and target alternative scenario with ζ = 0 and z = 0. Reported values reflect the proportion of 2,000 simulated trials that recommended E at analysis 1, 2 and 3, and overall, respectively.mSS = mean sample size when ignoring the non-binding lower stopping boundary.range of values for μ C around μ C0 = 38.9 to evaluate the operating characteristics of our proposed design under varying magnitudes of disagreement between the prospective and historical control arms.Under the null, we used π E,t = π C,t , t = 1, . . ., 60.Under the alternative, given ζ and {π C,t : t = 1, . . ., 60}, we used {π E,t : t = 1, . . ., 60} values corresponding to the elicited RRR.We generated observations for the E arm from the survival distribution with the resulting hazards.