Combining chains of Bayesian models with Markov melding

A challenge for practitioners of Bayesian inference is specifying a model that incorporates multiple relevant, heterogeneous data sets. It may be easier to instead specify distinct submodels for each source of data, then join the submodels together. We consider chains of submodels, where submodels directly relate to their neighbours via common quantities which may be parameters or deterministic functions thereof. We propose chained Markov melding, an extension of Markov melding, a generic method to combine chains of submodels into a joint model. One challenge we address is appropriately capturing the prior dependence between common quantities within a submodel, whilst also reconciling differences in priors for the same common quantity between two adjacent submodels. Estimating the posterior of the resulting overall joint model is also challenging, so we describe a sampler that uses the chain structure to incorporate information contained in the submodels in multiple stages, possibly in parallel. We demonstrate our methodology using two examples. The first example considers an ecological integrated population model, where multiple data sets are required to accurately estimate population immigration and reproduction rates. We also consider a joint longitudinal and time-to-event model with uncertain, submodel-derived event times. Chained Markov melding is a conceptually appealing approach to integrating submodels in these settings.


Introduction
The Bayesian philosophy is appealing in part because the posterior distribution quantifies all sources of uncertainty.However, a joint model for all data and parameters is a prerequisite to posterior inference, and in situations where multiple, heterogeneous sources of data are available, specifying such a joint model is a formidable task.Models that consider such data are necessary to describe complex phenomena at a useful precision.One possible approach begins by specifying individual submodels for each source of data.These submodels could guide the statistician when directly specifying the joint model, but to use the submodels only informally seems wasteful.Instead, it may be preferable to construct a joint model by formally joining the individual submodels together.Some specific forms of combining data are well established.Meta-analyses and evidence synthesis methods are widely used to summarise data, often using hierarchical models (Ades and Sutton, 2006;Presanis et al., 2014).Outside of the statistical literature, a common name for combining multiple data is data fusion (Kedem et al., 2017;Lahat et al., 2015), though there are many distinct methods that fall under this general name.Interest in integrating data is not just methodological; applied researchers often collect multiple disparate data sets, or data of different modalities, and wish to combine them.For example, to estimate SARS-CoV-2 positivity Donnat et al. (2020) build an intricate hierarchical model that integrates both testing data and self-reported questionnaire data, and Parsons et al. (2021) specify a hierarchical model of similar complexity to estimate the number of injecting drug users in Ukraine.Both applications specify Bayesian models with data-specific components, which are united in a hierarchical manner.In conservation ecology, integrated population models (IPMs) (Besbeas et al., 2002;Brooks et al., 2004;Maunder and Punt, 2013;Schaub and Abadi, 2011;Zipkin and Saunders, 2018) are used to estimate population dynamics, e.g.reproduction and immigration rates, using multiple data on the same population.Such data have standard models associated with them, such as the Cormack-Jolly-Seber model (Lebreton et al., 1992) for capture-recapture data, and the IPM serves as the framework in which the standard models are combined.More generally, the applications we list illustrate the importance of generic, flexible methods for combining data to applied researchers.
Markov melding (Goudie et al., 2019) is a general statistical methodology for combining submodels.Specifically, it considers M submodels that share some common quantity φ, with each of the m = 1, . . ., M submodels possessing distinct parameters ψ m , data Y m , and form p m (φ, ψ m , Y m ).Goudie et al. (2019) then propose to combine the submodels into a joint model, denoted p meld (φ, ψ 1 , . . ., ψ M , Y 1 , . . ., Y M ).However, it is unclear how to integrate models where there is no single quantity φ common to all submodels, such as for submodels that are linked in a chain structure.
We propose an extension to Markov melding, which we call chained Markov melding1 , which facilitates the combination of M submodels that are in a chain structure.For example, when M = 3 we address the case in which submodel 1 and 2 share a common quantity φ 1∩2 , and submodel 2 and 3 share a different quantity φ 2∩3 .
Our extension addresses previously unconsidered complications including the distinct domains (and possibly supports) of the common quantities, and the desire to capture possible prior correlation between them.Two examples serve to illustrate our methodology, which we introduce in the following section.The computational effort required to fit a complex, multi-response model is a burden to the model development process.We propose a multi-stage posterior estimation method that exploits the properties of our chained melded model to reduce this burden.We can parallelise aspects of the computation across the submodels, using less computationally expensive techniques for some submodels.Reusing existing software implementations of submodels, and subposterior samples where available, is also possible.Multi-stage samplers can aid in understanding the contribution of each submodel to the final posterior, and are used in many applied settings, including hierarchical modelling (Lunn et al., 2013) and joint models (Mauff et al., 2020).
One contribution of our work is to clarify the informal process commonly used in applied analyses of summarising and/or approximating submodels for use in subsequent analyses.The two most common approximation strategies seem to be (i) approximating the subposterior of the common quantity with a normal distribution for use in subsequent models (see, e.g.Jackson and White, 2018;Nicholson et al., 2021) and (ii) taking only a point estimate of the subposterior, and treating it as a known value in further models.These strategies may, but not always, produce acceptable approximations to the chained melded model.
Both the chained melded model and these approximation strategies are examples of 'multi-phase' and 'multi-source' inference (Meng, 2014), with the melding approach most comprehensively accounting for uncertainty.

Example introduction
In this section we provide a high-level overview of two applications that require integrating a chain of submodels, with more details in Sections 4 and 5. Our first example decomposes a joint model into its constituent submodels and rejoins them.This simple situation allows us to compare the output from the chained melding process to the complete joint model, and is meant to illustrate both the 'chain-of-submodels' notion and the mechanics of chained melding.The second example is a realistic and complex setting in which the combining of submodels without chained Markov melding is nonobvious.Our comparator is the common technique of summarising previously considered submodels with point estimates, and demonstrates the Figure 1: A simplified DAG of the integrated population model (IPM) for the little owls.The capturerecapture submodel (p 1 ) is surrounded by the blue line, the population count submodel (p 2 ) by the black line, and the fecundity submodel (p 3 ) by the red line.The capture-recapture and population count submodels share parameters affecting the juvenile and adult survival rate (φ 1∩2 ), whilst the parameter for fecundity is common to both the population count and fecundity submodels (φ 2∩3 ).The combination of all the submodels forms the IPM.
importance of fully accounting for uncertainty.

An integrated population model for little owls
Integrated population models (IPMs) (Zipkin and Saunders, 2018) combine multiple data to estimate key quantities governing the dynamics of a specific population.Schaub et al. (2006) and Abadi et al. (2010) used an IPM to estimate fecundity, immigration, and yearly survival rates for a population of little owls.
These authors collect and model three types of data, illustrated in Figure 1.Capture-recapture data Y 1 , and associated capture-recapture submodel p 1 (φ 1∩2 , ψ 1 , Y 1 ), are acquired by capturing and tagging owls each year, and then counting the number of tagged individuals recaptured in subsequent years.Population counts Y 2 are obtained by observing the number of occupied nesting sites, and are modelled in p 2 (φ 1∩2 , φ 2∩3 , ψ 2 , Y 2 ).
Finally, nest-record data Y 3 counts both the number of reproductive successes and possible breading pairs, and is associated with a submodel for fecundity p 3 (φ 2∩3 , ψ 3 , Y 3 ).The population count model p 2 shares the parameter φ 1∩2 with the capture-recapture model p 1 , and the parameter φ 2∩3 with the fecundity model p 3 ; each of the m = 1, 2, 3 submodels has distinct, submodel-specific parameters ψ m .No single source of data is sufficient to estimate all quantities of interest, so it is necessary to integrate the three submodels into a single joint model to produce acceptably precise estimates of fecundity and immigration rates.We will show that the chained Markov melding framework developed in Section 2 encapsulates the process of integrating these submodels, producing results that are concordant with the original joint IPM.

Survival analysis with time varying covariates and uncertain event times
Our second example considers the time to onset of respiratory failure (RF) amongst patients in intensive care units, and factors that influence the onset of RF.A patient can be said to be experiencing RF if the ratio of the partial pressure of arterial blood oxygen (PaO 2 ) to the faction of inspired oxygen (FiO 2 ) is less than 300mmHg (The ARDS Definition Task Force, 2012), though this is not the only definition of RF.Patients' PaO 2 /FiO 2 (P/F) ratios are typically measured only a few times a day.The relative infrequency of P/F ratio data, when combined with the intrinsic variability in each individual's blood oxygen level, results in significant uncertainty in about the time of onset of RF.
Factors that influence the time to onset of RF are both longitudinal and time invariant.Both types of data can be considered in joint models (Rizopoulos, 2012), which are composed of two distinct submodels, one for each data type.However, existing joint models are not able to incorporate the uncertainty surrounding the event time, which may result in overconfident and/or biased estimates of the parameters in the joint model.
Chained Markov melding offers a conceptually straightforward, Bayesian approach to incorporating uncertain event times into joint models.Specifically, we consider the event time as a submodel-derived quantity from a hierarchical regression model akin to Lu and Meeker (1993).We call this submodel the uncertain event time Figure 2: A simplified DAG of the submodels considered in the survival analysis example.The event time submodel p 1 defines the event time φ 1∩2 as noninvertible function of the other model parameters (denoted by the dotted line), whilst the survival submodel p 2 considers φ 1∩2 as the response.The longitudinal submodel p 3 has parameters φ 2∩3 in common with the survival submodel.
It is in examples such as this one that we foresee the most use for chained Markov melding; a fully Bayesian approach is desired and the submodels are nontrivial in complexity, with no previously existing or obvious joint model.

Markov melding
We now review Markov melding (Goudie et al., 2019) before detailing our proposed extension.As noted in the introduction, Markov melding is a method for combining M submodels p 1 (φ, which share the same φ.When the submodel prior marginals p m (φ) are identical, i.e. p m (φ) = p(φ) for all m, it is possible to combine the submodels using Markov combination (Dawid and Lauritzen, 1993;Massa and Lauritzen, 2010) (1) Markov combination is not immediately applicable when submodel prior marginals are distinct, so Goudie et al. define a marginal replacement procedure, where individual submodel prior marginals are replaced with a common marginal p pool (φ) = h(p 1 (φ), . . ., p M (φ)) which is the result of a pooling function h that appropriately summarises all prior marginals (the choice of which is described below).The result of marginal replacement is Goudie et al. show that p repl,m (φ, ψ m , Y m ) minimises the Kullback-Leibler (KL) divergence between a distribution q(φ, ψ m , Y m ) and p m (φ, ψ m , Y m ) under the constraint that q(φ) = p pool (φ), and that marginal replacement is valid when φ is a deterministic function of the other parameters in submodel m.Markov melding joins the submodels via the Markov combination of the marginally replaced submodels (3)

Pooled prior
Goudie et al. proposed forming p pool (φ) using linear or logarithmic prior pooling (Genest et al., 1986;O'Hagan et al., 2006) p pool, lin (φ where λ = (λ 1 , . . ., λ M ) are nonnegative weights, which are chosen subjectively to ensure p pool (φ) appropriately represents prior knowledge about the common quantity.Two special cases of pooling are of particular interest.Product of experts (PoE) pooling (Hinton, 2002) is a special case of logarithmic pooling that occurs when λ m = 1 for all m.Dictatorial pooling is a special case of either pooling method when λ m = 1 and, for all

Chained model specification
Consider m = 1, . . ., M submodels each with data Y m and parameters θ m denoted p m (θ m , Y m ), with M ≥ 3.
We assume that the submodels are connected in a manner akin to a chain and so can be ordered such that only 'adjacent' submodels in the chain have parameters in common.Specifically we assume that submodels m and m + 1 have some parameter φ m∩m+1 in common for m = 1, . . ., M − 1.For notational convenience define Note that all components of φ, ψ and Y may themselves be multivariate.Additionally, because φ m∩m+1 may be a deterministic function of either θ m or θ m+1 we refer to φ m∩m+1 as a common parameter or a common quantity as appropriate.
All submodels, and marginal and conditional distributions thereof, have density functions that are assumed to exist and integrate to one.When considering conditional distributions we assume that the parameter being conditioned on has support in the relevant region.We define the m th subposterior as p m (φ m , ψ m | Y m ).

Extending marginal replacement
We now define the chained melded model by extending the marginal replacement procedure to submodels linked in a chain-like way.The proposed chained marginal replacement operation modifies the submodels to enforce a common prior for φ.This consistency allows us to employ Markov combination to unite the submodels.
Specifically, the m th marginally replaced submodel is where p pool (φ) = g(p 1 (φ 1 ), p 2 (φ 2 ), . . ., p m (φ m )) is a pooling function that appropriately summarises all submodel prior marginals.The second equality in Equation ( 6) is because of the conditional independence We form the chained melded model by taking the Markov combination of the marginally replaced submodels Rewriting ( 9) in terms of φ m∩m+1 for m = 1, . . ., M − 1 yields Finally, we use chained melded posterior p meld (φ, ψ | Y ) ∝ p meld (φ, ψ, Y ) to refer to posterior of the chained melded model conditioned on all data.

Pooled prior
Specifying (9) requires a joint prior for φ.As in Markov melding we form the joint prior by pooling the marginal priors, selecting a pooling function g that appropriately represents prior knowledge about the common quantities.We define p pool (φ) as a generic function of all prior marginals because we do not always wish to assume independence between the components of φ.
Two special cases of Equation ( 12) are noteworthy.Firstly, if all components of φ are independent, then we can form p pool (φ) as the product of M − 1 standard pooling functions h m defined in Section 1.2.1 A second case, in between complete dependence (12) and independence (14 without any additional assumptions.That is, if any two consecutive components of φ are independent in the submodel containing both of them, we can divide the pooled prior specification problem into two pooling functions.The smaller number of arguments to g 1 and g 2 make it easier to choose appropriate forms for those functions. Selecting a specific form of g is not trivial given the many choices of functional form and pooling weights (the latter of which we discuss momentarily).One complication is that standard linear and logarithmic pooling, as defined in Equations ( 4) and (5), are not immediately applicable when the submodel marginal distributions consider different quantities.We now propose extensions to logarithmic, linear, and dictatorial pooling for use in the case of chained melding.

Chained linear pooling
Our generalisation of linear pooling to handle marginals of different quantities is a two step procedure.The first step forms intermediary pooling densities via standard linear pooling, using appropriate marginals of the relevant quantity where λ m = (λ m,1 , λ m,2 ) are nonnegative pooling weights, and for m = 2, . . ., M − 1 For m = 1 and m = M the relevant marginals are p 1 (φ 1∩2 ) and p M (φ M −1∩M ).In step two we form the pooled prior as the product of the intermediaries with K lin (λ) = M −1 m=1 p pool,m (φ m∩m+1 )dφ, for λ = (λ 1 , . . ., λ M ).Clearly, this assumes prior independence amongst all components of φ which may be undesirable, particularly if this independence was not present under one or more of the submodel priors.We discuss extensions to linear pooling that enable prior dependence between the components of φ in Section 6.

Dictatorial pooling
Chained Markov melding does not admit a direct analogue to dictatorial pooling as defined in Section 1.2.1 because not all submodel prior marginals contain all common quantities.For example, consider the logarithmically pooled prior of Equation ( 16) with, say, the m th entry in λ set to 1 and all others set to 0. This choice of λ results in p pool (φ) = p(φ m ), which is flat for φ −m .It seems reasonable to require any generalisation of dictatorial pooling to result in a reasonable prior for all components in φ.Such a generalisation should also retain the original intention of dictatorial pooling, i.e. 'the authoritative prior for φ m is p m (φ m )'.
We propose two possible forms of dictatorial pooling that satisfy the aforementioned criteria.Partial dictatorial pooling enforces a single submodel prior for the relevant components of φ, with no restrictions on the pooling of the remaining components; and complete dictatorial pooling which requires selecting one of the two possible submodel priors for each component of φ.
Partial dictatorial pooling considers p m (φ m ) as the authoritative prior for where g 1 and g 2 are linear or logarithmic pooling functions as desired 3 .
Complete dictatorial pooling requires the marginal pooled prior for each component in φ to be chosen solely on the basis of only one of the two priors specified for it under the submodels.For m = 1, . . ., M − 1, the m th marginal of the pooled prior is either If two consecutive marginals are chosen to have the same submodel prior then we instead define to preserve any dependence between φ m−1∩m and φ m∩m+1 that may be present under p m .The complete dictatorially pooled prior is thus where, subject to the potential modification in Equation ( 23), the terms in the product are as defined in Equation ( 22).For example, if M = 5 and we wish to ignore p 2 and p 4 when constructing the pooled prior and instead associate φ 1∩2 with p 1 , both φ 2∩3 and φ 3∩4 with p 3 , and φ 4∩5 with p 5 , then 3 Some care is required if the authoritative submodel is p m for m ∈ {1, 2, M − 1, M }.If it is taken to be m ∈ {1, 2}, then g 1 does not exist, and additionally in the m = 1 case p 1 (φ 0∩1 , φ 1∩2 ) := p 1 (φ 1∩2 ).The m ∈ {M − 1, M } cases have analogous definitions.

Pooling weights
Choosing values for the pooling weights is an important step in specifying the pooled prior (Abbas, 2009;Carvalho et al., 2022;María Jesús Rufo et al., 2012;MJ Rufo et al., 2012).Because appropriate values for the weights depend on the submodels being pooled and the information available a priori, universal recommendations are impossible, so we illustrate the impact of different choices in a straightforward example.It is important that prior predictive visualisations of the pooled prior are produced (Gabry et al., 2019;Gelman et al., 2020) to guide the choice of pooling weights and ensure that the result suitably represents the available information.Figure 3 illustrates how λ and the choice of pooling method impacts p pool (φ) when pooling normal distributions.Specifically, we consider M = 3 submodels and pool where N(φ; µ, σ 2 ) is the normal density function with mean µ and variance σ 2 (or covariance matrix where appropriate).The two dimensional density function p 2 has an additional parameter ρ, which controls the intra-submodel marginal correlation.We set = 1 and ρ = 0.8.In the logarithmic case we set λ 1 = λ 3 and parameterise λ 2 = 1 − 2λ 1 , so that λ 1 + λ 2 + λ 3 = 1 whilst limiting ourselves to varying only λ 1 .Similarly, in the linear case we set λ 1,1 = λ 2,2 = λ 1 and λ 1,2 = λ 2,1 = 1 − 2λ 1 .We consider 5 evenly spaced values of λ 1 ∈ [0, 0.5].
For both pooling methods, as the weight λ 1 associated with models p 1 and p 3 increases, the relative contributions of p 1 (φ 1∩2 ) and p 3 (φ 2∩3 ) increase.Note the lack of correlation in p pool under linear pooling (right column of Figure 3) due to Equation (20).A large, near-flat plateau is visible in the λ 1 = 0.25 and λ 1 = 0.375 cases, which is a result of the mixture of four, 2-D normal distributions that linear pooling produces in this example.The logarithmic pooling process produces a more concentrated prior for small values of λ 1 , and does not result in a priori independence between φ 1∩2 and φ 2∩3 .Appendix A shows analytically that λ 2 controls the quantity of correlation present in p pool in this setting.

Posterior estimation
We now present a multi-stage MCMC method for generating samples from the melded posterior.Whilst the melded posterior is a standard Bayesian posterior and so can, in principle, be targetted using any suitable Monte Carlo method, in practice this may be cumbersome or infeasible.More specifically, it may be feasible to fit each submodel separately using standard methods, but when the submodels are combined -either through Markov melding, or by expanding the definition of one submodel to include another -the computation required to estimate the posterior in a single step poses an insurmountable barrier.In such settings we can employ multi-stage posterior estimation methods including Tom et al. (2010), Lunn et al. (2013), Hooten et al. (2019), andMauff et al. (2020).We propose a multi-stage strategy that uses the chain-like relationship to both avoid evaluating all submodels simultaneously, and parallelise the computation required in the first stage to produce posterior samples in less time than an equivalent sequential method4 .Avoiding concurrently evaluating all submodels also enables the reuse of existing software, minimising the need for custom submodel and/or sampler implementations.
We also describe an approximate method, where stage one submodels are summarised by normal distributions for use in stage two.
We consider the M = 3 case, as this setting includes both of our examples.Our approach can be extended to M > 3 settings, although we anticipate that it is unlikely to be suitable for large M settings.We discuss some of difficulties associated with generic, parallel methodology for efficient posterior sampling in Section 6.

Parallel sampler
Our proposed strategy involves obtaining in stage one samples from submodels 1 and 3 in parallel.Stage two reuses these samples in a Metropolis-within-Gibbs sampler, which targets the full melded posterior.The stage specific targets are displayed in Figure 4.
Figure 4: A graphical depiction of the submodels and their shared quantities, with the parallel sampling strategy overlaid.The stage one (s 1 ) targets are surrounded by blue dashed lines, with the stage two (s 2 ) target in red.
Stage one Two independent, parallel sampling processes occur in stage one.Terms from the melded model that pertain to p 1 and p 3 are isolated and targeted using standard MCMC methodology.This produces Stage two Stage two targets the melded posterior of Equation ( 9) using a Metropolis-within-Gibbs sampler, where the proposal distributions are where q(ψ * 2 | ψ 2 ) is a generic proposal distribution for ψ 2 .We draw an index n * 1 uniformly from {1, . . ., N 1 } and use the corresponding value (φ * 1∩2 , ψ * 1 ) n * 1 as the proposal, doing likewise for n * 3 and (φ * 2∩3 , ψ * 3 ) n * 3 .The acceptance probabilities for these updates are where α(x, z) denotes the probability associated with a move from z to x.Note that all stage two acceptance probabilities only contain terms from the second submodel and the pooled prior, and thus do not depend on ψ 1 or ψ 3 .If a move is accepted then we also store the index, i.e. n * 1 or n * 3 , associated with the move, otherwise we store the current value of the index.The stored indices are used to appropriately resample ψ 1 and ψ 3 from the stage one samples.

Normal approximations to submodel components
Normal approximations are commonly employed to summarise submodels for subsequent use in more complex models.For example, two-stage meta-analyses often use a normal distribution centred on each studies' effect estimate (Burke et al., 2017).Suppose we employ such an approximation to summarise the prior and posterior of φ 1∩2 and φ 2∩3 under p 1 and p 3 respectively.In addition, assume that (a) such approximations are appropriate for p 1 (φ 1∩2 ), p 1 (φ 1∩2 | Y 1 ), p 3 (φ 2∩3 ), and p 3 (φ 2∩3 | Y 3 ), (b) we are not interested in ψ 1 and ψ 3 , and can integrate them out of all relevant densities, and (c) we employ our second form of dictatorial pooling and choose p 2 (φ 1∩2 , φ 2∩3 ) as the authoritative prior.The latter two assumptions imply that the melded posterior of interest is proportional to Denote the normal approximation of p , which is a normal distribution with mean µ 1 and covariance matrix Σ 1 .The corresponding normal approximation of the prior p ).The equivalent approximations for the subposterior and prior of p and p 3 (φ 2∩3 | µ 3,0 , Σ 3,0 ) respectively.Substituting in the approximations and using standard results for Gaussian density functions (see Bromiley (2003) and Appendix C) results in where Standard MCMC methods can be used to sample from the approximate melded posterior.If instead we opt for product-of-experts pooling, all µ de and Σ de terms disappear from the parameter definitions in Equation ( 38).

An integrated population model for little owls
We now return to the integrated population model (IPM) for the little owls introduced in Section 1.1.1.Before we detail the specifics of each submodel, we must introduce some notation.Data and parameters are stratified into two age-groups a ∈ {J, A} where J denotes juvenile owls (less than one year old) and A adults, two sexes s ∈ {M, F }, and observations occur annually at times t ∈ {1, . . ., T }, with T = 25.The sexand age-specific probability of an owl surviving from time t to t + 1 is δ a,s,t , and the sex-specific probability of a previously captured owl being recaptured at time t + 1 is π s,t+1 so long as the owl is alive at time t + 1.

Capture recapture: p 1
Capture-recapture data pertain to owls that are released at time t (having been captured and tagged), and then recaptured at time u = t + 1, . . ., T , or not recaptured before the conclusion of the study, in which case u=1 M a,s,t,u be the number of owls released at time t, i.e. a vector containing the row-wise sum of the entries in M a,s .The recapture times for owls released at time t follow an age-and sex-specific multinomial likelihood with probabilities Q a,s,t = (Q a,s,t,1 , . . ., Q a,s,t,T +1 ) such that (40)

Count data model: p 2
To estimate population abundance, a two level model is used: the first level models the observed (counted) number of females at each point in time denoted y t , with a second, latent process modelling the total number of females in population.The observation model is where we denote the number of juvenile and adult females in the population at time t as x J,t and x A,t respectively, with x t = x J,t + x A,t .If sur t adult females survive from time t − 1 to time t, and imm t adult females immigrate over the same time period, then the latent, population level model is where η t is the immigration rate.The initial population sizes x J,1 and x A,1 have independent discrete uniform priors on {0, 1, . . ., 50}.If x t−1 = 0 then we assume that the Poisson and binomial distributions become point masses at zero.

Fecundity: p 3
The fecundity submodel considers the number of breeding females at time thus the quantities in common between the submodels are φ 1∩2 = (α 0 , α 2 ) and φ 2∩3 = ρ.To align the notation of this example with our chained melding notation we define, for all permitted values of a, s and Note that the definition of φ 1∩2 does not include α 1 as it is male specific and does not exist in p 2 .The model variant of Finke et al. (2019) we consider does not include α 3 , and for comparability we keep the other parameter indices the same.
To completely specify p meld we must choose how to form p pool (φ 1∩2 , φ 2∩3 ).We form p pool (φ 1∩2 , φ 2∩3 ) using three different pooling methods and estimate the melded posterior in each case.The first pooling method is product-of-experts (PoE) pooling, which is logarithmic pooling with λ = (1, 1, 1), and we denote the melded posterior as p meld, PoE .We also use logarithmic pooling with λ = ( 1 2 , 1 2 , 1 2 ), which is denoted p meld, log and results in the chained melded model being identical to the IPM.The final pooling method is linear pooling 2 ), denoted p meld, lin .

Posterior estimation
We estimate the melded posterior -p meld (φ, ψ | Y ), proportional to Equation (9) -using both the parallel sampler (Section 3.1) and normal approximation (Section 3.2).This allows us to use pre-existing implementations of the submodels.Specifically, the capture-recapture submodel is written in BUGS (Lunn et al., 2009) and sampled via rjags (Plummer, 2019).The fecundity submodel is written in Stan (Carpenter et al., 2017) and sampled via rstan (Stan Development Team, 2021).The count data submodel is also written in BUGS, and we reuse this implementation in stage two of the multi-stage sampler via NIMBLE (de Valpine et al., 2017) and its R interface ( NIMBLE Development Team, 2019).The approximate melded posterior obtained by Section 3.2 is sampled using rjags.Code and data for this example, as well as trace plots and numerical convergence measures (Vehtari et al., 2020) for both stages of the parallel sampling process, are available in the accompanying online repository5 .) and φ 2∩3 = ρ from the posterior of the original integrated population model p ipm , and the individual subposteriors from submodels p 1 , p 2 , and p 3 .Bottom row: credible intervals for the same quantities, but with a different x-axis scale, from the original IPM (repeated from top row); the chained melded posteriors using product-of-experts pooling, logarithmic pooling, and linear pooling denoted p meld , p meld, log and p meld, lin ; and the melded posterior using the normal approximation p meld .Intervals are 50%, 80%, 95%, and 99% wide.

Results
We empirically validate our methodology and sampler by comparing the melded posterior samples to a large sample -6 chains, each containing 1 × 10 5 post-warmup iterations -from the original IPM posterior.
Similarity in the posteriors is expected as the IPM is effectively the joint model we wish to approximate with the chained melded model.It is simply fortunate, from a modelling standpoint, that this example's joint model is easy to construct and computationally feasible with standard tools.Note that under logarithmic pooling with λ = ( 1 2 , 1 2 , 1 2 ) the melded posterior is identical to the original IPM, so any differences between the two posteriors are attributable to the multi-stage sampler.Figure 5 depicts the posterior credible intervals (Gabry et al., 2021;Kay, 2020) for the common quantities from the individual submodels, the melded models, and the original IPM.The top row in Figure 5 indicates that the count data alone (p 2 ) contain minimal information about α 0 , α 2 and ρ; incorporating the data from the other submodels is essential for precise estimates.
The multi-stage sampler works well by producing melded posterior estimates generally similar to the original IPM estimate, and are near identical for logarithmic pooling.PoE pooling produces the posterior most different from the original IPM, as it yields a prior for (α 0 , α 2 ) that is more concentrated around zero than the other pooling methods.The lack of large differences between the melded posteriors that use different pooled priors indicates that the prior has almost no effect on the posterior.The similarity of the approximate approach ( p meld -bottom row of Figure 5) to the melding approaches suggests that the normal approximations are good summaries of the subposteriors, and that the approximate melding procedure of Section 3.2 is suitable for this example.

Survival analysis with time varying covariates and uncertain event times
We return now to the respiratory failure example introduced in Section 1.1.2.Our intention is to illustrate the application of chained Markov melding to an example of realistic complexity, and explore empirically the importance of accounting for all sources of uncertainty by comparing chained Markov melding to equivalent analyses which use only a point estimate summary of the uncertainty.Specifically, event times and indicators are a noninvertible function of other parameters in the first submodel, and are an uncertain response in the survival submodel.Chained Markov melding enables us to specify a suitable joint model despite these complications.
There are i = 1, . . ., N individuals in the data set.Each individual is admitted to the ICU at time 0, and is discharged or dies at time C i .See Appendix I for information on how the N = 37 individuals were selected from MIMIC-III (Johnson et al., 2016).

P/F ratio submodel (B-spline): p 1
The first submodel fits a B-spline to the PaO 2 /FiO 2 data to calculate if and when an individual experiences respiratory failure.Each individual has PaO 2 /FiO 2 ratio observations z i,j (in units of mmHg) at times t i,j , with j = 1, . . ., J i .For each individual denote the vector of observations z i = (z i,1 , . . ., z i,Ji ) and observation times t i = (t i,1 , . . ., t i,Ji ).To improve computational performance, we standardise the P/F ratio data for each individual such that z i,j = zi,j −zi ŝi , where zi,j is the underlying unstandardised observation with mean z i and standard deviation ŝi .Similarly we rescale the threshold for respiratory failure: τ i = 300−zi ŝi .We choose to model the P/F ratio using cubic B-splines and 7 internal knots, and do not include an intercept column in the spline basis (for background on B-splines see: Chapter 2 in Hastie and Tibshirani, 1999;and the supplementary material of Wang and Yan, 2021).The internal knots are evenly spaced between two additional boundary knots at min(t i ) and max(t i ).These choices result in k = 1, . . ., 10 spline basis terms per individual, with coefficients ζ i,k where ζ i = (ζ i,1 , . . ., ζ i,10 ).We denote the individual specific B-spline basis evaluated at time t i,j as B i (t i,j ) ∈ [0, ∞) 10 so that the submodel can be written as We employ a weakly informative prior for the intercept β 0,i ∼ N(0, 1 2 ), a heavy tailed distribution for the error term6 ε i,j ∼ t 5 (0, ω i ), and a weakly informative half-normal prior for the unknown scale parameter For the spline basis coefficients we set ζ i,1 ∼ N(0, 0.1 2 ), and for k = 2, . . ., 10 we employ the random-walk prior Kharratzadeh (2017).
We identify that a respiratory failure event occurred (which we denote by d i = 1) at event time T i if a solution to the following optimisation problem exists We attempt to solve Equation 46 using a standard multiple root finder (Soetaert et al., 2020).If there are no roots then the individual died or was discharged before respiratory failure occurred so we set T i = C i and The relationship between T i and other model coefficients is displayed in the left hand panel of Figure 6.

Cumulative fluid submodel (piecewise linear) p 3
The rate of fluid administration reflects the clinical management of patients by ICU staff, and hence changes to the rate reflect decisions to change treatment strategy.We employ a breakpoint regression model to capture the effect of such decisions, and consider only one breakpoint as this appears sufficient to fit the observed data.Specifically, we model the 8-hourly cumulative fluid balance data x i,l (in litres) at times u i,l , l = 1, . . ., L i .The cumulative data are derived from the raw fluid input/output observations, which we detail in Appendix D. We denote the complete vector of observations by x i = (x i,1 , . . ., x i,Li ) and times by We assume a piecewise linear model with η 0,i as the value at the breakpoint at time κ i , slope η b 1,i before the breakpoint, and slope η a 1,i after the breakpoint.We write this submodel as It will be useful to refer to the fitted value of this submodel at arbitrary time as m i (t).We assume a weakly informative prior for the error term i,l ∼ N(0, σ 2 x,i ), with individual-specific error variances σ x,i ∼ N + (0, 5 2 ), and specific, informative priors for the slope before the breakpoint η b 1,i ∼ Gamma(1.53,0.24) and after η a 1,i ∼ Gamma(1.53,0.24).An appropriate prior for κ i and η 0,i is challenging to specify due to the relationship between the two parameters and the individual-specific support for κ i .We address both challenges by reparameterisation, resulting in a prior for κ i that, in the absence of other information, places the breakpoint in the middle of an individual's ICU stay, and a prior for η 0,i that captures the diverse pathways into ICU that an individual can experience.Details and justifications for all the informative priors are available in Appendix E. Figure 6 displays the parameters and their relationship to the fitted regression line.

Survival submodel p 2
The rate at which fluid is administered is thought to influence the time to respiratory failure (Seethala et al., 2017), so we explore this relationship using a survival model.Individuals experience respiratory failure (d i = 1) at time 0 < t < C i , or are censored (d i = 0, t = C i ).We assume a Weibull hazard with shape parameter γ for the event times.All individuals have baseline (time invariant) covariates w i,a , a = 1, . . ., A, with w i = (1, w i,1 , . . ., w i,A ) (i.e.including an intercept term), and common coefficients θ = (θ 0 , . . ., θ A ).
The hazard is assumed to be influenced by these covariates and the rate of increase ∂ ∂t m i (t) in the cumulative fluid balance.The strength of the latter relationship is captured by α.Hence, the hazard is The survival function at an individual's observed event time and status, du}, has an analytic form which we derive in Appendix F. Hence the likelihood for individual where we suppress the dependence on the parameters on the right hand side for brevity.

Pooling and estimation
We consider logarithmic pooling with λ = ( 4 5 , 4 5 , 4 5 ) (any smaller value of λ results in a prior that is so uninformative that it causes computational problems) and with λ = (1, 1, 1) (Product-of-Experts).Because the correlation between φ 1∩2 and φ 2∩3 in p 2 (φ 1∩2 , φ 2∩3 ) is important, we do not consider linear pooling in this example.Logarithmic pooling requires estimates of p 1 (φ 1∩2 ) and p 2 (φ 1∩2 , φ 2∩3 ).Because these are mixed distributions, with both discrete and continuous components, standard kernel density estimation, as suggested by Goudie et al. (2019), is inappropriate.Instead, we fit appropriate parametric mixture distributions to the unknown prior marginal distributions.Details for all the parametric mixture distribution estimates are contained in Appendix H.
We use the parallel multi-stage sampler with p pool,1 (φ 1∩2 ) = p 1 (φ 1∩2 ), p pool,3 (φ 2∩3 ) = p 3 (φ 2∩3 ) and p pool,2 (φ 1∩2 , φ 2∩3 ) = p pool (φ) / p 1 (φ 1∩2 )p 3 (φ 2∩3 ) .That is, in stage one we target the subposteriors p 1 (φ 1∩2 , ψ 1 | Y 1 ) and p 3 (φ 2∩3 , ψ 3 | Y 3 ); in stage two we target the full melded model.Targeting p 1 (φ 1∩2 , ψ 1 | Y 1 ) in stage one alleviates the need to solve Equation ( 46) within an MCMC iteration, instead turning the production of φ 1∩2 into an embarrassingly parallel, post-stage-one processing step.Attempting to sample the melded posterior directly would involve solving (46) many times within each iteration, presenting a sizeable computational hurdle which we avoid.It is crucial for the convergence of our multi-stage sampler that the components of φ 1∩2 and φ 2∩3 are updated individual-at-a-time in stage two.This is possible due to the conditional independence between individuals in the stage one posterior, and Appendix K contains the details of this scheme.The stage one subposteriors are sampled using Stan, using 5 chains with 10 3 warm-up iterations and 10 4 post warm-up iterations.We use Stan to sample ψ 2 where, in every MH-within-Gibbs step, we run Stan for 9 warm-up iterations and 1 post warm-up iteration7 .We run 5 chains of 10 4 iterations for all stage two targets.Visual and numerical diagnostics (Vehtari et al., 2020) are assessed and are available in the repository accompanying this paper8 .

Results
We first inspect the subposterior fitted values for p 1 and p 3 .The top row of Figure 7 displays the P/F data, the fitted submodel, and derived event times for individuals i = 17 and 29.The spline appears to fit the raw P/F data well, with the heavy tailed error term accounting for the larger deviations away from the fitted value.It is interesting to see the relatively wide, multimodal distribution for (T 29 , d 29 ) (there is a second mode at (T 29 = C 29 , d 29 = 0) and for other individuals not shown here).The bottom row of Figure 7 displays the cumulative fluid data and the fitted submodel, with the little noise in the data resulting in minimal uncertainty about the fitted value and a concentrated subposterior distribution.
To assess the importance of fully accounting for the uncertainty in φ 1∩2 and φ 2∩3 , we compare the posterior for ψ 2 obtained using the chained melding approach with the posterior obtained by fixing φ 1∩2 and φ 2∩3 .Plugging in a point estimate reflects common applied statistical practice when combining submodels, particularly when a distributional approximation is difficult to obtain (as it is for p 1 (φ 1∩2 | Y 1 )).Additionally, standard survival models and software typically do not permit uncertainty in event times and indicators, rendering such a plug-in approach necessary.Specifically, we fix φ 1∩2 to the median value9 for each individual under p 1 (φ 1∩2 | Y 1 ) and denote it φ 1∩2 , and use the subposterior mean of p 3 (φ 2∩3 | Y 3 ) denoted φ 2∩3 .With these fixed values we sample p(ψ 2 | φ 1∩2 , φ 2∩3 , Y 2 ).We also compare the melded posterior to the submodel marginal prior p 2 (ψ 2 ), but we note that this comparison is difficult to interpret, as the melding process alters the prior for ψ 2 .Figure 8 displays the aforementioned densities for (θ 3 , θ 17 , γ, α) ⊂ ψ 2 , with (θ 3 , θ 17 ) chosen as they exhibit the greatest sensitivity to the fixing of φ 1∩2 and φ 2∩3 .For the baseline coefficients (θ 3 , θ 17 ) the chained melding posterior differs slightly in location from p(ψ 2 | φ 1∩2 , φ 2∩3 , Y 2 ), with a small increase in uncertainty.A more pronounced change is visible for α, where the melding process has added a notable degree of uncertainty and shifted the posterior leftwards.To investigate which part of the melding process causes this change in the posterior of α, we consider fixing either one of φ 1∩2 and φ 2∩3 to their respective point estimates.That is, we employ Markov melding as described in Section 1.2, using either logarithmic or PoE pooling, to obtain p meld (α 9 displays the same distributions for α as Figure 8, and adds the posteriors obtained using one fixed value ( φ 1∩2 or φ 2∩3 ) whilst melding the other non-fixed parameter.
Evident for both choices of pooling is the importance of incorporating the uncertainty in φ 1∩2 .This is expected given the large uncertainty and multimodal nature of φ 1∩2 compared to φ 2∩3 (see Figure 7).We suspect that it is the multimodality in p 1 (φ 1∩2 | Y 1 ) that produces the shift in posterior mode of φ 1∩2 , with the width of p 1 (φ 1∩2 | Y 1 ) affecting the increase in uncertainty.Because we prefer the chained melded posterior, under either pooling method, for its full accounting of uncertainty we conclude that p(α | φ 1∩2 , φ 2∩3 , Y 2 ) is both overconfident and biased.
The marginal changes to the components of ψ 2 visible in Figure 8 appear small, however the cumulative effect of such changes becomes apparent when inspecting the posterior of the survival function.Figure 9: Median (vertical line), 50%, 80%, 95%, and 99% credible intervals (least transparent to most transparent) for α.The marginal prior (grey, top row) and posterior using fixed φ 1∩2 and φ 2∩3 (black, bottom row) are as in Figure 8.For the chained melded posteriors (red and blue, rows 2 and 3) and the melded posteriors (red and blue, rows 4 -7), the tick label on the y-axis denotes the type of pooling used, and which of φ 1∩2 and/or φ 2∩3 are fixed.
The posterior survival functions differ markedly, with the 95% intervals overlapping only for small values of time.It is also interesting to see that φ 1∩2 , despite being a reasonable point estimate of p 1 (φ 1∩2 | Y 1 ), is not very likely under the melded posterior.Figure 10 also suggests that the Weibull hazard is insufficiently flexible for this example.We discuss the complexities of other hazards in Section 6.

Conclusion
This paper introduces the chained Markov melded model.In doing so we make explicit the notion of submodels related in a chain-like way, describe a generic methodology for joining together any number of such submodels and illustrate its application with our examples.Our examples also demonstrate the importance of quantifying the uncertainty when joining submodels; not doing so can produce biased, overconfident inference.We also present the choices, and their impacts, that users of chained Markov melding must make which include: the choice of pooling function, and where required the pooling weights; the choice of posterior sampler and the design thereof, including the apportionment of the pooled prior over the stages and stage-specific MCMC techniques.
We have introduced extensions to linear and logarithmic pooling to marginals of different but overlapping quantities.Linear pooling, introduced in Section 1.2.1, could be extended to induce dependence between the components of φ using multivariate or vine copulas (Kurowicka and Joe, 2011;Nelsen, 2006), or other techniques (Lin et al., 2014).Copula methods are particularly appealing as, depending on the choice of copula, they yield computationally cheap to evaluate expressions for the density function, are easy to sample, and induce correlation between an arbitrary number of marginals.
Our parallel multi-stage sampler currently only considers M = 3 submodels, rather than the fully generic definition of chained Markov melding in Equation ( 10).Whilst we anticipate needing more complex methods in large M settings, the value of M at which the performance of our multi-stage sampler becomes unacceptable will depend on the specific submodels and data under consideration.A general method would consider a large and arbitrary number of submodels in a chain, and initially split the chain into more/fewer pieces depending on the computational resources available.Designing such a method is complex, as it would have to: • avoid requiring the inverse of any component of φ with a noninvertible definition, • estimate the relative cost of sampling each submodel's subposterior, to split the chain of submodels into steps/jobs of approximately the same computational cost, • decide the order in which pieces of the chain are combined.
These are substantial challenges.It may be possible to use combine the ideas in Lindsten et al. (2017) and Kuntz et al. (2021), who propose a parallel Sequential Monte Carlo method, with the aforementioned constraints to obtain a generic methodology.Ideally we would retain the ability to use existing implementations of the submodels, however the need to recompute the weights of the particles, and hence reevaluate previously considered submodels, may preclude this requirement.Our current sampler is also sensitive to large differences in location or scale of the target distribution between the stages.The impact of these differences can be ameliorated using the methodology of Manderson and Goudie (2022), and, more generally, Sequential Monte Carlo samplers are likely to perform better in these settings.
Our chained Markov melding methodology is general and permits any form of uncertainty in the common quantities.In Section 5 we use our chained melded model to incorporate uncertainty in the event times and indicators into a survival submodel.Some specific forms of uncertainty in the event times have been considered in previous work.These include Wang et al. (2020), who consider uncertain event times arising from record linkage, where the event time is assumed to be one of a finite number of event times arising from the record linkage; and Giganti et al. (2020), Oh et al. (2018), andOh et al. (2021), who leverage external validation data to account for measurement error in the event time.However, the general and Bayesian nature of our methodology readily facilitates any form of uncertainty in the event times and the event indicators; uncertainty in the latter is not considered in the cited papers.
The example in Section 5 has three more interesting aspects to discuss.Firstly, the P/F ratio data used in the first submodel is obtained by finding all blood gas measurements from arterial blood samples.Approximately 20% of the venous/arterial labels are missing.In these instances a logistic regression model, fit by the MIMIC team10 , is used to predict the missing label based on other covariates.It is theoretically possible to refit the model in a Bayesian framework and use the chained melded model to incorporate the uncertainty in the predicted sample label -adding another 'link' to the chain.
Secondly, the application of our multi-stage sampler to this example is similar to the two-stage approach used for joint longitudinal and time-to-event models (see Mauff et al., 2020 for a description of this approach).
In the two-stage approach, the longitudinal model is fit using standard MCMC methods in stage one, and the samples are reused in stage two when considering the time-to-event data.This can significantly reduce the computational effort required to fit the joint model.However, unlike our multi-stage sampler, the typical two-stage approach does not target the full posterior distribution, which can lead to biased estimates (though Mauff et al. (2020) extend the typical two-stage approach to reduce this bias).
Thirdly, we observe a lack of flexibility the baseline hazard, visible in Figure 10.More complex hazards could be employed, e.g.modelling the (log-)hazard using a (penalised) B-spline (Rosenberg, 1995;Royston and Parmar, 2002;Rutherford et al., 2015).However, this increased flexibility precludes an analytic form for the survival function.Whilst numerical integration is possible it is not trivial, particularly when the hazard is discontinuous, as our hazard is at the breakpoint.Splines also have more coefficients than the single parameter of the Weibull hazard.Identifiability issues arise with a small number of individuals, many of whom are censored, and are compounded when there are a relatively large number of other parameters (α, θ).Whilst we do not believe these costs are worth incurring for our example, for settings with a larger number of patients and more complicated longitudinal submodels the increased flexibility may be vital.

B Sequential sampler
Figure 11: A DAG of the submodels and their common quantities, with the sequential sampling strategy overlaid.The stage one target (s 1 ) is encapsulated in the light blue dashed line; stages two and three (s 2 , s 3 ) are in dark blue and red respectively.
Figure 11 depicts graphically the strategy employed by the sequential sampler.The sequential sampler assumes that the pooled prior decomposes such that This is necessary to avoid sampling all components of φ in the first stage.All pooled priors trivially satisfy (53), as we can assume all but p pool,3 (φ 1∩2 , φ 2∩3 ) are improper, flat distributions.However, including some portion of the pooled prior in each stage of the sampler can improve performance, and eliminate computational instabilities when submodel likelihoods contain little information.

B.1 Stage one
Stage one of the sequential sampler targets using a generic proposal kernel for both φ 1∩2 and ψ 1 .The corresponding acceptance probability for a proposed update from (φ 1∩2 , ψ 1 ) to

B.2 Stage two
The stage two target augments the stage one target by including the second submodel, corresponding prior marginal distribution, and an additional pooled prior term A Metropolis-within-Gibbs strategy is employed, where the stage one samples are used as a proposal for φ 1∩2 , whilst a generic proposal kernel is used for ψ 2 and φ 2∩3 .Thus the proposal distributions for φ * 1∩2 and The acceptance probability for this proposal strategy is Our judicious choice of proposal distribution has resulted in a cancellation in Equation ( 59) which removes all terms related to p 1 .Similarly, all terms related to p 1 are constant -hence cancel -in Equation ( 60).This eliminates any need to re-evaluate the first submodel.

B.3 Stage three
In stage three we target the full melded posterior The target has now been broadened to include terms from the third submodel and the entirety of the pooled prior.Again, we employ a Metropolis-within-Gibbs sampler, with proposals drawn such that which leads to acceptance probabilities of The informed choice of proposal distribution for (φ 1∩2 , φ 2∩3 , ψ 1 , ψ 2 ) has allowed us to target the full melded posterior without needing to evaluate all submodels simultaneously.

C Normal approximation calculations
Substituting in the approximations of Section 3.2 to Equation (36) yields the approximate melded posterior Noting that the product of independent normal densities is an unnormalised multivariate normal density with independent components, we rewrite Equation (66) as The ratio of normal densities is also an unnormalised normal density, and hence Equation (67) simplifies to Mathematically, we define an ordered vector of boundary values noting that dim(v i ) = L i + 1.Because the observation times encoded as days since ICU admission and we are interested in the 8-hourly changes, our floor and ceiling functions round down or up to the appropriate third respectively.The raw fluid observations are then divided up into L i subsets of {x i , ũi } based on which boundary values the observation falls in between: for l = 1, . . ., L i .Denote N V i,l = |V i,l | / 2 (dividing by two as V i,l contains both the observation and the observation time).The l th 8-hourly fluid change ∆ i,l and corresponding observation time u i,l can then be computed as Finally, the 8-hourly cumulative fluid balance data are computed by x i,l = l s=1 ∆ i,s , and we assume they too are observed at u i,l .

E Priors and justification for the cumulative fluid submodel
The parameters for the gamma prior for η b 1,i and η a 1,i are obtained by assuming that the 2.5-, 50-, and 97.5percentiles are at 0.5, 5, and 20 (Belgorodski et al., 2017).A slope of 0.5 (i.e. the change in cumulative fluid balance per day) is unlikely but possible due to missing data; a slope of 20 is also unlikely but possible as extremely unwell patients can have very high respiratory rates and thus require large fluid inputs.
The prior for the breakpoint κ i is derived as follows.Define u i,(1) = min(u i ) and u i,(n) = max(u i ), with r i = u i,(n) − u i,(1) .We reparameterise the breakpoint by noting that κ i = κ raw i r i + u i,(1) , where κ raw ∈ [0, 1].We then set κ raw i ∼ Beta(5, 5) to regularise the breakpoint towards the middle of each individual's stay in ICU.This is crucial to ensure the submodel is identifiable when there is little evidence of a breakpoint in the data.Note that this results in the following analytic expression for p 2 (φ 2∩3 ) by the change of variables formula.
Specifying a prior for η 0,i , the cumulative fluid balance at κ i , is difficult because it too depends on the length of stay.Instead, we reparameterise so that η 0,i is a function of the y-intercept η raw 0,i .
We place a LogNormal(1.61,0.47 2 ) prior on η raw 0,i .These values are obtained assuming that, a priori, the 2.5%, 50%, and 99% percentiles of η raw 0,i are 0.5, 5, and 15 respectively (Belgorodski et al., 2017).This is a broad prior that reflects the numerous possible admission routes into the ICU.We expect those admitted from the wards to have little pre-admission fluid data.Those admitted from the operating theatre occasionally have their in-theatre fluid input recorded after admission into the ICU, with no easy way to distinguish these records in the data.

F Analytic form for the survival function
The hazard at arbitrary time t is Then, for t > κ i , the cumulative hazard is The survival functions then have corresponding definitions for t > κ i and t < κ i as S i (t) = exp{− t 0 h i (u)du}.

G Survival submodel prior justification
Our prior for (γ, α, θ) must result in a plausible distribution for p 2,i (T i | d i = 1), and a reasonable balance between d i = 1 and d i = 0 events.The primary concern is unintentionally specifying a prior for which the bulk of p 2,i (T i | d i = 1) is very close to zero.In addition, certain extreme configurations of (γ, α, θ) cause issues for the methodology of Crowther and Lambert (2013), particularly the numerical root finding and numerical integration steps.We would like to rule out such extreme configurations a priori.Ideally we would encode this information a joint prior for (γ, α, θ), but specifying the appropriate correlation structure for these parameters is prohibitively challenging.Instead we focus on specifying appropriate marginals for each of γ, α, and θ, and create visual prior predictive checks (Gabry et al., 2019;Gelman et al., 2020) to ensure the induced prior for (T i , d i ) is acceptable.
Before justifying our chosen marginal prior, we note that the exp{x i θ + α ∂ ∂Ti m i (T i )} term implies that the priors for θ and α are on the log-scale.Hence the magnitude of these parameters must be small, otherwise all event times would be very near zero or at infinity.The asymmetric effect of the transformation from the log scale also implies that symmetric priors are not obviously sensible.From these observations we deduce that θ and α must not be too large in magnitude, however if they are negative then they can be slightly larger.
Hence, we specify the skew-normal priors detailed in Section 5.3, noting that the skewness parameter for α is smaller, because ∂ ∂Ti m i (T i ) is strictly positive and typically between 0.5 and 20, whilst w i is standardised to be approximately standard normal.Lastly, if γ is too far away from 1 (in either direction), then the event times are very small either because the hazard increases rapidly (γ 1), or because almost all of the cumulative hazard is in the neighbourhood of 0 (γ 1).We specify a gamma distribution for γ with the 1 th -, 50 th -, and 99 th -percentiles of p 2 (γ) at 0.2, 1, and 2, allowing for a wide range of hazard shapes, but removing many of the extremes.

H Estimating submodel prior marginal distributions
For p 1 , we note that p 1 (φ 1∩2 ) = N i=1 p 1,i (T i , d i ), and that p 1,i (T i , d i ) conditions on each individual's length of stay (in specifying the location of the knots), as well as the range, mean, and standard deviation of the P/F data (by standardising zi,j ).Simple Monte Carlo samples are drawn from p 1 (φ 1∩2 ) and used to estimate p 1 (φ 1∩2 ).Under the second submodel we obtain samples from p 2 (φ 1∩2 , φ 2∩3 ) using the methodology of Crowther and Lambert (2013) as implemented in simsurv (Brilleman, 2021).These samples are use to estimate p 2 (φ 1∩2 , φ 2∩3 ).

H.1 P/F submodel
We approximate p 1 (φ 1∩2 ) using a mixture of discrete and continuous distributions, with a discrete spike at C i for the censored events and a beta distribution for the (rescaled) event times.Monte Carlo samples of T i and d i are obtained from p 1,i (T i , d i ) by drawing β 0,i and ζ i from their respective prior distributions and then solving (46).Denoting the estimated mixture weight π i ∈ [0, 1], the density estimate is where π i , a i and b i are maximum likelihood estimates obtained using the prior samples.Examples of p 1,i (T i , d i ) for a subset of individuals are displayed in Figure 12.
• We cannot investigate the relationship between the rate of fluid intake and respiratory failure if the latter occurs without sufficient fluid data surrounding the event.
exists due to the chained relationship between submodels.It is important to note that p repl,m (φ, ψ m , Y m ) is defined on a larger parameter space than p m (φ m , ψ m , Y m ), as it includes φ −m .Define p repl,m (φ m , ψ m , Y m ) = p repl,m (φ, ψ m , Y m )dφ −m .Each marginally replaced submodel, as defined in Equation (6), minimises the following KL divergence 2 p repl,m (φ m , ψ m , Y m ) = arg min q D KL q p m | q(φ m ) = p pool (φ m ) for all φ m , (7) where p pool (φ m ) = p pool (φ)dφ −m .We can thus interpret p repl,m (φ m , ψ m , Y m ) as a minimally modified p m (φ m , ψ m , Y m ) which admits p pool (φ m ) as a marginal.Note that it is the combination of p repl,m (φ m , ψ m , Y m ) and p pool (φ −m | φ m ) that uniquely determine (6).

Finke
et al. (2019)  consider a number of variations on the original model ofSchaub et al. (2006) andAbadi et   al. (2010): here we consider only the variant fromFinke et al. (2019) with the highest marginal likelihood (Model 4 of their online supplement).This example is particularly interesting to us as, for a certain choice of pooling function and pooling weights, the chained Markov melded model and the IPM are identical.This coincidence allows us to use the posterior from the IPM as a benchmark for our multi-stage sampler.

Figure 5 :
Figure5: Top row: credible intervals for φ 1∩2 = (α 0 , α 2 ) and φ 2∩3 = ρ from the posterior of the original integrated population model p ipm , and the individual subposteriors from submodels p 1 , p 2 , and p 3 .Bottom row: credible intervals for the same quantities, but with a different x-axis scale, from the original IPM (repeated from top row); the chained melded posteriors using product-of-experts pooling, logarithmic pooling, and linear pooling denoted p meld , p meld, log and p meld, lin ; and the melded posterior using the normal approximation p meld .Intervals are 50%, 80%, 95%, and 99% wide.

Figure 6 :
Figure 6: Parameters and form for the P/F ratio submodel (p 1 , left) and cumulative fluid submodel (p 3 , right).

Figure 7 :
Figure 7: The P/F ratio data (Y 1 , top row); cumulative fluid data (Y 3 , bottom row); subposterior means and 95% credible intervals for each of the submodels (black solid lines and grey intervals); and stage one event times (T i , red rug in the top row) for individuals i = 17 and 29.

Figure 8 :
Figure8: Density estimates for a subset of ψ 2 .The submodel marginal prior p 2 (ψ 2 ) is shown as the grey dotted line (note that this is not the marginal prior under the melded model).The figure also contains the subposteriors obtained from chained melding using PoE pooling (red, solid line) and logarithmic pooling (blue, solid line), as well as the posterior using the fixed values p(ψ 2 | φ 1∩2 , φ 2∩3 , Y 2 ) (black, dashed line).

Figure 10 :
Figure 10: Survival curves and mean survival function at time t.The red, stepped lines are draws of φ 1∩2 from the melded posterior using PoE pooling, converted into survival curves via the Kaplan-Meier estimator.The smooth red line and interval (posterior mean and 95% credible interval) denote the model-based, mean survival function obtained from the melded posterior (PoE pooling) values of ψ 2 and φ 2∩3 .The blue dashed line is the Kaplan-Meier estimate of φ 1∩2 , and the blue solid line and interval are the corresponding model-based estimate from p(ψ 2 | φ 1∩2 , φ 2∩3 , Y 2 ).

D
68) Calculating the cumulative fluid balance from the raw fluid data In the raw fluid data each patient has l = 1, . . ., Li observations.Each observation xi, l is typically a small fluid administration (e.g. an injection of some medicine in saline solution), or a fluid discharge (almost always urine excretion).The observations have corresponding observation times ũi, l, with ũi = {ũ i,1 , . . ., ũi, Li } and xi = {x i,1 , . . ., xi, Li }.We code the fluid administrations/inputs as positive values, and the excretions/outputs as negative values.Each patient has an enormous number of raw fluid observations (L i Li ) and it is computationally infeasible to consider them all at once.We aggregate the raw fluid observations into 8-hourly changes in fluid balance.From these 8-hourly changes we calculate the cumulative fluid balance.

Figure 12 :
Figure 12: Fitted distribution (curve) and Monte Carlo samples drawn from p 1 (φ 1∩2 ) (histogram) for a subset of the individuals in the cohort.The height of the atom at C i (red bar and point) has been set to 1 − π i

Figure 14 :
Figure 14: Monte Carlo (MC) samples from p 2,i (φ 1∩2 , φ 2∩3 ) obtained using simsurv and samples from the fitted normal approximation (NA) for i = 19.The panels on the off diagonal elements contain a 2D kernel density estimate for d i = 1 and the samples for d i = 0. Diagonal and lower-triangular panels are on their original scales, whilst the upper-triangular panels are on the log scale.
denoted N t , and the number of chicks produced that survive and leave the nest denoted n t .A Poisson model is employed to estimate Abadi et al. (2010) parameterise the time dependent quantities via linear predictors to minimise the number of parameters in the submodels.The specific parameterisation of Finke et al. (2019) we employ is logit(δ a,s,t