Distributed Computation for Marginal Likelihood based Model Choice

We propose a general method for distributed Bayesian model choice, using the marginal likelihood, where a data set is split in non-overlapping subsets. These subsets are only accessed locally by individual workers and no data is shared between the workers. We approximate the model evidence for the full data set through Monte Carlo sampling from the posterior on every subset generating a model evidence per subset. The results are combined using a novel approach which corrects for the splitting using summary statistics of the generated samples. Our divide-and-conquer approach enables Bayesian model choice in the large data setting, exploiting all available information but limiting communication between workers. We derive theoretical error bounds that quantify the resulting trade-off between computational gain and loss in precision. The embarrassingly parallel nature yields important speed-ups when used on massive data sets as illustrated by our real world experiments. In addition, we show how the suggested approach can be extended to model choice within a reversible jump setting that explores multiple feature combinations within one run.


Introduction
Distributed Bayesian computation largely falls in the MapReduce or split-applycombine frameworks for computation (Dean and Ghemawat, 2008;Wickham, 2011). The distributed Bayesian computational workflow can be described in three key steps: • Split: Divide the data set into subsets and distribute across nodes (workers).
• Apply: Each worker independently computes a posterior distribution based on a subset.
• Combine: Aggregate the worker posterior distributions to form a consensus.
The combine step involves the synthesis of multiple probability distributions, and falls under the general umbrella of belief aggregation, a well studied topic in meta-analysis and probabilistic forecasting (West, 1984;Genest, 1984;Dietrich, 2010). Our goal is to determine how marginal likelihood based Bayesian model choice can be operationalised within the aforementioned computational framework. We find that belief aggregation rules for model choice differ in some key aspects compared to fixed model inference and examine the theoretical and computational consequences.
In summary, our main contributions are the following.
• We derive a general decomposition of the marginal likelihood that enables efficient divide-and-conquer calculation without accessing the data in one single place. The combination of the results requires minimal communication and no exchange of data. The computational complexity per worker is O(n/S) instead of O(n) on a single machine, where n is the number of observations and S the number of workers.
• We provide a theoretical analysis of two different algorithms for distributed calculation of the marginal likelihood. The first is a simulation consistent approach making use of data augmentation, and the second is an approximate approach relying on local normal approximations. Error bounds are developed for the approximate approach.
• We illustrate the performance of our method on several challenging applications with millions of data points and show that the computation time is reduced by several orders of magnitude, incurring only a negligible bias.
• We show how to apply our approach in a reversible jump setting where an MCMC sampler moves between different feature spaces.
The rest of our work is structured as follows. We discuss related work in Section 2 and introduce relevant background material in Section 3. Then we present our main contributions on distributed Bayesian model choice in Section 4. In Section 5 we demonstrate the applicability of our approach on several data sets and models before discussing possible extensions in Section 6.

Related Work
Previous work on computationally efficient methods for processing data sets with massive sample sizes can generally be divided into two broad categories. Bardenet et al. (2017), Jahan et al. (2020) and Zhu et al. (2017) provide reviews focusing on Bayesian methods for big data.
A first stream of work focuses on speeding up computation using mini batches (i.e., random subsets) of the entire data. Initially introduced via optimisation in the field of machine learning, the idea of approximating a posterior using mini batches has received substantial attention since the work of Welling and Teh (2011) on stochastic gradient Langevin sampling and the work of Hoffman et al. (2013) on mini batch sampling for variational inference. Subsequent work on this family of algorithms has included theoretical analysis and practical extensions, see, for example, Chen et al. (2014); Alquier et al. (2016); Quiroz et al. (2019); Dang et al. (2019). The calculation of normalising constants, however, has received less attention despite the work of Lyne et al. (2015); Gunawan et al. (2020).
The second line of work aims to make use of parallel processing for reducing computation time. The idea of using a divide-and-conquer approach has seen interest in both the statistics and the machine learning community, see, for example, Deisenroth and Ng (2015) for an application in Gaussian processes and Jordan et al. (2019) for a general approach under communication constraints. A variety of follow-up has been sparked by the work of Scott et al. (2016). The consensus Monte Carlo (CMC) approach performs posterior sampling on data shards and combines the results on a single worker. This approach has been discussed and improved both from a theoretical and practical perspective (Wang and Dunson, 2013;Neiswanger et al., 2013;Minsker et al., 2014;Scott et al., 2017;Srivastava et al., 2018;Zhang et al., 2018;Szabó and van Zanten, 2019). The idea of distributed computation has since been picked up in different communities as, for example, in sequential Monte Carlo (Rendell et al., 2020) and expectation propagation Barthelmé et al., 2018). Combining different models using different data sources has also received substantial attention, see for instance Goudie et al. (2019); Jacob et al. (2017). The concept of federated learning (Li et al., 2020), where the estimation of a model is achieved in a highly distributed setting with repeated rounds of communication between a central node and workers, is another adjacent field to our work. Despite growing interest in Bayesian federated learning (Chen and Chao, 2021;Yurochkin et al., 2019), Bayesian model choice has seen little investigation.
We will proceed under the assumption that there are barriers to implementing a mini-batch based algorithm. Storing the data in one location may be infeasible, or privacy restrictions and communication costs rule out the possibility for one central node to be connected to each site with a subset of the data. Situations where this may occur include the analysis of electronic health records and the analysis of genetic data from large cohorts. It is natural to consider the Laplace approximation to the marginal likelihood given the large sample size (Kass and Raftery, 1995). However, the assumed barriers impede the ability to compute directly the Laplace approximation as determining the posterior mode and Hessian would require extensive communication between sites and the central node, see, for example, McMahan et al. (2017); Safaryan et al. (2021). Divide-and-conquer approaches will still be feasible in such a setting.

Background
We first give an overview on Bayesian model choice using the marginal likelihood, and useful computational techniques for estimation of the marginal likelihood. We then discuss core principles for distributed Bayesian inference, and cover the use of normal approximations in the consensus Monte Carlo algorithm by Scott et al. (2016).

Bayesian model choice
The normalising constant We define the posterior distribution of a parameter θ ∈ Θ ⊂ R p given data y as p(θ|y) = p(y|θ)p(θ)/p(y). It depends on the unknown normalising constant p(y). This normalising constant is the marginal likelihood of the data given the model, also called model evidence, and is calculated as In most settings this constant is not analytically tractable, as it involves the integration over a potentially high dimensional parameter space Θ. Various sampling based methods are available for the approximate calculation of the evidence, for example, importance sampling (IS) (Geweke, 1989), sequential Monte Carlo (Del Moral et al., 2006), nested sampling (Skilling, 2006) or bridge sampling (Meng and Wong, 1996;Gelman and Meng, 1998). See Knuth et al. (2015) for a review of the different methods.
The Bayes factor The posterior distribution over a set of competing models is defined using the marginal likelihood and a prior distribution on models. Models may be distinguished by different choices of link functions, hyper parameters or feature spaces. The posterior probability of model m i is given by where p(y|m i ) = Θ p(y|θ, m i )p(θ|m i )dθ, with both prior and likelihood now being dependent on the model i. Models may also be compared using the Bayes factor (BF) (Kass and Raftery, 1995). The BF is calculated as where p(m i ) denotes the prior probability of model i, p(m i |y) denotes the posterior probability of the model given the data and p(y|m i ) for i = 1, 2 denotes the probability of the data given the model. For two models the BF allows to select the model with highest posterior probability while adjusting for the prior odds. It is an alternative to standard statistical testing when it comes to model choice.
Data augmentation and conditionally conjugate models Data augmentation has proven to be an important technique for Bayesian computation, facilitating both posterior sampling and marginal likelihood estimation (van Dyk and Meng, 2001;Tanner and Wong, 2010). The rationale is to introduce a latent variable z ∈ Z such that the complete data model p(y, z|θ) is more mathematically and computationally tractable than the marginal model p(y|θ) = Z p(y, z, |θ) dz. Example applications include binary regression models, mixture models and factor analysers (van Dyk and Meng, 2001;Holmes and Held, 2006;Tanner and Wong, 2010). A particularly useful strategy is to construct a sampler on the extended space of θ, z ∼ p(θ, z|y) and to marginalise out z once a cloud of samples has been generated. This can be an efficient strategy if the full conditionals p(θ|y, z) and p(z|y, θ) have a known distribution and allows the construction of a Gibbs sampler. Conditionally conjugate models are obtained when selecting the prior p(θ) to be conjugate to the complete data likelihood p(y, z, |θ) (Gelman et al., 2013). Prior p(θ) and the conditional posterior p(θ|y, z) are part of the same distribution family and hence sampling from the conditional posterior becomes straightforward for commonly chosen priors. The posterior distribution can be represented as p(θ|y) = p(θ|z, y)p(z|y) dz.
For any ordinate θ, the log marginal likelihood satisfies log p(y) = log p(θ) + log p(y|θ) − log p(θ|z, y)p(z|y) dz Following Chib (Chib, 1995), a simulation consistent estimator of the model evidence is where the z i are samples from the posterior distribution p(z|y).

Distributed Bayesian inference
We assume that we observe data y ∈ Y consisting of n data points that can be split into S non overlapping data shards y s , potentially containing several observations such that y = {y 1 , · · · , y S }. The likelihood is assumed to satisfy the independence condition p(y|θ) = S s=1 p(y s |θ). As outlined in the introduction, the general strategy for distributed Bayesian inference is to allocate a shard of data to each worker who will then compute a local posterior distribution on the basis of the subset y s . The major challenge in this approach is the formation of a consensus probability distribution given the individual posterior distributions computed by each worker.
An important tool for the synthesis of probability distributions is product-of-experts pooling (Genest, 1984;Hinton, 2002;Dietrich, 2010). Taking θ to be the quantity of interest, the products-of-experts consensus distribution q(θ) is formed by taking the product of the worker distributions {q s (θ)} S s=1 , so q(θ) ∝ S s=1 q s (θ). The significance of product-of-experts pooling is readily seen when the model is treated as being fixed and known. The full posterior can be represented as where p(y s |θ) is a likelihood factor over the shard y s and p(θ) 1/S is the unnormalised subprior, i.e. a fraction of the initial prior p(θ). Each worker may compute a so-called subposterior p(θ|y s ) on a shard of data using the fractionated priorp(θ|y s ) ∝ p(y s |θ)p(θ) 1/S . The full posterior distribution can then be obtained by product-of-experts pooling of the subposteriors (Huang and Gelman, 2005;Scott et al., 2016) p(θ|y) ∝ S s=1p (θ|y s ).
This important property does not transfer to model selection. There is no analogous product-of-experts decomposition of the full posterior ditribution over models where the subposterior model probabilities are defined by p(m i |y s ) ∝ p(y|m i )p(m i ) 1/S and the subposterior evidence p(y s |m i ) = Θ p(y s |θ, m i )p(θ|m i ) dθ is determined using the normalised subprior p(θ|m i ) = p(θ|m i ) 1/S / Θ p(θ|m i ) 1/S dθ. This is due to the fact that in general, The lack of a product-of-experts decomposition for the full posterior on models (4) suggests that the protocol for distributed Bayesian inference in the fixed model setting may not be effective for model choice. If the goal is to recover the full posterior model probabilities p(m i |y), it is no longer sufficient to compute subposterior distributions over models in the apply step and then form a consensus distribution using product-ofexperts pooling in the combine step. The correct belief aggregation procedure for model selection is developed in Section 4.
In practice, each worker will typically return a Monte Carlo or analytic approximation of the subposterior distribution. An important consideration with distributed Bayesian inference is how the mean squared error of the consensus scales with the number of subsets S ( Bardenet et al., 2017;Scott et al., 2016;Neiswanger et al., 2013). There is almost always a trade-off between the error and the computational benefits afforded by distributing the workload across more nodes S. This dynamic will be a key feature in our theoretical analysis and computational experiments.

Consensus Monte Carlo
The Consensus Monte Carlo algorithm (Scott et al., 2016) is based on the product-ofexperts posterior decomposition (3). The key to the approach is that the product of normal subposteriors is proportional to another normal distribution: s , · · · , θ N s from each of the individual subposteriorsp(θ|y s ). This sampling can be achieved, for example, using standard MCMC algorithms like random walk Metropolis-Hasting (Hastings, 1970;Dunson and Johndrow, 2019) or Hamiltonian Monte Carlo (Neal et al., 2011). The samples are then recombined using a normal approximation to the subposteriors.
The normal approximation N (µ s , Σ s ) ≈p(θ|y s ) is based on the estimated mean µ s and variance Σ s from the subposterior samples, using the Laplace-Metropolis approximation (Lewis and Raftery, 1997). This approximation is asymptotically justified through the Bernstein-von-Mises (BvM) theorem (see also Ghosh and Ramamoorthi (2003)). The recombination of the samples from the local Markov chains is based on a weighting according to the inverse covariances of the subposteriors (Scott et al., 2016). Combining the sampling based approach with the normal approximation has the advantage that more features of the posterior distribution are captured compared to the use of a plain normal approximation where the sampling step would be omitted. See Scott et al. (2016Scott et al. ( , 2017 for more details.

Distributed Model Choice
We will now introduce and discuss our contributions making use of the previously introduced background material.
Decomposing the model evidence Using Bayes' Theorem and some elementary algebra, it is possible to obtain an identity for the marginal likelihood that lends itself to distributed computation.
Proof. See Appendix.
Each of the three components in the decomposition (5) has an interpretation in terms of the generic split-apply-combine framework mentioned in the introduction. The term α S reflects that each worker is allocated a fraction of the prior information in the split step. The subposterior evidence for each shardp(y s ) is computed by the workers in the apply stage. The evaluation of the subposterior integral Θ S s=1p (θ|y s )dθ is a necessary step for appropriate evidence synthesis in the combine stage.
Using Proposition 1 it is straightforward to show that the full posterior distribution on models has the modified product-of-experts representation recalling that the subposterior probabilities are given by p(m i |y s ) ∝p(y s |m i )p(m i ) 1/S . Comparing (6) to (4), we see that a modified product-of-experts rule is necessary for distributed Bayesian model selection, as the subposterior model probabilitiesp(m i |y s ) do not contain enough information to recover the full posterior model probabilities p(m i |y). Subset analyses will be under-powered relatively compared to the full data set analysis if the shard size is small compared to the total sample size, and this may manifest in a bias towards smaller models in the model subposteriorsp(m i |y s ). A secondary issue is that a model may appear to fit well in each subset but be of poor fit overall. The inclusion of the integral term over the subposteriors Θ S s=1p (θ|y s , m i ) dθ allows the global goodness of fit to be reconstructed by considering the overlap in the subposterior distributions.
We will now focus on how to use the presented decomposition in an algorithm to approximate the marginal likelihood of the full data set. In a variety of settings α can be computed analytically. For example, if p(θ) is a normal distribution, fragmenting the prior amounts to an inflation of the prior variance: N (0, Σ) 1/S ∝ N (0, SΣ) and α is thus obtained easily. The same holds true for a Laplace prior, where L(0, σ) 1/S ∝ L(0, Sσ). However, care must be taken for some distributions of the exponential family. When S is too large, the integral Θ p(θ) 1/S dθ becomes infinite as too much mass is pushed into the tails. We recommend checking this on a case by case basis. The local evidencẽ p(y s ) can be calculated by every individual worker using one of the various techniques described at the start of Section 3.1. The most difficult part of (5) is to evaluate the last integral We will now discuss two different strategies to estimate the integral I sub (7), an exact approach based on data augmentation and Gibbs sampling, and an approximate approach using normal approximations.

Data augmentation to compute I sub
A data augmentation strategy can be used to construct a simulation consistent estimator of the components in (5). See Ahfock (2019, Chap 2) for a detailed discussion.
In particular, the latent variables are helpful to calculate an estimator of (7) as we show now. We assume that the latent variables z s are independent given θ. Then, the augmented full data set posterior is again proportional to the subposteriors, i.e., The conditional latent variable subposteriors on the individual workers (accessing only y s ) are given as as and hence the subpriorp(θ) is not material when conditioning on θ. This establishes a Gibbs sampling strategy for sampling fromp(θ, z s |y s ) using the conditionals defined above. In conditionally conjugate models, where the subpriors are chosen such that Θ S s=1p (θ|y s , z s )dθ has a closed form solution, we can exploit the samples generated from the latent variable subposteriors to approximate I sub .
Proof. See Appendix.
As a a consequence, Proposition 2 suggests the following Monte Carlo estimator where Θ S s=1p (θ|y s , z i s )dθ is calculated analytically and z i s are samples obtained from, e.g., a Gibbs sampler. See the algorithmic version of the sampler in Algorithm 1. We provide further details on data augmentation for logistic regressions using Pólya-Gamma data augmentation in the appendix. The variance of the proposed estimator can be understood through a connection to importance sampling. .
Proof. See Appendix.
This links the variance of the estimatorÎ sub to the relative coverage of the joint conditional posterior p(z 1:S |y 1:S ) and the product of the conditional subposteriorsp(z 1:S |y 1:S ). The variance of the ratio measures the quality of the product of the conditional subposterior as an importance sampling proposal distribution. If the tails of the proposal are thinner than the target, the variance of the estimator potentially becomes infinite. As the number of splits increases, we expect the product of the conditional subposteriors to become a worse approximation of the true subposterior, and the variance of the estimator to increase. Note that in practice the variance ofÎ sub will be larger than the suggested quantity due to the autocorrelation introduced by the Gibbs sampler.

Algorithm 1 Distributed model evidence computation using data augmentation
Input: data y, number of chunks S, likelihood p(·|θ), prior p(θ). Split: Divide data in S chunks y 1 , · · · , y S . Apply in parallel:

A normal approximation to I sub
Models relying on data augmentation are a rather restricted class of models that require tailor-made samplers. A more widely applicable approach to estimate I sub (7) is a normal approximation to each subposterior p(θ|y s ) ≈ N (θ|µ s , Σ s ), yielding the following approximation of the integral: The mean and covariance parameters for the subposterior normal approximations, µ s , Σ s can be estimated given subposterior samples generated by each worker. Using the fact that the product of normal density functions is proportional to another normal density we can provide the closed form expression where

Algorithm 2 Distributed model evidence computation using normal approximations
Input: data y, number of chunks S, likelihood p(·|θ), prior p(θ). Split: Divide data in S chunks y 1 , · · · , y S . Apply in parallel: for s = 1 to S do Sample θ 1 s , · · · , θ N s ∼p(θ|y s ). Calculate and storep(y s ) = p(y s |θ)p(θ)dθ and µ s , Σ s . end for Combine: The normal approximations to the subposterior p(θ|y s ) ≈ N (θ|µ s , Σ s ) can be justified under the usual conditions for a Bernstein-von-Mises theorem to hold, see e.g. Ghosh and Ramamoorthi (2003). Due to the involved matrix inversions and calculation of determinants the complexity of calculating (10) is O(Sp 3 ). We approximate the normalising constantsp(y s ) using any of the techniques discussed at the beginning of Section 3.1. Note that the normal approximation is only required for estimating I sub , and does not influence the estimation of p(y s ). We hence suggest Algorithm 2 for the estimation of the marginal likelihood.
Using mixtures of normals If the posterior distribution is multimodal, a simple normal approximation will result in a poor approximation of the integral of the product of the subposteriors. As noted by Neiswanger et al. (2013); Scott et al. (2017), a mixture of normals or kernel density estimators can be used to approximate the posterior. These approaches have been shown to work well in practice. However, if the number of mixture components or the number of splits S gets large this approach may become prohibitive.
As we have to calculate the product of S mixtures, the resulting calculation of the product of distributions has a complexity of O(S k ), where k denotes the number of mixture components. We therefore refrain from this approach.
Error of the approximation Proposition 4 gives an exact representation of the relative error when using the normal approximation to the subposterior integral I sub (10).
Proposition 5 provides an error bound for the proposed approximation to the marginal likelihood under assumptions on the quality of the subposterior normal approximations.
Proposition 5. Suppose {µ s , Σ s } S s=1 are the selected parameters for the subposterior normal approximationsp(θ|y s ) ≈ N (θ|µ s , Σ s ), and that N (θ|µ, Σ) ∝ S s=1 N (θ|µ s , Σ s ) is the resulting normal approximation to the full posterior p(θ|y). The exact marginal likelihood p(y) and the proposed approximation to the marginal likelihood p(y) are respectively, Suppose that the subposterior normal approximations satisfy the density ratio bounds Then the relative error of the proposed approximation to the marginal likelihood satisfies Proof. See Appendix.
If the parameter space is unconstrained, so Θ = R dim(θ) , then E N (θ;µ,Σ) [1(θ ∈ Θ)] = 1 and the bound can be simplified to −S log B ≤ log p(y)/ p(y) ≤ S log A. The supremum of the density ratio is an interesting divergence measure for two probability distributions, with important connections to the total variation distance (Dümbgen et al., 2021). Under the conditions of the Bernstein-von-Mises theorem, the quality of each subposterior normal approximation is expected to improve as the shard size increases. With a fixed number of shards S, A and B will tend to 1 as n increases if {µ s , Σ s } S s=1 are taken to be the true subposterior means and variances. Corollary 1 provides an error bound under an assumption about the average accuracy of the subposterior normal approximations over the subsets S, rather than the worst case accuracy over the subsets S, as in Proposition 5.
Corollary 1. Fix the shard size n s , and let the total sample size be given by n = Sn s . Let p(y) represent the true marginal likelihood and p(y) denote the proposed approximation as in Proposition 5. Assume that as the number of shards S increases, where {y s , µ s , Σ s } S s=1 are treated as random variables. Then the relative error satisfies Proof. See Appendix.
Proposition 5 and Corollary 1 suggest a trade-off between the computational gains from the proposed distributed strategy via the number of splits S and the resulting approximation error.
In practice we transform (5) to the log domain and get The corresponding estimator is a sum of S + 2 terms so the variance of this quantity grows linearly in S. Moreover, when estimating logp(y s ) by N samples from a Markov chain the MSE is typically of order O(1/N ). Thus, more simulations can reduce this error. In summary, our results suggest that if the error that comes from MCMC sampling is relatively small and that the shard sizes are large enough so that the quality of the subposterior normal approximation is reasonable, our suggested approach will result in good approximations of the full data set marginal likelihood.

Model choice using reversible jump
Calculation of the marginal likelihood for every model in the candidate set is only feasible when there are a small number of models under consideration. In many practical settings, the aim is often to choose among a large number of models, that potentially have different support. An important example is variable selection, where a particular model consists of a specific combination of selected variables.
The reversible jump approach (Green, 1995) allows the construction of a Markov Chain that jointly traverses models and the associated parameter spaces. The posterior probability of a model given the data p(m i |y), see (1), is obtained by the relative time the sampler spends exploring that model.

Distributed RJMCMC
The modified product-of-experts belief aggregation rule for models (6) illuminates how one may conduct distributed Bayesian model selection using reversible jump methodology. During the apply stage, it is sufficient for workers to compute subposterior model probabilities p(m i |y s ) ∝p(y s |m i )p(m i ) 1/S , and model subposteriors (θ|y s , m i ) dθ can be evaluated in the combine stage, and the term α S i can be determined in the split stage. The belief aggregation rule (6) can then be used to reconstruct the full posterior model probabilities. Once again, a normal approximation to each subposterior Reversible jump is particularly well suited to to the decomposition (6) as the sampler simultaneously explores models and the associated parameter spaces. As such, estimates ofp(m i |y s ), µ i s and Σ i s will be readily available from the RJMCMC output of each worker. Consequently, the splitting approach can effectively be combined with a reversible jump algorithm as we suggest in Algorithm 3. See also the appendix for more details.

Algorithm 3 RJMCMC distributed model choice
Input: data y, number of chunks S, set of models m k for k = 1, · · · , K, likelihoods p(·|θ, m k ), priors p(θ|m k ), p(m k ). Split: Divide data in S chunks y 1 , · · · , y S . Apply in parallel: for s = 1 to S do Sample θ 1 s , · · · , θ N s |m k ∼p(θ|y s , m k ) using RJMCMC over all k. Calculate and storep(m k |y s ), µ k s , Σ k s for models k = 1, . . . , K. end for Combine: Bayes factors can be estimated as p(m k |y)p(m k )/{ p(m k |y)p(m k )} for any two models (k, k ).
There are a number of known issues with RJMCMC that may limit its effectiveness within a divide-and-conquer approach. Reversible jump samplers are known to be hard to tune and often slow to converge. The construction of suitable proposal distributions can also be very challenging when models are not nested. Moreover, the models of interest have to have been visited a sufficient number of times to get reliable estimates.
All these combined make the use of RJMCMC burdensome. With regards to distributed computation, an issue is that if the space of potential models is large, not all models of interest might have been explored on every data shard.

Experiments
In our experiment section we investigate the following questions. (a) Are naive voting strategies, upsampled likelihoods and CMC based importance sampling an alternative to our approach? (b) How does the approach based on data augmentation and the approach based on normal approximations perform relatively? (c) What is the magnitude of the error introduced from approximating I sub compared to the overall scale of the log marginal likelihood? (d) How does the error of I sub behave with respect to the number of splits? (e) What are practical gains from using the distributed approach on very large data sets? (f) How reliable is the distributed approach for RJMCMC? We answer these questions by assessing our approach with a Bayesian logistic regression, a normal linear regression and a linear regression with Laplace priors.
Our overall experimental set up is the following: prior variances on the model parameters are set to 1 and their means to 0. We use a randomised splitting procedure. This means for S splits we will have roughly n/S samples per split, where we use a uniform sampling scheme without replacement, if not otherwise stated. We run every sampler 20 times where at every iteration the splitting and the Monte Carlo sampler are initialised with a different seed. Thus, the observed variation in the outcome is a combination of the variation through splitting (i.e. different partitions) and Monte Carlo sampling. The number of generated MCMC samples per chain is 10, 000 where the first 2, 000 samples are discarded as burn-in. We assess the error of the estimation of the log marginal likelihood by comparing it to the result of the estimation on the whole data set, if this is computationally feasible in a reasonable amount of time.
We use the RMSE (root mean squared error), defined as √ MSE = E log p(y) − logp(y) 2 , wherep(y) is approximated using our suggested decomposition. The relative RMSE is defined as % √ MSE = √ E log p(y)−logp(y) 2 log p(y) × 100. The use of the relative RMSE is justified through the scale of the actual quantity we are trying to estimate.
Practical considerations We implement our algorithm using two generic R packages, namely rstan (Carpenter et al., 2017) and the bridgesampling package (Gronau et al., 2020). rstan allows convenient sampling from the posterior distribution of a model using HMC. The package can handle a variety of different models and the user has to provide only a simple code that describes the model. Tuning of hyper parameters and convergence checking is handled automatically. The bridgesampling package can use rstan models to calculate an approximation of the model evidence. For the reversible jump illustration we use the R2BGLiMS package 1 . This package performs model choice  Figure 1a the local marginal likelihood of the correct model 6 is compared with the other local marginal likelihoods. Based on a majority vote it is decided whether model 6 wins. We indicate the win rate of model 6, shown on the y-axis. In Figure 1b we use a Borda voting scheme to aggregate the order of local marginal likelihoods into a global model ordering. We display the average Borda count and hence the correct model 6 must get the highest average count to win. As the number of splits increase (x-axis) the correct model 6 is not chosen anymore neither for the majority voting scheme nor for the Borda count. using reversible jump MCMC on logistic, normal and Weibull regression models. We explicitly run the experiment on single core architectures to illustrate the advantages of distributing the computation over a large number of small workers. Code for reproducing the results is available through the first author's github repository 2 .

Experiment 1: Voting schemes based on local marginal likelihoods and CMC IS
In this section we illustrate potential issues of natural alternatives to our approach based on the decomposition in (5). Although this comparison is limited in scope, we believe it motivates and justifies the in-depth study of Algorithm 2. This experiment is based on six different Gaussian regression models with a log normal prior on the variance that makes this model non-conjugate. We simulate a dataset with 10, 000 observations and induce high correlation of the features. The correct model (6) uses the all 17 covariates whereas the other models all omit one relevant variable (see also the Appendix for further details).

Majority voting and Borda counts
To motivate the need for a coherent way of combining local evidences, we illustrate what can go wrong when using a naive approach for combining local inference. We try to identify the best model based on the local marginal likelihoodsp(y s |m k ) for the models m k . We compare two different voting schemes that aggregate local rankings. First, we use a majority voting scheme, where each   Figure 2b (right) CMC IS stays competitive as the number of splits increase and seems more stable than our approximate method (approx) based on Algorithm 2. However, an additional round of communication between the central node and the distributed workers is required.
worker returns the winning model based on local marginal likelihoods, i.e., winner s = arg max m kp (y s |m k )∀s. Then, the central node aggregates the vote share of model 6 (i.e., on how many workers the correct model won): win rate = S s=1 1 {winners=6} /S. As a second voting scheme we use Borda counts (Emerson, 2013). The local models receive a score based on their ranking going from 1 to 6, i.e., Borda score s,m k = arg sort m kp (y s |m k ), and the winning model is obtained by averaging the Borda scores S s=1 Borda score s,m k /S. This voting scheme favors a consensus vote over a majority based decision.
As illustrated in Figure 1, the voting schemes work reasonably well for a small number of splits. But with as little as 10 splits (every worker sees 10% of the data), the correct model is not chosen anymore, neither for the majority based voting nor for the Borda counts. Therefore, we decide not to compare our suggested method with these naive aggregations in the remainder of our experiments.
Mean and median of upsampled normalising constants We use the same setting as above and compare two additional methods with our suggested approach from Algorithm 2. We estimate locally the marginal likelihood of what we call an upsampled model following the idea in Zhang et al. (2018), i.e., we replicate locally the data of each shard to match the full data set size. The upsampled marginal likelihood is computed using the replicated data asṕ (y s ) = Θ p(y s |θ) S p(θ)dθ.
A final estimator of the marginal likelihood is obtained by computing the mean or median p(y) = mean({ṕ(y 1 ), . . . ,ṕ(y S )}), med({ṕ(y 1 ), . . . ,ṕ(y S )}), of the local marginal likelihoods when combining them on the central node. We illustrate the results of this approach in Figure 2a. Using medians or means of upsampled local marginal likelihoods becomes unstable when reaching 50 shards compared to our approach in Algorithm 2. We therefore do not consider this method in the rest of our experiments.
Consensus Monte Carlo based importance sampling As a final point of comparison, we suggest a straightforward extension of CMC to the computation of the normalising constant based on importance sampling using the consensus normal distribution. This approach is based on the following importance sampling identity that can also be derived from Proposition 4.
where we sample θ i ∼ N (θ|µ, Σ) ∝ S s=1 N (θ i |µ s , Σ s ) for N samples and correctly normalizes the importance sampling estimator in (11). Thus, we construct an approximation of the whole posterior but we only need to evaluate the ratio of the local subposteriors and the local normal approximation w s (θ i ). This approach adds an additional round of communication with the central node, as θ i ∼ N (θ|µ, Σ) is generated on the central node, the weights w s (θ i ) are computed locally and then send back to the central node to compute the aggregation. This IS approximation is exact, but can suffer from high variance if the local normal approximation N (θ|µ s , Σ s ) and the consensus approximation N (θ|µ, Σ) have little overlap. The product of importance weighting factors S s=1 w s (θ i ) potentially introduces further variance as the number of splits S increase (although we did not observe this empirically in Figure 2b).
We illustrate the result of this method in Figure 2b. Using an IS approximation based on CMC with 1000 importance samples performs reasonably well in this setting. A more detailed comparison of the CMC IS approximation for computing the marginal likelihood is out of scope for this work. Estimated evidence flights data Figure 3: Comparison of the calculated normalising constant (y-axis) for a logistic regression on the flights data. As the number of splits increase (x-axis), the estimates become unreliable for the conditionally conjugate approach in Algorithm 1 (denoted by random cond, middle boxes), even when using stratification (stratified cond, right boxes) for the sampling, whilst the approximate method (random approx, left boxes, Algorithm 2) is stable and accurate. The average number of observations per split is indicated in parentheses. The reference value is indicated as horizontal line.

Experiment 2: Comparison of the conditional and approximate method
In this experiment we are interested in predicting on-time arrival of air planes where we use a Bayesian logistic regression model with 17 features (see the appendix for more details and a comparison with another model). There are in total n = 327, 346 observations. Figure 3 shows the results of our simulations. The reference value for the model evidence has been obtained using importance sampling using a Laplace approximation based on the maximum a posteriori. In the current setting the bias is less than −0.5%, even for as many as 50 splits, both for the conditional approach that uses data augmentation (Algorithm 1) and the approximate method based on normal approximations (Algorithm 2). We notice a strong downward bias for the conditional approach, as predicted by Proposition 3. This clearly illustrates a weakness of the conditionally conjugate approach and we observed this behaviour on different data sets (not shown here).
A remedy for the high variance is the use of stratification to construct more homogeneous data shards to improve the performance of the conditional approach. We performed k-means clustering of the features with 10 clusters using the full data set. Then we stratify the observations using the outcome and the cluster membership. A similar approach was used in Zhao and Zhang (2014) to diversify sampling for mini batches in stochastic gradient optimisation. The motivation is to obtain representative samples of the entire data set with every cluster being represented. Although often feasible in practice, this approach goes against the idea of distributed computation as all data has to be seen at once to construct a stratification. The improvement for the conditional approach that comes from stratification is rather limited, as shown in Figure  3. In the remaining experiments we consider only the approximate method described in Section 4.2 and no stratification.

Experiment 3: Assessment of the approximate method in a toy example
In this experiment we investigate the behaviour of Algorithm 2 in the same Gaussian toy model as in the first experiment from Section 5.1. We compare the performance for estimating the marginal likelihood over up to 50 splits of 10, 000 observations. The error introduced via the approximation in (10) of the subposterior amounts to less than 0.02% (see Appendix), and despite a small downward bias in estimating log p(y), the resulting BFs stay stable over decreasing subset sizes (Figure 4), resulting in a consistent choice of the correct model (6). See also in the appendix for the comparison of log marginal likelihoods, that draws a similar picture.

Experiment 4: The approximate method on a very large data set
This experiment is based on a very large data set from particle physics where a binary classification problem consists in predicting the presence of a Higgs boson. The data set contains 11 million observations. Our aim is to understand whether model (1) with 21 low-level features or model (2) with 7 high-level features is more likely a posteriori. Running a full Monte Carlo simulation on the whole data set for model (1) leads to excessive computation times: a full run would take more than 450 hours (almost 3 weeks) on a single core CPU. We assess how the model evidence changes as splits get small by dividing the data set in shards of 1%, 0.2% and 0.1%. Thereby we bring the computation time down to less than 5 hours, 1 hour and less than 30 minutes, all running on different single core CPUs.  Figure 5: Comparison of the calculated normalising constant using Algorithm 2 (y-axis) for the Higgs data set. The left plot shows the evidence for both models (model 1 left boxes, model 2 right boxes) on the same scale. The middle plot corresponds to model (1), the right plot corresponds to model (2), using their respective scales only. The average number of observations per split is given in parentheses (x-axis). We observe a small downward bias as the number of splits increase.
is in the order of a few seconds as we have to perform O(Sp 3 ) operations to calculate I sub . (Although necessary matrix inversion can be pre-computed locally before sending them to the central node.) We show the results of our estimation in Figure 5. Some bias is introduced by the splitting as the number of shards grow as illustrates the right hand side of Figure 5. However, the bias is overall small and not visible when comparing both models on the same scale (left side of Figure 5). Consequently, we would clearly choose model (1).
In essence our experiment on the Higgs data set illustrates the necessity for distributed computation in the large data regime. Running the same experiment on the entire data set is too slow for most applications.

Experiment 5: Sparse regression on a large genetic data set
We compare the performance of a linear regression model with a Laplace prior on a real genetic data set from the UK Biobank database. There are n = 132, 353 observations available. We consider model 1 with 50 and model 2 with 100 genetic variants in the human leukocyte antigen (HLA) region of chromosome 6 that are included as features in order to predict mean red cell volume (MCV). Due to the Laplace prior the conditionally conjugate approach is not applicable. We face again a situation where sampling the posterior on the entire data set on a single core CPU is estimated to take more than 200 hours for model 2. Using 20, 50 and 100 splits brings this computation time down to 10 hours, 2 hours and 1 hour.
In order to assess the bias properly we decided to run our approximate method on a 10% subset of the data where it is computationally feasible to analyse the whole subset and derive a reference value for the normalising constant. We see in Figure 6 on the left side that a downward bias is present, but that we would choose consistently across subsets the right model for the 10% subset of the full data set. When we use the full data set with 20 to 100 splits the right model is still chosen correctly in a given partition Full HLA data set Figure 6: Comparison of the calculated normalising constant using Algorithm 2 (y-axis) for a linear regression using a sparsity enforcing prior. Model 1 (left boxes) has 50 features, model 2 (right boxes) has 100 features. Left plot: estimated model evidence for the model run on a 10% subset. Right plot: estimated model evidence using the full data set but starting with 20 splits. The average number of observations per split is given in parentheses (x-axis). We observe a downward bias as the number of splits increase. For the full data set a comparison across a different number of splits would be meaningless.
(see the right side of the same figure). We also see that model comparison becomes meaningless if different partitions are used to compare the model evidences.
As illustrates Table 1, the error relative to the true value of the normalising constant (% √ MSE) stays small even when using 50 splits and thus having only 265 observations for the estimation of 100 parameters. As the number of splits increases, the squared bias starts to dominate the error as illustrates the ratio Bias 2 /Var in Table 1. The error relative to the level of the quantity that we are trying to estimate stays rather small.  Table 1: Error metrics of the approximation of the log marginal likelihood for the subset (10%) of the HLA data set with 13, 235 observations.
Finally, for this experiment we assess the error of Θ S s=1p (θ|y s )dθ as the number of splits grows, but the total data set size of 13, 235 stays fixed. We define the following two measures of the error. The line corresponds to a linear regression line fitted to the data in order to illustrate the trend. We run every simulation 6 times in order to get repeated measurements. The sampler is run over the range of splits [1,2,4,5,8,10,15,20,25,30,35,40,45,50]. The total number of observations is fixed to 13, 235 and varies accordingly with the number of splits.
The error 1,S (referred to as relative log error) corresponds to the log of the error predicted by Proposition 5 and it is expected to grow linearly in S. The error 2,S (referred to as log error of the difference) is a measure of the absolute error. The left most plot in Figure 7 illustrates that 1,S seems to grows linearly with S. The error 2,S seems to grow more like log S, as depicted in the right-most plot in Figure 7.

Experiment 6: Distributed RJMCMC
Finally, we investigate the use of our splitting approach for a vanilla model selection in a RJMCMC setting. Our simulation should be seen as a proof of concept as RJMCMC faces numerous issues that make exact inference dependent on various tuning parameters that go beyond the scope of this paper.
For this purpose we simulate a toy data set of size n = 4, 000 with a binary outcome where the five features exhibit a high correlation of 0.9. The data is generated by mixing two data sets. Thus, we artificially generate a setting where it is not clear whether to include the third variable. See Table 2 for more details as well as the appendix. The comparison of the different models is shown in Figure 8.    As the number of splits increase, the estimates of the Bayes factors deteriorate. In the current setting going beyond 3 splits may lead to misleading results due to high variance of the estimates as show our experiment. This high variance occurs when combining back the results and is due to several reasons. RJMCMC samplers take a long time to mix and less likely models are potentially not explored enough. Therefore both the estimates based on the MCMC samples of the chain as well as the sojourn times suffer from high variance that accumulates when combining the results from several splits. There is no guarantee that on all data shards all models of interest are explored, if the data shards are too small. In this situation it is not possible to combine the results from several shards back together.
For our experiments we decided to use a medium sized data set and only a small number of features and thereby making exploration of the relevant models more likely. Potential remedies for the evoked problems are an improved estimation of the BF using the method presented in Bartolucci et al. (2006) or the construction of more homogeneous splits using stratification. In any case, whilst theoretically possible, distributed model evidence will rely on well mixing efficient RJMCMC in each split.

Guidelines for practical application
When using our approach in practice we recommend at least a few thousand observations per data shard for a normal approximation to be reasonable. At this stage we suggest to avoid high dimensional settings where the number of parameters exceeds the number of data points, see also Kass and Raftery (1995), where at least 5 observations per dimension are recommended. If the number of observations per shard are too small, the normal approximation becomes unreliable and one risks to face a large downward bias when combining the results. For complex models the bias grows faster as we split the data in smaller shards. It can be helpful to evaluate the variance of summary statistics across the shards to detect if the splits are not homogeneous and run the sampler on a different number of splits.
We recommend sufficiently long Markov chains (see our default settings for the experiments) and the use of convergence diagnostics to make sure that the posterior has been explored sufficiently and that estimated posterior moments are reliable. In particular, we often face the challenge to find a balance between (a) making sure the approximations are precise enough and (b) limiting computation time. In practice, the bias in the estimation is often smaller than the variation in the estimators. Thus consistent model choice between competing models is possible. However, care is needed if competing models are similar.

Discussion and Conclusion
We have presented an approach to calculate the normalising constant in a distributed fashion to enable Bayesian model choice with large data sets. We are able to effectively divide the computation time by several orders of magnitude by splitting the data over a large number of workers and limiting communication between workers. We have shown overall good numerical results and explained the theoretical underpinning for our approach. There remain open questions. Although the estimation of the subposterior normalising constants is biased in general, this bias seems worth accepting in practice. It would be interesting to link this bias of I sub to the way the data is split and to the characteristics of the model such as, e.g., its dimension. Proposition 4 suggests possible refinements of the normal approximation to I sub (10). In principle, the correction factor could be estimated by first approximating the density ratio terms p(θ|y s )/N (θ|µ s , Σ s ), and then using a simple Monte Carlo average for the expectation over the global normal approximation N (θ|µ, Σ) ∝ S s=1 N (θ|µ s , Σ s ) as we suggest for the IS CMC estimator. Subposterior samples could be used to construct a kernel density estimate of p(θ|y s ) (Neiswanger et al., 2013). The density ratio could also be estimated using various nonparametric techniques from the machine-learning literature (Kanamori et al., 2012). For large values of S, a Laplace approximation to the expectation could also be accurate enough for practical purposes (Tierney and Kadane, 1986). Estimation of the correction term (12) will introduce some additional variance, however this may be compensated for by a reduction in the bias. A detailed investigation of the costs and benefits of these approaches is an avenue for future research.
If we want to achieve truly parallel Bayesian computation, we must be able to both split the data and run short Markov chains without burn-in bias (Jacob et al., 2020). Based on this idea an unbiased estimation of the normalising constant via bridge sampling (Rischard et al., 2018) could be combined with our method to improve scalability. Another interesting avenue for future research would be the use of variational inference for distributed Bayesian model choice using the work of Rabinovich et al. (2015); Nowozin (2018). In practical settings shotgun stochastic search (SSS) (Hans et al., 2007) could be applicable using our decomposition as SSS relies on normal approximations to the posterior to quickly explore different models. We also suggest to generalise our approach to settings where data subsets are not i.i.d. and of varying seize and investigate applications to federated learning (Li et al., 2020) in combination with posterior approximations for neural networks (Immer et al., 2021). We think that distributed Bayesian computation merits further theoretical and practical investigation as an alternative to the mini batch paradigm.
Interestingly, the idea of splitting the data and running a sampler on the shards is also applicable in this setting. By using the decomposition in (5) we obtain wherep(m 1 |y s )/p(m 2 |y s ) is obtained as the posterior odds ratio of the RJMCMC sampler of data subset s.
The quantityp(m i |y) is typically available as the output of the sampler. The normalising constant of the subprior α i for model i is available for common priors. We can again use a normal approximation top(θ|y s , m i ) using the samples generated from the sampler. Consequently, the splitting approach can effectively be combined with a reversible jump algorithm as we suggest in Algorithm 3.

Appendix B: Additional details on the models and simulations B.1 Logistic regression models
The flights data This data set is available through the nycflights13 3 R package. This data set contains airline on-time data for all flights departing NYC in 2013. We create a binary indicator for the arrival delay if the flight arrived at least 1 minute late. We use as explanatory variable the departure delay from the departing airport and to which airline the plane belongs to. After removing missing values we get a data set of n = 327, 346 observations. We consider two different models to explain the outcome y: (1) a model with a dummy variable per carrier and the departing delay in minutes. This yields 17 different features.
(2) a model with interactions between carrier and departing delay in addition to the other features. This yields a model with 32 different features.
For the approximate method one would clearly favour model (2) over model (1), as the evidence is higher. This is maintained consistently as the number of splits increases despite a small downward bias, see Figure 9. The conditionally conjugate sampler starts breaking down for the more complex model (2) due to the variance of the estimator of I sub . The estimated value of the model evidence for model (2) is below the value of model (1) for as little as 10 splits. Therefore a consistent model choice is not possible with the conditional approach in this case.
The Higgs data Another binary regression experiment is based on a data set from particle physics where the classification problem consists in distinguishing between a signal process which produces Higgs bosons and a background process which does not. Estimated evidence flights data, both models Figure 9: Comparison of the calculated normalising constant for a logistic regression on the flights data. As the number of splits increase, the estimates become unreliable for the conditionally conjugate approach. The left plot compares the conditional and approximate estimation. "1/2" stand for model 1/2 and "a/e" stand for the approximate (a) or conditional method (e). The average number of observations per split is indicated in parentheses.
It is available through the UCI repository 4 . The data set contains 11 million observations and in total 28 features. The first 21 features are kinematic properties measured by the particle detectors in the accelerator. The last 7 features are high level features. Our aim in this task is to understand whether model (1) with the kinematic features or model (2) with the high level features is more likely a posteriori.

B.2 Gaussian toy model
We use a synthetic Gaussian setting (linear model with 17 covariates, a log Normal prior on the variance term leading to a non-conjugate model). The true model (#6) contains all features, the other models omit one relevant variable each. High correlation of the features (0.9) makes model choice difficult. The true parameters of the model are simulated from a normal distribution with varying mean and variance. The exact details are availabe in https://github.com/alexanderbuchholz/distbayesianmc/blob/master/R/ f_load_data.R#L61. Figure 10 illustrates that the contribution of the product of normals (left plot) is rather low and the error is reasonably small (right plot). Figure 11 shows again the presence of a downward bias when the number of splits increases, but this bias does not affect the ordering of the estimated marginal likelihoods.

B.3 Sparse linear regression model
We observe a vector of continuous outcomes y ∈ R n depending on some feature matrix X ∈ R n×p . We assume that y i ∼ N (µ i , σ 2 ), where N (·) denotes a Gaussian distribution with mean µ i and variance σ 2 . µ i = x t i θ, where θ ∈ R p is the unknown parameter vector endowed with a prior: θ ∼ L(0 p , σ 0 I p ), where L(·, ·) is a Laplace (double exponential) prior.
The HLA data set For our fourth experiment we compare the performance of a linear regression model with a Laplace prior on a real genetic data set from the UK Biobank database. The selected outcome variable is mean red cell volume (MCV), taken from the full blood count assay and adjusted for various technical and environmental covariates. Genome-wide imputed genotype data in expected allele dose format are available on n = 132, 353 study subjects. We consider 50 and 100 genetic variants in the human leukocyte antigen (HLA) region of chromosome 6, selected so that the allelic scores has the highest absolute correlation with the outcome. The region was chosen as many associations were discovered in a genome-wide scan using univariate models (Astle et al., 2016).

B.4 Logistic regression model in the RJMCMC setting
We observe a vector of binary outcomes y ∈ {0, 1} n depending on some feature matrix X ∈ R n×p . We assume that y i ∼ B(p i ), where B(·) denotes a Bernoulli distribution and p i is the probability of observing y i = 1. Estimated log evidence synthetic gaussian model Figure 11: log marginal likelihood of the 6 different models over an increasing number of splits. A slight downward bias is present. However, the bias does not change the ordering of the log marginal likelihoods.
transform and θ ∈ R p is the unknown parameter vector endowed with a multivariate Gaussian prior: θ ∼ N (0 p , σ 2 I p ). We generate simulated data as following: We generate a highly correlated feature matrix X. Then we split the generated features in two and calculate µ 1 = X 1 θ 1 and µ 2 = X 2 θ 2 , where θ 1 = [−1, 1, 0, 0, 1] and θ 2 = [−1, 1, 0.01, 0, 1]. Thereby we have effectively half of the observations that will be better explained by including the third feature.
We run the RJMCMC sampler for 10 million iterations where the first 2 million iterations are discarded as burn-in. We check for proper mixing by examining the trace plots of the sampler.
tilde. Plugging this decomposition in the previous equation we get This is rewritten as p(θ|y)p(y) = α S S s=1p (θ|y s )p(y s ).

C.2 Proof of Proposition 2
Proof. We recall the definition of I sub : Let us now rewrite (θ|y s , z s )dθ .

C.3 Proof of Proposition 3
Proof. We recall the definition of the effective joint distribution of the latent variables if sampling independently from each subposterior: where we have usedp(θ|y s , z s )p(z s |y s )p(y s ) = p(y s , z s |θ)p(θ). Now we use p(y 1:S ) = S s=1p (y s ) α S I sub yielding S s=1p (y s ) p(y 1:S ) = α −S I sub .
Integrating both sides over θ yields , We now use the estimator from (9) and plug in the previous equation:
A simple identity is Substituting (19) into (21)  Asp(θ|y s ) = 0 for θ ∈ Θ, the domain of integration can be extended, and we can express the result in terms of an expectation. Substituting the definitions of I sub and I sub gives the result presented in Proposition 4.

C.5 Proof of Proposition 5
We assume that the subposterior normal approximations satisfy the density ratio bounds For θ ∈ Θ, it must hold that where we have made use of (24). Therefore, we conclude log p(y) p(y) = log E N (θ;µ,Σ) S s=1 p(θ | y s ) N (θ; µ s , Σ s ) = O p (S).