Merging MCMC Subposteriors through Gaussian-Process Approximations

Markov chain Monte Carlo (MCMC) algorithms have become powerful tools for Bayesian inference. However, they do not scale well to large-data problems. Divide-and-conquer strategies, which split the data into batches and, for each batch, run independent MCMC algorithms targeting the corresponding subposterior, can spread the computational burden across a number of separate workers. The challenge with such strategies is in recombining the subposteriors to approximate the full posterior. By creating a Gaussian-process approximation for each log-subposterior density we create a tractable approximation for the full posterior. This approximation is exploited through three methodologies: firstly a Hamiltonian Monte Carlo algorithm targeting the expectation of the posterior density provides a sample from an approximation to the posterior; secondly, evaluating the true posterior at the sampled points leads to an importance sampler that, asymptotically, targets the true posterior expectations; finally, an alternative importance sampler uses the full Gaussian-process distribution of the approximation to the log-posterior density to re-weight any initial sample and provide both an estimate of the posterior expectation and a measure of the uncertainty in it.


Introduction
Markov chain Monte Carlo (MCMC) algorithms are popular tools for sampling from Bayesian posterior distributions in order to estimate posterior expectations. They benefit from theoretical guarantees of asymptotic convergence of the estimators as the number of MCMC samples grows. However, whilst asymptotically exact, they can be computationally expensive when applied to datasets with a large number of observations n. Indeed, the cost of generating one sample from the MCMC algorithm is at best O(n) as the posterior distribution of the model parameters, conditional on the entire data set, must be evaluated at each MCMC iteration. For very large n, therefore, MCMC algorithms can become computationally impractical.
Research in the area of MCMC for big data can be broadly split into two streams: those which utilise one core of the central processing unit (CPU) and those that distribute the work load across multiple cores, or machines. For the single processor case, the computational cost of running MCMC on the full data set may be reduced by using a random subsample of the data at each iteration (Quiroz et al., 2014;Maclaurin and Adams, 2014;Bardenet et al., 2014); however, the mixing of the MCMC chain can suffer as a result. Alternatively, the Metropolis-Hastings acceptance step can be avoided completely by using a stochastic gradient algorithm (Welling and Teh, 2011;Chen et al., 2014), where subsamples of the data are used to calculate unbiased estimates of the gradient of the log-posterior. Consistent estimates of posterior expectations are obtained as the gradient step-sizes decrease to zero (Teh et al., 2014). While popular, subsampling methods do have the drawback that the data must be independent and the whole data set must be readily available at all times, and therefore data cannot be stored across multiple machines.
Modern computer architectures readily utilise multiple cores of the CPU for computation, but MCMC algorithms are inherently serial in implementation. Parallel MCMC, where multiple MCMC chains, each targeting the full posterior, are run on separate cores, or machines, can be easily executed (Wilkinson, 2005), however this does not address the big-data problem as each machine still needs to store and evaluate the whole data set. In order to generate a significant computational speed-up the data set must be partitioned into disjoint batches, where independent MCMC algorithms are executed on separate batches on independent processors (Huang and Gelman, 2005). Using only a subset of the entire data means that the MCMC algorithm is targeting a partial posterior, herein referred to as a subposterior. This type of parallelisation is highly efficient as there is no communication between the parallel MCMC chains. The main challenge is to then reintegrate the samples from the separate MCMC chains to approximate the full posterior distribution. Scott et al. (2013) create a Gaussian approximation for the full posterior by taking weighted averages of the means and variances of the MCMC samples from each batch; this procedure is exact when each subposterior is Gaussian, and can work well approximately in non-Gaussian scenarios. Neiswanger et al. (2013) avoid the Gaussian assumption by approximating the subposteriors using kernel density estimation, however, kernel density approximations scale poorly in high dimensions (Liu et al., 2007). Also, the upper bounds on the mean squared error given in Neiswanger et al. (2013) grow exponentially with the number of batches, which is problematic in big data scenarios where the computational benefit of parallelisation is proportional to the number of available processors.
Previous approaches used to merge the product of subposterior densities have solely relied on the parameter samples outputted from each MCMC algorithm, but have neglected to utilise the subposterior densities which are calculated when evaluating the Metropolis-Hastings ratio. We place Gaussian-process (GP) priors on the log-density of each subposterior. The resulting approximation to the log of the full posterior density is a sum of Gaussian-processes, which is itself a Gaussian-process. From this we may obtain not only a point estimate of any expectation of interest, but also a measure of uncertainty in this estimate.
Starting from this Gaussian-process approximation to the full log-posterior density, we investigate three approaches to approximating the posterior. Firstly, an efficient Hamiltonian Monte Carlo (HMC) algorithm (Neal, 2010) which targets the expectation of the posterior density (the exponential of the combined GP); samples from this provide our first means of estimating expectations of interest. Secondly, the HMC sample values may be sent to each of the cores, with each core returning the true log-subposterior at each of the sample points. Combining these coincident log-subposterior values provides the true posterior at the sampled points, which in turn provides importance weights for the HMC sample, leading to asymptotically consistent estimates of posterior expectations. The practitioner may wish to avoid the complexities and computational expense of running HMC on the expectation of the exponential of the GP and of calculating the true sub-posteriors at a sample of points. We, therefore, also consider an importance proposal based upon any approximation to the true posterior and obtain repeated samples of importance weights by repeatedly sampling realisations of the GP approximation to the log-posterior. This provides both an estimate of any expectation of interest and a measure of its uncertainty. This paper is structured as follows. Section 2 reviews the parallel MCMC approach for sampling from the posterior, the HMC algorithm and importance sampling. Section 3 then outlines the creation of our Gaussian-process approximation for each of the individual subposteriors, and for combining these. In Section 4 we detail three methods for approximating posterior expectations, each utilising the combined Gaussian-process approximation. Section 5 highlights through two toy models, and two large scale logistic regression problems, that our method offers significant improvements over competing methods when approximating non-Gaussian posteriors. We conclude with a discussion.

Bayesian inference and MCMC
Consider a data set Y = {y 1 , y 2 , . . . , y n } where we assume that the data are conditionally independent with a likelihood n i=1 p(y i |ϑ), where ϑ ∈ Θ ⊆ R d are model parameters. Assuming a prior p(ϑ) for the parameters, the posterior distribution for ϑ given Y is Alternatively, the data set Y can be partitioned into C batches {Y 1 , Y 2 , . . . , Y C } where we define a subposterior operating on a subset of the data Y c as where p(ϑ) is chosen so that p(ϑ) 1/C is proper. The full posterior is given as the product of the subposteriors π(ϑ) ∝ C c=1 π c (ϑ). In this setting we no longer require conditional independence of the data, but rather independence between the batches {Y c } C c=1 , where now the data in each batch can exhibit an arbitrary dependence structure.
Creating an approximation to the posterior, π(ϑ), commences with sampling from each of the subposteriors π c (ϑ) independently in parallel, where, given the independence between data subsets, there is no communication exchange between the MCMC algorithms operating on the subposteriors. This type of parallelisation is often referred to as embarrassingly parallel (Neiswanger et al., 2013). The challenge then lies in combining the subposteriors, for which we propose using Gaussian-process approximations.
In this paper we introduce the Hamiltonian Monte Carlo (HMC) algorithm as one possible MCMC algorithm that can be used to sample from π c (ϑ). Moreover, we use HMC in Section 4 to sample from an approximation to the full posterior, π(ϑ). Other MCMC algorithms, including the random walk Metropolis (Roberts et al., 1997), Metropolis adjusted Langevin algorithm (Roberts and Rosenthal, 1998) and adaptive versions of these (e.g. Andrieu and Thoms, 2008) can also be used.

Hamiltonian Monte Carlo
We now provide a brief overview of Hamiltonian Monte Carlo and its application in this paper; the interested reader is referred to Neal (2010) for a full and detailed review. The HMC algorithm considers the sampling problem as the exploration of a physical system with − log π(ϑ) corresponding to the potential energy at the position ϑ. We then introduce artificial momentum variables ϕ ∈ R D , with ϕ ∼ N (0, M ) being independent of ϑ. Here M is a mass matrix that can be set to the identity matrix when there is no information about the target distribution. This scheme now augments our target distribution so that we are now sampling (ϑ, ϕ) from their joint distribution the logarithm of which equates to minus the total energy of the system. Samples from the marginal distribution of interest, π(ϑ), are obtained by discarding the ϕ samples. We can sample from the target distribution by simulating ϑ and ϕ through fictitious time τ using Hamilton's equations (see Neal (2010) for details) The differential equations in (4) are intractable and must be solved numerically. Several numerical integrators are available which preserve the volume and reversibility of the Hamiltonian system (Girolami and Calderhead, 2011), the most popular being the leapfrog, or Stormer-Verlet integrator. The leapfrog integrator takes L steps, each of size , on the Hamiltonian dynamics (4), with one step given as follows: Using a discretisation introduces a small loss or gain in the total energy, which is corrected through a Metropolis-Hastings accept/reject step. The full HMC algorithm is given in Algorithm 3 in Appendix A.
The HMC algorithm has a step-size parameter and number of leap frog steps L which need to be tuned. The performance of the algorithm is highly dependent on the tuning of the parameters. One way to tune the algorithm is to optimise the parameters such that the acceptance rate is approximately 65% (Beskos et al., 2013). Alternatively, the parameters could be adaptively tuned ; for this paper we use the popular NUTS sampler Hoffman and Gelman (2014), which tunes the trajectory length L to avoid the sampler doubling back on itself. The HMC algorithm can be efficiently implemented using the popular STAN 1 software package. The STAN modelling language automatically tunes the HMC algorithm, and by using efficient automatic differentiaion, the user need only express their posterior model.

Importance sampling
A popular alternative to MCMC for estimating posterior expectations is the importance sampler (Robert and Casella, 1999). Given a proposal density, q(θ), and an unnormalised posterior density, π(θ), importance sampling (e.g. Geweke, 1989) aims to estimate expectations of some measurable function of interest, h(θ) by sampling from q. The starting point is where w(θ) := π(θ)/q(θ) and Z := π(θ)dθ is the normalisation constant. Consider a sequence, {θ i } ∞ i=1 with marginal density q. Provided that a strong law of large numbers (SLLN) applies, setting h(θ) = 1 in the above equation implies thatẐ N := 1 N N i=1 w(θ i ) → Z, almost surely, and hence, almost surely, where w N (θ) := w(θ)/Ẑ N . In Section 4 we will use importance sampling to estimate expectations with respect to the combined posterior distribution.
3 A Gaussian-process approximation to posterior 3.1 Gaussian-process approximations to the subposteriors Parallelising the MCMC procedure over C computing nodes results in C subposteriors {π c (ϑ)} C c=1 . From each subposterior, c, where the MCMC algorithm has been iterated J times, we have where each pair consists of a sample from the Markov chain with its associated log-subposterior, up to some fixed additive constant. We wish to convert this limited information on a finite set of points to information about log π c over the whole support of ϑ. We therefore treat the whole log-subposterior (up to the same additive constant), L c (ϑ), as random with a Gaussian-process prior distribution: where m : ϑ → R and K : ϑ × ϑ → R are, respectively, the mean and covariance functions. For computational convenience, we further assume that the log-subposteriors are independent of each other. We model log π c (ϑ) rather than π c (ϑ), so that our approximation to the overall log-posterior will be a sum of Gaussian-process (Section 3.2); modelling the log-posterior also avoids the need for non-negativity constraints when fitting the GP 2 .
The mean function and covariance function are chosen by the user. A mean function of zero, m(ϑ) = 0, would be inappropriate in this setting as our prior must be the logarithm of a probability density function up to a finite additive constant. We ensure that exp {L c (ϑ)} dϑ < ∞ almost surely through a quadratic mean function of the form Here V is the empirical covariance of ϑ obtained from the MCMC sample and β i , (i = 0, 1, 2) are unknown constants.
The covariance function K(·, ·), determines the smoothness of the log-subposterior, which we shall assume is continuous with respect to ϑ. A popular choice is the squared-exponential function (e.g. Rasmussen and Williams, 2006) where Λ is a diagonal matrix and ω are hyperparameters. In this paper we analytically marginalise β 0 and β 1 (O' Hagan, 1978) and estimate β 2 and the kernel hyperparameters through maximum likelihood (details given in Chapter 5 of Rasmussen and Williams (2006)). We have found that our choice of mean and covariance function work well in practice, however, alternative functions can be applied and may be more appropriate depending on the characteristics of the log-subposterior. Given the choice of prior, D c are observations of this Gaussian-process, leading to a posterior distribution, Define L c (ϑ 1:J ) := {L c (ϑ 1 ), . . . , L c (ϑ J )} and, for some parameter, or parameter vector, θ := θ 1: with, and whereK = K(ϑ 1:J , ϑ 1:J ), K * , * = K(θ 1:N , θ 1:N ) and K * = K(ϑ 1:J , θ 1:N ). The posterior distribution for the GP, L c (θ 1:N )|D c , is a random approximation of the logsubposterior surface.

Merging the subposteriors
Our next goal is to approximate the full posterior π(θ) ∝ C c=1 π c (θ) by merging the subposteriors together. The approximation of each of the C subposteriors as independent Gaussianprocesses, L c (θ) ∼ GP(·, ·) (c = 1, . . . , C) leads directly to the approximation of the full log-posterior (up to an additive constant) as the sum of C Gaussian-processes, Our assumption that the Gaussian-processes representing the log-subposteriors {L c } C c=1 are independent a priori was made for computational convenience. This may not be true in practice since gross deviations from the quadratic prior mean, m c (ϑ), such as any skewness, may be repeated across subposteriors. However, a posteriori these gross deviations should be accounted for through the posterior mean µ c (ϑ). Variability in the original partitioning of the data into batches, and variability in the sample points, ϑ 1:J , across batches will both contribute to the more subtle variations of the GPs about their individual posterior means, so that the posterior correlation should be much smaller than the prior correlation.

Illustration
The Gaussian-process subposterior, provides estimates of the uncertainty in the log-subposterior at points, θ, where the log-subposterior has not been evaluated. This contrasts with current approaches (e.g. Scott et al., 2013;Neiswanger et al., 2013;Wang and Dunson, 2013) to approximate the subposterior which give no measure of uncertainty. This is illustrated below and used in Sections 4.3 and 5.3 to gauge the uncertainty in our estimates of posterior expectations.
To illustrate how the sum of Gaussian-processes is used to approximate L(θ) we sample n = 20, 000 data points from a Normal(θ,1.5 2 ) distribution with θ = 2 and evenly split the data across C = 2 processors. Independent HMC algorithms are run on each subposterior targeting the posterior for θ. A Gaussian-process approximation, as shown in Figure 1, is fitted to each of the subposteriors, where the blue line is the GP mean and the blue band gives a 95% confidence interval of the uncertainty in the approximation at unobserved regions of the parameter space. Using (11), the approximation to the full posterior is given by summing the means and covariances of the Gaussian-process approximations to the subposteriors.

Approximating the full posterior
We now detail three methods for approximating posterior expectations, all of which utilise our Gaussian-process approximation to the full posterior density.

The expected posterior density
Here we approximate the full posterior density (up to an unknown normalising constant) by its expectation under the Gaussian-process approximation: using the properties of the log-Normal distribution. If the individual GPs provide a good approximation to the individual log-subposteriors, then E [L(θ)] will be a good approximation to the full log-posterior. The HMC algorithm then provides an efficient mechanism for obtaining an approximate sample, Evaluating the GP approximation at each iteration of this MCMC algorithm is significantly faster than evaluating the true full posterior, π(θ), directly. As is apparent from the leapfrog dynamics, HMC requires the gradient of log π E , and here the tractability of our approximation is invaluable, since Given a sufficiently large sample fromπ E , approximations of posterior expectations can be highly accurate if the individual GPs provide a good approximation to the log-subposteriors. Moreover, the approximationπ E (θ) to the full posterior can be further improved by using importance sampling on the true posterior.

Distributed importance sampling
Unlike the proposal, q, in Section 2.2, samples generated from the HMC algorithm represent an approximate, correlated sample from an approximation to the true posterior, instead of exact, independent samples from an approximation. Nonetheless, we may still correct for inaccuracies inπ E using importance sampling while spreading the computational burden across all C cores. The full sample from the HMC algorithm targetingπ E , {θ i } N i=1 , is sent to each of the C cores. Each of the C cores then evaluates the true subposterior at each θ i . A single core then combines the subposterior densities for each θ i to provide the full true posterior density:
To be clear, each sub-posterior is evaluated at the same set of θ values, allowing them to be combined exactly. In contrast, the original HMC runs, performed on each individual subposterior, created a different set of θ values for each subposterior so that a straightforward combination was not possible.
Since the unknown normalising constants for both π andπ E appear in both the numerator and the denominator of this expression, they are not needed. Almost sure convergence ofÊ N (h) to E π [h(θ)] as the HMC sample size, N → ∞ relies on the strong law of large numbers (SLLN) for Markov chains (e.g. Tierney, 1996, Theorem 4.3). In addition, if desired, an unweighted approximate sample from π may be obtained by resampling θ i with a probability proportional w i .
We expect our HMC importance proposal to be especially efficient, since it mimics the true posterior. However, other proposal distributions based on competing algorithms for merging subposteriors (e.g. Scott et al., 2013;Neiswanger et al., 2013;Wang and Dunson, 2013) can be used instead; these are compared in Section 5. Algorithm 1 describes this general distributed importance sampler.

Gaussian-process importance sampler (GP-IS)
Finally, we present an importance sampler that uses the full posterior distribution of L, the GP approximation to the full unnormalised log-posterior conditional on {ϑ c,j , π c (ϑ c,j )} C,J c=1,j=1 . Compared with the importance sampler in Section 4.2, the set of points {θ i } N i=1 is generated from a simple proposal distribution, rather than the HMC algorithm applied toπ E . Moreover, given the set of points {θ i } N i=1 the computationally-expensive evaluation of each subposterior at this set of values is replaced with repeated, but relatively cheap sampling of realisations of L at these points. For a fixed number of GP training points, J, estimates of posterior expectations are no-longer asymptotically exact in N , however estimates of the uncertainty in these estimates are also supplied.
As in Sections 4.2 and 2.2 we are interested in I h := E π [h(θ)] = 1 Z π(θ)h(θ)dθ. Here we consider approximating this with where is a realisation of L from the distribution in (11) As an alternative, robust, point estimate, the median of {I h ( m )} M m=1 would target the posterior median. Other posterior summaries for I h , such as a 95% credible interval, could also be estimated from the sample.
Unfortunately it is not possible to store the infinite-dimensional object, ; and even if it were, for moderate dimensions, numerical evaluation of I h ( ) would be computationally infeasible. Instead we use importance sampling. Consider a proposal distribution q(θ) that approximately mimics the true posterior distribution, π(θ) and sample N independent points from it: θ 1:N := (θ 1 , . . . , θ N ). For each m ∈ {1, . . . , M } we then sample the finite-dimensional object ( m (θ 1 ), . . . , m (θ N )) from the joint distribution of the GP in (11). For each such realisation we then construct an approximation to the normalisation constant and to I h ( ): for posterior inference on I h . For the specific case of I E h a simplified expression for the approximation may be derived: Algorithm 2 creates point estimates based upon this. The proposal density q(θ i ) should be a good approximation to the posterior density. To create a computationally cheap proposal, and with a similar motivation to the consensus Monte Carlo approximation (Scott et al., 2013), we make q(θ i ) a multivariate Student-t distribution on 5 degrees of freedom with mean and variance matching those of the Gaussian posterior that would arise given the mean and variance of each subposterior and if each sub-posterior were Algorithm 2 GP Importance Sampler Input: GP approximation L(θ) and proposal distribution q(θ).
-Weight the samples according to (13).
Gaussian, Alternatively, it would be possible to use the output from the HMC algorithm of Section 4.1 in an analogous manner to the way it is used in Section 4.2. Many aspects of our importance sampler can, if necessary, be parallelised: in particular, calculating µ c (θ 1:N ) and Σ c (θ 1:N ), and then sampling 1 , . . . , m and obtaining the sample

Computational cost
We briefly review some of the notation in the paper as a point of reference for this section.
• N := # samples drawn from the approximation to the merged posteriorπ E (θ), or, for GP-IS, from the Student-t proposal.
The overall computational cost of applying the methods in Sections 4.1 and 4.2 to create an approximate (weighted) sample from the full posterior can be summarised in three (four) steps: -Run MCMC on each subposterior (see Section 2). This step is common to all parallel MCMC algorithms (e.g. Scott et al., 2013;Neiswanger et al., 2013;Wang et al., 2015) and has a cost of O(Jn/C).
-Fit GP to each subposterior (see Section 3). Fitting a Gaussian-process to each subposterior has a cost of O(J 3 ) due to the inversion of the J × J matrixK. One of the drawbacks of Gaussian-processes is the computational cost. Faster, approximate Gaussian-processes, referred to as sparse GPs (e.g Csató and Opper, 2002;Seeger et al., 2003;Quiñonero-candela et al., 2005;Snelson and Ghahramani, 2006) can be used to reduce the computational cost. 3 In this paper we apply the simpler speed-up technique of first thinning the subposterior Markov chain; for example, using only every twentieth sample. The thinned Markov chain has the same stationary distribution as the full chain, but the autocorrelation is reduced and, more importantly for us, the sample contains fewer points. Secondly we remove the duplicate samples from the subposterior as they provide no extra information for the GP approximation, and cause the kernel matrixK to become singular. Fitting C independent GPs to each of the subposteriors is embarrassingly parallel as the MCMC output from each subposterior is stored on a separate core.
-Perform HMC onπ E (see Section 4.1). Each iteration of the HMC algorithm requires an evaluation of µ c and Σ c from (10) with N = 1, and multiple evaluations of the gradient terms given in Section 4.1. SinceK −1 has already been calculated, the total cost over all N iterations of the HMC algorithm is O(N J 2 ). The cost of this step is equivalent to competing algorithms including (Neiswanger et al., 2013;Wang and Dunson, 2013), which also use an MCMC-type step to sample from the approximation to the posterior.

Experiments
In this section we compare our Gaussian-process algorithms for aggregating the subposteriors against several competing algorithms: • Consensus Monte Carlo (Scott et al., 2013), where samples are weighted and aggregated.
• Weierstrass rejection sampler 5 (Wang and Dunson, 2013), where the nonparametric density estimates are passed through a Weierstrass transform to give the merged posterior.
We consider four interesting examples: a Bernoulli model with rare events which leads to a skewed posterior, a mixture of Laplace distributions which only becomes identifiable with a large amount of data, and two logistic regression models for large data sets. These examples highlight some of the challenges faced by merging non-Gaussian subposteriors and the computational efficiency of large-scale Bayesian inference.
Our Gaussian-process approximation method is implemented using J = 100 samples from the thinned chain for each subposterior to fit the GPs for the Bernoulli and multimodal examples; for the logistic regression examples J = 500. Both for our methods and for comparator methods, N = 5000 samples from each merged posterior are created. To ensure a fair comparison, the sample from eachπ E that is used both directly and in our DIS algorithm is the unthinned output from the HMC run. The Student-t proposals for the Gaussian-process importance sampler are iid.
Weighted samples from DIS and GP-IS are converted to unweighted samples by resampling with replacement, where the probability of choosing a given θ is proportional to its weight.
For each of the models studied in this section we denote the true parameter values by θ * (when known). We obtain an accurate estimate of the true posterior from a long MCMC run, thinned to a size of N , with samples denoted θ f and the true posterior mean and variance m f and V f , respectively. Samples from the approximation are denoted θ a , and their mean and variance are m a and V a . We use the following metrics to compare the competing methods: • Kullback-Leibler divergence for the Bernoulli and mixture example is calculated using a nearest neighbour search 6 and for the logistic regression example, approximate multivariate Gaussian Kullback-Leibler divergence (see Wang and Dunson (2013) for details) between the true posterior π and aggregated posteriorπ is calculated as Wang et al., 2015), which gives a measure for the posterior spread around the true value θ * (ρ=1 being ideal). 4 Implemented using the parallelMCMCcombine R package 5 Implemented using the authors R package https://github.com/wwrechard/weierstrass 6 Implemented using the FNN R package • Mean absolute skew deviation, η = 1 is the third standardised moment, and the superscripts f and a denote empirical approximations obtained from the samples obtained using the true posterior and the approximation, respectively.

Rare Bernoulli events
The consensus Monte Carlo approach of Scott et al. (2013) is the optimal algorithm for merging the subposteriors when each subposterior is Gaussian. A popular example where this algorithm struggles is the Bernoulli model with rare events, where the subposterior distributions are skewed (e.g. Wang et al., 2015;Wang and Dunson, 2013;Scott et al., 2013). We sample n = 10, 000 Bernoulli random variables y i ∼ Bern(ϑ) and split the data across C = 10 processors. We set ϑ = C/n so that the probability of observing an event is rare. In fact, each subset only contains 1 success on average. A Beta(2, 2) prior distribution is assumed for ϑ. Figure 2 and Table 1 give the results of merging the subposteriors for the various algorithms. Both GP-HMC and GP-IS samplers produce good approximations to the posterior. All of the competing algorithms can reasonably identify the mode of the posterior, but do not adequately fit the tail of the distribution. The consensus algorithm gives a reasonable approximation to the body of the density but struggles to capture the posterior skew. The nonparametric method appears to perform the worst in this setting; this could be improved by hand-tuning the bandwidth parameter. However, doing so assumes that the full posterior is known, which is not the case in practice.
We can generate samples from the full posterior using the distributed importance sampler (Alg. 1), where samples from each of the aggregation methods can be used as a proposal. Figure 2 (right panel) shows that using the DIS improves the accuracy of all of the competing methods. This improvement is most noticeable for the consensus and nonparametric approximations. Ultimately, the overall accuracy of the approximation to the full posterior will dependent on the quality of the proposal distribution.

Multimodal subposteriors
We create a concrete data scenario that could lead to a set of multimodal sub-posteriors similar to the artificial, perturbed multimodal sub-posteriors used in Wang et al. (2015). The example is a toy representation of a general situation where one or more parameters of interest are poorly identified, but as the size of the dataset increases towards 'large' n, the parameters start to become identifiable. The subposteriors are multimodal but the full posterior is unimodal (see Figure 3, left panel). We sample n = 1, 000, 000 observations from a mixture of two Laplace distributions If β 1 = β 2 then the mixture components are non-identifiable, however, by setting β 1 = 1.01 and β 2 = 0.99 the parameter of interest, θ can be identified from a sufficiently large dataset. For this experiment θ = 0.05 and the data are split equally over C = 25 processors. The scale parameters, β 1 and β 2 are fixed at their true values and a N (0, 1) prior is assumed for θ. The left panel of Figure  terior reveals, approximately, the true value for θ, whereas θ is poorly identified by each of the subposteriors. The multimodality of the subposteriors results in a poor posterior approximation from the consensus Monte Carlo algorithm (right panel) as each subposterior is assumed to be approximately Gaussian. On the other hand, most of the nonparametric methods are able to capture the approximate shape of the posterior, but fail to correctly detect the posterior mode. Table 2 shows that the DIS step can slightly degrade the quality of the approximation if the proposal (e.g. semiparametric) under-represents the tail behaviour of the true posterior. As shown in Figure 3 (right panel), the GP-HMC and GP-IS samplers produce good approximations to the full posterior and unlike the nonparametric methods, the GP approximations concentrate around the posterior mode (see ρ in Table 2).

Logistic regression
Synthetic data set We use a synthetic data set on internet click rate behaviour where one of the covariates is highly predictive, but rarely observed. The data set has n = 10, 000, with 5 covariates and is identical to that in Section 4.3 of Scott et al. (2013). Partitioning the data across 10 machines means that only a few machines carry an observation for the highly predictive covariate, leading to some skewed subposterior distributions. The consensus Monte Carlo algorithm struggles in this scenario because, even though the marginal of the full posterior for the coefficient of the rarely observed covariate is approximately Gaussian, the subposteriors are not. We also compare the standard set-up of the merging algorithms against their DIS equivalents and find that DIS improves the accuracy of the approximation. The improvement brought about by DIS applied to our GP-HMC sampler is relatively small becauseπ E is already an accurate approximation to the posterior. Nonetheless, as can be seen with the consensus approximation, the DIS step can lead to a significant improvement   if the approximation produced by the merging algorithm alone is poor. The nonparametric methods begin to struggle on this example. This is partly due to the difficulty of tuning these methods, and as discussed in Wang and Dunson (2013), these methods begin to struggle as the dimension of the parameter space increases.
Real data set We conduct parallel MCMC experiments on the Hepmass 7 data set. The challenge here is to accurately classify the collisions of exotic particles by separating the particleproducing collisions from the background source. The full data set contains 10.5 millions instances with 28 attributes representing particle features. In our experiments we use the first million instances and discard the mass attribute in our model fit. The data is divided equally across C = 20 machines. The results from Table 4 show that all methods approximate the full posterior with more or less the same level of accuracy, with the exception of the nonparametric method. As discussed in Neiswanger et al. (2013), nonparametric methods scale poorly with dimension with the Weierstrass and semiparametric algorithms performing better than the simple nonparametric method. The posterior concentration ratio is not reported as this is close to one for all methods. The subposteriors are approximately Gaussian and as a result all methods, including the consensus Monte Carlo algorithm, produce accurate approximations to the full posterior. As a result, and with the exception of the nonparametric method, applying the DIS step does  Table 5: Expectation and variance of θ 1 and θ 17 from the logistic regression model with the HEPMASS dataset. Mean, Median and quantile estimates of the quantities are calculated from 500 samples from the GP-IS sampler (i.e. M=500).
not lead to a significant improvement in the approximation. As described in Section 4.3, the GP-IS sampler draws multiple realisations from the posterior distribution of the GP approximation to the posterior. Each of these realisations provides an estimate of the expectation of interest, their centre (mean or median) provides a point estimate and their spread (2.5% and 97.5% quantiles) provide a measure of the uncertainty. In Table 5 we estimate the posterior mean and variance of two parameters and compare these estimates against the truth, as calculated from an MCMC run on the full posterior. Sampling M = 500 realisations from GP we report the mean, median and 95% CI for estimates of the mean and variance and find that these results are consistent with the truth.

Discussion
Aggregating subposteriors generated through parallel, independent MCMC simulations, to form the full posterior distribution is challenging. Currently, available methods either produce a Gaussian approximation to the posterior, or utilise nonparametric estimators which are difficult to tune and do not scale well to high-dimensional settings. In this paper we have presented an alternative approach to this problem by directly modelling the log-density of the subposteriors. Using Gaussian-process priors we were able to employ a fully Bayesian strategy towards approximating the full posterior, and unlike competing methods, we were able to account for the uncertainty in the approximation.
Compared to the nonparametric methods, fitting the Gaussian-processes is straightforward using a mixture of marginalisation and maximum likelihood techniques for the hyperparameters. The main drawback of using Gaussian-process approximations is the computational cost. We have reduced the computational cost by, for each subposterior sample, thinning the Markov chain and removing duplicate MCMC samples prior to fitting the GP. We have shown that using only a small number of samples from the subposterior we can accurately approximate the full posterior. Furthermore, the computationally intensive step of fitting the individual GPs to the subposteriors is automatically parallelised, as the subposteriors are independent by definition and the GPs are independent by design.
The algorithms we propose scale well with the number of data points n, but fitting a GP when the dimension, d, of θ is high can be computationally expensive as the number of input points required to produce an accurate approximation grows exponentially with d. An extension to this work would be to employ sparse GP approximations to reduce the computational expense for high-dimensional problems.