Scalable Bayesian Inference for the Inverse Temperature of a Hidden Potts Model

The inverse temperature parameter of the Potts model governs the strength of spatial cohesion and therefore has a major influence over the resulting model fit. A difficulty arises from the dependence of an intractable normalising constant on the value of this parameter and thus there is no closed-form solution for sampling from the posterior distribution directly. There are a variety of computational approaches for sampling from the posterior without evaluating the normalising constant, including the exchange algorithm and approximate Bayesian computation (ABC). A serious drawback of these algorithms is that they do not scale well for models with a large state space, such as images with a million or more pixels. We introduce a parametric surrogate model, which approximates the score function using an integral curve. Our surrogate model incorporates known properties of the likelihood, such as heteroskedasticity and critical temperature. We demonstrate this method using synthetic data as well as remotely-sensed imagery from the Landsat-8 satellite. We achieve up to a hundredfold improvement in the elapsed runtime, compared to the exchange algorithm or ABC. An open source implementation of our algorithm is available in the R package"bayesImageS."


Introduction
Markov random field (MRF) models have seen widespread use in image analysis since their introduction by Besag [1974], as surveyed by Winkler [2003] and Li [2009].A MRF is a generalisation of the Markovian dependence structure to more than one dimension: satellite imagery has two spatial dimensions, while medical images such as computed tomography (CT) are three-dimensional.The hidden Potts [1952] model employs a latent MRF on discrete states to describe spatial dependence between adjacent neighbours.The degree of dependence in the model is governed by a parameter known as the inverse temperature due to its origin in statistical physics.It is difficult to set this parameter by trial and error, particularly for noisy images.Rather than using a fixed value, it would be preferable to estimate the inverse temperature as part of the model.However, the intractable normalising constant of the Potts model depends on the value of the inverse temperature, which means that there is no closed form solution for estimating its posterior distribution.Rydén and Titterington [1998] derived a pseudolikelihood (PL) approximation [Besag, 1975] to the intractable posterior density.Gelman and Meng [1998] instead approximated the ratio of normalising constants using thermodynamic integration (TI), also known as path sampling.Møller et al. [2006] introduced an auxiliary variable method that gives an exact MCMC algorithm for the special case of a 2 component Potts model, also known as an Ising model.Murray et al. [2006] proposed a variant of the exact method known as the exchange algorithm or multiple auxiliary variable method (MAVM).By replacing the expensive perfect sampling step [Propp and Wilson, 1996] with an approximation such as Gibbs sampling, Cucala et al. [2009] developed an approximate exchange algorithm that can be applied for Potts models with k > 2. Friel et al. [2009] introduced the reduced dependence approximation (RDA) which uses recursion to calculate the normalising constant on small sub-lattices.McGrory et al. [2012] generalised RDA to an irregular lattice.Grelaud et al. [2009] used the sufficient statistic of the Potts model to estimate the inverse temperature using approximate Bayesian computation (ABC).Everitt [2012] combined the approximate exchange algorithm with particle Markov chain Monte Carlo (PMCMC) and also implemented ABC with sequential Monte Carlo (SMC-ABC) for the Ising model.Although all of these methods work very well in theory, the largest image used in any of these papers to demonstrate an algorithm is less than ten thousand pixels.This does not give a reliable indication of how these algorithms might perform when applied to images of a more substantial size.
Images containing multiple megapixels are now commonplace, from digital photography and high-definition (HD) video to medical imaging and remote sensing.Table 1 illustrates how the number of pixels translates to real world scale for various types of images, including the area covered by a satellite image, the number of axial slices of a CT scan, and the number of frames of HD video.Multispectral images from the Landsat 7 [NASA, 2011] and Landsat 8 [USGS, 2014] satellites have a spatial resolution of 30 metres per pixel (area of 900m 2 ).A Landsat image covers an area of approximately 170km north-  et al. [2010], and the references therein.The remainder of the discussion will focus on static images.This is the first study to systematically investigate the scalability of methods for intractable likelihoods.In this paper we compare the performance of pseudolikelihood (PL), the approximate exchange algorithm (MAVM), thermodynamic integration (TI), and approximate Bayesian computation (ABC).By conducting these experiments we are able to address questions such as how the computational cost increases with the size of the images, as well as how much accuracy is lost by using faster, more approximate methods.The imaging datasets, which include 2D remote sensing and 3D medical imaging, are described in Section 2. We define the hidden Potts model in Section 3 and examine the properties of its intractable normalising constant.The various computational methods for estimating the inverse temperature are reviewed in Section 4. We present our experimental results in Section 5 and conclude the article with a discussion.

Synthetic data
The inverse temperature cannot be directly observed, so the only way to measure the accuracy of a method is to use simulated data.We have adapted a method that was proposed by Cook et al. [2006] for the purpose of assessing the correctness of software for fitting a Bayesian model.Their central concept is to draw values of the model parameters from the prior and simulate data using those parameters, then fit the model and compare the estimate of the posterior distribution to the known value.We do not make use of the χ 2 test devised by Cook et al., since our goal is not to evaluate the hypothesis that our implementation of the algorithm is correct.We know that all four algorithms are approximations, therefore our aim is to compare the performance of these algorithms under various conditions.
We use a uniform prior to draw values of the inverse temperature.This prior is adopted because it reflects current practice and enables comparison between our results and other recent studies.We use informative, conjugate priors for the means and variances of the mixture components in our simulation study.This avoids the problem of label switching and corresponds with the model that we use in our applications to real data.We use a fixed number of 3 mixture components to generate 6 datasets containing 20 images each.The images within a dataset all have the same dimensions.The small images have 2 6 pixels (8×8 for 2D images, or 4×4×4 for 3D).The medium images have 5 6 (125×125, or 25 × 25 × 25).The large images have 10 6 (1000 × 1000, or 100 × 100 × 100).This enables us to study how the performance of the algorithms changes as the size of the images increases, as well as how well these algorithms generalise to a 3D lattice.

Satellite remote sensing
Landsat imagery of the Greater Brisbane region was converted to the Normalised Difference Vegetation Index (NDVI).This index is calculated using the visible red (RED) and near-infrared (N IR) bands: NDVI is generally used to detect vegetated areas, with an expected value for vegetation being between 0.3 and 0.8.In this analysis, we are interested in detecting clusters which relate to vegetated areas.The size of the data set does not enable us to individually view the satellite image of Brisbane, however, as indicated in Figure 1b, a mixture model may be an appropriate means of determining clusters, and attributes of clusters, specifically those related to vegetation.In addition, a further goal of data analysis is detecting loss (or gain) of moderate sized vegetative areas through time.A mixture model may also suit this task, as pixel movement between clusters may indicate a change in habitat or land use.

Medical images
Computed tomography (CT) is a technology for reconstructing a three-dimensional image by filtered back-projection from X-ray beams [Kalender, 2011].The pixel intensities in a CT scan are linearly related to the electron density (ED) within the object that is imaged.CT scanners are calibrated in Hounsfield units (HU) with water at 0 HU and air at -1024 HU.
Rather than using biological images, which are difficult for the untrained eye to interpret, in this paper we have used 27 CT scans of an ED phantom (CIRS, Inc. model 062).This object is manufactured from epoxy and contains    cylindrical inserts that mimic the X-ray absorption of human tissue.Since the geometry and density of this object is known, the CT scans can be used to explore the effect of inverse temperature on segmentation accuracy.An example slice of a CT scan of the ED phantom is illustrated in Figure 2. The cylindrical inserts are arranged in pairs, clockwise from the top: lung (exhale); adipose (fatty tissue); dense (cortical) bone; muscle; lung (inhale); breast (50% fat); spongy (trabecular) bone; and liver.The body of the phantom is composed of water-equivalent solid and therefore should have a mean pixel intensity of 0 HU.

Hidden Potts model
Image segmentation can be viewed as the task of labelling the observed pixels y according to a finite set of discrete states z ∈ {1, . . ., k}.The hidden Potts model allows for spatial correlation between neighbouring labels in the form of a Markov random field.The latent labels follow a Gibbs distribution, which is specified in terms of its conditional probabilities: where β is the inverse temperature, z \i represents all of the labels except z i , i ∼ are the neighbouring pixels of i, and δ(u, v) is the Kronecker delta function.Thus, i∼ δ(z i , z ) is a count of the neighbours that share the same label.
If the labels z i are indexed row-wise, the nearest (first-order) neighbours i ∼ in a regular 2D lattice with c columns are {z i−1 , z i−c , z i+c , z i+1 }.Pixels situated at the boundary of the image domain have less than four neighbours.Likewise, voxels in a regular 3D lattice have a maximum of 6 first-order neighbours.These neighbourhood relationships are reciprocal, so h ∈ i ∼ implies i ∈ h ∼ .If E is the set of all unique neighbour pairs, or edges in the image lattice, then |E| = 2(n − √ n) for a square lattice and 3(n − n 2/3 ) for a cube.The observation equation links the latent labels to the corresponding pixel values: where θ j are the parameters that govern the distribution of the pixel values with label j.The hidden Potts model can thus be viewed as a spatially-correlated generalisation of the finite mixture model [Rydén and Titterington, 1998].Green and Richardson [2002] used a Poisson likelihood for (3), with intensity λ j .Instead we follow Geman and Geman [1984], Alston et al. [2007], and many others in assuming that the pixels with label j share a common mean µ j corrupted by additive Gaussian noise with variance σ 2 j : The Gibbs distribution is a member of the exponential family and so there is a sufficient statistic for this model, as noted by Grelaud et al. [2009]: This statistic represents the total number of like neighbour pairs in the image.
The likelihood p(y, z|θ, β) can therefore be factorised into p(y|z, θ)p(S(z)|β), where the second factor does not depend on the observed data, but only on the sufficient statistic.The joint posterior is then: The conditional distributions p(θ|z, y) and p(z i |z \i , β, y i , θ zi ) can be simulated using Gibbs sampling, but p(β|y, z, θ) involves an intractable normalising constant C(β): The normalising constant is also known as a partition function in statistical physics.It has computational complexity of O(nk n ), since it involves a sum over all possible combinations of the labels z ∈ Z: It is infeasible to calculate this value exactly for nontrivial images, thus computational approximations are required.
The conditional expectation of S(z) given β can be expressed in terms of the normalising constant: As β approaches infinity, all of the pixels in the image are almost surely assigned the same label, thus the expectation of S(z) approaches the total number of edges |E| asymptotically, while the variance approaches zero.When β = 0, (2) , hence the probability of any pair of neighbours being assigned the same label follows an independent Bernoulli distribution with p = k −1 .In this case, S(z) follows a Binomial distribution with expectation |E|/k and variance |E|k −1 (1 − k −1 ).In a finite image lattice the distribution of S(z) changes smoothly between these two extremes, as illustrated by Figure 3, but its computation is intractable for nontrivial images.
The Potts model undergoes a phase transition at the critical value of β, switching from a disordered to an ordered state.Potts [1952] showed that the critical value for a 2D lattice on a cylinder can be calculated exactly according to:  The periodic boundary condition does not apply to the lattices considered in this paper, so for example the critical value for the images in Figure 3 is different to (11).However, the error introduced by the finite boundary diminishes as n increases.Figure 4 shows that ( 11) is very accurate in predicting the behaviour of S(z) for a 2D image with a maximum value of S(z) for first-order neighbours of 1,954,672.This corresponds to the satellite image described in Section 2.2, with k=6 mixture components and β crit ≈ 1.24.S(z) is approximated by simulation using the algorithm of Swendsen and Wang [1987].Figure 4a shows that the gradient of the expectation becomes very steep near the critical region, which is reflected in the super-exponential increase of the standard deviation illustrated by Figure 4c.As n → ∞ the derivative of the likelihood at the critical point, and hence the variance, is unbounded [Pickard, 1987].This heteroskedasticity has important implications for many of the methods discussed in Section 4.
There is no exact formula for β crit in 3D images, although Hajduković [1983] developed an empirical approximation that works reasonably well: The behaviour of E z|β [S(z)] for a 3D image with k = 9 and β crit ≈ 0.86 is illustrated by Figure 4b.This is typical of the CT scans described in Section 2.3, with |E| approximately equal to three million.The standard deviation is shown in Figure 4d.It is evident that (12) has underestimated β crit since the maximum value of the standard deviation occurs at β = 0.9.

Computational methods
In this section we describe the four algorithms that are evaluated in Section 5.These methods provide alternative means to simulate parameter values from (8) without computing the intractable normalising constant.We describe the algorithms in terms of Markov chain Monte Carlo (MCMC) to enable direct comparison, although we also mention other approaches where applicable, such as particle-based (SMC and PMCMC) methods.Reference implementations of all of these methods are available from various sources described below, but for the purpose of comparison we have reimplemented the algorithms using RcppArmadillo [Eddelbuettel and Sanderson, 2014].An R source package containing our code is available in Moores and Mengersen [2015].

Pseudolikelihood and composite likelihood
Pseudolikelihood is the simplest of the methods that we have considered and also the fastest.Rydén and Titterington [1998] showed that the intractable distribution (8) could be approximated using the product of the conditional densities given by (2).This enables updates for the inverse temperature at iteration t to be simulated using a Metropolis-Hastings (M-H) step, as shown in Algorithm 1.  Algorithm 1 MCMC with pseudolikelihood (PL) Calculate sufficient statistics ȳj , s 2 j ∀z i = j, ∀j ∈ {1, . . ., k}

9:
Draw u ∼ Uniform[0, 1] 10: if u < ρ then 11: else 13: end if 15: end for The M-H proposal density q(β |β t−1 ) can be any distribution such that q(β |β t−1 ) dβ = 1.However, there is a tradeoff between exploring the full state space and ensuring that the probability of acceptance (at line 10) is sufficiently high.We use the adaptive random walk (RWMH) algorithm of Garthwaite et al. [2010], which automatically tunes the bandwidth of the proposal density to target a given M-H acceptance rate.When a symmetric proposal density is used, q(β |β t−1 ) = q(β t−1 |β ) and so this term cancels out in the M-H ratio on line 8.The natural logarithm of ρ is used in practice to improve the numerical stability of Algorithm 1.
Pseudolikelihood is exact when β = 0 and provides a reasonable approximation for small values of the inverse temperature.However, the approximation error increases rapidly for β ≥ β crit , as illustrated by Figure 5.This is due to long-range dependence between the labels, which is inadequately modelled by the local approximation.The implications of this inaccuracy for posterior inference will be demonstrated in Section 5. Rydén and Titterington [1998] referred to Equation ( 13) in Algorithm 1 as point pseudolikelihood, since the conditional distributions are computed for each pixel individually.They suggested that the accuracy could be improved using block pseudolikelihood.This is where the likelihood is calculated exactly for small blocks of pixels, then ( 13) is modified to be the product of the blocks:  where N B is the number of blocks, z Bi are the labels of the pixels in block B i , and z \Bi are all of the labels except for z Bi .This is a form of composite likelihood, where the likelihood function is approximated as a product of simplified factors [Varin et al., 2011].Friel [2012] compared point pseudolikelihood to composite likelihood with blocks of 3 × 3, 4 × 4, 5 × 5, and 6 × 6 pixels.Friel showed that ( 14) outperformed ( 13) for the Ising (k = 2) model with β < β crit .Okabayashi et al. [2011] discuss composite likelihood for the Potts model with k > 2 and have provided an open source implementation in the R package potts [Geyer and Johnson, 2014].
Evaluating the conditional likelihood in ( 14) involves the normalising constant for z Bi , which is a sum over all of the possible configurations Z Bi .This is a limiting factor on the size of blocks that can be used.The brute force method that was used to compute Figure 3 and 5 is too computationally intensive for this purpose.Pettitt et al. [2003] showed that the normalising constant can be calculated exactly for a cylindrical lattice by computing eigenvalues of a k r × k r matrix, where r is the smaller of the number of rows or columns.The value of (9) for a free boundary lattice can then be approximated using path sampling.Friel and Pettitt [2004] extended this method to larger lattices using a composite likelihood approach.
The reduced dependence approximation (RDA) is another form of composite likelihood.Reeves and Pettitt [2004] introduced a recursive algorithm to calculate the normalising constant using a lag-r representation.Friel et al. [2009] divided the image lattice into sub-lattices of size r 1 < r, then approximated the normalising constant of the full lattice using RDA:

Thermodynamic integration
Gelman and Meng [1998] derived an approximation to the log ratio of normalising constants using the path sampling identity: log which follows from the definition of E z|β [S(z)] in (10).The value of this definite integral can be approximated by simulating from the Gibbs distribution for fixed values of β and then interpolating between them.We used the Swendsen-Wang algorithm for simulating from z|β, as mentioned in section 3. Figure 6a illustrates linear interpolation of S(z) on a 2D lattice for k = 6 and β ranging from 0 to 2 in increments of 0.05.This approximation can be precomputed using the same method that was used to produce Figure 4. Figure 6b illustrates interpolation of S(z) on a 3D lattice for k = 9.

5:
Update the noise parameters µ j , σ 2 j ∼ p(ȳ j , s 2 j |µ j , σ 2 j ) π(µ j |σ 2 j ) π(σ 2 j ) 6: Draw proposed parameter value β ∼ q(β |β t−1 ) 7: Evaluate the definite integral in Equation ( 16) Calculate the log M-H acceptance ratio: if u < ρ then 12: else 14: end if 16: end for Path sampling is explained in further detail by Chen et al. [2000, chap. 5].A reference implementation in R is available from the website accompanying Marin and Robert [2007].This algorithm has an advantage over auxiliary variable methods such as the exchange algorithm or ABC because the additional simulations are performed prior to fitting the model, rather than at each iteration.This is particularly the case when analysing multiple images that all have approximately the same dimensions, as in the examples of section 5.However, the computational cost is still slightly higher than pseudolikelihood, which does not require a pre-computation step.Møller et al. [2006] demonstrated that it is possible to simulate from the posterior distribution of β by introducing an auxiliary variable, so that C(β) cancels out in the M-H ratio.The exchange algorithm of Murray et al. [2006] improves the performance of this method and avoids the need for a fixed estimate of β.The drawback of this approach is that it requires perfect sampling from the stationary distribution of the Potts model.This is possible using coupling from the past [Propp andWilson, 1996, Huber, 2003] but can be computationally prohibitive, particularly for large images.McGrory et al. [2009] reported that the time required for perfect sampling increased sharply for larger values of β.For this reason, Cucala et al. [2009] substitute 500 iterations of Gibbs sampling on the auxiliary variable to produce an approximate sample from its stationary distribution.The details of this method are shown in Algorithm 3.

5:
Update the noise parameters µ , σ 2 j ∼ p(ȳ j , s 2 j |µ j , σ 2 j ) π(µ j |σ 2 j ) π(σ 2 j ) 6: Draw proposed parameter value β ∼ q(β |β t−1 ) 7: Generate w|β by sampling from Equation ( 2) Calculate the M-H acceptance ratio according to Equation ( 8): if u < ρ then 11: else 13: end if 15: end for The M-H ratio given by Equation ( 18) can be considerably simplified if a uniform prior is used for β and β is drawn from a symmetric M-H proposal density.In this case, log{ρ} simplifies to (β − β t−1 )S(z) + (β t−1 − β )S(w).The similarity with Equation ( 17) shows that the exchange algorithm is closely related to path sampling, since it is a form of importance sampling.An implementation of the exchange algorithm is available in the online supplementary material accompanying Friel and Pettitt [2011].Everitt [2012] provides source code for the approximate exchange algorithm with particle MCMC.

Approximate Bayesian computation
Like the exchange algorithm, ABC uses an auxiliary variable w to decide whether to accept or reject the proposed value of β .Instead of a M-H ratio such as (18), the summary statistics of the auxiliary variable and the observed data are directly compared.The proposal is accepted if the distance between these summary statistics is within the ABC tolerance, .This produces the following approximation when the labels z are observed without error: where • is a suitable norm.In this paper we simply use the absolute difference between S(w) and S(z), since the summary statistic ( 5) is univariate.S(z) is sufficient for β, as noted by Grelaud et al. [2009], therefore the ABC approximation ( 19) approaches the true posterior as n → ∞ and → 0. In practice there is a tradeoff between the number of parameter values that are accepted and the size of the ABC tolerance.Everitt [2012] introduced ABC for noisy data, such as the Ising model, but only when the observations are discrete and so S(y) is defined.Moores et al. [2015] showed that ABC can also be applied when the domain of the observed data is continuous, such as with the additive Gaussian noise of Equation ( 4).In this type of model the posterior distribution of β does not depend directly on the observed data y, as shown by Equation ( 7).The ABC approximation ( 19) for the hidden Potts model becomes: The other parameters can be updated using the current value of β and then the summary statistic S(z) can be computed from the current values of the labels, using an ABC within Gibbs approach as shown in Algorithm 4. Equation ( 20) can be considered as a moving target, since S(z) could change whenever the labels are updated.As a consequence, ABC with noisy data can be more prone to getting stuck in low probability regions of the parameter space.Grelaud et al. [2009] used a rejection sampler, where the proposals are drawn independently from the prior and thus q(β |β t−1 ) = π(β).Under a sparse or uninformative prior, such as the uniform prior used in this study, too many proposals are rejected for this approach to be viable.Instead we have based Algorithm 4 on the ABC-MCMC algorithm of Marjoram et al. [2003].This form of ABC algorithm is best suited for direct comparison with the other intractable likelihood methods in this article.

7:
Generate w|β by sampling from Equation ( 2) and S(w) − S(z) < then 10: end if 14: end for There have been many extensions to ABC, as reviewed by Marin et al. [2012].Of particular interest are adaptive ABC algorithms that automatically adjust the tolerance and the proposal bandwidth σ 2 M H .ABC with sequential Monte Carlo (SMC-ABC) algorithms use a sequence of target distributions where the number of SMC iterations T can be determined dynamically using a stopping rule.The SMC-ABC algorithm of Drovandi and Pettitt [2011] uses multiple MCMC steps for each SMC iteration, while the algorithm of Del Moral et al. [2012] uses multiple replicates of the summary statistics for each particle.Everitt [2012] has provided a MATLAB implementation of SMC-ABC with the online supplementary material accompanying his paper.
The computational efficiency of ABC is dominated by the cost of drawing updates to the auxiliary variable, as reported by Everitt [2012].Thus, we would expect that the execution time for ABC would be similar to the approximate exchange algorithm.Various approaches to improving this runtime have recently been proposed, such as the Gaussian process emulation of Wilkinson [2014], the "lazy ABC" of Prangle [2014], and methods involving auxiliary models [Cabras et al., 2014, Moores et al., 2015, Buzbas and Rosenberg, 2015].
All four algorithms were run for 10,000 iterations on each image, discarding the first 5,000 as burn-in.The differences in elapsed runtime are illustrated by Figure 7.All of the algorithms scaled linearly with increasing numbers of pixels, but the approximate exchange algorithm (MAVM) and ABC were roughly two orders of magnitude slower than either pseudolikelihood (PL) or path sampling (TI).For 2D images with n = 10 6 pixels the average runtime was 43 hours for MAVM, 42.5 hours for ABC, 36 minutes for PL and 29 minutes for TI.It took 30 minutes for precomputation of 64 values of E z|β [S(z)] for path sampling, using 16 parallel threads.For 3D images with n = 10 6 voxels the averages were 45 hours for MAVM or ABC, 35 minutes for PL and 30 minutes for TI.The precomputation step took 24 minutes for 44 values of β.Even allowing for the cost of precomputation, TI was still much faster than MAVM or ABC.
The distributions of the differences between the MCMC samples of β and the true value of the inverse temperature are illustrated by Figure 8 for 2D images with n = 10 6 .Results for the other simulation studies are available in the supplementary material.It is evident that all four algorithms had difficulty with one of the images that had a very small value of β = 0.032.This appears to

Satellite remote sensing
For the satellite image described in Section 2.2 we used weakly informative priors π(µ j ) ∼ N (ȳ, 5) and π(σ 2 j ) ∼ IG(1, 0.01) for the k = 6 mixture components and π(β) ∼ U(0, 3) for the inverse temperature.We were unable to evaluate accuracy for this image because the true values of the inverse temperature and the pixel labels are unknown.However, we can still compare the 95% HPD intervals for β obtained from the four algorithms, as well as the time taken to fit the model.These results are summarised in Table 2.
We ran pseudolikelihood and path sampling for a total of 60,000 iterations each, discarding the first 5,000 as burn-in.Due to the high computational cost of ABC-MCMC and the approximate exchange algorithm, we only ran these algorithms for 10,000 iterations with 5,000 burn-in, using 500 auxiliary iterations of Gibbs sampling.This took more than 50 hours, in comparison to 2 hours for pseudolikelihood and 1.4 hours for path sampling, plus 33 minutes to precompute the expectation of the sufficient statistic for 64 values of β.Pseudolikelihood was the fastest algorithm but it appears to have overestimated the value of β by a wide margin.There was no overlap between any of the credible intervals, but the other three algorithms all produced estimates in the neighbourhood of 1.3.This is higher than the critical value of β, which is 1.24 for a 2D image with k = 6.For comparison, we also analysed this image using PyMCMC [Strickland et al., 2012] and the R package potts [Geyer and Johnson, 2014].PyMCMC produced a posterior estimate of 1.27, while the maximum pseudolikelihood estimate (MPLE) returned by potts was 1.76.
We ran pseudolikelihood and path sampling for 60,000 iterations on each of the 28 CT scans, but the approximate exchange algorithm and ABC were only run for 10,000, as we did for the satellite image and the synthetic data.The first 5,000 iterations were discarded as burn-in.The distributions of the MCMC samples for β are illustrated in Figure 9a.The pooled 95% HPD intervals were [2.207; 2.281] for pseudolikelihood, [0.934; 1.019] for the exchange algorithm, [0.988; 0.992] for path sampling, and [0.938; 1.034] for ABC.Pseudolikelihood has overestimated β in comparison to the other algorithms, as with the satellite image.The pooled credible intervals for the other three algorithms overlap with each other.
The distribution of runtimes are shown in Figure 9b.It took an average of 106.6 hours for 10,000 iterations of the exchange algorithm and 115 hours for ABC-MCMC.The average runtime for 60,000 iterations of pseudolikelihood was 4.9 hours, while the average for path sampling was 3.5 hours.It took 38 minutes to precompute 44 values of E z|β [S(z)] for path sampling.

Concluding remarks
We have compared the approximate exchange algorithm, path sampling, pseudolikelihood, and ABC using a simulation study as well as real data.All of these Markov chain Monte Carlo algorithms were prone to becoming stuck in low-probability regions of the parameter space.Path sampling provided the best tradeoff between accuracy and computational cost over all of the images that we tested.This was due to precomputation of E z|β [S(z)] for fixed values of β, an operation that is highly parallelisable for modern computer architectures.The output of this precomputation step can be reused for multiple images with similar dimensions.The only drawback of path sampling is that it fails to account for the heteroskedasticity in the distribution of S(z).This can result in errors in the posterior distribution, particularly when β is above the critical value.Moores et al. [2015] introduced a precomputation step for ABC, with a second-order approximation that models the variance at the critical point.This produced a comparable runtime to path sampling, but without the outliers that are evident in Figure 8.We have shown that simulating auxiliary variables at every iteration is too computationally expensive for applications in image analysis.Exact inference is infeasible for the scale of data that is commonly encountered in scientific studies, such as medical imaging and remote sensing.This has also been reported in previous studies, such as McGrory et al. [2009], Everitt [2012], Moores et al. [2015] and Moores and Mengersen [2014], but this is the first systematic comparison of the scalability of these algorithms.Methods such as Prangle [2014] can reduce the number of auxiliary iterations that need to be performed, but not enough to compensate for the two orders of magnitude difference in runtime that we observe.
Pseudolikelihood (PL) was the fastest of the four algorithms but produced very different estimates of β when applied to real data.The other three algorithms all depend on the value of the sufficient statistic S(z), while PL uses the product of the conditional densities.We have illustrated how the approximation error of PL increases for values of β above the critical point.This is due to long-range dependence between the latent labels.Composite likelihood methods such as Friel and Pettitt [2004] and Friel et al. [2009] can reduce the approximation error by computing the estimate using sub-lattices.It is possible that the real images that we used in this study have a more complex correlation structure than what can be captured by the Potts model.This could explain the differences in the PL estimates between the simulation study and the real data.
This paper can serve as a guide to the selection of a suitable algorithm for Bayesian image analysis.We have provided our implementation of the four algorithms as an R package so that readers can experiment using their own data.There are faster algorithms for fitting a hidden Potts model, such as variational Bayes (VB) [McGrory et al., 2009] and iterated conditional modes (ICM) [Besag, 1986], but these provide limited options for estimating the inverse temperature.We have shown that it is feasible to use MCMC for images of realistic size, given reasonable computational power by contemporary standards.This enables the posterior distribution of the inverse temperature to be estimated along with all of the other model parameters.

SUPPLEMENTARY MATERIAL
R package bayesImageS: R source package containing code to perform image segmentation using Markov chain Monte Carlo.Includes C++ implementations of all of the methods described in the article.(GNU zipped tar file) Simulation study: Additional figures and tables of results for the simulation study of Section 5.1.(PDF)

Figure 1 :
Figure 1: A satellite image and the distribution of pixel intensities.

Figure 2 :
Figure 2: An axial slice of a CT scan and the distribution of pixel intensities.

Figure 3 :
Figure 3: Distribution of the sufficient statistic of the Potts model for increasing values of the inverse temperature β, the number of pixels n, and the number of unique labels k.The expectation and standard deviation of S(z) were calculated exactly, using a brute force method.
Standard deviation for a 3D Potts model with k = 9.

Figure 6 :
Figure 6: Approximation of E z|β [S(z)] by simulation for fixed values of β, with linear interpolation.

Figure 7 :
Figure 7: Relationship between elapsed runtime and image size for pseudolikelihood (PL), the approximate exchange algorithm (MAVM), path sampling (TI), and approximate Bayesian computation (ABC).Both axes are on a logarithmic scale.

Figure 8 :
Figure 8: MCMC sampling error for β in simulated data for 20 images with 2 dimensions, k = 3 and n = 10 6 .The diagonal, dashed line indicates the true value of β, while the vertical, dotted line marks the critical point.

Table 1 :
Scale of common types of images.southby 183km east-west.Tomographic reconstructions such as CT scans are usually represented as 3D image stacks, with 512 × 512 pixels per slice.The pixel resolution and slice width varies depending on the clinical protocol.
Simoncelli [1999] HD video is typically 1920 × 1080 pixels with 24 frames per second (fps), although higher frame rates and resolutions (such as ultra high definition, UHD) are available.The volumes of data involved in video processing necessitate specialised methods, which are beyond the scope of this paper.For examples of Bayesian methods for video analysis, seeSimoncelli [1999], Minvielle

Table 2 :
Results for the satellite image of Brisbane, Australia.beduetothe level of noise in the image.The algorithms produced an accurate estimate for another image with β = 0.033 and much lower levels of noise.We did not observe this behaviour with any of the 3D images, nor with the smaller 2D images.We did, however, observe the errors increasing for values of β above the critical point, indicated by the vertical line in Figure8.In the worst case, PL underestimated β by 0.61, with a 95% highest posterior density (HPD) interval of [0.653; 0.706] when the true value of β was 1.289.MAVM underestimated β for the same image by 0.215 and TI was out by 0.245.ABC was the most accurate for that image, with an error of less than 0.01.Nevertheless it produced erroneous estimates for other images, in one case underestimating β by 0.107 and overestimating for another image by 0.150.This simulation study has shown that all four algorithms are prone to getting stuck in low-probability regions of the parameter space.