Vector operations for accelerating expensive Bayesian computations -- a tutorial guide

Many applications in Bayesian statistics are extremely computationally intensive. However, they are often inherently parallel, making them prime targets for modern massively parallel processors. Multi-core and distributed computing is widely applied in the Bayesian community, however, very little attention has been given to fine-grain parallelisation using single instruction multiple data (SIMD) operations that are available on most modern commodity CPUs and is the basis of GPGPU computing. In this work, we practically demonstrate, using standard programming libraries, the utility of the SIMD approach for several topical Bayesian applications. We show that SIMD can improve the floating point arithmetic performance resulting in up to $6\times$ improvement in serial algorithm performance. Importantly, these improvements are multiplicative to any gains achieved through multi-core processing. We illustrate the potential of SIMD for accelerating Bayesian computations and provide the reader with techniques for exploiting modern massively parallel processing environments using standard tools.


Introduction
Practical applications in Bayesian statistics are computationally challenging since the posterior density is only known up to a normalising constant. Therefore, advanced Monte Carlo schemes are often required. Intractable likelihoods further compound these computational burdens [48]. In addition to posterior sampling, there are other computational challenges in Bayesian statistics. For example, it may be necessary to ensure priors are only weakly informative, and selected from a family of priors with a hyperparameter [16,19].

An introduction to code optimisation
Here, we introduce CPU computing concepts essential for optimisation. We avoid technological details in favour of simple tools and rules-of-thumb for practitioners.

Vectorisation with SIMD
CPU cores are sequential 4 , performing a single operation, such as arithmetic or reading/writing data, per clock cycle. Without optimisations, the loop in Figure 1  Vector arithmetic is implemented using VPUs that execute in SIMD. The VPU inputs are wider than the scalar arithmetic units. A VPU that accepts 256 bit inputs can store eight single precision (32 bit) or four double precision (64 bit) floating point numbers. VPUs perform arithmetic element-wise on input vectors as shown in Figure 1(b). Acceleration through SIMD is multiplicative with gains achieved through multithreading. Using 18 threads and 512 bit vectors of an Intel Xeon Gold 6140 CPU 5 , could theoretically perform up to 18 × (512/64) = 144 times as many double precision operations per second than sequential processing with scalar operations [53].

Vectorisation and multithreading with OpenMP
VPUs may be accessed in a variety of ways. A simple approach is OpenMP, an open standard for automated parallel computing. OpenMP supports parallel computing through directives within standard C/C++ [55]. The directive #pragma omp simd (Figure 2(a)) guides the compiler to utilise VPUs.
The directive #pragma omp parallel tells the compiler to create a worker pool of threads within the next code block. Within a parallel block, the directive #pragma omp for before a loop causes iterations to be distributed across these threads. Figure 2 The loop body should only use arithmetic and standard mathematical functions (e.g. sin, cos, exp or pow). (Figure 2(c)) when loop iterations are independent but conditionals cause iterations to perform different calculations. The loop body can be arbitrarily complex with calls to user defined functions. (Figure 2(a)) loop can be included within the body of a parallel for (Figure 2(c)) loop, but the converse is not true. 4. The full parallel and for (Figure 2(b)) construct can be useful to more explicitly control the behaviour of individual threads through the use of OpenMP functions.

A simd
The above guidelines are not all strict rules, for example, user defined functions may be called within a simd loop provided the function has been appropriately constructed and the declare simd pre-processor directive is used. Many other features are also available within OpenMP, and we refer the reader to the specifications 6 for details. [54] and [55] also provide practical information on parallel computing for HPC. Alternatives to OpenMP are discussed in Section 5.

Memory access and alignment
To fully exploit SIMD, it is crucial to manage memory access. Commodity random access memory (RAM) bandwidth is around 20 GB/sec, whereas the VPU can process floating point data at a rate of 600 GB/sec. That means only 3% VPU utilisation is possible for data retrieved from RAM. CPUs avoid this bottleneck with a hierarchy of memory caches, typically with three levels: L1, L2, and L3. Lower cache numbers correspond to higher bandwidth, but smaller capacity. L1 cache can keep the VPU close to 100% utilised, but is typically less than 30 KB. When data are requested for computation, the caches are tested in order. If the data are in L1 cache, then there is no memory access penalty. If not in L1 cache, then L2 is checked and so on. RAM is accessed only when the data are not in any cache, which is called a cache miss. For efficiency, data arrays should be accessed in patterns that minimise cache misses: • Aim to reuse data soon after it was last accessed. If a partial calculation is pushed out of cache, then it can be faster to re-compute than access the result from RAM.
• Avoid random memory access patterns in favour of regular access a[i], then a[i+n], then a[i+2*n] 7 . Ideally, n = 1 or is small to avoid cache churn.
Memory alignment is also important for SIMD. Memory is partitioned into blocks called cache lines. For many recent Intel CPUs, these are 64 bytes in length 8 . When the CPU loads data at an address, the entire cache line is loaded. When doing SIMD, the first byte of a data array should be aligned with the 64byte cache line boundary, otherwise some vectors will cross cache lines. Performing SIMD on unaligned data incurs a performance penalty, or may fail completely for some older standards. In practice this means: • Use either mm malloc (Intel) or memalign (GNU), in place of default malloc 9 .
Efficient use of cache and memory access patterns is a large topic. See further discussion in [12] and [20].

Performance analysis
Code optimisation effort should be informed by performance analysis to ensure reductions in computation time are worth the increase in development time. It is common for this to be overlooked in practice [29,34]. Some guidelines to assist are: 1. Use profiling and static code analysis software to obtain performance statistics on timing and memory bottlenecks, such as Intel VTune Amplifier 11 and Intel Advisor 12 . These tools are also essential to investigate and control compiler output.
2. Estimate the theoretical peak performance for your hardware. Compare with an estimate of the number of floating point operations your algorithm will perform, on average. Use this to estimate the minimum possible runtime [29].
3. Most algorithms require some sequential operations, and not all calculations can be efficiently mapped to SIMD. Amdahl's law [1] states the speedup factor, s, is bounded by where C P is the sequential runtime of code that can be parallelised, C S is the runtime of code that must remain sequential, and P is the core count or vector length. Thus, s = (C S + C P )/C S is the maximum speedup as P → ∞. 7 Access matrices according to the matrix storage format. For example, the row-major format is used in C/C++ (access row-by-row) and the column-major format is used in R (access column-by-column). 8 The CPU-Z utility (https://www.cpuid.com) is a useful tool to identify cache line sizes. 9 On most systems, malloc defaults is 8byte or 16byte. 10

A note on random number generation
When combining statistics with parallel computing it is important to consider initialisation and usage of random number generators (RNGs). Dealing with these aspects poorly can lead to invalid results. In particular, the use of different seed values (one per thread) for the same sequential RNG should be approached with caution. For example, sequences of Linear Congruential Generators can become correlated if a linear sequence of seeds is used [13].
Broadly there are three approaches to deal with this: (i) generate all required randomness serially and distribute among parallel processes; (ii) use a random number generator that can be split into independent sub-streams, for example a generator that supports Skip Ahead or Leapfrog methods; or (iii) use a sequence of random number generators that are statistically independent for the same seed value. We utilise the last option through the VSL BRNG MT2203 generator family (available within Intel MKL) that provides statistically independent Mersenne Twister generators. See [10] and [33] for details.

Summary
This introduction to vectorisation, multithreading, memory usage, code analysis, and RNGs is not specific to Bayesian statistics. However, the ideas are important to efficiently implement steps within the increasingly computationally expensive algorithms that form a core part of modern practical Bayesian techniques. In Sections 3 and 4 we demonstrate the application of these guidelines through a detailed tutorial and number of case studies in settings of direct interest to Bayesian practitioners.

A practical tutorial demonstration for R users
In this section, we provide a practical demonstration of the computational benefits of directly accessing CPU SIMD operations for ABC-based inference. We begin with R implementations, step through relevant optimisation within R, then demonstrate the computational advantages of direct SIMD access using C and OpenMP.

Prior predictive sampling for approximate Bayesian computation
ABC techniques are powerful for inference with intractable likelihood functions [48] and are routinely used to study complex stochastic models [46,51]. The most accessible algorithm is ABC rejection sampling. Given observed data, D obs , the parameter prior distribution, π(θ), a discrepancy metric, ρ, and a vector of sufficient (or informative) summary statistics, S(D obs ), then ABC rejection sampling generates approximate posterior samples by first generating artificial data from the prior predictive distribution, where s(D; θ) is the probability density of the data generation process for a fixed parameter vector θ in parameter space Θ. Then a small proportion of samples are accepted to form an approximation to the posterior, with a sufficently small acceptance threshold ǫ. This procedure is presented in algorithmic form in Algorithm 1.

3:
Use stochastic simulation to generate prior predictive data D * ∼ s(D; θ * ); 4: until ρ(S(D obs ), S(D * )) ≤ ǫ; 5: Accept θ * as an approximate posterior sample; In practice, ABC rejection sampling is rarely implemented in this direct manner (Algorithm 1 is serial, and produces a random number of candidate samples). Rather, it is common to generate a fixed number, N, of prior predictive joint samples, (θ * , D * ). The acceptance threshold, ǫ, is then selected a posteriori based on the empirical distribution of the discrepancy metric, ρ. See e.g. [17] and [58] for detailed reviews of Monte Carlo algorithms for ABC. ABC-based Monte Carlo estimators converge slowly in mean-square [3], and therefore require substantial numbers of stochastic model simulations for reliable inference.

Example model: a genetic toggle switch
We consider a genetic toggle switch model [9]. Let u i (t) ≥ 1 and v i (t) ≥ 1 represent the expression levels of genes u and v at time t for cells i = 1, 2 . . . , C. Gene expression evolves according to two coupled stochastic differential equations (SDEs) where parameters α u , β u , α v and β v define gene inhibition and W i,u (t) and W i,v (t) are independent Wiener processes. The observation process is where T is the observation time, µ, σ and γ control the error rate and η i are standard normal random variables. The data are the measurements for all cells, {y i } C i=1 . Sampling the prior predictive distribution is needed to infer the parameters θ = (µ, σ, γ, α u , α v , β u , β v ) ′ using ABC. We adopt independent uniform priors as chosen by Bonassi et al. [9], µ ∼ U(250, 400), σ ∼ U(0.05, 0.5), γ ∼ U(0.05, 0.35), α u , α v ∼ U(0, 50), and β u , β v ∼ U(0, 7). We follow Vo et al. [56] in adopting 19 equally spaced quantiles of the empirical distribution of {y i } C i=1 as the vector of summary statistics.

Implementation and optimisation using R
To simulate the toggle switch system (Equation (3)) and observation process (Equation (4)) we can use the Euler-Maruyama scheme [39], where ∆t is the discretisation step and ξ i,u (t) and ξ i,v (t) are independent normal random variables. For simplicity, we will assume ∆t = 1. This can be implemented directly using R as shown in Listing 1. Here, the pair of SDEs for u i (t) and v i (t) are evolved one cell at a time, for i = 1, 2, . . . , C. Such a naïve implementation is typical for initial prototypes, but it is profoundly inefficient with a single simulation with T = 600 and C = 8000 taking 35 seconds. This is not practical for ABC sampling, even with individual simulations distributed across a HPC cluster. y <-n u m e r i c( C ) ; mu <-theta [1]; sigma <-theta [2]; gam <-theta [3] 3 alpha . u <-theta [4]; alpha . v <-theta [5]; 4 bet . u <-theta [6]; bet . v <-theta [7]  One can achieve substantial computational improvements by re-writing the R code in terms of vector and matrix mathematics. This exploits optimised linear algebra libraries such as BLAS and LAPACK that often use SIMD. Listing 2 is an example of vectorised R code for the toggle switch model.
The key changes are: 1) genes of all cells are stored in vectors of length C; 2) all Gaussian random variates required for the entire simulation are generated in a single call of rnorm(); 3) all genes are simulated together using R's vector maths functions; and 4) logical indexing is used instead of ifelse() to implement the boundary condition. The improvements are substantial, taking 1.3 seconds to perform a single simulation with T = 600 and C = 8000. This is more than 25× improvement, without any multithreading, explicit SIMD operations, or Rcpp code. See the excellent text by Gillespie and Lovelace [21] for details. Listing 2: Optimised R implementation of the toggle switch model.
The purpose of comparing Listings 1 and 2 is to highlight the importance of optimising serial code. Now that the R implementation is efficient, we can distribute the process across multiple cores using the doParallel package as shown in Listing 3. Distributing N = 8064 draws from the prior predictive across 16 cores (Intel Xeon E5-2680v3 processor 13 ) takes just over 11 minutes! For some applications, optimised R can be fast enough and we do not advocate the use of further optimisation for all Bayesian applications. However, the trade-off between development time and runtime is very application specific. For this example, even using 16 cores, the optimised R code will take around 24 hours to draw N = 1, 000, 000 prior predictive samples, which may be insufficient for ABC inference with small ǫ. To motivate progressing beyond R, note that the indirect access of SIMD through optimised BLAS and LAPACK libraries has limitations in terms of cache utilisation for this application. This is highlighted in the next section.

Optimisation using C and SIMD operations
Direct access to SIMD through C with OpenMP enables superior memory access patterns that ensure substantially higher utilisation of the VPUs (see Section 5 for alternatives). While the optimised C implementations presented here are more complex than the optimised R implementation (Listing 2), they provide a practical example that will help reduce the learning curve for practitioners.
Listing 4 provides the main structure of the C program. Note the inclusion of the header files for the MKL libraries and the OpenMP library; we will point out calls to these libraries as necessary. Definitions of the constants VECL and ALIGN refer to the width of the SIMD vectors and the memory alignment to ensure vectors never cross cache lines (see Section 2). The functions simulate tsw SDE and main represent the C equivalents of the R code Listings 2 and 3. C implementations are given in Listings 5 and 6, the details of which are discussed below. The optimised C implementation of the toggle switch model is given in Listing 5. From an algorithmic perspective this code can be considered a hybrid between the two R implementations (Listings 1 and 2). That is, cells are evolved in small blocks of length VECL, as opposed to individually (Listing 1) or all together (Listing 2). The outer loop iterates over the cell index c in strides of VECL (the E5-2680v3 CPU supports 265 bit vectors, giving VECL = 4 doubles). As in the optimised R code (Listing 2), all the random variates required within each block are pre-computed using the vdRngGaussian function from MKL. In each block there are VECL noisy observations, y[c],y[c+1],. . .,y[c+VECL-1], to be computed (one per cell). Using SIMD we can evolve the SDE pair associated with the y[c] observation and obtain the others in the same block at the same time. This is done with a second loop over the index c2 that represents the position in the SIMD vector. Note the use of the aligned statement within the simd directive. This allows the compiler to assume the arrays zeta (random variates) and y (simulated data) are aligned to the cache boundary and enables more efficient machine code to be generated. The SIMD loop is not really iterating over values of c2, but rather synchronously executing each statement within the loop for all values of c2 concurrently. This leads to exceptional reuse of L1 cache: since the SDE state variable u t and v t are updated in-place, all computation is performed using the fast CPU memory.
It is possible to do even better, since computing all Gaussian random variates for each block represents a trade-off. To understand this, note that advanced interfaces, such as compiler intrinsic functions (see discussion in Section 5), can enable efficient RNG sampling within the rest of the Euler-Maruyama update. However, this cannot be done with OpenMP as the function vdRngGaussian has overheads that dominate for very small sample sizes. Conversely, generating all the Gaussian random variates at the start with a single vdRngGaussian call forces the CPU to access L2 cache after a few blocks have been processed, resulting in loss of performance. This trade-off to memory access presented in Listing 5 cannot be efficiently replicated in R. Using BLAS and LAPACK functions for small vectors results in similar performance issues as for the vdRngGaussian function. Therefore, the best that can be done with R is to use long vectors to evolve all cells together. Armed with a highly efficient SIMD implementation of the model, we can generate prior predictive samples across multiple CPU cores. This uses the multithreading pre-processor directive, #pragma omp parallel, as shown in Listing 6. Each thread is provided with its own random number stream using the Intel MKL/VSL function vslNewStream with independence guarenteed through the VSL BRNG MT2203 generator family (see Section 2). The memory allocated to store prior samples, theta, and simulation outputs, obs vals, are shared across threads, using the shared clause, and thread specific offsets are computed, using the OpenMP functions omp get thread num() and omp get num threads(). The loop over index k is performing the prior predictive sampling tasks assigned to each thread. Unlike the R implementation (Listing 3), we explicitly partition the work amongst the available threads to allow reuse of the memory allocated for the Gaussian random variates, zeta. OpenMP does support various forms of automated scheduling of thread tasks, but for this example explicit partitioning is more straight forward.

Benchmarks
We compare the R implementations (Listings 1, 2 and 3) with the C implementation (Listings 4, 5 and 6) in terms of the time taken to generate a fixed number, N, of prior predictive samples given P cores. The fastest sampler would, in turn, produce a smaller Monte Carlo error for a fixed computational budget. For example, the standard central limit theorem suggests a 4× speedup will result in roughly 1/2 the Monte Carlo error for the same computational budget. In an ABC setting there is also bias to consider, so reducing the model simulation time is absolutely crucial since the acceptance threshold ǫ must be sufficiently small.
The codes are benchmarked using two Intel Xeon E5-2680v3 (Haswell) processors or a single Intel Xeon Gold 6140 (Skylake) processor. The E5-2680v3 supports the AVX2 instruction set (256 bit vector) and the 6140 supports the AVX512 instruction set (512 bit vectors). Both processor architectures have similar serial performance with cores clocked at around 2.5 GHz. We benchmark with different numbers of cores, P = 1, 2, 4, 8, 16 for each implementation. For each value of P the benchmark is repeated four times, to obtain mean computation timesĈ T , with each replicate generating N = 8064 prior predictive samples. For fairness, we compiled R (version 3.3.1) from source using the Intel compiler suite and linked against MKL for optimised BLAS and LAPACK routines. The Intel C compiler version is 17.0.1 (compatible with the GNU C complier version 6.3.0). To demonstrate the difference between speed-up obtained from the improved memory utilisation and the usage of SIMD, we also compile a version of the C implementation with SIMD disabled using the -no-vec compiler option. Figure 3 summarises the results in terms of runtimesĈ T (Figure 3(a),(c),(e)) and speed-up factorĈ T /Ĉ B for various serial baselinesĈ B (Figure 3(b),(d),(f)). See the supplementary material for precise data tables of runtimes. Using the naïve R code as the baseline (Figure 3(a)-(b)), a speedup of 16× is possible with P = 16 as one would expect. However, this pales by comparison with the more than 430× speedup using the optimised R code and P = 16, demonstrating the value in optimising serial code before implementing parallelism. The improvement is even greater for the C code with speedups of more than 1000× (no vectors, P = 16), 3800× (AVX2 vectors, P = 16), and 4300× (AVX512, P = 16). These impressive numbers are against a naïve baseline and more sensible numbers are obtained using the optimised R code as a baseline (Figure 3(c)-(d)). Here optimised C code can achieve 130× speedup (AVX2, P = 16) and 150× speedup (AVX512, P = 16) compared with only 16× speedup with optimised R code and P = 16. Finally, even with the C code as a baseline (Figure 3(e)-(f)), SIMD enables 30× speedup (AVX2, P = 16) and almost 40× speedup (AVX512, P = 16). In this case, the speedup from multithreading alone is only 13× for P = 16. There are a few reasons that could cause this, but it is likely that each thread does not perform enough work to mask threading overheads. P ≤ 4, seems to provide a large enough workload per thread leading to 15× speedup (AVX2, P = 4) and 20× speedup (AVX512, P = 4).

Summary
This tutorial has demonstrated the computational benefits of direct access to SIMD operations in comparison to indirect access through pre-compiled BLAS and LAPACK libraries available in languages such as R and Matlab. Our C implementation (Listings 5 and 6) is more that 20× faster than our optimised R implementation (Listing 2 and 3) for the same CPU hardware and number of available cores, P . Restricting comparison to the C implementation, SIMD operations can improve the performance of prior predictive sampling for ABC by a factor of more than 6× compared with standard scalar implementations. Given the enormous number of prior predictive samples required to obtain meaningful parameter estimates with ABC [3], we submit that these techniques are highly relevant for practitioners dealing with such applications. All codes presented here are available as supplementary material.

Case studies
In this section, we explore the benefit of SIMD operations in combination with mutlithreading using two case studies relevant to Bayesian practitioners.

5:
Generate data from prior predictive D 10: end for

11:
Compute γ λ k ,α as the α quantile of p 1 k , p 2 k , . . . , p N k ; 12: end for This process is extremely computationally intensive since the evidence term, π(D | λ), must be evaluated for many different D. Adaptive sequential Monte Carlo (SMC) sampling (Algorithm 3) using likelihood annealing and Markov chain Monte Carlo (MCMC) proposals [5,11] is used to estimate π(D | λ). Algorithm 2 requires 2KN executions of the adaptive SMC (Algorithm 3) with N p particles with tuning parameters c, related to the number of MCMC iterations, and h, related to the scaling of random walk MCMC proposals.

4:
Find t such that Construct a tuned proposal kernel q(u | v) = φ(u; v, h 2Σ n ), whereΣ n is the sample covariance matrix of {θ (j) n } Np j=1 ;

Parallelisation and SIMD opportunities
Parallel implementations of SMC samplers using multithreading require thread synchronisation for both the resampling and annealing steps [29,33,41]. Thus, it is more beneficial to distribute the K hyperparameter values in Algorithm 2 across P cores, with each thread computing the α quantile for K/P hyperparameters, since these may be performed independently. For every hyperparameter, we sequentially process the N p-value computations. The SMC sampler (Algorithm 3) can be accelerated through SIMD. SMC samplers are well suited to SIMD since sychronisation is maintained automatically. While there are many aspects of Algorithm 3 that use SIMD in the code example, we focus here on SIMD for the MCMC proposal kernel for diversification of particles (Steps 11-17 of Algorithm 3) as it has the greatest effect on performance.
The strategy is similar to the C implementation of the toggle switch model (Section 3). At SMC step n, each particle is perturbed via R n MCMC steps. The same operations are performed at each MCMC iteration, with the exception of a single branch operation that arises from the accept/reject step. We process the particle updates in blocks of length V and evolve the R n MCMC steps for this block together using SIMD. All of the Gaussian and uniform random variates required for the block are generated together before the block is processed. The SIMD MCMC proposals are performed as in Algorithm 4.

5:
Generate proposal θ * 1:V ← θ (i:i+V ) n + ξ r 1:V ; 6: n ; θ * 1:V , hΣ); 7: In practice, Algorithm 4 works with the likelihood and prior on the log scale to avoid numerical underflow. For optimal performance, SIMD forms for the likelihood L V , prior density π V and Gaussian proposal density φ V are required. This is available for the logistic regression model, Gaussian priors and proposals used here. Other aspects of Algorithm 3 that can utilise SIMD include likelihood evaluations and the computation of ESS and weight updates.
The scalar bottleneck is the multinomial resampling step that utilises the look-up method. Parallel approximations for resampling have been proposed [41] and demonstrated to be very effective for large scale SMC samplers. Since we focus here on a SIMD implementation of SMC, we do not implement this, however, extending [41] to SIMD is an important piece of future work.

Performance
We test the performance improvement obtained through vectorisation and multithreading using the Xeon E5-2680v3 and Xeon Gold 6140 processors. The evaluation of the weak informativity test in Algorithm 2 is performed for K = 400 hyperparameters and N = 400 datasets. In all simulations, the K hyperparameter values are generated using a bivariate uniform distribution λ k ∼ U([0.1, 10]×[0.1, 20]) for k = 1, 2, . . . , K. The SMC sampling is performed with N p = 500 particles, with the lower particle count enabling computation to remain largely within L1 and L2 cache. Tuning parameter values are specified as c = 0.01 and h = 2.38/ √ 2. Results are provided in Figure 4 (see supplementary material for data tables). Figure 4(b) shows almost perfect speedup from multithreading and consistent improvement due to SIMD when comparing scalar and SIMD performance for the same CPU and core count (2.1× for AVX2 and 2.3× for AVX512). Diminishing returns are observed when stepping from 256 bit vector operations to 512 bit SIMD. In our bioassay example, the MCMC proposal kernel performs a small number of steps (usually no more than R n = 30) and the number of particles in the SMC sampler is N p = 500. The performance will improve for SIMD in cases where longer MCMC runs are required, since the MCMC step will dominate SMC iterations and more reuse of L1 cache will occur.

Summary
We have presented the more challenging problem of weak informativity tests over a family of priors π(θ | λ) with respect to a base prior π(θ | λ 0 ). This requires a large number of approximations of the posterior normalising constant for different datasets from the prior predictive distribution. The traditional approach of parallelisation of each SMC iteration would reduce the level of parallelism available across hyperparameters. Using the SIMD implementation of SMC we double our computational performance and can reserve multithreading across hyperparameters. A combination of parallelism and SIMD could be implemented to improve the performance of a single SMC step. However, this offers little benefit here since utilising threads for SMC forces hyperparameters to be processed serially. If HPC resources are available then hyperparameters could be distributed across servers: leaving both threading and SIMD for SMC steps.

Case study 2: Parameter inference for a non-Gaussian asymmetric volatility model
We now consider an adaptive SMC sampler to perform parameter inference in eleven dimensions for the "bad evironment -good environment" (BEGE) model of innovations on stock market returns [4,50].
We use S&P Composite Index returns over the period July 1926 to January 2018 (obtained from the Center of Research in Security Prices) consisting of T = 1099 months of logged monthly divided-adjusted returns, D = R obs,T = {r obs,t } 0≤t≤T . Using these data, Bayesian inference on the unknown parameters θ = p 0 , σ p , ρ p , φ + p , φ − p , n 0 , σ n , ρ n , φ + n , φ − n , µ is performed. Priors are adopted from [50] and are given by The major challenge in this inference problem is the computational cost associated with the evaluation of the log-likelihood function, log L(θ; D) = log L(θ; R obs,T ) = T t=1 log π(r obs,t | r obs,t−1 , θ).
Bekaert et al. [4] show that evaluation of the transitional densities, π(r obs,t | r obs,t−1 , θ), requires the so called BEGE density π BEGE (u t | σ p , σ n , p t−1 , n t−1 ). This can be approximated by computing the BEGE distribution function and taking finite differences [4]. The BEGE distribution function is given by with GΓ(ω p,t −u t | n t−1 , σ n ) = 1−FΓ(ω p,t −u t | n t−1 , σ n ), where πΓ(· | k, s) and FΓ(· | k, s) are, respectively, the probability density and distribution functions of a centered gamma distribution with shape k and scale s. The integral in Equation (15) can be approximated numerically using quadrature. For each point in the discretisation of the ω p,t parameter space, we require two evaluations of the incomplete gamma function, where Γ(a) is the gamma function. Equation (16) can be computed using a power series [44]. We apply an adaptive SMC sampler to move N p particles from the prior π(θ) to the posterior π(θ | R obs,T ) under the BEGE model. This is a very similar SMC sampler to that applied in Section 4.1 (Algorithm 3). The main difference is the scaling rule for the proposal kernel within the MCMC step, as the posterior is highly non-Gaussian. We apply the method of [47] to evaluate a set of MCMC trials each with a random scale factor, h ∈ [0.1, 0.2, . . . , 1.0], at each SMC iteration. We then choose the scale factor, h opt , which maximises the median expected squared jump distance across all particles. This continues until at least half of the particles have moved further than the median [47].

Parallelisation and vectorisation opportunities
In the BEGE model inference problem, we utilise both SIMD and multithreading to accelerate a single SMC sampler. Parallel implementations of SMC samplers and particle filters have been well studied in the literature [29,33,41]. Within a single SMC iteration, particles are completely independent of each other. However, as noted in Section 4.1, automatic synchronisation can occur within the MCMC proposal mechanism. Rather than distribute N p particles across P cores, we ensure that the distribute occurs in contiguous blocks of length V . Each core will process N p /(P V ) blocks of particles. All data associated with particles is processed in contiguous blocks of length V , allowing each thread to independently exploit SIMD within the MCMC proposal mechanism as per Algorithm 4.
Another way to exploit SIMD is in the evaluation of the BEGE log-likelihood. We extend the approximation of the integral in Equation (15). Consider a discretisation of ω p,t of N ω + 1 nodes with spacing ∆ω. Then Equation (15) can be approximated by where ω j p,t = ω 0 p,t + j∆ω, for j = 0, 1, . . . , N ω and ω 0 p,t is the lower bound of the discretisation. For every point {ω j p,t } 0≤j≤Nω , the incomplete gamma function is evaluated twice; once with p t−1 and once with n t−1 . Therefore, we can consider SIMD for the incomplete gamma function for blocks of ω p,t points of length V , that is, P V (a, x 1:V ) = [P (a, x 1 ), P (a, x 2 ), . . . , P (a, x V )]. We achieve this by extending the method of [44], resulting in the element-wise vector series expression where exp V (−x 1:V ) = [e −x 1 , e −x 2 , . . . , e −x V ] ∈ R 1×V , and e 1:V = [1, 1, 1, . . . , 1] ∈ R 1×V . Equation 18 allows efficient iteration that reuses previous steps, thus enabling good use of L1 and L2 cache. The series is truncated once it has converged under the ∞-norm. We proceed to approximate F BEGE (u t | σ p , σ n , p t−1 , n t−1 ) by: 1) applying Equation (18) across the discretisation, {ω j p,t } 0≤j≤Nω , in blocks of length V with shape p t−1 ; 2) applying Equation (18) across the discretisation, {ω j p,t } 0≤j≤Nω , in blocks of length V with shape n t−1 ; and 3) accumulating the sum of products in Equation 17.

Performance
We implement the BEGE inference problem using adaptive SMC with likelihood annealing with N p = 1024 particles. We approximate F BEGE (u t | σ p , σ n , p t−1 , n t−1 ) using Equations (17) and (18) with the discretisation {ω j p,t } 0≤j≤Nω , N ω = 100, ω 0 p,t = 10 −4 − p t−1 σ p , and ∆ω = (10σ p √ p t−1 − ω 0 p,t )/(N ω − 1), as is performed by [4] and [50]. Results are provided in Figure 5 . We observe an improvement of up to 2× for AVX2 and almost 4× for AVX512 regardless of the number of cores. This is improved VPU utilisation for AVX512 compared with the weak informativity test (Section 4.1). That is, the BEGE distribution function approximation is a sufficiently large proportion of the total computation cost, that efficient calculation can exploit the longer 512 bit vectors. However, the overall speedup factor is lower from multithreading and increases slowly with P . The SMC sampler requires threads to synchronise at the end of each iteration. This synchronisation is in the resampling and annealing steps, and in estimating the optimal MCMC proposal scaling could be causing bottlenecks. This can be seen in the diminishing returns on the parallel speed-up as the number cores increases.

Summary
We have demonstrated how SIMD can be used to further accelerate a parallel SMC sampler for a challenging inference problem from econometrics. Note that the log-likelihood approximation we apply, based on the work of Bekaert et al. [4] is biased. Recently, Li et al. [35] proposed an unbiased likelihood estimator, for which there are SIMD opportunities also. For the purposes of this manuscript, we find the biased approximation of [4] lends itself more direct discourse.

Discussion
Across each of our example applications there have been some common features, which allow several conclusions to be drawn regarding task suitability for vectorisation. Firstly, it is clear that the form of stochastic model under study can have a dramatic effect on the potential performance boost due to vectorisation. Simulation schemes such as Euler-Maruyama are ideal candidates for vectorisation, however, exact stochastic simulation methods like the Gillespie direct method are more challenging. We show this effect in the supplementary material using the application of ABC to models of Tuberculosis transmission [51] and observe only moderate improvements from SIMD. This highlights a difference between continuous time and discrete time Markov processes, and may motivate the practitioner to consider alternative algorithms that are more suited to SIMD, for example, the τ -leap approximations to the Gillespie algorithm. Similarily, the success of the SIMD version of the MCMC proposal kernel in Section 4.1 relied on a SIMD implementation of the likelihood function.
Secondly, each application involved nested parallelisation. The utility of SIMD here is that within each parallel thread, the computational tasks may be further sub-divided through use of VPUs. Efficiency gains due to SIMD are multiplicative to those arising from the use of thread parallelisation. Finally, each application involved a mixture of tasks that can be performed completely independently and tasks that require synchronisation or communication. This is important, since independent parallel computation is ideal for multithreading and fine grain parallel operations with frequent synchronisation are well suited for SIMD. Our demonstrations are widely applicable, since these three features are common to many Bayesian applications.
Although we have focused on the acceleration of Monte Carlo methods for Bayesian applications, likelihood-based and frequentist applications can also benefit. This is especially true for complex optimisations for maximum likelihood parameter estimates [29]. Furthermore, other Monte Carlo schemes can benefit from our approach. Any application that has been demonstrated using GPGPUs could exploit SIMD on the CPU [33,28]. This is rarely taken into consideration when comparison between GPGPUs and CPUs are made [28,29,34].
We have provided example C programs, using OpenMP and MKL libraries. We appreciate that, for very good reasons, higher level languages such as Matlab and R are often the preferred environments for many practitioners. The implementation techniques we present can be exploited through Matlab C-MEX or Rcpp interfaces. However, Matlab and R must be configured correctly to use the required compiler options. Of course, matrix operations performed within high level languages, such as Matlab or R, are likely already utilising SIMD via high-performance BLAS and LAPACK libraries.
Unfortunately, many of the SIMD and memory optimisations we presented cannot be directly exploited using a high level language alone. This is demonstrated practically in Section 3. The only reliable way to access SIMD level parallelism from within a high level language is to use built-in matrix or vector functions that are already optimised (for example, by compiling R to use MKL). Memory access is still a significant challenge here. Between successive high level functions calls, it is unlikely that the caches are preserved, and as a result, our SIMD versions of the Euler-Maruyama scheme and MCMC proposal cannot be directly replicated in Matlab or R alone. A notable exception to this rule is the Julia language [6] using the @simd macro to achieve high performance.
OpenMP is also not the only way to access SIMD within C/C++. For example, while OpenCL is primarily applied to GPCPUs and other accelerators, OpenCL kernels may be compiled for CPUs that support SIMD units [29,36]. Instruction level intrinsic functions (also available in R via the RcppXsimd package) allow advanced features such as efficient random variates for small vectors, but this approach is very challenging and akin to machine code. Many other Monte Carlo and Bayesian applications could benefit from these approaches. One possibility is large scale particle filters, perhaps operating within a pseudomarginal scheme [2], which could be implemented as a hybrid algorithm in which proposals and weight updates are performed in SIMD blocks that are distributed across multiple threads. Such a scheme would enable prefix summations, that is, parallel computation of cumulative sums, to be performed [27] in the resampling step, which may improve the performance. Another possibility is within the class of ABC validation or post-processing procedures. For example, in the recalibration post-processing technique of [45], each of the N samples θ * from the approximate posterior distribution are individually recalibrated by the construction of a further approximate (ABC) posterior for each θ * using all previously generated (θ * , D * ) pairs, but based on observing the associated (simulated) dataset D * , and constructing all univariate marginal posterior distribution functions. This re-use of ABC algorithm and previously generated parameter values and datasets is very common in ABC [?, e.g.]]Blum2013, and makes them particularly suited for performance gains through parallelism and SIMD.
We have demonstrated that by following a few simple guidelines to maximise the utilisation of modern CPUs, advanced Monte Carlo methods may be relatively straightforwardly accelerated by a factor of up to 6× in addition to further speedups obtained through multithreading. These techniques will only become more relevant in the future as CPUs architectures are released with wider VPUs and statisticians develop more complex and sophisticated inferential algorithms. Tables   Data tables of mean runtimes used for Figures 3-5 in the main text are provided in  Tables A.1 Intel routines. Typically, the Intel routines require at least 1, 000 variates to be generated for peak performance. Fewer opportunities exist for vectorisation of the Gillespie method used for the tuberculosis model. The only suitable vectorisable step for updating genotype weights for the lookup method to select the genotype to apply the next event. The discrepancy measure components, g and H, can also be vectorised. The length of the state vector in the tuberculosis model is large enough for some performance boost from the vectorisation to be noticeable. The genetic diversity H calculation is particularly efficient as the sum of squares operation can exploit fused-multiply-add (FMA) SIMD operations that performs A ← A + (B • C).
Tables B.1 and B.2 show the results for the tuberculosis model for a range of prior predictive sample counts, N. The speedup factor from vectorisation is unsurprisingly not as significant as for the toggle switch model. This demonstrates the limitations of SIMD operations (including GPGPU-based accelerators). There is still an improvement from vectorisation of between 1.4× to 1.8×. A good result in the context of the HPC literature [15,32,40].
The speedup factor due to vectorisation is larger, at 1.8×, for the Xeon E5-2680v3 compared with 1.4× for the Xeon Gold 6140. However, the Xeon Gold 6140 results demonstrate almost 3× improvement in the scalar performance over the Xeon E5-2680v3. Therefore, we conclude that the tuberculosis model simulation is accessing L3 cache and main memory more often, and thus taking advant memory bandwidth and extra memory channels on the Xeon Gold 6140. If the tuberculosis model is not utilising L1 and L2 cache as intensively, then this explains the reduced speedup for wider vectors as the VPUs is not efficiently utilised.
Differences in performance of the toggle switch model and the tuberculosis model suggest the choice of stochastic simulation algorithm is crucial. Introducing approximations