Adaptive Tuning Of Hamiltonian Monte Carlo Within Sequential Monte Carlo

Sequential Monte Carlo (SMC) samplers form an attractive alternative to MCMC for Bayesian computation. However, their performance depends strongly on the Markov kernels used to rejuvenate particles. We discuss how to calibrate automatically (using the current particles) Hamiltonian Monte Carlo kernels within SMC. To do so, we build upon the adaptive SMC approach of Fearnhead and Taylor (2013), and we also suggest alternative methods. We illustrate the advantages of using HMC kernels within an SMC sampler via an extensive numerical study.


Introduction
Sequential Monte Carlo (SMC) samplers (Neal, 2001;Chopin, 2002;Del Moral et al., 2006) approximate a target distribution π by sampling particles from an initial distribution π 0 , and moving them through a sequence of distributions π t which ends at π T = π. In Bayesian computation this approach has several advantages over Markov chain Monte Carlo (MCMC). In particular, it enables the estimation of normalizing constants and can thus be used for model choice (Zhou et al., 2016). Moreover, particles can be propagated mostly in parallel, with direct advantages on parallel computing devices (Murray et al., 2016). Finally, SMC samplers are more robust to multimodality, see Schweizer (2012b) and Jasra et al. (2015).
SMC samplers iterate over a sequence of resampling, propagation and reweighting steps. The propagation of the particles commonly relies on MCMC kernels, that depend oftentimes on some tuning parameters. Choosing these parameters in a sensible manner is challenging and has been of interest both from a theoretical and practical point of view; see Fearnhead and Taylor (2013); Schäfer and Chopin (2013); Beskos et al. (2016).
One type of MCMC kernels that has raised attention recently is HMC (Hamiltonian Monte Carlo). HMC has originally been developed in Physics (Duane et al., 1987), and introduced to the Statistics community by Neal (1993). It has become a standard MCMC tool for sampling distributions with continuously differentiable density functions on R d (Neal, 2011). The main appeal of HMC is its better mixing (compared to, say, Metropolis samplers) in high-dimensional problems (Beskos et al., 2013;Mangoubi and Smith, 2017). This paper compares methods for the automatic tuning of HMC kernels within SMC. A few previous papers and thesis considered using HMC kernels within SMC (Gunawan et al., 2018;Burda and Daviet, 2018;Daviet, 2018;Kostov, 2016), but without focusing on tuning the kernels. In our experience, properly tuning MCMC kernels within SMC has a big impact on performance, and particularly so for HMC kernels. As a matter of fact, calibration of HMC kernels is recognised as a challenging problem in the MCMC literature (e.g. Mohamed et al., 2013;Beskos et al., 2013;Betancourt et al., 2014;Betancourt, 2016;Hoffman and Gelman, 2014). The big advantage of tuning such kernels within SMC is that we have at our disposal a cloud of particles that inform us on the shape and scale of the current target distribution.
We base our approach on the work of Fearnhead and Taylor (2013), which concerned the tuning of generic MCMC kernels within SMC samplers, and on existing approaches to tuning HMC.
We apply the proposed SMC sampler with HMC kernels to five examples; three toy examples, a binary Bayesian regression of dimension up to 95 and a log Gaussian Cox model of dimension up to 16, 384. Our numerical study illustrates the potential of SMC samplers for inference and model choice in high dimensions, and the improved performance brought by HMC kernels within SMC (relative to random walk and Langevin kernels). We also investigate the importance of adapting the tempering ladder, and the number of move steps for diversifying the particles.
The paper is organized as follows. Section 2 reviews SMC samplers and HMC kernels. Section 3 discusses adaptive tuning procedures for SMC. Section 4 provides numerical experiments. Section 5 discusses our results.

Background
The methods proposed in this article could apply to generic target distributions, but we will focus on posterior distributions and thus we will use the associated terminology. We consider the problem of calculating expectations of an integrable test function ϕ : R d → R with respect to a posterior distribution defined as π(x) = p(x)l(y|x)/Z. The random variable x with density π(·) is defined on the space (R d , B(R d )), where B(R d ) denotes the Borel set on R d . Here p(x) denotes the prior distribution, l(y|x) is the likelihood of the observed data y given the parameter x ∈ R d , and Z = R d l(y|x)p(x)dx denotes the normalizing constant, also called marginal likelihood or evidence. We next describe two algorithms widely used to approximate posterior distributions: Sequential Monte Carlo (SMC) and Hamiltonian Monte Carlo (HMC). These are building blocks for the adaptive SMC procedures discussed in this paper.

Sequential Monte Carlo samplers
2.1.1 Introducing a sequence of targets: tempering from the prior to the posterior Sequential Monte Carlo (SMC) approaches the problem of sampling from π by introducing a sequence of intermediate distributions π 0 , · · · , π t , · · · , π T defined on the common measurable target space (R d , B(R d )) for all t, and such that π 0 is easy to sample from, and π T = π.
We focus on tempering to construct intermediate distributions, that is π t (x) ∝ p(x)l(y|x) λt , where the sequence of exponents λ t is such that 0 = λ 0 < · · · < λ t < · · · < λ T = 1. These exponents do not need to be pre-specified: they may be automatically selected during the run of a SMC sampler, as described later. Other choices of sequences of distributions are possible (see e.g. Chopin, 2002;Del Moral et al., 2006;Chopin et al., 2013). Note also that we assume throughout the article that the prior distribution p(x) is a proper probability distribution, from which samples can be drawn.

SMC samplers with MCMC moves on tempered posteriors
We denote by γ t (x) = p(x)l(y|x) λt the unnormalized density associated with π t (x), and by Z t the normalizing constant: Z t = R d γ t (x)dx. One way of constructing SMC samplers is as follows. Suppose that at time t − 1 an equally weighted particle approximation x i t−1 i∈1:N of π t−1 is available, with possible duplicates among the particles. This cloud of particles is then moved with a Markov kernel K t , that leaves the distribution π t−1 invariant: for each i, . Consequently a set of new samples x i t i∈1:N is obtained. These particles are then weighted: particle x i t is assigned an importance weight w i t = γ t (x i t )/γ t−1 (x i t ), so that the next distribution π t is approximated by the set x i t , w i t i∈1:N . After resampling particles according to their weights, one obtains again an equally weighted set x i t i∈1:N and the procedure is repeated for the next target distribution π t+1 . The resulting algorithm is given in Algorithm 1.
Upon completion, the algorithm returns weighted samples {x i t , w i t } i∈1:N , which may be used to estimate expectations with respect to the target distributions as follows: as N → +∞. The algorithm also returns estimates of the ratios Z t /Z t−1 (and thus of Z T /Z 0 ); see line 13. (In the context of tempering, one may use the path sampling identity to derive an alternate estimate of Z t /Z t−1 , as explained in Zhou et al. (2016). However, in our experiments, the two estimates were very close numerically, so we focus on the former estimate from now on.) The kernels K t may be constructed, for example, as Metropolis-Hastings kernels (see e.g. Chopin, 2002;Jasra et al., 2011;Sim et al., 2012;Fearnhead and Taylor, 2013;Zhou et al., 2016). More details on the general construction of kernels and on optimality can be found in Del Moral et al. (2006Moral et al. ( , 2007. In general Markov kernels may depend on a set of tuning parameters h, and are hereafter denoted by K h t to make this dependence explicit, as in Algorithm 1. Algorithm 1: SMC sampler with MCMC moves on tempered posteriors Input: Number of particles N , Initial distribution π 0 , target distribution π T and intermediate distributions π t−1 (x) ∝ p(x)l(y|x) λ t−1 , rule for constructing Markov kernels K h t that are π t−1 invariant. Result: Set of weighted samples x i t , w i t i∈1:N for t ∈ 1 : T and normalizing constant estimates Z t /Z t−1 for t ∈ 1 : T . Initialization: t = 1, λ 0 = 0; Iteration: Tune Markov kernel parameters h using available particles; see Algorithm 5 or 6; Move step can be iterated for better mixing, see Algorithm 3 ;

10
Choose the next exponent λ t ∈ (λ t−1 , 1] based on available particles; see Algorithm 2; (a) The choice of the next exponent λ t , at line 10 of Algorithm 1, may be based on available particles; for instance on their effective sample size, as explained below.
(b) The number of move steps, at line 9 of Algorithm 1, may be based on the observed performance of the Markov kernels; see below.
(c) The tuning of the Markov kernel parameters h, at line 6 of Algorithm 1, may be based on the particles, which is the main difference compared to the usual MCMC setting. The main contribution of this paper is to investigate this tuning in the case of HMC kernels, and is described in Section 3.
(a) Choice of the next exponent A common approach (Jasra et al., 2011;Schäfer and Chopin, 2013) to choose adaptively intermediate distributions within SMC is to rely on the ESS (effective sample size, Kong et al., 1994). The ESS is a measure of performance for importance sampling estimates (Agapiou et al., 2017). This criterion is calculated as follows: where in the setting considered here. The ESS is linked to the χ 2 -divergence between the distributions π t−1 and π t or equivalently to the variance of the weights. More precisely the ESS is a Monte Carlo approximation of N/(1 + χ 2 (π t , π t−1 )) where χ 2 (π t , π t−1 ) is the χ 2 divergence from π t−1 to π t .
We may choose λ t by solving (in λ) the equation ESS(λ) = αN , for some user-chosen value α ∈ (0, 1). The corresponding algorithm is described in Algorithm 2. The validity of adaptive SMC samplers based on an ESS criterion is studied in e.g. Beskos et al. (2016), see also Huggins and Roy (2018); Whiteley et al. (2016). Another approach for choosing the sequence of temperature steps is exposed in Friel and Pettitt (2008).
Algorithm 2: Choice of the next exponent based on the effective sample size.
(b) Number of move steps The mixing of MCMC kernels plays a crucial role in the performance and stability of SMC samplers (see e.g. Del Moral et al., 2006;Schweizer, 2012a;Ridgway, 2016). For any MCMC kernel targeting a distribution π, mixing is improved by repeated application of the kernel, for a cost linear in the number of repetitions. We propose to monitor the product of componentwise first-order autocorrelations of the particles to decide how many repetitions to use. Autocorrelations are calculated w.r.t. x i t i∈1:N , the cloud of particles after reweighting and resampling at time t. After k move steps through the kernel K h t the cloud of particles is {x i t,k } i∈1:N . We then calculate the empirical correlation of the component-wise statistic x i t,k (j) + x i t,k (j) 2 , where x i t,k (j) denotes component j of the vector x i t,k , using the successive states of the chain x i t,k (j) and x i t,k−1 (j). This empirical correlation is denoted byρ k (j). This statistic is chosen to reflect the first two moments of the particles, but is otherwise arbitrary. We suggest to continue applying the Markov kernel until a large fraction (e.g. 90%) of the product of the first order autocorrelations drops below a threshold α = 0.1, for example. The resulting algorithm is described in Algorithm 3.
Algorithm 3: Adaptive move step based on autocorrelations.
Input: Particles x i t i∈1:N , proposal kernel K h t . Result: Particles after k move steps {x i t,k } i∈1:N .
Calculate the correlationρ k (j) for all j; Instead of using component-wise autocorrelations, one could also draw on recent work on the performance evaluation of MCMC algorithms in multidimensional spaces (Vats et al., 2015) or approaches based on the Stein discrepancy (Gorham and Mackey, 2015).

Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) consists in proposing moves by solving the equations of motion of a particle evolving in a potential. We first describe HMC, by following the exposition in Neal (2011), before turning to existing approaches to the tuning of its algorithmic parameters.

MCMC based on Hamiltonian dynamics
Let L(x) = log γ(x) be the unnormalized log density of the random variable of interest x. We introduce an auxiliary random variable p ∈ R d with distribution N (0, M), and hence unnormalized log density log f (p) = −1/2 p T M −1 p. The joint unnormalized density of (p, x) is given as µ(p, x) = f (p)γ(x) and the negative joint log-density is denoted by The physical analogy of this quantity is the Hamiltonian, where the first term denotes the potential energy at position x, and the second term denotes the kinetic energy of the momentum p with the mass matrix M. The movement in time of a particle with position x and momentum p can be described via its Hamilton equations, where dx/dτ, dp/dτ denote the derivatives of the position and the momentum with respect to the continuous time τ . The solution of this differential equation induces a flow Φ τ that describes the evolution of a system with initial momentum and position (p 0 , The solution is (a) energy preserving, e.g. H (p τ , x τ ) = H (p 0 , x 0 ); (b) volume preserving and consequently the determinant of the Jacobian of Φ τ equals one; (c) the flow is reversible w.r.t. time.
In terms of probability distributions this means that if (p 0 , x 0 ) ∼ µ(p, x) then also (p τ , x τ ) ∼ µ(p, x). In most cases an exact solution of the flow is not available and one has to use numerical integration methods instead. One widely used integrator is the Störmer-Verlet or leapfrog integrator (Hairer et al., 2003). The leapfrog integrator is volume preserving and reversible but not energy preserving. It iterates the following updates: where denotes the step size of the leapfrog integrator. Thus, in order to let the system evolve from τ to τ + κ with κ = L × we need to make L steps as described above. This induces a numerical flow Φ ,L such that Φ ,L (p τ , x τ ) = (p τ +κ ,x τ +κ ). In general we have ∆E κ = 0 where ∆E κ = H(p τ +κ ,x τ +κ ) − H(p τ , x τ ) is the variation of the Hamiltonian. The dynamics can be used to construct a Markov chain targeting µ on the joint space, with a Metropolis-Hastings step that corrects for the variation in the energy after sampling a random momentum and constructing a numerical flow. Algorithm 4 describes the Markov kernel of HMC.

Existing approaches to tuning HMC
The error analysis of geometric integration gives insights that allow to find step sizes that yield stable trajectories. For the leapfrog integrator the error of the energy is and the error of the position and momentum is see Leimkuhler and Matthews (2016); Bou-Rabee and Sanz-Serna (2018) for more details. It can be shown that the constant C 1 > 0 in (2) stays stable over exponential long time intervals L ≤ exp(h 0 /2 ) for some constant h 0 (Hairer et al., 2006, Theorem 8.1), whereas the constant C 2 > 0 in the (3) typically grows with L. Hence, care must be taken when choosing ( , L).
Using the error control in (2), Neal (2011) following Creutz (1988) provides an informal reasoning motivating the scaling of ε as d −1/4 , at least for targets that factorize into products of d independent components. To maintain a fixed integration time εL one should then scale L as d 1/4 . From a practical point of view the tuning of the HMC kernel requires the consideration of the following aspects. If is too large, the numerical integration of the HMC flow becomes unstable and results in large variations in the energy and thus a low acceptance rate, see (2). On the other hand if is too small, for a fixed number of steps L the trajectories tend to be short and high autocorrelations will be observed, see Neal (2011). To counterbalance this effect a large L would be needed and thus computation time would increase. If L gets too large, the trajectories might also double back on themselves (Hoffman and Gelman, 2014).
From a theoretical perspective Beskos et al. (2013) and later Betancourt et al. (2014) show that the integrator step size should be chosen such that acceptance rates between 0.651 and 0.9 are obtained, when the dimension of the target space goes to infinity. This idea has been exploited in Hoffman and Gelman (2014) where stochastic approximation is used in order to obtain reasonable values of .
A different approach is to choose the parameters of the kernel such that the expected squared jumping distance (ESJD) of the kernel is maximized, see Pasarica and Gelman (2010). The ESJD in one dimension is defined as assuming stationarity. In this sense maximizing the ESJD of a Markov chain is equivalent to minimizing the first order autocorrelation ρ 1 . In d dimensions maximizing the ESJD in Euclidean norm amounts to minimizing the correlation of the d dimensional process in Euclidean norm. In the case of HMC this has been discussed by Mohamed et al. (2013) and Hoffman and Gelman (2014). Mohamed et al. (2013) tune the HMC sampler with Bayesian optimization and vanishing adaptation in the spirit of adaptive MCMC algorithms, see Andrieu and Thoms (2008). The ESJD is then maximized as a function of ( , L). Hoffman and Gelman (2014) discuss the ESJD as a criterion for the choice of L. As a general idea the simulation of the trajectories should be stopped when the ESJD starts to decrease. However, this idea has the inconvenience of impacting the reversibility of the chain. This problem is solved by adjusting the acceptance step in the algorithm. The resulting algorithm is called NUTS (Hoffman and Gelman, 2014) and is used in the probabilistic programming language STAN, see Carpenter et al. (2017). Neal (2011) suggests to use preliminary runs in order to find reasonable values of ( , L) and to randomize the values around the chosen values. The randomization avoids pathological behavior that might occur when ( , L) are selected badly. Other approaches on identifying the optimal trajectory length are discussed in Betancourt (2016).
Another important tuning parameter is the mass matrix M, that is used for sampling the momentum. When the target distribution is close to a Gaussian, rescaling the target by the Cholesky decomposition of the inverse covariance matrix eliminates the correlation of the target and can improve the performance of the sampler. Equivalently, the inverse covariance matrix can be set to the mass matrix of the momentum. This yields the same transformation, see Neal (2011). Recently, Girolami and Calderhead (2011) suggested to use a position dependent mass matrix that takes the local curvature of the target into account. However, the numerical integrator for the Hamiltonian equation needs to be modified in consequence.

Tuning Of Hamiltonian Monte Carlo Within Sequential Monte Carlo
We now discuss the tuning of the Markov kernel in line 6 of Algorithm 1. The tuning of Markov kernels within SMC samplers can be linked to the tuning of MCMC kernels in general. One advantage of tuning the kernels in the framework of SMC is that information on the intermediate distributions is available in form of the particle approximations. Moreover, different kernel parameters can be assigned to different particles and hence a large number of parameters can be tested in parallel. This idea has been exploited by Fearnhead and Taylor (2013). We build upon their methodology and adjust their approach to the tuning of HMC kernels. We first describe the tuning of the mass matrix. Second, we present our adaptation of the approach of Fearnhead and Taylor (2013) to the tuning of HMC kernels, abbreviated by FT. Then we present an alternative approach based on a pre-tuning phase at each intermediate step, abbreviated by PR for preliminary run. Finally, we discuss the advantages and drawbacks of the two approaches.

Tuning of the mass matrix of the kernels
The HMC kernels depend on the mass matrix M, used for sampling the momentum. Calibrating this matrix based on the particles of the previous iteration allows to exploit the information generated by the particles. In the case of an independent MH proposal this has been used by Chopin (2002) and more recently by South et al. (2019). For the HMC kernel we use the following matrix at iteration t where Var πt [x t ] is the particle estimate of the covariance matrix of target π t . The restriction to a diagonal matrix makes this approach applicable in high dimensions; alternatively a non-diagonal covariance or precision matrix could be estimated in high dimension by assuming sparsity (see e.g. Liu et al., 2016). Thus, the global structure of the target distribution π t−1 is taken into account and moves in directions of higher variance are proposed.
3.2 Adapting the tuning procedure of Fearnhead and Taylor (2013) Suppose we are at line 6 during iteration t of Algorithm 1. We are now interested in choosing the parameters h of the propagation kernel K h t . Fearnhead and Taylor (2013) consider the following ESJD (expected squared jumping distance) criterion: stands for the Mahalanobis distance with respect to matrix M ; in our case we set M = M t−1 , see (4). (Fearnhead and Taylor (2013) set M to the covariance matrix of the particles at time t − 1, but, again, this requires inverting a full matrix, which may be too expensive in high-dimensional problems.) By maximizing g t (h) we minimize the first-order autocorrelation of the chain. This leads to a reduced asymptotic variance of the chain and hopefully to a reduced asymptotic variance for estimates obtained from the SMC sampler.
The tuning procedure referred to as FT has the following steps: 1. Assign different values of h i t according to their performance to the resampled particlesx i t−1 . In Step 1, we use the following performance metric, which is a Rao-Blackwellized estimator of (5):Λ Herex i t is the proposed position based on the Hamiltonian flow Φ ,L (x i t−1 , p i t ) before the MH step. The acceptance rate min(1, exp[∆E i t ]) of the MH step is based on the variation of the energy ∆E i t , and serves as weight.
This metric is the same as in Fearnhead and Taylor (2013), except that it is divided by L, in order to account for the fact that the CPU cost of an HMC kernel increases linearly with L.
The pairs h i t = ( i t , L i t ) are weighted according to the performance metricΛ(x i t−1 ,x i t ). The next set of parameters h i t+1 are sampled from where R is a perturbation kernel. We suggest to set R to where T N denotes a normal distribution truncated to R + . The part in curly brackets corresponds to a discrete mixture for the variable L. Thus is perturbed by a small (truncated) Gaussian noise, and L has an equal chance of increasing, decreasing or staying the same. The variance of the Gaussian noise is set to the value used by Fearnhead and Taylor (2013) (in our simulations we found that the tuning was robust to different choices of this value). The resulting algorithm is given in Algorithm 5.
Algorithm 5: (FT) Tuning of the HMC algorithm based on Fearnhead and Taylor (2013) Input: Previous parameters h i t−1 , estimator of associated utilityΛ(

Pretuning of the kernel at every time step
The previous tuning algorithm relies on the assumption that parameters suited for the kernel used at time t − 1 will also achieve good performance at time t.
We suggest as an alternative to the previous approach the following two-stage procedure: 1. We apply an HMC step (with respect to π t−1 ) to the N current particles; for each particle the value of ( , L) is chosen randomly from a certain uniform distribution (described in next section). We then construct a new random distribution for ( , L) based on the performance of these N steps (Section 3.3.2). The HMC trajectories are then discarded.
2. We apply again an HMC step to the N current particles, except this time (L, ) is generated from the distribution constructed in the previous step.
We now explain this approach in more detail.

Range of values for
In the first stage of our pre-tuning parameter, is generated from U[0, t−1 ], and L from U[0, L max ]. We discuss in this section how to choose t−1 . The value of the very first 0 is given in Section 4. Our approach is motivated by the upper bound in (2). If for different step sizesˆ i t and different momenta and positions (p i t , , this information may be used to fit a model of the form where ξ i t is assumed to be such that ∀i, E ξ i t = 0, Var ξ i t = σ 2 < ∞ and f : R + → R + . We may then choose so that f ( ) = | log 0.9|. This ensures that the acceptance rate of the HMC kernel stays close to 90%, following the suggestions of Betancourt et al. (2014).
In particular we suggest to use the model and we minimize the sum of the absolute value errors N i=1 ξ i t w.r.t. (α 0 , α 1 ), which amounts to a median regression. Compared to least squares regression, this approach is more robust to high fluctuations in the energy which typically occur when approaches its stability limit. We illustrate this point in Figure 1b.  Figure 1: Tempering of a normal distribution to a shifted and correlated normal distribution in dimension 10 (see the example in Section 4.1 for more details). Left: The normalized and weighted squared jumping distance (z-axis) as a function of (y-axis) and L (x-axis) for the temperature 0.008. Right: Variation of the difference in energy ∆E as a function of for the same temperature. The values of L are randomized. Based on an SMC sampler with an HMC kernel based on N = 1, 024 particles.

Construction of a random distribution for ( , L)
Algorithm 6 describes how we generate values for ( , L) during the second stage of our pre-tuning procedure. In words, these values are sampled from the weighted empirical distribution that correspond to the N values (ˆ i t ,L i t ) generated (uniformly) during the first stage, with weights given by performance metric (6). We visualize this metric as a function of , L in Figure 1a.

Range of values for L
During the first stage of our pre-tuning procedure, L is generated uniformly within range [0, L max ]. The quantity L max is initialized to some user-chosen value (we took L max = 100 in our simulations). Whenever a large proportion of the L i t generated by Algorithm 6 is close to L max , we increase L max by a small amount (5 in our simulations). Similarly, whenever a large proportion of these values are far away from L max , we decrease L max by the same small amount.

Discussion of the tuning procedures
Comparison of the two algorithms The difference between the two procedures consists in the pre-tuning phase at each intermediate step of the sampler. On one hand, pre-tuning makes the SMC sampler more costly per intermediate step. On the other hand this approach makes the sampler more robust to a sudden change in the sequence of distributions. We illustrate this point in our numerical experiments.
Both of the suggested tuning procedures have computational costs linear in the number of particles N , in line with the other operations performed in the SMC sampler.
Other potential approaches to tuning HMC within SMC One could try to maximize the squared jumping distance as a function of the position of every particle, based on the associated values of ( , L). However, learning optimal parameters for every position might be challenging, possibly harder than the original Monte Carlo problem of interest. In line with Girolami and Calderhead (2011) one could use a position dependent mass matrix that would take the geometry of the target space into account, for instance related to higher-order derivatives of the target probability density function.
Returning to the choice of ( , L) one could use an approach based on Bayesian optimization (Snoek et al., 2012), based on the performance of the ( i t−1 , L i t−1 ) at the previous iteration. This idea would amount to a parallel version of Mohamed et al. (2013). However, it is not clear how this approach would behave if the underlying distributions evolve over time. Not using a pre-tuning step reduces the computational load at the expense of making the sampler potentially less robust. Moreover, the approach of Fearnhead and Taylor (2013) already explores the hyperparameter space adaptively without requiring additional model specifications. If framed as a Bandit problem, fixing over time a grid of possible values ( , L) could be problematic if the grid misses relevant parts of the hyperparameter space. This holds true in particular when using continuous Bayesian optimization, where one typically has to define some box constraints on the underlying space.
Extensions The tuning procedure based on pre-tuning might also be used for tuning random walk (RW) Metropolis or MALA (Metropolis adjusted Langevin) kernels. In the first case we may use median regression to find an upper bound for the scale such that the acceptance rate is close to 23.4% (Roberts et al., 1997). In the second case one may target an acceptance rate of 57.4% (Roberts and Rosenthal, 1998). (MALA kernels may be viewed as HMC kernels with L = 1.) It recently came to our knowledge that the work of Salomone et al. (2018) also uses a pre-tuning approach for MCMC kernels within SMC samplers. A notable difference with our approach is that Salomone et al. (2018) concentrates on finding a single tuning parameter, rather than a distribution.

Experiments
Our experiments highlight the importance of adapting SMC samplers, in particular the parameters of their Markov kernels. Specifically, we try to answer the following questions. How important is it to adapt (a) the number of temperature steps and (b) the number of move steps? (c) Does our tuning procedure of HMC kernels provide reasonable values of ( , L) compared to other tuning procedures of HMC? (d) To what extent does HMC within an SMC sampler scale with the dimension and may be applied to real data applications? (e) How robust are SMC samplers to multimodality?
For this purpose we compare adaptive (A) and non-adaptive (N) versions of HMC-based SMC samplers, where the adaptation may be carried out using either the FT approach (our variant of the approach of Fearnhead and Taylor, 2013) or the PR (pre-tuning) approach. We also include in our comparison SMC samplers based on random walk (RW) and MALA kernels, and the FT adaptation procedure. We call our algorithms accordingly: i.e. HMCAFT stands for an SMC sampler using HMC kernels, which are adapted using the FT procedure.
In all the considered SMC samplers, the mass matrix M t of the MCMC kernels is set to the diagonal of the covariance matrix obtained at the previous iteration. Unless otherwise stated, the number of particles is N = 1, 024 and the resampling is triggered when the ESS drops below N/2. The computational load of a given sampler is defined as the number of gradient evaluations, plus the number of likelihood evaluations. Note, that this is a conservative choice as computations of the likelihood and the gradient often involve the same computations. Most of our comparisons are in terms of adjusted variance, or adjusted MSE (mean squared error), by which we mean: variance (or MSE) times computational load.
All HMC-based samplers are initialized with uniform random draws of on [0, 0.1] and L on {1, 100}. The MALA and RW-based samplers are initialized with random draws of the scale parameter on [0, 1]. The initial mass matrix is set to the identity for all different samplers. All samplers choose adaptively the number of move steps based on Algorithm 3. Code for reproducing the results shown below is available under https://github.com/alexanderbuchholz/hsmc.

Tempering from an isotropic Gaussian to a shifted correlated Gaussian
As a first toy example we consider a tempering sequence that starts at an isotropic Gaussian π 0 = N (0 d , I d ), and finishes at a shifted and correlated Gaussian π T = N (µ, Ξ), where µ = 2 × 1 d , for different values of d. For the covariance we set the off-diagonal correlation to 0.7 and the marginal variances to elements of the equally spaced sequenceΞ = [0.1, · · · , 10]. We get the covariance Ξ = diagΞ 1/2 corr(X) diagΞ 1/2 . This toy example is rather challenging due to the different length scales of the variance, the correlation and the shifted mean of the target. In this example the true mean, variance and normalizing constants are available. Therefore we report the mean squared error (MSE) of the estimators. We use normalized importance weights and thus Z T /Z 0 = 1.
We first compare the following SMC samplers: MALA, HMCAFT and HMCAPR (according to the denomination laid out in the previous section). We add to the comparison HMCNFT, an SMC sampler using adaptive (FT based) HMC steps, but where the sequence of temperatures is fixed a priori to a long equi-spaced sequence (the size of which is set according to the number of temperatures chosen adaptively during one run of HMCAFT). Figure 2a plots the ESS as a function of the temperature, for dimension d = 500, for algorithms HMCAFT and HMCNFT. Figures 2b and 3 compare the SMC samplers in terms of computational load (Figure 2b) and adjusted MSE (i.e. MSE times the computational load) for the normalizing constant and the expectation of the first component (with respect to the target). The results for other components (not shown here) are similar to the results for the first component.
A first observation is that it seems useful to adapt the sequence of temperatures: HMCNFT is outperformed by all the other algorithms for both estimates. A second observation is that there is no clear ranking between the three other samplers. HMC-based samplers (and particularly HMCAFT) do perform better than the MALA-based sampler for the normalizing constant, but the picture is less clear for the posterior expectation of the first component. It is remarkable that, even in dimensions as high as 500, MALA kernels may be competitive. In a second step, we compare the impact of adapting the number of move steps of the samplers. We restrict the comparison to a MALA-based sampler that uses FT tuning and an HMC-based sampler that uses PR tuning. For the non-adaptive samplers (N) the number of move steps is fixed to a constant number, equal to the average number of move steps over the complete run of an adaptive sampler. The temperatures are chosen adaptively.
We compare the performance of the samplers in estimating the trace of the variance Ξ. The results are shown in Figure 4a and the associated computational cost is shown in Figure 4b. The adaptive samplers seem to offer a similar trade-off in terms of MSE versus computational load. In order to assess the performance of the two tuning procedures (FT and PR), we compare the tuning parameters obtained at the final stage of our SMC samplers (HMCAFT and HMCAPR) with those obtained from the following MCMC procedures: NUTS (Hoffman and Gelman, 2014) and the adaptive MCMC algorithm of Mohamed et al. (2013). NUTS iteratively tunes the mass matrix M, the number of leapfrog steps L and the step size in order to achieve a high ESJD (expected squared jumping distance). The adaptive HMC sampler of Mohamed et al. (2013) uses Bayesian optimization in order to find values of ( , L) that yield a high ESJD.
For this purpose we run our HMC-based SMC samplers and record the achieved ESJD of the HMC kernel at the final distribution π T . NUTS and the adaptive HMC sampler are run directly on the final target distribution. For NUTS we use the implementation available in STAN; for the adaptive HMC sampler we reimplemented the procedure of Mohamed et al. (2013). After the convergence of the tuning procedures we run the chain for 2, 000 iterations and discard a burnin of 1, 000 samples. The ESJD is calculated on the remaining 1, 000 iterations of the chain. The results of this comparison are shown in Table 1. We see that both the HMC-based SMC samplers, NUTS and the adaptive HMC tuning achieve an ESJD of the same order of magnitude. Thus, our tuning procedure gives values of ( , L) that yield an ESJD comparable to other existing procedures.

Tempering from a Gaussian to a mixture of two correlated Gaussians
The aim of our second example is to assess the robustness of SMC samplers with respect to multimodality. We temper from the prior π 0 = N (µ 0 , 5I d ) towards a mixture of shifted and correlated Gaussians π T = 0.3N (µ, Ξ 1 ) + 0.7N (−µ, Ξ 2 ), where µ = 4 × 1 d and we set the off diagonal correlation to 0.7 for Ξ 1 and to 0.1 for Ξ 2 . The variances are set to elements of the equally spaced sequenceΞ j = [1, · · · , 2] for j = 1, 2. The covariances Ξ j are based on the same formula as in the first example. In order to make the example more challenging we set µ 0 = 1 d . Thus, the starting point of the sampler is slightly biased towards one of the two modes. We evaluate the performance of the samplers by evaluating the signs of the particles and use therefore the statistic T i := 1/d d j=1 1 {sign X j,i =+1} , based on a single particle. We expect a proportion of 30% of the signs to be positive, i.e. 1/N N i=1 T i ≈ 0.3, if the modes are correctly recovered. Our measure of error is based on the squared deviation from this value. We consider the following SMC samplers: MALA, RW, HMCAFT, and HMCAPR. All the samplers choose adaptively the number of move steps and the temperatures.
As shown by Figure 5b the two HMC-based samplers yield a lower error of the recovered modes adjusted for computation in moderate dimensions. In terms of the recovery of the modes all samplers behave comparably, see Figure 5a. Nevertheless, as the dimension of the problem exceeds 20 all samplers tend to concentrate on one single mode. This problem may be solved by initializing the sampler with a wider distribution. However, this approach relies on the knowledge of the location of the modes.
Consequently, SMC samplers are robust to multimodality only in small dimensions and care must be taken when using SMC samplers in this setting. That said, HMC-based samplers seem slightly more robust to multimodality; see e.g. results for d = 10 in Figure 5a. See also the recent work of Mangoubi et al. (2018) for the performance of HMC versus RWMH on multimodal targets.

Tempering from an isotropic Student distribution to a shifted correlated Student distribution
As a different challenging toy example we study the tempering from an isotropic Student distribution π 0 = T 3 (0 d , I d ) with 3 degrees of freedom towards a shifted and correlated Student distribution with ν = 10 degrees of freedom, i.e. π T = T ν (µ, Ξ). The mean and scale matrix are set as in the unimodal Gaussian example in 4.1.  Figure 6: Figure 6a shows the squared error of the estimator of the normalizing constant. Figure 6b shows the squared error of the trace of the mean over different dimensions adjusted for computation. The results are based on 100 runs of the samplers with N = 1, 024 particles.
This example is more challenging as the target distribution π T has heavy tails. We vary the dimension between d = 5 and d = 150. The temperature steps are chosen such that an ESS of 90% is attained. The adaptive samplers (A) adjust the number of move steps according to Algorithm 3. For the non-adaptive samplers (N) the number of move steps is fixed to a constant number, equal to the average number of move steps over the complete run of an adaptive sampler. The temperatures are chosen adaptively. The MALA-based sampler uses FT tuning, the HMC-based sampler uses our pre-tuning (PR) approach.
The HMC-based samplers perform better when it comes to estimating the mean (see Figure  6b) and the normalizing constant (see Figure 6a). Both samplers based on a MALA kernel tend to work poorly in terms of the estimation of the normalizing constant when the dimension increases, see Figure 6a.
In this particular example, we observe that the adaptation of the number of move steps has some negative impact on the performance of the samplers. We suspect that this is due to the heavy tails of the Student distribution: first, this phenomenon did not occur in our first numerical example, where the target had light tails (Gaussian); second, Livingstone et al. (2016) show that an HMC kernel cannot be geometrically ergodic when the target is heavy-tailed. Thus, setting the number of move steps to a fixed, large value may be beneficial for heavy-tailed targets. Our approach provides a guideline for finding this number.

Binary regression posterior
We now consider a Bayesian binary regression model. We observe J vectors z j ∈ R d and J outcomes y j ∈ {0, 1}. We assume y j ∼ Ber(r(z T j β)) where r : R → [0, 1] is a link function. The parameter of interest is β ∈ R d , endowed with a Gaussian prior, i.e. β ∼ N (0 d , I d ).
When using the logit link function r : x → exp(−x)/(1 + exp(−x)) we obtain a Bayesian logistic regression where the unnormalized log posterior is given as When using as link function the cumulative distribution function of a standard normal distribution, denoted Φ, one obtains the Bayesian Probit regression. In this case the unnormalized log posterior is given as We set π 0 to a Gaussian approximation of the posterior obtained by Expectation Propagation (Minka, 2001) as in Chopin and Ridgway (2017).
We consider two datasets (both available in the UCI repository): sonar (d = 61 covariates, J = 208 observations) and musk (d = 95, J = 476, after removing a few covariates that are highly correlated with the rest). In both cases, an intercept is added, and the predictors are normalised (to have mean 0 and variance 1).
We compare the following SMC samplers: RW, MALA, HMCAFT, HMCAPR; see Figure 7 for the estimation of the marginal likelihoods (which may be used to perform model choice) and The four samplers perform similarly on on the sonar dataset, while they perform quite differently on the musk dataset. In the latter case, the MALA-based and RW-based samplers did not complete after 48 hours of running, so we had to remove them from the comparison. (In contrast, the two HMC-based samplers took less than 45 minutes to complete). In addition, FT adaptation leads to a high variance for the normalising constant. Table 2 reports the logarithm of the adjusted variances for the four considered samplers, the two considered datasets, the two considered models (logit and probit) and a varying number of particles for the sonar dataset. By and large, HMCAPR seems the most robust approach; it often gives either the smallest, or a value close to the smallest, of the adjusted variances.  Figure 8a corresponds to the sonar dataset. Figure 8b corresponds to the musk dataset.

Log Gaussian Cox model
In order to illustrate the advantage of HMC-based SMC samplers in high dimensions we consider the log Gaussian Cox point process model in Girolami and Calderhead (2011), applied to the Finnish pine saplings dataset. This dataset consists of the geographic position of 126 trees. The aim is to recover the latent process X ∈ R d×d from the realization of a Poisson process Y = (y j,k ) j,k with intensity mΛ(j, k) = m exp(x j,k ), where m = 1/d 2 and X = (x j,k ) j,k is a Gaussian process with mean E [X] = µ × 1 d×d and covariance function Σ (j,k)(j ,k ) = σ 2 exp(−δ(j, j , k, k )/(dβ)), where δ(j, j , k, k ) = (j − j ) 2 + (k − k ) 2 . The posterior density of the model is given as p(x|y, µ, σ 2 , β) ∝    d j,k exp(y j,k x j,k − m exp(x j,k )) where the second factor is the Gaussian prior density. Following the results of Christensen et al. (2005) we fix β = 1/33, σ 2 = 1.91 and µ = log(126) − σ 2 /2. We vary the dimension of the problem from d = 100 to d = 16, 384 by considering different discretizations. We consider three SMC samplers: MALA, HMCAFT and HMCAPR. The starting distribution is the prior. Figure 9b shows that the HMC-based samplers estimate well the normalizing constant, even for a large dimension d. Moreover, we estimate the cumulative mean throughout different dimensions with relatively low variance, see Figure 9a. We omitted the simulation for the MALA-based samplers, as the simulation took excessively long in dimension 4, 096 due to slow mixing of the kernel. The estimated posterior mean of the latent field for dimension 900 is plotted in Figure 9c.  Figure 9a illustrates the estimations of the normalizing constants. Figure 9b illustrates the estimated cumulative posterior mean. Figure 9c illustrates the recovered componentwise posterior mean of the process in dimension 900. Table 3 reports adjusted variances (variances times computational load) for the considered SMC samplers. We see that the MALA-based sampler is not competitive when the dimension increases. Regarding the adaptation scheme, FT outperforms PR, especially in high dimension.  Table 3: Inefficiency of the estimation of the normalizing constant and the mean of the first component based on 40 runs of the different samplers. The inefficiency is measured as the log of adjusted variances of the samplers (variance multiplied by the mean number of gradient and likelihood evaluations). Smaller is better.

Discussion
Our experiments indicate that the relative performance of HMC kernels within SMC depends on the dimension of the problem. For low to medium dimensions, RW and MALA are much faster, and tend to perform reasonably well. On the other hand, for high dimensions, HMC kernels outperform, sometimes significantly, RW and MALA kernels.
The key to good performance of SMC samplers (based on HMC or other kernels) is to adaptively tune the Markov kernels used in the propagation step. We have considered two approaches in this paper. On posterior distributions with reasonable correlation our adaption of the approach by Fearnhead and Taylor (2013) works best. Our approach based on pre-tuning of the HMC kernels is more robust to changes in the subsequent target distributions, as illustrated by our binary regression example. This holds in particular when using SMC samplers for model choice. Moreover, we showed that, when sensibly tuned, SMC samplers with HMC kernels can scale to high dimensional problems. From a practical point of view and if the structure of the posterior is unknown the second approach may be the more prudent choice.
All in all, our methodology provides a principled approach for an automatic adaption of SMC samplers, applicable over a range of various models in different dimensions.