Delayed rejection Hamiltonian Monte Carlo for sampling multiscale distributions

The efficiency of Hamiltonian Monte Carlo (HMC) can suffer when sampling a distribution with a wide range of length scales, because the small step sizes needed for stability in high-curvature regions are inefficient elsewhere. To address this we present a delayed rejection variant: if an initial HMC trajectory is rejected, we make one or more subsequent proposals each using a step size geometrically smaller than the last. We extend the standard delayed rejection framework by allowing the probability of a retry to depend on the probability of accepting the previous proposal. We test the scheme in several sampling tasks, including multiscale model distributions such as Neal's funnel, and statistical applications. Delayed rejection enables up to five-fold performance gains over optimally-tuned HMC, as measured by effective sample size per gradient evaluation. Even for simpler distributions, delayed rejection provides increased robustness to step size misspecification. Along the way, we provide an accessible but rigorous review of detailed balance for HMC.


Introduction
Hamiltonian Monte Carlo (HMC), including auto-tuned extensions like the no U-turn sampler (NUTS), have become the de facto standard for high performance sampling of high-dimensional, differentiable distributions (Duane et al., 1987;Neal, 2011;Hoffman and Gelman, 2011).One reason for this is that HMC scales much better with dimension than other Markov chain Monte Carlo (MCMC) methods such as random-walk Metropolis or Gibbs sampling.HMC's scalability derives from its ability to move large distances by approximating the Hamiltonian flow defined by the gradient of a distribution's log density function (Betancourt, 2017).As a result, HMC is believed to require O(d 5/4 ) iterations to generate an independent sample in d dimensions as compared to the O(d 2 ) samples required with random-walk Metropolis or Gibbs sampling (Neal, 2011).The actual efficiency also depends strongly on geometric features of the density being sampled, particularly issues of high correlation between coordinates (leading to stiffness, i.e., ill-conditioning of the local curvature Hessian), and of spatially varying curvature (which defeats the use of global preconditioning to counteract stiffness).
One of the most common pathologies plaguing these algorithms is the multiscale geometry of the posterior distributions (Betancourt and Girolami, 2015 and Petzold, 2019): when the curvature of the log density varies spatially over a large dynamic range, small HMC time steps are needed for numerical stability in the highcurvature regions, preventing the use of the larger time steps needed for efficient sampling in smoother regions.This geometry arises naturally in hierarchical models that provide a population model for a group of effects in order to support regularization and partial pooling.However since all the contributions at the bottom of the hierarchy depend on the common global parameter, a small change in these high level parameters can induces large changes in the conditional density of the effects.Consequently, when the data are sparse and inference is sensitive to the priors on these parameters, the posterior density of these models looks like a "funnel" with a region of high density but low volume ("neck") widening to a region of low density and high volume ("mouth").We show a two-dimensional example of this distribution in Figure 1.In the right panel of the same figure, we show the dramatic variations in condition number as the log scale parameter moves along the funnel.Sampling this distribution is challenging because the mouth and the neck of the funnel contain equal probability mass and so any sampling algorithm needs to handle these variations in curvature.
A standard option for managing varying curvature is to use Hessian information.This has led to the development of Riemannian HMC (Girolami and Calderhead, 2011), which follows a Riemannian metric based on the posterior curvature.However, this is prohibitively expensive in high dimensions because it requires a positive-definite matrix at each point and many posteriors do not have positive-define Hessians.One way to do this is to have an explicit form of the Fisher information matrix (Girolami and Calderhead, 2011) or to use a conditioning operator like SoftAbs (Betancourt, 2013).
An alternative way to deal with high curvature is to approximate the Hamiltonian flow with an implicit symplectic integrator, which is able to naturally adjust stepping in different regions of phase space (Pourzanjani and Petzold, 2019;Brofos and Lederman, 2021b).Even simple implicit integration schemes like implicit midpoint are costly and present an algorithmic challenge for efficient and stable line search.Ultimately, we be-lieve it will be necessary to combine implicit integration and delayed rejection to achieve greater robustness in the face of even more challenging posterior sampling problems.
In this work, we develop an alternate approach inspired by the use of delayed rejection (DR) methods to sample multiscale posterior distributions.Recall that a high rejection rate increases autocorrelation of the Markov chain, reducing sampling efficiency.Whenever a rejection would occur in the Metropolis algorithm, DR methods do additional work, which can even exploit knowledge of the first rejection, to make a new proposal with a higher chance of acceptance (Haario et al., 2006).Since such new (possibly expensive) proposals are mostly made only when the standard proposal is poor, efficiency can be increased.Although well studied in the context of random walk Metropolis-Hastings sampling (Mira, 1998;Tierney and Mira, 1999;Green and Mira, 2001;Haario et al., 2006), there has been relatively little work done with these approaches for Hamiltonian Monte Carlo samplers (Sohl-Dickstein et al., 2014;Campos and Sanz-Serna, 2015).
Previous approaches employing DR with HMC extend the same trajectory upon rejection so as to balance the additional cost by making larger jumps in the state space (Sohl-Dickstein et al., 2014;Campos and Sanz-Serna, 2015).However, this approach is not helpful if the chains are stuck in a region of high curvature where instability causes a high rejection rate.In such cases, as with delayed rejection in random-walk Metropolis (Green and Mira, 2001;Haario et al., 2006), it is more productive instead to change the proposal parameters with the goal of increasing the chance of acceptance.In this work, we use this idea, building upon the original idea of delayed rejection (Tierney and Mira, 1999;Green and Mira, 2001) to develop delayed rejection HMC (DRHMC).Upon a rejection, DRHMC makes one or more subsequent proposals with smaller step sizes, with the aim that these are more likely to give stable leapfrog integration than their rejected predecessors.The result is a form of adaptivity with respect to step size, a very successful idea in numerical integration more generally.1 In the rest of this paper, we begin by reviewing, in a mathematically rigorous yet accessible fashion, Metropolis Hastings, HMC, and delayed rejection methods in Section 2. With these tools, we then derive DRHMC in section 3 for one or more proposals.We show that DRHMC obeys detailed balance, discuss the cost of delayed proposals and outline probabilistic alternatives to reduce the cost of delayed rejection approaches.Then in section 4, we consider some toy models as well as actual data models, and show that DRHMC can provide significant speed-ups as compared to traditional HMC in sampling tough multiscale distributions.We also show that in cases with no such pathologies, probabilistic DRHMC is no more expensive than HMC, thus suggesting its use as a robust alternative.We conclude with discussion in section 5. Two short appendices contain proofs needed in the main text.
2 Metropolis-Hastings, delayed rejection, deterministic maps, and HMC In this section we recap background material that is not easy to find gathered in an accessible format.While being somewhat tutorial in nature, this also sets up essential notation for the coming presentation of DRHMC.We include a proof, avoiding technical measure theory notation, that HMC samples the correct target distribution, since in the literature this is usually presented either heuristically (MacKay, 1998;Neal, 2011;Sohl-Dickstein et al., 2014;Betancourt, 2017), or rigorously but in highly abstract terms (Andrieu et al., 2020).
In general we use x ∈ S to denote the state in a continuous state space S, which can be taken as R n .(When we specialize later to HMC for sampling a target density over R d , we will set S = R 2d .)The goal of random-walk Metropolis, as with any MCMC method (MacKay, 1998;Geyer, 2011), is to sample from a target probability density function (pdf) π over S. We assume that π is absolutely continuous (AC), meaning that it can be represented by a nonnegative function.We assume the usual normalization π(x) dx = 1 (although all MCMC methods discussed can handle unnormalized π).Unless indicated, all integrals are over S.
A Markov chain is defined by its transition kernel k(x, y), which gives the probability density function of transitioning to the next state y, conditioned on the current state x.The normalization is thus k(x, y) dy = 1, ∀x ∈ S . (1) More formally, the transition kernel is a measure that depends on the parameter x, and only when this measure is AC can it be written as a kernel function k(x, y).We refer the reader to (Stein and Shakarchi, 2005;Hunter and Nachtergaele, 2001;Billingsley, 2012;Andrieu et al., 2020) for background on measure theory.We will need to handle non-AC cases, but only for measures that can be described using Dirac delta distributions, so will avoid technical language.A necessary condition for MCMC to sample the correct pdf π is its invariance under the transition operator2 , One way to ensure invariance is to construct kernels which maintain detailed balance (DB, also called "reversibility"), meaning For AC kernels this simply means that the two sides are equal for almost all x, y ∈ S.
For non-AC kernels the two sides may not be defined (e.g., infinite) for pairs (x, y) of interest, and one should interpret the left and right sides (once multiplied by dxdy) as measures over the Cartesian (tensor) product space S × S (see, e.g., (Billingsley, 2012, Ch. 18)).Then (3) should be interpreted as the left and right side being equal as product measures, which means the weak sense π(y)k(y, x) dx dy , for all (measurable) subsets A, B ⊂ S.
(4) For ease of reading we will write statements of the form (3) about kernels over (x, y) that represent product measures, with the understanding that they should be interpreted as in (4).
Finally, we recall the crucial fact that detailed balance implies invariance, which follows by substituting (3) into (2) then using (1).3

Metropolis-Hastings
The MH algorithm involves a proposal kernel q(x, y) which gives, for each starting state x, the (normalized) pdf over proposed states y.For now we will assume that q(x, •) is AC for each x ∈ S. The proposal is accepted with some x-and y-dependent probability α(x, y), in which case the next state is y, otherwise the next state remains as x.Thus the transition kernel k(x, y) defining the resulting Markov chain is, for each x, a mixture of the pdf q(x, y) and the (rejected) point mass at y = x, k(x, y) = q(x, y)α(x, y) + δ x (y)r(x) , where r(x) is the probability of rejection.Here δ is the Dirac delta distribution defined in Euclidean space by δ(x) = 0, ∀x = 0, and δ(x) dx = 1, and we use the notation δ x (y) = δ(x − y).Since the second term in (5) is x ↔ y symmetric whatever the form of r(x), then for detailed balance (3) to hold, we only need the condition on the first term π(x)q(x, y)α(x, y) = π(y)q(y, x)α(y, x) . (6) In the case where q is not AC, then (6) should be taken in the sense of equality of product measures described above.If q is AC and everywhere positive then the standard MH acceptance formula α(x, y) = min π(y)q(y, x) π(x)q(x, y) , 1 is the most efficient4 choice of α that satisfies (6).

Delayed rejection for Metropolis-Hastings
Here we summarize standard delayed rejection as introduced in (Mira, 1998;Tierney and Mira, 1999;Green and Mira, 2001).The idea is make a second proposal with kernel q 2 (x, s, y) to y if the first proposal q 1 (x, s) from x to s is rejected (see Fig. 3(a)).Note that q 2 may depend on both the current state x and the rejected state s.The transition kernel analogous to (5) must account for three possible ways to end up at state y: i) acceptance of q 1 (x, y), for which one uses the usual MH probability α 1 (x, y) obeying detailed balance (6); ii) acceptance of the second proposal, which occurs with some new probability α 2 (x, s, y); and iii) rejection of this second proposal.For cases ii) and iii) one must marginalize over all possible rejected first tries s.Thus the transition kernel is ) where r 2 is the probability of rejection of the second proposal5 .The factors q 1 (x, s)[1 − α 1 (x, s)] in the integrand are the probabilities of making the first proposal (q 1 ) then rejecting it (1 − α 1 ).The goal is then to choose α 2 (x, s, y) such that DB is satisfied for the kernel k given by (8).We have already established that this holds for the first term and the r 2 term, which leaves only the middle q 2 term.Substituting this middle term into the DB condition (6) gives where s and s are (unrelated) dummy integration variables.As before, if q 2 (x, s, •) is AC, then this condition must hold for almost all x, y ∈ S. As most clearly explained by Mira (Mira, 1998, Sec. 5.2), one way (but not the only way) to enforce this condition is simply to set the integrands equal.6This gives π(x)q 1 (x, s)[1−α 1 (x, s)]q 2 (x, s, y)α 2 (x, s, y) = π(y)q 1 (y, s)[1−α 1 (y, s)]q 2 (y, s, x)α 2 (y, s, x), (10) which now must hold for (almost) all x, y, s ∈ S.Then, again assuming AC proposal pdfs with everywhere positive densities, the acceptance probability for the second proposal that maintains DB with the least rejection is (Tierney and Mira, 1999) Note that when we propose delayed rejection for maps coming from HMC in the next section, we will not be able to use this method of Mira and Tierney, and will need to return to the general integral condition (9).For a MH proposal, as compared to Eq. 7, this acceptance probability has an extra factor q1(y,s)[1−α1(y,s)] q1(x,s)[1−α1(x,s)] which balances the probability of the first proposal being rejected at y and x respectively.

MH with deterministic proposals given by maps
Metropolis-Hastings is usually presented assuming absolutely continuous proposal densities, so that equation ( 7) may be formulated.However, HMC involves MH proposals that are given by deterministic maps, which are not AC, rendering (7) meaningless in this setting.Thus, in this section we derive rigorously the condition on the acceptance probability guaranteeing detailed balance, for the relevant class of maps.By a map we mean a smooth function F : S → S, which thus takes each state x to its image state y = F (x).The corresponding proposal kernel is which simply places the entire unit point mass at the point y = F (x), hence is deterministic.We will need the following two definitions.
Definition 1 (Involution).A map F : S → S is an involution if F −1 = F as maps, that is, F 2 = I where I is the identity map.
Definition 2 (Volume-preserving).A map F : S → S is volume (Lebesgue measure) preserving if dx , for all (measurable) subsets B ⊂ S , where F (B) := {F (x) : x ∈ B} denotes the image of the set B.
If DF (x) ∈ R n×n is the Jacobian derivative matrix with elements (DF (x)) ij = ∂F i (x)/∂x j , i, j = 1, . . ., n, then it is a standard result that volume preservation is equivalent to | det DF (x)| = 1 for all x ∈ S, i.e., a unit Jacobian determinant everywhere.(See, e.g., (Billingsley, 2012, Thm. 17.2).) If MH is performed using deterministic proposals coming from maps that are in both of the above special categories, then there is a particularly simple condition that the acceptance probability should obey for DB to hold, as follows.A simple proof is provided in Appendix A.
Lemma 3 (MH using a deterministic volume-preserving involution).Let π be an AC target density.Let F be a volume-preserving involution.Then MH with the deterministic proposal kernel q F given by (12), with acceptance probability α obeying has detailed balance with respect to π, and therefore has π as an invariant density.
The key point here is that the formula (13) for the acceptance probability does not depend on the function F at all, just on the ratio of target densities.We have not found this well explained in the literature.This will allow us in the following sections to place HMC, and our proposed DRHMC method, on a rigorous footing.q p π( ) shows the target density π(q) (bottom), the associated potential U (q) (middle), and the resulting contours in 2D phase space (q, p) of the Hamiltonian H given by ( 14).Each leapfrog step L moves approximately along such a contour.For step 2 of HMC, the proposal move F = L n ε P is sketched, for n = 3, where P is the momentum flip.(b) shows the p randomization (Gibbs move) in step 1 of HMC (red shows density of the kernel living on the d-dimensional slice q = constant).(c) shows the composition of steps 1 and 2, comprising one HMC iteration (again red shows the resulting Markov kernel density, which lives on the union of a curved d-dimensional manifold and a constant-q slice).

Classical Hamiltonian Monte Carlo
As our final piece of background, we review the Hamiltonian Monte Carlo (HMC) algorithm (Neal, 2011).We change the notation in this and the next section to overload q, with q ∈ R d now denoting the parameter vector of interest that is to be sampled. 7The target pdf, which we call π, is assumed to be continuous and differentiable.To draw samples q from π(q), HMC reinterprets the parameters of interest as a position vector with associated potential energy function U (q) = − log π(q), and simulates a Markov chain by approximating the following Hamiltonian dynamics.One introduces an auxiliary momentum vector p ∈ R d , which contributes a kinetic energy term K(p) = 1 2 p T M −1 p, where M is some symmetric positive definite mass matrix that we take as fixed.Then the Hamiltonian H : R 2d → R is the total energy function for the state x := (q, p), The state space S = R 2d is called phase space; see Fig. 2(a) for an illustration.Given initial data x(0) = (q(0), p(0)), the evolution of this physical system with respect to time t is the first-order ODE system called Hamilton's equations, where • = d/dt indicates the time derivative.Intuitively, this motion is that of a pointmass "rolling around" in the potential well U , in the absence of friction.The force vector −∇U attracts the ball so that it accelerates towards low-potential (high-probability) regions.
HMC generates samples x from the Gibbs pdf (also known as the Boltzmann or canonical distribution from statistical mechanics) defined by H, namely where Note that, since H was the sum of potential and kinetic terms, q and p are independent, with the qmarginal of π(x) being the target density π(q).Thus, given samples x (i) from π, by extracting their first d coordinates one obtains samples from π.
HMC uses as its main step an MH step using a proposal from a particular deterministic map F , which happens to approximate Hamiltonian dynamics over a certain length of time T followed by a negation of the momentum.The key property of this map-guaranteeing that is has the correct invariant distribution π-will be that it is a volume-preserving involution; the ancillary fact that it is an approximation to Hamiltonian dynamics is only relevant for creating a high mixing rate without excessive rejection in the MH acceptance step.However, the exact dynamics is restricted to a level set (energy shell) H(q, p) = constant (Neal, 2011, (2.13)), and staying permanently on this level set would not sample (16) correctly.Thus, HMC alternates these Metropolis steps with a Gibbs sampling step that draws a fresh p ∼ N (0, M ).This Gibbs update preserves the stationary distribution because the p and q terms factor in (16).Because the Metropolis and Gibbs updates both preserve the stationary distribution, so does their composition, which may be viewed as a single update in a Markov chain.
Let L ε be the map that performs one leapfrog (Verlet) step with time step ε > 0. Precisely, its action (q , p ) = L ε (q, p) is computed by the three sequential substeps, The composition of n = T /ε such leapfrog steps is a O(ε 2 )-accurate approximation to the exact dynamics (15) evolved to time T (e.g.see (Neal, 2011) for a derivation of the order of accuracy).This composition is a volume-preserving involution, but does not conserve H exactly. Also we will need the "momentum flip" operator P defined by P (q, p) = (q, −p).
With these defined, a single HMC iteration from the current state x (i) := (q (i) , p (i) ) comprises the two sequential steps: Step 1. Gibbs sampling: Resample the momentum p (i) from its Gaussian marginal distribution p ∼ N (0, M ), without changing q (i) .8This randomization step is shown as R in Fig. 2(b).(Note that there exist variants using partial randomization that we will not explore here (Neal, 2011;Sohl-Dickstein et al., 2014).) Step 2. Metropolis update: Perform a Metropolis update on x (i) , using a deterministic map F = L n ε P (here we compose operators to the right, so that P is the final operator), where n is a predetermined number of steps, ε > 0 is a time step, and L ε and P are the maps defined above.The proposal approximates Hamiltonian dynamics for time T = nε, followed by a p flip, as sketched in Fig. 2. Here, writing x = x (i) as the current state and y = F (x) as the proposal, the step is accepted with probability being the most efficient rule satisfying (13).Upon acceptance x (i+1) ← y, else x (i+1) ← x (i) .
After each such iteration, i is incremented, resulting in a Markov chain {x (i) } i=0,1,... from which expectations under π may be estimated in the usual fashion (Geyer, 2011).
The following mathematical result, while covered recently using much more technical notation (Andrieu et al., 2020), has a simple proof that we give in Appendix B.
Theorem 4 (HMC has the correct invariant pdf).Let π be a continuous, differentiable, positive pdf over R d , with associated Gibbs pdf π over R 2d given by (16).The Markov chain with HMC update, given by the composition of steps 1 (Gibbs) and 2 (MH) defined above, has π as an invariant pdf.
In short, the proof is that step 1 (Gibbs) and step 2 (MH) each independently preserve π as an invariant pdf, thus so does their composition.In particular for step 2 this hinges on Lemma 3 applied to F = L n ε P ; its approximation of Hamiltonian dynamics is irrelevant for the proof.It is also a common misunderstanding that their composition (the HMC iteration) obeys detailed balance: although steps 1 and 2 separately do, their composition in general does not.
It is worth pointing out that while first-order leapfrog integration (L) of Hamilton's equations is the most commonly used proposal in HMC, it is not the only choice.The leapfrog integrator itself can be extended to higher orders (Creutz and Gocksch, 1989;Yoshida, 1990).Neal points out that a modified Euler step is valid (Neal, 2011), and recent works have proposed using other maps, such as implicit integrators (Pourzanjani and Petzold, 2019;Brofos and Lederman, 2021a) for multiscale distributions or generalizing HMC with neural networks (Levy et al., 2017).However, a lesson of the above is that approximating Hamiltonian dynamics is not necessary to have the correct invariant pdf; it is merely a convenient way to propose long-distance moves with high acceptance rates.
x s y g 1−α1(y,g) 1−α1(x,s) ; for α 3 (x, s 1 , s 2 , y) see ( 28) Figure 3: Sketch of states (in the augmented state space R 2d ) involved in delayed rejection HMC.In this sketch, the locations are chosen purely to aid visualisation.(a) Basic 2-stage scheme with proposal maps F 1 and F 2 , each of which is a certain number of leapfrog steps followed by a momentum-flip, hence an involution (see Section 3).The first map F 1 generates a proposal s = F 1 (x), which is accepted with probability α 1 (x, s).If the first proposal is rejected, the second proposal y = F 2 (x) is accepted with probability α 2 (x, s, y).The target density at the "ghost" g = F 1 (F 2 (x)) is also needed.(b) 3-stage scheme, involving a third proposal map F 3 (see Section 3.1).The target density must be evaluated at 2 3 = 8 states, four of which are ghosts (shown by open circles).

Delayed rejection for HMC
Finally we have all the tools to combine delayed rejection (DR) with HMC.We call the resulting algorithm DRHMC.As with classical HMC, we work with the extended state x = (q, p) to sample from the desired distribution π(q), which is the marginal of π(x), the resulting Gibbs pdf (16) over the extended state.As in Section 2.2 we use s to represent intermediate proposals in DR that have been rejected, and y will always represent the most recent proposal, i.e. the proposal made in the current DR stage.
We keep the Gibbs step unchanged and apply DR only to the Metropolis step.Consider F 1 = L n ε P , a deterministic proposal map for some time step ε and number of leapfrogs n.The first acceptance probability remains the same as in classical HMC: gets rejected, this suggests a possibility that π(s)/π(x) is much less than 1, indicating very poor approximate energy conservation so that ε was too large for stable integration.This motivates a second proposal via a mapping F 2 which uses a smaller ε.The resulting second kernel is q 2 (x, s, y) = δ(y − F 2 (x)), which is independent of s.
We now derive the detailed balance condition for α 2 in a general setting.We assume only that the maps F 1 and F 2 are volume-preserving involutions, which is satisfied for HMC maps as discussed in the previous section.
For the second proposal, recall the general detailed balance condition (9) for delayed rejection.Substituting the above deterministic q 1 and q 2 kernels into this gives However simply setting the integrands equal, as done by Mira and Tierney to get (10), fails here since the LHS delta selects s = F 1 (x) and the RHS delta selects s = F 1 (y), but F 1 is injective so s = s could only hold if x = y.Instead one evaluates the two integrals to get which must hold as kernels over (x, y).Yet since F 2 = F −1 2 , the two delta distributions are the same, so equality holds as singular measures living on the manifold (18) To maximize the acceptance rate while obeying this constraint we set Since all proposals are deterministic in DRHMC, it is now useful to simplify notation by folding the known image points into the acceptance probabilities, This allows us to write the second acceptance probability obeying detailed balance as Note that this is the same as the acceptance relation (11) from plain DR in the Metropolis case, but setting all the proposal densities q to unity.However, we emphasise that its derivation is quite different, requiring care with deterministic maps, and relying on them being volume-preserving involutions.In addition to the initial point of the trajectory x and the two proposals F 1 (x) and F 2 (x), this rule demands, via α 1 (F 2 (x)), the pdf at a fourth state g = F 1 (F 2 (x)).This is the first proposal that would have been made in a hypothetical chain, had we started the chain in the reverse direction i.e. starting from y to go to x. Hence we call it a ghost preimage of the second proposal.See Fig. 3(a).While it is never proposed in the forward direction, maintaining DB requires us to evaluate the density at this point.
Finally, we describe the form of the new proposals that we test.We consider delayed rejections which reduce the step size of the leapfrog integrator by a constant adaptivity factor a > 1 but maintain the same trajectory length or time of integration (T ).Hence we will propose F 2 = L an ε/a P .In Section 4 we will show that this allows us to explore regions in the phase space that otherwise face persistent rejections with classical HMC.This completes the simplest form of DRHMC; however, we find that the higher-order proposals described next can also help.

Higher order proposals
The previous section focused on making a second proposal when the first proposal in HMC gets rejected.The same formalism can be extended to allow a third proposal upon rejection of the second, a fourth upon rejection of the third, and so on.In this section, we explicitly derive the acceptance probability for the third proposal in DRHMC and give a general recursive relation for k th proposal.Mira (Mira, 1998) presents similar acceptance probabilities for higher-order delayed proposals in the Metropolis-Hastings case.We also discuss the growth of the cost with number of proposals, since this determines the tradeoff with increased acceptance rate of DRHMC.
If we reject the first two proposals starting from a state x ∈ S, namely s 1 = F 1 (x) and s 2 = F 2 (x), we make a third proposal via a map F 3 .The resulting proposal kernel is In this case, the transition kernel analogous to (8) must account for four possible ways to end up at a state y: accepting the i) first, ii) second or the iii) third proposal with their respective acceptance probabilities or iv) rejecting all and maintaining the current state.We have established the acceptance probabilities of case i) and ii) in the previous section.For cases iii) and iv), the transition kernel must now marginalize over all possible rejected first and second proposals s 1 , s 2 , k(x, y) = q 1 (x, y) α 1 (x, y) + q 1 (x, s 1 ) [1 − α 1 (x, s 1 )] q 2 (x, s 1 , y) α 2 (x, s 1 , y) ds ( 24) where r 3 is the probability of rejecting the 3rd proposal.
As we saw in the previous section, since the first and second proposals are independent of the third proposal, their acceptance probabilities α 1 and α 2 are given by ( 6) and ( 18) respectively and hence the first two terms of Eq. ( 26) will maintain DB.As with r 2 , since r 3 also lies on the diagonal, it will also maintain DB regardless of its form.Thus to maintain detailed balance for the third proposal, we only need the condition on α 3 , where we have simplified the notation for acceptance probability via α 3 (x) := α 3 (x, F 1 (x), F 2 (x), F 3 (x)).Then following the same steps as in the derivation of α 2 and evaluating the two integrals allows us to write the the 3rd acceptance probability as Continuing in this way, one can write down a recursive relation for the acceptance probability of the kth proposal obeying detailed balance, where y = F k (x), and α k (x) := α k (x, F 1 (x), ..., F k (x)) is the notational shorthand.

Growth in cost with respect to the number of proposals
It can be tempting to keep making successively higher order proposals to increase the acceptance rate, however it is important to be mindful of their increasing cost.Thus we explicitly write down the full-form of α 1 and α 2 in the acceptance probability for the third proposal again to see the various ghost preimages that need to be evaluated.Recall that where Substituting these forms in (28) we see that the denominator involves estimating the density at points x, F 1 (x), F 2 (x) and F 1 (F 2 (x)) while the numerator requires the density at F 3 (x), F 1 (F 3 (x)), F 2 (F 3 (x)), and F 1 (F 2 (F 3 (x))).Of these 2 3 = 8 points, only F 1 (x), F 2 (x) and F 3 (x) are proposals made in DRHMC and the remaining states are the various ghost preimages that need to be evaluated to maintain detailed balance.Their form is sketched in Fig. 3(b).Evaluating the acceptance conditions for the kth proposal requires 2 k log density evaluations.In our algorithm, computation is dominated by the number of gradient evaluations.9 Despite this apparent exponential growth in cost, the cost of DRHMC is only a constant factor larger than classical HMC run at the locally optimal step size.For instance, consider a DRHMC setup where the first proposal is F 1 with n leapfrog steps of size ε, i.e., a trajectory time of T = nε, and a sequence of step sizes ε/a, ε/a 2 , . . .for subsequent higher order proposals upon rejection.Let k be the smallest integer such that ε/a k−1 gives stable leapfrog integration.Then even for this optimal step size, classical HMC would need at least a k−2 n steps.On the other hand for DRHMC, since ε is greater than the largest stable step size for this trajectory, this first proposal will very likely be rejected due to instability giving very poor H conservation and hence a tiny acceptance ratio.Our proposed higher-order DRHMC scheme then makes proposals with aforementioned sequence of step sizes ε/a, ε/a 2 , . . ., so that the first that is likely to be accepted is the kth with step size ε/a k−1 .The total number of leapfrog steps needed up to (and including) a kth order DRHMC proposal is where the first term is for the F 1 maps, the second for the F 2 maps, etc. (See Fig. 3(b) for the k = 3 case.)This sum is nka k−1 for a = 2, or O(a k−1 n) for a > 2. Thus for a = 2, the DRHMC cost is only O(ak) more than HMC with the optimal step size, and for a > 2, the DRHMC cost is O(a) (independent of k) more than HMC.
Thus we may summarize as follows.
Remark 5.Although kth-order DRHMC has a cost per proposal that grows exponentially in k, the cost is only a constant factor more expensive than classical HMC proposals made with the "correct" (largest stable) step size.

Probabilistic Delayed Rejection
One way to reduce the average cost per iteration for DRHMC is to make the delayed rejections probabilistic and dependent on where we are in the distribution.To motivate how this can be helpful, consider a case when the cost of secondary proposal is much higher than the first proposal and even though the first proposal function is well tuned for most of the state space, there are certain hard regions which can only be sampled by the second proposal.In this scenario, while we need DR to correctly sample the full distribution, we do not need it throughout the phase space.Every time we make a secondary proposal upon getting a rejection in the good regions, we might not be trading excess cost with higher acceptance rate effectively.Thus instead of making the second and subsequent proposal mandatory upon a rejection, we would like to make them probabilistic such that we make a second proposal with probability p 2 (x, s) < 1.This modifies the second proposal kernel to q 2 (x, s, y) = p 2 (x, s)δ(y − F 2 (x, s)).As was the case for the second proposal map F 2 , this probability can also be informed by the previously rejected proposals in the same trajectory.One can follow the steps from the previous section to maintain detailed balance and show that this modifies the acceptance probability as Returning to the scenario outlined above, we see that one way to avoid secondary proposals in good regions is to construct a proposal probability that makes it less likely for a secondary proposal if the first proposal was rejected on random chance despite having high acceptance probability.On the other hand, if the first proposal was rejected strongly, which might indicate that we are in a bad region of the state space for the first proposal, we make it more likely to make a subsequent proposal with a new function.
A simple heuristic proposal probability to achieve this is where s j is the j th proposal made from the current position x, and α(s j ) is the acceptance probability of the last proposal.Ideally however one would choose the proposal probability p j+1 to maximize expected squared jump distance over effort for the next proposal.Detailed balance is maintained by including this factor p j+1 in the acceptance condition α j+1 for the j + 1 th proposal along the lines of Eq. 30.In the experiments section, we will show how probabilistic delayed rejection can preserve the efficiency of basic HMC for simple distributions where HMC is effective.

Experiments
In this section, we compare the performance of delayed rejection HMC (DRHMC) to that of standard HMC.

Setup
Given a current state x, HMC makes a proposal y = F 1 (x) where F 1 = L n ε P is the deterministic mapping that integrates Hamiltonian dynamics with leapfrog integration for n steps and step size ε.In DRHMC, we consider the first proposal to be the same as in HMC.Upon rejection of the first proposal, we make k − 1 subsequent proposals.For each of these, we reduce the step size by a fixed factor a > 1 while increasing the number of steps in proportion, to maintain a constant integration time.This corresponds to a deterministic mapping, For every experiment and configuration, we run 50 chains with 1000 iterations for burn-in followed by 20,000 sampling iterations.

Choice of parameters
HMC has three tuning parameters, the step size ε, the number of leapfrog steps n, and the mass matrix M .The total integration time is T = nε.
We use Stan (Stan Development Team, 2011) to tune the reference values of these parameters using the following two steps.
1. We use the no-U-turn sampler (NUTS) (Hoffman and Gelman, 2011) to select the integration time T .NUTS is an adaptive algorithm that automatically stops every leapfrog trajectory when it starts to double back and retrace its steps and biases draws along the trajectory to later in the trajectory in an attempt to maximize expected squared jump distance.Therefore, NUTS does not require tuning for T during the warm-up phase.Following (Wu et al., 2018), we choose time of integration T to be the 90th percentile of the trajectories followed by NUTS. 10 2. After fixing T , we re-run Stan with HMC to estimate the optimal step size ε f and a diagonal mass metric, M .
In addition to the HMC tuning parameters for integration time and step size, DRHMC has tuning parameters k for the total number of of subsequent proposals made and a for the divisor by which step size is reduced for every subsequent proposal.To develop an understanding of how these parameters impact the performance of DRHMC, we report results for the grid of configurations with k ∈ {2, 3, 4} and a ∈ {2, 5, 10}.
With its ability to reduce step sizes in subsequent proposals, DRHMC is more robust to the initial tuning of step size.To demonstrate this, we evaluate HMC and RHMC with fixed and initial step sizes at, above and below the adapted step size,

Metric of comparison
To measure sampling performance, we report the umber of log density and gradient evaluations required per effective draw, that is, where N evals is the total number of log density and gradient evaluations in the Markov chain, and ESS is the effective sample size for a parameter estimate extracted from the chain.Log density and gradient evaluations dominate the cost of HMC, allowing us to ignore other implementation details.Thus C is the inverse of efficiency; smaller C is better.Its value will depend on the expectation being evaluated, so we report results for posterior means of parameters θ and their squares θ 2 , the latter of which measures performance in estimating variance.
If ρ t ∈ (−1, 1) is the autocorrelation of a quantity in the Markov chain at lag t, the effective sample size is where N is the total number of iterations (Geyer, 2011).Standard errors for estimating parameters are then derived from the MCMC central limit theorem as as where sd is posterior standard deviation.The central limit theorem states that as effective sample size grows, errors approach a normal distribution, θ − θ ∼ N (0, se 2 ) .
10 Unlike (Wu et al., 2018), we do not jitter the number of leapfrog steps.This is usually a reasonable approximation even for modest effective sample sizes.Alternatively, if we know the true posterior mean value θ, we can run independent Markov chains and calculate errors θ − θ.The sample standard deviation of the errors can be used to estimate se, from which we can back out effective sample size as ESS = sd se 2 .
In the following experiments, depending on whether we have access to the true parameter distributions, we will show results in terms of cost per effective sample calculated by autocorrelation length (C r ) or estimated through errors in cases where posterior means and variances are known (C c ).In experiments with more than one parameter being sampled, we will show the cost for the parameter that mixes the slowest in the sense of having the lowest effective sample size.We run multiple Markov chains and measure per-chain variation in cost by applying the bootstrap technique across chains.

Neal's funnel
We begin our experiments with the problem of sampling Neal's funnel, upon which we touched in the introduction.In d dimensions, given a variance σ 2 , we are interested in sampling q = {β, α 2 , . . ., α d } from Neal's funnel distribution (Neal, 2003), which is defined by where as usual we use N (µ, σ 2 ) to denote a normal pdf with mean µ and variance σ 2 .The resulting target pdf is Following Neal we set σ = 3.This distribution has equal probability mass in the regions β < 0 and β > 0. However the distribution has a wide range of length scales due to the curvature changing as β ranges from large to small values (see Fig. 1).This makes it challenging to sample the funnel efficiently with a constant step size.We can illustrate this with the help of Figure 4, which shows the empirical marginal of parameter β when sampling the funnel in d = 20 dimensions with NUTS and HMC for different settings.The correct marginal for β is N (0, 3 2 ) which means that about ∼ 5% of samples should lie at β < −5.However even for 0 = 0.2, there are no samples in this regime.For NUTS, to push to β < −5, the step size had to be reduced such that 99% of all proposals are accepted, as compared to 80% default value of Stan and 65% fraction considered optimal for normal distributions in HMC (Beskos et al., 2013).To explore beyond the 3σ region, as required to get the right results for modest tail statistics, we need to reduce the step size further to ε = 0.01.
The extremely small step size necessary to explore the neck of the funnel is very inefficient for exploring the mouth of the funnel.As β grows, the marginal p(α) approaches a lognormal distribution with σ = 3, and thus has long tails.The expected value of α 2 is on the order of 10 2 , whereas the expectation of α 4 is on the order of 10 8 .Due to the scale of the mouth of the funnel, the optimal step size is much larger than ε = 0.01 required to sample the neck of the funnel.Figure 4 provides an illustration of how well HMC can cover the mouth and neck of the funnel based on step size (left panel), as well as a comparison of the densities of accepted and rejected proposals for various step sizes (right panel).All of the step sizes are able to sample the mouth of the funnel, however inefficiently, but in the neck of the funnel, acceptance rate dwindles to a sharp cutoff below which HMC is unable to sample.To sample the tails of β < 3 • sd[β], we need to reduce step size even further below ε = 0.01.
Figure 5 shows how DRHMC can mitigate the sharp cutoff in the neck of the funnel by reducing step size as needed.The left plot shows the marginal density p(β) sampled with NUTS and HMC for step size ε 0 = 0.1, as well as one-retry DRHMC with different stepsize reduction factors, a.When the second proposal step size is reduced by factor of 10, DRHMC is able to sample β to ±3σ.The right panel shows the density of rejections and acceptances for the first, second, and third proposals of DRHMC with a larger step size (e 0 = 0.2), but allowing multiply retries with a reduction of a = 2.With the possiblity of three proposals, an initial step size of e 0 = 0.2 is also able to sample β to ±3σ.square errors to estimate effective sample size (ESS c ).This requires reference samples from the distribution, which are simple to generate independently using a non-centered parameterization of the funnel (Betancourt and Girolami, 2015).
Figure 6 shows that DRHMC is consistently a factor of 4 more efficient than HMC in terms of log density and gradient evaluations required for a given effective sample size; in some configurations the advantage is as much as a factor of 8. We restrict attention to configurations for which HMC is able to sample β to plus or minus three standard deviations (i.e., ≤ 0.1).For all configurations, we applied Kolmogorov-Smirnov (KS) tests to the body and tails of the distributions to ensure we are sampling the correct distribution.

Eight schools model
One of the motivating applications for Bayesian hierarchical modeling was a metaanalysis of the effects of a test preparation intervention on students in eight schools (Rubin, 1981).The data consists of the differences in pre-test and post-test scores, which are reported as an average y n and standard deviation σ n for each school n.The hierarchical model uses parameters θ n for the efficacy in each school and assigns them a hierarchical normal prior with unknown location µ and scale τ .The generative model is as follows (Gelman et al., 2013). 11  µ ∼ N (0, 5 2 ), τ ∼ Cauchy + (0, 5), θ n ∼ N (µ, τ 2 ), and The hyperparameter µ represents the average treatment effect across schools and τ the scale of variation of effects among schools.As τ → ∞, the model approaches no pooling, i.e., each of the school treatment effects is estimated independently.As τ → 0, the model approaches complete pooling, i.e., all of the school treatment effects are the same.For small values of τ , the school-level effects θ n are squeezed together; for large values, they are allowed to vary widely.This yields a multiscale, funnel-like geometry in the τ and θ parameters, where we would expect delayed rejection to improve the performance of baseline HMC.
Figure 7 evaluates several configurations of the DRHMC algorithm as applied to the eight schools problem (see 4.1), plotting the cost of each configuration using the standard error method (C c ) for the slowest mixing parameter.We use the reference samples provided by the posteriordb database12 to estimate the mean and variance  /2 (1)  0/2 (2)  0/2 (3)  0/5 (1)  0/5 (2)  0/5 (3)  0/10 (1)  0/10 (2 of the parameters as needed to calculate error-based effective sample size (ESS c ).The best DRHMC configuration improves over the best HMC configuration by a factor of three for estimating the parameter mean.Different configurations for HMC perform the best for the first and second moment, with the cost of second moment estimation by DRHMC being on par with that of HMC.

Gull's lighthouse
Challenging posterior geometries arise even in simple two dimensional problems if the data is not very informative.For example, consider estimating the direction of flashes emanating from a coastal lighthouse (Gull, 1988, p. 59).Assume the lighthouse is at position x 0 along a straight coast at distance y into the sea.Its light is spinning and emits a series of collimated flashes at random intervals which are then detected, each at a single point on the coastline.Given N f flashes recorded at the positions x i , i = 1, . . ., N f , we perform a Bayesian estimation of the position of the lighthouse (x 0 , y).
The lighthouse flashes in a random direction θ, relative to vertical, drawn from a uniform distribution on (−π/2, π/2).Such a flash will be observed at location x i on the coast, where θ = arctan((x i − x 0 )/y).Applying a change of variables, the likelihood of observing a flash is With improper uniform priors on x 0 and y, and the assumption that the flashes are independent, the posterior is proportional to the product of observation likelihoods, In Figure 8, we show the cost C r for the case with N f = 3 flashes observed at x i = 0.9, 1.2, 1.21, for both the parameters x 0 and y.We estimate ESS by measuring autocorrelation length of the chains since there are no reference samples available for this model.For estimating y, whose effective sample size is an order of magnitude lower than that of x 0 with HMC, DRHMC is a factor of five more efficient; there are no gains in sampling the parameter x 0 that mixes well with HMC.

Gaussian Mixture Model
Mixture models present problems for samplers with fixed step sizes when the mixture components are of different scales.Multimodal distributions whose components have varying geometries also defeat global tuning for HMC.HMC relies on tuning these parameters before sampling (for example, Stan first runs a warmup phase that performs adaptation before sampling begins).As in other intrinsically multiscale problems, DRHMC has the potential to outperform baseline HMC by using different proposal scales in different regions of the state space.
To simulate the situation arising with multivariate posteriors, we consider a univariate Gaussian mixture with equal mixing weights on the components.We take fairly separated locations µ i that still allow mixing.The scales σ i then vary by an order of HMC Configurations  /2 (1)  0/2 (2)  0/2 (3)  0/5 (1)  0/5 (2)  0/5 (3)  0/10 (1)  0/10 (2) Figure 9: Cost per effective sample for HMC and DRHMC for the Gaussian mixture model with components varying in scale by a factor of ten.The two panels show the cost for estimating the mean of θ and θ 2 , the second of which determines variance.Symbol legends are the same as in Figure 7. ESS is estimated with the standard error method after generating reference samples from the Gaussian mixture model.magnitude.The model pdf is where we fix φ 1 = 0.5, φ 2 = 0.5 µ 1 = 0, µ 2 = 3, and σ 1 = 0.1, σ 2 = 1.
Our goal is then to sample the univariate parameter θ ∈ R. We choose this simple problem for illustration because sampling mixtures only becomes more challenging in higher dimensions with differently conditioned components, in situations where the modes are either more widely separated or more highly overlapping, or when the weights of the components are highly skewed.The optimal step size for the components is directly proportional to the component's scale, which varies by an order of magnitude.
Figure 9 shows that the best DRHMC configurations can be twice as efficient as HMC.This gap can be made arbitrarily wide by increasing the number of dimensions and the difference in scales between the modes.

Stochastic volatility model
Finally we consider an example that does not suffer from the pathology of multiscale distributions, but is still challenging due to high dimensionality and correlated parameters.Stochastic volatility models (Kim et al., 1998) seek to model the volatility (i.e., variance) of a return on a financial asset, such as an option to buy a security.This changing volatility is modeled as a latent stochastic process in discrete time.Given the HMC Configurations  mean corrected returns y t on an underlying asset at T equally spaced time points as input data, we are interested in estimating the latent parameter h t for the log volatility, mean µ and variance σ of log volatility, as well as the persistence of the volatility φ.Thus the parameter vector is q = {µ, σ, φ, h t=1,...,T }, with φ ∼ uniform(−1, 1); σ ∼ Cauchy(0, 5); µ ∼ Cauchy(0, 10) σ 2 1 − φ 2 ; h t ∼ N µ + φ(h t−1 − µ), σ 2 , t = 2, 3, . . ., T y t ∼ N 0, e ht , t = 1, 2, . . ., T.
The posterior exhibits varying curvature due to the hierarchical prior on the volatility parameters, which reinforces the natural correlation among the volatility estimates due to their sequencing in time.Figure 10(a) shows that the the additional computational cost of DRHMC ends up making it more costly per effective sample than HMC.
The cost of DRHMC is high her because the original step size is optimal, so that retrying with a lower step size only doubles computational costs.To close the gap with fixed step size HMC when scales do not vary, we introduce a probabilistic modification of DRHMC that only retries when the previous proposal had a high probability of being rejected.Specifically, We consider a scheme in which subsequent proposals are made with probability p(x, s j ) = 1 − α(s j ), where s j is the previous proposal made from x and α(s j ) is its acceptance probability.Figure 10(b) shows how retrying with a probability equal to the original failure chance avoids needless step size reduction, allowing DRHMC now to exceed slightly the efficiency of HMC.

Discussion
We introduced a novel application of delayed rejection to Hamiltonian Monte Carlo (HMC) sampling in which subsequent proposals are made for the same integration time at a reduced step size.We showed that in multiscale posteriors such as mixture models or hierarchical models, delayed rejection can boost performance by a factor of five or more.We provided a proof that even if the initial step size is chosen to be too large, delayed rejection introduces at most a factor of two additional cost over choosing the optimal baseline step size.We also proved, in an accessible fashion avoiding measure theory, detailed balance for both classical Hamiltonian Monte Carlo and our new proposal.
In cases where the target density is not multiscale, we introduced a novel form of delayed rejection where retries are only attempted when the previous proposal had a high chance of failure.Unlike the case for HMC, which will fail with potentially hard to diagnose biases, DRHMC with probabilistic retries is robust to the initial choice of step size, thus reducing overall costs when tuning step size is expensive.
In realistic problems, we often do not know if our target distribution suffers from multiscale or varying geometry pathologies as in the cases we considered.For example, varying amounts of data and varying noise ratios in the data can dramatically change the posterior geometry, changing roughly normal posteriors into funnels or vice-versa, depending on the model parameterization (Papaspiliopoulos et al., 2007;Betancourt and Girolami, 2015).
Reducing the step size is not the only way of constructing delayed proposals.Another approach would be to replace the leapfrog integrator altogether for retries, for example with an implicit symplectic integrator (Pourzanjani and Petzold, 2019).Such integrators may additionally be able to deal with stiffness arising from high correlation.Delayed rejection HMC could also be combined with other improvements to HMC, e.g.ensemble preconditioning (Matthews et al., 2016), Riemannian HMC (Betancourt and Girolami, 2015), and manifold HMC (Au et al., 2020).
Proof of Theorem 4. It is sufficient to show that each step in the pair is π-invariant.This holds for step 1 (the Gibbs update of p) since it preserves the conditional over p, which is identical at each fixed q, while leaving q unaffected.It holds for step 2 (one deterministic MH step) since by Lemma 8, F = L n ε P is a volume-preserving involution, so one can apply Lemma 3.

Figure 1 :
Figure 1: Neal's funnel.(Left) Natural log density of the two-dimensional funnel (Equation 34) showing a region of high density but low volume ("neck", β < 0) to the left of a region of low density and high volume ("mouth", β > 0).(Right) Condition number of the inverse Hessian as a function of the log scale parameter β, along the slice α = 0.

Figure 4 :
Figure 4: Difficulty of sampling Neal's funnel in d = 20 dimensions.(Left) The marginal for β, which is N (0, 3 2 ) as shown in gray, when the funnel is sampled with NUTS (default settings and HMC for different step sizes.(Right) The fraction of accepted and rejected proposals for HMC as a function of β for different step sizes.All runs are done for 50000 samples and histograms are generated with bin-width of 0.1.

Figure 6 Figure 5 :
Figure6illustrates the efficiency gain of DRHMC over HMC for Neal's funnel.We show the cost per effective sample of β for funnels of dimensions d = 5, 10, 20, 100.ESS r can be biased to the high side because it only depends on the autocorrelation length of the chain and not that it is sampling the correct stationary distribution.Thus we use

Figure 6 :
Figure6: Efficiency of DRHMC for Neal's funnel.These plots show the ratio of cost per effective sample of β for DRHMC vs HMC when sampling Neal's funnel for different dimensions(d = 5, 20, 50, 100).ESS for the cost is estimated using standard error with reference samples (Eq.33).Black points and horizontal dotted black lines show the reference ratio (=1 for HMC with ε = 0.01).Different colors show the cost for DRHMC with different first step sizes ε 0 ; smaller is better.Different shapes correspond to different configurations (number of proposals k and reduction factor a).We show only configurations with ε min ≤ 0.01.Error-bars are estimated by bootstrapping over chains.

)Figure 7 :
Figure 7: Cost per effective sample for HMC and DRHMC for the eight schools model.The two panels show the cost for the slowest dimension for the first (θ) and second (θ 2 ) moments respectively.The cost for HMC is shown in black points, as estimated by using the standard error method for ESS.Different configurations for DRHMC are shown in different colors (reduction factor, a) and shapes (number of proposals, k).Different step sizes are separated by vertical dashed black lines.Dotted horizontal black line shows the cost for the reference configuration (as fit by Stan) of HMC.

Figure 8 :
Figure 8: Cost per effective sample for HMC and DRHMC for Gull's lighthouse model.The two panels show the cost for the two dimensions of the model.Symbol legends are the same as in Figure 7. ESS is estimated by measuring autocorrelation length of the chains.

Figure 10 :
Figure 10: Cost per effective sample for HMC and (a, top) DRHMC and (b, bottom) Probabilisitic DRHMC for Stochastic Volatility model.The two panels show the cost for the slowest for first (θ) and second (θ 2 ) moments respectively.Symbol legends are the same as in Figure 7.