Recycling intermediate steps to improve Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) and related algorithms have become routinely used in Bayesian computation. In this article, we present a simple and provably accurate method to improve the efficiency of HMC and related algorithms with essentially no extra computational cost. This is achieved by {recycling the intermediate states along simulated trajectories of Hamiltonian dynamics. Standard algorithms use only the end points of trajectories, wastefully discarding all the intermediate states. Compared to the alternative methods for utilizing the intermediate states, our algorithm is simpler to apply in practice and requires little programming effort beyond the usual implementations of HMC and related algorithms. Our algorithm applies straightforwardly to the no-U-turn sampler, arguably the most popular variant of HMC. Through a variety of experiments, we demonstrate that our recycling algorithm yields substantial computational efficiency gains.


Introduction
Markov chain Monte Carlo is routinely used to generate samples from posterior distributions. While specialized algorithms exist for restricted model classes, general-purpose algorithms are often inefficient and scale poorly in the number of parameters. Originally proposed by Duane et al. (1987) and popularized in the statistics community through the works of Neal (1996Neal ( , 2010, Hamiltonian Monte Carlo promises a better scalability (Neal, 2010;Beskos et al., 2013) and has enjoyed wide-ranging successes as one of the most reliable approaches in general settings (Gelman et al., 2013;Kruschke, 2014;Monnahan et al., 2016). Stan and PyMC software packages take advantage of this generality and performance (Stan Development Team, 2015;Salvatier et al., 2016).
Given a parameter θ ∼ π θ (·) of interest, HMC introduces an auxiliary momentum variable p and defines a distribution π(·) = π θ (·) × N (0, M ) on the augmented parameter space (θ, p) with a mass matrix M . A proposal is generated by simulating trajectories of Hamiltonian dynamics where the evolution of the state (θ, p) is governed by a differential equation: dθ dt = M −1 p, dp dt = ∇ log π θ (θ).
(1.1) Proposals generated by this mechanism can be far away from the current state and yet accepted with high probability. In case M = I and θ ∈ R 2 , the solution trajectory of (1.1) coincides with the motion of a frictionless puck sliding over a surface of height − log π(θ) (Neal, 2010). The puck feels "push" in the direction of the gradient ∇ log π(θ), pointing toward higher values of log π(θ). The momentum evolves accordingly but also gives the puck tendency to keep moving in the same direction, helping HMC proposals explore the parameter space in an informed manner.
Current practice uses the last state F K (θ 0 , p 0 ) as a proposal and discards all the intermediate values F k (θ 0 , p 0 ) for k < K. As we will show, this is wasteful since the intermediate values can be recycled to generate additional samples from posterior distributions. Figure 1, to be explained in detail later, illustrates the benefit of recycling. The recycling algorithm only requires quantities that have already been sampled or computed, so there is essentially no extra computational cost. Our proposed recycling approach can also be applied directly to a wide variety of modified HMC algorithms (Neal, 2010;Girolami and Calderhead, 2011;Pakman andPaninski, 2013, 2014;Lan et al., 2014;Shahbaba et al., 2014;Fang et al., 2014;Zhang et al., 2016;Lu et al., 2016). Extensions to more complex variants are also possible, including the No-U-Turn-Sampler (NUTS) (Hoffman and Gelman, 2014;Stan Development Team, 2015).
Our algorithm is distinguished by its simplicity and generality compared to alternative algorithms for utilizing the intermediate values of HMC (Neal, 1994;Calderhead, 2014;Bernton et al., 2015). Under our framework, one can typically implement an HMC variant as usual and simply add several lines of code to recycle the intermediate values using the familiar acceptance and rejection probabilities. On the other hand, the algorithm of Calderhead (2014) and its Rao-Blackwellization by Bernton et al. (2015) require a trajectory to be simulated forward and backward in a symmetric manner to satisfy the super-detailed balance condition (Frenkel, 2004;Tjelmeland, 2004). The proposal must then be followed by the acceptance-rejection step using the generalized Metropolis-Hastings algorithm (Calderhead, 2014) or assignment of appropriate weights to the intermediate values (Bernton et al., 2015). Importantly, these algorithms do not apply to NUTS, arguably the most popular variant of HMC. This is because NUTS yields a variable number of intermediate states and does not constitute a multi-proposal scheme necessary for using their algorithms.
The underlying idea behind our algorithm is most similar to Neal (1994) who realized that, in the variant of HMC that uses a collection of states in computing the acceptance probability, those states can be re-used when computing the posterior summaries through conditional expectation. Our theory is more general, however, and translates into methods to improve a variety of multi-proposal algorithms (Section 3).

Recycled Hamiltonian Monte Carlo
The following non-standard HMC algorithm accepts or rejects each of the intermediate values, enabling recycling of these samples.

Simulate a trajectory via the leapfrog steps as in
4. Update the starting point of the next HMC iteration: (θ The transition rule (θ ) above coincides with that of the standard HMC algorithm. The empirical measure 0 ) (·) thus converges weakly to the target distribution. The following theorem, which is a consequence of a more general theory given in the next section, shows that the intermediate samples can actually be recycled as valid draws from the target distribution. The index k runs from 1 to K as (θ k ) for k = 1, . . . , K are generated as in Algorithm 1, then where w → denotes the weak convergence of a measure.   (Neal, 2010;Hoffman and Gelman, 2014) or can be obtained cheaply as a byproduct of computing the gradients ∇ log π θ .
Algorithm 1 uses the fixed path length K, but it is often desirable to randomize the number of steps to avoid periodic behavior in the trajectories of (1.1) (Neal, 2010). Recycling is justified under this setting as well: Theorem 2. Consider a modified version of Algorithm 1 in which at the i-iteration 1) the trajectory is simulated for a random path length L (i) ∼ π L (·) with L (i) ≤ K and 2) the starting point for the next HMC iteration is set as (θ (2.3)

Theory Behind Recycling Algorithm
The validity of recycled HMC as in Theorem 1 and 2 follows from a more general principle below.
and has the marginal densities Then the following result holds: Additionally, the Markov chain {z (i) } i≥1 is geometrically (or uniformly) ergodic if P 0 (· | ·) is geometrically (or uniformly) ergodic.
We turn to the proof of a convergence rate (geometric or uniform ergodicity) of the chain {z (i) } i≥1 under the corresponding assumption on P 0 (· | ·). For the conditional distribution of z (n) | z (0) 0 , we have It follows that where · tv denotes a total variation norm. Hence the chain {z (i) } i≥1 inherits the convergence rate of P 0 (· | ·).
Theorem 3 has a subtle but important difference from "composition sampling," in which one would first generate a Markov chain {z . For a Markov chain generated as in Theorem 3, the conditional distribution z . This additional flexibility is critical for the recycling algorithms presented in this article. In particular, Theorem 3 reduces to Theorem 1 when the transition kernel P k (· | ·) is constructed as one iteration of HMC with k leapfrog steps for k ≥ 1 and P 0 (· | ·) as that with K steps. Theorem 2 is justified by an extension of Theorem 3, which shows that a recycling algorithm applies even when the number of states generated by a multi-proposal scheme varies from one iteration to another. Since in Theorem 2 the number of recyclable samples varies according to the random path length L (i) 's, the general theory behind Theorem 2 relies on the framework of a Markov chain in a trans-dimensional parameter space (Hastie and Green, 2012). The precise statement is given as Theorem 4 along with a proof in Appendix A.
The general formulation of the recycling algorithm as in Theorem 3 and 4 is of practical value for any MCMC algorithm that simultaneously yields multiple valid transition kernels P k (· | ·)'s. Indeed, in many variants of HMC (Neal, 2010;Girolami and Calderhead, 2011;Shahbaba et al., 2014;Fang et al., 2014), a proposal is generated by computing a long trajectory whose intermediate steps constitute valid proposal states that can be all recycled by simply adding acceptance-rejection steps as in Algorithm 1. Our theory also provides an alternate and simpler justification of the algorithms by Calderhead (2014) and Bernton et al. (2015) as shown in Appendix B. Our recycling algorithm can also be applied to more complex proposal generation mechanisms as we illustrate in the next section.

Recycled No-U-Turn-Sampler
No-U-turn sampler (NUTS) of Hoffman and Gelman (2014) automates choice of path lengths by simulating each trajectory of Hamiltonian dynamics until it starts moving back towards the starting point, a criteria they termed the U-turn condition. The length of a trajectory is recursively doubled forward or backward in a randomly chosen direction. This generates a trajectory of length 2 d endowed with a binary tree structure, where d denotes the depth at which the U-turn condition occurs.
Unlike the simpler trajectory simulation procedure behind HMC, the trajectory doubling procedure of NUTS does not yield a sequence of valid intermediate proposals. In particular, the empirical distribution does not converge to the correct target distribution if we naively recycle all the intermediate states of NUTS as in Algorithm 1. A simple recycling algorithm for NUTS can nonetheless be devised by taking advantage of the fact below. Instead of the Metropolis acceptance-rejection procedure, NUTS determines acceptable states along a simulated trajectory using a slice sampling approach via an auxiliary slicing variable u. (4.1) The algorithmic details behind construction of the set A are complex. Since it is not essential for understanding our recycling approach, we refer the readers to Hoffman and Gelman (2014). The discussions there also justify the above transition rule.
Fact 1 motivates the following algorithm for utilizing the intermediate states generated during each iteration of NUTS.
Algorithm 2 (Simple Recycled NUTS). Run NUTS to generate a sequence of random variables {(θ Fact 1 tells us that the transition (θ k ) preserves the target distribution π(·) for each k = 1, . . . , K. Algorithm 2 is thus justified with a straightforward application of Theorem 3.
Algorithm 2 captures the main idea behind our recycling algorithm; however, it is actually more statistically efficient to sample (θ ) so that they are evenly spread along a NUTS trajectory. Such a sampling scheme can be implemented in a simple and memory efficient mannerwithout storing all the intermediate states in memory -by taking advantage of the binary tree structure of a NUTS trajectory. This is described in Appendix C When we are not constrained by memory, the following Rao-Blackwellized version of recycled NUTS allows us to simply collect and use all the acceptable states of each NUTS iteration by assigning appropriate weights.
Algorithm 3 (Rao-Blackwellized Recycled NUTS). Denote the collection of acceptable states from the i-th iteration of NUTS by . . , N as the draws from the target distribution, yielding an empirical measure: The validity of Algorithm 3 follows simply by taking an expectation over the sampling step (θ

Numerical results
We demonstrate the benefit of recycling using four test cases: three taken from Hoffman and Gelman (2014) and one from Girolami and Calderhead (2011). We focus on these well-established test cases as guaranteeing a rapid convergence and mixing of HMC in full generality is an active area of research (Livingstone et al., 2016(Livingstone et al., , 2017Mangoubi and Smith, 2017). In all our simulations we chose the stepsizes such that the corresponding average acceptance rates are approximately 70%, as values between 60% and 80% are typically considered optimal (Neal, 2010;Beskos et al., 2013). The dual averaging algorithm of Hoffman and Gelman (2014) was used to find such stepsizes. The choice of path lengths τ (i) = L (i) for HMC is discussed within the individual test cases below. The identity mass matrix M = I was used in all our simulations except when investigating the use of recycling in mass matrix tuning (see Section 5.1).

Metrics for computational and memory efficiency of recycling
Effective sample sizes for mean, variance, and quantile estimators In comparing the algorithms with and without recycling, we use effective sample sizes (ESS) as a commonly used measure of computational efficiency of Monte Carlo algorithms (Brooks et al., 2011) where MSE(·) denotes the mean squared error of an estimator. The definition (5.2) agrees with the standard one when ). In each of our examples, we compute the ratios of ESS with and without recycling for the mean, variance, and 97.5% quantile estimator along each model parameter. The ESS ratio is obtained from the ratio of MSE using the relation (5.2). The MSE of the Monte Carlo estimates are estimated by running 400 chains independently for 3,200 iterations starting from stationarity. 2 The number of iterations here is chosen to roughly agree with the typical use cases of HMC -for example, Stan's default setting generates 1,000 NUTS samples after 1,000 warm-up iterations (Stan Development Team, 2015). We find in many cases, however, that the magnitudes of ESS gains from recycling are independent of the chain lengths (Appendix D). In computing the MSE of a Monte Carlo estimate, we need to have a precise value for the estimated quantity. Since the analytical expressions are unavailable in our test cases (except for the first synthetic one), we run a long NUTS chain of 10 7 iterations (after 10 3 burn-ins) to obtain the "ground truths" that are orders of magnitudes more accurate than the individual Monte Carlo estimates from the chains of length 3,200.

Benefits of recycling in estimating covariance structures
We also assess whether recycling helps estimate the covariance structure of the target distribution. To this end, we compute the top eigenvalue and eigenvector of the empirical covariance matrix for each chain. We then calculate the angle between the empirical eigenvector and the plane spanned by the true leading principal components. This angle should be close to 0 when the eigenvector is estimated well. To ensure identifiability of the direction, we choose = min{j : σ 2 j < σ 2 1 /2} in all our simulations where σ 2 j denotes the jth largest eigenvalue of the true covariance matrix.
As an alternate and more holistic evaluation of covariance estimation with a practical application, we further investigate the utility of recycling during the tuning phase of HMC/NUTS. Often, initial iterations of HMC/NUTS are used to estimate the covariance matrix of the target Σ = Var(θ), which can then be used to accelerate HMC/NUTS by setting the mass matrix M = Σ −1 (Neal, 2010;Stan Development Team, 2015). If recycling improves the covariance estimator Σ during the tuning phase, later HMC/NUTS iterations will be faster and mix better.
To quantify the benefit of recycling in this setting, we estimate the covariance with and without recycling during the tuning phase, and then run two independent chains with the two covariance estimators to compare their ESSs. Recycling is only applied during the tuning phase for covariance estimation. In carrying out this experiment, we follow the default settings of Stan for tuning the stepsize and mass matrix. First, 50 iterations of the dual-averaging algorithm are run to tune the stepsize with the identity mass matrix, followed by N adap iterations with a fixed stepsize to estimate the covariance matrix, and finally another 75 iterations of dual-averaging to re-adjust the stepsize with the tuned mass matrix. After the covariance estimation phase with N adap iterations, we with Σ emp the empirical covariance matrix and I the identity matrix. After the tuning phase, we run NUTS until the total number of gradient evaluations reaches 10 4 . This procedure is repeated 400 times and the ESS for each statistic is averaged across the repetitions. This experiment is not run for the models of Section 5.4 and 5.5 as the high-dimensionality of the parameter spaces make covariance estimation impractical.

Memory and statistical efficiency trade-off
We also study the relationship between the number of recycled samples and statistical efficiency. In particular, we demonstrate that it is not necessary to recycle all the intermediate steps to reap the benefit of recycling. This is relevant in a high dimensional parameter space where the amount of memory required to store the extra samples becomes substantial. 3 For long trajectories, there is substantial correlation among the intermediate states and we can expect that recycling a subset of the intermediate states may provide almost as much statistical efficiency as recycling all.
To quantify this, we first run the algorithm recycling all the intermediate states. We then repeatedly reduce K, the number of samples per iteration (recycled intermediate states plus the final state), by a factor of 2. In comparing the algorithms with and without recycling, we use the smallest K for which the ESS averaged across all the estimators is within 5% of that when recycling all the intermediate states. Section 5.6 investigates in more detail the relationship between the statistical efficiency and the number of recycled samples.

Multivariate Gaussian
The first test case is sampling from a 250-dimensional multivariate Gaussian N (0, Σ), where Σ is drawn from a Wishart distribution with 250 degrees of freedom and mean equal to the identity matrix. A covariance matrix drawn from this distribution exhibits strong correlations, and in our case the ratio between the largest and smallest eigenvalue of Σ was approximately 9.5 × 10 4 . Since HMC and NUTS with the identity mass matrix are invariant under rotations, we take Σ to be diagonal with Σ i,i = σ 2 i , where σ 2 i corresponds to the ith smallest eigenvalue of the original covariance matrix. For the path length of HMC, we first found the smallest value of τ for which the samples in the leading principal component direction are roughly independent. The typical practice would be then to jitter τ (i) 's within the range [0.9 τ, 1.1 τ] to avoid periodicity (Neal, 2010), but this still resulted in near perfect periodicity and hence poor mixing for some parameters. After some experiments, we found jittering τ (i) in the range [τ /2, τ ] to provide decent mixing along all the coordinates. Fig. 2a shows log 2 of the ratios between ESS of HMC with and without recycling. To facilitate the comparison of algorithms with and without recycling, the parameters are sorted in increasing order of the ESS ratios in mean estimation. Values above zero indicate superior performance of our recycling algorithm. The use of recycling improves, uniformly and substantially, estimation of the variances and quantiles: about 100% increase in ESS on average. Not all the mean estimators are improved by recycling, but the ones with worst ESS are significantly improved (Fig. 2b). Out of 251 recyclable samples generated on average from each iteration of HMC, we recycled 251/8 ≈ 31 samples.  The ratios of ESS in estimating the angle as well as the eigenvalues are shown in Fig. 3. We plot the ratios against the lengths of Markov chains. The direction of the principal component cannot be well estimated by shorter chains of lengths ∼ 200 even with recycling, but recycling conveys a substantial advantage as the chains are run longer. While here the ESS ratios vary substantially as a function of chain lengths, this dependence seems to be unique to this particular summary statistics -we show in Appendix D that, for the other statistics, the magnitudes of ESS gains from recycling are independent of the chain lengths. Lastly, the results of our covariance/mass matrix tuning experiment is summarized in Figure 5. Again, in this experiment recycling is only carried out during the tuning phase and the difference in ESS comes purely from difference in the accuracy of the covariance estimators. The benefit of recycling diminishes as N adap increases as the covariance matrix can be adequately approximated without recycling and we found no advantage of recycling when N adap ≥ 800.

Hierarchical Bayesian logistic regression
The second test case is a hierarchical Bayesian logistic regression model applied to the German credit data set available from the University of California Irvine Machine Learning Repository. Including two-way interaction terms and an intercept, there are 301 predictors and the regression coefficients β are given a N (0, σ 2 I) prior. A hyperprior is placed on σ 2 , which makes the posterior inference more challenging through the strong dependence between σ and β. We made one modification to the corresponding example in Hoffman and Gelman (2014) by defining our parameters to be (log(σ), β) instead of (σ 2 , β) since such a transformation of constrained variables has become standard (Stan Development Team, 2015). A default flat prior was placed on σ.
Performance comparisons as in the previous example are shown in Fig. 6, 7, and 8. For some parameters, recycling seems to produce little gains in terms of mean estimation but provides clear benefits in terms of variance and quantile estimation. In the mass matrix tuning experiment shown in Figure 8, we tried N adap = 500, 1000, 2000 and observed substantial improvement in the average ESS from recycling for N adap ≤ 1000.
For the path lengths for HMC, we first found the value τ to maximize the normalized expected square jumping distance τ −1/2 E θ (i+1) (τ ) − θ (i) (τ ) as in Wang et al. (2013), then jittered each path length τ (i) in the range [0.9 τ, 1.1 τ ]. The average trajectory length of HMC was 9 and all the intermediate states were recycled. The average trajectory length of NUTS was 2 4 = 16, out of which 7 were recycled.

Stochastic volatility model
The third test case is a stochastic volatility (SV) model fit to a time series y taken from the closing values of S&P 500 index for 3000 days ending on Dec 31st, 2015. The model is specified as follows: with priors s 0 ∼ Exp(mean = 1/10) and τ ∼ Gamma (1/2, 1/2). The observed value on Jan 2nd, 2008 was removed from the original data as this simple SV model could not fit this observation well. After integrating out τ to accelerate mixing, we are left with a 3000 dimensional parameter space for log s. Fig. 9 and 10. The path length for HMC was chosen as in Section 5.3. On average, 44 samples out of 90 per iteration were recycled for HMC and 7 out of 2 7 = 128 were recycled for NUTS. Log2 ESS ratio eigenvalue direction Figure 9: Performance comparison between HMC with and without recycling for the SV model.

Log-Gaussian Cox point-process (Riemann manifold HMC)
The last test case is a log-Gaussian Cox point-process model from Girolami and Calderhead (2011), where they apply Riemann manifold HMC (RMHMC) to sample from the latent process x ∈ R 4096 defined on a 64×64 grid. The observation y ij for i, j = 1, . . . , 64 is assumed to follow The latent process x | µ, σ, is given a Gaussian process prior with mean µ1 and covariance Following Girolami and Calderhead (2011), we fix the hyper-parameters at = 1/33, σ 2 = 1.91, and µ = log(126) − σ 2 /2 and simulate the data from the generative model.
Due to the greater computational cost of simulation in this example, we only run the 100 independent chains for 1,600 iterations. The ground truth statistics are calculated from a NUTS chain of 10 6 iterations. 4 The path length of RMHMC is uniformly sampled from the range 1 to 30 as done in Girolami and Calderhead (2011).
Performance comparisons as in the previous examples are shown in Fig. 11 and 12. Recycled RMHMC yields worse ESS for the majority of the mean estimators in this example, but this is an artifact of the negative auto-correlations in non-recycled RMHMC causing super-efficiency -MCMC samples yielding smaller Monte Carlo errors than the same number of independent samples. HMC, and hence RMHMC, is known to occasionally exhibit such behavior (Kennedy and Pendleton, 1991;Neal, 2010). We can see, however, that recycling uniformly and significantly improves ESS for the variance and quantile estimators. Under NUTS, such super-efficiency artifacts are not observed and recycling improves ESS for all the estimators. On average, 15 samples were recycled for RMHMC and 7 out of 2 4 = 16 for NUTS.

Number of recycled samples and statistical efficiency
As mentioned earlier, in the simulation results above we recycle enough of the intermediate states to achieve near-optimal efficiency gains. Here we take a closer look at how the efficiency gain from recycling depends on the number of recycled samples. The results in particular provide a practical guidance on how one might trade off statistical efficiency for memory efficiency when the available memory becomes limited.
For our experiments here, we focus on the problem of estimating a quantile; the dependence of mean and variance estimators on the number of recycled samples was found to be similar. The number of samples per iteration was repeatedly reduced by a factor of 2 until the benefit of recycling became almost negligible. The results are summarized in the log 2 ESS ratio plots as presented earlier; Figure 13 for the multivariate Gaussian example, Figure 14 for the hierarchical Bayesian logistic regression example, Figure 15 for the stochastic volatility example, and Figure 16 for the log-Gaussian Cox example. The green dotted line corresponds to the number of recycled samples at which the efficiency decrease relative to the optimal one becomes visually noticeable. The cyan dashed line corresponds to the number of recycled samples below which the benefit from recycling becomes negligible.
The performance of recycled NUTS is particularly remarkable, not only offering the near-optimal efficiency gain well-below the maximal recycling size but also demonstrating over 40% (≈ 2 0.5 ) efficiency gain with just one recycled sample. For recycled HMC, the efficiency gains remain substantial well-below the maximal recycling size but start to diminish much earlier than NUTS. Two design features of NUTS likely explain this phenomenon. First, NUTS simulates a trajectory in both the forward and backward direction, which means that some of the intermediate states lie in the direction opposite to the final proposal state relative to the starting point of a trajectory. Secondly, while HMC simulates a trajectory to construct one high-quality proposal state, NUTS generates a collection of states -any of which likely constitutes a good proposal state -and selects one state from the collection as a final proposal. Compared to those of HMC, therefore, the recyclable states of NUTS probably have smaller correlations with the final proposal state.
Our experiments here suggest that recycled NUTS may be a particularly practical alternative to the standard implementation of HMC-type algorithms; it not only eliminates the need to tune the path length but also provides a significant boost in efficiency with a rather small increase in memory requirement.

Discussion
We have proposed a simple and general algorithm for improving the efficiency of HMC and variants with essentially no extra computational overhead. Our simulations demonstrate the substantial gains in computational efficiency without excessive memory use. In practice, conceptual complexity, ease of implementation, and memory efficiency are just as important considerations as statistical efficiency. These considerations can explain why related ideas to improve the efficiency of HMC variants have not gained traction. Our algorithm provides a more practical and user-friendly alternative that applies straightforwardly to a wide range of multi-proposal schemes.

Appendix A: Recycling varying numbers of states
Theorem 4 below extends Theorem 3 to allow recycling the intermediate states of a multi-proposal scheme in which the number of generated states varies randomly from one iteration to another. Theorem 2 is a special case of Theorem 4 when the transition kernel P k (· | ·) is chosen as one iteration of HMC with k leapfrog steps for k ≥ 1 and P 0 (· | ·) as that with L ∼ π Λ (·) steps.