Variational Hamiltonian Monte Carlo via Score Matching

Traditionally, the field of computational Bayesian statistics has been divided into two main subfields: variational methods and Markov chain Monte Carlo (MCMC). In recent years, however, several methods have been proposed based on combining variational Bayesian inference and MCMC simulation in order to improve their overall accuracy and computational efficiency. This marriage of fast evaluation and flexible approximation provides a promising means of designing scalable Bayesian inference methods. In this paper, we explore the possibility of incorporating variational approximation into a state-of-the-art MCMC method, Hamiltonian Monte Carlo (HMC), to reduce the required gradient computation in the simulation of Hamiltonian flow, which is the bottleneck for many applications of HMC in big data problems. To this end, we use a {\it free-form} approximation induced by a fast and flexible surrogate function based on single-hidden layer feedforward neural networks. The surrogate provides sufficiently accurate approximation while allowing for fast exploration of parameter space, resulting in an efficient approximate inference algorithm. We demonstrate the advantages of our method on both synthetic and real data problems.


Introduction
Bayesian inference has been successful in modern data analysis. Given a probabilistic model for the underlying mechanism of the observed data, Bayesian methods properly quantify uncertainty and reveal the landscape or global structure of parameter space. While conceptually simple, exact posterior inference in many Bayesian models is often intractable. Therefore, in practice, people often resort to approximation methods among which Markov chain Monte Carlo (MCMC) and variational Bayes (VB) are the two most popular choices.
The MCMC approach is based on drawing a series of correlated samples by constructing a Markov chain with guaranteed convergence to the target distribution. Therefore, MCMC methods are asymptotically unbiased. Simple methods such as random-walk Metropolis (Metropolis et al., 1953), however, often suffer from slow mixing (due to their random walk nature) when encountering complicated models with strong dependencies among parameters. Introducing an auxiliary momentum variable, Hamiltonian Monte Carlo (HMC) (Duane et al., 1987;Neal, 2011) reduces the random walk behavior by proposing states following a Hamiltonian flow which preserves the target distribution. By incorporating the geometric information of the target distribution, e.g., the gradient, HMC is able to generate distant proposals with high acceptance probabilities, enabling more efficient exploration of the parameter space than standard random-walk proposals.
A major bottleneck of HMC, however, is the computation of the gradient of the potential energy function in order to simulate the Hamiltonian flow. As the datasets involved in many practical tasks, such as "big data" problems, usually have millions to billions of observations, such gradient computations are infeasible since they need full scans of the entire dataset. In recent years, many attempts have been made to develop scalable MCMC algorithms that can cope with very large data sets (Welling and Teh, 2011;Ahn et al., 2012;Chen et al., 2014;Ding et al., 2014). The key idea of these methods stems from stochastic optimization where noisy estimates of the gradient based on small subsets of the data are utilized to scale up the algorithms. The noise introduced by subsampling, however, could lead to non-ignorable loss of accuracy, which in turn hinders the exploration efficiency of standard MCMC approaches (Betancourt, 2015).
The main alternative to MCMC is variational Bayes inference (Jordan et al., 1999;Wainwright and Jordan, 2008). As a deterministic approach, VB transforms Bayesian inference into an optimization problem where a parametrized distribution is introduced to fit the target posterior distribution by minimizing the Kullback-Leibler (KL) divergence with respect to the variational parameters. Compared to MCMC methods, VB introduces bias but is usually faster.
A natural question would be: can we combine both methods to mitigate the drawbacks and get the best of both worlds? The first attempt in this direction was proposed by de Freitas et al. (2001) where a variational approximation was used as proposal distribution in a block Metropolis-Hasting (MH) MCMC kernel to locate the high probability regions quickly, thus facilitating convergence. Recently, a new synthesis of variational inference and Markov chain Monte Carlo methods has been explored in Salimans et al. (2015) where one or more steps of MCMC are integrated into variational approximation. The extra flexibility from MCMC steps provides a rich class of distributions to find a closer fit to the exact posterior, which allows for further improvement on the approximation quality.
In this work, we explore the possibility of utilizing variational approximation to speed up HMC for problems with large scale datasets. The key idea is to integrate fast variational approximations into the sampling procedure so that the overall computational complexity can be reduced. The main contributions are summarized as follows: 1. We train a fast neural network surrogate using extreme learning machine imsart-generic ver. 2014/10/16 file: VHMC_Mar01.tex date: April 19, 2017 (ELM) to approximate the log-posterior (potential energy function). Neural networks with randomly-mapped features (with almost any nonlinear activation functions) has been proved to approximate a rich class of functions arbitrarily well in Huang et al. (2006a). After the fast training of a reasonably accurate neural network surrogate, we use its gradient to simulate the surrogate induced Hamiltonian flow in order to generate proposals. The learning and training of the surrogate function provides an effective and flexible way to explore both regularity of the parameter space and redundancy in the data collectively, which can be viewed as an implicit subsampling. However, unlike the popular subsampling-based methods where the leapforg stepsize needs to be annealed or remains small enough to compensate for the resulting bias, the leapfrog stepsize in our approach can remain close to that of standard HMC while keeping a comparable acceptance probability. As such, the exploration efficiency of HMC can be maintained while reducing the computational cost. 2. A new training procedure is proposed for an efficient and general surrogate method. The neural network surrogate is trained by minimizing the squared distance between the gradient of the surrogate and the gradient of the target (log-posterior), a procedure that resembles score matching (Hyvärinen, 2005). The training data are collected while the "modified" HMC sampler (based on the surrogate induced Hamiltonian flow) explores the parameter space. In Hyvärinen (2005), the training data are assumed to be sampled from the target distribution. In our method, by including the gradient of the target distribution in the learning process, the information from the target distribution is explicitly included in the surrogate in our method. Therefore, the restriction on the training data is relaxed. 3. To further reduce the computation cost, we also use the computationally fast surrogate in the Metropolis-Hastings correction step. As a result, the modified HMC sampler will no longer converge to the exact target distribution. Instead, it will converge to the best approximation from an exponential family with pre-specified random sufficient statistics (which we call free-form variational approximation in the sequel). This variational perspective distinguishes our approach from the existing surrogate methods on accelerating HMC. Moreover, compared to traditional fixedform variational approximations, the free-form variational approximation is usually more flexible and thus can provide a better fit to more general target distributions.
Our paper is organized as follows. In section 2, we introduce the two ingredients related to our method: Hamiltonian Monte Carlo and fixed-form variational Bayes. Section 3 presents our method, termed Variational Hamiltonian Monte Carlo (VHMC). We demonstrate the efficiency of VHMC in a number of experiments in section 4 and conclude in section 5.

Hamiltonian Monte Carlo
In general formulation of Bayesian inference, a set of independent observations Y = {y 1 , . . . , y N } are modeled by an underlying distribution p(y|θ) with unknown parameter θ. Given a prior distribution of θ ∼ p(θ), the posterior distribution is given by Bayesian formula To construct the Hamiltonian dynamical system (Duane et al., 1987;Neal, 2011), the position-dependent potential energy function is defined as the negative log unnormalized posterior density log p(y n |θ) − log p(θ) (2.1) and the kinetic energy function is defined as a quadratic function of an auxiliary momentum variable r where M is a mass matrix and is often set to identity, I. The fictitious Hamiltonian, therefore, is defined as the total energy function of the system H(θ, r) = U (θ) + K(r) As one of the state-of-the-art MCMC methods, Hamiltonian Monte Carlo suppresses random walk behavior by simulating the Hamiltonian dynamical system to propose distant states with high acceptance probabilities. That is, in order to sample from the posterior distribution p(θ|Y ), HMC augments the parameter space and generates samples from the joint distribution of (θ, r) π(θ, r) ∝ exp(−U (θ) − K(r)) (2.2) Notice that θ and r are separated in (2.2), we can simply drop the momentum samples r and the θ samples follow the marginal distribution which is exactly the target posterior.
To generate proposals, HMC simulates the Hamiltonian flow governed by the following differential equations

Algorithm 1 Hamiltonian Monte Carlo
Input: Starting position θ (1) and step size ǫ for t = 1, 2, . . . , T do Resample momentum r Simulate discretization of Hamiltonian dynamics: Over a period t, also called trajectory length, (2.3) and (2.4) together define a map φ t : (θ 0 , r 0 ) → (θ * , r * ) in the extended parameter space, from the starting state to the end state. As implied by a Hamiltonian flow, φ t is reversible, volume-preserving and also preserves the Hamiltonian H(θ 0 , r 0 ) = H(θ * , r * ). These allow us to construct π-invariant Markov chains whose proposals will always be accepted. In practice, however, (2.3) and (2.4) are not analytically solvable and we need to resort to numerical integrators. As a symplectic integrator, the leapfrog scheme (see Algorithm 1) maintains reversibility and volume preservation and hence is a common practice in HMC literatures. Although discretization introduces bias which needs to be corrected in an Metroplis-Hasting (MH) step, we can control the stepsizes to maintain high acceptance probabilities even for distant proposals.
In recent years, many variants of HMC have been developed to make the algorithm more flexible and generally applicable in a variety of settings. For example, methods proposed in Hoffman and Gelman (2011) and Wang et al. (2013) enable automatically tuning of hyper-paramters such as the stepsize ǫ and the number of leapfrog steps L, saving the amount of tuning-related headaches. Riemannian Manifold HMC (Girolami and Calderhead, 2011) further improves standard HMC's efficiency by automatically adapting to local structures using Riemanian geometry of parameter space. These adaptive techniques could be potentially combined with our proposed method which focuses on reducing the computational complexity.

Fixed-form Variational Bayes
Instead of running a Markov chain, people can also approximate the intractable posterior distribution with a more convenient and tractable distribution. A popular approach of obtaining such an approximation is fixed-form variational Bayes (Honkela et al., 2010;Saul and Jordan, 1996;Salimans and Knowles, 2013) where a parametrized distribution q η (θ) is proposed to approximate the target posterior p(θ|Y ) by minimizing the KL divergence since log(p(Y )) is a constant (used extensively in model selection), it suffices to minimize the second term in (2.5). Usually, q η (θ) is chosen from the exponential family of distributions with the following canonical form: where T (θ) is a row vector of sufficient statistics, A(η) is for normalization and ν(θ) is a base measure. The column vector η is often called the natural parameters of the exponential family distribution q η (θ). Taking this approach and substituting into (2.5), we now have a parametric optimization problem in η:η = arg min The above optimization problem can be solved using gradient-based optimization or fix-point algorithms if E qη (θ) [log q η (θ)], E qη (θ) [log p(θ, Y )] and its derivatives with respect to η can be evaluated analytically. Without assuming posterior independence and requiring conjugate exponential models, posterior approximations of this type are usually much more accurate than a factorized approximation following the mean-field assumptions. However, the requirement of being able to analytically evaluate those quantities mentioned above is also very restrictive. Many attempts, therefore, has been made to evaluate those quantities using stochastic approximations from Monte Carlo integration (Blei et al., 2012;Ranganath et al., 2013;Kingma and Welling, 2013). As an alternative, Salimans and Knowles (2013) proposed a new optimization algorithm which relates (2.7) to stochastic linear regression. To reveal the connection, the posterior approximate (2.6) is relaxed and rewritten in the unnormalized form where the nonlinear normalizer A(η) is removed and the vectors of sufficient statistics and natural parameters are augmented, i.e.T (θ) = (1, T (θ)),η = (η 0 , η ′ ) ′ . The unnormalized version of KL divergence is utilized to deal with qη(θ) and achieves its minimum at which resembles the maximum likelihood estimator for linear regression. Based on this observation, Salimans and Knowles (2013) derived a stochastic approximation algorithm using (2.9) as a fixed point update and approximating the involved expectations by weighted Monte Carlo.
In the next section, we will discuss how the variational Bayes approach can be actually utilized to accelerate HMC. For this, we construct a fast and accurate approximation for the computationally expensive potential energy function. The approximation is provided by variational Bayes and is incorporated in the simulation of Hamiltonian flow.

Variational Hamiltonian Monte Carlo
Besides subsampling, an alternative approach that can save computation cost is to construct fast and accurate surrogate functions for the expensive potential energy functions (Liu, 2001;Neal, 2011). As one of the commonly used models for emulating expensive-to-evaluate functions, Gaussian process (GP) is used in Rasmussen (2003) to approximate the potential energy and its derivatives based on true values of these quantities (training set) collected during an initial exploratory phase. However, a major drawback of GP-based surrogate methods is that inference time typically grows cubically in the size of training set due to the necessity of inverting a dense covariance matrix. This is especially crucial in high dimensional spaces, where large training sets are often needed before a reasonable level of approximation accuracy is achieved. Our goal, therefore, is to develop a method that can scale to large training set while still maintaining a desired level of flexibility. For this purpose, we propose to use neural networks along with efficient training algorithms to construct surrogate functions. A typical single-hidden layer feedforward neural network (SLFN) with scalar output is defined as where γ i and v i are the node parameter and output weight for the ith hidden neuron, σ is a nonlinear activation function. Given a training dataset, the estimates of those parameters can be obtained by minimizing the mean square error (MSE) cost function. To save training time, randomly assigned node parameters {γ i } s i=1 are suggested in Ferrari and Stengel (2005) and Huang et al. (2006b) where the optimization is reduced to a linear regression problem with randomly mapped features which can be solved efficiently using algebraic approaches. This can also be viewed as using an additive model based on random (adaptive) basis to approximate the target distribution. Unlike a standard Gaussian process, the above neural network based surrogate scales linearly in the size of training data, and cubically in the number of hidden neurons. This allows us to explicitly balance evaluation time and model capacity. Moreover, the random network in (3.1) has been proven to approximate a rich class of functions arbitrarily well (Rahimi and Recht, 2008).

Approximation with random networks
As shown in Rahimi and Recht (2008), functions in (3.1) with randomly assigned bases is flexible enough to provide accurate approximations for other well-studied classes of functions (e.g., Reproducing Kernel Hilbert Space). Let {σ(·; γ) : γ ∈ Γ} be a family of functions on a compact set Θ ⊂ R d with parameter γ specified over the set Γ. Let p be a distribution on Γ, consider a rich class of functions of the following form 3) The following theorem shows that a given f ∈ F p can be approximated within where γ 1 , . . . , γ s are sampled iid from p(γ). See Rahimi and Recht (2008) for a detailed proof.

Proposition 1. Let the spaceĤ be the completion of the set of functions of the form (3.2) such that
Proof. See a proof in Rahimi and Recht (2008). Rahimi and Recht (2008) shows that F p is a dense subset of H.
Theorem 2. Let F p and H be defined as above for a given activation function σ(θ; γ) and probability density p(γ). Then F p is dense in H.
Proof. By the property of RKHS, H is the completion of the set of all finite linear combinations of the form with the inner product satisfying k(·, θ i ), k(·, θ j ) = k(θ i , θ j ). Using (3.6), we can rewrite (3.7) in the following form Since certain RKHSs are known to fit a rich class of (density) functions arbitrarily well, approximating these RKHSs with a small number of random bases allows for efficient surrogate construction with comparable approximation accuracy.

Free-form Variational Bayes
The correspondence between distributions and their potential energy functions builds a bridge between distribution approximation and function approximation. Viewing this way, each random neural network in (3.1) corresponds to a distribution in the exponential family (3.8) where v = (v i , i = 1, 2, . . . , s) is called the vector of canonical parameters, and the collection of randomly-mapped features Ψ = (Ψ i , i = 1, 2, . . . , s), Ψ i = −σ(θ; γ i ), i = 1, 2, . . . , s is known as sufficient statistics. The quantity Φ, known as the log partition function, is defined by the following integral Note that q v (θ) is properly normalized if and only if the integral is finite. Therefore, the canonical parameters v of interest belong to the set We call q v (θ) the induced distribution of neural network surrogate z(θ) if v ∈ Ω. As our neural network surrogates approximate the true potential energy function U (θ), the underlying distribution q v (θ) then approximates the target posterior distribution p(θ|Y ). Because both the surrogate induced distribution q v (θ) and the target posterior distribution p(θ|Y ) are known up to a constant, we introduce the following expected squared distance between unnormalized den-sitiesD which is similar to the well known score matching squared distance (Hyvärinen, 2005) (see section 3.4 for a more detailed explanation). By minimizing the expected squared distanceD SM , we arrive at a variational inference algorithm v = arg min v∈ΩD SM (p(θ|Y )||q v (θ)) (3.11) The surrogate induced approximation (3.8) enriches our choices for variational approximation from fixed-form tractable distributions (e.g., Gaussian or mixture of Gaussians) to fast and flexible intractable distributions. The integral in (3.10) is usually hard to evaluate in practice but can be approximated using samples from the surrogate induced distribution q v (θ) by the law of large numbers. Unlike the fixed-form approximation, the surrogate induced distribution generally does not allow for drawing samples directly. However, we can use MCMC methods to generate samples from it.
Due to the random nature in approximation (3.8), we call variational algorithm (3.11) free-form variational Bayes. By choosing a proper number of hidden neurons s, the free-form variational Bayes provides an implicit subsampling procedure that can effectively remove redundancy and noise in the data while striking a good balance between computation cost and approximation accuracy of the underlying distribution.
Remark. From an approximation point of view, each σ(θ; γ) is a basis with a random configuration of γ, which specifies the orientation and scaling properties, within a profile of the activation function σ. In particular we choose the softplus function for additive node σ(θ; w, d) = log 1 + exp(w · θ + d) to approximate the potential function, e.g., the Hamiltonian corresponding to the posterior distribution, for our free-form variational Bayes. Of course, the choice of σ is general. As shown in Zhang et al. (2015), different basis can be used in our free-form variational Bayes formulation. In particular, if exponential square kernel is used as radial basis functions (RBF) centered at given data, GP method is recovered. However, using kernel functions centered at very data point, GP method does not effectively exploit redundancy in data or regularity in parameter space and hence may result in expensive computation cost as well as instability for large data set. If exponential square kernel is used at a smaller set of properly induced points, a more computationally tractable GP model that tries to exploit redundancy in data, sparse GP (Snelson and Ghahramani, 2006;Quinonero-Candela and Rasmussen, 2005), is recovered. Since kernel function is usually local, to approximate potential functions in HMC which go to infinity at far field, it is shown in Zhang et al. (2015) that the use of random basis based on softplus function provides a more compact and better approximation that has a good balance between local features and global behaviors. Also the choice of s, the number of basis, can be used to balance flexibility/accuracy and overfitting.

Surrogate Induced Hamiltonian Flow
To sample from the surrogate induced distribution q v (θ), we generate proposals by simulating the corresponding surrogate induced Hamiltonian flow governed by the following equations where the modified Hamiltonian is Practitioners can use leapfrog scheme to solve (3.12) and (3.13) numerically. The end-point (θ * , r * ) of the trajectory starting at (θ 0 , r 0 ) is accepted with probability α vhmc = min[1, exp(H(θ 0 , r 0 ) −H(θ * , r * ))] (3.14) Similar to the true Hamiltonian flow, we have the following theorem Theorem 3. If we construct a Markov chain by simulating surrogate induced Hamiltonian flow (3.12) and (3.13) using leapfrog steps and Metropolis-Hastings correction with the acceptance probability according to (3.14), the equilibrium distribution of the chain is See Supplementary Materials for a proof. Theorem 3 implies that the marginal distribution of θ is exactly the surrogate induced distribution q v (θ). Therefore, we can run the Markov chain and collect the values of interest, such as the potential energy function and its derivatives, as additional training data to improve the surrogate approximation (solving (3.11)) on the fly. This way, our algorithm can be viewed as a generalized version of the stochastic approximation algorithm proposed in Salimans and Knowles (2013) for free-form variational Bayes.

Score Matching
A well-known strategy to estimate unnormalized models is score matching (Hyvärinen, 2005). Assuming that data D come from a distribution with unknown density p D (.), we want to find an approximation with a parameterized unnormalized density model q v (.), where v is an s-dimensional vector of parameters. Score matching estimates the model by minimizing the expected squared distance between the model score function ψ v (θ) = ∇ θ log q v (θ) and the data score function A simple trick was suggested in Hyvärinen (2005) to avoid the expensive estimation of the data score function ψ D (θ) from D. Similar ideas can be applied here to train our neural network surrogate. Notice that the posterior density is known up to a constant, the data score function then can be evaluated exactly ψ D (θ) = −∇ θ U (θ). This allows us to estimate our density model q v (θ) without requiring samples from the posterior distribution. Since sampling from the posterior distribution is computationally costly, we sample from the simpler and cheaper surrogate induced distribution (our variational model) instead. The corresponding expected squared distance isJ ( Note that the model score function ψ v (θ) = −∇ θ z(θ),J(v) is exactly the expected squared distanceD SM we introduced in section 3.2. In the case that our density model is not degenerate, we have a similar local consistency result to Hyvärinen (2005) as shown by the following theorem.
Theorem 4. Assume that the data density p D (.) follows the model: p D (.) = q v (.) for some v * . Further, assume that no other parameter value gives a probability density that is equal to q v * (.), and that q v (θ) > 0 for all v, θ. Theñ imsart-generic ver. 2014/10/16 file: VHMC_Mar01.tex date: April 19, 2017 As such, minimizing the expected squared distanceD SM would be sufficient to estimate the model. For a proof, see Supplementary Materials.
Remark. Note that the data score function is we may choose our surrogate function as log p(y n |θ) + log p(θ) (3.17) then the data density p D follows the model exactly. In order to reduce computational cost, our model is usually much simpler than (3.17) (s ≪ N ). This allows us to explore and exploit the redundancy in the data from a function approximation perspective.
In practice, samples from the surrogate induced distribution can be collected as observations and our surrogate can be trained by minimizing the empirical version ofJ(v). A regularization term could be included to improve numerical stability.
Now suppose that we have collected training data of size t from the Markov chain history where θ n is the n-th sample. The estimator of the output weight vector (variational parameter) can be obtained by minimizing the empirical square distance between the gradient of the surrogate and the gradient of the potential (i.e., score function) plus an additional regularization term: which has an online updating formula summarized in the following proposition 2 (see Supplementary Materials for a detailed proof). For ease of presentation, we use the additive nodes σ(θ; γ i ) = σ(w i · θ + d i ) 1 Proposition 2. Suppose our current estimator of the output weight vector is v (t) based on the current training dataset T (t) s := {(θ n , ∇ θ U (θ n ))} t n=1 ∈ R d × R d using s hidden neurons. Given a new training data point (θ t+1 , ∇ θ U (θ t+1 )), the updating formula for the estimator is given by where and C (t) can be updated by Sherman-Morrison-Woodburyformula: The estimator and inverse matrix can be initialized as v (0) = 0, C (0) = 1 λ I s . The online learning can be achieved by storing the inverse matrix C and performing the above updating formulas, which cost O(d 3 + ds 2 ) computation and O(s 2 ) storage independent of t.

Algorithm 2 Variational Hamiltonian Monte Carlo
Input: Regularization coefficient λ, transition function µt, number of hidden neurons s, starting position θ (1) and HMC parameters Find the Maximum A Posterior θ L and compute the Hessian matrix ∇ 2 θ U (θ L ) Randomly assign the node parameters: Propose (θ * , r * ) with regularized surrogate induced Hamiltonian flow, using ∇ θ Vt(θ) Perform Metropolis-Hasting step according to the underlying distribution πt ∼ The neural network based surrogate is capable of approximating the potential energy function well when there is enough training data. However, the approximation could be poor when only few training data are available which is true in the early stage of the Markov chain simulations. To alleviate this issue, we propose to add an auxiliary regularizer which provides enough information for the sampler at the beginning and gradually diminishes as the surrogate becomes increasingly accurate. Here, we use the Laplace's approximation to the potential energy function but any other fast approximation could be used. The regularized surrogate approximation then takes the form where θ L is the maximum a posteriori (MAP) estimate and µ t ∈ [0, 1] is a smooth monotone function monitoring the transition from the Laplace's approximation to the surrogate approximation. Refining the surrogate approximation by acquiring training data from simulating the regularized surrogate induced Hamiltonian flow, we arrive at an efficient approximate inference method: Variational Hamiltonian Monte Carlo (VHMC) (Algorithm 2).
In practice, the surrogate approximation may achieve sufficient quality and an early stopping could save us from inefficient updating of the output weight vector. In fact, the stopping time t 0 serves as a knob to control the desired approximation quality. Before stopping, VHMC acts as a free-form variational Bayes method that keeps improving itself by collecting training data from the history of the Markov chain. After stopping, VHMC performs as a standard HMC algorithm which samples from the surrogate induced distribution. VHMC successfully combines the advantages of both variational Bayes and Hamiltonian Monte Carlo, resulting in higher computational efficiency compared to HMC and better approximation compared to VB.

Experiments
In this section, four experiments are conducted to verify the effectiveness and efficiency of the proposed Variational HMC framework. In the first two examples, we demonstrate the performance of VHMC from a pure approximation perspective and compare it to state-of-the-art fix-form variational Bayes methods. We then test the efficiency of VHMC on two machine learning problems with large datasets and compare our methods to SGLD (Welling and Teh, 2011), which is one of the state-of-the-art stochastic gradient MCMC methods. Compared to standard HMC, Variational HMC introduces some additional hyperparameters, such as the number of random bases s and the transition monitor µ t , that might be hard to tune in practice. Generally speaking, we want our method to be as efficient as possible while maintaining good approximation. Therefore, we choose s to be small as long as a reasonable level of accuracy can be achieved and µ t to stay small until enough data have been collected to stabilize the training procedure. These can all be done by monitoring the acceptance rate using an initial run as in Zhang et al. (2015). In our experiments, we found that µ t = 1 − exp(−t/n s ) is a useful schedule where n s can be adjusted to accommodate different transition speed according to the complexity of the model.

A Beta-binomial Model for Overdispersion
We first demonstrate the performance of our variational Hamiltonian Monte Carlo method on a toy example from Albert (2009), which considers the problem of estimating the rates of death from stomach cancer for the largest cities in Missouri. The data is available from the R package LearnBayes which consists of 20 pairs (n j , y j ) where n j records the number of individuals that were at risk for cancer in city j, and y j is the number of cancer deaths that occurred in that city. The counts y j are overdispersed compared to what would be expected under a binomial model with a constant probability; therefore, Albert (2009) assumes a beta-binomial model with mean m and precision K: KL-divergence and score matching squared distance between the surrogate approximation and the exact posterior density using an increasing number of hidden neurons. and assigns the parameters the following improper prior: The resulting posterior is extremely skewed (as shown in the bottom right corner in Figure 1) and a reparameterization is proposed to ameliorate this issue.
We choose µ t = 1 − exp(−t/200) as our transition schedule and set up the HMC parameter to achieve around 85% acceptance. We run the variational Hamiltonian Monte Carlo long enough so that we can estimate the full approximation qualify of our surrogate. To demonstrate the approximation performance in terms of the number of hidden neurons s, we train the neural network based surrogate using different numbers of hidden neurons and examine the resulting KL-divergence and score matching squared distance to the true posterior density. As we can see from Figures 1 and 2, the neural network based surrogate indeed offers a high quality approximation and becomes more accurate as s increases. The surrogate induced Hamiltonian flow effectively explores the parameter space and transfers information from the posterior to the surrogate.

Bayesian Probit Regression
Next, we demonstrate the approximation performance of our Variational HMC algorithm relative to existing variational approaches on a simple Bayesian classification problem, binary probit regression. Given N observed data pairs {(y n , x n )|y n ∈ {0, 1}, x n ∈ R d } N n=1 , the model comprised a probit likelihood function P (y n = 1|θ) = Φ(θ T x n ) and a Gaussian prior over the parameter p(θ) = N (0, 100), where Φ is the standard Gaussian cdf. A full covariance multivariate normal approximation is used for all variational approaches. The synthetic data we use are simulated from the model, with N = 10000 and d = 5. We show the performance averaged over 50 runs for all methods. We compare our algorithm to Variational Bayesian Expectation Maximization (VBEM) (Beal and Ghahramani, 2002;Ormerod and Wand, 2010), and the best fixed-form variational approximation of Salimans and Knowles (2013) that uses Hessian and subsampling (FF-Minibatch). VBEM and FF-Minibatch are both implemented using the code provided in Salimans and Knowles (2013) with the same learning rate and minibatch size. For all variational approaches, we initialize the posterior approximation to the prior. For our Variational HMC algorithm, we choose s = 100 random hidden units for the surrogate and set the starting point to be the origin. The number of hidden units is chosen in such a way that the surrogate is flexible enough to fit the target well and remain fast in computation. The HMC parameters are set to make the acceptance probability around 70%. The target density is almost Gaussian, and a fast transition µ t = 1 − exp(−t/5) is enough to stabilize our algorithm. Since this experiment is on synthetic data, we follow Salimans and Knowles (2013) to assess the approximation performance in terms of the root mean squared error (RMSE) between the estimate (variational mean for VB and sample mean for VHMC) and the true parameter that is used to generate the dataset. Figure 3 shows the performance of our Variational HMC algorithm, as well as the performance of the other two variational Bayes methods. As we can see from the graph, VHMC and the subsampling based fixed-form variational approach (FF-minibatch) achieve lower RMSE than the VBEM algorithm. That is because of the extra factorization assumptions made by VBEM when introducing the auxiliary variables (Ormerod and Wand, 2010). Even though Gaussian approximation is already sufficiently accurate on this simple example, VHMC can still arrive at a lower RMSE due to the extra flexibility provided by the free-form neural network surrogate function.

Bayesian Logistic Regression
Next, we test the efficiency of our Variational HMC method as a scalable sampling method. We first apply Variational HMC to a Bayesian logistic regression model. Given the i-th input vector x i , the corresponding output (label) y i = {0, 1} is assumed to follow the probability p(y i = 1|x i , β) = 1/(1 + exp(−x T i β)) and a Gaussian prior p(β) = N (0, 100) is used for the model parameter β. We test our proposed algorithm on the a9a dataset (Lin et al., 2008). The original dataset, which is compiled from the UCI adult dataset, has 32561 observations and 123 features. We use a 50 dimension random projection of the original features. We choose s = 2000 hidden units for the surrogate and set a transition schedule µ t = 1 − exp(−t/500) for our VHMC algorithm. We then compare the algorithm to HMC (Duane et al., 1987;Neal, 2011) and to stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011). For HMC and VHMC, we set the leap-frog stepsize such that the acceptance rate is around 70%. For SGLD we choose batch size of 500 and use a range of fixed stepsizes.
To illustrate the sampling efficiency of all methods, we follow Ahn et al. error of covariance (REC) is defined as where β t = 1 and sample covariance up to time t and the ground truth β o , C o are obtained using a long run (T = 500K samples) of HMC algorithm. Figure 4 shows the relative error at time T = 300, T = 3000 as a function of the time normalized mean ESS, which is a measure of the mixing rate. The results for the mean are shown on the top, and those for the covariance are on the bottom. We run each algorithm with a different setting of parameters that control the mixing rate: number of leap-frog steps L = [50,40,30,20,10,5,1] for HMC and L = [50,40,30,20,10,5] for VHMC, and stepsizes ǫ = [2e-3, 1e-3, 5e-4, 1e-4, 5e-5, 1e-5] for SGLD.
As we decrease the stepsize, SGLD becomes less biased in the gradient approximation, resulting in smaller relative error. However, the exploration efficiency drops at the same time and sampling variance gradually dominates the relative error. In contrast, HMC uses a fixed leap-frog stepsize and therefore maintains high exploration efficiency in parameter space. The down side is the expensive computation of the full gradient and the possible turning back of the trajectories when the number of leap-frog steps is unnecessarily large. Adopting a flexible neural network surrogate, VHMC balances the computation cost and approximation quality much better than subsampling and achieves lower relative error with high mixing rates.

Independent Component Analysis
Finally, we apply our method to sample from the posterior distribution of the unmixing matrix in Independent Component Analysis (ICA) (Hyvärinen and Oja, 2000). Given N d-dimensional observations X = {x n ∈ R d } N n=1 , we model the data as where w i is the i-th row of W and p i is supposed to capture the true density of the i-th independent component. Following Welling and Teh (2011), we use a Gaussian prior over the unmixing matrix p(w ij ) = N (0, σ) and choose p i (y i ) = [4 cosh( 1 2 y i )] −1 with y i = w T j x. We evaluate our method using the MEG dataset (Vigário et al., 1997), which has 122 channels and 17730 observations. We extract the first 5 channels for our experiment which leads to samples with 25 dimensions. We then compare our algorithm to standard HMC and SGLD (Welling and Teh, 2011). For SGLD, we use a natural gradient (Amari et al., 1996) which has been found to improve the efficiency of gradient descent significantly. We set σ = 100 for the Gaussian prior. For HMC and Variational HMC, we set the leap-frog stepsize to keep the acceptance ratio around 70% and set L = 40 to allow an efficient exploration in parameter space. For SGLD, we choose batch size of 500 and use stepsizes from a polynomial annealing schedule a(b + t) −δ , with a = 5 × 10 −3 , b = 10 −4 and δ = 0.5. (This setting reduces the stepsize from 5 × 10 −5 to 1 × 10 −6 during 1e+7 iterations). We choose s = 1000 hidden units and set the transition schedule µ t = 1 − exp(−t/2000) for our Variational HMC algorithm. Note that the likelihood (4.2) is row-permutation invariant. To measure the convergence of all samplers on this ICA problem, we therefore use the Amari distance (Amari et al., 1996) d A (W , W 0 ), where W is the sample average and W 0 is the true unmixing matrix estimated using a long run (T = 100K samples) of standard HMC algorithm.
The Amari distance as a function of runtime is reported for each of these methods in Figure 5. From the graph, we can see that SGLD converges faster than standard HMC. The noise introduced by subsampling is compensated by the fast exploration in parameter space which allows for an early arrival at the high probability domain for SGLD. However, when the variance dominates the bias, the convergence speed slows down since SGLD requires annealing (or small) stepsize that inevitably leads to low exploration efficiency. By maintaining efficient exploration in parameter space (same stepsize as HMC) while reducing the computation in simulating the Hamiltonian flow, VHMC outperforms SGLD, arriving at a lower Amari distance much more rapidly.

Conclusion
We have presented a novel approach, Variational Hamiltonian Monte Carlo, for approximate Bayesian inference. Our approach builds on the framework of HMC, but uses a flexible and efficient neural network surrogate function to approximate the expensive full gradient. The surrogate keeps refining its approximation by collecting training data while the sampler is exploring the parameter space. This way, our algorithm can be viewed as a free-form variational approach. Unlike subsampling-based MCMC methods, VHMC maintains the relatively high exploration efficiency of its MCMC counterparts while reducing the computation cost. Compared to fixed-form variational approximation, VHMC is more flexible and thus can approximate more general target distribution better. As the complexity of the model increases (e.g., high dimensional covariates), the computational cost of random network surrogates may increase since it usually requires more random bases to provide adequately accurate approximation in high dimensions. However, the free-style surrogate construction in VHMC allows us to reduce the computation cost by adapting to the model if some special structures exist. For example, one can build surrogates based on graphical models to exploit the conditional independence structures in the model. Alternatively, one can also train the full neural network with sparsity constraints imposed to the input weights. Both approaches could be interesting to explore in future work.