The Bayesian Update: Variational Formulations and Gradient Flows

. The Bayesian update can be viewed as a variational problem by characterizing the posterior as the minimizer of a functional. The variational viewpoint is far from new and is at the heart of popular methods for posterior approximation. However, some of its consequences seem largely unexplored. We focus on the following one: deﬁning the posterior as the minimizer of a functional gives a natural path towards the posterior by moving in the direction of steepest descent of the functional. This idea is made precise through the theory of gradient ﬂows, allowing to bring new tools to the study of Bayesian models and algorithms. Since the posterior may be characterized as the minimizer of diﬀerent functionals, several variational formulations may be considered. We study three of them and their three associated gradient ﬂows. We show that, in all cases, the rate of convergence of the ﬂows to the posterior can be bounded by the geodesic convexity of the functional to be minimized. Each gradient ﬂow naturally suggests a nonlinear diﬀusion with the posterior as invariant distribution. These diﬀusions may be discretized to build proposals for Markov chain Monte Carlo (MCMC) algorithms. By construction, the diﬀusions are guaranteed to satisfy a certain optimality condition, and rates of convergence are given by the convexity of the functionals. We use this observation to propose a criterion for the choice of metric in Riemannian MCMC methods.


Introduction
In this paper we revisit the old idea of viewing the posterior as the minimizer of an energy functional. The use of variational formulations of Bayes rule seems to have been largely focused on one of its methodological benefits: restricting the minimization to a subclass of measures is the backbone of variational Bayes methods for posterior approximation. Our aim is to bring attention to two other theoretical and methodological benefits, and to study in some detail one of these: namely, that each variational formulation suggests a natural path, defined by a gradient flow, towards the posterior. We use this observation to propose a criterion for the choice of metric in Riemannian MCMC methods.
Let us recall informally a variational formulation of Bayes rule. Given a prior p(u) on an unknown parameter u and a likelihood function L(y|u), the posterior p(u|y) ∝ 1. The variational formulation provides a natural way to approximate the posterior by restricting the minimization problem to distributions q(u) satisfying some computationally desirable property. For instance, variational Bayes methods often restrict the minimization to q(u) with product structure Attias (1999), Wainwright and Jordan (2008), Fox and Roberts (2012). A similar idea is studied in Pinski et al. (2015), where q(u) is restricted to a class of Gaussian distributions. An iterative variational procedure that progressively improves the posterior approximation by enriching the family of distributions was introduced in Guo et al. (2016).
2. If the prior p εn (u) or the likelihood L εn (y|u) depend on a parameter n , then the variational formulation allows to show large n convergence of posteriors p εn (u|y) by establishing the Γ-convergence of the associated energies. This method of proof has been employed by the authors in Garcia Trillos and Sanz-Alonso (2018a), Garcia Trillos et al. (2017b) to analyze the large-data consistency of graph-based Bayesian semi-supervised learning.
3. Each variational formulation gives a natural path, defined by a gradient flow, towards the posterior. These flows can be thought of as time-parameterized curves in the space of probability measures, converging in the large-time limit towards the posterior.
In this paper we study three gradient flows associated with the variational formulations defined by minimization of the functionals J KL , J χ 2 , and D μ . For intuition, we recall that a gradient flow in Euclidean space defines a curve whose tangent always points in the direction of steepest descent of a given function -see equation (7). In the same fashion, a gradient flow in a more general metric space can be thought of as a curve on said space that always points in the direction of steepest descent of a given functional Ambrosio et al. (2008). In Euclidean space the direction of steepest descent is naturally defined as that in which an Euclidean infinitesimal increment leads to the largest decrease on the value of the function. In a general metric space the direction of steepest descent is the one in which an infinitesimal increment defined in terms of the distance leads to the largest decrease on the value of the functional. In this paper we study: (i) The gradient flows defined by J KL and J χ 2 in the space of probability measures with finite second moments endowed with the Wasserstein distance-definitions are given in (3). By construction, these flows give curves of probability measures that evolve following the direction of steepest descent of J KL and J χ 2 in Wasserstein distance, converging to the posterior measure in the large-time limit.
(ii) The gradient flow defined by the Dirichlet energy D μ in the space of square integrable densities endowed with the L 2 distance. By construction, this flow gives a curve of densities in L 2 that evolves following the direction of steepest descent of D μ in L 2 distance, converging to the posterior density in the large-time limit. Interestingly, the curve of measures associated with these densities is the exact same as the curve defined by the J KL flow on Wasserstein space Jordan et al. (1998).
A question arises: what is the rate of convergence of these flows to the posterior? The answer is, to a large extent, provided by the theory of optimal transport and gradient flows Ambrosio et al. (2008), Villani (2003), Villani (2008), Santambrogio (2015). We will review and provide a unified account of these results in the main body of the paper, section 3. In the remainder of this introduction we discuss how rates of convergence may be studied in terms of convexity of functionals, and how these rates may be used as a guide for the choice of proposals for MCMC methods.
Rates of convergence of the flows hinge on the convexity level of each of the functionals J KL , J χ 2 , and D μ . Recalling the Euclidean case may be helpful: gradient descent on a highly convex function will lead to fast convergence to the minimizer. What is, however, a sensible notion of convexity for functionals defined over measures or densities? Our presentation highlights that the notion of geodesic (or displacement) convexity McCann (1997) nicely unifies the theory: it guarantees the existence and uniqueness of the three gradient flows and it also provides a bound on their rate of convergence to the posterior. In the L 2 setting one can show that positive geodesic convexity is equivalent to the posterior satisfying a Poincaré inequality, and also to the existence of a spectral gap-see subsection 3.2. On the other hand, the geodesic convexity of J KL and J χ 2 in Wasserstein space is determined by the Ricci curvature of the manifold, as well as by the likelihood function and prior density-see (10), (11), (13), (14). Typically the three functionals J KL , J χ 2 , and D μ will have different levels of geodesic convexity, and establishing a sharp bound on each of them may not be equally tractable.
The theory of gradient flows and optimal transport gives, for each of the flows, an associated Fokker-Planck partial differential equation (PDE) that governs the evolution of densities Ohta andTakatsu (2011), Santambrogio (2015). Such PDEs are typically costly to discretize if the parameter space is of moderate or high dimension, but they may be used in small-dimensional problems as a way to define tempering schemes. Here we do not explore this idea any further. Instead, we focus on the (nonlinear) diffusion processes associated with the PDEs. These diffusions are Langevin-type stochastic differential equations, whose evolving densities satisfy the Fokker-Planck equations. By construction, the invariant distribution of each of these diffusions is the sought posterior, and a bound on the rate of convergence of the diffusions to the posterior is given by the geodesic convexity of the corresponding functional. The gradient flow perspective automatically gives a sense in which the diffusions are optimal: the associated densities move locally (in Wasserstein or L 2 sense) in the direction of steepest descent of the functional. From this it immediately follows, for instance, that the law of a standard Langevin diffusion in Euclidean space evolves locally in Wasserstein space in the direction that minimizes Kullback-Leibler, and that it also evolves locally in L 2 in the direction that minimizes the Dirichlet energy.
The MCMC methodology allows to use a proposal based on a discretization of a diffusion-combined with an accept-reject mechanism to remove the discretization bias-to produce, in the large-time asymptotic, correlated posterior samples. Heuristically, the rate of convergence of the un-discretized diffusion may guide the choice of proposal. Proposals based on Langevin diffusions were first suggested in Besag (1994), and the exponential ergodicity of the resulting algorithms was analyzed in Roberts and Tweedie (1996). The paper Girolami and Calderhead (2011) considered changing the metric on the parameter space in order to accelerate MCMC algorithms by taking into account the geometric structure that the posterior defines in the parameter space. This led to a new family of Riemannian MCMC algorithms. Our paper is concerned with the study of un-discretized diffusions; the effect of the accept-reject mechanism on rates and ergodicity of MCMC methods will be studied elsewhere. We suggest that a way to guide the choice of metric of Riemannian MCMC methods is to choose the one that leads to a faster rate of convergence of the diffusion under certain constraints. We emphasize that despite working with un-discretized diffusions, our guidance for choice of proposals accounts for the fact that discretization will eventually be needed. Our criterion weeds out choices of metric that lead to diffusions that achieve fast rate of convergence by merely speeding-up the drift. This is crucial, since a larger drift typically leads to a larger discretization error, and therefore to more rejections in the MCMC accept-reject mechanism in order to remove the bias in the discrete chain. This important constraint on the size of the drift seems to have been overlooked in existing continuous-time analysis of MCMC methods.
In summary, the following points highlight the key elements and common structure of the variational formulations of the Bayesian update and of the study of the associated gradient flows: • The posterior can be characterized as the minimizer of different functionals on probability measures or densities.
• One can then study the gradient flows of these functionals with respect to a metric on the space of probability measures or densities; the resulting curve is a curve of maximal slope and its endpoint is the posterior.
• The gradient flows are characterized by a Fokker-Planck PDE that governs the evolution of the density of an associated diffusion process.
• By studying the convexity of the functionals (with respect to a given metric) one can obtain rates of convergence of the gradient flows towards the posterior.
In particular, the level of convexity determines the speed of convergence of the densities of the associated diffusion process towards the posterior, and hence can be used as a criterion to guide the choice of proposals for MCMC methods; here we emphasize that care must be taken when comparing different diffusions if a higher speed of convergence is at the cost of a more expensive discretization.
The ideas in this paper immediately extend beyond the Bayesian interpretation stressed here to any application (e.g. the study of conditioned diffusions) where a measure of interest is defined in terms of a reference measure and a change of measure. Also, we consider only Kullback-Leibler and χ 2 prior penalizations to define the functionals J KL and J χ 2 , but it would be possible to extend the analysis to the family of m-divergences introduced in Ohta and Takatsu (2011). Kullback-Leibler and χ 2 prior penalization correspond to m → 1 and m = 2 within this family. In what follows we point out some of the features of the different functionals and gradient flows that we consider in this paper.

Comparison of Functionals and Flows
We now provide a comparison of the three choices of functionals that we consider.
1. The two gradient flows in Wasserstein space (arising from the functionals J KL and J χ 2 ) are fundamentally connected with the variational formulation: these variational formulations can be used to define posterior-type measures via a penalization of deviations from the prior and deviations from the data in situations where establishing the existence of conditional distributions by disintegration of measures is technically demanding. On the other hand, the variational formulation for the Dirichlet energy is less natural and requires previous knowledge of the posterior.
2. The precise level of geodesic convexity of the functionals J KL (and J χ 2 ) can be computed from point evaluation of the Ricci tensor (of the parameter space) and derivatives of the densities. In particular, knowledge of the underlying metric suffices to compute these quantities. In contrast, establishing a sharp Poincaré inequality-the level of geodesic convexity of the Dirichlet energy in L 2 (M, μ)is in practice unfeasible, as it effectively requires solving an infinite dimensional optimization problem. It is for this reason-and because of the explicit dependence of the convexity in Wasserstein space with the geometry induced by the manifold metric tensor-that our analysis of the choice of metric in Riemannian MCMC methods is based on the J KL functional (see section 4, and in particular Theorem 10).
3. On the flip side of point 2, a Poincaré inequality for the posterior with a not necessarily optimal constant can be established using only tail information. In particular, even when the functional J KL is not geodesically convex in Wasserstein space, one may still be able to obtain a Poincaré inequality (see subsection 5.2 for an example).
4. In contrast to the diffusions arising from the J KL or Dirichlet flows, the stochastic processes arising from the J χ 2 formulation are inhomogeneous, and hence simulation seems more challenging unless further structure is assumed on the prior measure and likelihood function. Also, the evolution of densities of the gradient flow of J χ 2 in Wasserstein space is given by a porous medium PDE.

Outline
The rest of the paper is organized as follows. Section 2 contains some background material on the Wasserstein space, geodesic convexity of functionals, and gradient flows in metric spaces. The core of the paper is section 3, where we study the geodesic convexity, PDEs, and diffusions associated with each of the three functionals J KL , J χ 2 , and D μ . In section 4 we consider an application of the theory to the choice of metric in Riemannian MCMC methods Girolami and Calderhead (2011),

Set-up and Notation
(M, g) will denote a smooth connected m-dimensional Riemannian manifold with metric tensor g representing the parameter space. We will denote by d the associated Riemannian distance, and assume that (M, d) is a complete metric space. By the Hopf-Rinow theorem it follows that M is a geodesic space-we refer to subsection (2.1) for a discussion on geodesic spaces and their relevance here. We denote by vol g the associated volume form. To emphasize the dependence of differential operators on the metric with which M is endowed, we write ∇ g , div g , Hess g and Δ g for the gradient, divergence, Hessian, and Laplace Beltrami operators on (M, g). The reader not versed in Riemannian geometry may focus on the case M = R m with the usual metric tensor, in which case d is the Euclidean distance and dvol g = dx is the Lebesgue measure. However, in section 4 where we discuss applications to Riemannian MCMC, we endow R m with a general metric tensor g and hence familiarity with some notions from differential geometry is desirable.
We denote by P(M) the space of probability measures on M (endowed with the Borel σ-algebra). We will be concerned with the update of a prior probability measure π ∈ P(M)-that represents various degrees of belief on the value of a quantity or parameter of interest-into a posterior probability measure μ ∈ P(M), based on observed data y. We will assume that the prior is defined as a change of measure from vol g , and that the posterior is defined as a change of measure from π as follows: The data is incorporated in the Bayesian update through the negative log-likelihood function φ(·) = φ(·; y).

Preliminaries
In this section we provide some background material. The Wasserstein space, and the notion of λ-geodesic convexity of functionals are reviewed in subsection 2.1. Gradient flows in metric spaces are reviewed in subsection 2.2.

Geodesic Spaces and Geodesic Convexity of Functionals
A geodesic space (X, d X ) is a metric space with a notion of length of curves that is compatible with the metric, and where every two points in the space can be connected by a curve whose length achieves the distance between the points (see Burago et al. (2001) for more details). Geodesic spaces constitute a large family of metric spaces with a rich theory of gradient flows. Here we consider three geodesic spaces. First, the base space (M, d), i.e. the manifold M equipped with its Riemannian distance. Second, the space P 2 (M) of square integrable Borel probability measures defined on M, endowed with the Wasserstein distance W 2 . Third, the space of functions f ∈ L 2 (M, μ), with M fdμ = 1, equipped with the L 2 (M, μ) norm. We spell out the definitions of P 2 (M) and W 2 : The infimum in the previous display is taken over all transportation plans between ν 1 and ν 2 , i.e. over α ∈ P(M×M) with marginals ν 1 and ν 2 on the first and second factors. The space (P 2 (M), W 2 ) is indeed a geodesic space: geodesics in (P 2 (M), W 2 ) are induced by those in (M, d). All it takes to construct a geodesic connecting ν 0 ∈ P 2 (M) and ν 1 ∈ P 2 (M) is to find an optimal transport plan between ν 0 and ν 1 to determine source locations and target locations, and then transport the mass along geodesics in M (see Villani (2003) and Santambrogio (2015)).
The space of functions f ∈ L 2 (M, μ), with M fdμ = 1, equipped with the L 2 (M, μ) norm is also a geodesic space, where a constant speed geodesic connecting f 0 and f 1 is given by linear interpolation: We will consider several functionals E : X → R ∪ {∞} throughout the paper. They will all be defined in one of our three geodesic spaces-that is, X = M, X = P 2 (M) or X = L 2 (M, μ). Important examples will be, respectively: where π is a given (prior) measure and, for ν 1 , ν 2 ∈ P(M), and the potential-type functional J : where h is a given potential function.

The Dirichlet energy
Recall that here and throughout, ∇ g denotes the gradient in (M, g) and · is the norm on each tangent space T x M.
A crucial unifying concept will be that of λ-geodesic convexity of functionals. We recall it here: The following remark characterizes the λ-convexity of functionals when X = M.
Remark 2. Let Ψ ∈ C 2 (M) so that we can define its Hessian at all points in M (see the proof of Theorem 10 in the Supplementary Material (Garcia Trillos and Sanz-Alonso, 2018b) for the definition). Then the following conditions are equivalent: If (M, d) is the Euclidean space, (i) and (ii) are also equivalent to: This latter condition is known in the optimization literature as strong convexity.

Gradient Flows in Metric Spaces
In this subsection we review the basic concepts needed to define gradient flows in a metric space (X, d X ). We follow Chapter 8 of Santambrogio (2015); a standard technical reference is Ambrosio et al. (2008).
To guide the reader, we first recall the formulation of gradient flows in Euclidean space, where X = R d and d X is the Euclidean metric. Let E : R d → R be a differentiable function, and consider the equation Then, the solution x to (7) is the gradient flow of E in Euclidean space with initial condition x 0 ; it is a curve whose tangent vector at every point in time is the negative of the gradient of the function E at that time. In order to generalize the notion of a gradient flow to functionals defined on more general metric spaces, and in particular when the metric space has no differential structure, we reformulate (7) in integral form by using that This identity, known as energy dissipation equality, is equivalent to (7)-see Chapter 8 of Santambrogio (2015) for further details and other possible formulations. Crucially (8) involves notions that can be defined in an arbitrary metric space (X, d X ): the metric derivative of a curve t → x(t) ∈ X is given by The identity (8) is the standard way to introduce gradient flows in arbitrary metric spaces. In this paper we consider gradient flows in L 2 and Wasserstein spaces, where the notion of tangent vector is available. L 2 has Hilbert space structure, whereas the Wasserstein space can be seen as an infinite dimensional manifold (see Ambrosio et al. (2008), Santambrogio (2015)).

Variational Characterizations of the Posterior and Gradient Flows
In this section we lay out the main elements of the theory of variational formulations and gradient flows in regards to the Bayesian update. Subsection 3.1 details three variational formulations defined in terms of the functionals J KL , J χ 2 and the Dirichlet energy D μ . Subsection 3.2 studies the geodesic convexity of J KL and J χ 2 in Wasserstein space and of D μ in L 2 . Finally, subsection 3.3 collects the PDEs that characterize the gradient flows, as well as the corresponding diffusion processes.

Variational Formulation of the Bayesian Update
The variational formulation of the posterior as the minimizer of J KL and J χ 2 share the same structure and will be outlined first. The variational formulation in terms of the Dirichlet energy will be given below.

The Functionals J KL and J χ 2
In mathematical analysis Jordan and Kinderlehrer (1996) and probability theory Dupuis and Ellis (2011) it is often useful to note that a probability measure μ defined by is the minimizer of the functional where and the integral is interpreted as +∞ if φ is not integrable with respect to ν. In physical terms, the Kullback-Leibler divergence represents an internal energy, F KL represents a potential energy, and the constant Z is known as the partition function. Here we are concerned with a statistical interpretation of equation (9), and view it as defining a posterior measure as a change of measure from a prior measure. In this context, the Kullback-Leibler term D KL (· π) in (10) represents a penalization of deviations from prior beliefs, the term F KL (ν; φ) penalizes deviations from the data, and the normalizing constant Z represents the marginal likelihood. For brevity, we will henceforth suppress the data y from the negative log-likelihood function φ, writing φ(u) instead of φ(u; y).
We remark that the fact that μ minimizes J KL follows immediately from the identity Minimizing J KL (·) or D KL (· μ) is thus equivalent, but the functional J KL makes apparent the roles of the prior and the likelihood.
The posterior μ also minimizes the functional where We refer to Ohta and Takatsu (2011) for details. Note that both J KL and J χ 2 are defined in terms of the two starting points of the Bayesian update: the prior π and the negative log-likelihood φ. The associated variational formulations suggest a way to define posterior-type measures based on these two ingredients in scenarios where establishing the existence of conditional distributions via disintegration of measures is technically demanding. This appealing feature of the two variational formulations above is not shared by the one described in the next subsection.

The Dirichlet Energy D μ
Let now the posterior μ be given, and consider the space L 2 (M, μ) of functions defined on M which are square integrable with respect to μ. Recall the Dirichlet energy introduced in equation (6). Now, since the measure μ can be characterized as the probability measure with density ρ μ ≡ 1 a.s. with respect to μ, it follows that the posterior density ρ μ ≡ 1 is the minimizer of the Dirichlet energy D μ over probability densities ρ ∈ L 2 (M, μ) with M ρdμ = 1.

Geodesic Convexity and Functional Inequalities
In this section we study the geodesic convexity of the functionals J KL , J χ 2 , and D μ . The geodesic convexity of J KL and J χ 2 in Wasserstein space is considered first, and will be followed by the geodesic convexity of D μ in L 2 . We will show the equivalence of the latter to the posterior satisfying a Poincaré inequality.

Geodesic Convexity of J KL and J χ 2
The next proposition can be found in von Renesse and Sturm (2005) and Sturm (2006). It shows that the convexity of J KL can be determined by the so-called curvaturedimension condition-a condition that involves the curvature of the manifold and the Hessian of the combined change of measure Ψ + φ. We recall the notation π = e −Ψ vol g and μ ∝ e −φ π.
We recall that the Ricci curvature provides a way to quantify the disagreement between the geometry of a Riemannian manifold and that of ordinary Euclidean space. The Ricci tensor is defined as the trace of a map involving the Riemannian curvature (see do Carmo Valero (1992)).
The following example illustrates the geodesic convexity of D KL (· μ) for Gaussian μ. Example 1. Let μ = N (θ, Σ) be a Gaussian measure in R m (endowed with the Euclidean metric), with Σ positive definite. Then D KL (· μ) is 1/Λ max (Σ)-geodesically convex, where Λ max (Σ) is the largest eigenvalue of Σ. This follows immediately from the above, since here Ψ(x) = 1 2 x − θ, Σ −1 (x − θ) , and the Euclidean space is flat (its Ricci curvature is identically equal to zero). Note that the level of convexity of the functional depends only on the largest eigenvalue of the covariance, but not on the dimension m of the underlying space.
The λ-convexity of J KL guarantees the existence of the gradient flow of J KL in Wasserstein space. Moreover, it determines the rate of convergence towards the posterior μ. Precisely, if μ 0 is absolutely continuous with respect to μ, and if λ > 0, then the gradient flow t ∈ [0, ∞) → μ t of J KL with respect to the Wasserstein metric starting at μ 0 is well defined and we have: The second inequality, known as Talagrand inequality Villani (2003), establishes a comparison between Wasserstein geometry and information geometry. It can be established directly combining the λ-geodesic convexity of J KL (for positive λ) with the first inequality. From (15) we see that a higher level of convexity of J KL allows to guarantee a faster rate of convergence towards the posterior distribution μ.
We now turn to the geodesic convexity properties of J χ 2 . We recall that m denotes the dimension of the manifold M. The following proposition can be found in (Ohta and Takatsu, 2011, Theorem 4.1).
Proposition 4. J χ 2 is λ-geodesically convex if and only if both of the following two properties are satisfied:

φ is λ-geodesically convex as a real valued function defined on M.
There are two main conclusions we can extract from the previous proposition. First, that condition 1) is only related to the prior distribution π whereas condition 2) is only related to the likelihood; in particular, the convexity properties of J χ 2 can indeed be studied by studying separately the prior and the likelihood (notice that the proposition gives an equivalence). Secondly, notice that condition 1) is a qualitative property and if it is not met there is no hope that the functional J χ 2 has any level of global convexity even when the likelihood function is a highly convex function. In addition, if 1) is satisfied, the convexity of φ determines completely the level of convexity of J χ 2 . These features are markedly different from the ones observed in the Kullback-Leibler case.
As for the functional J KL , one can establish the following functional inequalities, under the assumption of λ-geodesic convexity of J χ 2 for λ > 0: The above inequalities exhibit the fact that a higher level of convexity of J χ 2 guarantees a faster convergence towards the posterior distribution μ.

Geodesic Convexity of Dirichlet Energy
We now study the geodesic convexity of the Dirichlet energy functional defined in equation (6). In what follows we denote by · the L 2 norm with respect to μ. Let us start recalling Poincaré inequality.
Definition 5. We say that a Borel probability measure μ on M has a Poincaré inequality with constant λ if for every f ∈ L 2 (M, μ) satisfying M fdμ = 0 we have We now show that Poincaré inequalities are directly related to the geodesic convexity of the functional D μ in the L 2 (M, μ) space.

Proof. First of all we claim that
for all f 0 , f 1 ∈ L 2 (M, μ) and every t ∈ [0, 1]. To see this, it is enough to assume that both D μ (f 0 ) and D μ (f 1 ) are finite and then notice that equality (17) follows from the easily verifiable fact that for an arbitrary Hilbert space V with induced norm | · | one has Now, suppose that μ has a Poincaré inequality with constant λ and consider two functions f 0 , f 1 ∈ L 2 (M, μ) satisfying M f 0 dμ = M f 1 dμ = 1. Then, (17) combined with Poincaré inequality (taking f := f 0 − f 1 ) gives: which is precisely the 2λ-geodesic convexity condition for D μ .
Conversely, suppose that D μ is 2λ-geodesic convex in the space of L 2 (M, μ) functions that integrate to one. Let f ∈ L 2 (M, μ) be such that M fdμ = 0 and without the loss of generality assume that D μ (f ) < ∞ and that f μ = 0. Under these conditions, the positive and negative parts of f , f + and f − , satisfy is obtained directly from (17) and (18) applied to

Remark 7.
It is well known that the best Poincaré constant for a measure μ is equal to the smallest non-trivial eigenvalue of the operator −Δ μ g defined formally as where div g and ∇ g are the divergence and gradient operators in (M, g). This eigenvalue can be written variationally as Kipnis and Varadhan (1986).

Remark 8. Spectral gaps are used in the theory of MCMC as a means to bound the asymptotic variance of empirical expectations
Let us now consider t ∈ (0, ∞) → μ t the flow of D μ in L 2 (M, μ) with some initial condition dμ0 dμ = ρ 0 . It is well known that this flow coincides with that of the functional J KL in Wasserstein space. However, taking the Dirichlet-L 2 point of view, one can use a Poincaré inequality (i.e. the geodesic convexity of D μ ) to deduce the exponential convergence of μ t towards μ in the χ 2 -sense. Indeed, let A standard computation then shows that In the second equality we have used that ∂ρ ∂t = Δ μ g ρ, as discussed in subsection 3.3 below. Hence by Gronwall's inequality, see e.g. Teschl (2012),

PDEs and Diffusions
Here we describe the PDEs that govern the evolution of densities of the three gradient flows, and the stochastic processes associated with these PDEs. We consider first the flows defined with the functionals J KL and D μ and then the flow defined by the functional J χ 2 .

J KL -Wasserstein and D μ -L 2 (M, μ)
It was shown in Jordan et al. (1998)-in the Euclidean setting and in the unweighted case π = dx-that the gradient flow of the Kullback-Leibler functional D KL (· π) in Wasserstein space produces a solution to the Fokker-Planck equation. More generally, under the convexity conditions guaranteeing the existence of the gradient flow t ∈ (0, ∞) → μ t of D KL (· μ) (equivalently of J KL ) starting from μ 0 ∈ P(M), the densities satisfy (formally) the following Fokker-Planck equations Equation (20) can be identified as the evolution of the densities (w.r.t. dvol g ) of the diffusion where B g denotes a Brownian motion defined on (M, g) and ∇ g is the gradient on (M, g). Naturally, the D μ flow in L 2 has the same associated Fokker-Planck equation (19) and diffusion process (21).

J χ 2 -Wasserstein
The PDE satisfied (formally) by the densities of the J χ 2 -Wasserstein flow t ∈ (0, ∞) → μ t is the (weighted) porous medium equation: where the weighted Laplacian and divergence are defined formally as Consider now the stochastic process {u t } t≥0 formally defined as the solution to the nonlinear diffusion whereρ is the solution to (22). Let θ t be the evolution of the densities (with respect to dvol g ) of the above diffusion. Then a formal computation shows that θ satisfies the Fokker-Planck equation: If we let β = 1 Z exp(−Ψ)θ we see, using (23), that implying that the distributions of the stochastic process (24) are those generated by the gradient flow of J χ 2 in Wasserstein space. (21), the process (24) is defined in terms of the solution of the equation satisfied by its densities. In particular, if one wanted to simulate (24) one would need to know the solution of (22) before hand.

Application: Sampling and Riemannian MCMC
So far we have treated the Riemannian manifold (M, g) as fixed. In this section we take a different perspective and treat the metric g as a free parameter. Precisely, we will now consider a family of gradient flows of the functional J KL with respect to Wasserstein distances induced by different metrics g on the parameter space. We do this motivated by the so called Riemannian MCMC methods for sampling, where a change of metric in the base space is introduced in order to produce Langevin-type proposals that are adapted to the geometric features of the target, thereby exploring regions of interest and accelerating the convergence of the chain to the posterior. There are different heuristics regarding the choice of metric (see Girolami and Calderhead (2011)), but no principled way to compare different metrics and rank their performance for sampling purposes.
With the developments presented in this paper we propose one such principled criterion as we describe below. We restrict our attention to the case M = R m .
Let g be a Riemannian metric tensor on R m defined via In what follows we identify g with G and refer to both as 'the metric' and we use terms such as ggeodesic, g-Wasserstein distance, etc. to emphasize that the notions considered are being constructed using the metric g. Let d g be the distance induced by the metric tensor g and let vol g be the associated volume form. Notice that in terms of the Lebesgue measure and the metric G, we can write We use the canonical basis for R m as global chart for R m and consider the canonical vector fields ∂ ∂x1 , . . . , ∂ ∂xm . The Christoffel symbols associated to the Levi-Civita connection of the Riemannian manifold (R m , g) can be written in terms of derivatives of the metric as where in the right hand-side-and in what follows-we use Einstein's summation convention. The proof of the following result is in the Supplementary Material.
The sharp constant λ for which J KL (or D KL (· μ)) is λ-geodesically convex in the g-Wasserstein distance is equal to where Hess F is the usual (Euclidean) Hessian matrix of F , B is the matrix with coordinates and C is the matrix with coordinates Moreover, for any a > 0, Note that λ G is a key quantity in evaluating the quality of a metric G in building geometry-informed Langevin diffusions for sampling purposes, as it gives the exponential rate at which the evolution of probabilities built using the metric G converges towards the posterior: larger λ G corresponds to faster convergence. However, in order to establish a fair performance comparison, the metrics need to be scaled appropriately. Indeed a faster rate can be obtained by scaling down the metric (which can be thought of as time-rescaling), as it is clearly seen by the scaling property (28) of the functional λ G . It is important to note that scaling down the metric leads to a faster diffusion, but also makes its discretization more expensive. Indeed the error of Euler discretizations is largely influenced by the Lipschitz constant of the drift. This motivates that a fair criterion for choosing the metric could be to maximize λ G with the constraint since ∇ g F = G −1 ∇F (where ∇ denotes the standard Euclidean gradient) is the drift of the diffusion (21). Note that the constraint (29) ensures that the metric cannot be scaled down arbitrarily while also guaranteeing that the discretizations do not become increasingly expensive. We remark that other constraints involving higher regularity requirements may be useful if higher order discretizations are desired.
Remark 11. The functional λ G can be used to determine the optimal metric among a certain subclass of metrics of interest satisfying the condition (29). For instance, it may be of interest to find the optimal constant metric G (see Proposition 12 below), or to find the best metric within a finite family of metrics. On the other hand the constraint (29) forces feasible metrics to induce diffusions that are not expensive to discretize.
To illustrate the previous remark we show that for a Gaussian target measure the optimal preconditioner is, unsurprisingly, given by the Fisher information. More precisely we have the following proposition: N (0, Σ). Then maximizes λ G over the class of constant metrics G satisfying G −1 Σ −1 ≤ 1, as in (29). Moreover, the maximum value is Proof. Suppose for the sake of contradiction that there exists a constant metric G that satisfies condition (29), which in this case reads G −1 Σ −1 ≤ 1 and is such that Let u be a unit norm eigenvector of G with eigenvalue λ > 0. Notice that by definition of λ G we must have The left hand side of the above display can be rewritten as and by Cauchy-Schwartz inequality we see that Since u is an eigenvector of G with eigenvalue λ, it follows that u is also an eigenvector of G 1/2 with eigenvalue √ λ and of G −1/2 with eigenvalue 1 √ λ . Therefore the right hand side of the above display is equal to one. This however contradicts (31). From this we deduce the optimality of G * among feasible metrics.
Consider the optimal metric G * = 1 0 0 −1 given by the previous proposition and the rescaled Euclidean metric G e where the scalings have been chosen so that A calculation then shows that λ G * = 1 while λ Ge = . Note that if the Euclidean metric is not rescaled by −1 -violating the constraint (29)-then the same unit rate of convergence as with the metric G * is achieved. However, the drift of the associated

Example: Semi-Supervised Learning
In this section we study the geodesic convexity of functionals arising in the Bayesian formulation of semi-supervised classification. Our purpose is to illustrate the concepts in a tangible setting, and to show that establishing sharp levels of geodesic convexity may be more tractable for some functionals than others.
In semi-supervised classification one is interested in the following task: given a data cloud X = {x 1 , . . . , x n } together with (noisy) labels y i ∈ {−1, 1} for some of the data points x i , i ∈ Z ⊂ {1, . . . , n}, classify the unlabeled data points by assigning labels to them. Here we assume to have access to a weight matrix W quantifying the level of similarity between the points in X. Thus, we focus on the graph-based approach to semi-supervised classification, which boils down to propagating the known labels to the whole cloud, using the geometry of the weighted graph (X, W ). We will investigate the existence and convergence of gradient flows for several Bayesian graph-based classification models proposed in Bertozzi et al. (2017). In the Bayesian approach, the geometric structure that the weighted graph imposes on the data cloud is used to build a prior on a latent space, and the noisy given labels are used to build the likelihood. The Bayesian solution to the classification problem is a measure on the latent space, that is then push-forwarded into a measure on the label space {−1, 1} n . This latter measure contains information on the most likely labels, and also provides a principled way to quantify the remaining uncertainty on the classification process.
Let (X, W ) then be a weighted graph, where X = {x 1 , . . . , x n } is the set of nodes of the graph and W is the weight matrix between the points in X. All the entries of W are non-negative real numbers and we assume that W is symmetric. Let L be the graph Laplacian matrix defined by where D is the degree matrix of the weighted graph, i.e., the diagonal matrix with diagonal entries D ii = n j=1 W ij . The above corresponds to the unnormalized graph Laplacian, but different normalizations are possible Von Luxburg (2007). The graph-Laplacian will be used in all the models below to favor prior draws of the latent variables that are consistent with the geometry of the data cloud.
Remark 13. A special case of a weighted graph (X, W ) frequently found in the literature is that in which the points in X are i.i.d. points sampled from some distribution on a manifold M embedded in R d , and the similarity matrix W is obtained as In the above, K is a compactly supported kernel function, |x i − x j | is the Euclidean distance between the points x i and x j , and r > 0 is a parameter controlling data density. It can be shown (see Burago et al. (2013) andGarcia Trillos et al. (2017a)) that the smallest non-trivial eigenvalue of a rescaled version of the resulting graph Laplacian is close to the smallest non-trivial eigenvalue of a weighted Laplacian on the manifold, provided that r is scaled with n appropriately.
In the next two subsections we study the probit and logistic models.

Probit and Logistic Models
Traditionally, the probit approach to semi-supervised learning is to classify the unlabeled data points by first optimizing the functional G : R n → R given by (32) over all u ∈ R n satisfying n i=1 u i = 0, and then thresholding the optimizer with the sign function; the parameter α > 0 is used to regularize the functions u. The minimizer of the functional G can be interpreted as the MAP (maximum a posteriori estimator) in the Bayesian formulation of probit semi-supervised learning (see Bertozzi et al. (2017)) that we now recall: Prior: Consider the subspace U := {u ∈ R n : n i=1 u i = 0} and let π be the Gaussian measure on U defined by The measure π is interpreted as a prior over functions on the point cloud X with average zero. Larger values of α > 0 force more regularization of the functions u.
Likelihood function: For a fixed u ∈ U and for j ∈ Z define where the η j are i.i.d. N (0, γ 2 ), and S is the sign function. This specifies the distribution of observed labels given the underlying latent variable u. We then define, for given data y, the negative log-density function where H is given by (32).
Posterior distribution: As shown in Bertozzi et al. (2017), a simple application of Bayes' rule gives the posterior distribution of u given y (denoted by μ y ): where Ψ is given by (33), and φ is given by (34).
From what has been discussed in the previous sections, the posterior μ y can be characterized as the unique minimizer of the energy Let us first consider the gradient flow of J KL with respect to the usual Wasserstein space (i.e. the one induced by the Euclidean distance).
We can study the geodesic convexity of this functional by studying independently the convexity properties of D KL (ν π) and of φ(·; y). Precisely: i) Since π is a Gaussian measure with covariance L −α , Example 1 shows that D KL (ν π) is (Λ min (L)) α -geodesically convex in Wasserstein space, where Λ min (L) is the smallest non-trivial eigenvalue of L.
It then follows from Proposition 3 that J KL is (Λ min (L)) α -geodesically convex in Wasserstein space. As a consequence, if we consider t ∈ [0, ∞) → μ t , the gradient flow of J KL with respect to the Wasserstein distance starting at μ 0 (an absolutely continuous measure with respect to μ), geometric inequalities can be immediately obtained from (15); such inequalities will not deteriorate with n-see Remark 13.
However, the diffusion associated to this flow is given by and in particular its drift (more precisely the term L α X t ) deteriorates as n gets larger.
Notice that if we wanted to control the cost of discretization by rescaling the Euclidean metric (as exhibited in Example 2), the geodesic convexity of the resulting flow would vanish as n gets larger.
The previous discussion shows that the flow of J KL in the usual Wasserstein sense does not produce a flow with good convergence properties that at the same time is cheap to discretize (robustly in n). This motivates considering the gradient flow of J KL with respect to the Wasserstein distance induced by a certain constant metric g. Indeed, inspired by Proposition 12, let us consider the constant metric tensor Since the metric tensor is constant, in particular its induced volume form vol g is proportional to the Lebesgue measure and hence we can write On the other hand, from the discussion in Section 3.3 we know that the densities of the stochastic process correspond to the gradient flow of the energy J KL with respect to the Wasserstein distance induced by the metric g, where B g is a Brownian motion on (R m , g). This diffusion can be rewritten in terms of the standard Euclidean gradient ∇ and Brownian motion B as after noticing that where for the second identity we have used the fact that G is constant. How convex is the energy J KL with respect to the Wasserstein distance induced by g? Since the metric tensor G is constant it follows that λ G := inf x∈R m Λ min G −1/2 Hess F G −1/2 , where F (u) := φ(u; y) + 1 2 L α u, u . Finally, due to the convexity of φ(u; y) we deduce that λ G ≥ Λ min G −1/2 L α G −1/2 = 1.
We notice that in (37) L appears as L −α . This is a fundamental difference from (36) (where L appears as L α ) with computational advantages, given that the eigenvalues of L grow towards infinity.

Remark 14.
A carefully designed discretization of (37) induces the so called Langevin pCN proposal for MCMC computing (see Cotter et al. (2013)).

Remark 15.
In the above we have considered a probit model for the likelihood function. The ideas generalize straightforwardly to other settings, notably the logistic model φ(u; y) := − j∈Z log σ(y j u j ; γ) , u ∈ U, where σ(t; γ) := 1 1 + e −t/γ .
The convexity of φ for the logistic model (38) can be established by direct computation of the second derivative of σ.

Ginzburg-Landau Model
We now present the Ginzburg-Landau model for semi-supervised learning. This model will provide us with an example of a functional J KL whose geodesic convexity with respect to Wasserstein distance is not positive (and hence one can not deduce geometric inequalities describing the rate of convergence towards the posterior), but for which one can obtain a positive spectral gap giving the rate of convergence of the flow of Dirichlet energy in the L 2 sense.
We consider the following Bayesian model.
with respect to L 2 (μ). How convex is this functional? For every f ∈ U we have from where it follows that A similar remark to the one at the end of section 5.1 regarding the dependence in L of the resulting diffusion applies here as well.

Conclusions and Future Work
The main contribution of this paper is to explore three variational formulations of the Bayesian update and their associated gradient flows. We have shown that, for each of the three variational formulations, the geodesic convexity of the objective functionals gives a bound on the rate of convergence of the flows to the posterior. As an application of the theory, we have suggested a criterion for the optimal choice of metric in Riemannian MCMC schemes. We summarize below some additional outcomes and directions for further work.
• We bring attention to different variational formulations of the Bayesian update. These formulations have the potential of extending the theory of Bayesian inverse problems in function spaces, in particular in cases with infinite dimensional, nonadditive, and non-Gaussian observation noise. Moreover, they suggest numerical approximations to the posterior by restricting the space of allowed measures in the minimization, by discretization of the associated gradient flows, or by sampling via simulation of the associated diffusion.
• The variational framework considered in this paper provides a natural setting for the study of robustness of Bayesian models, and for the analysis of convergence of discrete to continuum Bayesian models. Indeed, the authors Garcia Trillos and Sanz-Alonso (2018a), Garcia Trillos et al. (2017b) have recently established the consistency of Bayesian semi-supervised learning in the regime with fixed number of labeled data points and growing number of unlabeled data. The analysis relies on the variational formulation based on Kullback-Leibler prior penalization in equation (35).
• Our results give new understanding of the ubiquity of Kullback-Leibler penalizations in sampling methodology. In practice Kullback-Leibler is often used for computational and analytical tractability. The results in section 3.3 show that Kullback-Leibler prior penalization leads to a heat-type flow and, therefore, to an easily discretized diffusion process. On the other hand, χ 2 prior penalization leads to a nonlinear diffusion process.