Stability of doubly-intractable distributions

Doubly-intractable distributions appear naturally as posterior distributions in Bayesian inference frameworks whenever the likelihood contains a normalizing function $Z$. Having two such functions $Z$ and $\widetilde Z$ we provide estimates of the total variation and Wasserstein distance of the resulting posterior probability measures. As a consequence this leads to local Lipschitz continuity w.r.t. $Z$. In the more general framework of a random function $\widetilde Z$ we derive bounds on the expected total variation and expected Wasserstein distance. The applicability of the estimates is illustrated within the setting of two representative Monte Carlo recovery scenarios.


Introduction
Suppose that (Θ, d) is a complete and separable metric space equipped with its Borel σ-algebra B(Θ) and a σ-finite reference measure µ. Assume that there are measurable functions Φ : Θ → (−∞, ∞) and Z : Θ → (0, ∞) with defines a probability measure on (Θ, B(Θ)). We are interested in stability properties of π Z w.r.t. the function Z. Given another measurable function Z : Θ → (0, ∞) which is somehow close to Z, we ask whether π Z and π Z are also close to each other. This question is motivated by applications within Bayesian inference, where such type of distributions appear, see for example [4,5] as well as [17] and the references therein. The interpretation is as follows: Think of θ → exp(−Φ(θ))/Z(θ) as a likelihood function which contains an unknown normalizing function Z. One is interested on sampling w.r.t. a posterior distribution based on the partially unknown likelihood function. Unknown here in the sense that it is infeasible to evaluate Z exactly. This and the fact that (1) itself contains an unknown normalizing constant C Z is the reason for calling π Z doubly-intractable. A recent survey for approximate sampling of such doubly-intractable distributions is given in [17]. We provide a motivating example for such scenarios. Example 1.1 (Gibbs distribution as likelihood). A Gibbs distribution on a finite state space G is determined by a probability mass function ̺(· | β) with inverse temperature parameter β > 0 given by where H : G → [0, ∞) is called Hamiltonian and Z(β) = x∈G exp(−βH(x)) partition function. Suppose that there is observational data x obs ∈ G available as a realization of the Gibbs distribution but with unknown β. We then aim to gain knowledge of β through the realization of x obs . In a Bayesian framework this leads to a posterior distribution of the form (1) with Θ = (0, ∞) and Φ(β) = βH(x obs ) and Z(β). Note that the Ising model fits into this framework: Let (V, E) be a graph with (non-empty) vertex set E, edge set V ⊂ E × E and G = {−1, 1} E with H(x) = − (e,e ′ )∈V x(e)x(e ′ ).
Having such an example in mind it is reasonable to recover Z by an approximation Z and to gain knowledge of the posterior distribution by sampling w.r.t. π Z (which is hopefully close to π Z ). A theoretical justification of that approach requires a stability investigation of π Z w.r.t. Z.
For quantifying stability properties of probability measures we need to introduce how we want to measure the difference of distributions. For this we use either the total variation or the Wasserstein distance based on the metric d on (Θ, B(Θ)). The main results of this note, stated and proven in Section 2, are upper bounds on the difference of π Z and π Z in terms of Z and Z with respect to that distances. We use the following notation. For a measure ν on (Θ, B(Θ)) we denote the L p (ν)-norm, for p ≥ 1, by · ν,p , that is, for measurable f : Θ → R we have f ν,p := Θ |f (θ)| p ν(dθ) 1/p . In Theorem 2.1 and Theorem 2.7 we prove that the total variation and the Wasserstein distance (w.r.t. d) of π Z and π Z is smaller than a constant times Z For the total variation distance the result holds also with the L 1 (π Z )-norm instead of the L 2 (π Z )-norm on the right-hand side. In addition to that we provide a number of consequences under some further regularity conditions. For example, if exp(−Φ)/Z µ,p ≤ K and inf θ∈Θ Z(θ) ≥ ℓ for some p ∈ [1, ∞], some K < ∞ and ℓ > 0, then is again an upper bound of the total variation distance. Under some additional moment conditions on µ and exp(−Φ)/ Z µ,p ≤ K a similar estimate is verified for the Wasserstein distance, see Corollary 2.9. Note that from (2) one can conclude a local Lipschitz continuity of the mapping Z → π Z from a subset of L µ,p/(p−1)functions to the set of probability measures on (Θ, B(Θ)).
If Z takes the role of a normalizing constant, as in Example 1.1 above, an approximation Z of Z by numerical integration is natural. For this purpose Monte Carlo integration or Monte Carlo estimators, respectively, of Z are a common choice. Since Monte Carlo methods yield random approximations Z, we also conduct a stability analysis allowing for randomized recovery alorithms of Z leading to random probability measures π Z . In that randomized scenario we provide estimates of the expected total variation, see Corollary 2.10, and the expected Wasserstein distance, see Theorem 2.12. The total variation result reads as follows Thus, the upper estimate depends on an averaged relative second moment difference of Z and Z. We apply our randomized stability results to Monte Carlo approximations of Z in two particular examples including the Gibbs distribution of Example 1.1.
In the following we discuss how our results fit into the literature. The study of stability properties w.r.t. posterior distributions in Bayesian inference attracted in recent years considerable attention, see e.g. [3,9,19,20]. In the work [19] local Lipschitz continuity for bounded likelihood functions θ → exp(−Φ(θ))/Z(θ) has been investigated and in [9] continuity results of posterior distributions w.r.t. perturbations within the observed data are proven. In contrast to [19] a consequence of our main estimate is local Lipschitz continuity also for possibly unbounded likelihood functions and in contrast to the continuity study in [9] we focus on quantitative rather than qualitative results. In Bayesian statistics a number of Markov chain approaches have been developed for approximate sampling of π Z , for a comprehensive review we refer to [17]. One can distinguish two different types of approaches. The exact one, see [13,15], where a Markov chain with limit distribution π Z is constructed and the inexact one, where whenever a function evaluation of Z is needed, an approximation of it is used. The inexact setting leads to Markov chains which not necessarily target π Z , but another distribution that is (hopefully) close to π Z . In particular, the noisy Markov chain approach, for example investigated in [1,14,18] falls into this category. In addition to that also adaptive Markov chain Monte Carlo approaches have been developed, see [2,10]. The inexact setting is closely related to our work, since there for a given θ ∈ Θ Monte Carlo approximations of Z(θ) are employed.
The outline of our work is as follows: In the next section we state and prove our stability results. We introduce the total variation and Wasserstein distance as well as defining related quantities which we need for the formulation of our results. Furthermore, for a random function Z we provide estimates on the expected total variation and Wasserstein distance. Finally, we illustrate our bounds in a simple Monte Carlo recovery scenario and in a Gibbs distribution posterior setting. In the latter we use a multiple importance sampling approach.

Stability results
First, we derive bounds of π Z and π Z in the total variation and the Wasserstein distance. After that we state stability results for the more general case of Z(θ) being a random variable for any θ ∈ Θ.

Total variation distance
Given two probability measures ν 1 , ν 2 on (Θ, B(Θ)) we define the total variation distance of ν 1 and ν 2 by Theorem 2.1. Suppose that Z : Θ → (0, ∞) and Z : Θ → (0, ∞) are measurable functions. Then, for π Z and π Z as in (1), we have Proof. For a measurable function f : Θ → R one has In order to bound the right-hand side, we note such that, with finite |f | ∞ := sup θ∈Θ |f (θ)|, we obtain Let us provide some remarks and consequences.
Remark 2.2. The previous estimate is not sharp in the following sense. Set Z = cZ for some constant c > 0, then the left-hand side is zero, but the righthand side in general not. This deficiency can be easily repaired by using the fact that π Z = π c Z for any arbitrary constant c > 0. With this fact Theorem 2.1 implies readily that For the new bound if Z = cZ (with c > 0), the left-and right-hand side are both zero. In particular, for the slightly more conservative upper bound using the L 2 (π Z )-norm instead of the L 1 (π Z )-norm on the right-hand side of (7), one can derive that Remark 2.3. By interchanging the roles of Z and Z in (3) one easily obtains In the light of [19] one might ask for local Lipschitz continuity of the mapping Z → π Z .
Note that in Theorem 2.1 and in the previous corollary on the right-hand side the norm itself already depends on Z. One might argue that this hides some Zdependence. Under an additional requirement we can remove this dependence by applying Hölder's inequality. . .
The last inequality provides a local Lipschitz continuity w.r.t. the L p/(p−1) (µ)norm in contrast to the local Lipschitz continuity w.r.t the L 1 (π Z )-norm of Corollary 2.4. For p = ∞ it is essentially the estimate of [19,Theorem 8] in our context. Remark 2.6. In the light of the final estimate in the previous corollary let us explain in more detail what we mean with local Lipschitz continuity: Define the set

Wasserstein distance
In the recent years the Wasserstein distance has become a standard tool in applied probability and statistics, see for example [9,16,18,19]. One advantage of this distance is that it takes topological properties of the metric space (Θ, d) into account, which provides a certain flexibility. For instance, for θ, θ ∈ Θ the Wasserstein distance of the Dirac measures δ θ and δ θ goes to zero when d(θ, θ) → 0.
Theorem 2.7. Suppose that Z : Θ → (0, ∞) and Z : Θ → (0, ∞) are measurable functions. Then, for π Z and π Z as in (1), we have Moreover, if Z andZ are sufficiently close to each other, that is, Proof. For arbitrary θ 0 ∈ Θ by (8) it is sufficient to estimate the right-hand side of (4). For this we again use (5) and obtain where By (6) and taking the infimum over θ 0 ∈ Θ we obtain I 1 ≤ Z Z − 1 π Z ,1 |π Z | (1) and by additionally using the Cauchy-Schwarz inequality we get which proves the first assertion. The second statement follows easily from the first by and rearranging the terms accordingly.
Remark 2.8. Having an approximation Z of Z in mind, (9) tells us that "asymptotically", that is, with sufficient accuracy of the recovery algorithm Z, no explicit bound on |π Z | (1) is required.
As in the total variation distance consideration, by a boundedness assumption and Hölder's inequality we can exchange the π Z -dependence on the right-hand side, by an explicit Z-dependence. .
In this context Theorem 2.1 combined with a Fubini argument and Cauchy-Schwarz inequality lead to the following consequence: Corollary 2.10. Suppose Z : Θ → (0, ∞) is measurable and let Z : Ω×Θ → (0, ∞) be a random function. Then, for π Z and π Z(ω) , ω ∈ Ω, given according to (1), we have Remark 2.11. The final estimate (11) follows immediately by the Cauchy-Schwarz inequality. For Monte Carlo recovery approximation Z(·, θ) of Z it is usually convenient to bound the right-hand side of the second inequality in Corollary 2.10.
The reason behind is that the term E Z(·,θ) Z(θ) − 1 2 relates readily to the relative mean squared error of the method, whereas the "reversed" relative mean absolute error E Z(θ) Z(·,θ) − 1 is harder to bound directly.
Using similar arguments as in the proof of Theorem 2.7 we obtain the following result w.r.t. the expected Wasserstein distance.
(i) Assume that for P-almost any ω ∈ Ω we have |π Z(ω) | (1) ≤ R for some R < ∞ and |π Z | (1) < ∞. Then for any θ 0 ∈ Θ. In particular, (ii) Assume that for P-almost any ω ∈ Ω we have Proof. First we verify the statement of (i). From (10) we obtain for arbitrary θ 0 ∈ Θ that which gives (12). Further estimating (by using the Cauchy-Schwarz inequality several times) leads to . Now we turn to the proof of the statement of (ii). By (9) it is sufficient to estimate E Z Z(·) −1 π Z ,2 . With Jensen's inequality as well as a change of integration we have which concludes the proof.
Remark 2.13. The condition in the first statement of Theorem 2.12 that for some R < ∞ we assume |π Z(ω) | (1) ≤ R for almost any ω ∈ Ω seems to be restrictive. We would like to comment on that: 1. If (Θ, d) is a bounded metric space, then |π Z(ω) | (1) ≤ R holds trivially for R being the upper bound for d. Further note, that by d(θ 1 , θ 2 ) := min{R, d(θ 1 , θ 2 )} we can turn any metric space (Θ, d) into a topologically equivalent bounded metric space (Θ, d).
3. In addition to that, the second statement of Theorem 2.12 tells us, within the setting of a Monte Carlo approximation Z, that "asymptotically", if at some accuracy level Z is sufficiently close to Z with probability one, then no explicit upper bound on |π Z(ω) | (1) is required.
Remark 2.14. Related to results in this subsection are the recent publications [11,12]. There, the authors studied well-posedness properties of Bayesian inverse problems with random likelihoods. In particular, they bounded the expected squared Hellinger distance between a random approximate posterior distribution and a desired one.

Monte Carlo recovery -illustrative examples
Given the probability measure π Z on (Θ, B(Θ)) one might ask how to get Z which is "close" to Z in the sense of our stability results. Thus, we are looking for an approximation Z of Z. Observe that, for an arbitrary, maybe even unknown, constant c > 0 it is actually sufficient to approximate the function c · Z, since π cZ = π Z . However, the crucial difficulty lies in the fact that we do not have access to function evaluations of c · Z. Nevertheless, there exist scenarios where Monte Carlo estimators can successfully be applied, see [1,2,4,18]. This motivates the consideration of Monte Carlo recovery methods for Z. A random function Z N , as defined in Section 2.3, is a Monte Carlo approximation of Z with information parameter N if Z N uses at most N pieces of available information, e.g. function evaluations or samples w.r.t. a certain distribution. We consider two illustrating scenarios for which we provide explicit Monte Carlo approximations of Z.

Simple Monte Carlo recovery
We consider the same framework as in [14,Section 4.1]. Let (G, G) be a measurable space and ̺ : G × Θ → [0, ∞) be a measurable function, such that where (ν θ ) θ∈Θ is a family of probability distribution on G. With another measurable function Φ : Θ → (−∞, ∞) this defines the distribution of interest π Z as given in (1). We assume that for any θ ∈ Θ we can sample w.r.t. ν θ . Then Z N (θ) : N being an iid sample w.r.t. ν θ , provides a Monte Carlo approximation of Z. Define Q N (θ) := Z N (θ)/Z(θ) and note that as well as by [14, Lemma 23] we have for the second inverse moment of Q N (θ) that (EQ N (θ) −2 ) 1/2 ≤ (EQ 1 (θ) −2 ) 1/2 . Hence by Corollary 2.10 we obtain To further estimate the former inequality we impose the following regularity assumption throughout the rest of this section.
Furthermore, setting and assuming that it is finite easily gives |π Z N (ω,·) | (1) ≤ R for any ω ∈ Ω, see also Remark 2.13. Then, by Theorem 2.12(ii) we have Summarized, one can say that if the integrals on the right-hand sides of (13), (14) and (15) are finite, then the difference of π Z and π Z N (·) measured either in the total variation or Wasserstein distances decreases with the classical Monte Carlo rate of convergence.

Gibbs distribution
Let G be a finite set and suppose that there is a measurable function h : is a probability mass function on G, where Z(θ) := x∈G exp(−h(x, θ)) denotes the normalizing constant of exp(−h(x, θ)) given θ ∈ Θ. In that setting, for a given x obs ∈ G we have Φ(θ) = h(x obs , θ). Note that this framework contains Example 1.1 as a special case. We impose the following condition.
Assumption 3.2. We assume that we can sample on G w.r.t. the distribution determined by ̺(· | θ) for any θ ∈ Θ.
Usually there are two arguments why this assumption is not too restrictive. The first is that one might rely on perfect sampling and the second, that one can, at least approximately, sample from such distributions by using Markov chains. Remark 3.3. In the scenario of Example 1.1 for a given θ ∈ Θ (or rather β ∈ (0, ∞)) in a series of papers a Monte Carlo product estimator for Z(θ) (or better Z(β)) has been analyzed, see [6] and the references therein. As an alternative also multiple importance sampling approaches have been applied for the approximation of Z(θ), see [2,4,10].
To elaborate on this we impose the following assumption.

Conclusion
We conducted a stability analysis of doubly-intractable distributions. In particular, given two functions Z, Z : Θ → (0, ∞) we derived estimates of the total variation and Wasserstein distance of π Z and π Z . Essentially it turns out that if a relative difference between Z and Z, measured in a certain L p -sense, is small, then also π Z and π Z are close to each other. We also consider a randomization of Z, that is, for any θ ∈ Θ we have a random variable Z(θ). In this context we provide estimates on the expected total variation and Wasserstein distance of π Z to the random measure π Z . In addition to that we illustrate our bounds in two simple Monte Carlo recovery settings. Finally let us comment on further aspects. In the stability analysis we focused on the total variation and Wasserstein distance, but of course also other quantities for measuring the difference of distributions, such as the Hellinger distance or Kullback-Leibler divergence, are reasonable to investigate. Furthermore, in Section 2.3 we only considered the expected difference of distributions. In the light of [6] and also [7,8] statements of the type "small error with high probability" are desirable. In particular, an investigation concerning the approximation of functions by Monte Carlo recovery algorithms seems to be a challenging and very interesting task.