Brittleness of Bayesian Inference Under Finite Information in a Continuous World

We derive, in the classical framework of Bayesian sensitivity analysis, optimal lower and upper bounds on posterior values obtained from Bayesian models that exactly capture an arbitrarily large number of finite-dimensional marginals of the data-generating distribution and/or that are as close as desired to the data-generating distribution in the Prokhorov or total variation metrics; these bounds show that such models may still make the largest possible prediction error after conditioning on an arbitrarily large number of sample data measured at finite precision. These results are obtained through the development of a reduction calculus for optimization problems over measures on spaces of measures. We use this calculus to investigate the mechanisms that generate brittleness/robustness and, in particular, we observe that learning and robustness are antagonistic properties. It is now well understood that the numerical resolution of PDEs requires the satisfaction of specific stability conditions. Is there a missing stability condition for using Bayesian inference in a continuous world under finite information?


Introduction
With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is becoming a pressing question.
While it is known that Bayesian methods are robust and consistent when the number of possible outcomes is finite, the exploration of Bayesian inference in a continuous world has revealed both positive [19,30,38,67,69,92,96] and negative results [12,13,35,47,48,61,71]. One contribution of this paper is the development of a calculus for the elucidation of the mechanisms generating robustness or brittleness in Bayesian inference. In particular, this paper 1. shows that the process of Bayesian conditioning on data at fine enough resolution is sensitive (as defined in [94], modulo a small technicality) with respect to the underlying distributions, under the total variation and Prokhorov metrics; and 2. raises the question of a missing stability condition for using Bayesian inference in a continuous world under finite information, somewhat akin to the CFL condition for the stability of a discrete numerical scheme used to approximate a continuous PDE.
Point (1) is the source of negative results similar to those caused by tail properties in statistics [8,37], and can be seen as an extreme occurrence of the dilation phenomenon from robust Bayesian inference [103]. Let us now illustrate the main question explored in this paper with a simple example of Bayesian reasoning in action: There is a bag containing 102 coins, one of which always lands on heads, while the other 101 are perfectly fair. One coin is picked uniformly at random from the bag, flipped 10 times, and 10 heads are obtained. What is the probability that this coin is the unfair coin?
The correct probability is given by applying Bayes' theorem: where A is the event "the coin is the unfair coin" and B is the event "10 heads are observed". If the number of coins is not known exactly and the supposedly fair coins are not exactly fair, then Bayes' theorem can be used to produce a robust Bayesian inference in the following sense: if the fair coins are slightly unbalanced and the probability of getting a tail is 0.51, and an estimate of 100 coins is used and an estimate 1 2 of the fairness of the fair coins is used, then the resulting estimate 1 1+99×2 −10 is still a good approximation of the correct answer.
Does this robustness hold when the underlying probability space is continuous or an approximation thereof? For example, what if the random outcomes are decimal numbers -perhaps given to finite precision -rather than heads or tails?

The General Question
To investigate these questions in a general context let us now consider the situation in which the space X where observations/samples take their values is no longer {Head, Tail} but an arbitrary Polish space (with the real line R as a prototypical example). Write M(X ) for the set of probability measures on X and let Φ : M(X ) → R be a function 1 defining a quantity of interest. When X is the real line R, a prototypical example is Φ(µ) := µ[X ≥ a], the probability that the random variable X distributed according to µ exceeds the threshold value a; another typical example is Φ(µ) := E µ [X], the mean of X.
Problem 2. Let the data-generating distribution µ † ∈ M(X ) be an unknown or partially known probability measure on X . The objective is to estimate Φ(µ † ) from the observation of n i.i.d. samples from µ † , which we denote by d = (d 1 , . . . , d n ) ∈ X n .
For practical reasons (and to avoid problems associated with conditioning with respect to events of measure zero) we will assume that the data is observed up to resolution/precision δ > 0, i.e. what we actually observe in Problem 2 is the event d ∈ B n δ , where B n δ := n i=1 B δ (x i ), (x 1 , . . . , x n ) is a fixed point of X n , and B δ (x) is the open ball of radius δ and center x (defined with respect to a consistent metric on the Polish space X ). Now observe that the Bayesian answer to Problem 2 is to assume that µ † is the realization of some random measure µ on M(X ). This is done by choosing a model class A ⊆ M(X ) and a probability measure π ∈ M(A) which we call the prior. This prior determines the randomness with which a representative µ ∈ A is selected, and for each such µ ∈ A, the generation of n i.i.d. samples d ∈ X n by randomly sampling from µ n naturally determines a product measure on A × X n . In analogy to Problem 1, A plays the role of the bag of coins (measures) and each measure µ ∈ A plays the role of a coin. Now the prior estimate of the quantity of interest is E µ∼π [Φ(µ)] and the posterior estimate is defined as the conditional expectation with respect to this product measure. One response to the concern that the choice of prior π is somewhat arbitrary is to explore classes of priors. Indeed: "Most statisticians would acknowledge that an analysis is not complete unless the sensitivity of the conclusions to the assumptions is investigated. Yet, in practice, such sensitivity analyses are rarely used. This is because sensitivity analyses involve difficult computations that must often be tailored to the specific problem. This is especially true in Bayesian inference where the computations are already quite difficult." [102] In this paper we will investigate this approach, known as robust Bayesian inference [15,16,25,104] or Bayesian sensitivity analysis, and examine the robustness of Bayesian inference by computing optimal bounds on prior and posterior values in terms of given sets of priors. To do so, we need some definitions.

Example of Brittleness Under Finite Information
As illustrated in Problem 1, it is already known from classical Bayesian sensitivity analysis that posterior values are robust if the random outcomes live in a finite space (i.e. X is finite) or if the class of priors Π is finite-dimensional (i.e. if what one does not know can be represented by a finite number of known parameters). One purpose of this paper is to investigate what the very same classical Bayesian sensitivity analysis framework would conclude in the presence of finite information (i.e. if for instance Π is finite codimensional). To understand this question let us consider the following example: Example 1.2. Our purpose is to estimate the mean Φ(µ † ) := E µ † [X] of some random variable X with respect to some unknown distribution µ † on the interval [0, 1] based on the observation of n i.i.d. samples d := (d 1 , . . . , d n ), given to finite resolution δ (i.e. we observe d ∈ B n δ , where B n δ is the product of n open balls of radius δ), from the unknown distribution µ † .
The Bayesian answer to that problem is to assume that µ † is the realization of some random measure distributed according to some prior π (i.e. µ ∼ π) and then compute the posterior value of the mean by conditioning on the data, i.e. compute (1.2) with Φ(µ) := E µ [X]. Observe that to specify the prior π we need to specify the distribution of all the moments 2 of µ (i.e. the distribution of the infinite-dimensional vector (E µ [X], E µ [X 2 ], E µ [X 3 ], . . .)).
It is known, from classical robust Bayesian inference, that the posterior value (1.2) is robust with respect to finite dimensional perturbations of the particular choice of the prior π. However, rather than specifying a finite-dimensional class of priors Π (i.e. assuming infinite information), it appears epistemologically more reasonable to specify a finite-codimensional Π (i.e. assume finite information) and a natural way to do so is to specify the distribution Q of only a large, but finite, number of moments of µ (i.e. to specify the distribution of (E µ [X], E µ [X 2 ], . . . , E µ [X k ]), where k ∈ N can be arbitrarily large). This defines a class of priors Π on M([0, 1]) such that if π ∈ Π and µ ∼ π then More precisely, writing Ψ as the function mapping each measure µ on [0, 1] to its One consequence of one of the main results of this paper, Theorem 4.13, is that no matter how large k is, no matter how large the number of samples n is, for any Q that has a density with respect to the uniform distribution on the first k moments, if you observe the data at a fine enough resolution, then the minimum and maximum of the posterior value of the mean over the class of priors Π are 0 and 1, i.e. the following proposition holds. (a) f a (x, θ) This example of brittleness is derived from Theorem 4.13 (see Example 4.16), the proof of which sheds light on the mechanism leading to brittleness in a general context and shows that the pathology illustrated by Proposition (1.3) is general and inherent to using Bayesian inference in continuous spaces (or their discretizations) under finite information. Furthermore, although this simple example concerns the posterior mean, the quantity of interest in Theorem 4.13 is arbitrary and the brittleness results apply to the whole posterior distribution.

Example of Brittleness Under Infinitesimal Model Perturbations
Theorem 4.13 (and its corollary, Theorem 6.1), which leads to brittleness under finite information as illustrated in the previous example, also leads to brittleness under infinitesimal model perturbations in the total variation and Prokhorov metrics. We will now illustrate one mechanism causing brittleness with a simple example.
In this example we are interested in estimating Φ(µ † ) = E µ † [X] where µ † is an unknown distribution on the unit interval (X = [0, 1]) based on the observation of a single data point d 1 = 0.5 up to resolution δ (i.e. we observe d 1 ∈ B δ (x 1 ) with x 1 = 0.5).
Consider the following two Bayesian models (measures) µ a (θ) and µ b (θ) on the unit interval [0, 1], parametrized by θ ∈ (0, 1), and with densities f a and f b given by where Z is a normalization constant (close to one) chosen so that [0,1] f b (x, θ) dx = 1. See Figure 1.1 for an illustration of these densities.
Observe that the density of model b is that of model a besides the small gap of width δ c > 0 created around the data point for model b (if θ < 0.999, see Figure 1.1); since the data point is fixed at x 1 = 1 2 , the total variation distance d TV µ a (θ), µ b (θ) between the two models is, uniformly over θ ∈ (0, 1), a constant times δ c . Assuming that the prior distribution on θ is the uniform distribution on (0, 1), observe that the prior value of the quantity of interest E µ [X] under both models (a and b) is approximately 1 2 . Now, when θ is close to one (zero) then the density of model a puts most of its mass towards one (zero). Observe also that the density of model b behaves in a similar way, with the important exception that the probability of observing the data under model b is infinitesimally small for θ < 0.999. Therefore, for δ < δ c , the posterior value of the quantity of interest E µ [X] under model a is 1 2 whereas it is close to one under model b. Observe also that a perturbed model c analogous to b would lead to a posterior value close to zero.
This simple example of brittleness under infinitesimal model perturbations is derived from the proof of Theorem 6.4, which shows that Bayesian posterior values are generally brittle under infinitesimal perturbations of Bayesian models in TV and in Prokhorov metrics.
µ b (θ) is also a simple example of what worst priors can look like after a classical Bayesian sensitivity analysis over a class of priors specified via constraints on the TV or Prokhorov distance or the distribution of a finite number of moments.
Can we dismiss these worst priors because they depend on the data? The problem with this argument is that in the context of Bayesian sensitivity analysis worst priors always depend on (or are pre-adapted to) the data. Therefore the same argument would lead to a dismissal of Bayesian sensitivity analysis and therefore the robust Bayesian framework. Can we dismiss these worst priors because they depend too much on the data? The problem with this argument is that it is not a transparent task to define too much without introducing the following element of circular reasoning: the degree of pre-adaptation determines the degree of brittleness, the framework is dismissed is when the degree of pre-adaptation is "too much", therefore the method cannot be brittle.
Can we dismiss these worst priors because they can "look nasty" and make the probability of observing the data very small? The problem with this argument is that these worst priors are not "isolated pathologies" but directions of instability and their number increase with the number of data points. We will illustrate this point with another simple example by placing a uniform constraint on the probability of observing the data in the model class. We already know that if the data is equally likely under all measures in the model class then posterior values are robust but learning is not possible (prior and posterior values are equal). The following example will show that although variations in the probability of the data in the model class make learning possible, they also lead to brittleness.

Example of Learning vs Robustness
In this example we are interested in estimating Φ(µ † ) = µ † [a, 1] for some a ∈ (0, 1), where µ † is an unknown distribution on the unit interval (X = [0, 1]) based on the observation of n data point d 1 , . . . , d n up to resolution δ (i.e. we observe Our purpose is to examine the sensitivity of the Bayesian answer to this problem with respect to the choice of a particular prior. Consider the model class (1.3) and the class of priors Observe that Π corresponds to the assumption that µ † is the realization of a random measure on [0, 1] whose mean is on average m. As in the previous example, the finite codimensional class of priors Π leads to brittleness in the sense that the least upper bound on prior values is Can this brittleness be avoided by adding a uniform constraint on the probability of observing the data in the model class? To investigate this question let us introduce α ≥ 1 and a probability measure µ 0 on [0, 1] with strictly positive Lebesgue density (with a prototypical example being that µ 0 is itself uniform measure on [0, 1]), consider the (new) model class 6) and the (new) class of priors Note that, for the model class A(α), the probability of observing the data is uniformly bounded below by 1 α µ n 0 [B n δ ] and above by αµ n 0 [B n δ ]. Therefore, for α = 1, the probability of observing the data is uniform in the model class, prior values are equal to posterior values, and the method is robust but learning is impossible. If α slightly deviates from 1, then the calculus developed in this paper allows us to compute the least upper bound on posterior values and obtain that . (1.8) We refer to Example 4.10 for the derivation of (1.8) from Theorem 4.8. Note that the right hand side of (1.8) is equal to m/a for α = 1 (when the probability of the data is constant on the model class) and quickly converges towards 1 as α increases. As a numerical application observe that for a = 3 4 and m = a 2 = 3 8 , we have lim δ→0 U Π(α) = 1 2 and lim δ→0 U Π(α)|B n δ = Therefore, for α = 2, we have (irrespective of the number of data points) lim δ→0 U Π(2)|B n δ = 0.8, and for α = 10, we have (irrespective of the number of data points) Moreover, if α is derived by assuming the probability of each data point to be known up to some tolerance γ, i.e. if the model class A(α) is replaced by for some γ > 1, then it can be shown that , which exponentially converges towards 1 as the number n of data points goes to infinity.
In conclusion, the effects of a uniform constraint on the probability of the data under finite information in the model class show that learning ability comes at the price of loss in stability in the following sense: when α = 1, the data is equiprobable under all measures in the model class, posterior values are equal to prior values, the method is robust but learning is not possible. As α deviates from one, the learning ability increases as robustness decreases, and when α is large, learning is possible but the method is brittle.

Missing Stability Condition for Using Bayesian Inference Under Finite Information
The previous examples have shown that Bayesian inference can be unstable under finite information, therefore, at the very least, the question of the existence and of the nature of a stability condition for using Bayesian inference remains to be answered. Indeed it is well known that numerical solutions of PDEs can become unstable if specific stability conditions such as the CFL stability condition are not satisfied. Although numerical schemes that do not satisfy the CFL condition may look grossly inadequate, the existence of such perverse examples does not imply the dismissal of the necessity of a stability condition. Similarly, although one may, as in Subsection 1.3, exhibit grossly perverse worst priors, the existence of such priors does not invalidate the question of the missing stability condition for using Bayesian inference under finite information. The example provided in Subsection 1.4 suggests that, in the framework of Bayesian sensitivity analysis, (i) such a stability condition would depend on how well the probability of the data is known or constrained in the model class, and (ii) learning and robustness are antagonistic/conflicting requirements -there is no free lunch and increased learning potential is paid for by decreased stability of posterior values.
Could this stability condition be derived from closeness in Kullback-Leibler divergence? The problem with this approach is that closeness in Kullback-Leibler divergence cannot be tested with discrete data and it requires the non-singularity of the data generating distribution with respect to the model, which could be a strong assumption for the certification the safety of a critical system. Indeed, when performing Bayesian analysis on function spaces, as is now increasingly popular, for studying PDE solutions, results like the Feldman-Hájek theorem [45,56] tell us that most pairs of measures are mutually singular, and hence at Kullback-Leibler distance infinity from one another. Another problem with using Kullback-Leibler divergence is that a local sensitivity analysis (in the sense of Fréchet derivatives) of posterior values suggests infinite sensitivity as the number of data point goes to infinity [54] (and this result is valid for the broader class of divergences that includes the Hellinger distance).
A close inspection of some of the cases where Bayesian inference has been successful shows the existence of a non-Bayesian feedback loop on the evaluation of its performance [75,77,89]. Therefore one natural question is whether the missing stability condition could be derived by exiting the strict framework of Bayesian analysis/inference. According to Efron [43], without genuine prior information "Bayesian calculations cannot be uncritically accepted and should be checked by other methods, which usually means frequentistically."

Calculus for Measures over Measures
The results of this paper are derived from a calculus allowing us to solve/reduce optimization problems with variables corresponding to measures over measures over arbitrary Polish spaces. The following assertion of Theorem 3.11 is an example of this calculus.
In (1.10), Ψ is a measurable function mapping A (a Suslin subset of the set M(X ) of probability measures on a Polish space X ) into a separable metrizable space Q, Q is a subset of M(Q), and Φ is a measurable quantity of interest defined on M(X ). Therefore, (1.10) states that the optimization problem (in its left hand side) over Ψ −1 Q (a subset of the set of measures of A, i.e. a subset of the set of measures of the set of measures over X ) is equal to the nesting of an optimization problem over Ψ −1 (q) (a subset of A, i.e. a subset of the set of measures over X ) and an optimization problem over Q (a subset of the set of measures over Q).
We will now illustrate this calculus by showing how (1.4) can be derived through a simple application of (1.10). First we need to give a short reminder on optimization over measures via the following problem. is an infinite dimensional optimization problem over measures, it is easy to see that to achieve the maximum, any mass put above a should be placed exactly at a to create minimum leverage towards the right hand side of the seesaw and any mass put below a should be placed at 0 to create maximum leverage towards the left hand side of the seesaw (as illustrated in Figure 1.2.(a)). This simple argument allows to reduce (1.11) to a simple one dimensional problem whose solution is m a and corresponds to Markov's inequality. This simple example of reduction calculus has a generalization to spaces of functions and measures [82] and is based on a form of linear programming in spaces of measures. In particular, the calculus developed in [82] uses results of Winkler [107] -which follow from an extension of Choquet theory (see e.g. [84]) by von Weizsäcker and Winkler [97,Corollary 3] to sets of probability measures with generalized moment constraints -and a result of Kendall [64] characterizing cones, which are lattice cones in their own order.
We will now consider the next level of complexity, illustrated by the following two equivalent problems. Problem 4. 10, 000 children are, each, given one pound of playdoh and a seesaw. On average, how much mass can they put above the threshold a while, on average, keeping the seesaws balanced at m? Problem 5. A child is given one pound of playdoh and a seesaw. What can you say about how much mass she is putting above the threshold a if all you have is the belief that she is keeping the seesaw balanced at m?
The mathematical formulation of problems 4 and 5 is as follows (for Problem 4, replace 10, 000 by N and consider the asymptotic limit N → ∞). What is the least upper bound on E µ∼π µ[X ≥ a] if π is an unknown (imperfectly known) probability measure on M [0, 1] (the set of probability distributions on [0, 1]) such that E µ∼π E µ [X] = m?
The answer to this question is where Π is the set of measures of probability π on the set of measures of probability on Although (1.12) is an optimization over measures over measures, the calculus of (1.10) introduced in Theorem 3.11 allows us to reduce it to the nesting of two optimization problems over measures as follows.

Structure of the Paper and Main Results
This paper is structured as follows: Section 2 incorporates Bayesian priors into the Optimal Uncertainty Quantification (OUQ) framework [82]. In the OUQ framework, Uncertainty Quantification (UQ) is formulated as an optimization problem (over an infinite-dimensional set of functions and measures) corresponding to extremizing (i.e. finding worst and best case scenarios) probabilities of failure or other quantities of interest, subject to the constraints imposed by the scenarios compatible with the assumptions and information. In this generalization, priors are probability measures on spaces of measures, and computing optimal bounds on prior values (given a set of priors) requires solving problems in which the optimization variables are measures on spaces of measures (the results of this paper can be extended to measures over spaces of measures and functions but, for the sake of simplicity and clarity, we will limit the presentation to measures over measures).
Section 3 shows how such optimization problems can, under general conditions, be reduced to the nesting of two optimization problems over measures, where then we can apply the reduction theorems of [82].
Section 4 provides similar reduction theorems for the computation of optimal bounds on posterior values given a set of priors and the observation of the data. These reduction theorems lead to the brittleness results of Theorems 4.13, 6.4, and 6.9.
Section 5 reviews questions of Bayesian consistency, inconsistency, model misspecification, and robustness through a motivating analysis and interprets the results of this paper in relation to those questions. Section 6 presents the brittleness under local misspecification results of Theorems 6.4 and 6.9. That is, given a model, Theorem 6.4 provides optimal bounds on posterior values for priors that are at arbitrarily small distance (in the Prokhorov or total variation metrics) from a given model. Theorems 6.4 and 6.9 show that these optimal bounds on posterior values are the essential supremum and infimum of the quantity of interest irrespective of the size of data and of the size of the metric neighborhood around the model. Finally, Sections 8 and 9 contain the proofs.
2 General Set-Up

Notation and Conventions
Throughout, for a topological space Y, B(Y) will denote the Borel σ-algebra of subsets of Y and M(Y) will denote the space of Borel probability measures generally endowed with the weak topology and the corresponding Borel σ-algebra unless specified otherwise. For an alternative σ-algebra Σ Y of subsets of Y the set of probability measures on the σ-algebra Σ Y will be denoted M(Σ Y ). For a mapping between topological spaces, the term "measurable" will mean Borel measurable unless specified otherwise. Moreover, suprema over the empty set will have the value −∞ and infima over the empty set the value +∞.

The General Problem and the Optimal Uncertainty Quantification (OUQ) Framework
Let X be Polish and Φ be a measurable function mapping M(X ), the set of measures of probability on X , onto the real line R, known as the quantity of interest. Let µ † be an unknown or imperfectly known probability measure on X . The general problem guiding our presentation will be that of estimating Φ(µ † ). Let A be an arbitrary subset of M(X ). If A represents all that is known about µ † (in the sense that µ † ∈ A and that any µ ∈ A could, a priori, be µ † given the available information) then [82] shows that the quantities to be optimal given the available information µ † ∈ A as follows: It is simple to see that the inequality (2.3) follows from the assumption that µ † ∈ A. Moreover, for any ε > 0 there exists a µ ∈ A such that Consequently since all that we know about µ † is that µ † ∈ A, it follows that the upper bound Φ(µ † ) ≤ U(A) is the best obtainable given that information, and the lower bound is optimal in the same sense. Although the OUQ optimization problems (2.1) and (2.2) are extremely large, we have shown in [82], for the more general situation where A is a set of functions f and measures µ and Φ a function of (f, µ), that an important subclass enjoys significant and practical finite-dimensional reduction properties. First, by [82,Cor. 4.4], although the optimization variables (f, µ) lie in a product space of functions and probability measures, for OUQ problems governed by linear inequality constraints on generalized moments, the search can be reduced to one over probability measures that are products of finite convex combinations of Dirac masses with explicit upper bounds on the number of Dirac masses. Furthermore, in the special case that all constraints are generalized moments of functions of f , the dependency on the coordinate positions of the Dirac masses is eliminated by observing that the search over admissible functions reduces to a search over functions on an m-fold product of finite discrete spaces, and the search over m-fold products of finite convex combinations of Dirac masses reduces to a search over the products of probability measures on this m-fold product of finite discrete spaces [82,Thm. 4.7]. Finally, by [82,Thm. 4.9], using the lattice structure of the space of functions, the search over these functions can be reduced to a search over a finite set.
For the sake of clarity we will now restrict the presentations of our results to the (simpler) situation where the quantity of interest Φ is (solely) a function of an unknown measure µ. As in [82], the results of this paper can be generalized to situations where Φ is a function of (f, µ).
where a is a safety margin. In the certification context one is interested in showing that µ † [X ≥ a] ≤ ε, where ε is a safety certification threshold (i.e. the maximum acceptable µ † -probability of the system exceeding the safety margin a). If U(A) ≤ ε, then the system associated with µ † is safe even in the worst case scenario (given the information represented by A). If L(A) > ε, then the system associated with µ † is unsafe even in the best case scenario (given the information represented by A). If L(A) ≤ ε < U(A), then the safety of the system cannot be decided (although we could declare the system to be unsafe due to lack of information).

Bayesian Priors on the Admissible Set
In the OUQ setting, an assumption of the form µ † ∈ A was used to derive the optimal inequality (2.3). This paper will consider the situation in which one has priors on the admissible set A and also information in the form of sample data. One of our goals is to analyse the robustness (or brittleness) of Bayesian inference by obtaining optimal bounds on posterior values given local misspecifications. In that context A can be viewed as a model class, and µ † , as the realization of a probability measure (the prior) on A.
In order to define priors on the space of admissible scenarios, A needs to be given the structure of a measurable space; i.e. a suitable σ-algebra Σ A on A must be provided. From now on, we will assume A to be a Borel subset of the Polish space M(X ), endowed with the Borel σ-algebra for A. We will also refer to a probability measure π ∈ M(Σ A ) as a prior.
Remark 2.2. The desire to have the Borel measurable structure of a Polish space might seem to be a spurious level of abstraction, but there are many good reasons for it. The first is that, by Suslin's Theorem [63, Thm. 14.2], all Borel subsets of a Polish space are Suslin, where a Suslin space is a continuous Hausdorff image of a Polish space. Indeed, Suslin sets are important in measurable selection theorems (see e.g. [29]) such as those that we use in the proof of Lemma 3.10; furthermore, in addition to Ulam's theorem [6,Thm. 4.3.8] that all probability measures on a Polish space are regular (approximable from within by compact sets), Schwartz' theorem [87] implies that that all probability measures on a Suslin space are regular, and, therefore, [95,Thm. 11.1] implies that the extreme points in the space of probability measures on a Suslin space are the Dirac measures. Consequently, when M(X ) is Polish, any Borel subset A ⊆ M(X ) is Suslin and so the extreme points of probability measures on A are the Dirac measures, and some powerful measurable selection theorems are available. Moreover, when the base space is metrizable, then the space of probability measures is Polish in the weak topology if and only if the base space is Polish.
Furthermore, since separability is equivalent to second countability for metric spaces, we have that the Borel structure of a product is the product of Borel structures of Polish spaces. In addition, by [40,Thm. 10.2.2], regular conditional probabilities exist for observables with values in a Polish space. Also, Polish spaces are the spaces of Descriptive Set Theory, see e.g. Kechris [63]. Polish spaces appear to be the appropriate spaces to play topological games such as the Banach-Mazur game [83], the Sierpiński game, the Ulam game, the Banach game, and the Choquet game. Moreover, a theorem of Choquet [63,Thm. 8.18] shows that a separable metric space is completely metrizable (and hence Polish) if and only if the second player has a winning strategy in the strong Choquet game. For a review of topological games, see Telgársky's review [93], and for topological games in hyperspace see that of Zsilinszky [108].

Data Spaces and Maps
In practice, the probability measure µ † is not observed directly; instead the sample data arrives in the form of (realizations of) observation random variables, the distribution of which is related to µ † . To simplify the current presentation, we will assume that this relation is determined by a function of µ † -such as the case where the data X 1 , . . . , X n are determined by n independent realizations X i of the random variable X determined by the possibly unknown distribution µ † . Throughout this paper we will use the following notation: D will denote the observable space (i.e. the space in which the sample data take values); D will be assumed to be a metrizable Suslin space and D will denote a D-valued random variable producing the observed sample data. To represent the dependence of the observation random variable D on the unknown state µ † ∈ A we introduce a measurable function where M(D) is given the Borel structure corresponding to the weak topology, to define this relation. The idea is that D(µ) is the probability distribution of the observed sample data D(µ) if µ † = µ, and for this reason it may be called the data map or -even more loosely -the observation operator. Often, for simplicity, we will write D instead of D(µ). Note that when the data comes in the form of n i.i.d. realizations of µ † we have D = X n and D(µ) = µ n (where µ n is the n-fold tensorization of µ).
We proceed with a natural generalization of the Campbell measure and Palm distribution associated with a random measure as described in [62] (see also [33,Ch. 13] for a more current treatment). To that end, observe that since D is metrizable, it follows from [4,Thm. 15.13], that, for any B ∈ B(D), the evaluation ν → ν(B), ν ∈ M(D), is measurable. Consequently, the measurability of D implies that the mapping is a transition function in the sense that, for fixed µ ∈ A, D(µ, · ) is a probability measure, and, for fixed B ∈ B(D), D · , B is Borel measurable. Therefore, by [23, Thm. 10.7.2], any π ∈ M(A), defines a probability measure where 1 A is the indicator function of the set A: It is easy to see that π is the A-marginal of π D. Moreover, when X is Polish, [4,Thm. 15.15] implies that M(X ) is Polish, and it follows that A ⊆ M(X ) is second countable. Consequently, since D is Suslin and hence second countable, it follows from and hence π D is a probability measure on A × D. That is, Let us refer to an element of M(A) as a prior on A. With a prior π on A, the quantity of interest Φ(µ) becomes a random variable and we will be interested in estimating its distribution conditioned on the observation D ∈ B, where B ∈ B(D). Example 2.3. In the context of Example 2.1, we are interested in estimating the probability (under the prior π) that the system is unsafe, conditioned on the observations D ∈ B, i.e. the conditional expectation If D corresponds to observing independent realizations of X, then the observation space D is X n and the measure D(µ) is µ n .
If D is the random variable that results from observing n independent realizations of (X + ξ) (X is observed with additive Gaussian noise ξ ∼ N (0, σ 2 )), then the measure D(µ) is the one associated with the random variable D = X 1 + ξ 1 , . . . , X n + ξ n ) where the X i are independent and distributed according to µ and the ξ i are independent Gaussian random variables of mean zero and variance σ 2 .

Bayes' Theorem and Conditional Expectation
Henceforth A will be a Suslin space, and suppose now that we have π D ∈ M(A × D) constructed in the above way. Let π · D denote the corresponding Bayes' sampling distribution defined by the D-marginal of π D, and note that, by (2.4), we have (2.5) Since both D and A are Suslin it follows that the product A × D is Suslin. Consequently, [23,Cor. 10.4.6] asserts that regular conditional probabilities exist for any sub-σ-algebra of B A × D . In particular, the product theorem of [23,Thm. 10.4.11] asserts that product regular conditional probabilities exist and that they are π · D-a.e. unique.
When we consider π ∈ M(A) a prior, then this result can be interpreted as the posteriors of Bayes' theorem. However, because such regular conditional probabilities are only uniquely defined π · D-a.e., when a data sample d ∈ D arrives such that π · D[{d}] = 0, a posterior π D | d that could be any of the π · D-a.e.-equal regular conditional probabilities evaluated at d appears to have dubious utility. Indeed, the fact that the regular conditional probabilities are only uniquely defined π·D-a.e. suggests that integrals of posteriors over subsets B ∈ B(D) such that π · D[B] > 0 are the more natural objects. Moreover, the restriction that B be an open set is natural for practical reasons, since conditioning on D lying in an open subset B rather than on its exact value is what one has to do when the sample data can only be observed after rounding error. Furthermore, we will show in Section 4 that if the data d have been sampled from a probability measure π † · D for some π † ∈ M(A) (commonly called a "true prior" in Bayesian statistics) then with π † · D probability one (on the realization of d), the π † · D-measure of any open set containing d is strictly positive. In other words, π † · D-almost surely, π † (the "true prior") belongs to the random subset of M(A) defined as the priors π ∈ M(A) such that π · D[B] > 0 for any open set B containing the data d (this subset is randomized through the realization of the data d).
Finally, throughout, we will find it useful to assume that Assumption 1. Φ is semibounded in that it is either bounded above or bounded below. Semiboundedness is sufficient to ensure that the integral of Φ with respect to any probability measure exists, possibly with the value ∞ or −∞, and such integrands are sufficient for the reduction theorems of Winkler [107] that we use.
Remark 2.4. Note that the assumption that Φ is semibounded is mostly for convenience since integrands which are not semibounded, like that defining the first moment, can be considered by restricting the space of measures to those measures that have well-defined first moments.

Incompletely Specified Priors
In practical situations, (1) the choice of a particular prior on A involves a degree of arbitrariness that may be incompatible with the certification of rare/critical events, and (2) the definition of such a prior is a non-trivial task if A is infinite dimensional. For these reasons it is necessary to consider situations in which the prior π is imperfectly known or specified. More precisely, the (lack of) information (or specification) on π can be represented via the introduction of a space Π where the subset Π ⊆ M(A) consists of the set of admissible priors π.
One of our goals in allowing incompletely specified priors is to assess the robustness of posterior Bayesian estimates with respect to the particular choice of priors. More precisely we will compute optimal bounds on E π [Φ] when π ∈ Π and show how these bounds are affected by the introduction of sample data by computing optimal bounds on E π D [Φ|B], for B ∈ B(D).

Optimal Bounds on the Prior Value
Recall that for a subset A and a measurable quantity of interest Φ : A → R, that under the assumption µ † ∈ A, we have the optimal upper U(A) and lower L(A) bounds on the value Φ(µ † ) of the quantity of interest, defined in (2.1) and (2.2) by When we put a prior π on A, we have to define the valueΦ(π) of the prior π corresponding to an extended quantityΦ : M(A) → R of interest corresponding to Φ. Disregarding integrability concerns, for a given Φ, let us call the induced function the canonical one associated with Φ and abuse notation by denoting the functionΦ as Φ. For such a canonical quantity of interest, we call the value E π [Φ] the prior value, and note that the values form a natural generalization of the values U(A) and L(A). Moreover, in the same way that U(A) and L(A) are optimal upper and lower bounds on Φ(µ † ) given the information that (µ † ) ∈ A, U(Π) and L(Π) are optimal upper and lower bounds on E π Φ given the information that π ∈ Π. Of course, for these expressions to be well defined, integrability concerns should be addressed. Indeed, Assumption 1 implies that E π Φ is well defined for any bounded measure π, possibly with the value ∞ or −∞, and therefore the quantities in (3.2) and (3.3) are well defined.
Remark 3.1. The restriction that the the extended quantity of interest corresponding to Φ be canonical is really no restriction, but is assumed only to simplify the presentation and notation. Indeed, there are many important extended quantities of interest that are not affine as functions of the measure π. However, all the ones that we have thought of can be handled by small modifications of the present framework, and their inclusion here would simply complicate the presentation and notation. Moreover, note that many affine non-canonical extended quantities of interest become canonical through simple transformations. For example, when Φ 1 (µ) := µ[X ≥ a] is a quantity of interest, and the extended quantity of interest is the probability that the system is unsafe, > ε} is the set of unsafe µ, then this extended quantity of interest is not canonical with respect to Φ 1 . However, by transformation to , the extended quantity of interest becomes canonical and U(Π) and L(Π), defined in terms of Φ 2 , are optimal upper and lower bounds on the probability that the system is unsafe given the set of priors Π.

General Information Bounds on Prior Values
Let δ : A → M(A) be the mapping of points to unit Dirac measures, where δ µ denotes the Dirac mass at µ, and, for Π ⊆ M(A), define That is, A Π consists of those scenarios µ that are not only admissible in the sense that they lie in A, but are also admissible as a prior in the sense that δ µ is an element of Π.  Moreover, if A Π is non-empty, then

Priors Specified by Marginals
In many settings, probability measures or sets of probability measures are specified through generalized moments or other properties of marginal distributions. To analyse this case, let Q be a topological space and consider a measurable map Ψ : A → Q. We will now demonstrate how to reduce the computation of U(Ψ −1 Q) and L(Ψ −1 Q) when Q is specified by linear inequalities. Later, in Section 3.2.2, we will develop a more powerful nested reduction which will provide the foundation for our reduction methods.
Before we begin, we need to introduce some terminology. Following Winkler [107], let Y be a topological space and let M ⊆ M(Y) be a convex set of measures. Let ext(M) denote the set of extreme points of M and let the evaluation field Σ(ext(M)) be the smallest σ-algebra of subsets of ext(M) such that the evaluation map ν → ν(B) is measurable for all B ∈ B(Y). Then a measure ν ∈ M(Y) is said to be a barycenter of M if there exists a probability measure p on Σ(ext(M)) such that the barycentric formula holds. Furthermore, the following notion of a measure affine function is central to Winkler's [107] reduction theorems, which we use: is said to be measure affine if, for all ν ∈ M and all probability measures p on Σ(ext(M)) for which the barycentric formula (3.6) holds, F is p-integrable and A major consequence of Assumption 1, that Φ is semibounded, is that E ν [Φ] exists, with possible values ∞ and −∞, for all finite measures ν. As a consequence, by [107,Prop. 3.1], the extended-real-valued function ν → E ν [Φ] is measure affine.

Primary Reduction for Prior Values
Let us consider the computation of when Q is specified by n generalized moment inequalities determined by measurable functions g 1 , . . . , g n . The situation for the lower bound L(Ψ −1 Q) is the same. That is, let I 1 , . . . , I n be n closed intervals, allowing semi-infinite intervals (−∞, where implicit in the definition is that all n integrals exist. Then, by a change of variables, Hence, Ψ −1 Q is defined by the n generalized moment inequalities corresponding to g i • Ψ : A → R for i = 1, . . . , n. Consequently, since the function π → E π [Φ] is measure affine, it follows from the reduction theorems of [82] that we can reduce the supremum on the right-hand side of (3.7) to the convex combination of n + 1 Dirac masses. To state the theorem we have just proven, let of those measures which are the (n + 1)-fold convex combinations of Dirac masses.
Theorem 3.4. Let A be Suslin, let Q be separable and metrizable, and let Ψ : A → Q be measurable. Moreover, for n measurable functions g 1 , . . . , g n : Q → R and n closed intervals I 1 , . . . , I n , let define the admissible set of Ψ-marginals. Then, (3.10) Remark 3.5. The freedom to determine intervals I i , i = 1, . . . , n, is one way to incorporate uncertainty and maintain a reduction to n + 1 Dirac masses. In particular, by choosing semi-infinite intervals I i := (−∞, q i ] we obtain a reduction to n + 1 Dirac masses for inequality constraints of the form E Q [g i ] ≤ q i , and by choosing point intervals I i := [q i , q i ] we obtain a reduction to n + 1 Dirac masses for equality constraints of the form E Q [g i ] = q i . Moreover, by choosing the interval to be semi-infinite or point interval depending on the index i we obtain a reduction to n + 1 Dirac masses for mixed equality and inequality constraints. Theorem 3.4 can be put into a canonical form in the following way: by considering the modified feature map Ψ : A → R n with components it follows from the above that That is, by changing from the feature map Ψ to Ψ we end up with a constraint set defined by the first moment of the vector function Ψ . Therefore, let us remove the from Ψ , and require Ψ : A → R n to be measurable. The following theorem is the canonical form of Theorem 3.4. It is a corollary of Theorem 3.4 for the constraint E π [Ψ] ∈ Z when Z = I is a closed rectangle. However, it is true for arbitrary Z ⊆ R n . Theorem 3.6. Let A be Suslin, let Ψ : A → R n be measurable, let Z ⊂ R n , and let be the set of those measures whose first moment belongs to Z. Then, for for some fixed q ∈ (0, a). Then we will show that (3.14) To that end, observe that since so that Theorem 3.6 implies that we can reduce the optimization in U(Π) to the supremum over subject to the constraint Introducing the slack variables and using [82,Thm. 4.1] to reduce this problem further in µ 1 , µ 2 , we obtain that U(Π) is equal to the supremum over α ∈ [0, 1] and q 1 , Observing that the supremum is achieved at q 1 , q 2 ≤ a, we conclude that U(Π) = q/a, establishing (3.14). Moreover, note that

Nested Reduction for Prior Values
The result of Example 3.7 can also be deduced through a nested reduction that we will find generally more useful for two reasons. The first is that, in practice, not only is it highly non-trivial to specify a prior on the space A, since it requires quantifying information on an infinite-dimensional space, but it may also be undesirable to do so. Indeed, if an expert does not have a prior on the full space A but only on some projection Ψ(A) = Q, then, rather than arbitrarily picking one particular prior on the space A compatible with the specified prior on Ψ(A), it might be preferable to work with the set of priors on A specified through such marginals. Our second and main motivation is that, even when we can do the reduction on the primary space M(A), the reduced space remains so large that it may not be amenable to computation. However with the nested reduction theorems given below, the reduced space becomes computationally manageable for finite-dimensional Q.
where a is thought of as a safety margin, 3,4]. In that example, the expert has only "the prior" that the mean of X with respect to µ is uniformly distributed on [−1, 1] and that the variance of X with respect to µ is independent of its mean and uniformly distributed on [3,4].
Observe that in this situation Q does not uniquely specify a prior π ∈ M(A) but an infinite-dimensional set of priors Ψ −1 (Q) ⊆ M(A) and a robust approach would require assessing the safety of the system under the whole set Ψ −1 (Q) rather than under a particular element π of that set.
Idea of the Nested Reduction. Roughly, the idea of the nested reduction is as follows. To compute (3.7), consider the induced function where we use the notation of (2.1). From this it is natural to consider Let Q ∈ Q. Then, for any π such that Ψπ = Q, it follows that . However, if it were true, then we would obtain and conclude that We will show that, despite the fact that U • Ψ −1 • Ψ = Φ, the conclusion is still valid, provided that it is interpreted correctly. Heuristically, the reason for this is that the supremum sup π∈Ψ −1 Q in U(Ψ −1 Q) is exploring the maximum value of Φ on level sets of Ψ very much like the supremum in If A is such that a reduction theorem, e.g. from [82], applies to reduce the computation of the inner supremum in U • Ψ −1 to the supremum over convex combinations of Dirac masses, and the admissible set Q is such that a reduction theorem applies to the computation of the outer supremum of sup Q∈Q E Q [U • Ψ −1 ], then the identity (3.15) represents a nesting of reductions.
Let us now establish a result like (3.15). To do so will require addressing three questions: can define an integral of a function with properties discovered from the answer to (1)? (3) Can we obtain a measurable solution operator to the optimization problem U •Ψ −1 (q), where q ∈ Q? To that end, let us first recall a definition of universally measurable functions. Definition 3.9. Let (T, T ) be a measurable space, and for a positive measure ν on (T, T ), let T ν denote the ν-completion of T . Let T := ν T ν , where the intersection is over all positive bounded measures ν, denote the universally measurable sets. A T -measurable function is said to be universally measurable.
At the heart of the commutative representation used for the nested reduction is the following optimal measurable selection lemma answering questions (1) and (3) above: Lemma 3.10. Let A be a Suslin space, let Q be a separable and metrizable space, and let Ψ : A → Q be measurable. Then, for any subset T ⊆ Ψ(A), To answer question (2) above, define a support supp(Q) of a measure Q ∈ M(Q), as in [4,Ch. 12.3], to be a closed set such that When Q is a separable and metrizable space, it follows that it is second countable and therefore, by [4,Thm. 12.14], all Q ∈ M(Q) have a uniquely defined support. Now consider a measure Q ∈ M(Q) such that supp(Q) ⊆ Ψ(A). Then, by Lemma 3.10, can be defined by integration with respect to the completion Q: More generally, for any universally measurable function f and any finite measure Q, we define the expected value Such a method of defining integrals of, possibly non-Borel measurable, but universally measurable, functions brings up many questions such as: when is it uniquely defined?; for a fixed integrand, when is the expectation operation affine in the measure?; does it have a change a variables formula? All such questions have nice answers and, although we are sure that this is classical, we cannot find a reference for these facts so we have included statements and proofs of the facts needed in this paper in Section 9.1 of the Appendix. We now state our nested reduction theorem of the form (3.15): Theorem 3.11. Let A be a Suslin space, let Q be a separable and metrizable space, and let Ψ : where the expectations on the right-hand side are defined as in (3.16). Finally, the expectation operator on the right-hand side is measure affine in Q.
Remark 3.12. Note that (3.18) can be written Remark 3.13. Since the right-hand side is measure affine in Q, if Q is specified through (multi-)linear generalized moment inequalities, then the reduction theorems of [82] can be applied to obtain the supremum over Q by reducing Q to a convex combination of a finite number of Dirac masses on Q. Moreover, if Q consists of a single element, i.e. Q = {Q}, then 20) and the right hand-side of (3.20) can be approximately evaluated via Monte Carlo sampling of q ∈ Q according to the measure Q.
Remark 3.14. A similar theorem can obtained for the optimal lower bound L(Ψ −1 Q). Throughout this paper, results given for optimal upper bounds U can be translated into results for optimal lower bounds L by considering the negative quantity of interest −Φ and for the sake of concision we will not write those results unless necessary.
, and the set of admissible priors π on A is the collection for some fixed q ∈ (0, a). We will now demonstrate how the result U(Π) = q/a of (3.14) obtained by the primary reduction follows from the nested reduction theorem. To that end, observe that since Ψ(A) = [0, 1] ⊆ R, by restricting to measures Q ∈ M(R) with support supp(Q) ⊆ [0, 1], Theorem 3.11 implies that where Q is the set of probability measures Q on R with support contained in [0, 1] such that E Q [Q] = q. Theorem 4.1 of [82] shows that the inner supremum of µ[X ≥ a] can be achieved by assuming that µ is the weighted sum of two Dirac masses, i.e.
For q > a, the supremum in the right-hand side of (3.22) is 1, and for q ≤ a, the supremum in the right-hand side of (3.22) is achieved by x 2 = 0, x 1 = a and α = q /a, and so we conclude that Using [82,Thm. 4.1] again, we obtain that the supremum in Q in the right-hand side of (3.23) is equal to the supremum over α, subject to the constraint that αq 1 + (1 − α)q 2 = q. This supremum is achieved by q 1 = a, q 2 = 0 and α = q a , and so we obtain that U(Π) = q/a, in agreement with (3.14).

Optimal Bounds on the Posterior Value
What happens to the optimal bounds (3.2) and (3.3) on the prior value E π [Φ], investigated in Section 3, after conditioning on the data? Does the interval corresponding to these optimal bounds shrink down to a single point as more and more data comes in? Does this interval shrink as the measurement noise on the data is reduced? What happens to posterior estimates associated with two distinct but close priors, possibly sharing the same marginal distribution on a high dimensional space? These are the questions that will be investigated in this section. Our answers will show that: (1) optimal bounds on posterior estimates grow as data comes in; (2) optimal bounds on posterior estimates grow as measurement noise is reduced (3) two priors sharing the same high-dimensional marginals can lead to diametrically opposed posterior estimates. In some sense these results can be seen as extreme occurrences of the dilation property observed in robust Bayesian inference [103]. As discussed in Section 2.4, let us now consider the case where the probability distribution of the data is a known function D(µ) of the admissible candidates µ ∈ A. As shown in Section 2, directly conditioning measures π D with respect to the random variable D representing the observed sample data would require manipulating regular conditional probabilities on A × D.
Furthermore, in Bayesian statistics a prior π may represent a "subjective belief" about reality and, in such situations, the data may be sampled from π † · D † which may be distinct from π · D. In frequentist analyses of Bayesian statistics π † is called the "true" prior, or "data-generating distribution", and π a "subjective" prior (see [14] and references therein). Although it is known that the subjective prior π might be distinct from the true prior π † , one may still try to evaluate the conditional expectation of the quantity of interest Φ using π as the distribution on A. We will show here that although the observation of the sample data d does not uniquely determine the true prior π † , it does determine a random subset of M(A) (i.e. a random subset of priors) denoted R(d) such that, π † -a.s., π † ∈ R(d). This observation is based on the following fundamental lemma:   Now suppose the data d are generated according to a probability measure π † · D (where π † is the "true" prior). We conclude from Lemma 4.1 that when we observe a sample d, if we assume that π † ∈ R(d) where then we will be correct in this assumption with π † · D-probability 1. Therefore, when the data d are generated and we observe that d ∈ B d where B d is an open subset containing the data d (to keep our notation simple, we will, later on, drop d in the notation B d ), then we restrict our attention to priors π ∈ Π such that π · D[B d ] > 0. That is to say, we restrict our attention to the intersection of Π with the set of priors π such that π ∈ M(A) and π · D[B d ] > 0. We write Π B d for this intersection, i.e.
If Π B d is void, then we assert that "π † is not contained in Π" and we know that this assertion is true with π † · D-probability 1 on the realization of the data d. Conversely, if π † is contained in Π, then Π B d must, with π † · D-probability 1 on the realization of the data d, still contain π † (in particular it must be non-empty).
Happily, this approach also facilitates the efficient computation of the conditional expectations because now they have a simple representation. Indeed, consider the conditional expectation of an object of interest Φ given a prior π and data map D, conditioned on a subset B ∈ B(D) such that π · D[B] > 0. It follows from (2.4) and (2.5) that the conditional expectation of Φ given B is which, using (2.4) and (2.5), leads to . (4.1) Moreover, recall that this conditional expectation is the best mean squared approximation of Φ under the measure π D, given the information that D ∈ B, i.e.

General Information Bounds on Posterior Values
Now let B ⊆ D be open and let and use L for the corresponding infimum. The following theorem is a straightforward consequence of (4.1): Theorem 4.5. It holds true that Moreover, if A Π B is non empty, then

Primary Reduction for Posterior Values
As in Section 3.2.1, when priors are specified through finite-dimensional inequalities, it is possible to provide a reduction of the computation of U(Π|B) on the primary space. To that end, let M + (A) denote the set of positive bounded measures on A and let us extend the "expectation notation" to mean integration with respect to a positive measure in the natural way: for a measurable function ψ and a π + ∈ M + (A) define if the integral exists. Let ψ 0 , . . . , ψ n be real-valued measurable functions on A and define where implicit in the definition is that all n + 1 integrals exist, and let Π +,n := Π + ∩ ∆(n) be the set of those measures in Π + that are non-negative sums of n + 1 Dirac masses.
Furthermore, if ψ 0 is non-negative on A and there exists a measurable function ϕ such that Φ = ψ 0 ϕ, then sup recall also the notation of (4.5) and recall the result (4.1) that, for any π ∈ Π B , .
The proof of the following theorem is obtained by first proving the theorem for equality constraints Z = {q}, by observing that U Π(q) B is a linear fractional optimization problem in π and utilizing the fact that such problems are equivalent to linear problems [27], and then applying Theorem 4.7. To extend the result to the subset Z ⊆ R n , one uses a layercake approach as in the proof of Theorem 3.6. As in Section 3, the following primary reduction theorem, Theorem 4.8, will be formulated in canonical form and the nested reduction theorem, Theorem 4.11, will be in the general form. for some q ∈ (0, a). We saw in Example 3.7 that U(Π) = q a . Now suppose that we observe the random variable D := (X 1 , . . . , X n ) corresponding to n i.i.d. samples of µ † ∈ A. More precisely, we observe D ∈ B where B = B 1 × · · · B n and B i is the ball in (0, 1) of center x i and radius ρ, x i ∈ (0, 1) and 0 < ρ 1/n. Let D n denote the data map corresponding to taking n i.i.d. samples, that is, D n (µ) := µ ⊗ · · · ⊗ µ, and observe that D n (µ) subject to the constraints Introducing slack variables β 1,i := µ 1 [B i ] and β 2,i := µ 2 [B i ] as n linear constraints on µ 1 and n linear constraints on µ 2 we obtain (from [82,Thm. 4.1]) that the supremum can be achieved by assuming that each µ i is the weighted sum of at most n + 2 Dirac masses. Assuming that the B i are non intersecting balls of radius ρ 1/n centered on x 1 , . . . , x n , n of these Dirac masses will have to be put at x 1 , . . . , x n ; for optimality, the two others will have to be put at 0 and a (with weights p 1 and p 2 ). Introducing γ 1 = α 1 D n (µ 1 )[B] and γ 2 = α 2 D n (µ 2 )[B], it follows that U(Π B D n ) is equal (as ρ ↓ 0) to the supremum over γ 1 , γ 2 ≥ 0, p 1 , p 2 ∈ [0, 1] of γ 1 p 1 + γ 2 p 2 subject to the constraints γ 1 + γ 2 = 1, and By considering 0 < β i,j 1 it is easy to obtain that U(Π B D n ) = 1.  7). Then, Theorem 4.8 implies that U Π(α)|B n δ , the least upper bound on posterior values, is equal to the supremum over α 1 , α 2 ≥ 0, µ 1 , µ 2 ∈ A(α) of subject to the constraints where we have used the notation µ n [B n δ ] := n i=1 µ(B δ (x i )). Introducing γ 1 = α 1 µ n 1 [B n δ ] and γ 2 = α 2 µ n 2 [B n δ ], it follows that U Π(α)|B n δ is equal to the supremum over γ 1 , γ 2 ≥ 0, µ 1 , µ 2 ∈ A(α) of subject to the constraints which can be simplified to the supremum over µ 1 , µ 2 ∈ A(α) of By introducing slack variables for m 1 = E µ 1 [X] and m 2 = E µ 2 [X], maximizing (4.12) with m 1 and m 2 , then taking a supremum over m 1 , m 2 , one obtains that the supremum of (4.12) is achieved, in the limit δ ↓ 0, in the configuration where µ 1 puts most of its mass on a, µ 2 puts most of its mass on 0, and (4.13)

Nested Reduction for Posterior Values
Here, as in Section 3.2, we show how the optimization problems (4.5) and (4.6) can be reduced to nested OUQ optimization problems (i.e. nested problems analogous to (2.1) and (2.2)) when the collection Π of admissible priors is defined by how they push forward by a measurable mapping Ψ : A → Q. That is, we specify a feature space Q, a measurable map Ψ : A → Q, a subset Q ⊆ M(Q) and define the admissible set of priors by As before, we focus on reducing the upper bound (4.14) Theorem 4.11. Let A be a Suslin space, let Q be a separable and metrizable space, and let Ψ : A → Q be measurable. Moreover, let Q ⊆ M(Q) be such that supp(Q) ⊆ Ψ(A) for all Q ∈ Q. Then, for each Q ∈ Q, Ψ −1 Q is non-empty. Moreover, the upper bound U Ψ −1 Q B , defined in (4.14), satisfies where the expectations on the right-hand side are defined as in (3.17). Finally, the expectation operator on the right-hand side is measure affine in Q, as defined in (3.3).
Remark 4.12. Note that Theorem 4.11 is more general than Theorem 4.8 because its application does not require the assumption that Ψ −1 Q is defined via generalized moments constraints.
The following theorem is our main result. It shows not only that the right-hand side of the assertion (4.15) of Theorem 4.11 depends on the sample data in a very weak way, but also that, under very mild assumptions, the observation of this sample data leads to an increase (rather than a decrease) of the least upper bound on the quantity of interest: Remark 4.14. Note that the convention that sup ∅ = −∞ implies that, if the assumption (4.17) is satisfied, then there is a measure Q ∈ Q such that the set of q such that D(µ)[B] > 0 for some µ ∈ Ψ −1 (q) has strictly positive Q-measure. On the other hand, Theorem 3.2 asserts that so we conclude that That is, observing the sample data does not improve the optimal bound! Moreover, when the inequality (4.19) is strict, if we define from which we conclude that when the inequality (4.19) is strict, observing the sample data makes the optimal bound worse! In other words, after the observation of the sample data (which may be limited to a single realization of X under the measure µ † , or an arbitrary large number of independent samples of X i ) the optimal upper bound on the quantity of interest, , D n (µ) := µ ⊗ · · · ⊗ µ. In this example are interested in estimating the mean of X under some unknown measure µ † ∈ A and we observe d = (d 1 , . . . , d n ), n i.i.d. samples from X; note that n can be very large. The sample data contain information on µ † through the fact that their distribution is D n (µ † ) = µ † ⊗ · · · ⊗ µ † (i.e. although the distribution of the sample data is unknown, its dependency structure, as a functional of µ † , is known). Let k be a (possibly large) number. Define Π to be the set of priors π under which the distribution of (E µ [X], . . . , E µ [X k ]) is Q, where Q is a distribution on R k such that E µ [X] (its first marginal) is uniformly distributed on [0, 1] and such that the (conditional) distribution of E µ [X 2 ] conditioned on E µ [X] = q 1 is the uniform distribution on the interval and such that the conditional distributions of the other marginals E µ [X k ] are defined iteratively in the same manner. For this example, note that Ψ(µ) = (E µ [X], . . . , E µ [X k ]). Note that, for q := (q 1 , . . . , q k ) in the range of Ψ (i.e. Ψ(A)), We will now use Theorem 4.13 to compute optimal bounds on the posterior values of Φ(µ) = E µ [X]. We will focus our attention on the upper bound. First observe that in this example Q is reduced to the single measure Q constructed above and D is reduced to the single data map D n .
Let us first check that condition (4.17) is always satisfied (irrespective of the value of the data d). Note that condition (4.17) is satisfied if for all δ > 0 there exists a subset of values of q of strictly positive Q-measure such that µ ∈ Ψ −1 (q) | D n (µ)[B] > 0 and E µ [X] ≥ 1 − δ is non empty. So, let δ > 0 be arbitrary and define µ d to be the empirical distribution of d, i.e.
One can show by induction that Ψ(A δ ) has a non-empty interior and that any open subset of Ψ(A) has strictly positive Q-measure. Let q * be a point in the interior of Ψ(A δ ), and let B τ (q * ) be a ball of center q * and radius τ such that B 2τ (q * ) is contained in the interior of Ψ(A δ ). Note that B τ (q * ) has strictly positive Q-measure. Furthermore, for sufficiently small, for each q ∈ B τ (q * ) there exists q ∈ B 2τ (q * ) and µ ∈ Ψ −1 (q ) such that µ : follows that (4.17) is satisfied (irrespective of the value of the data d).
Let us now consider condition (4.16). Observe that condition (4.16) is satisfied if for Q-almost all q ∈ Ψ(A) and all > 0, there exists µ ∈ Ψ −1 (q) such that D n (µ)[B] < . Assume that d contains at least k + 2 distinct points and that ρ is strictly smaller than half of the minimal distance between two of such points, so that the associated B i do not overlap; note that this assumption is satisfied with probability converging to one (as n → ∞) if the data are sampled from a measure µ † that is absolutely continuous with respect to the Lebesgue measure on [0, 1]. Let q ∈ Ψ(A); by the reduction theorems of [82] there exists µ q ∈ Ψ −1 (q) such that µ q is the weighted sum of at most k + 1 Dirac  [80] where, in particular, a quantitative version of Theorem 4.13 is developed and then applied to Example 4.16. Curiously, a refined analysis of the integral geometry of the truncated Hausdorff moment space, used to demonstrate the approximate satisfaction of the conditions of Theorem 4.13, is shown in [80] to lead to a new family of Selberg integral formulas. See [46] for a discussion of their importance. Moreover, if Π is convex, then by considering priors of the form π 0 λ + (1 − λ)π 1 with π 0 , π 1 ∈ Π, π 0 ·D[B] > 0 and π 1 ·D[B] > 0, it is easy to see that the Bayesian posterior can take any value in the interval L(A), U(A) , irrespective of the data. In addition, it is easy to observe that even including the quantity of interest Φ in the marginal Ψ does not prevent this fragility. Theorem 4.13 also leads to the following apparent paradoxes when the Bayesian framework is applied to the space A: (1) Posteriors with different priors may diverge as more and more data comes in; (2) When the sample data is observed with some (say Gaussian) measurement noise of variance σ 2 , then, the optimal bound U Ψ −1 (Q) B on the quantity of interest Φ converges towards U Ψ −1 (Q) as σ 2 → ∞. That is, if one interprets optimal bounds on posterior values as uncertainty bounds, then one would reach the paradoxical conclusion that adding measurement uncertainty decreases the uncertainty of the quantity of interest. The idea of the proof of this assertion is based on the following observation: Let y be the (noisy) measurement whose distribution given the value of the data d is assumed to be independent of µ. Write p σ (d) [B] for the probability that the value of y belongs to a set B and observe that the conditional value of the quantity of interest Φ given the y ∈ B is equal to . The fact that optimal bounds on prior values may become less precise after conditioning is known as the dilation phenomenon in robust Bayesian inference [103], and, in some sense, the brittleness results presented in this paper could be seen as an extreme occurrence of this phenomenon.
and that, for all δ > 0,

Bayesian Robustness and Consistency
It is appropriate at this point to place the results of Sections 3 and 4 in the more wellestablished context of two key questions about Bayesian inference, namely its robustness with respect to perturbations of the prior (and likelihood and observed data), and its frequentist consistency. This discussion will also motivate Section 6, where we show that Bayesian inference can be profoundly non-robust even under arbitrarily small local perturbations in total variation and Prokhorov metrics.

Bayesian Robustness
The robust Bayesian viewpoint appears to have been introduced independently by Box [25] and Huber [58]; see e.g. [15,16] and Chapter 15 of [60] for surveys of the field. In the robust Bayesian approach, a class Π of priors and a class Λ of likelihoods together produce a class of posteriors by pairwise combination through Bayes' rule. Robust Bayesian methods are a subclass of the methods of imprecise probability; the idea that the probability of an event need not be a single real number has a history stretching back to Boole [24] and Keynes [65], with more recent and comprehensive foundations laid out in e.g. [68,100,105]. One way of generating such a class Π of priors is via a belief function, as in [104] and Dempster-Shafer theory more generally. The belief function framework encompasses prior probabilities whose values are known only on some finite partition of the probability space, and not the whole σ-algebra; classes of ε-contaminated priors can also be represented in this way, as well as classes of locally perturbed priors. The belief function approach has the useful feature that explicit formulae can be given for the lower and upper posterior probabilities of events [104,Theorem 4.1].
Another typical approach to generating a class Π might be to consider a finitedimensional parametrized class of models. For example, one could consider, instead of a single Gaussian prior on R of specified mean and variance, a two-parameter class of Gaussian priors with a range of means and variances, or a three-parameter class of skew-Gaussian priors. Similarly, one might consider a two-parameter class of beta distributions instead of a uniform prior on a bounded interval.
However, a danger in specifying a finite-dimensional class Π of priors is that one is making very strong statements about the form of the priors, particularly with regard to the tails, that cannot be justified based on often-limited amounts of prior information. For example, if all the priors π ∈ Π have thin tails, then the class Π will have a very difficult time modeling events that lie in those tails, even when exposed to data from those regions. This problem is particularly important in applied fields such as catastrophe modeling, insurance, and re-insurance, in which the catastrophic events of interest are by definition high-impact low-probability "Black Swan" events: the difference between an exponentially small and an inverse-polynomially small tail can be vitally important. Also, because members of a finite-dimensional parametric family Π of priors often have similar qualitative properties (such as being mutually absolutely continuous), the apparently broader perspective does not not add much to the asymptotic posterior picture in terms of robust consistency, although it does provide a broader understanding given finitely many samples.
Rather than specifying a finite-dimensional Π, it is epistemologically more reasonable to specify a finite-codimensional Π, for example by specifying interval bounds on the expected values of finitely many observed test functions (i.e. generalized moment inequalities); this setting encompasses the finite-partition belief function framework mentioned above. Calculation of optimal prior and posterior bounds on quantities of interest is often an exercise in numerical optimization [20,82,90] rather than closed-form formulae.
One consequence of Theorems 4.8 and 4.13 is that the very same Bayesian sensitivity analysis framework that produces the robustness results of classical robust Bayesian inference under finite-dimensional classes of priors also leads to brittleness results under finite-codimensional classes of priors, when the set of all priors is infinite dimensional. As illustrated by (1.8) and Example 4.10, Theorems 4.8 and 4.13 can also be used to obtain robustness/stability results by adding sufficiently strong constraints (at the expense of learning) on the probability of the data in the model class. As discussed in Subsection 1.4, Example 4.10 suggests that posterior stability and learning are antagonistic properties in Bayesian inference under finite information.

Motivation for Bayesian Inconsistency and Model Misspecification
To motivate Section 6 and interpret the results of this paper in relation to the issue of convergence of posterior values in Bayesian inference we will now analyse and review questions of Bayesian consistency, inconsistency and model misspecification. There is, of course, a large literature on these topics, and we will not attempt to be exhaustive in providing references; rather, our aims are: first, to give a short reminder on how Bayesian inference is currently employed in Uncertainty Quantification (UQ); second, to identify issues and popular beliefs about what one actually learns from Bayesian inference, and thereby motivate the results of this paper; and, last, to present sufficient references that the interested reader can find technical justification for the formal manipulations of this subsection.
In this subsection, we are interested in estimating Φ(µ † ) where Φ is a known quantity of interest function and µ † is an unknown (or partially known) probability measure on X . For the purposes of exposition, in this subsection, we assume that X = R k . One example of a quantity of interest, when X = R, is Φ(µ † ) := µ † [X ≥ a] (the probability that the random variable X distributed according to µ † exceeds the threshold value a). We also assume that we are given n independent samples d 1 , . . . , d n , each distributed according to µ † .
We will now present the parametric Bayesian answer to this problem. For the purposes of exposition, in this section, we restrict our attention to parametric Bayesian inference. We first introduce {µ( · , θ)} θ∈Θ a family of probability distributions on X parametrized by θ ∈ Θ (and commonly referred to as the model class). For the sake of simplicity here we also assume that Θ = R . Let Note that A 0 is a subset of M(X ) that may or may not contain µ † . If µ † / ∈ A 0 , then the model is said to be misspecified ; otherwise, the model is said to be well specified.
We next introduce p 0 ∈ M(Θ), a probability distribution on Θ (the prior distribution on θ). Let π 0 be the push-forward (measure) of p 0 under the map θ → µ( · , θ) (see [22,23], Sections 3.6, 3.7) and observe that π 0 is a probability distribution on A 0 , i.e. π 0 ∈ M(A 0 ), and that π 0 is the distribution of the random measure µ( · , θ) when θ is distributed according to p 0 . The next step is then to estimate Φ(µ † ) via conditioning. Let p n ∈ M(Θ) be the posterior distribution of θ given the observation of the i.i. d. samples d 1 , . . . , d n , as obtained using Bayes' formula, and let π n be the push-forward of p n . The Bayesian estimate of For the purposes of exposition, we assume that the measures µ( · , θ) and µ † are all absolutely continuous with respect to the Lebesgue measure and write f ( · , θ) and f † for their densities, which we assume to be continuous. Similarly, we assume that the measure p 0 is absolutely continuous with respect to the Lebesgue measure and, abusing notation, write p 0 for both the measure p 0 and its (continuous) density, and similarly for p n ( · ), the posterior density of θ on Θ given the observation the samples d 1 , . . . , d n . We will now examine the convergence properties of the sequence of posterior densities p n (θ) as n → ∞. This analysis being classical (see for instance [79] and references therein), our purpose is not to provide rigorous justifications but rather to familiarize the reader with the mechanisms regarding the convergence of posteriors.
We have which we write as p n (θ) = p 0 (θ)e nLn(θ) Recall that n j=1 f (d j , θ) is commonly known as the likelihood and L n (θ) as the (sample) average log-likelihood.
Consistency and the Large-Sample Limit. Now observe that if log f (d j , θ) is integrable then it follows from the Law of Large Numbers that L n (θ) converges almost surely, as n → ∞, to the expected log-likelihood L(θ) defined by Assuming that L(θ) has a unique maximizer θ * ∈ Θ (corresponding to the asymptotic limit of the maximum likelihood estimator (MLE), as the number of data points goes to infinity)and that p 0 is strictly positive in every neighborhood of θ * , it follows under regularity assumptions on f (or local strict convexity in the neighborhood of θ * ) that p n (θ) converges, almost surely, as n → ∞, towards a Dirac mass supported at θ * . Therefore, assuming Φ to be sufficiently regular, the Bayesian posterior estimate of Φ(µ † ), i.e., converges almost surely as n → ∞ to Φ µ( · , θ * ) . where It follows that θ * is also the minimizer of D KL f † f ( · , θ) with respect to θ, i.e. the MLE θ * is characterized by the property that µ( · , θ * ) is the distribution having minimal relative entropy to µ † in the model class {µ( · , θ)} θ∈Θ . An immediate consequence of this observation is the fact if the model is not misspecified, i.e. if µ † is an element µ( · , θ † ) of the model class, then θ * = θ † , µ( · , θ * ) = µ † , and the Bayesian estimate (5.3) is asymptotically exact in the limit as n → ∞. In this situation, the Bayesian estimate is said to be consistent.
This convergence result is known as the Bernstein-von Mises Theorem (see for instance [79,Theorem 5]) or as the Bayesian Central Limit Theorem, since the limiting posterior can even be described in a more refined way as being asymptotically normal and not just a point mass. The condition that every open neighborhood of θ † has strictly positive p 0 -probability (or, even more strongly, that the prior be globally supported) has been named Cromwell's Rule 3 by Lindley [72].
Recent results [30,61,71,79] on the Bernstein-von Mises phenomenon show a notable dependence of the validity of the Bernstein-von Mises property upon subtle geometrical and topological details, and regularity properties of the model and the data-generating distribution. Therefore, it is to be expected that any general stability condition for Bayesian inference would have to take account of such factors.
What Happens When the Model is Misspecified? To provide an illustrative answer to this question, consider the family of Gaussian models What will happen when this model is exposed to data coming from a potentially non-Gaussian truth µ † , with density f † , that has a well-defined mean c † and standard deviation σ † ? By the above considerations, θ * maximizes the expected log-likelihood (5.2) with respect to θ, and the expected log-likelihood is simply A quick calculation using partial derivatives shows that θ * = (c * , σ * ) maximizes (5.5) if and only if c * = c † and σ * = σ † . That is, the Bayesian estimate (5.1) of Φ(µ † ), for any distribution µ † of mean c † and standard deviation σ † , converges almost surely as the number of sample data goes to infinity, towards Φ µ( · , (c † , σ † )) , where µ( · , (c † , σ † )) is the unique Gaussian distribution on R with mean c † and standard deviation σ † . However, now there is a problem: there are many different probability distributions µ on R that have the same first and second moments as µ † but have, say, different higher-order moments, or different quantiles. Predictions of those other moments or quantiles using µ( · , (c † , σ † )) can be inaccurate by orders of magnitude. A trivial, albeit extreme, example is furnished by Φ(µ) := E µ |X − c µ | ≥ tσ µ (where c µ and σ µ denote the mean and standard deviation of µ). Under the Gaussian model, (defining erf(z) : s dt as the error function) whereas the extreme cases that prove the sharpness of Chebyshev's inequality -in which the probability measure is a discrete measure with support on at most three points in R -have In the case of the archetypically rare "6σ event", the ratio between the two is approximately 1.4×10 7 . This is, of course, an almost perversely extreme comparison: it would be obvious to any observer with only moderate amounts of sample data that the data were being drawn from a highly non-Gaussian distribution. However, it is not inconceivable that the true distribution µ † has a Gaussian-looking bulk but tails that are significantly fatter than those of a Gaussian, and the difference may be difficult to establish using reasonable amounts of sample data; yet, it is those tails that drive the occurrence of "Black Swans", catastrophically high-impact but low-probability outcomes. The results of this paper suggest that this situation is generic, and cannot be avoided no matter how many moments or integrals of arbitrary test functions of the truth µ † are matched nor how "close" µ † is to the class {µ( · , θ)} θ∈Θ .

Bayesian Inconsistency and Model Misspecification
To quote [79], "[w]hile for a Bayesian statistician the analysis ends in a certain sense with the posterior, one can ask interesting questions about the the properties of posteriorbased inference from a frequentist point of view." Many of these questions are asymptotic in nature: for example, in the limit of infinitely many independent µ † -distributed samples, will the posterior converge in a suitable sense to µ † regardless of the initial choice of prior π? This property is referred to as consistency 4 ; a general survey of consistency results is found in [99]. As noted above, the consistency theorem is generically known as the Bernstein-von Mises theorem [19,96], although the earliest rigorous proofs are due to Doob [38] and Le Cam [69]. Unfortunately, Cromwell's Rule is only necessary, and not sufficient, to ensure consistency. In fact, consistency is far from being a generic property, and once the probability space contains infinitely many points (and hence any parameter space Θ that parametrizes all probability measures on that probability space is infinite-dimensional), inconsistency is not the exception, but the rule [36]. In [48,Sec. 5 Each θ gives rise to a probability distribution P θ = µ( · , θ) under which the observations X 1 , X 2 , . . . are IID with P θ [X n = i] = θ(i). The problem is assumed to be well-specified, so that one particular θ † ∈ Θ is considered to be the "true" parameter value, and the frequentist data-generating distribution is µ † = P θ † = µ( · , θ † ). Theorem 5 of [48] shows that, when supp(µ † ) is infinite, given any "spurious" probability distribution Q = P q , there exists a prior probability measure π on Θ that has θ † in its support, such that the posterior of π µ † -a.s. concentrates on q in the limit of observing infinitely many i.i.d. µ † -distributed samples. In fact, there is a prior that gives positive mass to every open subset of Θ but yields consistent posterior estimates for only a first-category set of possible "true" (data-generating) parameter values θ † .
There are conditions on priors that do ensure consistency in infinite-dimensional or non-parametric contexts, e.g. the tail-free priors introduced by Freedman in [48] and hybrid Bayesian-frequentist tools such as Dirichlet process priors [52]. However, while the collection of "bad" priors that lead to inconsistent results is measure-theoretically small [38,28], it is topologically generic [49].
Remark 5.1. It is probably fair to say that, despite their popularity and documented successes, Bayesian methods have always attracted some degree of controversy and opposition: see e.g. [51] and rejoinders for a recent academic discussion, and [73,78] for less formal treatments. Often, this opposition is philosophical in nature, particularly with regard to the subjective interpretation of the probabilities involved, which is something that remains counter-intuitive to many commentators: see [44, par. 35 & 37] for a recent example in law. However, there are also analytical reasons to be careful about the application of Bayesian methods [88,76,43]. It is, in fact, now well understood that Bayesian methods may fail to converge or may converge towards the wrong solution if the underlying probability mechanism allows an infinite number of possible outcomes [35] and that, in these non-finite-probability-space situations, this lack of convergence (commonly referred to as Bayesian inconsistency) is the rule rather than the exception [36]. There is now a wide literature of positive [19,30,38,67,69,96,92] and negative results [12,35,48,47,61,71] on the consistency properties of Bayesian inference in parametric and non-parametric settings, and an emerging understanding of the fine topological and geometrical properties that determine (in)consistency.
It is important to appreciate that the requirement of positive prior mass in every neighborhood of the true distribution depends upon the topology placed upon M(X ). For example, Schwartz [86] shows that every π that puts positive mass on all Kullback-Leibler (relative entropy) neighborhoods of µ † is weakly consistent. On the other hand, Freedman [48] and Diaconis & Freedman [35] show that π may put positive mass on all weak neighborhoods of µ † and still fail to be weakly consistent -e.g. by not being tail-free. Nor are results limited to weak convergence of the posterior to µ † . For example, [9] shows that consistency holds in the Hellinger distance if π puts positive mass on all Kullback-Leibler neighborhoods of µ † and certain smoothness and tail conditions are satisfied; see [98,101] for further results on Hellinger and Kullback-Leibler consistency. The amount of prior probability mass that lies Kullback-Leibler-close to the truth, quantified using a notion called thickness, can be used to quantify the convergence properties of Bayes estimates [1,2,74]. However, it is important to note that, in the infinite-dimensional contexts that are increasingly subject to Bayesian analyses, results like the Feldman-Hájek dichotomy [45,56] suggest that probability measures are 'usually' mutually singular and 'rarely' mutually absolutely continuous, and so the Kullback-Leibler neighborhoods of µ † are 'small' sets that are 'unlikely' to intersect the model class.
The situation in which there is no θ † ∈ Θ such that µ † = µ( · , θ † ) is referred to as model misspecification. The consistency and other asymptotic properties of misspecified models appear to have first been considered by Berk [17,18] and Huber [59]. See [66,67] for a recent contribution, and [74] for convergence rates.
"In practice, Bayesian inference is employed under misspecification all the time, particularly so in machine learning applications. While sometimes it works quite well under misspecification [21,66], there are also cases where it does not [31,50], so it seems important to determine precise conditions under which misspecification is harmful -even if such an analysis is based on frequentist assumptions." [53] There is a reasonable popular belief that gross misspecification of the model will be detected by some means before engaging in a serious Bayesian analysis; indeed there do exist tests [57,106] for model misspecification, but it is important to note that while one can determine that the model is misspecified, one cannot be sure that the model is well-specified. There is also an understandable popular belief that these tests mean that one need only be concerned with the situation of "mild misspecification", and that provided µ † lies "close enough" to the model class {µ( · , θ)} θ∈Θ , the posterior estimates will still converge to a usefully informative limit.
Remark 5.2. This belief echoes G. E. P. Box's statement [26, p. 424] that "essentially, all models are wrong, but some are useful" and question [26, p. 74] "Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful?" In terms of the above discussion, one purpose of this paper is to explore the extent to which one can simultaneously have robust Bayesian analyses that produce consistent answers, given that the models used (both priors and likelihoods) are certain to be misspecified to some degree. Can one be "just a little bit wrong" in terms of model misspecification? Our results suggest that the answer is negative within the classical framework of Bayesian Sensitivity analysis, when "closeness" is measured in terms of total variation and Prokhorov metrics or in terms of a finite (but possibly large) number of marginals of the data generating distribution.
In particular, one aim of Section 6 is to show that this belief is wrong if "mild misspecification" is measured using the Prokhorov or the total variation metrics, the number of samples is finite (but possibly arbitrarily large), and if convergence is required to hold uniformly in an arbitrarily small neighborhood of the model. Remark 5.3. It is known from the Bernstein-von Mises theorem [19,96] that, in finitedimensional situations, posterior values converge towards the quantity of interest if the prior distribution has strictly positive mass in every neighborhood of the truth (see also [69,79]). It is also known that "even for the simplest infinite-dimensional models, the Bernstein-von Mises theorem does not hold" [32,47]. This possible lack of convergence, referred to as the consistency problem, has been at the center of a debate between frequentists and Bayesians. We quote Diaconis and Freedman [35] (see also [36]) "If the underlying mechanism allows an infinite number of possible outcomes (e.g., estimation of an unknown probability on the integers), Bayes estimates can be inconsistent: as more and more data comes in, some Bayesian statisticians will become more and more convinced of the wrong answer." What is the significance of Theorem 4.13 in that discussion? To answer this question, consider Example 4.9 (and 4.19), in which one is interested in estimating the probability (under the unknown measure µ † ) that X exceeds a after observing n independent samples. We already know from [35,32] that placing priors on the infinite-dimensional space A = M[0, 1] of probability measures on [0, 1] is unlikely to lead to Bayesian posteriors that will converge towards the true value as more and more data comes in. One strategy to circumvent this lack of convergence would be to consider a finite-dimensional subset of A, i.e. a family (µ λ ) of probability measures on [0, 1] indexed by a finite-dimensional parameter λ ∈ R k , put a strictly positive prior p on λ ∈ R k , and then invoke the Bernstein-von Mises theorem to guarantee the convergence of posterior values.
However, the Bernstein-von Mises theorem requires that the true distribution under which the data is sampled belongs to {µ λ | λ ∈ R k }, the parametrized finite-dimensional subset of A. What happens when this is not the case, i.e. the situation of misspecification? Write π p for the push-forward of the prior p on λ ∈ R k to a prior on A under the map λ → µ λ . Assume that the data have been sampled from π † · D where π † is the (frequentist) true distribution. Here Theorem 4.13, as illustrated in Example 4.16, can be used to show that the posterior values of the quantity of interest under π p and π † may lie near the opposite extreme values of Φ in A even if (1) π † is a Dirac mass on a measure µ † ∈ A; (2) the number of independent samples is large; and (3) k is large and k moments of µ † and µ λ * are equal for some λ * ∈ R k . Remark 5.4. One popular method for detecting failure of convergence under model misspecification is to divide the data into data used for calibrating the parameters of the model and data used for validating the accuracy or predictability of the (calibrated) model. This approach, oftentimes described as "frequentist" [55,11], could be used to validate Bayesian calculations [43]. Although the detection (of the lack of predictability of the model) is asymptotically robust, it requires the availability of sufficient data.

Brittleness under Local Misspecification
The purpose of this section is to present brittleness results with respect to local perturbations in the total variation and Prokhorov metrics. Thus, whereas the examples given for Theorem 4.13 highlighted that no finite number of common moments would be sufficient to constrain two priors to give nearby posterior value for the quantity of interest, this section shows that closeness in the TV and Prokhorov metrics is also insufficient to ensure robustness.
We now establish a corollary to the proof of Theorem 4.13 which we will then use to establish an extreme brittleness theorem for a model with local misspecification. Recall that, for a map Ψ : A → Q, a map ψ : Ψ(A) → A is called a section of Ψ if Ψ • ψ(q) = q for all q ∈ Ψ(A). Theorem 6.1. Let A be a Suslin space, let Φ : A → R be measurable, let Q be a separable and metrizable space, and let Ψ : A → Q measurable. Let Q ⊆ M(Q) be such that supp(Q) ⊆ Ψ(A) for all Q ∈ Q. Let the data space D be metrizable and consider it follows that Figure 6.1 for an illustration of Theorem 6.1. We now use Theorem 6.1 to develop a brittleness theorem for a model with local misspecification. To that end, let X be a Polish space so that, by [4,Thm.15.15], M(X ) endowed with the weak topology is Polish. Moreover, by [40,Thm. 11.3.3], we know that if we select a complete consistent metric d for X , then the Prokhorov metric d M defined by is the ε neighborhood of A, metrizes the weak topology on M(X ). Moreover, Prokhorov's theorem [40,Cor. 11.5.5] asserts that the Prokhorov metric d M is a complete metric for the Polish space M(X ). For α > 0, µ ∈ M(X ), let B α (µ) := {µ ∈ M(X ) | d M (µ, µ ) < α} be the open ball of Prokhorov radius α about µ.
Let Θ be a Polish space and let the model define a map P : Θ → M(X ).
As in Section 5.2, the image P(Θ) is referred to as the (Bayesian) model class. Assume that P is measurable and denote its image by A 0 := P(Θ). Let π Θ ∈ M(Θ) be a prior distribution on Θ and let π 0 := Pπ Θ ∈ M(A 0 ) be its pushforward. Let Φ 0 : M(X ) → R be a measurable quantity of interest. We are interested in estimating Φ 0 using the prior π 0 and our purpose is to show the extreme brittleness of this estimation under arbitrarily small perturbations of the model class A 0 in both the Prokhorov and total variation metrics.
For conditioning on observations, let the data space be D := X n , and consider the n-i.i.d. sample data map D n 0 : M(X ) → M(X n ) defined by D n 0 µ := µ n , µ ∈ M(X ) .
To define α-perturbations of the model class A 0 in Prokhorov metric, we introduce, for α > 0 the α-neighborhood A α ⊆ M(X ) of A 0 defined by (see Figure 6.2) It is easy to see that the ball fibration (see Remark 6.8) Figure 6.2: The original model class A 0 (black curve) is enlarged to its metric neighborhood A α (shaded). This procedure determines perturbations µ α ∈ A α of the original random measure µ 0 ∈ A 0 .
of the set of balls about points of A 0 projects to where P 0 : M(X ) × M(X ) → M(X ) is the projection onto the first component and P α the projection onto the second. The naturally induced set of priors corresponding to π 0 ∈ M(A 0 ) is therefore the set Π α ⊂ M(A α ) defined by Π α := π α ∈ M(A α ) ∃π ∈ M(A) with P 0 π = π 0 and P α π = π α . (6.11) Remark 6.3. Observe that each element π α ∈ Π α is the distribution of a random measure µ 2 on A α such that: (i) there exists a random measure µ 1 ∈ A 0 with distribution π 0 (that of the model); (ii) (µ 1 , µ 2 ) is jointly measurable; and (iii) with probability one the Prokhorov distance from µ 2 to µ 1 is less than α, i.e. d M (µ 1 , µ 2 ) < α. Observe in particular that π 0 ∈ Π α .
Our main result is provided in Theorem 6.9 but for the sake of clarity we will first give this result in the following (simpler) form.
Theorem 6.4. Using the notations introduced above and the data map (6.5), let Π α be defined as in (6.11). If lim then, for all α > 0 there exists δ c (α) > 0 such that for all 0 < δ < δ c (α), all n ∈ N, and all (x 1 , . . . , and with similar expressions for the lower bounds L. Remark 6.5. Theorem 6.4 implies the extreme brittleness of Bayesian inference under local misspecification. Indeed, assume that the model class A 0 is well specified (i.e. it contains the truth µ † ) and that, therefore, the Bayesian estimator described by π 0 is consistent. One may believe that a model A 1 lying in a 'small enough' neighborhood of A 0 should have good convergence properties, Theorem 6.4 and Remark 6.3 invalidate this belief, at least as far as the TV and Prokhorov notions of 'small enough' are concerned. Using the notations of Remark 6.3, observe in particular that an unscrupulous practitioner may design a model corresponding to a random measure µ 2 such that the distance between µ 1 (the well specified model) and µ 2 is a.s. at most α (where α is arbitrarily small) and the posterior value using the random measure µ 2 is as distant as possible from the posterior value using µ 1 irrespective of the sample size n.
Remark 6.6. Observe that the condition (6.12) is extremely weak and satisfied for most Bayesian models. This condition can in fact be made weaker by replacing it with the assumption that for n sufficiently large it holds true that for all θ, P(θ) does not contain a Dirac mass in each ball B δ (x i ) (i.e. on the sample data when δ ↓ 0). We also note that the proof of Theorem 6.4 does not require the samples to be i.i.d., in particular, the same results can be obtained with coupled samples, if, for instance, the data map D n 0 is replaced by a data map D such that C n Remark 6.7. Theorem 6.4 is a corollary of Theorem 6.9 and the proof of Theorem 6.9 shows that, if Θ is compact and P is continuous and Φ(µ) := µ(A) for some fixed A ∈ B(X ), then the result of Theorem 6.4 also holds when using the total variation distance d TV instead of the Prokhorov distance, which produces a much smaller neighborhood.
However, in this metric M(X ) in general is not separable and this introduces measurability difficulties. These difficulties can be overcome somewhat when Θ is compact and P is continuous, since the image of a compact set under a continuous map is compact and therefore measurable. Moreover, validation or certification type quantities of interest defined by Φ(µ) := µ(A) for some fixed A ∈ B(X ) are easily seen to be continuous and therefore measurable. Moreover, because of continuity, . Our motivation in working mainly with the Prokhorov metric lies in the fact that we also seek to lay down measurability foundations for the scientific computation of optimal statistical estimators where the unknown quantities are products of functions and measures and for such spaces the total variation metric is too strong for the measurability of standard quantities of interest.
We will now give a more general version of Theorem 6.4 and elaborate on the objects entering in its formulation. We start with Π Θ ⊆ M(Θ), a set of admissible priors and let Π 0 := PΠ Θ ⊆ M(A 0 ) denote the push-forward by the model P. We consider the pull-back Φ Θ := Φ 0 •P, of the measurable quantity of interest Φ 0 : M(X ) → R, to a measurable quantity of interest Φ Θ : Θ → R. Then the change of variables formula [40,Thm. 4.1.11] implies that, for π Θ ∈ M(Θ), whenever either side is well defined. Therefore, taking suprema and infima, we obtain where we note that the quantity of interest implicit in these definitions is determined by the argument. For α > 0, define A α , A, P 0 and P α as in (6.7), (6.8), (6.9) and (6.10).
Remark 6.8. Using the affine convexity of M(X ), one can show that A is indeed a Hurewicz fibration, in that it has the homotopy lifting property, see e.g. [91, p. 66]. Let For a subset Π 0 ⊆ M(A 0 ), the projection identity (6.9) implies that the set Π := P −1 0 Π 0 defined by P −1 0 Π 0 := {π ∈ M(A) | P 0 π ∈ Π 0 } is the induced set of probability measures on A. Moreover, for π ∈ Π, the change of variables formula is the induced set of probability measures on A α . Let us denote this induced set by so that these equalities become For conditioning on observations, define D n 0 as in (6.5) and pull it back to the data map D n : M(X ) × M(X ) → M(X n ) defined by D n := D n 0 • P α . Define B n δ as in (6.6) and recall the definition (4.1) .
of the conditional expectation and the corresponding (4.5) upper value in terms of the admissible set (4.3) of product measures, where the marginal is defined by Let us indicate the dependence on some measure π of the essential supremum of some quantity of interest Φ by (6.14) For π α = P α π with π ∈ Π, we have so that we conclude that Π ∞ (Φ) = Π ∞ α (Φ 0 ). Let us now quantify a type of regularity for the model P. It is clear that P ∞ : R + → [0, 1] is an increasing function. Moreover, for most parametric families, it is easy to show that P ∞ is continuous and P ∞ (0) = 0, and for many of them not difficult to find useful upper bounds. Finally, let us assume that the model P is positive, in that µ(B δ (x)) > 0 for all µ ∈ A 0 , x ∈ X , and δ > 0. Theorem 6.4 is a direct consequence of the following theorem. Theorem 6.9 (Brittleness under Local Misspecification). With the notation and assumptions above, let Π α be defined as in (6.13), and let δ > 0 and 0 < α < 1 satisfy Then, using D n 0 for the distribution of the data, for all integers n ≥ 1, with similar expressions for the lower bounds L.
Remark 6.10. When Cromwell's rule (see Section 5.2) is implemented (i.e. if the prior measure of every non-empty neighborhood is strictly positive), it follows that Π ∞ 0 (Φ 0 ) = U(A 0 ) so that the conclusion of Theorem 6.9 becomes Remark 6.11. Theorem 6.9 provides conditions sufficient to guarantee how bad things can get regardless of how many samples are taken. One might hope that when these conditions are not satisfied, that more samples may prove beneficial. However, when the condition inf and the quantitative version of Theorem 4.13 (given in [80,Thm. 3.1], see also [80,Rmk. 3.2]) imply that things actually get 'worse' with more samples.

Conclusions and Further Developments
In this paper, we have looked at the robustness of Bayesian Inference in the classical framework of Bayesian Sensitivity Analysis. In that (classical) framework, the data is fixed, and one computes optimal bounds on (i.e. the sensitivity of) posterior values with respect to variations of the prior in a given class of priors. Although robustness is already well established when the class of priors is finite dimensional, we observe that, under general conditions, when the class of priors is finite codimensional, the optimal bounds on posterior values are as large as possible, no matter the number of data points. Our motivation for specifying a finite codimensional class of priors is to look at what classical Bayesian sensitivity analysis would conclude under finite information, and the best way to understand this notion of "brittleness under finite information" is through the simple example provided in Subsection 1.2. The mechanism causing this "brittleness" has its origin in the fact that, in classical Bayesian sensitivity analysis, optimal bounds on posterior values are computed after the observation of the specific value of the data, and that the probability of observing the data under some feasible prior may be arbitrarily small (the example given in Subsection 1.3 provides an illustration of this phenomenon). This data dependence of worst priors is inherent to this classical framework and the resulting brittleness under finite information can be seen as an extreme occurrence of the dilation phenomenon (the fact that optimal bounds on prior values may become less precise after conditioning) observed in classical robust Bayesian inference [103]. Although these worst priors do depend on the data, "look nasty", and make the probability of observing the data very small, they are not "isolated pathologies" but directions of instability (of Bayesian conditioning) and their number increase with the number of data points. The example given in Subsection 1.4 provides an illustration of this point and also suggests that learning and robustness are, to some degree, antagonistic properties: a strong constraint on the probability of the data makes the method robust but learning impossible and, as the constraint is relaxed, learning becomes possible but posterior values become brittle.
Since "brittleness under finite information" appears to be inherent to classical Bayesian Sensitivity Analysis (in which worst priors are computed given the specific value of the data), one may ask whether robustness could be established under finite information by exiting the strict framework of Robust Bayesian Inference and computing the sensitivity of posterior conclusions independently of the specific value of the data. To investigate this question, Hampel and Cuevas' notion of qualitative robustness has been generalized in [81] to Bayesian inference based on the quantification of the sensitivity of the distribution of the posterior distribution with respect to perturbations of the prior and the data generating distribution, in the limit when the number of data points grows towards infinity. Note that, contrary to classical Bayesian Sensitivity Analysis considered here, in the qualitative formulation the data is not fixed and posterior values are therefore analyzed as dynamical systems randomized through the distribution of the data. To express finite information, the total variation, Prokhorov, and Ky Fan metrics have been used to quantify perturbations and sensitivities.
Since this notion of qualitative robustness is established in the limit when the number of data points grows towards infinity, it is natural to expect that the notion of consistency (i.e. the property that posterior distributions convergence towards the data generating distribution) will play an important role. Although consistency is primarily a frequentist notion, it is also equivalent to intersubjective agreement which means that two Bayesians will ultimately have very close predictive distributions. Therefore, it also has importance for Bayesians. Fortunately, not only are there mild conditions which guarantee consistency, but the Bernstein-von Mises theorem goes further in providing mild conditions under which the posterior is asymptotically normal. The most famous of these are Doob [38], Le Cam and Schwartz [70], and Schwartz [86,Thm. 6.1]. Moreover, the assumptions needed for this consistency are so mild that one can be lead to the conclusion that the prior does not really matter once there is enough data. For example, we quote Edwards, Lindeman and Savage [42]: "Frequently, the data so completely control your posterior opinion that there is no practical need to attend to the details of your prior opinion." To some, the consistency results appeared to generate more confidence than possibly they should. We quote A. W. F. Edwards [41, p. 60]: "It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this 'defence' the better." [81] shows that the Edwards defence is essentially what produces non qualitative robustness in Bayesian inference. In particular, the assumptions required for consistency (e.g. the assumption that the prior has Kullback-Leibler support at the parameter value generating the data) are such that arbitrarily small local perturbations of the prior distribution (near the data generating distribution) results in consistency or non-consistency, and therefore have large impacts on the asymptotic behavior of posterior distributions. These mechanisms are different and complementary to those discovered by Hampel and developed by Cuevas, and they suggest that consistency and robustness are, to some degree, antagonistic requirements (a careful selection of the prior is important if both properties, or their approximations, are to be achieved) and also indicate that misspecification generates non qualitative robustness.
In conclusion, the exploration of Bayesian inference in a continuous world has revealed both positive and negative results. However, positive results regarding the classical or qualitative robustness of Bayesian inference under finite information have yet to be obtained. To that end, observe that the example provided in Subsection 1.4 suggests that there may be a missing stability condition for Bayesian inference in a continuous world under finite information akin to the CFL condition for the stability of a discrete numerical scheme used to approximate a continuous PDE. Although numerical schemes that do not satisfy the CFL condition may look grossly inadequate, the existence of such perverse examples certainly does not imply the dismissal of the necessity of a stability condition. Similarly, although one may, as in the example provided in Subsection 1.3, exhibit grossly perverse worst priors, the existence of such priors does not invalidate the need for a study of stability conditions for using Bayesian Inference under finite information. The example of Subsection 1.4 suggests that, in the framework of Bayesian Sensitivity Analysis, under finite information, such a stability condition would strongly depend on how well the probability of the data is known or constrained in the model class in addition to the class of priors and the resolution of the measurements. It is natural to expect that such robustness and stability questions will increase in importance as Bayesian methods increase in popularity due to the availability of computational methodologies and environments to compute the posteriors. Indeed, when posterior distributions are approximated using such methods, the robustness analysis naturally includes not only quantifying sensitivities with respect to the data generating distribution and the choice of prior, but also the analysis of convergence and stability of the computational method. This is particularly true in Bayesian updating where Bayes' rule is applied iteratively and computed posterior distributions become prior distributions for the next iteration. Oftentimes these posterior distributions (which are then treated as prior distributions) are only approximated (e.g. via MCMC methods) and the Brittleness results discussed here and in [81] suggest that having strong convergence (of these MCMC methods) in TV would not be enough to ensure stability. At a higher level, these results appear to suggest that robust inference (in a continuous world under finite information) should be done with reduced/coarse models rather than highly sophisticated/complex models (and the level of "coarseness/reduction" would depend on the available "finite information").

Proof of Theorem 3.6
For q ∈ R n , define Π(q) : and let Π(q, n) := Π(q) ∩ ∆(n) ⊆ Π(q) be the subset consisting of (n + 1)-fold convex combinations of Dirac masses. Using a 'layercake' approach, we use the fact that Π(Z) = q∈Z Π(q) and Π(Z, n) = q∈Z Π(q, n), while applying Theorem 3.4 with equality constraints Π(q), q ∈ R n , and the fact that the supremum over a union is a supremum of suprema to obtain a reduction as follows: which completes the proof.
Since T ⊂ Q is a subset of a separable metrizable space, [4,Cor. 3.5] implies that it is itself separable and metrizable. Consider the set-valued map with non-empty values Ψ −1 : T A with graph G defined by Let d be a metric that generated the topology of T and define h : T × A → R by h(q, µ) := d(Ψ(µ), q). Then, since d is continuous in each of its arguments, it follows that h is a Carathéodory function, as defined in Definition 9.2. Since T is separable and metrizable, Lemma 9.3 implies that h is B(T ) ⊗ B(A)-measurable. Rewriting Equation yields that G belongs to B(T ) ⊗ B(A). Lemma 9.1 (through the identification S = A, s = µ, ϕ(t, s) = Φ(µ)) implies that the function U • Ψ −1 : T → R defined for q ∈ T by q → sup µ∈Ψ −1 (q) Φ(µ) is B(T )-measurable, thereby establishing the first assertion. The second assertion then follows from the second part of Lemma 9.1.

Proof of Lemma 4.1
Consider the set and the proof is finished.

Proof of Theorem 4.8
First, we prove that where Π + (q) is the set of positive finite measures π + on A such that E π + Ψ(µ) − q = 0 and E π + D(µ)[B] = 1. To that end, first observe that and that, for any π ∈ Π(q) such that π D[B] > 0, .
Since the above argument also shows that Π(q) B D is nonempty if and only if Π + (q) is nonempty, (8.4) follows. The right hand side of (8.4) is a linear program in π + , so Theorem 4.7 implies that the supremum in π + can be achieved by assuming π + to be the weighted sum of at most n + 1 Dirac masses, i.e. by assuming that This finishes the proof of Theorem 4.8.

Proof of Theorem 4.11
First let us show that, for λ ∈ R, the statement that is equivalent to the statement that To that end, assume (8.7) and observe that the definition (4.3) of (Ψ −1 (Q)) B implies that π · D[B] > 0, where, by (4.4), Consequently, by (4.1), > λ, and the denominator is strictly positive. Therefore, which is a contradiction. Consequently, π · D[B] > 0 and dividing the assumption throughout yields (8.7) and the equivalence is established. The main assertion now follows from a direct application of Theorem 3.11. Finally, since Φ is semibounded, it follows that µ → Φ(µ)D(µ)[B] is semibounded and measurable, and the measure-affinity assertion follows from Lemma 9.9.

Proof of Theorem 4.13
Let us first establish that the assumptions of the theorem are well defined. To that end, note that Lemma 3.10 implies that q → inf µ∈Ψ −1 (q) D(µ)[B] is B(supp(Q))-measurable and hence (4.16) is well defined. Similarly (4.17) is well defined.
8.10 Proof of Theorem 6.9 We appeal to the corollary, Theorem 6.1, to Theorem 4.13. To that end, let A be defined as in (6.8), and let Q := A 0 , Ψ := P 0 , D := {D n }.

Appendix
The following lemma is Lemma III.39 p. 86 of [29]. We also refer to p. 87 of [29] for the existence of the measurable selection η (which is also derived from Theorem III.38 p.85 of [29]). These results are related to Aumann's measurable section principle [7] (the extension to Suslin space is due to Sainte-Beuve [85]).
Lemma 9.1. Let (T, T ) be a measurable space, S a Suslin space. ϕ : T × S →R a T ⊗ B(S) measurable function and Γ a multifunction (i.e. a set-valued map) from T to non-empty subsets of S whose graph G belongs to T × B(S). Then 1. the function m(t) := sup{φ(t, x) | x ∈ Γ(t)} is a T -measurable function of t.
The following definition is Definition 4.50 in [4]: Definition 9.2. Let (S, Σ) be a measurable space, and let X and Y be topological spaces. A function h : S × X → Y is a Carathéodory function if: 1. for each x ∈ X, the function h x = h(., x) : S → Y is Σ, B(Y ) -measurable; and 2. for each s ∈ S, the function h s = h(s, .) : X → Y is continuous.
The following lemma is Lemma 4.51 in [4] (see also [29, p. 70]): Lemma 9.3. Let (S, Σ) be a measurable space, X a separable metrizable space, and Y a metrizable space. Then every Carathéodory function h : S ×X → Y is jointly measurable.

Universally Measurable Functions
For a topological space T let B(T ) denote the σ-algebra of universally measurable sets. For a measure µ, let µ denote its completion. Here we state the following proposition that allows us to define the expected value of B(T ) measurable functions with respect to Borel measures. In all statements in the following proposition, the assertions follow when the integrals involved exist, in particular for semibounded functions. The proof is straightforward but tedious and follows from e.g. [  when the latter exists, where µ is the completion of the measure µ as described in [39, p. 37].
Recall that a carrier T for a probability measure Q ∈ M(Q) is a set T ∈ B(Q) such that Q(T ) = 1. For a carrier T , since T ∈ B(Q), it follows that B(T ) = B(Q) ∩ T and we can define the trace measure Q T ∈ M(T ) by Q T (A) := Q(A), A ∈ B(Q) ∩ T . The following proposition shows that the expectation of a function can be defined with respect to measures which possess carriers upon which the function is universally measurable: Proposition 9.6. Let S be a topological space. Suppose that f is B(T )-measurable for all measurable T ⊆ S. Suppose also that Q ∈ M(S) has a carrier T ⊆ S. Then, using Definition 9.5, any such carrier T defines an expectation , and this definition is independent of the carrier; that is, if T ⊂ S is another carrier, then E Q T [f ] = E Q T [f ]. Moreover, this expectation satisfies the assertions of affinity and monotonicity of Proposition 9.4.
We also need a change of variables formula for expectations of universally measurable functions.
Proposition 9.7. Let X and Y be topological spaces, Ψ : X → Y a measurable map and suppose that f : Y → R is B(Y ) measurable. Then f • Ψ : X → R is B(X)-measurable and, for π ∈ M(X), For Suslin space X and a subset M ⊂ M(X ) let Σ(M ) denote the smallest σsubalgebra of subsets of M for which the the evaluation map ν → ν(B) is measurable for all B ∈ B(X ). The following version of a result of von Weizsacker & Winkler [97] as stated in [107,Thm. 3.1] will be useful to us: Let T and T be two carriers for Q ∈ M(S) and f a function such that f T and f T are B(t)-and B(T )-measurable respectively. Then Proposition 9.4 implies that there are functions f 1 , f 2 measurable on T and f 1 , f 2 measurable on T such that Therefore we conclude that E ν [f ] dp(ν ) = ext(H) F (ν ) dp(ν ), and the assertion is proved.

Proof of Lemma 9.10
Let us first establish that ν 1 + ν 2 = ν 1 + ν 2 , for all ν 1 , ν 2 ∈ M(X ), and that R + K ⊂ R + M(X ) is a lattice cone in its own ordering. For the first, observe that since ext M(X ) = {δ x | x ∈ X } and that f i are δ x -integrable for all i = 1, . . . , n, x ∈ X , it follows that {δ x | x ∈ X } ⊆ ext(K). Now suppose that ν ∈ ext(K) is not a Dirac mass. Then, as in the proof that the extreme points of M(X ) are the Dirac masses, see e.g. [4,Thm. 15.9], and using the fact that the support of ν must contain 2 or more points, we can decompose ν as a convex combination ν = αν 1 + (1 − α)ν 2 where ν 1 = ν 2 and α ∈ (0, 1). Moreover, from we conclude that f i being ν-integrable implies that f i is ν j -integrable for j = 1, 2 and i = 1, . . . , n. Consequently, ν j ∈ K for j = 1, 2. Since ν was an extreme point we conclude that ν 1 = ν 2 which is a contradiction, and (9.5) follows. Now let us demonstrate that R + K is a lattice cone in its own ordering. To that end, note that by [84,Lem. 10.4], it suffices to show that R + K ⊂ R + M(X ) is a hereditary subcone, in that ν 1 ∈ R + K, ν 2 ∈ R + M(X ) and ν 1 − ν 2 ∈ R + K together imply that ν 2 ∈ R + K. To that end, consider such ν 1 and ν 2 . Then (9.3) implies that (ν 1 − ν 2 ) = ν 1 − ν 2 and so we conclude that and therefore from which we conclude that ν 2 ∈ R + K. Hence, R + K is a hereditary subcone, and the assertion then follows as in the proof of [107, Thm. 2.1].