Some models are useful, but how do we know which ones? Towards a unified Bayesian model taxonomy

Probabilistic (Bayesian) modeling has experienced a surge of applications in almost all quantitative sciences and industrial areas. This development is driven by a combination of several factors, including better probabilistic estimation algorithms, flexible software, increased computing power, and a growing awareness of the benefits of probabilistic learning. However, a principled Bayesian model building workflow is far from complete and many challenges remain. To aid future research and applications of a principled Bayesian workflow, we ask and provide answers for what we perceive as two fundamental questions of Bayesian modeling, namely (a)"What actually is a Bayesian model?"and (b)"What makes a good Bayesian model?". As an answer to the first question, we propose the PAD model taxonomy that defines four basic kinds of Bayesian models, each representing some combination of the assumed joint distribution of all (known or unknown) variables (P), a posterior approximator (A), and training data (D). As an answer to the second question, we propose ten utility dimensions according to which we can evaluate Bayesian models holistically, namely, (1) causal consistency, (2) parameter recoverability, (3) predictive performance, (4) fairness, (5) structural faithfulness, (6) parsimony, (7) interpretability, (8) convergence, (9) estimation speed, and (10) robustness. Further, we propose two example utility decision trees that describe hierarchies and trade-offs between utilities depending on the inferential goals that drive model building and testing.

Quantity of interest, its model-based estimator (function of θ) p(θ) Prior distribution of parameters p(y | θ) Likelihood function (explicit or implicit/simulation-based) p(θ, y) Joint distribution of parameters and observables p(θ | y) Posterior distribution of parameters given observables p A (θ | y) Approximate representation of posterior by approximator A G(·), p * (y) True data generator, true data-generating distribution Expected value of a quantity with respect to density p T, H Summary statistics of posterior, summary statistics of data

Introduction
Probabilistic (Bayesian) modeling has seen a surge of applications in almost all quantitative sciences and industrial areas [110,202,113,62,173,148]. This development is driven by a combination of several factors, including powerful probabilistic estimation algorithms [139,24,126,231,246], efficient post-processing [297,129], flexible open-source software [280,78,40], and increased information processing capacity. Furthermore, these factors are coupled with a growing awareness of the benefits of probabilistic modeling, such as inclusion of prior knowledge [220,205], regularization [108,41,26,241], or uncertainty quantification and propagation [142,202,113]. Despite these advances, creating and improving Bayesian models in the context of a principled Bayesian workflow [267,113] remains a complicated endeavor that requires expertise in various domains; these include subject matter knowledge about the system and the data it generates, statistical learning expertise, programming and understanding of software development, as well as knowledge of numerical approximation and simulation methods [173,113]. Thus, to aid future research on and applications of a principled Bayesian workflow, we ask and provide answers to what we hold to be two fundamental questions: 1. What actually is a Bayesian model? 2. What makes a good Bayesian model?
In current practice, the term Bayesian model is highly overloaded and used to describe a wide range of objects with potentially very different properties. Moreover, modern Bayesian models are more than just a likelihood and a prior -rather, they resemble complex simulation programs coupled with black-box approximators, interacting with various data structures and context variables, embedded within iterative workflows with multiple feedback loops [199,79,173,113,267]. Thus, we aim to disambiguate and structure the different meanings of a Bayesian model by proposing the PAD model taxonomy (see Section 2). Our taxonomy aims to accommodate modern uses of Bayesian models and provides an answer to Question 1. With a clear definition of Bayesian models in hand, we describe a collection of ten utility dimensions that can be used to quantify the goodness of Bayesian models holistically (see Section 3), thus providing an answer for Question 2. We then continue with a discussion of importance hierarchies and common trade-offs between utilities in Section 4 and end with a conclusion in Section 5.
This paper started as an attempt to organize our thoughts and provide a unifying and consistent language of Bayesian model building. To a certain extent, it is inevitably opinionated. Nevertheless, we aim to be comprehensive in the utility dimensions we discuss, such that all the goals we can sensibly ask from a Bayesian model to achieve have their place in this paper. In contrast, due to the large number of different topics we touch on in the process, the amount of details and cited literature per topic are necessarily non-exhaustive. The cited literature is only meant as a starting point for the interested reader to dive in deeper if they wish. In terms of the target audience, we hope that this paper will be helpful to both methodological researchers developing Bayesian models as well as users applying Bayesian models in practice.

What is a Bayesian Model?
As the term Bayesian model (or just model for that matter) can sustain multiple meanings depending on context, it can prove incredibly difficult to talk about models with sufficient clarity. As we will see later, different kinds of models may have different kinds of properties which need to be considered and prioritized by an analyst. Without clearly communicating the essential kind of model one has in mind, a discussion about its properties only contributes to the conceptual entropy in quantitative research. In this section, we attempt to resolve this issue by proposing the PAD taxonomy for Bayesian models (see Figure 1 for an overview; see also Table 1 for a quick reference of key concepts and corresponding notation). We will define four basic model classes and explain how they relate to each other. While the PAD taxonomy might be applicable and useful in other contexts, we will specifically expand on it from a Bayesian perspective.

P Models
We define P models by a joint probability distribution p(y, θ) over all quantities of interest whose potential variation or uncertainty we express in terms of probability theory. We assume that y represents all observable quantities (i.e., data, observations, or measurements) and θ represents all unobservable quantities (i.e., parameters, latent states, or system variables) within a particular modeling context. In most cases, the joint distribution factorizes into a likelihood p(y | θ) and a prior p(θ) via the chain rule of probability: p(y, θ) = p(y | θ) p(θ) (1) This conceptually simple factorization serves as the basis for the common generative (forward) notation used to denote a "probabilistic recipe" for creating synthetic data by sequentially sampling from the prior and the likelihood: The generative notation overloads the semantics of the "∼" operator, which attains a dual meaning of "distributed as" and "sampled from". Not all P models are created equal, but most are built to mimic a real-world process or a system, G, whose behavior we can observe or measure. Having some properties that are of interest to the analyst, the opaque generator G induces an unknown (true) data distribution p * (y), typically available only through finite observationsỹ ∼ p * (y) (i.e., real-world data). Accordingly, P models strive to encode probabilistic information about the true distribution p * (y) and/or structural information about the true generator G. The former means that our model matches the statistical properties of p * either a priori, p(y) ≈ p * (y), or a posteriori p(y |ỹ) ≈ p * (y), where p(y) and p(y |ỹ) are the prior and posterior predictive distributions of P, respectively. The latter means that our parameters θ correspond to some relevant (hidden) properties φ of G, for which we endeavor to learn something by analyzingỹ. We will expand on these goals in more detail in Section 4. P models are typically generative, that is, we can obtain pseudo-random parameter and data draws via Monte Carlo simulations from Equation (1). The generative property presupposes that the prior is proper (i.e., its density function has a finite integral) and that efficient algorithms for sampling random draws from both p(θ) and p(y | θ) exist. P models are the basic building blocks of all further model classes described in the upcoming sections. Moreover, due to their generative properties, standalone P models can be useful on their own for various forward inference tasks. These include, for instance, exploring the stability of complex mechanistic equations [170], testing different prior assumptions before a model sees any real-world data [23], or venturing into computational philosophy using simulation [199]. As part of a Bayesian workflow, the plausibility of P models can already be evaluated through prior predictive or prior pushfoward checks [266], which ultimately aim to determine whether the generative behavior of a P model is consistent with the available domain expertise.

Non-Parametric P Models
In contrast to the above introduced parametric formulation, non-parametric P models replace the finite joint model p(y, θ) with an infinite dimensional (functional) expression [188]. Practically speaking, the number of parameters in such models simply grows with the number of observed data points [191,253,188]. We may still assume that the observed dataỹ is drawn from some unknown distributionỹ ∼ p * , but then place a prior p(f ) over the set of all possible generating functions f , instead of over a finite-dimensional parameter space. The forward (generative) model is thus given by: where the "likelihood" describes the probability of the data given a realization of the function f . For nonparametric regression models (e.g., Gaussian processes, [253]), the function f would also depend on additional inputs (i.e., predictors or covariates) and is thus restricted by the problem design. The corresponding prior typically prescribes some properties of f , for instance, smoothness or certain frequency characteristics [191], but may itself be non-analytic; still, it is often possible to obtain random draws from the generative model and compute marginal and conditional distributions. In between the parametric and non-parametric worlds, we can encounter high-dimensional P models, such as Bayesian neural networks (BNNs) [190,148]. The parameters θ of BNNs represent the set of trainable network weights and biases or a subset thereof, such as the weights and biases of the last hidden layer (for a practical overview of recent techniques, see [153]). The prior over network weights is typically chosen out of computational convenience [93], since there is hardly any domain expertise which can yield informative priors. The likelihood of BNNs can also be an ostensibly simple distribution (e.g., a Gaussian) whose parameters are obtained through a highly nonlinear transformation defined by the computational graph of the network. Thus, even though BNNs are formally parametric P models, their high-dimensionality and non-linearity makes them behave more like non-parametric P models [176].
This paper was conceptualized and written mainly with parametric P models in mind. That said, almost all of its aspects apply to non-parametric and high-dimensional P models as well, except, perhaps, for those that presuppose direct interest in the P model parameters (e.g., Section 3.2). Furthermore, despite theoretical differences [188], the practical treatment of parametric and non-parametric P models, when trained on finite data, is not radically different in the end. Finally, non-parametric P models commonly appear as local building blocks in otherwise parametric P models (e.g., a latent Gaussian process as part of an additive model; [168]), blurring the line even further.

Explicit vs. Implicit Likelihood Models
Thus far, we have emphasized that both parametric and non-parametric P models can be analyzed through the lens of their generative properties. A common denominator in such forward inference tasks is that the P model's behavior (i.e., dynamic properties) may not be immediately obvious from the P model's specification alone (i.e., static properties). Thus, simulation methods bridge the gap between the specification and the realization of a P model [274,131]. Indeed, from a simulation perspective, we can further draw a distinction between explicit likelihood (P E ) models and implicit likelihood (P I ) models. P E models are characterized by a likelihood function that has a tractable mathematical form. This means that the likelihood p(y | θ) is known analytically (e.g., Gaussian) and its value can be evaluated directly or approximated numerically for any pair (y, θ). The same logic applies to non-parametric P models using the pair (y, f ). P E models include popular statistical models, such as (generalized) linear and additive models [134], but also (stochastic) differential equation systems with simple statistical properties [152], finite mixture models [55], or feedforward neural networks [124]. P I models are defined through a Monte Carlo simulation program y = g(θ, z) and a prior p(θ), rather than directly through an analytic likelihood function p(y | θ). The simulator g transforms its inputs θ into outputs y through a series of latent program states z. A Monte Carlo simulator only implicitly defines the likelihood density via the relation where p(y, z | θ) is the joint distribution of observables y and random latent program states z, if such a distribution exists. The above integral runs over all possible execution paths of the simulation program for a given input θ and is typically intractable, that is, we cannot explicitly write down the mathematical form of the implied likelihood p(y | θ). P I models are usually built upon firm theoretical assumptions and computational considerations aimed at providing a faithful representation of the modeled real-world system or process. Common P I models include mechanistic neural models [147], particle physics simulators [65], population genetics algorithmic models [137], or agent-based models [125], to name just a few. The distinction between P E and P I models is not a conceptual necessity, but rather an emerging practical convenience. While most standard statistical models can easily be specified in terms of known density or distribution functions, the behavior of complex computational models might be easier to emulate directly using a simulation program. Importantly, P E and P I models necessitate the use of different estimation methods and thus disparate modes of approximation and inference, as we will see in later sections.

PD Models
PD models are defined as the combination of a P model and observed dataỹ, that is, they represent a tuple ( p(y, θ),ỹ ). The data can comprise any number of measurementsỹ with an arbitrary structure (e.g., sets, time series, graphs, etc.). Furthermore, the number of observed data sets (conditioning quantities for the posterior) will be determined by the structure of the P model: Data on a hundred countries represents a single data set from the lens of a multilevel (hierarchical) model, but it comprises a hundred data sets for a single-level (non-hierarchical) P model.
The goal of PD models is to integrate the joint distribution and the observed data to arrive at the corresponding analytic posterior : where the denominator p(ỹ) = p(ỹ | θ) p(θ) dθ represents the model-implied marginal likelihood (aka evidence) evaluated atỹ and typically treated as a normalizing constant due to its independence of the model parameters. If the P model is generative, the analytic posterior exists for everyỹ that satisfies the expected data structure of the P model, regardless of whether or not it represents the true real-world generator G. In the non-representative case, the P model is said to be misspecified. In most quantitative sciences, except perhaps in some areas of the natural sciences, we can expect all P models to be misspecified to some (non-negligible) degree. This does not prevent the corresponding PD models from being useful, though, if they can at least express some relevant aspects of reality captured byỹ. PD models represent the ideal endpoint of Bayesian inference. However, because we can rarely compute the marginal likelihood p(ỹ) analytically, we do not have access to the actual PD model outside of textbook examples with limited generality and applicability (i.e., for conjugate P models, [114]). In other words, for most practically relevant and non-trivial P E and P I models, we cannot retrieve the analytic posterior p(θ |ỹ) and can only work with an approximate representation through the lens of an intermediary A which we call a posterior approximator.

PA Models
PA models are defined as the combination of a P model with a posterior approximator A, that is, they constitute a tuple ( p(y, θ), p A (θ | y) ), where the latter denotes any algorithm capable of somehow approximating the analytic posteriors of model-implied observations y for a given P model. Approximators themselves exist at both an algorithmic and an implementation level, and details on both levels can influence their behavior and performance. In the absence of actually observed dataỹ, PA models can be useful for confirming the computational faithfulness of a workflow, for instance, via simulation-based-calibration (SBC, [285,208]) or assessing the adequacy of a model for answering a particular research goal [266]. Importantly, the type of P model (i.e., P E or P I ) will typically determine or necessitate the choice of a particular approximator A, as we will see shortly.

What is an approximator?
More precisely, we can define an approximator as a triple A = {A, I, H}, where A denotes the algorithmic representation (formal computer program), I denotes the actual implementation in a concrete programming language, and H denotes the set of admissible hyperparameters (i.e., adjustable settings or inputs) of the approximator. The first two components of A are often entangled when talking about approximators in general, but they require different levels of analysis. For instance, we can determine the computational complexity of A via standard algorithmic analysis and classify approximators according to their asymptotic run time or memory requirements [60]. However, the latter two will also be constrained by the particular implementation I: Parallel computing can easily turn a scary-looking quadratic O(n 2 ) time complexity into a negligible constant run time in practice [60]. Thus, we deem it important to keep the distinction between A and I explicit 1 . In addition, the performance of an approximator will heavily depend on the choice of particular hyperparameters h ∈ H and these should be explicitly specified in any PA model.

Approximators for P E models
Currently, the two most commonly used approximators for P E models are Markov chain Monte Carlo (MCMC) samplers and variational inference (VI) methods, but there exist many more approximator classes, for example, integrated nested Laplace approximation (INLA, [261,180]) or optimal transport applied to Bayesian inference [81,160,233].
MCMC sampling algorithms, such as the Metropolis-Hastings algorithm [135], Gibbs sampling [105], Hamiltonian Monte Carlo [HMC,214], or its extension to the No-U-Turn (NUTS) sampler [139], belong to a family of stateful algorithms which generate a sequence of correlated draws that converge in distribution to a stationary target distribution [110]. Generally, our goal in MCMC is to construct a (geometrically) ergodic Markov chain on θ whose stationary distribution is the posterior p(θ | y) [110]. In practice, we then sample from the chain to obtain a finite number of random draws from the (hopefully accurate) stationary distribution p A (θ | y) and use these draws to approximate p(θ | y). More precisely, using the posterior draws, we can efficiently approximate expectations (e.g., mean or variance) and quantiles of the posterior marginals, but not the posterior density itself.
The idea of approximating a complicated distribution via dependent random draws, albeit rather straightforward in hindsight, has gradually transformed and shaped the field of Bayesian inference. Moreover, it constitutes the main logic behind major probabilistic programming languages such as Stan [50] or JAGS [242]. A sampler is thus a computer program which uses computer-generated randomness to generate draws from a (complicated) distribution, instead of deriving or estimating its algebraic form.
Differently, variational inference (VI) methods cast the problem of posterior inference as an optimization task. In contrast to MCMC, the resulting posterior approximation p A (θ | y) is in the form of a tractable density instead of random samples from the posterior. Our goal in VI is to specify a family of approximate densities Q over the parameters θ of P. Then, we try to retrieve the density q * ∈ Q which minimizes the Kullback-Leibler (KL) divergence to the analytic posterior. Finally, we use q * (θ) as our approximation p A (θ | y) to the analytic posterior.
MCMC and VI methods represent the two endpoints of a trade-off between theoretical guarantees and computational efficiency. MCMC methods enjoy the guarantee that under certain regularity conditions [110], the obtained draws represent the true parameter posterior p(θ | y). More precisely, the posterior expectations can be perfectly recovered if the MCMC chain is run infinitely long and, more practically important, expectations can be efficiently approximated already with a finite number of draws. Despite their favorable theoretical properties and major advances in recent years, MCMC algorithms are notoriously slow, which renders estimation of some complex models or applications to really big data practically infeasible [29]. On the other hand, VI methods can be very fast and offer a viable alternative to MCMC in applications to large data sets or real-time inference. However, VI approximators can suffer severe loss of posterior accuracy and, as of today, offer less guarantees for correct inference than MCMC methods ( [29], but see [322,321]). Thus, the choice between an MCMC or a VI approximator for a particular PA model will largely depend on the modeling context. In addition, highly complex P E models might not be estimable with either MCMC or VI, in which case they might be treated as P I models in practice and tackled via simulation-based approximators, as we discuss next.

Approximators for P I models
Standard MCMC and VI solutions are not applicable to statistical inference with P I models, since the latter lack an analytic likelihood function p(y | θ). Accordingly, approximators for P I models leverage Monte Carlo (i.e., randomized) simulations for estimating the posterior based on the implicit likelihood defined by the simulator and Equation (6).
Approximate Bayesian computation (ABC) comprises a broad family of asymptotically correct methods for performing inference with P I models. The core idea of ABC methods is to approximate the posterior by repeatedly drawing parameters from the prior and then running the simulator with the sampled parameters to obtain a synthetic data set. Whenever a synthetic data set is sufficiently similar to an actually observed data set (as defined by a fixed similarity criterion or a distance metric), the corresponding parameters are retained as a draw from the target posterior, otherwise rejected (i.e., rejection sampling).
In practice, ABC methods are notoriously inefficient and hindered by various methodological "curses", such as the curse of dimensionality [254] or the curse of insufficiency [193]. Several more efficient methods employ various techniques, such as sequential Monte Carlo [SMC, 275,165]) or ABC-MCMC [194] with kernel density estimation (KDE) [289] to optimize sampling or correct potential deficiencies, but the core idea of using simulations to aid real-world inference remains invariant across methods.
Recently, machine learning and deep learning innovations have permeated the field of simulation-based inference with the goal of scaling up or replacing standard ABC methods altogether [62]. Most of these innovations require simulation-based training of an expressive machine learning algorithm (e.g., random forests or neural networks) which is then used as a standalone approximator [54,126,123,245], in combination with an ABC routine [151] or an MCMC sampler [136,90,185,33].
For instance, neural density estimation (NDE) methods employ specialized neural architectures for analyzing complex high-dimensional distributions [e.g., natural images, 72,162,4]. In the context of Bayesian inference, NDE methods can approximate different components of intractable P I models and currently represent a field of active and promising development [62,173]. Specifically, neural posterior estimation (NPE) Figure 2: Amortized approximators incorporate a simulation-based approximation loop (training phase) before any real data are collected. The subsequent inference phase involves no simulations or further optimization and could be carried out almost instantly. The upfront training effort therefore amortizes over arbitrarily many observed data sets from a research domain working on the same P model family.
methods [3,126,245,123,225,6] involve simulation-based training of a conditional generative neural network [e.g., normalizing flows, 166,231]. The trained network then acts as a functional that can approximate the posterior across the entire prior predictive distribution of a P model without any re-training, enabling amortized inference (to be explained shortly). A shared feature between NPE methods is that they avoid MCMC sampling altogether and can perform exact inference under certain optimal conditions. Ultimately, the utility of any simulation-based method will depend on a combination of various factors, such as generality, domain expertise, theoretical guarantees, efficiency, scalability, and software availability. The amount of available data will once again play a crucial role in the choice of approximator. In this context, the distinction between amortized and non-amortized posterior approximators becomes crucial.

Amortized vs. Non-Amortized Approximators
Arguably, there are numerous ways to devise a taxonomy for the ever-growing zoo of posterior approximators. A particularly useful and clear-cut classification views approximators as either amortized or non-amortized, with different degrees of amortization possible. Amortized approximators involve a costly simulation-based optimization (training) phase which renders subsequent inference on simulated y or real dataỹ extremely efficient (see Figure 2). In other words, the optimization/training effort amortizes over repeated inference queries (e.g., over multiple data sets or data set sizes). Differently, non-amortized approximators repeat all necessary computations for each data set or prior choice from scratch and utilize hardly any pooling of computational resources (see Figure 3).
Examples of amortized approximators include the BayesFlow method [245,247,247,249], sequential neural posterior estimation (SNPE) methods operating in a single-round regime [126,123,80], machine learning-enhanced ABC [254], or the pre-paid estimation method [204]. Examples of non-amortized approximators include standard explicit inference algorithms, such as MCMC or VI, but also several common ABC methods, such as ABC-SMC [275,165] or ABC-MCMC [194,289]. In addition, some neural PA models might include both amortized and non-amortized components, such as multi-round SNPE methods (involving a separate training phase for each data set, [228,126,80,67]), likelihood approximators or surrogates (involving MCMC sampling, [230,185,90,33]), or inference compilation methods (involving SMC, [226,174]). Figure 3: Non-amortized approximators perform a separate approximation loop (dashed plate) for each observed data set from a given research domain. Likelihood-based approximations, such as MCMC will evaluate the likelihood, whereas simulation-based approximators, such as ABC rejection samplers, will only use random draws from the implicit likelihood (available through stochastic simulations). Approximation and inference are tightly intertwined and the observed data enters the approximation loop.
Amortized approximators are typically employed to estimate implicit PA(D) 2 models, but are equally applicable to explicit PA(D) models. In the former case, their involvement often arises out of necessity, since P I models are analytically intractable and state-of-the-art approximators, such as HMC-MCMC, are not applicable out of the box. In the latter case, amortized approximators might be the only resort to estimate multiple PAD models in the presence of multiple data sets, where non-amortized approximators, despite being feasible, would demand an inordinate amount of a researcher's lifetime [303].

PAD Models
PAD models are defined as the combination of a P model, a posterior approximator A, and observed data D, that is, they constitute a triple ( p(y, θ), p A (θ |ỹ),ỹ ). Ultimately, PAD models aim to approximate the corresponding PD model through a suitable approximator A, whereas the amount of data D, together with the type of P model, will largely determine the choice of approximator. As a consequence, the properties of a particular PAD model may be very different than what is expected from studying the corresponding PA model, since the observed dataỹ may not have been generated from P itself. This misspecified P model case can arise for both P E and P I models and can have different consequences for the validity of inference depending on the particular approximator A [197,28,96,95].
For instance, amortized approximators face the challenge of dealing with simulation gaps [269,224]. Simulation gaps occur when P model simulations do not accurately represent the real behavior of the modeled system or when they cannot adequately account for unexpected contamination of the observed data. Simulation gaps are especially critical for amortized approximators since the latter assume that simulations are faithful proxies of reality. Thus, simulations from misspecified P models may lead to subsequent problems for amortized inference on real data [269]. In these cases, the resulting p A (θ |ỹ) will not be representative of the analytic p(θ |ỹ) and any substantive conclusions based on the former will have little validity.
In contrast, principle limitations due to model misspecification do not exist for standard, non-amortized Bayesian approximators, such as MCMC. Under certain regularity conditions, MCMC samplers guarantee that the obtained samples represent the analytic posterior p(θ |ỹ) even when the underlying P model is misspecified [110]. However, misspecified models might still cause considerable difficulties and convergence problems for MCMC methods in practice. Thus, any trustworthy approximator should be equipped with diagnostics signaling improper convergence or invalid inference queries (see Section 3.8).

Intermediate Summary I
Thus far, with our PAD taxonomy, we have defined four different classes of Bayesian models comprising different, yet interdependent, conceptual elements. Common to all has been the joint probability model (P), which represents the core probabilistic and structural assumptions of a Bayesian model. In addition, we proposed to treat the posterior approximator (A) and the data (D) as further constituents of Bayesian models. We consider this warranted, since all three elements not only determine the scope and validity of the substantial conclusions derived from model-based inference but also influence which assumptions we decide to (and could!) test and which we choose to keep untouched by reality.

What makes a good Bayesian model?
Below, we present a total of ten utility dimensions that, from our perspective, capture most relevant aspects of Bayesian models as defined by our taxonomy. For each of these dimensions, we explain (a) its definition and meaning, (b) the reason why we deem it relevant for Bayesian model building, and (c) how to practically measure it. The order in which we present each utility dimension does not indicate their importance but aims to ease their presentation. We discuss the relative importance of utility dimensions in Section 4.

Causal Consistency
A common goal of scientific models is the investigation of a causal hypothesis, such as the improvement a certain treatment might bring to some medical condition or the effect an intervention has on an outcome of interest. Most people are aware of the widely recited folk wisdom that correlation does not imply causation. Yet, this adage bears the seeds of a far-reaching and nowadays generally acknowledged opinion that statistics alone simply cannot solve questions of causality [236].
While statistical inference can handle the static nature of associations in observational data, causality is a matter of changing conditions and handling these changing conditions requires causal assumptions to build upon [235]. Moreover, different P models may claim different degrees of causal sophistication. For example, some P E models built only to make accurate predictions may pass without a single mention of causality, while some mechanistic P I models may directly embody causal functional relationships, such that an input variable x is assumed to cause an observable y by construction or by derivation from scientific theory. Some complex P models may even hold standard unidirectional notions of causality inadequate, as the dynamics of certain natural systems appear to necessitate bidirectional or hierarchical forms of causal interplay [288,216].
The scientific methods developed around the notion of causality help us determine whether a P model is a valid recipe for answering a particular causal query in principle. Put differently, we ask whether the probabilistic structure of a P model is consistent with a set of external causal assumptions. Thus, we refer to this implied model utility as Causal Consistency.
In this section, we will briefly present the foundation of causal theory based on the work of Pearl [235], as it is currently the most common causal framework. There are adoptions and adaptations for individual fields, such as the social sciences [210,97] and public health research [294]. Moreover, recent Bayesian statistics textbooks have started discussing causality as a central aspect of statistical analysis [202]. In addition, the fields of causal discovery [143,278,120] and optimal experimental design [83,88,146] deserve a mention as well, since they tackle problems related to causality. Finally, other promising causal frameworks have been proposed [145] but are not discussed in detail here for reasons of brevity.

Structural Causal Models
Pearl [235] proposes a framework to express causal assumptions and construct requirements on probabilistic models that make them consistent with those assumptions. The mathematical objects that allow for causal analysis are called structural causal models [SCMs, 236] and they comprise structural equations (what we express via P models), causal graphs, as well as interventional and counterfactual logic [238]. For instance, linear regression P models, if combined with proper causal calculus, comprise a widely used and simple form of linear SCMs. However, vastly more complex SCM architectures are possible, such as causal generative neural networks [167], where a causal graph is connected to a generative adversarial network responsible for learning interventional distributions (see Section 3.1.2 for details on interventions).
For the purpose of this paper, it is sufficient to discuss SCMs comprising a set of three endogenous variables whose causal relationships are to be studied. We refer to these variables as w, x, and y. For every endogenous variable, we assume there exists a corresponding exogenous (noise) variable, ξ w , ξ x , and ξ y , respectively. Under the assumption of causal sufficiency (i.e., every exogenous variable affects no more than a single endogenous variable), a hypothesis of the form "x causes y" means that y is generated by a structural equation y = g y (x, ξ y ) for some function g y . The corresponding causal graph is simply x → y.
To extend this example, the left panel of Figure 4 illustrates a path diagram of the structural equations relating the endogenous variables w, x, and y, along with the corresponding causal graph w → x → y. Importantly, any set of structural equations also encodes assumptions about the lack of causal influence. For instance, the absence of w from the right-hand side of g y conveys the assumption that y will remain invariant to changes in w, as long as variables x and ξ y remain constant.
In general, a causal graph implied by a set of structural equations will be a directed acyclic graph (DAG). It can be constructed as follows: The variables that appear on the right-hand side of a structural equation become the parents of the variable that appears on the left-hand side of the structural equation. We can understand the structural equations as encoding explicit structural assumptions about the opaque (true) data generator G, which in turn implies a (true) joint distribution of the endogenous variables, here p * (x, y, z). This distribution is realized by first assuming a joint distribution of the noise variables, p * (ξ w , ξ x , ξ y ), and then propagating this uncertainty to w, x, and y through the respective structural equations.
In P model terms, a DAG can be understood as defining a Bayesian network for the implied joint probability distribution of the endogenous variables [236]. The conditional distribution of an arbitrary endogenous variable v is given by p(v | N v ), where N v denotes a set of parent variables of v as implied by the DAG. For the current example, this would imply a generative likelihood that factorizes as p(w, x, y | θ) = p(y | x, θ) p(x | w, θ) p(w | θ), where θ are our P model parameters (left unspecified in the DAG).
In our model taxonomy, a P model may or may not be consistent with the set of causal assumptions embodied in a DAG, which constitutes a binary metric of causal consistency. For example, consider again the simple DAG given by x → y, with structural equation y = g y (x, ξ y ). The concrete approximation of g y is part of the P model assumptions (see below), whereas adherence to the (external) DAG implies satisfying causal consistency. For example, consider the following linear P model with unspecified distributional forms of ξ x , ξ y , and β for simplicity. The approximationĝ y chosen for g y iŝ g y (x, ξ y ) = βx + ξ y whileĝ x (ξ x ) = ξ x is just the identity function. The above P model is clearly causally consistent with the DAG x → y. In contrast, another linear P model in which we had swapped x and y (i.e., assuming x = βy + ξ x ), would be causally inconsistent with the graph x → y.
In linear P models, the regression coefficients represent path coefficients of structural equations and thus quantify the linear "causal effects" of certain variables on others. However, even when a linear P model is causally consistent with a given DAG, its linear functional form y = βx + ξ y may still be a poor approximation of the true (potentially highly non-linear) structural equation y = g y (x, ξ y ). Thus, an equally causally consistent, but more flexible, non-linear P model may be a better choice in the end, depending on other utility dimensions. This illustrates that causal consistency, as defined here by the formal agreement with a causal DAG, is only a necessary, but not a sufficient condition for a P model to provide trustworthy causal inference. Further requirements will be discussed in the context of parameter recoverability (see Section 3.2).
Causal graphs allow for an unambiguous communication of assumptions about causal relations, but on their own, they still represent static entities. In contrast, interventions and counterfactuals describe actions which enable us to answer causal queries based on (a subset of) these assumptions. Below, for the sake of brevity, we will elaborate solely on interventions (see [235] for more details of counterfactuals).

Interventions
An intervention is an operation that changes the underlying structural equations, hence the corresponding causal graph. Intervening on x means setting it to a fixed valuex, say, administering the treatmentx to a patient. We denote an intervention as do(x =x) or simply do(x) for short. The effect of an intervention on the path diagram of our example three-variable SCM is shown in the right panel of Figure 4. An intervention do(x) differs from conditioning onx in the following way: The former removes the connections of node x to its parents, whereas the latter does not change the causal graph from which data is generated [235,167]. If we set the value of x to somex, then it is no longer determined through the structural equation g x (w, ξ x ), that is, we have intervened in the generative mechanism. Importantly, the interventional distribution of interest, say, p(y | do(x)) may differ from the corresponding conditional distribution p(y |x).
However, when we only have access to observational data because we cannot intervene in the causal graph (e.g., an experiment is too expensive to perform), our resort is to estimate conditional distributions. Thus, an important question arises: "Which causal queries can we answer (i.e., which interventions' effects can we estimate) based on observational data alone?" In the language of do-calculus, this translates to the question of whether we can circumvent the do operator and express the interventional distribution of interest p(y | do(x)) via a conditional distribution [237]. For this purpose, we can use three basic rules of do-calculus Posterior contraction is easier to compute and interpret, but Bayesian surprise is more general, as it captures differences beyond second moments (i.e., variances). that specify the conditions under which we can 1) ignore observations, 2) treat interventions as equivalent to observations, and 3) ignore interventions [237].
Against this background, we say that a P model is causally consistent for a given causal query, if that query can be answered by applying the rules of do-calculus to the underlying DAG and all necessary conditional distributions are part of the P model. A P model which is causally consistent with a DAG is also causally consistent for all valid causal queries of that DAG. In practice, however, we can rarely attain (or care about) the former but are only concerned with causal consistency for a few queries of interest. To illustrate this point, let us again consider the DAG w → x → y from Figure 4. The linear P model (8) is not causally consistent with this DAG, since it does not include the structural equation x = g x (w, ξ x ) but only y = g y (x, ξ y ). However, it is causally consistent for the specific query p(y | do(x)) because, after applying the second rule of do-calculus, we find that p(y | do(x)) = p(y |x) for this DAG. Correspondingly, the latter conditional distribution is part of the P model in the form of y = βx + ξ y . Naturally, the conditions under which we can answer causal queries using conditional distributions become harder to test for causal graphs containing more than just three variables, but the underlying principles remain the same [58].

Parameter Recoverability
A central goal of Bayesian modeling is to perform parameter inference, that is, to draw conclusions directly from the posterior of the latent parameters or other pushforward quantities of interest. But how can we assess whether our inferences are informative and capture all relevant layers of uncertainty? The Parameter Recoverability dimension captures the ability of P models (and of PA models; see Section 3.2.3) to gain information from data and perform faithful uncertainty quantification. Moreover, recoverability is a concept where frequentist statistics inevitably play a role, even in the context of purely Bayesian models.
For the purpose of generality, consider the task of estimating a quantity of interest φ based on a P model and (yet to be realized) data y using an estimator ψ = ψ(θ) of φ where θ ∼ p(θ | y). The epistemic uncertainty implied by the posterior p(θ | y) is naturally propagated to the posterior of ψ. Based on the implied posterior p(ψ(θ) | y), we can derive both point and uncertainty estimates, among other things, as detailed further below.
To make this notion more concrete, let us consider a simple example. Suppose we are interested in the (true) mean difference of y between two groups, φ = E[y 1 ] − E[y 2 ], where y 1 and y 2 represent the responses of the two groups, respectively. One way to estimate φ here is via a linear regression P model with response vector y = (y 1 , y 2 ) and pointwise likelihood y n ∼ Normal(µ n , σ) where I is the indicator function, C 1 is the index set of observations n belonging to Group 1, and C 2 is the corresponding index set of Group 2. Then, based on this P model, we define an estimator ψ of φ as ψ = β 1 − β 2 . Accordingly, ψ does not need to be a model parameter itself but can be any pushforward quantity computable from the parameters. The Gauss-Markov theorem tells us that the chosen estimator ψ has the lowest sampling variance among the class of linear unbiased estimators in case of flat priors on β 1 and β 2 . However, the properties of any estimator in general are not always that clear: Consider another example where the true data generator is given by y n = f (φ x n ) + ξ n , with x being a known continuous variable, f a monotonically increasing function, and ξ an additive error term. In the absence of knowledge about the exact form of f , we could set up a P model with a normal likelihood that is linear in ψ, y n ∼ Normal(ψ x n , σ).
Moreover, the properties of ψ as an estimator of φ will certainly depend on the unknown function f and is likely not as favourable as in the first example. However, through ψ, we can at least hope to get the sign of φ right, which may as well turn out to be sufficient for meeting the goals of some applications.

Identifiability and Information Gain
Oftentimes, we are interested in learning something about the (true) real-world generator G through the P model-dependent quantity ψ, justified by its resemblance to the model-independent quantity φ that we assume to play a role in G. As a first step, we need to study whether the data generated by the unknown process enables the P model to extract any information about ψ at all. If the data is not informative, there is no point in further studying the recoverability of φ through ψ. In a frequentist sense, we say that a quantity ψ is identified in the given P model, if all the possible values of ψ lead to unique conditional distributions, that is, for any ψ 1 ̸ = ψ 2 we have p(y | ψ 1 ) ̸ = p(y | ψ 2 ) [52]. Thus, frequentist identification implies that, in the limit of infinite data, no ambiguity remains about possible values of ψ [178].
In a Bayesian context, the posterior captures all information about ψ gained from the data. Thus, the posterior should be a key object for defining identifiability. Since the posterior always exists (as long as the prior is proper), regardless of how informative the data are, the mere existence of the posterior is not a helpful measure of identifiability [see also 181,122,265,264,25, for discussions of Bayesian identification]. Instead, we have to define identifiability by a juxtaposition of prior and posterior. The transition from prior to posterior (i.e., Bayesian updating) essentially conveys a reduction in uncertainty brought about by observing some data. Equivalently, it can be seen as communicating the information gain achieved by accounting for the data. Thus, we expect the posterior to be narrower (sharper) than the prior, as the opposite would imply a loss of information through observation -a rather paradoxical scenario. In other words, the data should be sufficiently informative of ψ, otherwise, the posterior will just resemble the prior.
Bayesian surprise offers a way to quantify arbitrary differences between prior and posterior. The Bayesian surprise is typically defined as the Kullback-Leibler (KL) divergence between the two distributions but other divergence or integral metrics are also possible [212]. The Bayesian surprise, as defined above, is non-negative and equals zero if and only if p (ψ | y) = p (ψ). Henceforth, to avoid commitment to the KL divergence, we will use the symbol D to denote any divergence with the above two properties. In information theory, this particular form of the Bayesian surprise is called a relative entropy, and, in Bayesian terms, represents the information gained by updating the prior to the posterior in units determined by the base of the logarithm. 3 Accordingly, in a Bayesian context, ψ is identified if the relative entropy is non-zero. Further, the concept of posterior contraction provides a simpler and tractable empirical diagnostic to assess identification and degrees of informativeness [25]. Posterior contraction formalizes the idea that the posterior should get narrower as the amount of data increases and is computed as the ratio between posterior and prior variance: If y contains no information about ψ, then PC(ψ | y) = 0. Conversely, the more information (i.e., uncertainty reduction) we gain from y, the larger PC(ψ | y) becomes, up to a maximum of PC(ψ | y) = 1. The posterior contraction can be combined with the posterior z-score (i.e., the difference between the true parameter and its posterior mean) as an intuitive two-dimensional estimate of the information gain that can be achieved by a P model when combined with data D [266].
Posterior contraction compares only the second moments (i.e., the variance) of the prior and the posterior, which means that it can be efficiently computed from random draws of those distributions. However, relevant differences between prior and posterior may manifest themselves only in higher moments: It is still possible that we learn something about a distribution, for instance, about its tail exponent or symmetry, while its variance remains largely unchanged (see Figure 5 for an illustration).
So far, we have only considered posterior contraction and Bayesian surprise brought about by a single data set y. Thus, Equation (10) provides only a measure for local (i.e., per-data) information gain. Whenever we are interested in global (i.e., in expectation over all possible observations) information gain, then the expected Bayesian surprise (EBS) should be considered: or, similarly, the expected posterior contraction (EPC). Global information gain assumes access to the distribution p * of real-world generator outputs and so we can rarely compute this quantity in practice. Instead, we can obtain a Monte Carlo estimate of Equation (13) over multiple observed data sets as an approximation of the true EBS. In many scenarios (e.g., during model development), we are interested in the recoverability of ψ over the full generative scope of a model P, in combination with a posterior approximator A, before collecting any data. In this case, we will be considering the approximate posterior p A (ψ | y) and estimating the difference between prior and approximate posterior with respect to the joint distribution p(θ, y) implied by the P model: In other words, we assume the P model to be a good representation of p * (y) and evaluate the identification of ψ under this assumption for a given approximator A. Note that approximating the expectation over p(y, θ) will be computationally expensive for many PA models relying on non-amortized approximators (i.e., ABC or MCMC), since estimating the posterior p A (ψ | y) repeatedly will dominate almost any approach (see also Section 3.2.3). Thus, well-calibrated amortized approximators [126,245] can serve as remarkable catalysts for efficiently quantifying global information gain for a given PA model before committing to the (costly) process of data collection.

Ground-Truth Comparisons
We have hitherto assumed that we are dealing with a black-box (true) generator G whose actions give rise to the data-generating distribution p * (y). Thus, we did not require φ to play any actual role in the process of data generation. In this section, we will restrict our focus to scenarios where φ does in fact represent some intrinsic properties of G. Thus, we assume an unknown conditional data-generating distribution p * (y | φ) and are interested in the similarity between φ and its P-model-based estimator ψ.
Obtaining the posterior of ψ for a single data set and verifying sufficient information gain will tell us nothing about the recoverability of φ given a P model, (i), because the resemblance between φ and its estimator ψ remains unclear and, (ii), because we need to consider the variation in y, that is, variation across possible data sets as well. This means that we ought to estimate the performance of an estimator in expectation over possible data: where f (φ, ψ | y) is some function comparing φ with the posterior of ψ, conditional on data y (see below for examples). If the applied P model were the actual data generator itself, then p * (y | φ) would be equal to p(y | φ) = p(y, θ | φ) dθ and we could set ψ = φ. In this case, φ could be directly estimated through its own posterior distribution induced by φ(θ) with θ ∼ p(θ | y). However, in reality, we do not know how well P represents the actual generator, and so we continue to distinguish φ from its P model-based estimator ψ.
Notably, the evaluation of Equation (15) does not actually require any observed data and so can be done ahead of time, before commencing any data collection. Unfortunately, as for many things in Bayesian statistics, it is a lot easier to write down the target in mathematical notation than to actually compute it: The integral in (15) is almost always intractable, even if the posterior of ψ itself were analytic. Thus, in statistical practice, we approximate the integral with a finite sum over M independently simulated data sets y 1 , . . . , y M ∼ p * (y | ψ): This Monte Carlo estimate is now conceptually easy to compute, but potentially very time-consuming, since the P model needs to be fit M times, whereby each single fit may itself demand a considerable amount of time.
In Equations (15) and (16), φ is held constant, which constitutes the typical setup in simulation studies where we fix the ground-truth to a single value per simulation instance. However, the conclusions we can draw from such studies are naturally limited to the few investigated simulation instances (chosen groundtruths). If the investigated instances were non-representative in reality, then we would learn little to nothing of value from our simulations, even if the data-generating distribution p * (y | φ) itself were faithful. To consider this implied uncertainty, we can make the criterion (15) fully Bayesian by adding a prior p * (φ) over φ. Thereby, we can now measure recovery in expectation over data y and a priori plausible values of the quantity of interest φ: with Monte Carlo (simulation-based) approximation for M ground-truth simulations, each generated according to φ m ∼ p * (φ) and y m ∼ p * (y | φ m ).
Point Estimation. One central aspect of parameter recoverability that can be assessed in terms of expectations over the data-generating distributions is point estimation. We write T (ψ | y) for a point estimator derived from the posterior of ψ. Most commonly, we compute the posterior mean ψ(θ) p(θ | y) dθ, or alternatively the posterior median or mode. Due to aleatoric uncertainty in the data y, we cannot expect T (ψ | y) = φ for all y, even if the former would be the best possible point estimator of φ. Instead, we can measure how far away our estimator is from the truth via a strict distance function d on T (ψ | y) and φ, such that d(T (ψ | y), ψ) = 0 holds if and only if T (ψ | y) = φ. Common distance functions are the bias T (ψ | y) − φ, the squared error (T (ψ | y) − φ) 2 , and the absolute error |T (ψ | y) − φ|. To estimate the performance of a point estimator in expectation over the data-generating process, we would set f (φ, ψ | y) = d(T (ψ | y), ψ) and then apply Equations (15) to (18). Whenever we compare P models based on their point estimation capabilities, we would prefer the P model with the smallest expected distance of its point estimator to the assumed true φ.
Uncertainty Estimation. An uncertainty estimator is defined as a parameter region that is supposed to contain the true quantity of interest φ with a certain (user-defined) probability q. We write U q (ψ | y) to denote a q uncertainty region derived from the posterior of ψ. Common Bayesian uncertainty regions are quantile-based credible intervals and highest density intervals (HDIs) [110]. We say that an uncertainty region is well calibrated for a given φ (in a frequentist sense) if the following equality holds: where I(φ ∈ U q (ψ | y)) is the indicator function evaluating to 1 if φ ∈ U q (ψ | y) and to 0 otherwise. In other words, an uncertainty region for probability q is well calibrated if it contains the assumed true parameter in a fraction of q data sets. If the above property holds for every uncertainty region U q (ψ | y), we say that the whole posterior of ψ is well calibrated for estimation of φ. Bayesian uncertainty regions are not generally designed to satisfy this frequentist calibration and there is no guarantee that they will [110,213]. Yet, it can be a perfectly valid approach to use them even to satisfy purely frequentist goals [103]. Interestingly, when considering expectations over (y, φ) ∼ p * (y | φ) p * (φ) as in Equation (17), a P model will exhibit perfect calibration as long as its generative behavior matches the unknown data generator and posterior computation is exact [285]. This property is extensively used in diagnosing the correctness of posterior approximations, a topic we will discuss in Section 3.2.3.
When comparing P models based on their uncertainty estimation of φ, we would prefer the model which yields uncertainty estimates closest to the equality in Equation (19) for some pre-selected, application-specific uncertainty regions. For example, if we were primarily interested in well-calibrated 95% credible intervals (perhaps more precisely stated, compatible intervals, [202]), then we would prefer the model for which these intervals had closest to q = .95 coverage of the assumed true φ. That said, for some specific analysis goals, for example in null-hypothesis significance testing [171], over-coverage (higher than q coverage) may be more acceptable than under-coverage, or vice versa, depending on the assigned utility values of the corresponding Type-I and Type-II errors [267].
Sharpness. Multiple P models, say P 1 and P 2 , may provide estimators ψ P1 and ψ P2 that are equally well calibrated for a quantity of interest φ, yet their uncertainty regions may differ in coverage [121]. This implies that calibration alone is insufficient to describe the appropriateness of uncertainty regions: Additionally, we need to introduce the concept of sharpness. We say that model P 1 is sharper than model P 2 for an uncertainty region U q (ψ | y) with finite bounds, if that region is better or equally well calibrated in P 1 than for P 2 and if the volume of U q (ψ P1 | y) is smaller than the volume of U q (ψ P2 | y) in expectation over the data-generating distribution: where Vol indicates the volume in Euclidean space. For unidimensional φ and corresponding uncertainty region, say, a 95% credible interval, the volume is simply equal to the width of the interval. Of course, depending on whether we hold φ constant or assign a generating prior p * (φ) to it, we can also investigate sharpness in expectation over the joint distribution p * (y | φ) p * (φ), instead of only focusing on p * (y | φ).
If sharpness holds for all finite-volume uncertainty regions U q (ψ | y), then the posterior of ψ P1 is sharper than the posterior of ψ P2 . Well-calibrated uncertainty regions cannot be infinitely sharp and there exists a sharpest model and corresponding estimator if the set of well-calibrated models is non-empty [121]. However, in practice, we have no access to this sharpest model. Thus, in contrast to calibration, sharpness cannot be practically computed in an absolute sense, but can only be probed as a relative quantity in the context of two or more P models.

Calibration of Posterior Approximations
So far we have primarily focused on P models in the context of parameter recoverability and all of the estimators assumed access to the analytic posterior p(θ | y) to obtain the analytic posterior p(ψ(θ) | y) of the estimator ψ of φ. Since we do not have access to the analytic posterior in practice, our typical estimators are based on PA models.  Figure 6: Simulation-based rank histograms (top) and corresponding empirical cumulative distribution function (ECDF) difference plots [284] (bottom) for three hypothetical quantities of interest. The pink areas in the ECDF difference plots indicate 95%-confidence intervals under the assumptions of uniformity and thus allow for a null-hypothesis significance test of self-consistent calibration. Left: A well-calibrated quantity. Center: A miscalibrated quantity with too many lower ranks indicating a positive bias in the PA modelbased posteriors. Right: A miscalibrated quantity with too many extreme ranks indicating overconfident PA model-based posteriors (i.e., variance underestimated).
Correspondingly, we define the estimator ψ A of φ via the approximate posterior p A (ψ(θ) | y) of ψ obtained by the approximator A. If A were approximating the posterior via random draws θ (s) from p A (θ | y), the approximate posterior p A (ψ(θ) | y) would be represented by the pushforward draws ψ(θ (s) ). Thus, we can evaluate identifiability, point and uncertainty estimation, as well as the sharpness, of a PA model by replacing ψ with ψ A in the corresponding equations. Ideally, we would like to separate the estimation of φ via ψ from the estimation of ψ via ψ A and we can do so if we assume that the considered P model is the true generator itself. This is due to two related self-consistency properties. The first one is which states that a P model's prior (left-hand side) is equal to the P model's data-averaged posterior (righthand side), that is, the posterior in expectation over its own generating distribution [285]. The second one states that all uncertainty regions U q (ψ | y) of all pushforward quantities ψ are well calibrated, as long the generating distribution of the assumed P model is equal to true data-generating distribution and posterior computation is exact [285]. Writing ψ * instead of φ to explicate the direct correspondence between the quantity of interest and its estimator ψ, this property can be written as Both self-consistency properties are useful, but Equality (22) provides a particularly convenient means to diagnose the calibration of the approximated posterior p A (ψ(θ) | y): Under perfect (self-consistent) calibration, the posterior probability Pr(ψ * ≤ ψ) is uniformly distributed in the unit interval [285,284]. If the approximate posterior can be expressed in terms of random draws, then uniformity can be tested empirically by comparing the empirical distribution of ranks over M simulated data sets to a uniform distribution, a procedure known an simulation-based calibration [SBC,285]. If the distribution of ranks is close enough to uniformity (e.g., according to a frequentist nullhypothesis significance test), we can conclude that the PA model is well calibrated for approximating the P model, assuming self-consistency of P. The required uniformity can be checked graphically, for example via histograms (top row of Figure 6) or by plotting the empirical cumulative distribution function (ECDF) of the ranks normalized against their expected values under uniformity (bottom row of Figure 6), a method known as ECDF difference plots [284].
Even though self-consistency tested via SBC is a powerful tool to ascertain the trustworthiness of a PA model if the underlying P model is well specified, it tells us nothing about the trustworthiness of PA if P is misspecified, that is, if its joint distribution cannot accurately represent the true data generating process p * (y). In the latter case, we currently have no general procedure to verify the trustworthiness of a posterior approximation, that is, how close a PAD model is to the PD model it attempts to approximate (but see [197,322] for recent theoretical work). This is a subtly different problem than dealing with misspecified PD models, whose convergence characteristics have been established under certain regularity conditions [163,164]. In the case of PAD models, we can only hope that self-consistent calibration of PA implies good enough calibration in a sufficiently large model neighborhood of P that also contains p * . For posterior approximators coming with guarantees of asymptotic correctness, such as MCMC, this hope is probably better justified than for neural approximators that have been shown to perform poorly under P model misspecification [269,307].

Predictive Performance
Undoubtedly, predictive performance is the central utility in most machine learning research [134] and an essential goal of computational [227] and scientific models in general [101]. Moreover, predictive performance has recently been elevated to an indispensable condition for reproducible quantitative research in the social sciences [320]. In deep learning, enormous amounts of computing resources are spent even for just a second decimal improvement in predictive accuracy on domain benchmark data sets [116], notably at the expense of other utilities (e.g., parsimony, see Section 3.6, or estimation speed, see Section 3.9). In our Bayesian model taxonomy, we treat predictive performance as just one of the ten model utilities, but we still recognize it as an important one.
In a way, predictive performance would be nothing but a special case of parameter recoverability (see Section 3.2), if not for the fact that it targets observable variables that are comparable against observed data. This opens up the possibility to directly evaluate predictive performance in real-world scenarios instead of having to use simulations, as is often necessary for estimating parameter recoverability. Along similar lines, predictive P(D) model comparison or averaging can be seen as a form of parameter recoverability from the perspective of mixture modeling (with the individual P models as components) or in terms of continuous model expansion [107]. However, in practice, we approach these challenges mainly based on predictions from separate P(A)D models to reduce conceptual and computational costs [319].
In the following, we denote the set of "test" data to be predicted as y * , whereas P(A)D model "training" data continues to be denoted byỹ. In principle, these two data sets are allowed to fully coincide, partially overlap, or be completely disjoint (see Section 3.3.3), andỹ may even be empty (see Section 3.3.2). Further, we will allow the test data to be clustered into C mutually independent and exhaustive clusters y * = {y * c } C c=1 . In most applications, both y * andỹ are associated with observed input variables (aka features, predictors, or covariates), but we will keep these implicit to make the notation more readable.
The ocean of predictive performance metrics for Bayesian models is vast and we refer to [296] for a comprehensive overview. To illustrate some overarching points in this article, we will focus on a few important metrics that follow the general form where f (y * c , θ) is a predictive score comparing a test data cluster y * c with corresponding model-based predictions. We compute the expected predictive score by integrating over the PD model posterior p(θ |ỹ), where l is some function applied to each expectation before summation over clusters. Whenever we use a PAD model, we need to approximate the above expectation over p A (θ |ỹ), either by using random draws from p A (θ |ỹ) or by relying on an approximate closed-form density. Below, we examine predictive performance along multiple dimensions: absolute versus relative, prior versus posterior, and in-sample versus out-of-sample predictive performance.

Absolute and Relative Predictive Performance
Evaluating absolute predictive performance requires knowing an optimally achievable value of the predictive metric, whereas relative predictive performance only involves comparing multiple P(D) models' predictions evaluated on the same test data y * . As an example for the former, consider the per-observation squared difference f (y * i , θ) = (y * i −ŷ i (θ)) 2 as a predictive score, whereŷ i (θ) is a P(D) model-implied prediction given parameter value θ (e.g., a single random draw or realization from the posterior predictive distribution, see [296]). In this case, we know that the optimal value of Equation (24) is zero. For the sake of increased interpretability, such squared differences can be further transformed to the canonical "percentage of explained variance" R 2 measures which are bounded between 0 and 1, the latter indicating optimal predictions [112].
However, optimal predictions are not achievable in practice, since even a Bayes-optimal decision maker may elicit suboptimal predictions in the presence of aleatoric uncertainty [134], at least when it comes to outof-sample predictions (see Section 3.3.3). Moreover, since we nearly never know the Bayes-optimal decisions in practice (hence the need for predictive modeling in the first place), the expected optimal achievable predictive performance is also unknown to us. As a result, relative predictive performance is usually our only resort in practical applications [296].
That said, some models produce such strikingly poor predictions that they can be ruled out without the need to find a better model first, often via visual predictive checks [102]. For instance, if we consider the case illustrated in Figure 7, it is immediately obvious that the normal likelihood P model (left-hand side) is inappropriate for the given count data. As another example, consider a P(A)D model for binary classification that achieves just 50% accuracy, equal to random chance. Assuming a balanced data set (i.e., both classes occur with the same frequency), we would not need a competing model to conclude that the classifier is bad -unless our goal was to demonstrate that the two categories cannot be possibly differentiated given the available information.

Prior and Posterior Predictive Performance
The distinction between prior and posterior predictive performance has often led to confusion in the past and still remains a rather precarious one to discuss. Prior and posterior predictive performance are distinguished based on whether we evaluate predictions before or after conditioning on the training dataỹ, respectively [183]. In other words, we either compute (or approximate) expectations over the prior, p(θ), or over the posterior p(θ |ỹ). Since prior predictive performance does not require the training data (see Equation 24), we consider it a utility of P(A) models, while we view posterior predictive performance as a utility of P(A)D models. We still require the test data y * , but it is not a part of any model class in our taxonomy.
Statistically, the line between prior and posterior predictive performance is thin and more quantitative than qualitative [219]. As an illustration, suppose we observe N data points in total -then we could choose to use none,ỹ = ∅, or any number between 1 and N for model training. For complex P models, the predictive result implied by using one or two observations for training, rather than none at all, will be almost identical, despite everything but zero training data technically counting as "posterior" predictive performance [219]. Yet, the metrics commonly applied to quantify prior and posterior predictive performance differ not only in the amount of available training data but also in some other non-trivial ways (to be explained below).
In general, any predictive metric should match the intended real-world prediction goals. Below, we will focus on certain (log-)probability metrics, which can be considered good general-purpose choices in the absence of any known task-specific option [296].
Prior Predictive Performance. The canonical metric for evaluating prior predictive performance is the joint P model likelihood evaluated at the test data, f (y * , θ) = p(y * | θ), with C = 1 and l = identity, in which case the prior expectation above becomes the marginal likelihood: When used for model comparison, the marginal likelihood then gives rise to well-known comparative metrics known as Bayes factors evaluated by comparing two P models P j and P k as and posterior model probabilities over a set of J models {P j } J j=1 as where p(y * | P j ) denotes the marginal likelihood of P model P j and p(P j ) denotes the corresponding prior probability with J j=1 p(P j ) = 1, following a closed-world assumption [21,319]. Although the marginal likelihood is formally an expectation and thus, in theory, we can approximate it arbitrarily well using sufficiently many random draws from the prior, it is practically impossible to evaluate due to its unfavorable pre-asymptotic behavior for any non-trivial model [203,296,129]. The main reason for this is that the parameter subset for which p(y * | θ) contributes to the integral in Equation (25) (i.e., the typical parameter set implied by the test data; [24]) is very narrow and thus we need a very high number of prior draws to ensure sufficiently many of them occupy that narrow space. In addition, numerical issues caused by p(y * | θ), such as floating-point underflow, can also be hindering.
For these reasons, the practical computation of marginal likelihoods currently rests on bridge sampling [15] relying on posterior draws from a corresponding PAD model [203,129]. In contrast to estimating posterior expectations or quantiles, bridge sampling requires about an order of magnitude more posterior draws to yield reliable results [129] and is still largely missing principled convergence diagnostics or uncertainty quantification [128], leaving room for future research.
An alternative prior predictive metric arises if one uses the log-likelihood f (y * , θ) = log p(y * | θ) as a (predictive) score instead of the likelihood itself, which leads to the Gibbs loss [308] that, for factorizable likelihoods [39], evaluates to The Gibbs loss is not only simpler to evaluate for exponential family models [296] and numerically more stable than the marginal likelihood but also exhibits better pre-asymptotic behavior for factorizable likelihoods when estimated via prior draws since the integrands become much simpler. However, the Gibbs loss cannot be used to obtain actual predictions because it does not evaluate to a predictive distribution over y * [296]. The latter problem can be avoided by taking expectations with respect to individual test observations y * i first and only taking the log afterwards (C = N * and l = log), which leads to the expected log predictive density (ELPD) metric [296,297], evaluated over the prior: Comparing equations (25) and (29), we see that the marginal likelihood considers the joint predictive density of all test data y * , while the ELPD considers marginal predictive densities of y * i , marginalized over all other test data. Even though the ELPD has found wide application in the context of posterior predictive performance [297], it does not yet seem to play a noteworthy role in the context of prior predictive performance. However, together with the Gibbs loss, it may become a computationally favourable competitor to metrics based on the marginal likelihood.
Posterior Predictive Performance. When assessing posterior predictive performance, we apply the same metrics we encountered in the context of prior predictive performance but evaluate expectations over the posterior induced by the training dataỹ. However, the practical popularity of the metrics seems to be reversed when it comes to posterior predictions. For example, the posterior ELPD finds widespread application [297], while the "conditional marginal likelihood" has not yet attained wide popularity, despite having several useful properties [219,17,130,183]. The choice between prior or posterior predictive performance seems to depend on the modeling goals for which a P model is specified. While prior predictive performance seems to be favored for the purpose of testing scientific theories [316,132,85,130], posterior predictive performance is the perspective of choice in almost all machine learning scenarios (but see [320]), where we first obtain a PAD model based on training data (and perhaps only minimal prior information) and then utilize the model in downstream predictive tasks [134].

In-Sample and Out-of-Sample Predictive Performance
We measure in-sample predictive performance if the test data is a subset of the training data, y * ⊆ỹ, but measure out-of-sample predictive performance if test and training data do not overlap, that is, y * ∩ỹ = ∅. Whenever we evaluate prior predictive performance, we have no training data per definition and thus always measure out-of-sample predictions. Accordingly, the difference between in-sample and out-of-sample predictive performance only matters in the context of posterior predictions.
From a posterior predictive perspective, the decision between using in-sample and out-of-sample predictive performance is based on whether or not we want to generalize our inferences from a data set to a wider population. If a given data set included the entire problem space, then in-sample predictive performance would be sufficient. However, as most introductory statistical courses teach, a data set is typically only a small sample from a much larger population, to which we would like to extend our inferences. Thus, out-of-sample predictive performance (aka generalization ability) is almost always what we are after [134,296,297,320].
That said, we can still learn from in-sample predictive performance, as it provides an upper bound for out-ofsample predictive performance in expectation, such that when in-sample predictions are poor, out-of-sample predictions are likely to be even worse [134,297,102].
In the presence of only a single overall data set y total , estimating out-of-sample predictions is practically realized via data splitting, such that y total = {ỹ, y * }. To reduce the dependency of the predictive results on a single realized data split, we typically perform cross-validation by repeating the data splitting several times (folds), evaluating out-of-sample predictions for every fold, and then aggregating the results across folds [281,296,297].
The type of cross-validation scheme employed should resemble the envisioned prediction goals for which the PD model has been created [296]. For example, the predictive goal of time series models is usually to predict future values based on past values, making leave-future-out cross-validation a sensible choice [47]. Regardless of the type of cross-validation employed, it involves the repeated fitting of the same P(A) model to different data sets. Depending on the number of such refits, the individual data sizes, and the applied approximator, the required estimation time can quickly become prohibitive for any practical use. As such, approximate cross-validation procedures that require no or only a few refits have proven to be highly popular in practice [297,300,47]. However, key cross-validation schemes, such as leave-group-out cross-validation, cannot yet be robustly approximated, so there is more research needed in that direction [223].
Although evaluating out-of-sample predictive performance is often our best shot at preventing overfitting to the training data, it is not always sufficient to fully achieve good generalization within commonly applied model-building workflows [113]. In these workflows, we typically fit different P models to the same data in an iterative fashion. For example, we might first compare two models, decide which one to retain, and only then fit a third model to compare it with the winner of the first round. Even if each model choice was based on local out-of-sample predictive performance, subsequent results can be informed by out-of-sample results from previous iterations, making it not strictly out-of-sample for any future iteration steps from the perspective of the analyst's knowledge. As such, in an iterative workflow, local out-of-sample predictive metrics may still lead to overfitting, but the degree to which this biases the end results remains a topic for future research.

Predictions in a Dynamic World
Time is one of the most precipitous sources of uncertainty and any attempt to forecast the future with a static, time-independent P(A)D model will only be meaningful if the opaque generator G is strictly stationary (i.e., its regularities are invariant to time). Otherwise, a P model needs to have an appropriate temporal resolution to deliver reasonable out-of-sample predictions beyond the empirical snapshot of the collected data. Moreover, since the precise details of temporal shifts are extremely hard to anticipate, a P(A)D model which claims universal predictive performance should regularly be subjected to the falsification of time.
This brings us to an important distinction when it comes to assessing out-of-sample predictive performance. Whenever we make our P(A)D model "blind" to certain observations in the original data set D and use these observations to assess our-of-sample predictive performance (as we do in any form of crossvalidation, even those built for time series data [47]), we are essentially testing the model's ability to perform induction about the statistical regularities of p * (y) in a temporal snapshot determined by data collection. In such a scenario, however, we are not probing the model's ability to faithfully forecast the future, since the "left-out" observations are new only from the perspective of the model, but not from that of the modeler. Thus, cross-validation can sometimes be overly optimistic in estimating out-of-sample predictive performance, since a sample collected at a future date might exhibit surprisingly different properties (i.e., the P model would no longer be structurally faithful) than the sample currently at hand.
Why would the empirical distribution p * (y) change over time? One reason can be that the hidden properties of the generator G itself may change, bringing about alterations in the statistical properties of p * (y). For instance, strong auto-correlations in financial time series are notoriously short-lived due to feedback processes and market adaptation [276]. Yet another reason can be that new sources of noise contaminate future data D in unexpected ways. For instance, a sensor in a measurement device may break and yield incorrect data or case reporting policies during an ongoing pandemic may switch between waves.
However, the P(A)D model may have no mechanism to adapt to any of these changes and its out-of-sample predictive performance would likely suffer.
Within our model taxonomy, prediction failures due to changes in p * (y) concern misaligned assumptions about temporal invariances embodied in the P model's structure. One way to revise these assumptions is to include time-varying parameters θ t in the P model, with the corresponding time-invariant parameterization being a special (and more parsimonious) case. For instance, this can be achieved within the superstatistics framework [13], which aims to represent heterogeneous dynamics through a superposition of multiple stochastic processes at different temporal scales [195]. In any case, researchers should bear in mind that static P(A)D models are not designed to deal with things that move, so, as simple as it sounds, time remains a key arbiter of the quest for universal substantial conclusions or robust predictive systems.

Fairness
Fairness in the context of model building aims to ensure that model-guided decisions are equitable, with a specific focus on groups that differ in protected attributes, such as sex, gender, or ethnic background [59,9]. In a relatively narrow sense, fairness is a primary concern for P(A)D models, as it applies to realworld outcomes and their real-world reverberations owing to the connection between a P model's structure and data D. However, purely simulation-based P(A) models are not exempt from fairness considerations, especially when used to guide important public policies and decision support systems [46,8,218]. In the following, due to its predominant share in the literature, we will examine the fairness of P(A)D models from two different perspectives, namely, from the perspectives of psychometric measurement and predictive modeling.

Measurement Fairness
In psychometric measurement theory, the aim is to estimate people's scores on latent psychological traits, for example, general intelligence, creativity, or aptitude for university programs [76]. In the model-based literature of psychometric measurement, namely Item Response Theory (IRT; [290,82,42]), two major aspects of fairness have received considerable attention.
First, we need to ensure that the observable features (i.e., items) have been selected and administered in a fair way [200,36,5]. This aspect does not appear to be immediately model-based, since it concerns the data collection process as well as causal assumptions about the latent traits' influence on the item responses [36]. However, some of its requirements can be checked via P(A)D models in the form of differential item functioning (DIF) analysis [140,222]. When investigating DIF, the item parameters ζ i of item i are allowed to vary across groups g and their P(A)D model's posteriors are compared to verify their statistical equivalence. That is, we aim to examine whether p(ζ i |ỹ, g) ≈ p(ζ i |ỹ, g ′ ) holds for all pairs of considered groups g and g ′ and all items i.
Second, we need to estimate the latent traits of all individuals with a similar degree of uncertainty [43]. In the context of P(A)D models, this means that the posterior of trait η j for person j has approximately the same entropy across all individuals being compared, that is, H(η j |ỹ) ≈ H(η j ′ |ỹ) for all pairs of individuals j and j ′ . This turns out to be a difficult, sometimes even unachievable goal: Due to floor and ceiling effects arising in almost all psychometric tests, the resulting information is non-uniform across the latent trait space in non-linear IRT models [290,46,43]. As a result, more extreme latent trait scores will be estimated less precisely than more average scores. As a partial remedy, one may try to ensure that the information gain about all individuals' trait scores at least exceeds a minimal, application-specific threshold [43].

Predictive Fairness
What we term predictive fairness has its origins in the field of machine learning [258,9]. We will define predictive fairness directly on PD models because there is no hope that a P model can yield fair decisions for all possible training data; after all, training data may themselves be biased against protected groups [258,9]. And while we define it as a utility of PD models, it also automatically pertains to a corresponding PAD model, unless the posterior has a simple analytic form.
Mathematically, for individual-level decisions, we consider a PD model-specific decision rule d(x |x,ỹ) that outputs a decision for each admissible vector of attribute values x given training data D = (x,ỹ) consisting of observed attribute valuesx and corresponding decision-relevant outcomesỹ in a supervised learning context. If we consider only binary decisions to simplify notation, we can write the decision rule as withr (x |x,ỹ) := r(x, θ) p(θ |x,ỹ) dθ being a real-valued (expected) risk score of x that is obtained as an expectation over the PD model's posterior p(θ |x,ỹ). The decision (e.g., whether to give someone a loan or release a defendant while they await trial) is then made by comparing the risk score against a pre-defined threshold τ . The conditional risk score r(x, θ) determines how the P model and its parameters θ are used for assessing risk. For example, the risk score could be the mean of the PD model's predictive distribution given feature value x and parameter value θ: r(x, θ) := y p(y | x, θ) dy.
Conditional risk scores do not necessarily have to rely on the predictive distribution. Rather, they may also be based on latent model quantities, such as psychometric trait scores obtained from IRT P(A)D models [290,82,42], which bridges the gap between measurement and predictive fairness.
There are different classes of predictive fairness criteria considered in the literature, among others anticlassification [35,59] and classification parity [59,19] (also known as statistical parity; [57]). Even within these classes, criteria are partially incompatible and neither of them can actually ensure universal fairness, but we can still learn from their limitations [59,57,9,19]. In the context of such criteria, we differentiate between protected attributes x p (e.g., sex, gender, or ethnic background) and other, unprotected attributes x u such that x = (x p , x u ). Anti-classification requires that protected attributes x p (or their proxies; [35]) are not used in model-based decisions at all, which mathematically translates to In our PAD model taxonomy, this can simply be realized by using a PD model with p(θ |x,ỹ) = p(θ |x u ,ỹ) and conditional risk score r(x, θ) that is independent of x p as well. Anti-classification approaches have two main drawbacks. First, protected attributes can often be predicted fairly well from unprotected attributes, which makes it impossible to be completely agnostic about them [89]. Second, empirical risk distributions (after removing all unfair risk influences) may differ across values of x p , such that ignoring the latter may actually lead to unfair decisions against the groups one originally attempted to protect [59]. Differently, classification parity comprises a class of fairness criteria that requires the population distribution of certain decision metrics to be the same across all values of the protected attributes [59,19]. Using demographic parity [89] as an example, we would require that the decision's distribution itself, as implied by the distribution of attributes x in the considered population, to be independent of the protected attributes: Contrary to anti-classification, we usually have to incorporate the protected attributes into the P model in the first place to ensure any kind of classification parity [19]. In the context of psychological tests, for example, this could be achieved by imposing group-specific norms of comparison [263]. Yet, classification parity does not guarantee universal fairness either, whenever the true risk score distribution (after removing all unfair risk influences) varies between groups defined by the protected attributes [59].
The shortcomings of these predictive fairness definitions highlight that requiring a certain outcomethe decision itself (anti-classification) or aspects of its population distribution (classification parity) -to be independent of the protected attributes may be insufficient. Towards the goal of achieving fairness through a PD model, the underlying P model needs to be causally consistent (see also Section 3.1) in a way that considers how the protected attributes x p relate to the causal graph that includes all the valid, unprotected attributes x u and the outcome y [35]. In addition, the training data D needs to be representative of the true (unbiased) outcome distribution p * (y). It goes without saying that these are complicated, application-specific tasks that require contributions from various scientific fields and considerable domain expertise.
What is more, fair decisions, regardless of their modeling context, need to take into account that the same decision may affect different people (and their surroundings) differently and that these differences may be related to both protected and unprotected attributes. More formally, we need to consider the decision d(x |x,ỹ) in a context C(x) that only together determine the output of a utility function U (d(x |x,ỹ), C(x)), which offsets all possible gains and losses caused by the decision. Obtaining such a function could steer a decision towards fairness as quantified by equal utility outcomes across protected groups.
At an even higher level, we should consider taking sufficient precaution that (anticipated) political decisions or societal processes triggered by anonymous modeling results do not lead to unfair treatment of protected groups. However, such considerations may come into conflict with the principle of scientific freedom, in which case a careful ethical analysis of the specific situation becomes mandatory.

Structural Faithfulness
In most data analysis scenarios, we have a reasonable amount of qualitative prior knowledge about the data structure and the data generating process, even if we don't know the precise analytic relation between the two. In particular, this knowledge concerns the scales of variables to be modeled, the dependencies between observations, as well as physical constraints, such as symmetries or invariances. The Structural Faithfulness utility captures how well a P model incorporates such knowledge. Structural faithfulness is at the core of statistical modeling, be it Bayesian or otherwise, as it determines the probability distributions we assign to our observed and unobserved variables, the parameters we add to our P models, and the assumptions we can justifiably make to simplify reality.
Moreover, we can roughly distinguish between probabilistic structure and functional structure, which are related to the modeler's degree of ignorance regarding the problem at hand. Purely statistical models aim to capture the probabilistic structure of p * (y), without making reference to functional structure of the hidden generator G. Non-deterministic mechanistic models, on the other hand, aim to capture the functional structure of G (usually represented by physical constraints), such that the probabilistic structure of p * (y) can be reproduced or explained. For instance, when we study the dynamics of a phenomenon via stochastic differential equations, functional faithfulness refers to the mathematical form of the differential equation and probabilistic faithfulness refers to the fidelity of the stochastic assumptions.
To us, it remains unclear how to measure structural faithfulness in an absolute sense and we see it primarily as a relative metric. What is more, structural faithfulness consists of multiple components that may each favor a different P model. For example, model P 1 might take a known symmetry into account that model P 2 ignores, while P 2 might assign a more appropriate distribution to a response variable than P 1 does. In this case, none of the two P models would actually be more structurally faithful than the other, at least not uniformly so.

Variable Scales
The scale of a variable determines not only what information it represents but also how it should ideally be treated within a P model. For example, if the response variable consists of count data without a known or practically reachable upper bound, we should model this data via an appropriate (unbounded) discrete distribution (e.g., Poisson, or some of its generalizations) to sensibly capture the aleatoric (irreducible) uncertainty in those count responses [302,99,313]. What is more, this ensures that the variables' natural boundaries are respected (e.g., lower bound of zero for count data), such that the corresponding model Negative−binomial Figure 7: Posterior predictive checks [102] of epilepsy treatment data [287]. The response variable is the number of epileptic seizures of patients in a given time interval, that is, a count variable without a known upper bound. Results are shown for three PAD models with different likelihoods (shown as facets) and posteriors approximated via MCMC in Stan [280]. Histograms indicate observed data and each black line indicates one draw from the posterior predictive distribution of the corresponding PAD model, smoothed via continuous density estimation. For Poisson and negative-binomial likelihoods, posterior predictions are in fact counts but are still displayed as smoothed continuous densities to ease readability and comparability across facets. As is clearly visible on the left-hand side, the PAD model with normal likelihood predicts a lot of theoretically impossible negative counts and can neither predict the spike at counts close to zero, nor the heavy right tail.
predictions cannot go beyond the data space that is possible in reality (see Figure 7 for an illustration). As another example, if our response variable is ordinal, that is, it consists of discrete ordered categories without guarantees that the categories can be considered equidistant, we should model such data via an ordinal distribution [201,179,45]. The same points hold also for predicting variables even if they are not explicitly modeled with a distribution [44,115]. Failure to consider the variable scales in P models can have detrimental consequences for the validity of the obtained results [115,44,179]. Equivalently, respecting the intrinsic scales of all quantities included in a P model can help to avoid unreasonable parameter estimates or implausible (or worse, impossible) predictions.

Probabilistic Structures
Observed data often exhibits specific probabilistic structures that can be inferred from (qualitative) understanding of the data-generating process. For example, if we collect psychometric data from multiple students in the same class, it is highly unlikely that the data points will be mutually independent (e.g., because students share the same teacher, rooms, peers, etc.). This situation is prototypical for the application of multilevel models, which aim to capture such dependencies [108,12,40,41]. Multilevel models treat such dependencies of observations belonging to the same group as equivalent to variation between groups [108]. In other words, if there were no variation between groups, there would be no structural dependency of observations within groups (at least none elicited by this grouping structure). There are three major types of structural dependence between groups that can be expressed as multilevel models: exchangeable, directed, and undirected [260,100,103], illustrated schematically in Figure 8.
Exchangeable groups are the most common assumption in multilevel models and imply that (before seeing any data) we hold the same prior beliefs about each of the groups but assume they are all drawn from the same population (e.g., students within classes, classes within schools, schools within cities, etc.). In the most simple case (i.e., two-level structure, univariate and normally distributed parameters), we would specify a Figure 8: Graphs illustrating common probabilistic structures. Rectangles depict nested parameters within a given probabilistic structure. Circles depict the corresponding hyperparameters. (a) Exchangeable parameters; (b) Conditionally dependent parameters with a directed (e.g., temporal) dependency structure. (c) Conditionally dependent parameters with bidirectional (e.g., spatial) dependency structure. univariate normal prior for each group indexed by i and group parameter ϕ i as where µ and σ are the mean and standard deviation parameters shared across groups, respectively. Typically, we would estimate the across-group parameters from the data along with the group-specific parameters ϕ i themselves.
In directed dependency structures, adjacent groups are assumed to have directed influence on each other in a way that group i can affect group j, but not vice versa. The most common example is temporal autocorrelation where a variable at time i can potentially be influenced by a variable at time i − 1 [278,103]. For a univariate Gaussian random walk, we would formalize this assumption with the following prior In undirected dependency structures, the influence of adjacent groups can go both ways, with spatial autocorrelation being the most common example [22,106,211]. For example, in (spatial) conditional autoregressive (CAR) structures [22], we could write down the prior on the group coefficients as where N i is the set of groups that are neighbours of group i. Importantly, a shared feature of these dependency structures is that they are agnostic towards the underlying causal mechanisms -their purpose is purely to accurately represent the inherent probabilistic structure of the observed data [106,312,103]. But what if the data-generating process suggests a certain kind of dependency for which we find no empirical support? For example, shall we retain a grouping term of classes even if the PAD model suggests close to zero variation between groups? There are good arguments for both choices. On the one hand, excluding such a term implies a simpler model with higher parsimony [11] (see also Sections 3.6), although the increase in parsimony will be quite small due to the partial pooling property of multilevel models induced by their hierarchical priors if there is a sufficient number of groups [108,138]. On the other hand, including the term sets a good example for future replications and applications of the same P model, in the same or different contexts. That is, if someone applies this P model to a new data set, they may very well find the between-group variation under question to be non-zero, thus justifying the inclusion of the corresponding grouping term.

Physical Constraints
In the domains of physics and natural sciences, we tend to have strong prior knowledge about the functional P model structure in the form of known hard constraints such as symmetries, invariances, or conservation laws [292,251,157,7]. For example, a harmonic oscillator expressed by the second-order differential equation with functional solution x, second derivativeẍ, as well as constant k, represents an isolated system that is energy conserving [187]. Similar to a harmonic oscillator, most physical hard constraints can be expressed via differential equations whose direct inclusion in a P model is computationally demanding if we do not have access to an analytic solution [62,173,283]. Accordingly, building a more flexible, data-driven P model as a surrogate is a computationally attractive choice [173,49]. Still, even for such a surrogate, it remains beneficial to incorporate known physical constraints to eliminate the need to learn them directly from data. This is likely to increase the model's data efficiency, that is, the amount of data required by the model to achieve a certain predictive goal [251,173]. The discussion about physics-informed modeling is particularly prominent in core areas of high-dimensional machine learning, such as neural networks that tend to be very data-hungry [251], but in principle applies to all P models created for representing data with known physical constraints.

Parsimony
Parsimony refers to the formal simplicity of a Bayesian model; some might define it as the conceptual or mathematical elegance of the underlying interpretative framework. Here, we view parsimony as a quantifiable property of a Bayesian model. We treat it also as a relative quantity -it is always possible to propose a more complex model (or possibly a simpler one) which is equally consistent with the available data.
Within our PAD framework, we will distinguish two types of parsimony: P-parsimony and A-parsimony. P-parsimony characterizes the formal simplicity of a P model and should be measurable from the structure of the joint distribution p(y, θ). A-parsimony characterizes the simplicity of an approximator and should be measurable through the interface of A. The former is directly related to the theoretical appeal of a P model's probabilistic assumptions; the latter is directly associated with the usability of an approximator.

P-Parsimony
In many real-world modeling scenarios, we have limited data and strive for P models that can capture all relevant latent properties with as little data as possible (see Figure 9 for a simple illustration). One particular aspect of this goal is captured by the dimensionality of the parameter space, whereby higher parsimony simply means lower parameter dimensionality. Canonical examples for high parsimony are physical simulators defined by complex (white-box) forward models with intractable likelihoods [62]. The latter are informed by strong subject matter knowledge and are thus able to maintain low parameter dimensionality (e.g., consider the harmonic oscillator Equation (40), which only requires a single parameter to describe highly non-linear, non-monotonic behavior). On the other end of the spectrum are neural network models that tend to use simple likelihoods (e.g., Gaussian or categorical), but are characterized by an extremely high parameter dimensionality and large compositions of non-linear transformations, such as GPT-3 featuring 175 billion parameters [92]. In a way, we need to compensate for our lack of a priori knowledge (or inability/unwillingness to use it) by applying less parsimonious models that replace more restrictive model structures with a heightened hunger for data.
The motivation for parsimony is related to other utilities as well, since more parsimonious P(A)D models tend to require less data to achieve the same reduction in epistemic uncertainty (parameter recoverability; Section 3.2) and predictions (predictive performance; Section 3.3), and tend to be easier to comprehend in real-world applications (interpretability; Section 3.7). Still, we can construct chaotic models -where minimal changes in the parameters lead to strong changes in the predictions -that are highly parsimonious, yet uninterpretable and extremely flexible in terms of the function space they can approximate [239]. However, most P models applied in current practice do not exhibit such chaotic behavior.
Despite its intimate connection to other utilities, we think that parsimony deserves to be a utility in its own right, harmonized with Occam's razor: Given two models, and other things being equal, one should choose the more parsimonious one [31]. Increasing the parsimony of a model (or a scientific theory, for that matter) implies making more restrictive assumptions (i.e., reducing the function space that can be theoretically approximated by the model), thus increasing its falsifiability: We can more easily create situations where the model is wrong. Furthermore, in applied settings, sparser models may lead to more efficient data collection and more economical measurement designs (i.e., fewer variables to measure or less acquisition trials in design optimization) [234]. Nevertheless, the strive for parsimony may not always be a useful guide to our scientific exploration, if the aesthetics of parsimonious P models make us blind for potentially more appropriate (e.g., in terms of other utilities), but less parsimonious representations. For example, the strive for parsimony may be one of the factors that has stalled the scientific progress in the foundations of physics during the past decades [141].
Effective Number of Parameters. There are different ways to measure parsimony, with simply counting the number of parameters 4 of a P model being the most straightforward approach. For simple models, such as linear regression, this measure of parsimony matches the concept of degrees of freedom (DoF) in frequentist statistics. In the same way, the DoF concept becomes awkward even for slightly more complex models [150], the former is not a generally useful measure of parsimony either [239]. The reason for this is that, from a Bayesian perspective, any prior information on a parameter increases a P model's parsimony, such that the effective number of parameters (ENP), might be substantially smaller than the nominal number of parameters [297]. The same mechanism also underlies the difficulty in computing the DoF of test statistics in frequentist multilevel models, because random effects distributions are equivalent to priors [138].
There are several ENP measures in the literature [277,309,297,241], often defined in the context of information criteria. For the information criterion based on leave-one-out cross-validation (LOO-CV), ENP is measured as the sum of the differences between the pointwise log predictive densities of the full posterior and the pointwise log predictive densities of the LOO posteriors [297]: The notation y −n indicates that the n-th data point in y has been excluded. As more parameters are added to the model, the in-sample predictive performance represented by log p(y n | y) grows more quickly than the out-of-sample predictive performance represented by log p(y n | y −n ) such that the sum of their pointwise differences grows. This provides an intuition why ENP LOO can be considered a measure of parsimony. Its concrete interpretation as an effective number of parameters is inspired by the following observation: When using very wide or even completely flat priors over all parameters, ENP LOO will roughly coincide with the nominal number of parameters, but becomes smaller than the latter in the presence of prior information [297]. Bayesian LOO-CV can usually be computed efficiently via importance sampling without any model refitting, and so can ENP LOO be computed without any actual refitting [297,300]. For a large number of observations N , ENP LOO can be asymptotically approximated by the sum of the full posterior variances over the pointwise log-likelihood values, which is the ENP estimate used in the widely applicable information criterion (WAIC) [309]: Intuitively, as the number of parameters grows, so does the epistemic uncertainty in the posterior, which leads to an increase in the variance of posterior predictive quantities, such as log p(y n | θ). The WAIC approximation of LOO-CV performance can be quite unreliable so using ENP LOO is highly recommended whenever possible [297]. What becomes apparent in these equations is that parsimony, at least when measured through these ENPs, may depend on the specifically realized dataỹ, and as such needs to be defined over PD models. This is specifically true for models with hierarchical priors, where the amount of hierarchical shrinkage (i.e., the influence of the hierarchical priors) is data-dependent [108]. Practically, the posterior integrals in (41) and (42) for PAD models are efficiently approximated via Monte Carlo estimates based on posterior draws from an approximator [297]. The huge advantage of these ENP measures is that they do not need to be aware of the internal structure of a P model, but only require its predictive outputs in the form of pointwise log-likelihood values. However, the need for the latter has the drawback that ENP measures do not work natively with P I models due to their lack of tractable likelihoods; unless one has learned not only the model's posterior but also its likelihood density during training [314]. What is more, if the model includes residual dependencies between observations, the pointwise (log-)likelihood may not be available, even if the joint likelihood is analytic [39].
Prior P-Parsimony. In the above-described ENP definitions, we integrate over the posterior distribution and so, in this sense, measure posterior parsimony. This naturally raises the question of whether we can define measures of prior parsimony as well. In a Bayesian setting, prior parsimony is automatically embodied in the marginal likelihood (sometimes called Bayesian evidence) [158,189,183], which we already encountered in our discussion on prior predictive performance (see Section 3.3.2). As a reminder, we obtain the marginal likelihood by marginalizing the joint P model over its prior Accordingly, we can interpret the marginal likelihood as the expected probability of generating data y from a P model when we randomly sample from the prior p(θ). Through the prior's role as a weight on the Figure 10: Hypothetical scenario with three P models of descending complexity: P 1 , P 2 , and P 3 . The most complex model P 1 can account for the broadest range of observations at the cost of diminished sharpness of its marginal likelihood; in contrast, the simplest model P 3 has the sharpest marginal likelihood which concentrates onto a narrow range of possible data. Even though the observed dataỹ is well within the generative scopes of models P 1 and P 2 too, the simplest model P 3 has the highest marginal likelihood atỹ among the three candidates and is therefore favored from a marginal likelihood perspective. However, the higher relative marginal likelihood of the simplest model P 3 is a poor proxy of its predictive performance for new data sets, as it assigns close to 0 density to the new data setỹ new , suggestive of overfitting. The model P 2 , whose marginal likelihood is closest to the data-generating distribution p * , would have been favored, had y new instead ofỹ been used for computing the associated Bayes factors.
likelihood, the marginal likelihood encodes a probabilistic version of Occam's razor by penalizing the prior complexity of a P model [158,189]. However, the marginal likelihood is not an explicit measure of parsimony; rather, it represents an implicit relative quantity which combines prior parsimony with the ability of a P model to fit the data by considering its entire generative scope (see Figure 10). Following [189,Chapter 28], we can illustrate the above conflation by assuming that the posterior of a P(A)D model is well represented by a (multivariate) Gaussian. In this case, the marginal likelihood can be approximated as: where θ MP is the posterior mode and H(θ MP ) is the Hessian of the likelihood evaluated at θ MP . The multiplicand p(θ MP ) det(H(θ MP )/2π) − 1 2 is termed an Occam factor and represents the factor by which a P(A)D model's parameter space contracts as the prior is updated to the posterior based on the information contained in D. Thus, under the Gaussian assumption, the magnitude of the Occam factor is an explicit measure of prior complexity (i.e., inverse prior parsimony) related to the information gain a P model can achieve over its generative scope [189,183]. Consequently, a P model with a vague prior will incur a larger penalty by the Occam factor than a different P model with a sharper prior, provided that both models share the same likelihood. However, if the Gaussian assumption is inadequate, the approximation of Equation (44) can sustain a large error and may no longer be useful. Unfortunately, we are not aware of a more general decomposition of the marginal likelihood into a prediction factor and a parsimony factor, as is the case with ENP LOO .
A closely related concept is the principle of Minimum Description Length [MDL, 255,133], which views parsimony through the lens of information theory. In the MDL framework, a probabilistic model represents a coding scheme designed to describe the dataỹ. Accordingly, a parsimonious P model provides a concise description of the data in terms of code length (relative to a competing P model). Note, that MDL is not a unique measure, but rather an umbrella framework for deriving measures of parsimony/complexity in various application contexts (see [133] for a comprehensive exposition). For instance, in a Bayesian context, one can show [133] that a canonical measure of description length for model P is given by which we recognize as the negative logarithm of the marginal likelihood introduced in Equation (43). In this way, MDL not only highlights the theoretical connection between Bayesian model comparison and information theory but also provides a principled way for deriving new measures of prior parsimony in future basic research.
Sparsity-inducing priors. Another perspective on P-parsimony is provided by sparsity-inducing priors, especially global-local shrinkage (GLS) priors [293,27,291]. These priors will shrink redundant coefficients towards values close to zero, inducing sparsity in the posterior 5 . GLS priors can be applied in many model classes, including linear and generalized linear models, non-linear and non-parametric function estimation, time series, as well as deep neural networks [293,118,27,268]. Here, we focus our discussion on Gaussian linear models as this case is most intuitive and theoretically best understood. Given a linear regression model in its simplest form, GLS priors are defined on the K regression coefficients β k as follows: where λ k denotes the local scale parameter unique to each coefficient and τ denotes the global scale parameter that is shared across all coefficients. The choice of the hyperpriors p(λ k ) and p(τ ) determines the specific properties of the GLS prior, leading to, for example, the horseshoe [51,241] or the R2D2 prior [325,1]; see [293] for a comprehensive overview. The implied posterior of the coefficients has a highly interesting relationship with the maximum likelihood (ML) estimateβ k that can be obtained from the same likelihood and data but under the assumption of flat priors on the coefficients. Concretely, and assuming that the ML estimate exists, the posterior mean E θ|y (β k ) can be computed as follows [241,1]: with Here, a k is some constant that depends on the response's and the k-th predictor's scales. Accordingly, the smaller λ k and τ , the stronger the shrinkage of β k to zero, relative to the ML estimateβ k . Conversely, the larger λ k and τ , the closer the posterior mean of β k will be toβ k . Given these properties, κ k are called shrinkage factors [241,1]. The model leading to the ML estimate has K coefficients, which are all counted fully when it comes to determining the number of parameters (see above). Since the posterior mean β k implied by the GLS prior is equal to (1 − κ k )β k , we see that summing over all (1 − κ k ) terms can be considered a measure of the effective number of coefficients [ENC,241]: In contrast to the above ENP measures, ENC GLS is essentially limited to linear models. What is more, ENC GLS only considers regression coefficients, not necessarily all P model parameters (e.g., it ignores the residual standard deviation σ). These are not the only differences between these measures though. Even though both are derived as generalizations of simply counting parameters, the ENC measures focus on posterior variance (which is explicit in the definition of ENP WAIC ), while ENC GLS focuses on the posterior mean. Thus, they consider different aspects of the posterior when measuring parsimony. Studying the relationships between these measures more closely would be an interesting endeavor for future research.

A-Parsimony
As we discussed in Section 2.3 concerning PA models, posterior approximators can range from relatively simple optimization algorithms to high-dimensional parametric models (e.g., neural networks) which themselves can be viewed as standalone P models (e.g., Bayesian neural networks). The notion of A-parsimony intends to capture our intuition that these different approximators have varying degrees of complexity. Here, we propose a very straightforward definition of A-Parsimony: The cardinality of the hyperparameter space H available for fine-tuning through the implementation interface I of the underlying mathematical algorithm A. For instance, the widespread use of MCMC in Bayesian inference is partly because probabilistic programming languages provide relatively simple interfaces, which abstract away a staggering multitude of hyperparameters of complex MCMC samplers [e.g., NUTS,139]). On the other hand, neural approximators [e.g., 246,126] inherit the vast hyperparameter spaces of deep neural networks and are thus currently still rather challenging to apply or fine-tune [301]. A-parsimony is not only relevant for the usability of approximators, but also plays an important and limiting role in comparison or benchmarking studies assessing the relative performance of different approximators. Suppose we wish to compare approximator A 1 having no hyperparameters with approximator A 2 having a single continuous hyperparameter h ∈ [0, 1], in the context of some P model. A comparison of approximators must naturally be based on some metric (or a set of metrics) q(A, P) which quantifies the approximation quality of A with respect to a given P model (e.g., the distance between corresponding PD and PAD models or the estimation speed of the approximator). However, even for the simple scenario outlined above, it is not clear how to systematically carry out such a comparison due to the presence of hyperparameters. One approach would be to approximate the average approximation quality given by 1 0 q(A 2 (h), P) p(h)dh of A 2 and compare it to q(A 1 , P). Another approach would be to seek the best approximation quality given by max h∈[0,1] q(A 2 (h), P) and compare it to that of q(A 1 , P). Needless to say, the difficulty of ranking and benchmarking approximators with large hyperparameter spaces drastically increases, which makes A-parsimony a key limiting factor as well as a desirable utility to improve upon.
Finally, A-parsimony is related to robustness (see Section 3.10) and convergence (see Section 3.8), as the presence of multiple hyperparameters raises the question of how to choose hyperparameter settings which i) lead to stable results and ii) generalize to various applications of a PA(D) model. For some approximator classes (most notably, MCMC) and P models (e.g., linear models), empirical guidelines and theoretical considerations may suggest relatively robust default choices. For newer approximator classes (e.g., neural density estimators) or more exotic applications, some form of sensitivity analysis or hyperparameter search might be necessary to ensure sufficient robustness or generalizability.

Interpretability
Interpretability of a P(A)(D) model can be qualitatively defined as "the degree to which a human can understand the cause of a [model-based] decision" [206] or as "the degree to which a human can consistently predict the model's result" [159]. A more precise, perhaps even mathematical, definition is difficult to provide given the context and expertise-dependent nature of interpretability, but there is progress in this direction [73]. In any case, achieving interpretability will help us understand why a P(A)(D) model behaves the way it does (e.g., in terms of predictive performance; see Section 3.3). Such understanding can have not only profound epistemological, but also far-reaching ethical and social implications [232,73,209].
According to [209], we can distinguish between intrinsic and post-hoc interpretability. The former is related to the intelligibility of the P(A)(D) model itself (i.e., its structure and parameters), whereas the latter is related to the explainability of the PAD model's results using auxiliary methods, such as permutation feature importance for neural networks [317] or random forests [149]. However, there is a conceptual ambiguity regarding the term in the recent literature. Some accounts use explainability as a synonym for interpretability in general [209], while others use explainability to refer solely to post-hoc interpretability [38]. In our PAD model taxonomy, we view only intrinsic interpretability as a utility of the P(A)(D) model. Differently, post-hoc interpretability is a utility of an explanator that is applied to the original PAD model's results -in fact, the explanator may just be another, more interpretable P(A)(D) model that is used as a surrogate [38]. Accordingly, the following discussion focuses only on intrinsic interpretability, to which we hitherto refer simply as interpretability.
P model interpretability relates to the general meaning of its parameters, so it makes sense to differentiate between the interpretability of P I and P E models since the two model classes often put different demands on the epistemic value of their parameters. Further, as we will see below, there are P models whose interpretability can be influenced by both data D and approximator A. As such, it can be necessary to further distinguish the interpretability of P, PD, and PAD models.

Interpretability of P I models
In P I models, most parameters correspond to real-world quantities or emergent properties, whose meaning can be understood independently of the P I model that is used to estimate them (see Section 2.1). For example, in a harmonic oscillator [187], the object's mass that serves as a parameter carries a meaning independent of the differential equation that describes the oscillator's behavior. As such, while the transformations performed to generate data from an P I model are highly non-linear and often not analytically tractable [62], the interpretability of P I models tends to be high (at least in the eyes of domain experts in the field).
However, even for domain experts, it can be exceptionally challenging to predict the generative behavior of a high-dimensional P I model given a particular parameter configuration. This can be the case, even when a P model has a small number of readily interpretable parameters. Consider, for instance, the prototypical logistic map equation [198] given by and having only a single parameter ρ ∈ [0, 4] which can be interpreted as growth rate in population dynamics modeling [282]. Despite its beguilingly simple form, the logistic map is known to develop chaotic behavior as the parameter ρ varies in the range from approximately ρ ≈ 3.56995 to ρ ≈ 3.82843. The model's generative behavior in this range is characterized by a periodic phase, intercepted by bursts of aperiodic fluctuations.
And even though such behavior can be generally abstracted and described for a single parameter, for instance, with the help of bifurcation diagrams [119], it can quickly become less amenable to high-level descriptions when it results from the interaction of two or more parameters [10]. Unsurprisingly, Bayesian analysis of PD or PAD models based on an underlying chaotic P I model has long been recognized as a challenging endeavor [20], requiring sophisticated approximators with surrogate likelihoods [279]. As alluded to above, the interpretability of high-dimensional P I models will often depend on whether we focus on individual parameters and their functional role for data generation in isolation (i.e., first-order interpretability) or try to understand interactions between parameters as well as their joint contribution to the generation of y (i.e., higher-order interpretability). Accordingly, even for complex P I models with dozens of parameters, we may still retain relatively high first-order interpretability through the theoretical embedding of each individual parameter, but higher-order interpretability may suffer, since multiple parameters can act similarly on y and interact in surprising ways due to non-linearity. For instance, the compartmental model of the early COVID-19 pandemics in Germany set up by [248] has 34 free parameters, each of which has a direct isolated interpretation, for instance, infection rate, number of initially exposed people, weekly modulation, or probability of detection. However, the exact interplay between these parameters in determining the actual reported number of daily cases might not be immediately obvious from the understanding of individual parameters alone or from the model equations themselves.
Finally, the higher-order interpretability of P I models may change once they have been connected to data due to dependencies between parameters. Oftentimes, we choose a prior p(θ) which factorizes into independent components, reflecting our assumption of disentanglement or independent generative factors of variation. However, the resulting PD or PAD models will rarely conserve independence in their joint posteriors (e.g., due to loss of information or an inherent lack of disentanglement in the inverse model). A canonical example would be a strong posterior correlation between two parameters with initially independent priors, indicating that the parameters do not fulfil orthogonal functional roles for generating the data.

Interpretability of P E models
In P E models, the parameters do not need to correspond to real-world quantities or mechanisms. Rather, their meaning can often only be understood within the P E model they are part of [111]. The archetypal P E model is linear regression, where a regression coefficient β describes the linear relationship between a predictor variable and the response whilst holding all other predictors constant. As such, β has a clear meaning to an analyst with some statistical knowledge, provided that the measurement scales of predictor and response variables make sense for the task at hand. However, the requirement to hold all other predictors constant becomes impossible to fulfil if the predictors cannot be varied independently from each other, for example, because they are correlated in purely observational data or because some of them constitute interactions between already included predictors. As such, even for as few as four or five predictors, interpretability of their joint contribution becomes highly challenging unless predictors are mutually independent [209].
The use of non-linear, monotonic transformations in P E models, such as link functions in generalized linear models [215] or non-linear activation functions in neural networks [272] further complicates the interpretability of an originally linear predictor structure. For example, when using the logarithmic link (equivalently, the exponential response/activation function), the originally additive relationships become multiplicative, resulting in exponential growth, which is much harder to comprehend for humans [304]. This then reduces the interpretability of the P E model's parameters from both their signs and magnitudes to only their signs. If one were to apply non-monotonic transformations, the interpretability of the parameters' signs would be lost as well. In addition to non-linear transformations of the whole linear predictor term, every structural deviation from a (latent) linear structure further reduces interpretability. For example, interactions, polynomial terms, hierarchical structure [108], Gaussian processes [311,253], or splines [98,315] all make interpretation of a P E model's parameters harder, if not impossible in some cases.
The interpretability of a P E model may also be affected by the data utilized for parameter estimation. Accordingly, PD models may differ in their interpretability even if their underlying P E model is the same. For example, when employing shrinkage priors for high-dimensional linear P E models with lots of irrelevant predictors, the posterior of most regression coefficients will shrink to values very close to zero, effectively eliminating the corresponding predictors from the regression equation [241,325] (see also Section 3.6.1). If only a few coefficients are substantially different from zero, the interpretability of the resulting PD model would be much higher than that of the original P E model.
Finally, an approximator A may create a situation where a PA(D) model's interpretability deviates from that of the underlying P(D) model. However, that may only happen if the posterior approximation p A (θ | y) is qualitatively different from its analytic counterpart p(θ | y) due to an incomplete posterior exploration. A common case arises when the analytic posterior is multi-modal but the approximator collapses to a single mode [104]. Notably, mode collapse represents a case where the interpretability of the PAD model may be higher than that of the underlying PD model, at the cost of other utilities, such as predictive performance (see Section 3.3). An example of an P E model class that produces highly multi-modal posteriors are artificial neural networks [104,77,148]. While the interpretability of the underlying P E model is usually low [38,323,324], some of their PAD models can exhibit much higher interpretability if they are steered in the right direction [323,324].
The above-described notions of interpretability are largely qualitative. While some attempts at a quantitative treatment have been made [73], we are not aware of any sufficiently general definition that allows for a more objective, quantitative comparison between P(AD) models with respect to their interpretability. Thus, we hope that our PAD model taxonomy may inspire more focused research on the quantification of interpretability.

Convergence
Convergence is a utility of PA and PAD models which rely on complex approximators, such as MCMC, variational inference, or neural density estimators. As explained in Section 2.3, approximators provide certain guarantees under specific assumptions, such as infinite draws in the case of MCMC [110] or infinite training and representational capacity in the case of neural density estimators [246,247,269]. In practice, ECDF difference plots with 99%-confidence envelopes [284]. Both kinds of plots use the same posterior draws, but only the rightmost ECDF difference plot highlights that Chains 1 and 4 have some mixing problems for σ. Example draws obtained from the bayesplot R package [102].
however, modelers cannot wait a lifetime of infinity for approximators' promises to come true; for the time being, we need to work with finite posterior draws and non-convex optimization objectives teeming with local optima. Thus, our convergence utility pertains to the relative distance between a particular (finitely instantiated) PA(D) model and the optimal PA(D) model attainable under perfect conditions for A. For the above definition to be useful, we need a proxy measure of how close the current approximation is to the optimal approximator outcome. We call such measures convergence diagnostics and they are indispensable for ascertaining the validity of PA(D) models. Ideally, good convergence diagnostics should also indicate that the approximation is close to the analytic posterior, but only within the space of distributions the approximator can reach. Accordingly, the relation between convergence and analytic posterior approximation is only indirect for approximators that may be asymptotically biased [318,70]. Below, we briefly detail common convergence diagnostics for different types of approximators.

Convergence Diagnostics for Markov Chain Monte Carlo
Convergence diagnostics are fundamentally important for posterior approximators that rely on MCMC since these approximators can be arbitrarily bad before full convergence [172]. Thus, all model-based inference relies on the quality of the approximation being close enough to the analytic posterior with respect to some minimally required precision. For a quick graphical check, trace plots or ECDF difference plots [284] can be used, as illustrated in Figure 11.
In terms of numerical approaches, three related classes of MCMC convergence diagnostics are applied in today's practice, namely scale reduction factor R, effective sample size (ESS) and Monte Carlo standard error (MCSE) [109,61,256,91,110,74,299]. They all provide convergence measures for univariate quantities of interest ψ = ψ(θ) that are functions of the P model's parameters θ (see also Section 3.2). There is not a single "global" R, ESS, or MCSE measure for ψ, but one for each summary statistic T (ψ) of ψ, where T can be any posterior expectation or quantile [299]. As such, for example, a set of S posterior draws ψ (s) might yield a very precise estimate for the posterior mean of ψ, while at the same time, the estimates of some tail quantiles of ψ (e.g., 5% and 95% quantiles) have much less precision [299]. Accordingly, each of these convergence measures is a function of the quantity of interest ψ and the summary statistic T , computed from the S posterior draws ψ (s) .
Broadly speaking, the scale reduction factor R compares the between-chain variance B = B(f T (ψ)) to the within-chain variance W = W (f T (ψ)): where the dependence of B and W on T is realized by an appropriate transformation f T (ψ) that is applied to each posterior draw ψ (s) before the variances are computed, usually on split chains [110,299,48]. We can conclude that convergence has been reached if R ≈ 1, that is, if the within-chain variance dominates the between-chain variance. The ESS estimates the number of independent draws that contain the same amount of information about T (ψ) as the S dependent posterior draws obtained via an MCMC approximator. As a result, we usually see ESS < S, although the opposite can also happen in case of antithetic (negatively auto-correlated) chains [299]. We can obtain the ESS from all autocorrelations ρ t = ρ t (f T (ψ)) of lag t of the chains as where, in practice, we would truncate the infinite sum at some finite value [117]. In modern versions of ESS, ρ t implicitly depends also on R to take variation across chains into account [110,299]. In case of independent draws, we have p t = 0 such that ESS = S. The MCSE describes how much (reducible) uncertainty in T (ψ) remains due to the fact that we only have a finite set of dependent MCMC draws for estimation [91,74,110,299]. If T represents an expectation, we can write down the corresponding MCSE schematically as an overall variance V = V (f T (ψ)) across the S draws divided by the corresponding ESS [91,299]: MCSE estimates for quantiles need to be computed a little differently and are provided in [299]. Ideally, we should define convergence of MCMC as reaching or undercutting the maximal MCSE that we find minimally acceptable for the given summary of interest T (ψ). However, The MCSE is scale-dependent as it has the same scale as T (ψ), which requires an understanding of how much of an error is acceptable for a certain quantity, in the context of a particular model and research question. This inherently makes MCSE harder to use in practice and hence the scale-free alternatives R and ESS are often preferred [299].
All of the above measures are univariate in the sense that they only concern a univariate T applied to a univariate ψ. Recently, a more comprehensive measure, called R * [172], has been developed that measures convergence in a multivariate way across multiple model parameters or quantities of interest. It is able to detect non-convergence in the joint posterior that may be overlooked by only investigating convergence of a small, non-exhaustive set of univariate quantities [172]. This is achieved by training an expressive machine learning model (i.e., random forest) to predict chain indices from posterior draws. If the predictive performance of the machine learning model on (unseen) test draws does not exceed chance level, we can assume that the MCMC chains have converged.
In addition to all these sampler-agnostic convergence metrics, there are also few sampler-specific metrics. Most notably, this concerns divergent transitions in Hamiltonian Monte-Carlo (HMC) [24], where every occurring divergent transition in the Markov chain may bias the MCMC results and indicate difficulties of the sampler with exploring the target posterior. Divergent transitions tend to occur in regions of high curvature of the explored posterior; regions that most other MCMC samplers struggle to explore as well, only that they fail more silently compared to HMC [24].

Convergence Diagnostics for Optimization-Based Algorithms
Many classes of posterior approximators are based on optimization algorithms. The simplest of such approximators aim to find a single point estimate to approximate the analytic posterior, namely the posterior mode, also known as maximum a posteriori (MAP) estimate [110,189]. Variational inference (VI) approximators also use optimization, but instead of finding the MAP, they aim to find a parametric distribution (e.g., a multivariate Gaussian) that approximates the analytic posterior as closely as possible [94,252,30,310]. The optimization then targets the parameters of this parametric distribution (e.g., the means and standard deviations in Gaussian mean-field VI). Expectation propagation [EP,221,207,298] and integrated Laplace approximation [INLA,261,180,262] work in a conceptually similar fashion, but the structure of their parametric approximators and their target distributions are different (e.g., for INLA, the conditional posteriors of the parameters, instead of their joint posterior). Again, highly similar in terms of their use of optimization, neural approximators (e.g., invertible neural networks; [3,246]), use optimization to find the neural network parameters that yield the best posterior approximation within the generative scope of the network [229,184,126,231,246,269] (but see Section 3.8.5 for specifics in diagnosing convergence of amortized neural approximators).
Regardless of how optimization is applied for posterior approximation, the aim is always to find a single point in a potentially high dimensional space that leads to the best approximation of the analytic posterior within the set of realizable approximations. Accordingly, all traditional convergence criteria for iterative point optimization apply. That is, for non-stochastic optimization algorithms (e.g., gradient-decent or L-BFGS; [217]), small absolute or relative changes in the point estimate, small absolute or relative changes in the target function, or small absolute or relative closeness of the target function's gradient to zero (if the gradient is available) [217,280], would indicate convergence. For stochastic optimization algorithms (e.g., stochastic gradient-decent or more sophisticated versions, such as Adam; [217,161]), measuring convergence becomes less straightforward due to the stochasticity in the objective's trajectories. If the step size is held constant, they yield a Markov chain around the target point, once the algorithm comes close enough, instead of converging directly to the target [250,84]. The latter implies that MCMC convergence diagnostics, in particular R, can be applied to diagnose convergence of stochastic optimization algorithms [70].

Convergence Diagnostics for Sequential Monte Carlo
Sequential Monte Carlo (SMC; aka particle filtering) comprises a heterogeneous class of posterior approximators for PD models whose underlying P models can be expressed in the form of a sequence of conditional distributions (i.e., time series P models) [75,68]. Most SMC samplers can be shown to provide asymptotically correct inference as the number of draws (particles) approaches infinity [64]. However, empirical convergence diagnostics in the pre-asymptotic regime appear to be relatively scarce still [63,175,64]. Perhaps this is because SMC approximators consist of multiple iteratively applied components [64], each with their own pre-asymptotic behavior requiring their own local convergence diagnostics: To assess the convergence of importance sampled (IS) particles at a given step, ESS estimates for weighted samples [169,326] or variance measures driven by the number of siblings per particle (i.e., the number of particles with the same ancestor node at step zero) [175] can be applied. The trustworthiness of the IS weights themselves could be diagnosed via the Pareto-k-diagnostic of Pareto-smoothed importance sampling (PSIS; [300,47]), although we are not aware this has been tried so far in the context of SMC (for a closely related application, see [47]). Convergence of MCMC kernels that are part of many SMC algorithms [64] could be assessed via MCMC convergence diagnostics (see Section 3.8.1). While each of these diagnostics may be locally informative for a given SMC component at a given step, whether and how they convey global convergence to the target joint posterior remains to be studied further.

Convergence Diagnostics for Approximate Bayesian Computation
The standard ABC rejection algorithm [259,71,286,243] requires a distance function which quantifies the difference between simulated data y (generated from a P model with a particular parameter configuration θ) and observed dataỹ. Further, it needs a tunable tolerance level ϵ according to which the algorithm rejects a fraction of 1 − ϵ simulated parameter values. The algorithm then keeps the remaining parameter values as random draws from an approximate posterior p ϵ (θ | y).  Figure 12: Detecting model misspecification (i.e., simulation gaps) with amortized neural approximators [269]. A summary (aka embedding) network H transforms the typical generative set T (P) of a P model (i.e., the finite set of "in-distribution" data simulations that a P model typically generates) into the typical set of a simple distribution (e.g., multivariate Gaussian). Discrepancies between the model-implied data distribution p(y) and the true data distribution p * (y) manifest themselves as detectable anomalies, causing potential posterior errors by the inference network F. We can detect these anomalies via standard our-ofdistribution (OOD) detection techniques.
The ESS of standard ABC rejection samplers is thus typically equal to (1 − ϵ) S, with S denoting the total simulation budget, since vanilla ABC samplers perform independent sampling. However, this does not mean that their sampling efficiency is particularly appealing, especially for high-dimensional P models. That is because ABC samplers notoriously suffer from the curse of dimensionality: Most simulated data sets from a high-dimensional P model will be rejected and so it becomes challenging to obtain enough random draws from p ϵ (θ | y) for a reasonable reduction of the MCSE.
More sophisticated ABC algorithms, such as ABC-SMC [275,165] or ABC-MCMC [194,289] alleviate some of these issues and inherit the convergence diagnostics of SMC and MCMC. However, whenever handcrafting of distance functions and summary statistics of the data H(ỹ) (i.e., dimensionality reduction) is involved, ABC algorithms can converge at best to p A (θ | H(ỹ)). This issue does not exist whenever H(ỹ) is a sufficient summary statistic, but it can potentially lead to overestimation of V (f T (ψ)) if H(ỹ) results in considerable loss of information about the parameters θ.
Recent work on ABC focuses on building robust ABC approximators and exploring the possibility of utilizing hand-crafted summary statistics as a key element of misspecification analysis and error correction [95,196]. A related line of work suggests comparing posterior moments recovered by differently configured ABC approximators as an empirical diagnostic [96]. It remains an interesting open question whether similar "ensemble approaches" can be generalized to other approximators for simulation-based inference, such as amortized neural surrogates [246,126,230] or ABC with learned summary statistics [56,151].

Convergence Diagnostics for Amortized Approximators
In contrast to MCMC-based approximators, there are no standard convergence diagnostics for amortized approximators yet, as the latter are grounded in diverse, and still fast-evolving theoretical frameworks. However, whenever we employ neural posterior approximators, we can resort to convergence checks used commonly in deep learning applications [124]; see also Section 3.8.2.
Since most modern neural architectures are trained using some form of stochastic gradient-based optimization, optional stopping (i.e., discontinuation of training once the cost function does not improve over some tolerance period) and other convergence heuristics can be used to determine when a neural approximator has reached a stable local minimum of its cost function. That said, due to the nature of non-convex optimization, simply assessing local convergence is not enough for trusting PA(D) models coupled with amortized neural approximators.
Moreover, amortized neural approximators require simulations to be faithful proxies of reality and might yield arbitrarily bad posterior approximations when confronted with data that are atypical under the assumed P model [269]. The latter case is also known as a simulation gap and it occurs when a P model does not accurately represent the behavior of the modeled real-world system (i.e., when the model is misspecified). Consequently, amortized approximators must be able to detect simulation gaps and potential posterior errors, so that they can warn users about suspicious input data and resulting inference.
Generally, there are two broad types of empirical convergence diagnostics we need to utilize in the context of amortized neural approximators: those of a PA model and those of a PAD model. Model-agnostic tools, such as different variants of SBC (see Section 3.2.3), are only applicable to PA models, as they assume that we have access to the actual data-generating parameters. On the other hand, if the neural approximator is a generative neural network [245,126], the latent space can be used as a source of convergence information for the PA model. For instance, flow-based networks [162,231] are trained to transform an intractable posterior into a simple base distribution from which random draws can be easily obtained. Thus, convergence of the PA model can generally be determined by a divergence between the prescribed and the learned base distribution.
Unfortunately, neither of the above PA diagnostics can tell us whether the corresponding PAD model will be able to yield faithful estimation due to potential discrepancies between the simulation model and reality (see above). Thus, further diagnostics are necessary to promote the trustworthiness of amortized posteriors. One such diagnostic is the maximum mean discrepancy [MMD,127] between summary statistics of simulated and real data which tells us whether the observed data belongs to the typical generative set of the simulator or not [269, cf. Figure 12]. As the field of simulation-based inference is still in its infancy, we expect a rapid development of convergence diagnostics for amortized approximators in the future.

Estimation Speed
For non-amortized approximators (cf. Figure 3), we can define Estimation Speed as the time from the start of running the approximator A of a PAD model (or a particular instance of a PA model) until convergence, defined by approximator-specific convergence diagnostics (see Section 3.8). In certain cases (see Section 4), it may be sensible to define estimation speed less strictly as the time until termination of the approximator run after which useful results are obtained, without necessarily having achieved convergence. For a given PAD model, estimation speed tends to vary by several orders of magnitude across different classes of approximators. For example, MAP estimators, VI, or other (non-amortized) optimization-based approximators, will usually require a fraction of what sampling-based approximators such as MCMC or SMC need [261,40]. When considering estimation speed in isolation, there is no doubt that "faster is better". However, to obtain faster approximators, we often have to give up accuracy or asymptotic guarantees of the resulting posterior approximation [94,318]. Thus, increasing speed by changing the approximator class may have an adverse effect on other utilities of PA(D) models, specifically on parameter recoverability and predictive performance (see Sections 3.2 and 3.3).
Within a given class of approximators, hyperparameter choices can greatly affect the estimation speed as well [139,318,246]. As an example, consider static HMC where the number of leapfrog steps per Markov transition has to be chosen a priori [139,24]. On the one hand, if the number of leapfrog steps is too small, MCMC draws will be highly auto-correlated and thus more draws are required to achieve convergence. On the other hand, if the number of leapfrog steps is too large, a lot of computation time is wasted by unnecessary leapfrog steps; or auto-correlation might even get worse again when the HMC sampler eventually makes a so-called "U-turn" to come back to its starting point [139]. Such problems due to hyperparameter choice can be mitigated by automatically tuning hyperparameters in a "warm-up" phase or adapting them on the fly, conditional on the local geometry of the approximated analytic posterior. For example, when using the No-U-turn sampler [NUTS,139], a generalization of HMC, the number of leapfrog steps is adaptive. It removes the requirement for the user to choose this hyperparameter manually and may even have better estimation speed than optimally tuned, static HMC [139].
The above definition of estimation speed is straightforward but can be misleading in practice if convergence is not achieved (or unachievable) within a given run of A until its termination, such that A has to be restarted [113]. A typical reason is sub-optimal choices of A's hyperparameters, for example, if the leapfrog step size in HMC is too large leading to divergent transitions whose occurrence implies irrecoverable non-convergence for the current approximator run [24]. Thus, estimation speed in practice may be strongly affected by an approximator's ability to run reliably out of the box without much tuning. Tuning demand can be reduced by adapting hyperparameters automatically on the fly or by having only a small number of sensitive hyperparameters (see Section 3.10). The latter property can further be understood as determining approximator parsimony (see Section 3.6.2).
A manually but skilfully tuned approximator A 1 might beat an auto-tuned approximator A 2 in terms of estimation speed when considering only the final, converging run. However, the overall time (including failed runs) it can take to get A 1 to this optimal state may easily more than offset its final speed advantage. As a result, in an honest comparison of practical estimation time, it may be A 2 that comes out ahead by a substantial margin. Along similar lines, the particular P model implementation may be more or less favourable for different approximators, which can also strongly affect estimation times [24,299,16,66].
Sometimes, the reason for an approximator's termination without convergence may also lie in the computational environment, for example, time or memory constraints on a computing cluster that limit the resources available for a single job. For example, if the estimation of a PA(D) model takes longer than expected, estimation might be terminated prematurely, in the worst case leaving no intermediate result to restart from. In this scenario, even if the second run would then be successful, we still had to deal with at least a doubled estimation time. Accordingly, both the predictability of the expected resource requirements and the small variance in resources between repeated runs of the same PA(D) model can imply substantial practical speed improvements.

Sampling Efficiency
For sampling-based approximators, convergence in terms of reaching a given MCSE value (for given quantities of interest and summary statistics), is strongly application-dependent, and so is the estimation speed associated with it (see also Section 3.8.1). For more general comparisons of sampling-based approximators, the concept of sampling efficiency is easier to handle and we define it as the average ESS per unit time (for a given quantity of interest ψ and summary statistic T ): where t start and t end are the start and end times of the approximator run, respectively. While most approximators, even optimization-based ones, can be used to obtain posterior draws upon convergence [261,94], we restrict the class of sampling-based approximators to those that return only posterior draws as their immediate endpoint instead of the parameters of (closed-form) density functions. MCMC, SMC, and rejection sampling are the most important members of this class [259,110,64]. Within a class of sampling-based approximators, say MCMC, the same convergence diagnostics, in particular the same ESS, can be applied to all competitors, which simplifies comparisons [16,66]. Here it is important to not only use the same implementation for these diagnostics across all approximators, but also to ensure that this implementation follows the current state-of-the-art of diagnostic development [295]. Otherwise, comparisons may be biased by outdated diagnostics. Additionally, one needs to verify empirically that the obtained stationary distribution for a given PAD model is the same for all the compared approximators. Otherwise, sampling efficiency will be misleading since at least one approximator would not have estimated the analytic posterior of the underlying PD model well enough. Comparisons between approximators belonging to different sampling-based classes may require even more care to ensure that ESS diagnostics across classes are comparable, for example, when comparing MCMC with SMC approximators.
Whenever we are performing sampling efficiency comparisons for PA instead of PAD models, we not only have, in principle, an infinite number of data sets as a basis for comparison but can utilize SBC to falsify the correctness of the achieved stationary distributions (see Section 3.2.3). However, when we investigate sampling efficiency on a set of simulated data sets, different data-generation scenarios should be studied, since well-specified P models may have different efficiency than misspecified P models.
When studying sampling efficiency, the same practical caveats apply as for estimation speed in general. For instance, to achieve a comparison that is ecologically valid for real-world situations, we have to investigate practical sampling efficiency that considers both optimized and non-optimized P model implementations, as well as failed or prematurely terminated approximator runs.

Estimation Speed of Amortized Approximators
Amortized approximators, such as the pre-paid estimation method [204] or neural density estimation methods [245,126], require a slightly modified view on estimation speed, since they tend to split inference into two phases (cf. Figure 2). Convergence in the context of amortized neural approximators typically happens before any posterior draws have been obtained [303,123], so the primary computational load falls into the upfront simulation-based training phase. In contrast, the computational cost of applying a pre-trained amortized approximator to obtain thousands of posterior draws or perform density estimation on real data is typically negligible and only a matter of seconds even for high-dimensional posteriors [245]. In this way, amortized approximators can be extremely useful for studying the (global) information gain (see Section 3.2.1) or calibration (see Section 3.2.3) as part of the parameter recoverability utility of PA models, since these demand inference on many, potentially thousands of data sets simulated from the underlying P model.
Due to the properties of amortized approximators, we can modify the definition of estimation speed as the time until convergence of the training phase plus the time for obtaining a sufficient number of posterior draws on real data to reduce the MCSE beyond a pre-defined threshold. In this context, estimation speed will greatly depend on the simulation time, that is, the computational cost of performing a sufficient number of model simulations. Some amortized methods make it possible to further subdivide estimation speed into three parts: simulation time, training time, and inference time 6 , for instance, when using BayesFlow in an offline training regime [246] or when applying sequential neural estimators [126] with the prior as a sole proposal distribution throughout training.
In any case, estimation speed for amortized approximators will be dominated by the time spent before obtaining posterior draws. Accordingly, it is often important to determine the break-even point between an amortized and a non-amortized method, that is, after how many observed data sets does the training effort amortize in terms of ESS per unit time? Naturally, this break-even point will heavily depend on the modeling context. For some P models, the break-even point between neural estimation and ABC can occur after as few as 5 or even fewer observed data sets [247], but it can also occur only after as many as dozens when comparing different neural approximators [245]. In addition, amortized neural samplers can often yield independent posterior draws upon convergence [245,123,126], so their sampling efficiency during the inference phase (cf. Figure 2) will be generally superior to non-amortized approximators (i.e., stateful samplers) yielding dependent draws.
Importantly, comparisons between amortized and non-amortized approximators (but also comparisons within the same A class) should take implementation factors into account. For instance, the estimation speed of neural approximators will be greatly enhanced by using GPU parallelization and even standard ABC rejection samplers can be quite efficient when run on a computing cluster with hundreds of nodes [165]. For simulation-based inference, the implementation of the simulation model presents a further potential bottleneck, which can be alleviated via parallelization, model reformulation reducing the algorithmic complexity of the simulator, or calibration of large-scale simulators via simpler surrogates [186]. In addition, recent hybrid methods employ a mixture of amortized and non-amortized components, such as amortized likelihood ratio approximators within non-amortized MCMC [136] or neural likelihood surrogates [230]. These hybrid methods blur the distinction between amortized and non-amortized methods and render the definition of estimation speed even more challenging. The dependence on these various implementation factors should make us wary of comparisons between approximators in terms of estimation speed and appreciate the challenges of building scalable PA(D) models.

Robustness
A common question that arises when we discuss substantive conclusions derived from model-based inference is how fragile these conclusions are with respect to crucial aspects of P, A, or D. Can we "break" the analysis by a barely perceptible change in the data or by using a slightly different approximator? Or are the main results of the analysis largely impervious to such seemingly unsubstantial changes? The Robustness utility attempts to answer such questions by measuring how much a PA, PD, or PAD model's implications change as we (systematically) perturb some of its components.
In the above definition, we use the term "component" very generally. It can refer to (structural) aspects of the P model, most notably, to priors or their hyperparameters [69,154,244,18,86,87,257] or to aspects of the likelihood function [18,34]. It can also refer to the choice of data D, for example, the percentage of left-out observations [154,219,300,297], or to hyperparameters of the posterior approximator A [139,246,75,318]. Thus, the term essentially refers to any aspect in which a P(A)(D) model can be sensibly modified.
More formally, we want to investigate the sensitivity (inverse robustness) of the posterior of some quantity ψ = ψ(α) with respect to some variable α which exerts a potential influence on a component of interest.
If we are only interested in investigating the sensitivity of a specific (point) summary of the posterior, we convey this by writing T (ψ(α)) for an arbitrary summary statistic T .
For example, we can study likelihood or prior sensitivity by power-scaling the respective components of the P model, that is, replacing the joint model p(y | θ) p(θ) with p(y | θ) α p(θ) or p(y | θ) p(θ) α , respectively [219,144,28,154]. Of course, one can also choose to power-scale only parts of the likelihood or parts of the joint prior. Although it is just one of many ways to systematically perturb a P model, power-scaling is a popular approach due to its simplicity and natural integration with existing workflows [28,154].
Regardless of the exact perturbation method, we can define (local) sensitivity as a measurable distance (represented by a function f ) between the results of the current P(A)(D) model at value α 0 and an alternative value α 1 that implies a different P(A)(D) model, diverging from the original one only in the choice of α [154,257]: For example, if T were a posterior expectation or a quantile of some univariate quantity ψ and α were a hyperparameter of the prior p(ψ) = p(ψ | α), then f could simply be the absolute difference between these expectations or quantiles as implied by α 0 and α 1 , respectively. This definition can further be generalized to sets of alternative α values in the neighborhood of α 0 [257]. If α can be perturbed continuously (e.g., using power-scaling) and if T (ψ(α)) is differentiable at α 0 , we may also define sensitivity as a function of the derivative of T (ψ(α)) evaluated at α 0 [154,244]: The latter definition has the advantage that no further value α 1 has to be chosen, but it has a smaller range of applicability and potentially more difficult interpretation. In both of the above definitions, we can always choose f such that the sensitivity is non-negative with a value of 0 indicating complete insensitivity. For complex models, small amounts of sensitivity are almost always expected but may not practically matter. Accordingly, we would say that T (ψ(α) is practically sensitive with respect to α if for some chosen threshold δ that depends on the sensitivity measure and the modeling context [154]. For example, let T denote a posterior mean and ψ denote a standardized effect size that we would deem sensitive if a change exceeds δ = 0.2 standard deviation units. Then, the results would be practically sensitive to the given perturbation, if changing α from α 0 to α 1 implied |T (ψ(α 0 )) − T (ψ(α 1 ))| > 0.2. When evaluating practical sensitivity related to hyperparameter choice within a class of posterior approximators, robustness is highly desirable, since approximators should ideally converge to the same target (see Section 3.8). What is more, as PA models grow in complexity, analyses based on a single approximator Table 3: Applicability of the ten utility dimensions for each model class of the PAD-taxonomy.
may gain trustworthiness by some form of multiverse analysis employing multiple approximators [305]. However, it is currently unclear how to systematically weigh the relative contribution of different approximators when trying to aggregate results from multiverse analysis, since some approximators might yield very poor posterior approximations and thus skew any substantial conclusion. Differently, when it comes to perturbations in P model assumptions or the observed data, neither practical sensitivity nor insensitivity is desirable per se. Rather, we would like results to be practically robust to perturbations if (a) the perturbations affect only nuisance components of the P(A)D models that are equally justifiable within the given context, or (b) if the perturbations are so small that they could very well have occurred due to uncontrolled or uncontrollable influences. In contrast, when different P model assumptions represent competing substantive theories of interest, we want the corresponding P(A)D models to be sensitive to these assumptions.
Examples for (a) include different choices of non-equivalent likelihood families that have an overall similar complexity (e.g., Log-Normal vs. Gamma distribution for continuous positive data) or different P(A)D models that are capable of similar predictive performance (see Section 3.3), in case the latter is not already the quantity of interest itself. Examples for (b) include adding small amounts of noise to the data [192], leaving out a small subset of the data [300,297], or slightly changing prior hyperparameters when the goal is to specify weakly informative priors [154]. As the magnitude of the perturbations increases, we expect results to become practically sensitive to these perturbations and observed insensitivity would then be a reason for concern. For example, if drastically increasing the amount of data would not reduce the posterior standard deviation of ψ, this would be an indication of empirical non-identifiability ( [111]; see also Section 3.2.1) or an error in our model code [113].
Another type of sensitivity, arising in modeling dynamic systems, is sensitivity to initial conditions [182,271], which in our taxonomy can be understood as part of P I models. Sensitivity to initial conditions, popularly known as the butterfly effect, implies that an arbitrarily small change in initial conditions can result in considerably different subsequent system states (or observed trajectories). Moreover, this type of sensitivity can be considered as a hallmark of deterministic chaos [271]. In the context of dynamic models, the so-called Lyapunov exponent can measure a model's sensitivity to initial conditions [14]. Lyapunov exponents characterize the rate of exponential divergence from perturbed initial conditions and the maximal Lyapunov exponent can be used to summarize the overall sensitivity of a model into a single number [155]. For more details on dynamic systems and deterministic chaos, we refer the interested readers to [271,32].

Intermediate Summary II
In the preceding sections, we have proposed to focus on ten general utility dimensions pertaining to the different Bayesian model classes within our PAD model taxonomy. Table 3 summarizes the applicability of each utility dimension to each model class; the justification for this classification can be gathered from the treatment of individual utilities. Throughout, we have tried to elucidate the rather diverse facets of these utilities compactly, while providing a comprehensive list of references for the interested readers. We believe that considerations regarding these utilities are already implicit in much of the applied Bayesian literature. However, besides disambiguation of core concepts, a major goal of collecting these utilities has been to stimulate their explicit consideration in Bayesian workflows. In the following section, we will highlight the interactions between some of these utilities and discuss their relative importance in terms of applicationdependent utility hierarchies.

Utility Hierarchies and Trade-offs
For comprehensive Bayesian model comparison across all utility dimensions, we need a thorough understanding of how the latter relate to each other. Thus, we begin by highlighting important connections between selected utilities. Some connections have been discussed already in different places in Section 3, but we summarize some of them here again for the reader's convenience.

Interplay Between Utilities
Predictive Performance and Parameter Recoverability. Any modeling goal can be summarized by a set of quantities whose inferred model-based approximations are then used in subsequent decision-making. These quantities of interest may be manifest (observable) or latent (unobservable), leading either to a focus on predictive performance (observable quantities) or parameter recoverability (latent quantities). Statistically, these two utilities can be evaluated similarly by comparing model-based approximations with their real-world or ground-truth counterparts. However, despite the statistical similarity in evaluation, prioritizing either of the two leads to substantially different modeling workflows, as detailed further below.
Parameter Recoverability and Convergence. Both parameter recoverability and convergence aim to quantify the difference between model-based results and some ideal theoretical target that we want to approximate as best as possible. However, the two utilities differ in what the ideal target is. For parameter recoverability, it is a known ground-truth from which the data D was implicitly or explicitly generated. For convergence, it is the best possible posterior approximation that a particular approximator A can achieve for a given PD model. Indeed, these utilities are different for two main reasons. First, even the analytic posterior of a PD model may perform very poorly in terms of parameter recovery, for example, when D contains too little information relative to the complexity of P. Second, for biased approximators, the ideal convergence target is not even the analytic posterior itself but rather the "closest possible" distribution within the scope of A.
Convergence and Estimation Speed. In most cases, convergence determines the definition of estimation speed by defining the latter as the time from the start of running the approximator to convergence. While this definition usually works well, there are some scenarios where it falls short. First, sometimes we may want to define estimation speed less strictly as the time until termination of the approximator run after which useful results can be obtained. This definition can be sensible especially when the goal is to achieve good predictive performance. If that goal was already achieved with a formally non-convergent model, there may be no need to bother with convergence anymore. Second, when using amortized approximators, we split the approximation process into a time-intensive training step and a subsequent inference step, which is then almost instant. In this context, convergence can be easily defined only for the training step, while estimation speed has clear definitions both for the training and inference step.
Causal Consistency and Fairness. Fairness is not detached from causal considerations. For example, measurement fairness aims to ensure that the items in a test battery measuring a psychometric construct relate to this construct in the same way for all groups differing in protected attributes. In the associated causal graph, the causal effect of the latent construct on the measured items should be conditionally independent of the protected attributes given the unprotected attributes. However, fairness considerations reach beyond causality, as they carry important ethical, political, and societal aspects that causal modeling alone cannot appropriately account for (i.e., questions of fact vs. questions of value).
Causal Consistency and Structural Faithfulness. Both causality and structural faithfulness aim to represent an empirical phenomenon or the characteristics and constraints of an assumed stochastic process as closely as possible via an adequate P model. However, they differ in what aspects of the process they try to encompass. Causal consistency focuses on the specification of conditional independence between (sets of) variables given (sets of) other variables. This structure is then reflected in the P model's probabilistic factorization without referring to specific distributions or functional forms. Differently, structural faithfulness deals with the details of the P model itself, for example, what functional forms connect the conditionally dependent variables or which probability distribution they follow.
Structural Faithfulness and Parsimony. Changing structural faithfulness affects parsimony although in different directions depending on the kind of P models being considered and how structural faithfulness is increased. For example, when adding physical constraints such as symmetries, parsimony is increased along with structural faithfulness as the model does not have to learn the constraints from data. In contrast, when modeling additional probabilistic structures (e.g., of time and space), parsimony is usually decreased as new parameters have to be added to the P model to account for these structures.
Parsimony and Interpretability. Higher parsimony is often associated with higher interpretability. However, there are notable expectations. For example, a highly regularized P model (e.g., through continuous shrinkage priors) may be considered highly parsimonious in terms of the effective number of parameters, but at the same time remain largely opaque because the parameter dimensionality itself remains high. As another example, low dimensional yet highly non-linear systems may feature directly interpretable parameters (e.g., growth rate in logistic map models), but their joint influence on the system's behavior may be very hard to understand without the aid of model simulations.

Utility Trees
Having highlighted some key connections between utilities, we will now move on to evaluating the utilities' relative importance depending on the goal of inference. We will differentiate between a utility hierarchy, where one utility is strictly more important than another, and a utility trade-off, where we can achieve a gain of one utility at the cost of a loss of another. Utility hierarchies, utility trade-offs, as well as the relative importance of different utilities, are inevitably application-specific and contingent on the particular modeling goals.
The first branching point in ranking the different utilities is whether we are interested in observable or latent quantities for subsequent decision-making, leading to what we call observable and latent inferential goals. This distinction is equivalent to focusing on either predictive performance or parameter recoverability as a primary utility. Related binary perspectives have been put forward, for example, as the difference between two "statistical cultures" [37] or the prediction-explanation dilemma [320,273]. However, the distinction between the two goals has not been discussed in the context of explicit model utilities. Below, we present two utility trees defining hierarchies and trade-offs for the two kinds of inferential goals.
The way we would like to see these utility trees being used in practice is that analysts (a) build and improve their models in a way that respects utility hierarchies and (b) talk explicitly about the utility trade-offs they have been making in the process. This should enable users of the models and consumers of statistical inference to understand which model-building decisions have been made, why they have been made, and how they affect the trustworthiness of model-based decisions.

Utility Tree for Observable Inferential Goals
The workflow for observable inferential goals centers around the predictive performance of PAD models as a central utility. This is the prevalent perspective on modeling in machine learning research. Given its practical nature, this perspective would require a practically usable representation of a posterior distribution from which predictions can subsequently be obtained, hence the focus on PAD models. Below, we discuss our proposed model utility tree for observable inferential goals (see Figure 13).

Primary Utilities
Fairness: The fact that predictive performance is the central utility under this perspective does not mean that it would be the sole or even the most important utility to consider. Rather, on the top of the utility hierarchy, we need to check whether the predictive goals concern certain aspects related to fairness. If they do, our PAD model needs to satisfy the fairness criteria agreed upon in the corresponding domain, otherwise, it would be considered invalid from an ethical and/or legal perspective (see Section 3.4), regardless of how good its predictive performance is. If fairness concerns do not apply to the particular PAD model, the fairness utility can be circumvented.

Secondary Utilities
The secondary level of our predictive utility tree includes (in addition to predictive performance itself), estimation speed, interpretability, and robustness (in alphabetical order) as utilities across which trade-offs can be made. Notably, we do not require convergence of the PAD here and view estimation speed simply until termination of the approximator, regardless of whether or not convergence had been reached (see Section 3.9). While the three additional central utilities may exhibit trade-offs among each other, increasing them may in particular justify (some) reduction in predictive performance.
Estimation speed : If achieving high predictive performance requires either a very high dimensional parameter space (e.g., as in a neural network) or the repeated evaluation of a complex simulator (e.g., in a differential equation-based mechanistic P I model), then estimation speed will exhibit a trade-off with predictive performance. In other words, PAD models with easy-to-evaluate or easy-to-simulate likelihoods may obtain faster posterior approximation, but may also yield worse predictive performance. Whenever estimation speed becomes prohibitive for the practical application of a PAD model, the context may justify the use of another P model, even if the latter sacrifices (some) predictive performance. The same logic can justify using approximators that speed up the approximation of a PAD model, even at the expense of losing (some) predictive performance (e.g., using VI instead of MCMC-based approximators; see [29] or Section 2.3).
Interpretability: Interpretability is often higher in P models with lower parameter dimensionality and linear structure (see Section 3.7). However, depending on the complexity of the real data generator, low-dimensional and/or linear(ish) models may have worse predictive performance than higher-dimensional and/or more non-linear models. Yet, even if predictive performance is the main goal, it may still be legitimate (or even legally required) to use a more interpretable PAD model, even at the cost of some predictive performance.
Robustness: A PAD model yielding high predictive performance on some test data may yield surprisingly poor predictions on slightly modified data (e.g., adversarial attacks on deep neural networks, see [2]). In a similar vein, a PAD model that is well predicting given some reasonable initial values of an approximator A may deliver worse predictions for some other (equally reasonable) initial values [177]. Both of these sensitivity types are not desirable and it can be legitimate to sacrifice (some) predictive performance for an increase in robustness against small perturbations.

Tertiary Utilities
At the third level of the predictive performance tree, we find supporting utilities that may serve as proxies for the central utilities. Specifically, these include causal consistency, convergence, parameter recoverability, parsimony, and structural faithfulness (in alphabetical order). Tertiary utilities are often easier to evaluate and available "early" in the model-building workflow, for example, when they are only requiring a P model instead of a PAD model. Using these utilities can thus help speed up the model-building process. However, these supporting utilities should only guide final modeling choices whenever multiple models are equally justifiable with respect to primary and secondary utilities.
Causal consistency: If the quantities of interest are purely predictive, enforcing a P model to satisfy causal consistency (or even thinking about a causal graph in the first place) is not required and may even have detrimental effects on predictions [202]. In other words, for predictions, it would usually be entirely irrelevant how the association between two variables came to happen, as long as the input variables are predictive of the outcome variables. However, causal consistency can still be a supporting utility to reduce the a priori admissible model space by ruling out variables or interactions, as well as related P model terms, for which a causal graph implies a lack of relation to the target variables (i.e., no path between covariates and target, regardless of path direction). Considering a genetic association study as an example, we can rule out gene areas that only encode genes whose effects are known and understood to have no plausible relationship to the phenotypes being predicted [306].
Convergence: Convergence is not strictly required in the predictive utility tree, since a non-converged PAD model may still exhibit satisfactory predictive performance. That said, achieving convergence will likely imply an improvement in central utilities as well. Not only is this true for predictive performance itself, but also for robustness. For instance, a non-converged PAD model may vary arbitrarily from another non-converged PAD model, whereas we can expect them to be (more) similar upon convergence, at least for approximators that have the potential to explore the full posterior (see Section 2.3). Accordingly, before studying central utilities that may be costly to evaluate, we can use convergence as a shortcut to rule out PAD models with low potential to score high in those central utilities.
Parameter recoverability: Parameter recoverability can indirectly enhance predictive performance since nearly non-identifiable P models and poorly calibrated PA models can be discarded early in a model-building workflow. Such models can neither yield good predictions, nor trustworthy uncertainty representation, as some information gain is necessary to achieve posterior predictions that are different from prior predictions (see Section 3.2.1). However, strict parameter recoverability still plays a secondary role for the central goal of achieving good predictions. For instance, highly over-parameterized P models may achieve zero Bayesian surprise (i.e., no difference between prior and posterior) for a large fraction of their parameters, but still yield reasonable predictions based on the few identifiable parameters [248].
Parsimony: In addition to having aesthetical value in itself (see Section 3.6), parsimony as a supporting utility bears close relations to estimation speed, interpretability and predictive performance: More parsimonious models tend to be (a) faster to estimate, at least when comparing P models that are nested (i.e., one model is a special case of the other), (b) more interpretable, as fewer parameters have to be considered simultaneously, and (c) less prone to overfitting (although they may be prone to underfitting). Some forms of parsimony (both plain number of parameters and a priori effective number of parameters; see Section 3.6.1) Figure 14: Utility tree for model-based inference of latent quantities (e.g., parameter estimation).
are available before running any approximator. Correspondingly, we can utilize parsimony as an a priori proxy for the central utilities.
Structural faithfulness: Structural faithfulness comprises several P model characteristics that we ideally know and understand before running any approximator: variable scales, probabilistic symmetries, and physical constraints (see Section 3.5). Structural faithfulness is related to multiple central utilities. Most importantly this concerns predictive performance, as structurally faithful models are more likely to predict more accurately while requiring less data and showing better uncertainty calibration [251,312]. But structural faithfulness is also related to estimation speed (for the better or worse; [108,40]) as well as robustness, for example, small perturbations of the training data [192]. Accordingly, we can also treat structural faithfulness as a proxy for central utilities to reduce the a priori considered model space to (sufficiently) structurally faithful models.

Utility Tree for Latent Inferential Goals
The utility tree for latent inferential goals centers around the parameter recoverability of P or PD models as the central utility. Modeling goals following this perspective are almost entirely of theoretical, epistemic nature, and so the approximator is not itself part of the modeling goal. Yet, in practice, we will still almost always rely on PA and PAD models for practical evaluation, hence the indispensable role of the approximator. Below, we discuss our proposed model utility tree for latent inferential goals (see Figure 14).

Primary Utilities
Fairness: Once again, on top of the hierarchy, we find fairness for ethical and/or legal reasons whenever the modeling context has fairness-related implications (see Section 3.4). First, at an individual (person-specific) level, we need to ensure that estimated latent parameters are fair with regard to individuals of protected groups. Second, at a more general (person-unspecific) level, we need to keep in mind how our inferences about latent parameters might trigger political decisions or societal processes affecting protected groups.
Causal consistency: Next, we need to ensure that the P model is causally consistent with the assumed, and theoretically justified, causal graph (see Section 3.1). We argue that thinking about causal consistency is required for any latent modeling goal. Even if studies do not engage in (sufficient) causal analysis, and then correctly state that their results cannot be interpreted causally, there remains the (perhaps implicit) wish that a causal claim would be possible. What is more, even a pure measurement goal (e.g., estimating intelligence or personality traits without the need to relate latent parameters to each other) would need a causal model to decide and justify which observable variables to use for the estimation of the latent variables [156]. Finally, even if one might find a latent inferential goal that would be honestly satisfied with association only, causal analysis and discussion would still be required to prevent people from interpreting results causally.
Convergence: Convergence of PAD models is a prerequisite for any practically trustworthy result of a latent inferential goal because we have no external validation criterion available during inference on real data as we would have when considering observable inferential goals. In fact, before convergence, posterior approximations may be almost arbitrarily incorrect, regardless of the kind of approximator being used (see Section 3.8). Specifically, for asymptotically biased approximators (e.g., VI), the approximated posterior upon convergence may still be a bad representation of the analytic posterior if the expressive scope of the approximator is limited. But even in such cases, a converged approximator is more likely to be closer to the analytic posterior than an arbitrary, non-converged approximator, and so the former is to be treated as more trustworthy.

Secondary Utilities
When it comes to inferential goals, parameter recoverability is the central utility of the secondary hierarchy. However, except for the identification sub-utility, it cannot be studied directly on real data because knowledge of the ground-truth is missing (see Section 3.2). As a result, many studies on parameter recoverability occur in the form of simulations or, if possible, mathematical analysis. This also concerns studying trade-offs with other central utilities, namely, estimation speed, interpretability, and robustness. These remain instrumentally the same as for observable inferential goals and can reveal trade-offs with parameter recoverability for the same reason as for predictive performance (see Section 4.3.2).

Tertiary Utilities
Due to the lack of ground-truth latent parameters at real-data inference time, the tertiary, supporting utilities not only aim at speeding up model building but may also function as observable proxies of parameter recoverability. These supporting utilities are parsimony, structural faithfulness, and predictive performance. The reason for the relevance of the former two is the same as for observable inferential goals (see Section 4.3.3), and so only predictive performance requires separate explanation and justification.
Predictive Performance: The relation between predictive performance and parameter recoverability is complicated and using the former as an observable proxy for the latter in a valid way requires great care [273,270]. Most importantly, we should not choose causally inconsistent P models, even if they predict better [202,270]. Fortunately, when taking the here-presented hierarchy of utilities seriously, this danger is banished by giving causal consistency priority over almost all other utilities for latent inferential goals. Within the class of causally consistent P models, it seems that using predictive performance as a proxy for parameter recoverability in (converged) PAD models represents a valid approach [270]. For example, we can utilize predictive performance to determine whether an extra probabilistic structure (e.g., accounting for potential temporal or spatial dependencies; see Section 3.5.2) is worth including or not [39]. This affects the balance between structural faithfulness and parsimony, which in turn serve as proxies for parameter recoverability. As another example, the choice of likelihood functions driven by predictive performance can be used to improve parameter recoverability in the context of regression P models [270].

Conclusion
We proposed answers to two fundamental questions of Bayesian modeling, namely (1) "What actually is a Bayesian model" and (2) "What makes a good Bayesian model"? Ultimately, we hope that both of these questions and the answers we provided will aid in thinking and talking about Bayesian models, as well as enhance the overarching model-building process, regardless of the specific methods and fields of application.
As an answer to the first question (Section 2), we proposed the PAD model taxonomy that defines four different kinds of Bayesian models as subsets of the triple of joint distribution of all involved variables (P), the training data (D), and the posterior approximator (A). In this way, we put forward our view that modern Bayesian models are more than just likelihood and prior, but comprise a variety of "external components" that influence, and, in turn, are influenced by, the goals and the results of any statistical analysis.
As an answer to the second question (Sections 3 and 4), we first argued that there are ten utility dimensions along which we can evaluate Bayesian models, namely, (1) causal consistency, (2) parameter recoverability, (3) predictive performance, (4) fairness, (5) structural faithfulness, (6) parsimony, (7) interpretability, (8) convergence, (9) estimation speed, and (10) robustness. Then, we proposed two utility trees that embody utility hierarchies and trade-offs depending on the particular inferential goals. We hope that our list of utility dimensions and structure of possible inferential goals is exhaustive (up to using synonyms and regrouping sub-utilities differently). However, it may as well become incomplete in the future, as new ideas are born and rapidly developed, and we will be happy to incorporate these into our taxonomy.