Simulation-Based Calibration Checking for Bayesian Computation: The Choice of Test Quantities Shapes Sensitivity ∗

Simulation-based calibration checking (SBC) is a practical method to validate computationally-derived posterior distributions or their approximations. In this paper, we introduce a new variant of SBC to alleviate several known problems. Our variant allows the user to in principle detect any possible issue with the posterior, while previously reported implementations could never detect large classes of problems including when the posterior is equal to the prior. This is made possible by including additional data-dependent test quantities when running SBC. We argue and demonstrate that the joint likelihood of the data is an especially useful test quantity. Some other types of test quantities and their theoretical and practical beneﬁts are also investigated. We provide theoretical analysis of SBC, thereby providing a more complete understanding of the underlying statistical mechanisms. We also bring attention to a relatively common mistake in the literature and clarify the diﬀerence between SBC and checks based on the data-averaged posterior. We support our recommendations with numerical case studies on a multivariate normal example and a case study in implementing an ordered simplex data type for use with Hamiltonian Monte Carlo. The SBC variant introduced in this paper is implemented in the SBC R package.


Introduction
Simulation-based calibration checking (SBC; Talts et al. 2020) is a method to validate Bayesian computation, extending ideas from Cook et al. (2006). 1 While SBC is primarily intended for validating sampling algorithms such as MCMC, it can be used for validating any method implementing or approximating Bayesian inference.Published applications include variational inference (Yao et al., 2018) and neural posterior approximations (Radev et al., 2023).
Typically, the posterior distribution π post is the target of inference but is impossible to evaluate directly.While many computational approaches exist for sampling from the posterior or its approximations, they may fail to provide a correct answer.Problems can arise from errors in how the algorithm or the statistical model are encoded or from inherent inability of the computational method to correctly handle a given model with a given dataset.
(3) Equation ( 3) immediately shows that conditional on a specific data y ∈ Y , the distributions of θ and θ in Equations ( 2) and (3) are identical.In general, SBC-like checks are sensitive to different deviations from the correct posterior than checks based on the data-averaged posterior (see Section 3.5 for more details).The two families of checks coincide when Y has just a single element as in this case both reduce to directly comparing two distributions.SBC and related methods employ two different implementations of the same statistical model and check if the results have the same distribution conditional on data.The first step is to define a generator capable of directly simulating draws from π prior ( θ) and π obs (y| θ), and the second step is to define a probabilistic program that, in combination with a given posterior approximation algorithm, samples from the posterior distribution π post (θ|y).Each simulation from the generator yields, θ * ∼ π prior ( θ) where M is the number of posterior draws sampled.Where confusion is possible we use asterisk to mark a random variable.We run many such simulations and then inspect the realized distributions of θ 1 , . . ., θ M and θ * conditional on y * .Specific calibration checking methods differ in how exactly they test the conditional equality of the two distributions.
1.2.Proposed SBC variant SBC has been believed to be insensitive to some classes of mismatches, and as described in (Talts et al., 2020) would not work for discrete variables.To remove those limitations, we argue for the following variant of the SBC check: First, project the potentially high-dimensional parameter and data space into a scalar test quantity f : Θ × Y → R. Second, compute the rank of the prior draw in the posterior conditional on y.Specifically, we take the number of posterior sample draws where the test quantity is lower than in the prior draw, and, if there are any ties, choosing the rank randomly among the tied positions: I f (θ m , y) < f ( θ, y) K ∼ uniform(0, N equals ) N total := N less + K, where I[P ] denotes the indicator function for predicate P .The procedure simplifies if there are no ties, which will be true for most practical test quantities over models with continuous parameter space.When no ties occur, we have N total = N less .Then, if the probabilistic program and the generator implement the same probabilistic model, we have See Theorems 3 and 4 for a formal statement and proof.As a result, once we obtain a set of draws from empirical distribution of N total via multiple simulations, we can perform a test for uniformity.The process is then repeated for all test quantities we want to consider.If we are using MCMC to sample from π post , the posterior sample typically needs to be thinned to ensure that θ 1 , . . ., θ M are approximately independent (Talts et al., 2020;Säilynoja et al., 2022).The overall SBC process is illustrated in Figure 1.
While it is possible to use numerical tests for uniformity with SBC, we generally prefer to use visualisations of the rank distribution as they are more informative than numerical summaries and discourage dichotomous thinking.Most prominent are rank histograms and plots of empirical cumulative distribution functions (Säilynoja et al., 2022).
Our proposed SBC variant improves upon the way SBC has been previously reported and used in two major ways: • We let test quantities depend on both data and parameters, while previous work only considered quantities that depend on the parameters.In practice, these test quantities were almost exclusively just the individual parameters themselves.
• Previous formulations of SBC required uniformity of N less .However, even if the probabilistic program is exactly correct, N less will not be uniform if Pr(N equals > 0) > 0, that is, if ties can occur.With our improved SBC procedure, we can handle test quantities that have distributions with point masses and thus ties between f ( θ, y) and (f (θ 1 , y), . . ., f (θ M , y)).
Resolving ties lets us use SBC for models with discrete parameters as well as in some other special cases, such as when a theoretically strictly positive test quantity suffers underflow and some prior/posterior sample draws are numerically zero.Random tie-braking has previously been used for checking that data-averaged posterior equals prior (1) over discrete parameter spaces (Saad et al., 2019).

Uniformity test
Figure 1: Schematic representation of SBC with S simulations.The generator is responsible for sampling from the prior distribution θ ∼ π prior ( θ) and from the observation model y ∼ π obs (y | θ).
The draws from the observation model are then treated as input for the probabilistic program and the associated algorithm which takes M posterior draws θ 1 , . . .θ M .Each test quantity projects the prior draw and the posterior draws (potentially using data) onto the real line, letting us compute a single rank (N total ).Finally, deviations from discrete uniform distribution are assessed numerically or visually.

Practical considerations
SBC will be satisfied if the generator, probabilistic program, and posterior approximation algorithm are in harmony: The generator and the probabilistic program should correspond to the same data-generating process.At the same time the posterior approximation algorithm (including the associated tuning parameters) provides samples that have at most a negligible difference from the correct posterior for the probabilistic program, given the data simulated from the prior.Failure indicates that at least one of the components is mismatched to the others.However, by itself, SBC cannot determine where exactly the problem lies.As a result, two broad uses of SBC arise: • We have code to simulate data and a probabilistic program we trust, and the goal is to check that an algorithm correctly samples from the posterior, or • We have an algorithm that we trust is correct and trustworthy code to simulate data, and the goal is to check that we correctly implemented our probabilistic program.
In practice, those classes overlap and mix: we are rarely completely certain of the correctness of any algorithm, generator, or probabilistic program.Additionally, SBC as a simulation method has no way to inform us about a discrepancy between the process that generated real data and the assumptions of our statistical model.For reliable inference, SBC thus needs to be combined with other elements of Bayesian workflow that can detect model misspecification, such as posterior predictive checks or analysis of residuals (Gabry et al., 2019;Gelman et al., 2020;Kay, 2021).

Importance of test quantities
It has been generally believed that methods based on Equation (2), including SBC, are never sensitive to some classes of mismatches between the generator and the probabilistic program-most notably that it is impossible to detect if the probabilistic program samples from the prior distribution and ignores the information in the data (e.g., Equation (1.3) of Lee et al. 2019;Appendix M.2 of Lueckmann et al. 2021;Schad et al., 2022;Zhao et al., 2021;Ramesh et al., 2022;Cockayne et al., 2022).
In this paper we show that the choice of test quantities greatly influences the usefulness and sensitivity of SBC.We show that using test quantities that depend on data makes it possible to detect any conceivable mismatch between the generator and the probabilistic program.Thus, we demonstrate that the belief in inherent limitations of SBC has relied on overly restrictive and sometimes plainly incorrect assumptions.We discuss useful classes of test quantities that have not been used so far and provide characterization of possible remaining undetected failures.We provide simulation studies as well as theoretical analysis of SBC to support our findings.We hope that our theoretical framework can serve as a basis for a better understanding of the properties of SBC and related methods.All of the techniques discussed are implemented in the SBC R package (Kim et al., 2022).
The rest of the paper is structured as follows: Section 2 discusses related work, Section 3 summarizes the theoretical results we derived, Sections 4 and 5 show results of simulation and real-world case studies, and Section 6 discusses the results and our recommendations for practical use of SBC.

Related work
Prior contributions to validation of Bayesian computation can be roughly split into works that focus on the data-averaged posterior, those that focus on the SBC property, and other relevant works that do not directly invoke any self-consistency property.

Data-averaged posterior
The idea of using simulations via a generator to verify Bayesian computation can be traced back to Geweke (2004) who compared the moments of the prior and the data-averaged posterior distributions for multiple test quantities.That paper proposed to integrate a transition kernel for an MCMC sampler targeting π post into a scheme that samples π joint directly.This lets one obtain the dataaveraged posterior from a single run of this sampler, potentially reducing the computational cost but increasing implementation burden.Geweke's formalism allows the test quantities to depend on data, although all the examples actually shown only depend on parameters.Comparing the mean vector and covariance matrix of the prior distribution and the data-averaged posterior distribution is also discussed by Yu et al. (2021), who use repeated fits to build the data-averaged posterior.Saad et al. (2019) proposed a check for identity of two potentially high-dimensional discrete distributions by inspecting ranks generated by different total orderings over the parameter space.Their work is relevant in four ways: (1) it can be used to assess the data-averaged posterior criterion (1) for discrete domains, (2) in close analogy to the use of test quantities in this paper, they focus on different orderings of the underlying domain and their different power to detect discrepancies, (3) they propose breaking ties in ordering uniformly at random in the same way we do, and (4) they prove some results that are analogous to or special cases of some of our theoretical results.

SBC-like checks
The identity of prior and posterior distributions conditional on a specific dataset as a tool to check computation was proposed by Cook et al. (2006) and further refined by Talts et al. (2020) who introduced SBC as it is currently used.Specific variants of SBC have been proposed for variational inference (Yao et al., 2018), Bayes factors (Schad et al., 2022), and Gaussian processes (Mcleod and Simpson, 2021).SBC has also been used to validate likelihood-free inference methods including neural posterior approximators with normalizing flows (Radev et al., 2020(Radev et al., , 2021) ) and an SBC variant for checking joint calibration of such methods has been proposed and used in (Radev et al., 2023).Gandy and Scott (2020) proposed a procedure similar to SBC that can work with shorter sequences of Markov transitions than a full fit, reducing computational cost.This is, however, less relevant for algorithms that need a nontrivial warmup phase to adapt to the specific posterior (e.g., the adaptive Hamiltonian Monte Carlo sampler implemented in Stan; Carpenter et al., 2017).This is because warmup is a fixed cost that occurs during every model fit even if fewer post-warmup draws are needed.Prangle et al. (2014) proposed an SBC-like procedure for approximate Bayesian computation (ABC).They note that the possibility that the probabilistic program simply samples from the prior distribution cannot be ignored in this context and resolve this issue by separately inspecting ranks for some subsets of the simulated datasets.SBC is closely related to the coverage property discussed by Prangle et al.: when using M posterior sample draws, SBC can be understood as checking for all posterior intervals of width α ∈ 1 M , . . ., M −1 M that the probability the interval contains the original simulated value of the test quantity is α.
A broader framework for calibration of learning procedures has been proposed by Cockayne et al. (2022).There, Bayesian inference is just one example of procedures where calibration can be empirically verified with an SBC-like check.They distinguish between strong calibration which corresponds to passing SBC (specifically continuous SBC as defined in Appendix A; ) for all measurable test quantities and weak calibration which corresponds to having a correct data-averaged posterior (Equation 1).They however only consider test quantities that depend only on parameters.

Miscellaneous
The problem of diagnosing and understanding computational issues is transformed by Rendsburg et al. (2022).Their approach tries to find a prior distribution that would make the probabilistic program and algorithm exactly match the generator.
Both Grosse et al. (2016) and Domke (2021) proposed to use fits to multiple generated datasets to estimate the symmetrized KL-divergence between a distributional approximation to the correct posterior (e.g., Laplace or variational inference) and the true posterior.Cusumano-Towner and Mansinghka (2017) described a method to compute the symmetrized KL-divergence between a gold standard posterior and an approximate posterior.

Theoretical results
The formalism required for SBC is relatively heavy on definitions and syntax, so for all results in this section we also provide plain English-language summaries.Proofs, expanded definitions (as required for proofs) and some additional discussion can be found in Appendix A .Some of the results for stochastic rank statistics (SRS) by Saad et al. (2019) can be understood as special cases of some of our theorems.Specifically, SRS assumes that the parameter space Θ is finite or countable and the goal is to directly compare two distributions, which is the same as assuming the data space Y has just a single element.We will refer to this special case as the SRS assumption.
Definition 1 (Posterior family, test quantity).A posterior family ϕ assigns a normalized posterior density to each possible y ∈ Y .That is, a posterior family is a function ϕ : Θ × Y → R + such that ∀y : dθ ϕ(θ|y) = 1.For each y, we will denote the implied distribution over Θ as ϕ y .A test quantity is any measurable function f : Θ × Y → R Definition 2 (Sample rank CDF, sample Q, sample SBC).Given a test quantity f , M ∈ N and a posterior family ϕ.If θ 1 , . . ., θ M ∼ ϕ y we can define the following random variables: The M -sample Q is: This definition does not match immediately with the procedure we actually use to run SBC in practice but is more convenient for further analysis and is equivalent: Theorem 1 (Procedural definition of sample SBC).A posterior family ϕ passes M -sample SBC w.r.t.f if and only if given θ * ∼ π( θ), y * ∼ π obs (y| θ * ), N total = N total ϕ,f, θ * ,y * we have N total ∼ uniform(0, M ).
Next, we define an idealized, continuous version of SBC that will be more amenable to theoretical analysis: Definition 3 (Continuous rank CDF, continuous q, continuous SBC).We first define fitted CDF: θ|y) and fitted tie probability: assuming U is a random variable distributed uniformly over the [0, 1] interval.

Correctness
With the definitions ready, we first establish that if a probabilistic program achieves uniform distribution of ranks in sample SBC for a given test quantity as M → ∞, then it will satisfy continuous SBC as well.
1.For any fixed y ∈ Y if as the number of sample draws M → ∞ we have ∀i ∈ {0, . . ., M } : 2. If as M → ∞ we have ∀i ∈ {0, . . ., M } : Y dy Q ϕ,f (i|y)π marg (y) → i+1 M +1 then ϕ passes continuous SBC for f .Theorem 3 then shows that if a probabilistic program passes continuous SBC for a given test quantity, it will pass sample SBC for all M .We then show that passing continuous SBC (and thus our SBC variant) is a necessary condition for the correctness of posterior estimation (Theorem 4).That is, the correct posterior will always produce uniformly distributed ranks, including for test quantities that may have ties (see also Examples 5 and 6 in Appendix B ).A special case of Theorem 4 under the SRS assumption was proven as Theorem 3.1 of Saad et al. (2019).

Characterization of SBC failures
Still, many incorrect posteriors will also pass SBC for any given test quantity, so in Theorem 5 we characterize those situations.
Theorem 5 (Characterization of SBC failures).For all y ∈ Y and s ∈ R : Not only does the correct posterior yield a uniform distribution of ranks when averaging over the whole data space Y , but the ranks are uniformly distributed even when we only consider simulations that yielded data in some Ȳ ⊂ Y .The reverse implication also holds: when the ranks are uniformly distributed for all subsets of the data space Ȳ ⊂ Y , then the implied posterior distribution of the test quantity under investigation has to be exactly correct.In other words, whenever SBC "fails" and the implied posterior distribution of a given test quantity is incorrect although the rank distribution is uniform, we can find a subset of the data space, where the ranks are non-uniform.It just so happens that all the deviations in various subsets cancel each other out perfectly.
An obvious application of Theorem 5 is that we could partition our simulations based on some features of the data space and investigate uniformity separately for each part, similarly to the procedure suggested by Prangle et al. (2014).This however quickly runs into issues of multiple testing due to the lower number of simulations in each part.It is thus in our experience not practical except for the special case that interest lies only in some subset of the data space, so that the SBC checks can focus only on that data space of interest.This is a form of rejection sampling and can be practically useful if it is easy to formulate a criterion that constrains plausible real data sets but hard to construct a defensible prior distribution that would enforce this criterion implicitly.For example, prior information can be available on the plausible variance of an outcome across the whole population, which may be hard to express as a prior on coefficients associated with predictors (but see the approaches for linear models discussed in Zhang et al., 2020 andAguilar andBürkner, 2023).

Data-dependent test quantities
The characterization of SBC failures discussed above provides intuition why test quantities that depend on data are useful: If SBC passes for a test quantity f , but the posterior is in fact incorrect, we can always pick a test quantity g that combines f with some aspect of the data and ensures that the discrepancies in various parts of data space add up instead of canceling out.For example, we could have over-abundance of low ranks and under-abundance of high ranks in Y 1 ⊂ Y and a matching under-abundance of low ranks and over-abundance of high ranks in Y 2 ⊂ Y .Setting will ensure over-abundance of high ranks in both Y 1 and Y 2 .Since such a test quantity uses all the simulations, we do not lose power from reduced number of simulations.
An even stronger reason to use data-dependent test quantities is that they make SBC in some sense complete: If there is any difference between the correct posterior and the posterior implemented by the probabilistic program, there will exist a data-dependent test quantity that fails SBC.In fact, we can construct a specific test quantity that detects the failures, which is the ratio of the correct posterior density to the posterior density actually implemented by the probabilistic program.
Theorem 6 (Density ratio).For any posterior family ϕ, take g (θ, y) = πpost(θ|y) ϕ(θ|y) .Then ϕ passes continuous SBC w.r.t.g if and only if π post and ϕ are equal except for a set of measure 0: Here g is not a practical test quantity, as it (a) depends on the specific probabilistic program we implemented and (b) requires that we already have the correct posterior density.However our empirical results in this paper, and our experience with using SBC in model development more generally, shows that the model likelihood π obs (y|θ) is frequently useful as a general-purpose test quantity.This makes sense intuitively, as the likelihood is an important contributor to the density ratio.In their Theorem 3.1, Saad et al. (2019) proved an analogous result under the SRS assumption, although relying on a different test quantity.Under the SRS assumption, they also show that the difference of the two densities will fail M -sample SBC for all M > 1 (their Theorem 3.6) and has maximum power against discrepancies (their Theorem 3.7).

Ignoring data
We generalize the result that probabilistic programs sampling from the prior distribution will pass SBC against all test quantities that do not depend on data.
Theorem 7 (Incomplete use of data).Assume a model π with observation space Y and parameter space Θ, a space Y ′ , and a measurable function t : Y → Y ′ .Denote the set t −1 (y ′ ) = {y ∈ Y : t(y) = y ′ }.Consider the model π ′ with parameter space Θ and observation space Y ′ such that for all θ ∈ Θ, y ′ ∈ Y ′ : dy π obs (y|θ).
Assume a test quantity Here, the choice of t lets us choose which aspects of the data are ignored, if ∀y ∈ Y : t(y) = 1, we recover the case where all data are ignored: π ′ post (θ|y) = π prior (θ) and thus ϕ(θ|y) = π prior (θ) will pass SBC w.r.t.f .If t is a bijection, no information is lost.Other choices of t then let us interpolate between those two extremes, for example ignoring just a subset of the data points, treating some data points as censored, rounding all data to integers.

Detailed analysis of simple models and test quantities
Appendix B provides full theoretical analysis of SBC for simple models and test quantities where we can actually characterize all possible posterior distributions that will satisfy SBC.This is aimed at providing intuition on what SBC actually does and also serves as counterexamples to some claims.In some literature (e.g., Lee et al., 2019, Lueckmann et al., 2021, Schad et al., 2022, Grinsztajn et al., 2021, Ramesh et al., 2022, Saad et al., 2019), it is assumed that SBC is based on the data-averaged posterior (1).We show that this is incorrect: Example 2 not only explicitly constructs posterior distributions that will satisfy (1) for some test quantity while not passing SBC, but also posterior distributions that pass SBC while not satisfying (1).One possibly more general lesson is that SBC is most naturally understood as enforcing constraints on the quantile function of the test quantity while having a correct data-averaged posterior is most naturally seen as constraint on the density of the test quantity.
This implies there might be some gains from using both the data-averaged posterior and SBC when verifying the correctness of Bayesian computation.We however suspect that the additional practical benefit of using the data-averaged posterior is small in the sense that the incorrect posteriors that pass SBC but are ruled out by Equation (1) are mostly contrived and unlikely to be a result of a computational problem or an inadvertent mistake.Lemma 2.19 of Cockayne et al. (2022) proves that if a posterior passes SBC for all possible test quantities that do not depend on data, it will have the correct data-averaged posterior for all test quantities that do not depend on data, so SBC is stronger at least in the limit of using infinitely many test quantities.We leave a more thorough examination of the relationship between data-averaged posterior and SBC as future work.
Additionally, we show the behavior of SBC when ties are present, whether induced by a test quantity (Example 5) or by discrete parameter space (Example 6).Discrete parameter spaces may induce additional structure on the space of posterior families passing SBC.

Monotonic transformations of test quantities
Finally, transforming a test quantity by a strictly monotonic function produces equivalent SBC results: Theorem 8 (Monotonic transformations).Assume test quantities f, g and a set of measurable functions h y : R → R such that ∀y ∈ Y, θ ∈ Θ : f (θ, y) = h y (g(θ 1 , y)) and a posterior family ϕ.If either for all y ∈ Y : h y is strictly increasing or for all y ∈ Y : h y is strictly decreasing then 1) ϕ passes continuous SBC w.r.t.f if and only if ϕ passes continuous SBC w.r.t.g and 2) ϕ passes M -sample SBC w.r.t.f if and only if ϕ passes M -sample SBC w.r.t.g.
The result cannot be easily strengthened as many non-monotonic transformations lead to different, non-equivalent SBC checks.Example 3 shows that flipping the ordering of values only for some subset of the data space yields a different SBC check.Example 4 shows that we can also obtain a different check if we combine a test quantity with a non-monotonic bijection, and Example 5 shows the same for the case when a whole range of values is projected onto a single point.In all those examples, the transformed test quantities rule out some sets of posteriors that pass SBC for the original quantity, but there are also sets of posteriors not passing SBC for the original quantity but passing SBC for the transformed quantity.

Numerical case studies
The theoretical analysis in previous section primarily deals with the behavior of SBC in the limit of both infinitely many posterior draws per fit and infinitely many simulations.Here, we further support the results by numerical experiments which let us understand not only whether a certain problem is detectable at all but also how much computational effort is required for SBC to detect the problem.

Setup
To illustrate some of the properties of various types of test quantities, we use a simple multivariate normal model, where the two-element vector µ is the target of inference and y 1 , . . ., y n are observed.Introducing ȳ = 1 n n i=1 y i , the correct analytic posterior is MVN N ȳ n+1 , 1 n+1 Σ .Unless mentioned otherwise we will use n = 3.In most previous use cases of SBC, the only test quantities used would have been the parameters themselves, that is, the elements of µ in the above example.Below, we also check a host of derived quantities: the sum, difference, and product of the µ elements, the joint likelihood of all the data, and pointwise likelihoods for the first two data points.
To quantify the discrepancy between an observed distribution of posterior ranks and the uniform distribution, we take the likelihood of observing the most extreme point on the empirical CDF if the rank distribution was indeed uniform: Here, M is the number of draws in the sample obtained from the posterior, S is the number of simulations (and thus the number of observed ranks), z i = i M +1 is the expected proportion of observed ranks smaller than i, R i is the observed count of ranks smaller than i, and Bin(R|S, p) is  for rejecting uniformity at 5% for the correct posterior.mvn log lik[1] and mvn log lik[2] are the pointwise likelihoods π(y 1 |µ) and π(y 2 |µ) respectively, while mvn log lik is the joint likelihood.As expected when using a 5% level for rejection, false positives (values below the threshold) do happen, but they tend to correspond to only small discrepancies.
the CDF of the binomial distribution with S trials and probability of success p evaluated at R. This metric was introduced in a paper by Säilynoja et al. (2022), where we can also find computational methods to evaluate the distribution of γ under uniform distribution of ranks for given M and S. Our primary metric of interest would then be log γ γ , where γ is the 5th percentile of the null distribution.That is, if you adopt a hypothesis-testing framework, then log γ γ < 0 implies a rejection of the hypothesis of uniform distribution at the 5% level.Having log γ γ < 0 also corresponds to situations where visual checks of the ECDF plots would show problems (for a single test quantity).This diagnostic is typically more sensitive than the Kolmogorov-Smirnoff or χ 2 test.

Correct posterior -Case study 1
Figure 2 shows how the γ statistic evolves in a fairly typical SBC run as we add more simulations using a probabilistic program that samples from the correct posterior.There is some variability, but most of the time all quantities would indicate uniformity and if they indicate some non-uniformity, the discrepancies tend to be small so we are unlikely to reject this model as incorrect.

Ignoring data -Case studies 2-4
For comparison, case study 2 (Figure 3) shows the evolution of the same quantities for a typical run with an incorrect posterior that is completely equal to the prior.All quantities that do not depend on data pass SBC, barring small short-term deviations as seen for the correct posterior.But all the likelihood-based quantities start showing big discrepancies after just a handful of simulations.While the overall distribution of ranks for the parameters themselves is uniform, when we look separately at data with large average y and low average y, the ranks are strongly non-uniform in both regions (Figure 4).
In case study 3, we observe similar behaviour for the posterior that ignores only the first data point; see Figure 5.The biggest difference is that that now the pointwise likelihood for the second data point-which was not ignored-passes SBC, while the joint likelihood as well as the pointwise  likelihood for the first ignored data point show problems.Additionally, the pointwise likelihood for the ignored data point now shows bigger discrepancy than the joint likelihood.For both quantities, the discrepancy is smaller and requires about S = 20 simulations to reliably uncover, because ignoring a single data point produces a posterior that is closer to the correct one than when ignoring all the data.For case study 4 we increase the number of data points to n = 20 (Figure 6), ignoring just a single data point produces a posterior that is close to correct and even after 1000 simulations, the discrepancy for the joint likelihood is small.The pointwise likelihood for the first (ignored) data point still detects the problem relatively quickly.More generally, if the model (partially) ignores data, then adding a test quantity that involves both data and parameters can detect this failure.Specifically adding the joint log-likelihood of the data as a derived quantity seems to be a useful default.If only a small part of the data is missing, using the joint likelihood in SBC will turn it into a problem of precision.Missing just a single datapoint in a large dataset (e.g., an off-by-one error in the probabilistic program) may change the posterior only slightly and be undetectable with realistic computational effort.Case study 3: Evolution of the difference between the gamma statistic and threshold for rejecting uniformity at 5% for an incorrect posterior that ignores the first datapoint among a small data set (n = 3).

Incorrect correlations -Case study 5
Suppose we have an incorrect posterior that has the correct marginal distributions for both parameters, i.e., sampling is done from independent univariate normal distributions, The evolution of the discrepancy as simulations are added is shown in Figure 7.If the test quantities are the univariate parameters, SBC passes without any indication of problems, while the likelihood-based quantities as well as the difference, product, and sum of the variables show problems relatively quickly.The joint likelihood is the first to show serious issues.
If the inference does not represent correlations in the posterior correctly, this should as well manifest in an SBC failure for some function of the parameters.This can be directly targeted by using products ("interactions") of model parameters, but the log-likelihood once again seems to be generally useful as a highly nonlinear function of all model parameters.

Less plausible problems -Case study 6
In this subsection our results get less practical and more theoretical.The (partially) unused data case may easily arise in practice due to a bug in the probabilistic program such as an indexing bug or a deficient overall approach.For example, an approximate Bayesian computation algorithm may not learn from the data at all and just stick to the prior (Prangle et al., 2014).Incorrect correlations or more general higher-order structure of the posterior may also easily arise due to a problem with an approximate inference algorithm.For example, mean-field variational inference will never recover any correlations by design.Beyond those examples, we have found it hard to find incorrect probabilistic programs that would satisfy the SBC identity and could plausibly arise from unintentional mistakes in program code or problems with an algorithm.We see this as anecdotal evidence that SBC augmented with a few well-chosen test quantities that probe usage of data and higher order posterior structure such as the likelihood can robustly detect these kinds of mistakes.That said, for specific models, wide sets of artificial counterexamples that incorrectly pass SBC can be constructed.
In case study 6, we show a specific case of a more general class of setups where we can create an incorrect posterior approximation that produces overabundance of low ranks for datasets with average of y positive and compensates by producing overabundance of high ranks for other datasets.If this is done right, the test quantity will pass SBC.The distribution of the ranks conditional on the average of y for one such setup is shown in Figure 8-here we transform draws from the correct posterior distribution by first applying the correct CDF, manipulating the results to achieve the desired shape of ranks and then transform back via the quantile function.See the associated code for more details.As seen in Figure 9, when averaging over all datasets, SBC indeed passes for the univariate parameter test quantities, but if we instead look at, say, the absolute value of µ (as well as some other non-monotonic transformations of µ), we immediately see problems as now some of the previously low ranks flip to high ranks and the discrepancies accumulate instead of canceling each other.In this particular case, the problem is also eventually picked up by the product of the µ values and with enough simulations even by the joint likelihood, but there is no guarantee this will always happen.In general, non-monotonic transformations can discover incorrect posteriors that would be otherwise hidden when looking at the original variables.Still, the practical relevance of non-monotonic transforms in SBC is, in our view, likely limited, as it required careful work to construct posteriors that manifested this behaviour.We were unable to find even remotely plausible scenarios where an issue with Bayesian computation was best discovered by using a non-monotonic transformation of another test quantity.

Small discrepancies -Case study 7
A final case study considers small discrepancies in the posterior.To be specific, we introduce a small bias in the posterior drawn from normal(0, 0.3) independently for each simulation and element of µ.The resulting SBC history is shown in Figure 10.While all of the monitored quantities will eventually show the problem, the likelihood-based quantities and the difference of µ do that noticeably sooner than others.This demonstrates that derived quantities can somewhat improve precision of SBC: small changes in the univariate marginals can result in big (and thus easy to detect) changes for some test quantities combining the univariate marginals with data and other parameters.

Real-world case study
We present a case study adapted from an actual user discussion on forums of the Stan probabilistic programming language.Our goal is to use Stan and its Hamiltonian Monte Carlo implementation to sample from a distribution over an ordered K-dimensional simplex which is then to be used as a Log Gamma − Threshold Note the different horizontal axis between top row (quantities that detect the problem slowly or not at all) and bottom row (quantities that detect the problem quickly).The vertical red dashed line marks 500 simulations.We only show quantities derived from the first element of µ; the situation is analogous for the second element.The drop(mu[1]) quantity is defined as µ 1 if µ 1 < 1 and as  component in a larger model: To do that, we need to construct an ordered simplex from primitive data types available in Stan and compute the logarithm of the Jacobian determinant of the transformation (up to a constant).

Proposed implementations
We consider three variants.Mimicking the fallibility of methods proposed by real statisticians, not all of the following derivations will be correct.A reader interested in little mathematical puzzles may try to pin down any errors.In the next subsection, we will then show how to use SBC to discover the error(s) without the painstaking attention to detail required for checking the math.We then also remedy the error(s).
The first variant will be called min.Here, we start with an unordered bounded vector u ∈ [0, 1] K−1 (which is a primitive in Stan).The minimal element of the simplex needs to satisfy x 1 < 1 K , so we set , giving us a recursive formula for the transformation, which we can unroll as: Here r i can be understood as tracking the remaining amount to be distributed to ensure x i sum to 1 if all the following elements will be at least b i .For 1 ≤ i < K, we have ∂x i ∂u i = r i K+1−i , and when also i ≤ j < K then ∂x i ∂u j = 0, so the Jacobian matrix is triangular and the Jacobian determinant is thus The second variant, called softmax starts with a positive ordered vector v ∈ (0, +∞) K−1 , v 1 < . . .< v K−1 (also a primitive in Stan).We then prepend 0 to the vector and normalize it with the softmax function: We notice the repeated elements and define a K − 1 dimensional diagonal matrix D, where D i,i = exp(y i ) s 2 .We can now express the Jacobian matrix as We now define a K − 1 dimensional column vector c, c k = − exp(v k ) and a row vector r, r k = 1 and obtain J = D (cr + sI K−1 ).By the matrix determinant lemma, det(cr + X) = det(X)(1 + rX −1 c), for any invertible matrix X.Since rc = K−1 i=1 (− exp v i ) = 1 − s, we have: , we finally have As a different approach, if we are willing to restrict our priors over the ordered simplex to Dirichlet distributions, we may employ the fact that if w ∈ (0, +∞) K , w i ∼ Γ(α i , 1) then . So if we start with w positive ordered (a primitive in Stan), then x = w K i=1 w i will be Dirichlet distributed over OrdSimplex K and no Jacobian adjustment is required.A downside of this approach is that the mapping is many-to-one and in models where x is tightly constrained by data, the implied geometry on w will likely pose difficulty for most samplers.This variant will be referred to as gamma.
At this point the interested reader is welcome to try to find issues with any of the above approaches.

Testing with SBC
Whether the reader managed to find errors or not, we can use SBC to test all approaches.To run SBC we embed the ordered simplex into a simple model: Implementing the simulator code is straightforward: due to symmetry, we can sample x simply by ordering a sample from the unordered Dirichlet distribution.Both min and gamma variant show no problems in SBC and are indeed correct, but softmax exhibits issues.Figure 11 shows the evolution of the discrepancies.The problems are most quickly picked up by the first element of x and the log Dirichlet prior density.Although the problem is found relatively quickly with SBC, the bias in the inferences would likely not be noticed in an informal assessment of the model: the results are not completely wrong, just somewhat biased.The source of the issue is an off-by-one error in the exponent for s in equation ( 9); the correct Jacobian determinant is Indeed, if we correct the Jacobian, SBC passes.
Log Gamma − Threshold log lik is the multinomial log likelihood of the data and log prior is the log density of the prior Dirichlet distribution.Note the different horizontal axis between top row (quantities that detect the problem quickly) and bottom row (quantities that detect the problem slowly).The vertical red dashed line marks 400 simulations.

Remarks
The previous section showed the type of modeling problem where SBC is in our view the most useful: deriving and implementing the probabilistic program is relatively involved and offers plenty of opportunities for error, but building a simulator is straightforward.Jacobian adjustments for changes of variables are also in our experience one of the most confusing concepts to Stan users and SBC offers a good way to check if one's reasoning is correct.The examples in this section introduce several non-obvious conceptual questions: For min and softmax we compute the Jacobian only considering K − 1 elements of the ordered simplex, even when the Dirichlet prior then acts on all elements.Is that correct?For gamma, will ordering w imply the correct ordered simplex distribution?Running SBC is then a useful (although not completely definitive) check that our reasoning is correct.
In this example, the log likelihood did not reveal the error quickly, showing that it is not a panacea, especially in cases where the problem lies with the prior.The log prior density seems potentially useful in this case as it shows the problem as quickly as the most problematic individual parameter.Another lesson is that SBC is useful not only for testing a full model but also for testing components of a model in isolation, akin to unit tests in software engineering.Additionally, by running SBC, we get a simulation study for free: in the specific setup described by Equation 10, min is the most efficient in terms of effective sample size per second, followed by gamma.The correct version of softmax performs worst.The softmax variant also fails to converge in 6 of the 1000 simulations, while the other two are slightly more stable (convergence problems in 3 and 1 of the simulations respectively).Finally, our posterior uncertainty is large and the data do not really provide a lot of information about the parameter values.See the rendered output of the supplementary code for details.

Choosing test quantities for SBC
We have found that enriching the repertoire of test quantities used in SBC provides both qualitative and quantitative improvements to the ability of SBC to detect problems in Bayesian computation.For practical use of SBC in everyday model and algorithm development, we recommend to use by default the individual model parameters as test quantities as well as the joint likelihood of the data and potentially a small number of other quantities.
Individual parameters are recommended as they are always immediately available and are able to diagnose a large number of problems with a posterior approximation.Also, the parameters are themselves often of primary interest for inference, so it is desirable to check that their uncertainty is correctly calibrated.
The joint likelihood is a highly useful quantity to detect the types of problems discussed in Section 4 (especially ignoring data and incorrect correlations).In all of the cases presented in our simulations, the joint likelihood was able to detect the discrepancies and in many cases it was even able to detect them with the fewest simulations among all considered quantities.While, for some specific problems, we could find quantities that are more sensitive than the joint likelihood, none other was useful in all cases.Section 3.3 provides theoretical justification for why we could expect this to hold frequently and not only in the examples we discussed.We think this generality makes the joint likelihood a good default quantity to monitor in SBC.If not using all the data correctly is a potential issue (e.g., because the code handling the data is particularly complex), then adding selected likelihoods for subsets of the data might also be sensible.
As shown in Section 5, knowing where a potential problem lies can let us design more sensitive problem-specific checks (e.g., when we are not sure our prior density is correct, the log prior can be highly useful).It also makes sense to add test quantities tailored to the specific inferential goals we have built the model for (e.g., some specific model predictions).These quantities often let us implicitly check the correctness of parameter correlations or other dependency structures and safeguard the user against problems that they care about the most.If correlations or other dependencies in the posterior are directly of interest, then pairwise products or differences of the model parameters can also be sensible test quantities.

Limitations
Although we have shown that SBC can in principle diagnose any problem, limitations for practical use remain.For nontrivial models, adding a finite number of test quantities cannot guard against all possible ways the SBC identity may be satisfied by an incorrect posterior.However, as we check more quantities, the potential counterexamples become contrived, hard to construct, and unlikely to be the result of an inadvertent bug in model or algorithm code.At the same time, adding more test quantities increases the risk of false SBC failures simply due to the number of tests performed (if no corrections for multiple comparisons are made for the SBC checks) or it may reduce the overall power of the check (if corrections for multiple comparisons are made), so choosing test quantities carefully remains important.
This problem could potentially be alleviated by improving our understanding of the expected dependency structure of different test quantities' uniformity checks, letting us correct for multiple comparisons without loosing that much power.However, even similar test quantities can lead to in principle different SBC checks (see Section 3.6).So any practical measure of dependency or orthogonality between test quantities would need to reflect not only existence of a difference, but also its magnitude.We leave that as future work.In practice, we have seen similarity in the degree of uniformity violation between different test quantities using the same inputs, making the need for multiple comparison correction less urgent.
Moreover, there are practical limitations imposed by the fact that we always have only limited computational resources for SBC: We can produce only a limited number of simulated datasets to fit the model on and only a limited number of posterior draws per fitted model.Both contribute to the stringency and precision of the uniformity test we can perform.The difference between continuous SBC and any practical implementation of sample SBC arises due to (a) approximating q ϕ,f (x|y) by Q ϕ,f (⌊xM ⌋ | y), and (b) using finite number of simulations to assess uniformity of N total .In both cases, the underlying difference can be understood as estimating a CDF by an empirical CDF and should therefore have similar rate of decrease with more draws.This suggests that for a given computational budget a user is likely to obtain the highest sensitivity using the same order of magnitude of simulated datasets as posterior draws per dataset.However, in practice most algorithms incur a substantial cost in a warmup phase, before any samples can be extracted.We also want to assess that our fitting algorithm has converged for each dataset, which typically requires the equivalent of at least 100 independent posterior samples (as measured by effective sample size) to do that (e.g., to get a low R statistic, as discussed by Vehtari et al. 2021).It is thus hard to get a speedup by reducing the number of posterior draws.Unless we can afford to run many thousands of simulations, we are also unlikely to benefit substantially from getting more than this minimal number of draws.
Additional test quantities do not help much with precision problems-if the posterior is close to correct, the test quantities will also be close to correct.Although in some cases, some test quantities can slightly increase the sensitivity of the check by combining multiple parameters, so small imprecisions in each of the parameters can get compounded (once again the nonlinearity of the likelihood seems to be at least sometimes useful in this regard).

Implications for non-SBC checks
As a contribution to the broader discussion about validation of Bayesian computation, we show that SBC and the data-averaged posterior provide different checks, despite being repeatedly conflated in the literature (see Section 3.5).We leave a more detailed comparison of SBC and data-averaged posterior as future work, although there are some tentative arguments to believe that SBC provides stricter checks.
SBC is not the only approach to validating Bayesian computation that relies on choosing specific test quantities-test quantities are fundamental to the methods of Geweke (2004), Prangle et al. (2014), Gandy andScott (2020), andCockayne et al. (2022).We suspect that many of the considerations regarding their choice for SBC are applicable also in these other approaches.
For each y, we will denote the implied distribution over Θ as ϕ y .
Definition 2 (Test quantity).Given a data space Y and a parameter space Θ a test quantity is any measurable function f : Θ × Y → R.
Definition 3 (Sample rank CDF, sample Q, sample SBC).Given a data space Y , a parameter space Θ, a test quantity f , M ∈ N and a posterior family ϕ.For any y ∈ Y , if θ 1 , . . ., θ M ∼ ϕ y we can define the following random variables: For a fixed θ, we define the M -sample rank CDF as: Averaging over the correct posterior, we can define the M -sample Q as: We then say that ϕ passes M -sample SBC w.r.t.f if ∀i ∈ 0, . . ., M − 1 we have This definition does not match immediately with the procedure we actually use to run SBC in practice but is more convenient for further analysis.We now prove the equivalence with the SBC procedure: Theorem 1 (Procedural definition of sample SBC).Given a data space Y , a parameter space Θ, a test quantity f , M ∈ N and a posterior family ϕ.ϕ passes M -sample SBC w.r.t.f if and only if given θ * ∼ π( θ), y * ∼ π obs (y| θ * ), N total = N total ϕ,f, θ * ,y * we have N total ∼ uniform(0, M ).Proof.We show the equivalence of CDF for N total with the formula in the definition of sample SBC.For all i ∈ {0, . . ., M } we obtain However, the sample SBC is not particularly amenable to direct analysis as the choice of M can matter.We will thus focus on a continuous case, which can be understood as the limit of the sample SBC as M → ∞.
Definition 4 (CDF, tie probability, quantile functions).Given a model π, data space Y , a parameter space Θ, a posterior family ϕ, and a test quantity f , we define • fitted quantile function: • true tie probability: where R := R ∪ {−∞, +∞} is the extended real number line.When no confusion arises, the model superscript will be omitted.
Definition 5 (Continuous rank CDF, continuous q, continuous SBC).Assume a data space Y , a parameter space Θ, a test quantity f , and a posterior family ϕ.For fixed θ ∈ Θ and y ∈ Y we define the continuous rank CDF r ϕ,f : assuming U is a random variable distributed uniformly over the [0, 1] interval.If the fitted tie probability is 0, then r is a step function and the implied rank distribution is degenerate.
Averaging over the correct posterior, we can define the continuous q : We then say that ϕ passes continuous SBC w.r.t.
In both the sample and the continuous case, the presence of ties introduces analytical difficulties.We thus start by a useful lemma that lets us avoid dealing with ties for many purposes.3) both sample Q and continuous q are maintained.That is, Proof.The overall idea is that we can always smooth the implied ties in the distributions of the test quantity by introducing a gap at each possibly tied point and adding a uniformly random value.Specifically, for each y ∈ Y, s ∈ R we define Dϕ,f (v|y) ∈ R, the set V ϕ,f (y) ⊂ R and number Dϕ,f (v|y).
For each y there can be at most a countable number of point masses in the implied distributions and the total mass is at most 1, so w ϕ,f (s, y) will always be defined and smaller than 2.
We now construct the new model, posterior, and test quantity by adding a new parameter p uniformly distributed over the [0, 1] interval, so Θ ′ := Θ × [0, 1] and for all y ∈ Y, p ∈ [0, 1] we set: Here, w provides gaps as it increases by Dϕ,f (f (θ, y)) > 0 for each tied value.Those gaps are then filled uniformly randomly by adding scaled p.By inspection, there are no ties in f ′ .The ordering of previously non-tied elements for both continuous and sample ranks has not changed and the order among those previously tied is uniformly random so both sample Q and continuous q do not change.
Lemma 2 (No ties and inverting CDFs).For all models π, posterior families ϕ, and test quantities f : Θ × Y : 1.If model π has no ties w.r.t.f then for all y the quantile function C −1 f (s|y) is the inverse of C f (s|y).Additionally, C f (s|y) is a surjection on the [0, 1] interval.
2. If ϕ has no ties w.r.t.f then for all y the quantile function C −1 ϕ,f (s|y) is the inverse of C ϕ,f (s|y).Additionally, C ϕ,f (s|y) is a surjection on the [0, 1] interval.
3. If both π and ϕ have no ties w.r.t.
Proof.Items 1 and 2 follow directly from the basic properties of CDFs.For item 3, when there are no ties, the definition of continuous rank simplifies to r ϕ,f (x| θ, y) = I x ≥ C ϕ,f f θ, y y and thus: where the second equality holds because C −1 ϕ,f is the inverse of C ϕ,f for all y.
We now move to establishing a close correspondence between the sample SBC and continuous SBC.
Theorem 2 (Sample SBC implies continuous SBC).Let ϕ be any posterior family over Y and Θ, and let f be a test quantity: 1.For any fixed y ∈ Y if as the number of sample draws M → ∞ we have ∀i ∈ {0, . . ., M } : 2. If as M → ∞ we have ∀i ∈ {0, . . ., M } : Y dy Q ϕ,f (i|y)π marg (y) → i+1 M +1 then ϕ passes continuous SBC for f .Proof.Using Lemma 1 it once again suffices to prove the case with no ties.We start by showing that as M → ∞ normalized sample ranks converge almost everywhere to the continuous ranks.
Assuming no ties, R ϕ,f (⌊xM where Bin(K|M, p) is the CDF of a binomial distribution with M trials and success probability p evaluated at K. Using the central limit theorem, we have lim where Nor(y|µ, σ) is the CDF of a normal distribution with mean µ and standard deviation σ evaluated at y.We can now inspect the limiting behaviour of the z-score: We have thus established pointwise convergence of R ϕ,f (⌊xM ⌋ | θ, y) to r ϕ,f (x| θ, y) almost everywhere w.r.t.x, because we assume there are no ties.This means that we can satisfy the conditions of the dominated convergence theorem.
We can now prove both claims in an analogous way.We start with claim 2, which is more complex.For all x ∈ [0, 1]: Equation ( 12) holds regardless of the actual rank distributions.We now use the assumption that as M → ∞ the data-averaged rank distribution becomes uniform, which implies ∀x ∈ [0, 1]: Combining ( 12) and ( 13), we get that Y dy q ϕ,f (x|y)π marg (y) = x and therefore ϕ passes continuous SBC w.r.t.f .The reasoning for claim 1 is analogous, only omitting the integration over Y .
Theorem 3 (Continuous SBC implies sample SBC).Let ϕ be any posterior family over Y , Θ then for all M ∈ N and any test quantity f : Proof.If either ϕ or π has ties w.r.t.f , we can use Lemma 1 to construct a model and posterior family with no ties but the same continuous q and sample Q; therefore it suffices to prove the statement when there are no ties.When there are no ties, then given y, q ϕ,f (x|y) is a bijection and we can thus also define q −1 .Additionally, θ * ∼ π prior ( θ) and θ 1 , . . .θ M ∼ ϕ y are all conditionally independent given y.
This allows us to determine the probability of the first r draws being smaller than the prior sample θ * ∼ π(θ) and the remaining M − r draws being larger conditional on a specific y: M −r .
Since θ 1 , . . ., θ L are independent given y, we can easily extend to all possible orderings and substitute z = q −1 ϕ,f (x|y): From the precondition in claim 1, q ϕ,f (x|y) = x and thus ∂ ∂z q ϕ,f (z|y) = 1.Then which proves claim 1.For claim 2 we investigate the unconditional probability, ∀r ∈ 0, 1, . . ., M : Pr Since claim 2 assumes ϕ satisfies SBC w.r.t.f , we have ∂ ∂z Y dy π marg (y) q ϕ,f (z|y) = 1 and thus we can proceed as in claim 1: With those foundations ready we now focus on proving statements about continuous SBC with the understanding that thanks to Theorems 1-3 they also shed light on SBC as is actually implemented in software.
Proof.When π has ties w.r.t.f , we can use Lemma 1 to construct a model with no ties with the same continuous q, preserving correctness.So we only need to prove the case where there are no ties.
Without ties, we have ∀s ∈ R, y ∈ Y : C ϕ,f (s|y) = C f (s|y) and ∀x ∈ [0, 1], y ∈ Y : Corollary.The correct posterior passes continuous SBC.Combining the result with Theorem 3, the correct posterior will produce uniform Q and pass M -sample SBC for any M and f .
Passing SBC is a necessary condition to have a correct posterior distribution, but it is not a sufficient condition.The following theorem characterizes a sufficient condition for having a correct distribution for a given test quantity.
Theorem 5 (Characterization of SBC failures).Given a posterior family ϕ over Y , Θ and test quantity f , then for all y ∈ Y and s ∈ R : C ϕ,f (s|y) = C f (s|y) if and only if ∀x ∈ [0, 1] : q ϕ,f (x|y) = x.
Proof.Given a model π and a test quantity f , we build a model π ′ over the same data space and a univariate parameter space Θ ′ := R by first setting: π ′ marg (y) := π marg (y).
We can then define the true posterior and the posterior family ϕ ′ via their univariate CDFs, for all s ∈ R, y ∈ Y : This always defines a valid joint distribution π ′ joint (θ ′ , y) = π ′ post (θ ′ |y)π ′ marg (y) and thus a valid model.
We now consider the projection function p : R × Y → R, p(θ ′ , y) = θ ′ .By construction of π ′ and ϕ ′ we have for all s ∈ R, x ∈ [0, 1], y ∈ Y : This then implies ∀x ∈ [0, 1], y ∈ Y : It therefore suffices to prove that for π ′ , The forward implication follows from Theorem 4 because the correct posterior always produces correct q.According to Lemma 1 it suffices to prove the reverse implication only for posteriors without ties.In this case using Lemma 2 we have for all x ∈ [0, 1]: So for all y ∈ Y we see that C π ′ ,−1 is the inverse of C π ′ p and thus the CDFs have to be equal: Since there is only a single parameter θ ′ in π ′ , the equality of the CDFs for the projection function implies the equality of the densities, and This completes the proof.
This characterizes SBC failures in the sense that SBC investigates the average over Y of the q ϕ,f functions, which can appear correct even if for some subset of Y with non-zero marginal mass the q ϕ,f functions do not yield expected results.The consequence is that whenever we have an incorrect distribution of a given test quantity, the failure can be isolated to some Ȳ ⊂ Y and detected by separately considering Ȳ dy q ϕ,f (x|y).While in practical uses of SBC inspecting rank distributions for subset of the simulations based on the realized y ∈ Y leads to loss of power, we can often get similar results by using a suitable test quantity g that depends on f and y in a way that causes the problems in individual y to accumulate instead of canceling.This is one of the reasons why allowing test quantities to depend on data is useful in general.
We will now show that our SBC variant is in some sense complete.Previous work was correct in noting that with test quantities that are only a function of the parameters some problematic posteriors (e.g., the unaltered prior distribution) can never fail SBC.However, when test quantities are allowed to depend on observed data we can in principle discover any problematic ϕ.For this we will need a technical lemma.
Lemma 3 (Value of q).Let ϕ be any posterior family over Y , Θ and f a test quantity and let s = C −1 ϕ,f (x|y).Then for all y ∈ Y, x ∈ [0, 1], Proof.For a given x, we split Θ into three disjoint sets: Now using the definition of q q ϕ,f (x|y) = The definition of r ϕ,f implies that θ ∈ Θ x The definition also directly implies θ ∈ Θ x 3 =⇒ r ϕ,f (x|θ, y) = 0, and thus If D ϕ,f (s | y) = 0 then Θ x 2 is empty, and we have q ϕ,f (x|y) = C f (s|y).If D ϕ,f (s | y) > 0, then we first note that θ ∈ Θ x 2 ⇐⇒ f (θ, y) = s.Then, straight from the definition of continuous rank, the rank distribution is uniform between C ϕ,f (s|y) − D ϕ,f (s|y) and C ϕ,f (s|y), and the corresponding CDF thus has a linear segment, specifically: This let's us evaluate the integral over Θ x 2 as Combining with the previous results yields, Theorem 6 (Density ratio).For any posterior family ϕ, take g (θ, y) = Proof.First we establish that ∀y ∈ Y, s ∈ R : C ϕ,g (s|y) ≥ C g (s|y): First consider s ≤ 1.We can rearrange: where we once again integrate over non-negative values.So the difference is non-negative for any s.Also the difference cannot be zero for all s once there is region of nonzero mass where g(θ, y) ̸ = 1.In other words ∀s : C ϕ,g (s|y) = C g (s|y) if and only if the condition in ( 14) holds.
Since C ϕ,g (s|y) ≥ C g (s|y) we have C −1 ϕ,g (s|y) ≤ C −1 g (s|y).And so due to the monotonicity of CDF and quantile functions, C g (C −1 ϕ,g (x|y)|y) ≤ C g (C −1 g (x|y)|y).If there are no ties, we can apply Lemma 2 and have for all y ∈ Y, x ∈ [0, 1]: Thus Y dy q ϕ,g (x|y)π marg (y) ≤ x, where equality holds only if Equation ( 14) is satisfied.When ties are present, we recall Lemma 3 and make several observations: 4. Straight from its definition r ϕ,g is a non-decreasing function of x and thus q ϕ,g is also a non-decreasing function of x.
Putting those together we obtain that even if ties are present, (15) holds for some x directly (via point 1 above).At any other point, the q function is linear and due to points 3 and 4 we also have q ϕ,g (x|y) ≤ x and equality cannot hold at those intermediate points unless it also holds at the edges of the linear segment.Therefore even when ties are present, Y dy q ϕ,g (x|y)π marg (y) ≤ x and equality holds only if ( 14) is satisfied.
Corollary.For any posterior family ϕ that would provide different posterior inferences than the true posterior π post , there is a test quantity f such that ϕ does not pass SBC w.r.t.f .The density ratio is typically not available in practice-and if it were, it would likely make more sense to directly check whether g(θ, y) = 1.The above theorem however provides some intuition why using the likelihood f lik : f lik (θ, y) = π obs (y|θ) is useful as it is related to the density ratio, but not dependent on particular ϕ and usually easily available in practice.
We now generalize the result that ignoring data in test quantities implies that some problems in the posterior family can never be detected.We use a technical lemma, a continuous analog of Theorem 1.
Lemma 4 (Integral representation of SBC).Given a test quantity f and a posterior family ϕ such that both π and ϕ have no ties w.r.t.f , ϕ satisfies continuous SBC w.r.t.f if and only if for all x ∈ [0, 1]: Proof.We directly apply the definition of C ϕ,f and use the absence of ties to get the invertibility of C ϕ,f results from Lemma 2.
Theorem 7 (Incomplete use of data).Assume a model π with observation space Y and parameter space Θ, a space Y ′ , and a measurable function t : Y → Y ′ .Denote the set t −1 (y ′ ) = {y ∈ Y : t(y) = y ′ }.Consider the model π ′ with parameter space Θ and observation space Y ′ such that for all θ ∈ Θ, y ′ ∈ Y ′ : dy π obs (y|θ).
Here, the choice of t lets us choose which aspects of the data are ignored, if ∀y ∈ Y : t(y) = 1, we recover the case where all data are ignored: π ′ post (θ|y) = π prior (θ) and thus ϕ(θ|y) = π prior (θ) will pass SBC w.r.t.f .If t is a bijection, no information is lost.Other choices of t then let us interpolate between those two extremes, for example ignoring just a subset of the data points or treating some data points as censored.
Proof.Start with the integral representation of SBC for ϕ w.r.t.f from Lemma 4.
The last step is an integral representation of SBC for ϕ ′ with respect to f ′ from Lemma 4.
Proof.If the conditions hold, then one can construct t that merges P 1 and P 2 while preserving the values of f i .The correct posterior for the implied model π ′ will imply an incorrect posterior for π that passes SBC.
Lemma 5 (Order-preserving transformations).Given test quantities f, g : Proof.The result follows directly from definitions of r and R as they depend only on ordering of the values.
Lemma 6 (Reverse transformation).Given test quantities f, g : Proof.First consider the continuous rank CDF r.We first establish relations between the CDFs and tie probabilities: We can then proceed to directly evaluate r ϕ,g : where U is uniformly distributed on [0, 1] and the second-to-last identity holds, because the probability of exact equality is 0 and 1 − U has the same distribution as U .Now we can focus on the sample rank CDF.We first distinguish how the behaviour of the N equals and N less random variables (from Definition 3) relates between f and g: We can then directly evaluate R ϕ,g : Proof.This is a direct consequence of Lemmas 5 and 6.The test quantity g has to be either an order-preserving transform or a sequence of an order-preserving transform and a reverse transform.Since order-preserving transforms do not change the rank CDFs at all, SBC results will be exactly identical.The only remaining thing to prove is that a reverse transform maintains uniform rank distribution in both the sample and continuous case.
Assuming ∀y ∈ Y : h y is strictly decreasing, we get for the sample case: and thus ϕ passes SBC w.r.t.g.If instead we first assume ϕ passes SBC w.r.t.g, then, After substituting j = M − i − 1 this shows ϕ passes SBC w.r.t.f .For the continuous case, where both directions of implication for passing SBC w.r.t.f and g follow immediately.
Theorem 8 shows that certain simple transformations of test quantities lead to equivalent behaviour for SBC.The result cannot be easily strengthened as many transformations in fact can lead to different behaviours for SBC.Appendix B provides several counterexamples: Example 3 shows that if we allow h y to be strictly increasing for some y ⊂ Y while other y will have h y strictly decreasing, we obtain a different check.Example 4 shows that we can obtain different behaviour if we combine a test quantity with a discontinuous non-monotonic bijection, and Example 5 shows the same when a whole range of values is projected onto a single point.In all those examples, the transformed test quantities rule out some posterior families that pass SBC for the original, but This means we are relatively free to choose one of the inverse CDFs and solve for the other, for all x ∈ [0, 1]: Given a valid inverse CDF Φ −1 (x|1) on [0, 1], Equation ( 23) defines a valid ϕ as long as: where ( 24) and ( 25) ensure a non-negative value inside the square root in (23), and (26) ensures Φ −1 (x|1) is nondecreasing.While this allows for a wide array of incorrect solutions, passing SBC for this condition already avoids some pathological behavior, e.g. the low quantiles of the posterior when 0 is observed cannot be too high.If we instead solve (22) for Φ −1 (x|1) we obtain that when 1 is observed, high quantiles cannot be to low; specifically, ∀ 1 2 < x < 1 : Φ −1 (x|1) > √ 2x − 1).
Example 2 (Projection and data-averaged posterior).We can further build upon Example 1 to illustrate the differences between passing SBC and having correct data-averaged posterior.
Correctness of the data-averaged posterior for the projection f 1 (θ, y) = θ simply entails for all θ ∈ [0, 1]: For example, flipping the correct posterior to

√
x violates all the conditions in ( 24)-( 26).On the other hand, we can take a simple posterior family satisfying SBC for f 1 (via Equation 23): The implied density has an incorrect data-averaged posterior: Finally, Lee et al. (2019) in Equation 1.3 claim that several methods including SBC cannot notice when ϕ is a convex combination of the prior and the correct posterior.This is not true; the authors derive it from the incorrect assumption that SBC checks for the correct data-averaged posterior.We can readily build a counterexample: consider ϕ C (θ|y) := 1 2 (π post (θ|y) + π prior (θ)).We then have: However, plugging Φ −1 C (x|0) into ( 23) which is a necessary condition for passing SBC w.r.t f 1 yields, which is in conflict with what we derived in (28) and ϕ C thus cannot pass SBC w.r.t.f 1 (although it has correct data-averaged posterior).
To combine this with the solution in 23, we additionally need to ensure continuity of Φ −1 ( 1 2 |0), so that computing the number via 23 followed by 30 matches the original value.This translates to: Combining the projection and likelihood therefore fixes the midpoint of the quantile function to its true value for both observations.However, we are still able to choose Φ −1 (0, x) for x ∈ [0, 1 2 ] freely as long as the conditions 24 -26 and 31 are met, and this then defines a ϕ that passes SBC w.r.t.both f 1 and f 2 .
And the SBC condition becomes: 4(1−h 1 ) 4(1−h 1 ) max{h 0 , h 1 } ≤ x. (34) If max{h 0 , h 1 } = 1, then SBC fails as values of f 4 (θ, y) = 1 2 are never generated in the posterior while appearing in the prior.If not, then the last branch implies, Substituting this into the second branch in 34, we obtain a set of constraints on Φ −1 : x < min{h 0 , h 1 } =⇒ Φ −1 (x|1) = 2x + (Φ −1 (x|0) − 1) 2 − 1 (36) x − 1 h 0 − 1 (37) . This will pass SBC w.r.t.f 1 , as it is derived directly via (23).However, it will not pass SBC w.r.t.f 4 for multiple reasons.Probably the easiest to see is that in this case we have h 0 = 1 2 and h 1 ≈ 0.15, so over a large range we would need Φ −1 D (x|0) proportional to square root of x due to (37).Additionally, condition (35) entails h 0 = 1 2 =⇒ h 1 = 1 2 and is therefore also violated.Passing SBC for f 4 restricts the functional form of a potentially big segment of the inverse CDFs via (37) or (38), which makes it strict in this area while providing no constraints on the distribution of values above 1 2 .Constructing a posterior family satisfying SBC for f 4 can then proceed as follows: Pick 0 < h 0 < 4 5 , calculate h 1 via (35).For x > h y the values of Φ(x|y) can be arbitrary (as long as they define valid quantile functions).We can also freely choose Φ −1 (x|0) for x < min{h 0 , h 1 }, but we need to ensure that (a) Φ −1 (x|1) is a valid quantile function-the conditions derived for f 1 in ( 24) -(26)

Figure 2 :
Figure2: Case study 1: Evolution of the difference between the gamma statistic and threshold (log γ) for rejecting uniformity at 5% for the correct posterior.mvn log lik[1] and mvn log lik[2] are the pointwise likelihoods π(y 1 |µ) and π(y 2 |µ) respectively, while mvn log lik is the joint likelihood.As expected when using a 5% level for rejection, false positives (values below the threshold) do happen, but they tend to correspond to only small discrepancies.

Figure 3 :Figure 4 :
Figure3: Case study 2: Evolution of the difference between the gamma statistic and threshold for rejecting uniformity at 5% for an incorrect posterior that equals the prior.Note how quickly large discrepancies accumulate for the likelihood-based quantities, despite the horizontal axis being zoomed to show only first 50 simulations.

Figure
Figure5: Case study 3: Evolution of the difference between the gamma statistic and threshold for rejecting uniformity at 5% for an incorrect posterior that ignores the first datapoint among a small data set (n = 3).

Figure 6 :
Figure 6: Case study 4: Evolution of the difference between the gamma statistic and threshold for rejecting uniformity at 5% for an incorrect posterior that ignores the first datapoint among a larger dataset (n = 20). mu

Figure 7 :
Figure7: Case study 5: Evolution of the difference between the gamma statistic and threshold for rejecting uniformity at 5% for incorrect posterior that has wrong correlation structure.

Figure 8 :
Figure8: Case study 6: Rank distribution for the elements of µ split by the average value of the corresponding y elements for the incorrect posterior that satisfies SBC for individual parameters.The distributions for the two cases exactly compensate to make the overall distribution uniform.The gray horizontal line represents exact uniform distribution and the blue areas represent an approximate 95% prediction interval for the observed ranks, assuming uniform rank distribution

Figure 9 :
Figure9: Case study 6: Evolution of the difference between the gamma statistic and threshold for rejecting uniformity at 5% for incorrect posterior that satisfies SBC for individual parameters.Note the different horizontal axis between top row (quantities that detect the problem slowly or not at all) and bottom row (quantities that detect the problem quickly).The vertical red dashed line marks 500 simulations.We only show quantities derived from the first element of µ; the situation is analogous for the second element.The drop(mu[1]) quantity is defined as µ 1 if µ 1 < 1 and as µ 1 − 5 otherwise.

Figure 10 :
Figure10: Case study 7: Evolution of the difference between the gamma statistic and threshold for rejecting uniformity at 5% for incorrect posterior that introduces a small bias for each parameter.

Figure 11 :
Figure11: Evolution of the difference between the gamma statistic and threshold for rejecting uniformity at 5% for the incorrectly implemented softmax variant of an ordered simplex model.log lik is the multinomial log likelihood of the data and log prior is the log density of the prior Dirichlet distribution.Note the different horizontal axis between top row (quantities that detect the problem quickly) and bottom row (quantities that detect the problem slowly).The vertical red dashed line marks 400 simulations.