The Posterior Predictive Null

Bayesian model criticism is an important part of the practice of Bayesian statistics. Traditionally, model criticism methods have been based on the predictive check, an adaptation of goodness-of-fit testing to Bayesian modeling and an effective method to understand how well a model captures the distribution of the data. In modern practice, however, researchers iteratively build and develop many models, exploring a space of models to help solve the problem at hand. While classical predictive checks can help assess each one, they cannot help the researcher understand how the models relate to each other. This paper introduces the posterior predictive null check (PPN), a method for Bayesian model criticism that helps characterize the relationships between models. The idea behind the PPN is to check whether data from one model's predictive distribution can pass a predictive check designed for another model. This form of criticism complements the classical predictive check by providing a comparative tool. A collection of PPNs, which we call a PPN study, can help us understand which models are equivalent and which models provide different perspectives on the data. With mixture models, we demonstrate how a PPN study, along with traditional predictive checks, can help select the number of components by the principle of parsimony. With probabilistic factor models, we demonstrate how a PPN study can help understand relationships between different classes of models, such as linear models and models based on neural networks. Finally, we analyze data from the literature on predictive checks to show how a PPN study can improve the practice of Bayesian model criticism. Code to replicate the results in this paper is available at \url{https://github.com/gemoran/ppn-code}.


Introduction
Bayesian model criticism is a crucial component of applied data analysis. While designing and studying Bayesian models, the goal of Bayesian model criticism is to understand in what ways the models fit the data well and in what ways they fall short. that reference distribution. If the model passes this check, the observed data is said to be consistent with the model. If the check fails, it suggests that the model cannot adequately generate data similar to the observed data; as such, the model does not provide a useful representation of observed reality.
The most commonly used predictive check is the posterior predictive check (PPC) (Guttman, 1967;Rubin, 1984). A PPC sets its reference distribution to be the posterior predictive distribution of the data, and calculates the probability of the observed data (filtered through a diagnostic function) under this distribution. A PPC captures the idea that an adequate model is one whose posterior predictive provides a plausible distribution of the data.
Over the past decades, many researchers have innovated, refined, and expanded Bayesian predictive checks. Some work has explored the benefits of different reference distributions, such as the prior predictive (Box, 1980), the posterior predictive (Guttman, 1967;Rubin, 1984;Meng, 1994;Gelman et al., 1996), and combinations of the two (Evans and Moshonov, 2006). Other work considers different ways to assess the adequacy of the observed diagnostic, be it through a p-value, another measure of surprise (Bayarri and Berger, 1999), or a visual inspection (Gelman et al., 1996;Gelman, 2004). Still other work has addressed the problem of calibration, trying to ensure that the Bayesian model check enjoys sensible frequentist properties (Robins et al., 2000;Bayarri and Berger, 2000). Finally, there is a line of research that studies how to target the checks at specific components of the model (O'Hagan, 2003;Marshall and Spiegelhalter, 2007). This large body of work provides a rich toolbox for criticizing a Bayesian model.
However, there is an important side of model criticism that predictive checks do not address. In practice, rather than focus on a single model, most Bayesian researchers posit and criticize many models (Gelman et al., 2020). Given such a collection, predictive checks can assess each model individually, but they cannot compare the models to each other. Do some models capture different aspects of the data? Are some of them equivalent to each other?
To answer these questions, this paper proposes the posterior predictive null check (PPN). A PPN asks whether posterior predictive data from one model can pass the predictive check of another model. As an example, consider the simple data in Figure 1 (left): two-dimensional data from a mixture of three Gaussians. Given the data, Figure 1 (right) shows the posterior predictive distribution under four mixture models, with the number of mixture components K ∈ {1, 2, 3, 4} (for details, see Appendix A of the Supplementary Material). As K increases, the posterior-predictive data looks more like the true distribution of the observed data; as expected, the posterior predictive for K = 3 is close to the truth. But notice the posterior predictive for K = 4 is equally good. While we might hope that a predictive check helps decide that K = 1 and K = 2 are inadequate, how can we detect that K = 4 offers no improvement over K = 3?
The PPN helps to solve this problem, asking whether posterior data generated from the K = 3 posterior predictive passes the predictive check for K = 4. As we will discuss, answering this question is equivalent to checking whether the predictive distribution for is indistinguishable from data drawn from p3(xrep | x obs ) (A-generated data). A PPN helps diagnose the fact K = 4 provides no improvement over K = 3. (This can help a researcher choose K = 3, for example, on the principle of parsimony.) K = 3 is the same as the predictive distribution of K = 4. If model K = 3 passes this check, then K = 3-generated data "fools" the posterior predictive check for K = 4; that is, the PPN suggests that K = 4 offers no additional modeling benefit, under the diagnostic used in the check.
We note that the name PPN comes from the idea that the "null distribution" of the PPC is the posterior predictive, and the principle of the PPC is that if the data came from the null then the model is consistent with the data. A PPN asks if an alternative model's posterior predictive distribution might also produce the same null distribution.
For a set of models, a PPN study can help a researcher better understand the relationships between their models, both how they are redundant with each other and how they differ in their predictive distributions. As a demonstration, consider the matrix of plots in Figure 2. Each row indexes a model K ∈ {1, 2, 3, 4}. Along the diagonal are classical predictive checks-each panel illustrates the posterior predictive distribution of the model-specific diagnostic (here, a log likelihood) and the observed value. Notice here that all the models pass their predictive checks; each model can expand the variance of its components to capture the observed data. Consequently, these checks do not narrow down the set of models under consideration.
The PPNs in the off-diagonal panels of Figure 2 can help narrow down the set of models under consideration. Each PPN panel plots the distribution of the diagnostic under both the model under study (the row) and a simpler model (the column). When these distributions overlap then data from the column's model can fool the check for the row's model. We see that K = 3 cannot be fooled by K = 2 or K = 1; but K = 4, though consistent with the data, is fooled by K = 3. When working with an index of complexity-as we are for mixtures-the classical PPC helps indicate if a model's complexity is sufficient to represent the observed data, while the PPN helps to determine whether that complexity is necessary to represent the data. (In Section 3 we will also study collections of models that are not indexed by complexity.) We have demonstrated how a PPN study, by appealing to the concept of parsimony,   Figure 2: A PPN study of mixture models which suggests K = 3 is consistent with the data (no further mixture components are needed). The data are from Figure 1 (left); the true value of K is 3. Along the diagonal are heldout predictive checks; every value of K passes the check.
To the left of the diagonal are PPNs, each one checking if a simpler model can fool the model under study. While K = 1 and K = 2 pass their checks, the PPN shows that they cannot fool K = 3, which also passes. On the other hand, K = 3 can fool the check for K = 4.
can be used to select the number of components in a mixture model. We emphasize that we do not envision the PPN study as a replacement for model selection. Rather, as we discussed, a PPN study can be used as a companion to model selection methods and help to understand the relationships within a collection of "selected" models. In this way, we echo the perspective of Gelman et al. (2020), who contend that presenting multiple models, as opposed to selecting or averaging models, provides a useful picture of the uncertainty inherent in the process of analyzing data. This viewpoint also connects to the "Rashomon effect" as coined by Breiman (2001): there are often many models which have equally good performance. Semenova et al. (2019) expand on this phenomena and define the "Rashomon set" as the set of almost equally accurate models for a given problem. A PPN study determines which models give the same posterior predictive distributions, providing a Bayesian perspective on the Rashomon effect.
In Section 2 we review Bayesian predictive checks, define the posterior predictive null check, and discuss how to use and interpret it. In Section 3 we study and demonstrate the PPN study in different modeling scenarios. With mixtures, we demonstrate how a PPN study can help select the number of components. With probabilistic factor models, we demonstrate how a PPN study can help understand relationships between different classes of models, such as linear models and models based on neural networks. Finally, we analyze a dataset from the literature on predictive checks to show how a PPN study can enhance the practice of Bayesian model criticism.
2 Posterior predictive null checks 2.1 Bayesian model criticism with posterior predictive checks We want to analyze a dataset x obs with Bayesian model A. The model has latent variables θ and is defined by its joint, (1) Bayesian analysis proceeds by evaluating the posterior and the corresponding posterior predictive The posterior distribution of θ helps us investigate the latent variables; the posterior predictive provides a distribution of new data.
Many applications of Bayesian statistics end here. Having defined the model, we use its posterior and posterior predictive to their intended purposes. But this is where the activity of Bayesian model criticism begins. Is the model of Eq. 1 a good model of the data? Does it capture the properties of the data that are important to us? If not, in what ways does it fall short?
One of the foundational methods for Bayesian model criticism is the posterior predictive check (PPC), an idea that adapts classical goodness-of-fit testing to Bayesian statistics (Guttman, 1967;Rubin, 1984). The central premise of a PPC is that if a model is good then its posterior predictive distribution will capture the true distribution of the data. If the observed data is plausible under this predictive distribution then the model has "passed" the check. Notice that this idea takes a Bayesian approach to modeling and a frequentist approach to checking.
There are several ingredients in a PPC. The first is the diagnostic statistic d A (x). It is a function of observable data that measures the incompatibility between x and model A. As discussed in Section 4.3 of Gelman et al. (1996), the choice of diagnostic should capture the aspects of the model we are interested in checking. For example, a diagnostic which assesses the overall fitness of a model is the χ 2 diagnostic: where E A and Var A are expectation and variance with respect to p A (x rep |x obs ) (i.e. model A). In this paper, we consider such model-dependent diagnostic functions in order to assess whether one model can fool another model. In other contexts, the diagnostic might not explicitly depend on the model.
A second ingredient is the reference distribution. When the model is adequate, the reference distribution is the distribution of the diagnostic d A (x) from which we expect the observed diagnostic was drawn. For their reference distribution, Guttman (1967) and Rubin (1984) use the posterior predictive of the diagnostic p A (d A (x) | x obs ), which is derived from Eq. 3. If model A is a good model then the posterior predictive of the diagnostic will capture the distribution of the observed diagnostic.
The goal of a PPC is to evaluate whether the observed diagnostic d A (x obs ) could have plausibly come from the reference distribution. The final ingredient of a PPC is a measure of surprise, a method to assess whether an observed value was drawn from a reference. One common approach is to use a p-value, a tail probability. A posterior predictive p-value is Here a small p-value indicates a poor model: the observed d A (x obs ) is too surprising under the posterior predictive. Note that a p-value is just one way to locate d(x obs ) in its reference distribution; graphs and other measures of surprise provide good alternatives (Gelman et al., 1996;Gelman, 2004;Bayarri and Castellanos, 2007).
PPCs are an intuitive method for assessing the quality of a Bayesian model, but their statistical properties have also been criticized. The central issue is that the PPC uses the data twice, once to construct the reference p A (x rep | x obs ) and once to provide the point d A (x obs ) to locate within the reference. The consequence is that the PPC might be overconfident about a false model. Bayarri and Berger (2000) and Robins et al. (2000) examine this issue, both theoretically and empirically. They consider the sampling distribution of the p-value as a function of (random) observations x obs from a true likelihood p(x obs | θ * ). A calibrated p-value has a uniform sampling distribution when the data truly come from this model. Calibration is necessary to interpret p-values; if we do not know the distribution of p-values under the null hypothesis, we cannot make a decision on whether the p-value is "surprising" or not. Bayarri and Berger (2000) and Robins et al. (2000) show that Eq. 5 is not calibrated. Bayarri and Berger (2000) also propose alternative reference distributions, called partial posterior predictives, for which the p-values enjoy better calibration. This paper will use an adaptation of their check, which we will refer to as the heldout predictive check. The heldout predictive check divides the observed data into two sets x obs = (x in , x out ), uses x in to form the reference distribution, and locates d A (x out ) within it. The check is This type of check relates closely to predictive checks that rely on cross-validation (Gelfand et al., 1992;Marshall and Spiegelhalter, 2007) and held-out data (Draper, 1996).

Posterior predictive null checks
The spirit of a predictive check is to try to falsify a model. If we find an observed diagnostic in the tail of the reference distribution then we "reject the model," taking a p-value as a measure of the risk of falsely rejecting a plausible model. When the observed diagnostic is not in the tail-when it has "passed the check"-then we have not (yet) falsified the model. With this perspective, Gelman and Shalizi (2012) relate PPCs to the Popperian view of the philosophy of science.
There is an important side of model criticism, however, that a predictive check does not address. Suppose model A is not rejected; it passes its PPC. This result means that d A (x obs ) is plausible under the model-A predictive distribution, and we do not reject model A. But does that mean we should accept it?
To help answer this question, we propose the posterior predictive null check (PPN). Consider a different model B and suppose that it provides the same posterior predictive distribution as model A. This means that data from model B will pass the predictive check for model A, i.e., that data from model B can "fool" the check for model A. In this case, we would conclude that model B captures the data equally well as model A (with respect to the chosen diagnostic). This is exactly what the PPN is designed to test. Simply, the PPN asks whether the two models produce the same posterior predictive distribution of the model-A diagnostic. While a predictive check assesses whether the model is adequate, a PPN helps to assess whether the model is necessary to represent the data. Figure 1 (Left), which shows two-dimensional data from a mixture of three Gaussians. There are clearly three clusters. Figure 1 (right) shows draws from the corresponding posterior predictive for four models, K = {1, 2, 3, 4}. As expected, a 3-mixture provides a good posterior predictive distribution but notice that K = 4 does as well; it simply splits one of the clusters. The predictive checks corroborate this visual insight-both K = 3 and K = 4 pass their check, and the partial predictive p-values (Eq. 6) are 0.42 and 0.45, respectively. (In fact, each of these models passes its check.)

Consider again
A PPN can help assess K = 4 by asking when data from the 3-mixture's posterior predictive can fool the check for the 4-mixture. This question amounts to asking if the distribution of the K = 4 diagnostic d 4 (x rep ) is the same whether x rep is drawn from K = 4 posterior predictive, which is the reference distribution of its predictive check, or the K = 3 posterior predictive. If these two posterior predictive distributions are the same then either one will adequately locate the observed diagnostic d 4 (x obs ) in the K = 4 reference distribution. Consequently, passing the predictive check for K = 4 does not rule out the possibility that the data came from K = 3 (which, in this case, it did).
Definition 1 (Posterior predictive null check; PPN). Consider two models, A and B, and their posterior predictive distributions. Each model involves its own set of latent variables, but defines a distribution on the same observation space x ∈ X , Consider a diagnostic function for model A denoted by d A (·), such as a residual (Eq. 4), and an observed dataset x obs . The posterior predictive null check PPN(d A , p A , p B , x obs ) assesses the similarity between the posterior predictive distributions of the two models under the model-A diagnostic. With a divergence, D, the PPN is: One example is the symmetrized Kullback-Leibler divergence: where . Return to the Gaussian mixture model, and recall that both K = 3 and K = 4 passed their respective predictive checks. We use a PPN to check if data from the simpler mixture (K = 3) can fool the more complex one (K = 4). For the diagnostic d 4 (·) we use the Gaussian mixture model likelihood with K = 4 components (for further details, see Appendix A of the Supplementary Material). To implement the check, we calculate the empirical distributions of d 4 (x This analysis is illustrated in the bottom row of Figure 2; all the panels in the row involve the model K = 4. In the rightmost panel is an illustration of the partial predictive check. The distribution is the posterior predictive of the diagnostic and the red line is the observed diagnostic (from held-out data); the model K = 4 passes its predictive check. The panels to the left illustrate different PPNs, each illustrating the predictive distribution of p(d 4 (x rep )|x obs , K = 4) (blue) and p(d 4 (x rep )|x obs , K = k) for k = 1, 2, 3 (red). Specifically, the leftmost panel is a PPN that checks if data from p(x rep | x obs , K = 1) can fool the check for K = 4; it cannot. The next panel asks the same question for K = 2; again it cannot fool the check. The next panel, however, illustrates the distribution for K = 3; data from p(x rep | x obs , K = 3) will pass the check for K = 4. Thus we cannot distinguish between the two models. In Appendix B of the Supplementary Material, we compare the models with Bayes factors (Jeffreys, 1961;Kass and Raftery, 1995) and obtain a similar conclusion to the PPN study.

The PPN study in a Bayesian workflow
How can we incorporate the PPN study in the workflow of Bayesian data analysis (Gelman et al., 2020)? First consider a set of models that are ordered by their natural complexity. The mixture models of Figure 1 are a good example. One approach to using a PPN is to iteratively increase the complexity of the model class, use a predictive check to check each model, and use a PPN to check whether any of the simpler models can fool the check. Based on the principle of parsimony, one can choose the model that passes its predictive check and for which no simpler model can fool it.
Figure 2 demonstrates this analysis for the mixture model. First note that the predictive check does not help determine which K is necessary. The diagonal plots show how each model passes its predictive check, including the trivial model where K = 1; the reason is that the estimated variance in the log-likelihood is too large to detect an anomaly between the observed and predictive data. The off-diagonal plots demonstrate the value of the PPN study. They show that no simpler model can fool the check for K = 3. However, as we discussed, data from the 3-mixture can fool the check for the 4-mixture. Based on this analysis, the researcher can choose K = 3.
Next, we consider classes of different types of models that are not necessarily nested within each other. We suggest using a predictive check to check each model and then use a PPN to check every pair. This process will result in an equivalence class of models that the data cannot distinguish.
Consider again two models A and B and assume that they both pass their respective predictive check. Now consider two PPNs, one to check if data from model B can fool model A and one to check if data from model A can fool model B. There are three possibilities, • Suppose data from A fools B and data from B fools A. Then these two models are in an equivalence class. Relative to their diagnostics, neither provides information that the other does not. We may use a qualitative criterion to select the model (e.g., parsimony, as we did for mixtures) or hold them both.
• Suppose data from A fools B but data from B does not fool A. In this situation, we choose model A. It provides more information than model B. (If the converse is true, choose model B.) • Suppose data from A does not fool B and data from B does not fool A. Then each model is capturing an aspect of the data that the other does not. Both models are valuable.
Enumerating these scenarios suggests a way to explore the differences between classes of models, particularly those that do not necessarily admit a natural ordering in terms of complexity.
The PPN study is related to a large literature on Bayesian predictive models for model criticism. A thorough review of such methods is provided by Vehtari and Ojanen (2012). In particular, a related method was developed in Gelfand and Ghosh (1998), which proposes to select a model that minimizes the expected error in predicting data from the posterior predictive distribution. The PPN builds on such predictive methods by determining whether the posterior predictive distribution of one model can fool the predictive check for another model. In this way, the PPN provides a notion of model similarity based on predictive distributions. Further, the PPN takes into account the sampling variability in the posterior predictive distribution.

Computing realized diagnostics
The diagnostic d A (x) is a function that quantifies, in some way, how incompatible the data x are to model A. In designing diagnostics, it is often natural and convenient to consider function of the latent variables specified in the model. Gelman et al. (1996) refer to such functions as "realized" because they require a realization of the latent variables. For example, a common realized diagnostic is the negative log likelihood of the data, where θ A are the latent parameters of model A. Large values of this diagnostic mean the data are incompatible with the realization of the latent variable.
When we use a realized diagnostic, we have to decide how to handle the latent variable. One possibility is to remove it from the diagnostic, thereby forming a simple diagnostic from a realized one (Gelman et al., 1996). Examples of such diagnostics include the average and MAP diagnostic: Note these diagnostics can be used in the context of a PPC or a PPN.
Such diagnostics, however, are still computationally expensive. To evaluate each one requires a minimization or posterior inference, and Bayesian model criticism tends to require many evaluations of the diagnostic, one for each replicate of the dataset.
To alleviate this burden, we propose a "validation diagnostic." The validation diagnostic marginalizes over the posterior of the latent parameters given a fixed held-out validation dataset x val , one that is not used in the context of the model check. The validation diagnostic is This diagnostic avoids the computational cost of refitting the model to each replicated dataset.
In practice, we split the data into x obs = {x in , x val }. We use the in-sample data to draw samples from the posterior predictive distribution. The diagnostic is then defined from the validation data. One might ask why not use the same data in both settings. The reason is that this would bias the diagnostic to favor the observed data, mirroring the "double counting" issue of the PPC. Specifically, the PPN with the validation diagnostic assesses the similarity of the distributions p . Note that when we use the PPN in concert with the heldout predictive check in Eq. 6, we instead split the data into three: • x in is in-sample data used to draw from the posterior predictive distribution; • x out is out-of-sample data located within the reference distribution for a heldout predictive check; • x val is data used to calculate the validation diagnostic Eq. 12.
Then, the heldout predictive p-value with the validation diagnostic is: The heldout predictive check with the validation diagnostic is in Algorithm 1. A PPN with the validation diagnostic is in Algorithm 2. Finally, a PPN study, which combines predictive checks and PPNs, is in Algorithm 3.

A simple regression example
As a simple pedagogical example of a PPN, we compare two regression models for which the posterior predictive distributions are known exactly. The first "regression" M A does not include any covariates, while the second M B includes p covariates, Algorithm 1 Algorithm 3: A PPN study output: A collection of PPNs S = ∅; for k = 1, . . . , K do compute the empirical heldout predictive check for M k (Algorithm 1); if M k passes the check then with θ ∈ R and β, x i ∈ R p .
Given observed data (y obs , X obs ), a PPN study helps answer the questions: Do models A and B adequately capture the data? Is model A sufficient to model the data or is the more complex model B required? In detail, the study follows these steps: 1. Split the data into (y obs , X obs ) = {(y in , X in ), (y out , X out ), (y val , X val )}.

Choose a validation diagnostic,
3. Calculate the posterior predictive distributions for both M A and M B given the in-sample data y in : p(y A rep |y in , X in ; M A ) and p(y B rep |y in , X in ; M B ).
4. Calculate heldout predictive p-values for both models, 5. Assuming both models pass their checks, calculate the PPN, which checks if posterior predictive data from model A can "fool" posterior predictive data from model B (under the diagnostic d B ), 6. If the PPN passes, conclude that model A is consistent with the data and that model B is not required.
Suppose model A is true; the covariates are not involved in producing y. To demonstrate the PPN study, we generated 2,000 data points from this model (Eq. 14, θ = 2.5) along with ten (meaningless) covariates. We then ran a PPN study to compare model A and model B; the results are in Figure 3.
We see that both models pass the heldout predictive check, the distributions of d B (y A rep ; y val ) and d B (y B rep ; y val ) are visually very similar, and their symmetric KL is 0.24. From this study, we would correctly conclude that model A is adequate and that the more complex model B (which still passes its check) is not needed.
In this simple situation, the PPN of Eq. 19 also has good theoretical properties. Given that model A is true, we can prove that the distributions of d B (y A rep ; y val ) and d B (y B rep ; y val ) are asymptotically equal; the correct model A can "fool" model B. Proposition 1. Suppose the data (y obs , X obs ) is drawn from model A in Eq. 14; the covariates do not matter. Also assume the covariates satisfy the following condition Then as n → ∞ and p/n → 0, both d B (y A rep ; y val ) and d B (y B rep ; y val ) converge in distribution to 2χ 2 n random variables.
Proof. See Appendix C of the Supplementary Material.
The PPN of the proposition compares the posterior predictive distributions of model A and model B under a model-B diagnostic. It shows that these distributions are equal in the limit as p/n → 0. As for the simulation, when the data is drawn from model A, the PPN detects that model B contains no further information. Note that the condition on the covariates in Eq. 20 may hold in a number of settings. One simple example is when the covariates are distributed as x obs,i ∼ N (0 p , I p ), i = 1, . . . , n. (With conditions, some non-diagonal covariance matrices may also satisfy Eq. 20.) This PPN study required the number of regression coefficients p to be much smaller than the sample size n. However, PPN studies are also appropriate when p n. When p n in model B, the distributions of d B (y A rep ; y val ) and d B (y B rep ; y val ) will be different. This difference is due to model B overfitting the data (under the improper prior). This overfitting will be detected by a heldout predictive check. If there is alternative model that does not overfit, the PPN will detect whether the additional complexity of that model is needed.

Empirical studies
We demonstrate the PPN with several empirical studies.
• Section 3.1: We consider the infant temperament data of Stern et al. (1995), which was also analyzed by Gelman et al. (1996) to illustrate PPCs with realized discrepancies. Following the authors, we fit a multinomial mixture to the data.  To choose the number of components, we use a PPN study; the result validates previous analyses of the data.
• Section 3.2: We consider synthetic data from a linear factor analysis model. We conduct a PPN study to choose between probabilistic PCA and two different deep generative models, fit with a variational autoencoder (VAE, Kingma and Welling, 2014) and skip-VAE (Dieng et al., 2019), respectively. The PPN study correctly suggests that PPCA is adequate to fit the data.
• Section 3.3: We consider synthetic data from a nonlinear factor analysis model. Here, the PPN study correctly suggests that PPCA (which assumes linearity) is not adequate to model the data; nonlinear deep generative models provide better fits. Stern et al. (1995) study infant temperament data, which was also analyzed by Gelman et al. (1996) to illustrate PPCs with realized discrepancies. In the study, two cohorts of infants (n = 169, in total) were scored on the (i) degree of motor activity (scored 1-4) and (ii) crying to stimuli (scored 1-3), both at 4 months, and (iii) the degree of fear to unfamiliar stimuli at 14 months (scored 1-3). Based on these scores, it is hypothesized that infants can be clustered into two groups: inhibited and uninhibited.

Multinomial mixture model
To investigate the two-group hypothesis, we follow Gelman et al. (1996) and consider a multinomial mixture model. To choose the number of mixture components, we use a PPN study. In their analysis, Gelman et al. (1996) noted that "the two-class mixture model provides an adequate fit that does not appear to improve with additional classes." Here the PPN study also suggests the two-class mixture model is sufficient to model the data.
For infant i, denote their scores in each of the three tests as {x i } and their group indicator by z i . Following Stern et al. (1995), we assume that infants in group k will have the same score probabilities, (θ k ), across the three tests. The multinomial mixture model with K groups is: We set α = 2 and α π = 2, following the recommendation of Gelman et al. (1996) to use a "weak but not uniform prior distribution." To draw from the posterior predictive, we use Gibbs sampling.
For both partial predictive checks and PPNs, we use the heldout diagnostic Eq. 12. The underlying diagnostic function is the χ 2 -discrepancy: where We split the data into x = {x in , x out , x val }, where x in is used to draw posterior predictive data, x out is used as the out of sample data in the partial predictive check, and x val is used to define the diagnostic. Specifically, the heldout diagnostic is: which is approximated via Monte Carlo with samples from the posterior p(θ|x val ).
We consider mixture models with K ∈ {1, 2, 3, 4} components. All four models pass their partial predictive checks ( Figure 5); that is, all models generate predictive distributions which are consistent with the observed data (according to the heldout diagnostic). As we cannot eliminate models based on goodness-of-fit, we use a PPN study to determine which models are providing essentially the same predictions.   Figure 5: A PPN study of multinomial mixture models. This study suggests that K = 2 is consistent with the data (no further components are needed). On the diagonal are heldout predictive checks displaying pA(dA(x A rep ; x val )|xin) (blue histogram) and dA(xout; x val ) (red line). All models pass the checks. To the left of the diagonal are PPNs, each one checking if a simpler model can fool the model under study. Specifically, the PPNs compare pB(dA(x B rep ; x val )|xin) (red histogram) and pA(dA(x A rep ; x val )|xin) (blue histogram). The PPN study shows that K = 1 passes its check but it does not fool K = 2. K = 2 passes its check and can fool K = 3 and 4.
The PPN study proceeds as follows.
• Based on visual inspection and the symmetrized Kullback-Leibler distance, the PPN suggests that K = 1 cannot fool K = 2.
• The PPN suggests that K = 2 can fool K = 3.
• The PPN suggests that K = 2 can fool K = 4.
The PPN study suggests K = 2 is adequate for modeling the data; increasing the number of mixture components beyond K = 2 is not justified. This finding corroborates that of Gelman et al. (1996), who made the heuristic choice of K = 2. The posterior predictive draws from the different models have high variability Figure 4, preventing direct visual comparison of predictive distributions at the data level.
We additionally compute the Bayes factors to compare the models. We approximate the marginal likelihood by the harmonic mean of the likelihood values (for details, see Appendix B of the Supplementary Material). The Bayes factors provide inconclusive evidence (

Linear Factor Analysis
When is a nonlinear model required for factor analysis, and when is a linear model adequate? To investigate the capacity of a PPN study to help answer this question, we consider two simulation settings. In one, the data is generated from a linear factor model; in the other, it is generated from a nonlinear factor model.
The observed data is x i ∈ R G , i = 1, . . . , N . We assume that x i has some low dimensional representation z i ∈ R K with for some function f : R K → R G and noise term ε i ∈ R G . We consider three different modeling strategies for estimating this mapping between the latent representation and the observed data: (i) probabilistic principal components analysis (PPCA, Tipping and Bishop, 1999); (ii) a deep generative model, fit with a variational autoencoder (Kingma and Welling, 2014); and (iii) a deep generative model with skip connections, fit with a skip-VAE (Dieng et al., 2019).

Models
Probabilistic PCA (Tipping and Bishop, 1999) assumes that f is a linear mapping from the low-dimensional latent representation to the observed data, The latent variables are assigned a normal prior, z i ∼ N (0, I). Tipping and Bishop (1999) estimated the linear mapping W and representations z i using the EM algorithm.
In some datasets, however, it may be that x i lies on a much lower, nonlinear manifold. In this situation, a linear mapping would require more latent dimensions to represent the underlying structure than a nonlinear method. To accommodate these nonlinearities, one option is to use a multi-layer feedforward neural network µ θ : R K → R G for the mapping from the latent variables z i to the observed data x i . Such a neural network with L ∈ N layers has the form: where p l is the dimension of layer l, b l ∈ R p l are shift vectors, W l ∈ R p l ×p l−1 are weight matrices and a b l is an activation function. The collection of latent variables is denoted by θ = {b l , W l } L+1 l=1 . We can then model the data using this flexible neural network mapping in the following deep generative model (DGM) (Kingma and Welling, 2014;Rezende et al., 2014): where the noise variance is Σ = diag{σ 2 j } G j=1 . The latent variables θ of the neural network are generally estimated via MAP estimation.
An alternative nonlinear mapping is a "skip" or residual neural network µ SKIP θ : Dieng et al., 2019). It includes direct connections to the latent variables z in each hidden layer of the neural network. Specifically, the skip neural network µ SKIP θ has the form: The skip neural network can be used in place of µ θ in the DGM in Eq. 30. We refer to this model as a skip-DGM.
For both the DGM and skip-DGM, posterior inference is intractable. We fit the DGM using a variational autoencoder (VAE, Kingma and Welling, 2014;Rezende et al., 2014), which optimizes an approximation of the regularized likelihood that uses a variational approximation of the posterior of p(z i |x). The variational family is where µ φ : R G → R K , σ 2 φ : R G → R K are neural networks parameterized by φ in a similar way to Eq. 29. The parameters θ and φ are estimated by optimizing the evidence lower bound (ELBO) (Kingma and Welling, 2014;Rezende et al., 2014). Note that the parameters θ and φ are shared across all N samples and corresponding latent variables unlike mean-field variational Bayes (Jordan et al., 1999;Blei et al., 2017), where each sample has a unique variational parameter. This sharing of functional parameters across samples is referred to as amortized variational inference (Gershman and Goodman, 2014).
Similarly, we fit the skip-DGM with a skip-VAE, which includes direct connections to the observed data x in the mappings µ SKIP φ and σ SKIP φ (Dieng et al., 2019).
For all neural networks, we use a 3-layer neural network with 20 neurons in each hidden layer. We use a rectified linear unit (ReLU) activation in each of the hidden layers, given by a b l (z) = max(z + b l , 0). To estimate the neural network parameters which maximize the ELBO, we use stochastic gradient descent with Adam (Kingma and Ba, 2014) with learning rate 1 × 10 −3 .

Synthetic Data
We first consider a linear simulation setting where we would expect PPCA to find an appropriate mapping from the latent space to the observed data, and the DGM and skip-DGM to perform similarly well. We set the number of samples to N = 1000, the number of observed features to G = 10 and the latent dimension as K = 2. The data is generated as where z i ∼ N (0, I), ε i ∼ N (0, σ 2 I) with true σ 2 = 1. (Note however that σ 2 is treated as unknown in the inference stage). The matrix W is the following block matrix, W = 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 .
That is, the first five values of x i are linearly related to the first factor, and the last five values of x i are linearly related to the second factor. We generate three datasets:

Model Checking
For both the PPN study, we use a heldout diagnostic That is, θ A is the maximum a posteriori estimate for model A given the validation data x val . The underlying realized diagnostic, d A (x, θ A ) is the reconstruction loss,  After fitting the models, each of PPCA, DGM and skip-DGM pass their partial predictive checks ( Figure 6). It is unclear which model to choose. The PPN study can help. Consider Figure 6: from both visual inspection and the symmetrized KL divergences (Table 2), the PPN suggests that each model fools every other model. The PPN study concludes that both PPCA and the deep generative models fit the data adequately and can be considered equivalent based on their posterior predictive distributions. If we prefer an interpretable linear model, then we should choose PPCA.
Note we do not consider the Bayes Factors for model comparisons here as they cannot be computed for the deep generative models.

Nonlinear Factor Analysis
Now consider a simulation setting where the true mapping from the factors to the observed data is nonlinear. In this situation, we expect that both the DGM and the skip-DGM will reconstruct the data well while PPCA will require a larger number of latent dimensions to model the nonlinear mapping.
We set the number of samples to N = 1000, the number of observed features to G = 7 and the latent dimension to K = 2. The data is generated from where z i ∼ N (0, I), ε i ∼ N (0, σ 2 I) with true σ 2 = 1. That is, the first three columns of x are related to the first factor; the next three columns are related to the first factor; and the final column is an interaction term between the two factors. Both the DGM and skip-DGM should be able to reconstruct the data using K = 2 dimensions, whereas PPCA would need at least K = 5 latent dimensions to capture the nonlinear terms.
For the DGM and skip-DGM, we use a similar neural network architecture as the previous section, except with 50 neurons in each hidden layer. For PPCA, we consider both a model with the true number of latent dimensions, K = 2, and a model with an overestimate of the latent dimension, K = 5.
We first check each model. PPCA with K = 2 fails the partial predictive check, as expected; two-dimensions are inadequate for a linear model to capture the data. Each of PPCA (K = 5), the DGM and skip-DGM pass their partial predictive check ( Figure 7). To assess which model (or set of models) to proceed with, we use the PPN study. Consider The skip-DGM fools PPCA-5 , but PPCA-5 does not fool the skip-DGM. This result suggests that the skip-DGM captures aspects of the data that PPCA-5 does not. Based on the overlap of the posterior predictive distributions, both the DGM and skip-DGM fool each other, suggesting that they both capture the same aspects of the data.

Discussion
We developed and studied the posterior predictive null check (PPN), an approach to Bayesian model criticism that complements the classical predictive checks. A PPN checks whether data from model B's posterior predictive distribution can pass the predictive check of model A. By studying a space of models with a collection of PPNs, we can understand the relationships between them. Which models capture different aspects of the data?
With mixtures, we demonstrated how a PPN study can help select a model by the principle of parsimony. With probabilistic factor models, we demonstrated how it can help understand relationships between different classes of models. We re-analyzed data from the research literature on Bayesian model criticism, and we studied the calibration properties of the PPN.
Running a PPN study is more computationally expensive than computing predictive checks. This expense is because for M models, a PPN study considers order M 2 model combinations. This computational expense may be mitigated when the models are ordered by complexity. In this case, a PPN study can proceed by comparing only consecutive models (i.e. M k vs. M k+1 ), reducing the number of model combinations to M .
In the modern practice of applied Bayesian statistics, researchers iteratively design and explore many models, a process that was recently dubbed "the Bayesian workflow" (Gelman et al., 2020). By helping the researcher understand the relationships between different models, and particularly so in the context of Bayesian model criticism, a PPN study can help guide the researcher through this process.
All of the PPN studies here, both with real data and simulated data, involve a situation where more than one model passes its predictive check. One interesting direction of future work is to consider a PPN study where no model passes its check, but where we might still be interested in understanding the differences between the models' predictive distributions.
We conduct the partial predictive and PPN checks with the heldout diagnostic (12). The underlying realized diagnostic function is the log-likelihood, given by where γ(x i , θ) is a draw from p(γ|x i , θ). For both the partial predictive and PPN checks, we use R = 200 draws from the posterior predictive distributions.

Appendix B: Comparison to Bayes Factors
To compare two models, M A and M B , the Bayes factor is In the mixture model example, we approximate the marginal likelihood, p(X|M), with the harmonic mean of the likelihood values (Newton and Raftery, 1994): where θ (r) ∼ p(θ|X, M) are draws from the posterior under model M.
For the Gaussian mixture model example, the Bayes factors suggest K = 3, similarly to the PPN (Table 4).
For the multinomial mixture model example, the Bayes factors provide inconclusive evidence (Table 5).
Then, as n → ∞ and p/n → 0, d B (y A ) has an approximately 2χ 2 n -distribution (to first order).
We now consider the distribution of data from M B under the M B diagnostic: where Z i ∼ N (0, 1). As n → ∞, we have: Further, As n → ∞, we have x in,i [X val X val ] −1 x in,i + x in,i [X in X in ] −1 x in,i → 2p/n.
Then, as n → ∞ and p/n → 0, we have d B (y B rep ; y val ) has an approximately 2χ 2 ndistribution (to first order).
Thus, d B (y B rep ; y val ) is asymptotically equal in distribution to d B (y A rep ; y val ) (to first order).