Using prior expansions for prior-data conflict checking

Any Bayesian analysis involves combining information represented through different model components, and when different sources of information are in conflict it is important to detect this. Here we consider checking for prior-data conflict in Bayesian models by expanding the prior used for the analysis into a larger family of priors, and considering a marginal likelihood score statistic for the expansion parameter. Consideration of different expansions can be informative about the nature of any conflict, and extensions to hierarchically specified priors and connections with other approaches to prior-data conflict checking are discussed. Implementation in complex situations is illustrated with two applications. The first concerns testing for the appropriateness of a LASSO penalty in shrinkage estimation of coefficients in linear regression. Our method is compared with a recent suggestion in the literature designed to be powerful against alternatives in the exponential power family, and we use this family as the prior expansion for constructing our check. A second application concerns a problem in quantum state estimation, where a multinomial model is considered with physical constraints on the model parameters. In this example, the usefulness of different prior expansions is demonstrated for obtaining checks which are sensitive to different aspects of the prior.


Introduction
A common approach to checking the likelihood in a statistical analysis is to consider model expansions motivated by thinking about plausible departures from the assumed model. Then using either formal or informal methods for model choice, we can compare the expanded model with the original one to determine whether the original model was good enough. In Bayesian analyses, information from the prior is combined with information in the likelihood, and an additional aspect of checking Bayesian models is to see whether the prior and likelihood information conflict. If the likelihood is inadequate, there will be no value of the model parameter giving a good fit to the data, whereas prior-data conflict occurs when the prior is putting all its mass out in the tails of the likelihood. Checking for prior-data conflict is important, because it is not sensible to combine conflicting sources of information without careful thought, and the influence of the prior only increases with increasing conflict. In any case something of significance has been determined when a prior-data conflict has been shown to exist.
The purpose of this work is to consider model expansion for checking for prior-data conflict, rather than for checking the likelihood. Suppose that θ is a parameter, y is data, g(θ) is a prior density for θ, and p(y|θ) is the sampling model. Write g(θ|y) ∝ g(θ)p(y|θ) for the posterior density. Suppose that we have checked the likelihood component of the model and that it is thought to be adequate, so that checking for prior-data conflict is of interest. Checking the likelihood component first is important, since sound inferences cannot result from a poor model no matter what prior is used for θ. Our approach to prior-data conflict checking considers embedding the original prior into a larger family, which we write as g(θ|γ), where γ is some expansion parameter and the original prior is g(θ|γ 0 ) for some value γ 0 . The corresponding posterior densities will be denoted by g(θ|y, γ). Throughout this work, γ will be a scalar, or if we embed the prior into a family with more than one additional parameter, we will vary these parameters one by one. If we integrate out θ from the likelihood using g(θ|γ), we obtain p(y|γ) = g(θ|γ)p(y|θ)dθ, and we propose using the score-type statistic for checking for prior-data conflict. A p-value for the statistic (1) is computed to provide a calibration of its value by using the prior predictive density as the reference distribution. Let us assume the embedding family is such that a large value of S(y) indicates conflict. Then an appropriate p-value for the check is where Y ∼ p(y) = g(θ)p(y|θ)dθ and y obs is the observed value of y. In the next section we will describe a framework for Bayesian model checking that explains some logical requirements that a prior-data conflict check should satisfy, and we discuss why a check based on (1) and (2) satisfies these requirements. We show later that under appropriate regularity conditions (1) has the alternative expression S(y) = d dγ log g(θ|γ) γ=γ 0 g(θ|y)dθ.
Expression (3) gives an intuitive meaning to the test statistic (1), as well as being useful later for computation. We can see that (1) is the posterior expectation of the rate of change of the log prior with respect to the expansion parameter. If there is a conflict, and our posterior distribution is concentrated in the tails of the prior, then the derivative of the log prior with respect to the expansion parameter will be large if the prior is changing in a direction that resolves the conflict when we vary γ around γ 0 . One of the main advantages of the score type check we propose is that by appropriate choices of the prior expansion we can obtain checks of conflict which are sensitive to different aspects of the prior. Other proposals for prior-data conflict checking do not have this feature, which will be illustrated in some of the examples.
In the next section, we discuss prior-data conflict checking and how this differs from checking the likelihood component in a model-based statistical analysis. We also review the existing literature on checking for prior-data conflict. Section 3 discusses hierarchical extensions of our check. Section 4 illustrates implementation of the checks in two complex applications. The first concerns shrinkage estimation of coefficients in linear regression with squared error loss and a LASSO penalty (Tibshirani 1996), which can be thought of equivalently as MAP estimation for a Gaussian linear regression model with a Laplace prior on the coefficients. Griffin and Hoff (2017) have recently considered a test for the appropriateness of the LASSO penalty based on the empirical kurtosis for a point estimate of the coefficients, and where their test is designed to be powerful against alternative priors in an exponential power family. Here we consider the embedding of the Laplace prior into this same family for the construction of our prior-data conflict score test, and show that our method is an attractive one in this example. The second application considered relates to a problem in quantum tomography. Here the model is a multinomial, but with physical constraints on the parameter space. We consider several different prior expansions leading to statistics that are sensitive to different aspects of the prior. Section 5 gives some concluding discussion.

Prior-data conflict checking
In this section we first outline a framework for Bayesian model checking due to Evans and Moshonov (2006) that helps to explain some logical requirements that should be satisfied by any check for prior-data conflict. We then review the divergence-based checks considered in Nott et al. (2016), and their connections with the score-based checks suggested here. We follow this with a review of the wider prior-data conflict checking literature. Evans and Moshonov (2006) develop a general framework for Bayesian model checking based on a factorization of the joint density for parameters and data in a Bayesian model into different terms to be used for different purposes in a Bayesian analysis. Their factorization extends an earlier similar suggestion due to Box (1980). Similar to the introduction, let θ be a parameter, y be data, and T be a minimal sufficient statistic. If no non-trivial minimal sufficient statistic is available, we can consider some asymptotically sufficient statistic such as the maximum likelihood estimator. We again write p(y|θ) for the sampling model, g(θ)

Model factorization and Bayesian model checking
for the prior, and consider decomposing the joint model as where all terms are densities with respect to appropriate base measures, p(t) = p(t|θ)g(θ)dθ denotes the prior predictive density of T , g(θ|t) is the posterior density (since T is sufficient) and p(y|t) is the density of y given T, θ which does not depend on θ due to the sufficiency of T . This decomposition can be extended further in the case of hierarchical models or when there exist ancillary statistics but we won't consider this now. Evans and Moshonov (2006) observe that g(θ|t) should be used for Bayesian inference about the parameters, and suggest that p(y|t) should be used for checking the sampling model, since p(y|t) doesn't depend on the prior, which has nothing to do with the correctness of the sampling model. When there are ancillary statistics, they can also be used for checking the sampling model. It is logical to use the remaining term p(t) for checking for prior-data conflict, since any variation in y that does not affect the likelihood is irrelevant to whether the prior and likelihood conflict. So any discrepancy used to test for prior-data conflict should be a function of a sufficient statistic, and to exclude as much irrelevant variation as possible, a minimal sufficient statistic. Evans and Moshonov (2006) suggested using the prior predictive p-value to check for conflict, where T ∼ p(t) is a sample from the prior predictive for the minimal sufficient statistic T and t obs is the observed value. Since T determines the likelihood function, (4) is measuring how surprising the observed likelihood is, in terms of the minimal sufficient statistic, under the prior. So (4) will be small if the observed value t obs is in the tails of the prior predictive density for T . Conditioning on ancillary statistics when they are available can also be considered in (4), as well as extensions to priors which are elicited hierarchically.

Conflict checks based on relative belief
Evans and Jang (2010) note that (4) is not invariant to the minimal sufficient statistic chosen, and suggest an invariantized version of the check. Nott et al. (2016) considered an alternative check based on prior to posterior Rényi divergences (Rényi 1961) which is also invariant. We describe this approach in some detail since it is closely related to our proposed score checks. For the method of Nott et al. (2016), a conflict p-value is computed as where Y denotes a draw from the prior predictive distribution for y, p(y) = g(θ)p(y|θ)dθ, and R α (y) denotes the prior to posterior Rényi divergence for data y, where α > 0 and the case α = 1 is defined by taking a limit α → 1, which corresponds to the Kullback-Leibler divergence. Nott et al. (2016) note connections between their suggested check and the relative belief framework for inference (Evans 2015;Baskurt and Evans 2013). For a parameter of interest ψ(θ), the relative belief function for ψ is where g(ψ|y) is the posterior distribution for ψ, and g(ψ) is the prior. If the relative belief is larger than 1 at a given ψ, this means there is evidence for that value, whereas if it is less than 1 there is evidence against. We see that R α (y obs ) is a measure of the average evidence in y obs or equivalently the average change in beliefs from a priori to a posteriori. So (5) is a measure of how much beliefs about θ have changed from prior to posterior compared with what is expected under the prior, and is hence a measure of how surprising the data are under the prior. It is also noteworthy that relative belief inferences have been shown to possess optimal robustness to the prior properties but this robustness decreases with increasing prior data conflict, see (Al Labadi and Evans 2017). The case α = 2 gives the posterior mean of the relative belief function, whereas α → ∞ corresponds to the maximum relative belief. Instead of computing (5) based on the prior and posterior for θ, one can also consider the prior and posterior for a parameter of interest ψ. Nott et al. (2016) also discuss hierarchical versions of their check, and this extension is also considered here later. Because the discrepancy for the check R α (y) depends on the data only through the posterior, this discrepancy is a function of any minimal sufficient statistic, and it is invariant to the choice of sufficient statistic. There is also a connection between the check (5) and the Jeffreys' prior. Nott et al. (2016) show that the limiting form of the p-value is where θ * denotes the true parameter and θ ∼ g(θ). The p-value (6) is a measure of how far out in the tails of the prior density the true parameter is (with the prior expressed as a density with respect to the Jeffreys' prior as support measure). The limiting value is 1 if g(θ) is taken as the Jeffreys' prior, which gives a heuristic reason why the Jeffreys' prior is non-informative, since asymptotically a conflict is not possible. A similar result holds for the check of Evans and Moshonov (2006), but where the prior density is expressed with respect to the Lebesgue measure as support measure. The hierarchical extension of the checks of Nott et al. (2016) also has some relationship to reference priors (Berger et al. 2009;Ghosh 2011).

Connections between the relative belief and score checks
The score based statistic (1) is closely related to the divergence based check of Nott et al. (2016) for a number of reasons. First, it shares the property with (5) of depending on the data only through the posterior. This follows from the expression (3) for S(y), an expression which is derived from Fisher's identity (see, for example, Cappé et al. (2005), equation (10.12)). Fisher's identity applies when we have some model for data y, with latent variables z and a parameter η. There is a joint model for (y, z) given η, p(y, z|η) say, and p(y|η) is obtained by integrating out the latent variables, p(y|η) = p(y, z|η)dz. Fisher's identity states that under appropriate regularity conditions ∇ η log p(y|η) = (∇ η log p(y, z|η)) p(z|y, η)dz.
Using this formula and identifying θ with z and γ with η, the expression (3) for S(y) follows, provided that the differentiation under the integral sign required for Fisher's identity is valid. Because the score check depends on the data only through the posterior, the statistic S(y) depends only on the data through the value of a sufficient statistic, and it is invariant to the choice of that statistic. This is desirable for a prior-data conflict check as discussed in Section 2.1. Furthermore, to apply the method it is not required to identify any non-trivial minimal sufficient statistic, since S(y) is computed directly from the posterior distribution.
As well as depending on the data only through the posterior, the score based check is similar to the divergence based one through its connection with relative belief based inference. By rearranging Bayes' rule with the prior g(θ|γ), Hence S(y) is the derivative with respect to the expansion parameter at γ 0 of the negative log relative belief, evaluated at any θ. Since the right-hand side does not depend on θ, we can also average over any distribution on θ and choosing g(θ|y, γ), Hence the score-based check statistic is the negative of the derivative with respect to γ at γ 0 of the Kullback-Leibler divergence statistic of Nott et al. (2016). A further connection between the two approaches emerges by considering the expansion g(θ|γ) = (1 − γ)g(θ) + γq(θ) for a fixed prior q(θ). Using (3), If the Jeffreys' prior is proper, then taking q(θ) to be the Jeffreys' prior, (2) becomes for Y ∼ p(y), and in the asymptotic limit this is equivalent to the p-value (6) obtained using the divergence based check.

Other approaches to prior-data conflict checking
There is an extensive existing literature on prior-data conflict checking. One class of approaches involves somehow converting the likelihood and prior information into something comparable, either through renormalization or converting the likelihood to a posterior through a non-informative prior. A recent example of this approach is Presanis et al. (2013), where conflicts are examined locally at any node or group of nodes in a directed acyclic graph. Their work unifies and generalizes a number of previous suggestions (O'Hagan 2003;Dahl et al. 2007;Marshall and Spiegelhalter 2007;Gåsemyr and Natvig 2009). Scheel et al. (2011) consider a related approach where the model is formulated as a chain graph and at a certain node a marginal posterior distribution based on a local prior and lifted likelihood are compared. Bousquet (2008) considers an approach where ratios of prior-to-posterior Kullback-Leibler divergences are calculated, for the prior to be checked and a non-informative prior. Hierarchical extensions are also discussed. Reimherr et al. (2014) consider the difference in information required to be put into a likelihood function to obtain the same posterior uncertainty for a proper prior used in an analysis relative to a non-informative baseline prior that would be used if little prior information were available. Another method similar to those of Evans and Moshonov (2006) and Nott et al. (2016), in not requiring the formulation of any non-informative prior, is described in Dey et al. (1998), where vectors of quantiles of the posterior distribution itself are used in a Monte Carlo test using a prior predictive reference distribution. Bayarri and Castellanos (2007) review and evaluate various methods for checking the second level of hierarchical models, and advocate the partial posterior predictive p-value approach (Bayarri and Berger 2000). General discussions of Bayesian model checking which are not specifically concerned with checking for prior-data conflict are given in Gelman et al. (1996), Bayarri and Berger (2000) and Evans (2015). The method we propose here is a useful addition to the above proposals because the use of an appropriate encompassing family of priors for constructing the check gives some guidance for how to construct checks that are sensitive to conflicts in different aspects of the prior; furthermore, it does not rely on the construction of any non-informative prior for its application, which can sometimes be difficult. There is also a growing literature on the question of what to do in a Bayesian analysis if a prior-data conflict is found. We do not discuss this further, but see Evans and Moshonov (2006), Evans and Jang (2011b), Held and Sauter (2017) and Bickel (2018) among others for some recent discussion of this issue.

Hierarchical extension of the score based check
When a prior distribution is elicited hierarchically, it is desirable to check the different parts of the prior as they are elicited to determine the source of the problem if a conflict occurs.
We now describe how to do this with the proposed score based checks. Let θ be partitioned as θ = (θ T 1 , θ T 2 ) T and suppose the prior has been specified as g(θ) = g(θ 1 )g(θ 2 |θ 1 ). The discussion can be generalized to the case where θ is partitioned into more than two parts. First, consider an expansion of the form g(θ|γ (1) ) = g(θ 1 )g(θ 2 |θ 1 , γ (1) ), where the marginal prior for θ 1 is held fixed but the conditional prior g(θ 2 |θ 1 ) is embedded into g(θ 2 |θ 1 , γ (1) ) with g(θ 2 |θ 1 , γ and define and We propose to check for conflict for the conditional prior g(θ 2 |θ 1 ) using the p-value where Y ∼ m(y) = g(θ 2 |θ 1 )g(θ 1 |y obs )p(y|θ)dθ. Here it has been assumed again in the calculation of the p-value that the embedding prior family is such that a large value of S (1) (y) indicates conflict. The justification for this check is that in checking g(θ 2 |θ 1 ) we should consider an appropriate check for this prior as if θ 1 is fixed (which leads to the statistic S (1) (y, θ 1 )) but then to eliminate the unknown θ 1 we take the expectation with respect to θ 1 under the posterior given y obs . So we see if there is a conflict involving g(θ 2 |θ 1 ) for θ 1 values that reflect knowledge of θ 1 under y obs . The reference distribution for the check also reflects knowledge of θ 1 under y obs but using the conditional prior of θ 2 given θ 1 in generating predictive replicates.
These checks are similar to those considered in Nott et al. (2016) for their divergence based check. As explained there, in models with particular additional structure, the checks above can be modified in various ways. For example, in hierarchical models with observation or cluster specific parameters, cross-validatory versions of the check can be considered, as well as versions of partial posterior predictive checks (Bayarri and Berger 2000) in constructing S (1) (y) and its reference distribution. If there are sufficient or ancillary statistics at different levels this can be exploited also (Evans and Moshonov 2006;Nott et al. 2016).

Simple examples
It is insightful to consider properties of the check (2) first in some simple examples, where calculations can be performed analytically. The examples we discuss were also given in Evans and Moshonov (2006) and Nott et al. (2016), and we make some connections with the checks (4) and (5) that they propose. In the examples below, we use the following notation, which was also used in Nott et al. (2016). If S 1 (y) and S 2 (y) are two discrepancies for a Bayesian model check, and one is a monotone function of the other (as a function of y), we will write S 1 (y) . = S 2 (y). Note that prior predictive checks based on these discrepancies will give the same result, if the appropriate tail probability is calculated.
Example 4.1. Normal location model.
Let y 1 , . . . , y n be a random sample, y i ∼ N (θ, σ 2 ), where σ 2 > 0 is a known variance and θ is an unknown mean. The sample mean is sufficient for θ and normally distributed, so it suffices to consider the case n = 1 and this will be assumed in what follows. We write y obs for the observed value of y. Suppose the prior for θ is normal, N (µ 0 , τ 2 0 ), where µ 0 and τ 2 0 are fixed hyperparameters. Next, expand the prior to N (µ 0 , τ 2 ), where τ 2 is allowed to vary. Clearly p(y|τ 2 ) is a normal density, with mean µ 0 and variance σ 2 + τ 2 , and hence So to calculate the prior predictive p-value for the check we compare (y obs − µ 0 ) 2 to its prior predictive density. This is the same check obtained by Evans and Moshonov (2006) and Nott et al. (2016) using (4) and (5) and the corresponding p-value is (Evans and Moshonov (2006), p. 897)
Taking logs and differentiating with respect to γ, we obtain where ψ(·) denotes the digamma function. Using the fact that ψ(x) = log whereθ n = (y + a)/(n + a + b) is the posterior mean of θ under the prior g(θ). Equation (9) is equivalent to where I(θ) = n/(θ(1 − θ)) is the Fisher information. This matches the asymptotic form of the check (5) considered in Section 4 of Nott et al. (2016). We have already established in Section 3 that an arithmetic mixture involving the Jeffreys' prior would lead to a similar result. On the other hand, if instead of considering a geometric mixture of the Jeffreys' prior with the Beta(a, b) prior we instead consider a geometric mixture of the uniform distribution with Beta(a, b) instead, then we obtain, using a similar argument, so that and this is a discrepancy that is asymptotically equivalent to the check suggested by Evans and Moshonov (2006) (see also Evans and Jang (2011a)). Again, it is easy to see following the argument of Section 3 that an arithmetic mixture involving the uniform will lead to the same result. So for appropriate expansions of the Beta(a, b) prior we can obtain checks asymptotically equivalent to both (4) and (5).
Consider checking the mean component of the prior, g(µ|σ 2 ). We expand the prior g(µ|σ 2 ) to g(µ|σ 2 , λ) = N µ 0 , σ 2 λ , where now λ is allowed to vary (i.e. it is no longer fixed at λ 0 ). We have where E n denotes an n × n matrix of ones. Note that where tr(·) denotes the matrix trace. This gives Noting that we obtain which is the Kullback-Leibler based check considered in Nott et al. (2016). Nott et al. (2016) also note that the check is very similar to the one suggested in Evans and Moshonov (2006), p. 909. In this example we could have expanded the prior g(µ|σ 2 ) by allowing the mean rather than the scale hyperparameter to vary, embedding g(µ|σ 2 ) into the family g(µ|σ 2 , µ ) = N (µ , σ 2 /λ 0 ), where µ is not necessarily equal to µ 0 . If we do this, we obtain S (1) (y) . = (ȳ − µ 0 ), and computation of a two-sided p-value gives that this is equivalent to a check using the statistic (ȳ − µ 0 ) 2 , so that the two different ways of expanding the prior lead to the same result in this case. An example where two different embedding families lead to useful and quite different answers is considered later.

Checking the appropriateness of a LASSO penalty
As a more complex application, a problem discussed in Griffin and Hoff (2017) is now considered. The goal is to assess whether or not a penalty term, used in a penalized regression to induce sparsity, is contradicted by the data. Since the use of the penalty term they consider is equivalent to employing a prior together with MAP estimation, checking the penalty term can be addressed by prior-data conflict checking, and that is how we approach it here.

Example 5.1. Many means problem with LASSO penalty
To start we restrict to the many means context with no predictors, since the analysis is easier and the behavior reflects what happens in the more general situation. Supposex ∼ N (µ, (σ 2 /m)I n ) is observed with µ ∈ R n and there is a belief that µ is sparse, namely, many of the means satisfy µ i = 0. It is also assumed that σ 2 = 1 is known, as nothing material beyond computational complexity is added to the analysis by placing a prior on this quantity. For the prior on µ, consider a product prior where each µ i has density for ν ∈ R 1 . This is the exponential power family of priors that was considered in Griffin and Hoff (2017), and it can be shown that if µ i has prior (13), then E(µ i ) = 0 and V ar(µ i ) = τ 2 . When q = 2, the prior is normal, N (0, τ 2 ), and when q = 1, the prior is a Laplace rescaled by τ / √ 2. As q → 0 this family of priors induces greater sparsity. A question of interest is whether or not the Laplace prior obtained when q = 1 conflicts with the data, as this corresponds to the popular LASSO penalty (Tibshirani 1996). Griffin and Hoff (2017) effectively compare the observed value of the kurtosis statistic with its prior distribution when q = 1. Actually they use the prior distribution of the kurtosis of a sample of n from the prior itself as the reference distribution for computational reasons, but we use the more appropriate prior distribution of k(x) for this comparison. If the observed k(x) lies in the tails of its prior predictive density, then this is an indication that the double exponential prior is in conflict, and a modification of the prior is needed.
The p-value value for the check is 2 min{P (k(X) < k(x)), P (k(X) > k(x))} where P is the prior predictive measure, and if we decide a conflict occurs when this p-value is less than 0.05, then the left and right critical values for testing q = 1 are (1.65, 6.72) when n = 10, and (3.01, 10.07) when n = 100. So if n = 10 and k(x) < 1.65 or k(x) > 6.72, then a prior-data conflict exists.
With g(µ|x, τ, q) denoting the conditional posterior for µ given τ and q, the score function for assessing sensitivity to q is namely, the posterior expectation of the derivative of g(· | τ, q) with respect to q evaluated at q = 1. A simple calculation leads to where (3)) /2, and C(1) = −Γ(3) 1/2 . As previously, ψ(x) denotes the digamma function. Rather than compute the expectation in (14), this is approximated byŜ(x | τ ) where the estimates µ i =x i are substituted into (15). Furthermore, it is assumed hereafter that the elicited value of τ is τ = 1. In general τ is chosen such that the effective support of the prior, which can be defined as a central interval containing say 0.99 of the prior probability when q = 1, covers all the µ i values. As such, any prior data conflict that arises is due to the choice of q = 1 alone. Figure 1 is a plot of the null distribution ofŜ(x | 1) when n = 10 and m = 20, which leads to critical values (0.408, 1.117). When n = 100 and m = 20, the critical values are given by (0.670, 0.898).
To compare our approach to the method of Griffin and Hoff (2017), the power of the tests was compared. Figure 2 shows plots of the power functions of the kurtosis and the approximate score statistics in different situations. It is seen that the approximate score approach compares quite favorably with the method of Griffin and Hoff (2017).

Example 5.2. Regression with LASSO penalty
We now extend from the many means setting to a regression problem with data y = Xβ + σz where X ∈ R n×p , β ∈ R p is unknown, z ∼ N (0, I n ) and again σ 2 = 1 is assumed. Also for simplicity it is assumed that the β i can all be treated equivalently so there is no intercept term which must be treated differently. The prior on β is taken to be a product prior with the same prior (13) placed on each β i . In practice, the columns of X can be standardized to each have sum 0 and unit length. With this assumption, the mean of the ith coordinate of y is x T i β, where x T i is the i-th row of X, and note that x i ∈ [−1, 1] p because of the standardization. As such, a bound on the means x T i β that holds for all i implies that τ can be chosen to guarantee that the bounds on the means hold with high prior probability. Accordingly, our concern for prior-data conflict can focus on q, and again the case q = 1 is considered.
For the kurtosis and approximate score, similar formulae are obtained as with the many means case, and here the β i are estimated via least-squares. When k > n, so that X is no longer of full rank, the Moore-Penrose estimates are used as these minimize the length of the estimate vector and that seems appropriate when considering sparsity. Griffin and Hoff (2017) used a ridge estimator in the non-full rank case but this made little difference in the results reported here.
A simulation study was conducted as in Section 2 of Griffin and Hoff (2017). Data were generated from the regression model with σ 2 = τ 2 = 1, for n = 25, 50, 100 and 200 and p = 25, 50, 75 and 100. The entries of X were drawn from the standard normal distribution. For 10 3 independent replicates of X and β (drawn from the prior with q = 1) the power was estimated for a grid of values for q from 0.1 to 2 in steps of 0.1. The cutoff level in the test to determine the existence of a prior-data conflict was 0.05. The simulation results are given in Figure 3. It is seen that the approximate score does quite well, and in certain cases, namely, when p < n, can do better than the test based on the kurtosis statistic.
The approximate score doesn't do as well as the kurtosis statistic when p > n. This is felt in part to be due to the the simulation performed. For when generating X via independent standard normals the matrix is of rank n with probability 1 when p ≥ n. So this situation is somewhat like having n observations with n independent variables and in such a case it is not possible to criticize the model, as it will fit the data perfectly, let alone the prior. In practice, if we wish to check both the prior and likelihood components of the model, and the coefficient vector is known to be sparse, then a preliminary screening of variables (Fan and Lv 2018) can reduce p to p < n before penalized regression is performed, and the score based check would seem to be preferable for checking the appropriateness of the penalty in that case. Such a screening procedure could also be implemented together with data splitting, where the screening and analysis are done using disjoint subsets of the data.

Checking a truncated Dirichlet prior in a constrained multinomial model for quantum state estimation
The data acquired in measurements on quantum systems are fundamentally probabilistic because -as a matter of principle, not for lack of knowledge -one can only predict the probabilities for the various outcomes but not which outcome will be observed for the next quantum system to be measured. Therefore, the interpretation of quantum-experimental data requires the use of statistical tools, where the constraints that identify the set of physically allowed probabilities must be enforced. We shall illustrate prior checking in this context for a simple example, after setting the stage by recalling some basic tenets of quantum theory.
In the formalism of quantum theory, the Hilbert space operators of a D-dimensional quantum system can be represented by D × D matrices; for simplicity, we shall not distinguish between the operators and the matrices that represent them. There is, in particular, the statistical operator ρ that describes the state of the quantum system, which is positive-semidefinite and has unit trace. A measurement with K outcomes is specified by K positive-semidefinite probability operators (also commonly referred to as POVMs) Π 1 , Π 2 , . . . , Π K , one for each outcome. When such a measurement is performed on independent and identically prepared systems, we get a click of one of the detectors for each of the measured systems, and the probability that the kth detector will click for the next quantum system is θ k = tr(ρΠ k ) ("Born's rule"). The unit sum of the probabilities, K k=1 θ k = 1, is ensured for any ρ by the unit sum of the Π k s, K k=1 Π k = I D , where as previously I D denotes the D × D identity matrix.
The K-tuplets of probabilities, θ = (θ 1 , θ 2 , . . . , θ K ), constitute a convex set in the (K −1)simplex but usually they do not exhaust this simplex (for an example, see Figure 6 below). The permissible probabilities are those consistent with ρ ≥ 0, and the actual constraints obeyed by the θ k s result from the properties of the Π k s. We denote the convex set of permissible θs by Θ.
After measuring N identically prepared copies of the quantum system, thereby counting n k clicks of the kth detector, we have the data y = (n 1 , n 2 , . . . , n K ) with K k=1 n k = N . The problem of inferring the statistical operator ρ from the data y is the central theme of quantum state estimation (Paris andŘeháček 2004;Teo 2015), and Bayesian methods are well-suited for this task (Shang et al. 2013;Li et al. 2016). Enforcing ρ ≥ 0, or the corresponding implied constraints on the probabilities θ, is crucial and often challenging. We shall consider a rather simple example below, with D = 2 and K = 3.

Two prior expansions
We first consider two different families of expansions of the above prior, which do not attempt to check violations of the physical constraint on θ. Later we consider an expansion suitable for checking the physical constraint in a simple situation (D = 2 and K = 3), and where the formula (3) does not hold without some modification. To minimize notation, in all the prior families considered below the expansion parameter in the prior is denoted by γ, although it should be noted that in different families the interpretation of this parameter differs.
In our first prior expansion, similar to our earlier binomial example, we consider a geometric mixture of the original prior with the Jeffreys prior, which is the Dirichlet prior Dir( 1 2 , 1 2 , . . . , 1 2 ) constrained to θ ∈ Θ. Mixing with the Jeffreys prior thickens the tails of the original prior, and as such this family may be helpful for constructing an overall test of conflict. This idea leads to a family that is still constrained Dirichlet, proportional to Dir(δ 1 , . . . , δ K ) with δ k = (1 − γ)α 0 q k + 1 2 γ. The corresponding density function is denoted by The original prior is obtained when γ = 0.
Our second prior family also derives from considering a constrained Dirichlet density, Dir(δ 1 , . . . , δ K ), with δ k = α 0 q k , q 1 = q 1 + γ and q k = q k − γ/(K − 1) for k = 2, . . . , K. We see that K k=1 q k = 1 and K k=1 δ k = α 0 , so that the overall precision parameter is kept fixed while changing the location parameters by increasing q 1 by γ, with the other q k adjusted to maintain the unit-sum constraint. This family of priors is constructed to focus particularly on conflicts involving the first component θ 1 of θ, and we obtain the original prior at γ = 0. For this family we write

Score statistics for the checks
Consider first the family g (1) . Apart from terms not depending on θ, we have and, upon using (3), On the other hand, for the family g (2) , we obtain that, apart from terms not depending on θ, and using (3) again, We see that in the case of this prior family the components are no longer treated symmetrically in the check, with conflicts involving θ 1 being the focus. γ Power Figure 4: Power of checks based on families g (1) and g (2) when data are simulated under the prior predictive for g (1) (left) and under the prior predictive for g (2) (right).

Power comparison for the checks
We consider the case of K = 3 with q = ( 1 3 , 1 3 , 1 3 ), and α 0 = 30. For the family g (1) , γ = 0 corresponds to a Dir(10, 10, 10) prior, and γ = 1 to a Dir( 1 2 , 1 2 , 1 2 ) prior. For the g (2) expansion, γ = 0 is a Dir(10, 10, 10) prior, and γ = 1 3 is a Dir(20, 5, 5) prior. We examine the power of the checks in two cases, where a p-value smaller than 0.05 is considered to be a conflict. In the first case, we consider simulating data under the prior predictive for the g (1) expansion, for values of γ = i/20, i = 0, . . . , 20. Figure 4 (left) shows how the power of the two checks varies with γ, and we note that the check based on g (1) is more powerful, when the prior predictive for g (1) is used for simulating the data. The power is approximated at each value of γ based on 500 simulations. Next, we consider simulating data under the prior predictive for g (2) , for values of γ = i/60, i = 0, . . . , 20. Again 500 simulations are performed at each γ to compare power for the two checks, and Figure 4 (right) shows again that it is the check corresponding to the family used to generate the data that is more powerful. The different families are powerful against different kinds of conflict with the original prior, and the expansion of the prior used can be constructed with this in mind.

A simple quantum measurement scenario
The simplest genuine quantum system is that of a binary alternative (D = 2), a qubit. Here, we have 2 × 2 matrices for all operators, and the usual parameterization of the statistical operator is If we regard, as we shall, the three real parameters s 1 , s 2 , s 3 as Cartesian coordinates of a point, then there is a one-to-one correspondence between the quantum states of a qubit and the three-dimensional unit ball. Two-outcome measurements on qubits realize the situation of coin tossing and do not exhibit features particular to quantum physics. We shall, therefore, consider 3-outcome measurements (K = 3), for which we choose the Π k s in accordance with The corresponding probabilities do not involve s 3 , so that no information about this state parameter is gained from such a measurement, and the three θ k s are restricted by s 2 1 + s 2 2 ≤ 1. The relevant parameter space is now the unit disk in the s 1 , s 2 plane, the intersection of this plane with the unit ball.
The angles φ 1 , φ 2 , φ 3 divide the unit disk into three pie slices; see Figure 5. The condition Π 1 + Π 2 + Π 3 = 1 0 0 1 or, equivalently, θ 1 + θ 2 + θ 3 = 1 for all s 1 , s 2 determines the weights w k and they will be positive if each slice is less than half of the pie; we take this for granted. The symmetric case of three pie slices of equal size is that of the so-called trine measurement. Upon choosing φ 1 = 0 by convention, we then have φ 2 = 2 3 π and φ 3 = 4 3 π, and the Figure 5: The angles φ 1 , φ 2 , φ 3 in the probabilities (16) slice the circular unit-disk pie into three pieces, each slice smaller than half of the pie.
probabilities of the trine measurement are which are constrained by In view of θ 1 + θ 2 + θ 3 = 1, this can be equivalently, and more symmetrically, written as θ 2 1 + θ 2 2 + θ 2 3 ≤ 1 2 . For φ 1 = 0, φ 2 = π − ϕ, φ 3 = π + ϕ we get a symmetrically distorted trine, for which the probabilities are with cos γ = tan( 1 2 ϕ), and the analog of (18) reads where we note that the set of permissible θs depends on the value of γ. [Note: Later this distortion parameter γ will play the role of the generic expansion parameter γ.] We recover the ideal trine for ϕ = 1 3 π and (cos γ) 2 = 1 3 , and the limiting cases of ϕ = 1 2 π and ϕ = 0 yield degenerate 2-outcome measurements of no further interest. In the recent experiment by Len et al. (2018) different symmetrically distorted trines were realized (for measuring the polarization qubit of a photon), among them (cos γ) 2 = 0.1327 for which y = (n 1 , n 2 , n 3 ) = (180, 31, 30) were the counts of detection events. While the actual counts in the experiments were about ten times as many, namely (1802, 315, 303) as communicated by author Y. L. Len, we are using these smaller numbers here because prior-data conflicts are less of an issue in data-dominated situations. However, even if posterior inferences are insensitive to a prior-data conflict in large data settings, it is still of interest to detect the conflict, since this indicates a lack of scientific understanding in setting up the model.
The probability space for the symmetrically distorted trine has a simple geometry, illustrated in Figure 6. The unit disk in the s 1 , s 2 plane accounts for all s 1 , s 2 pairs for which the probabilities in (19) obey the constraint in (20); that is: the unit disk represents the set Θ of permissible probabilities. The s 1 , s 2 pairs for which one of the probabilities in (19) has a chosen value, mark a line in the s 1 , s 2 plane, and different fixed values for the same θ k yield a set of parallel lines. The particular three lines with θ 1 = 0 or θ 2 = 0 or θ 3 = 0 are tangential to the unit circle and intersect where either (θ 1 , θ 2 , θ 3 ) = (1, 0, 0) or (0, 1, 0) or (0, 0, 1); the triangle thus defined is the 2-simplex for the probabilities in (19). For the ideal trine, it is an equilateral triangle; for the symmetrically distorted trine, we have an isosceles triangle with vertices at (s 1 , s 2 ) = (2(sin γ) −2 − 1, 0) and (−1, ∓(cos γ) −1 ). For the general case of (16), there is an analogous construction with a triangle with no particular symmetry for the 2-simplex. Among all these triangles, the equilateral triangle for the ideal trine has the smallest area.

A prior family for checking the physical constraints
We now consider the symmetrically distorted trine with its γ-dependent set of permissible θs, in accordance with (20). If it is suspected that the trine measurement set-up was not properly balanced, this gives a natural family of priors for performing our check.
In such a situation where the support for the prior changes with γ, (3) can no longer be Figure 6: For the probabilities (θ 1 , θ 2 , θ 3 ) of the symmetrically distorted trine measurement in (19), the physically allowed values correspond to the points on the unit disk (on and inside the black unit circle, the blue circle in Fig. 5) while the probability 2-simplex is a triangle whose sides touch the unit circle. The probabilities associated with top left and bottom left vertices of the triangle are (θ 1 , θ 2 , θ 3 ) = (0, 0, 1) and (0, 1, 0), respectively, and the vertex on the right has (1, 0, 0). The blue equilateral triangle is for the ideal-trine probabilities in (17) when (cos γ) 2 = 1 3 , the green triangle is for (cos γ) 2 = 1 3 − 1 12 , and the red triangle is for (cos γ) 2 = 1 3 + 1 12 . The dashed blue lines show where θ 3 = 0, 0.2, 0.4, 0.6, 0.8, 1 for the blue triangle.
The changing support makes d dγ p(y|γ) inconvenient to evaluate numerically. To deal with this, we switch from integrating over θ to integrating over s 1 and s 2 and use polar coordinates in the s 1 , s 2 plane, s 1 = r cos φ, s 2 = r sin φ with 0 ≤ r ≤ 1 and 0 ≤ φ ≤ 2π, to enforce the constraints (19) for any value of γ. With this, we now have a situation where the sampling model itself changes with γ: p(y|r, φ, γ) = N n 1 , n 2 , n 3 θ n 1 1 θ n 2 2 θ n 3 3 θ k from (19) with s 1 + is 2 = re iφ .
Dirichlet prior over the physical space Consider the case of a Dirichlet prior over the physical space, i.e., g(θ|γ) ∝ θ α 1 −1 1 θ α 2 −1 2 θ α 3 −1 3 I γ (θ), where I γ (θ) is the indicator function, Under the r, φ parameterization, we then have  Figure 7: Probability of detecting a conflict at a p-value threshold 0.05 for data simulated under the prior predictive for different prior hyperparameter γ, for the case where cos 2 (γ 0 ) = pose a prior which is flat over the symmetric trine is chosen ((cos γ) 2 = 1 3 ). Then the score-based conflict check gives a p-value of 0.00004, indicating a conflict. If instead the correct γ is chosen, the same test yields a p-value of 0.56.

Discussion
We have considered a new approach to constructing prior-data conflict checks based on embedding the prior used for the analysis into a larger family and then considering a marginal likelihood score statistic for the expansion parameter. The main advantage of this technique is that through the choice of the prior expansion we can construct checks which are sensitive to different aspects of the prior.
There are a number of ways in which our work could be extended. In Section 4, we considered checking for the appropriateness of the LASSO penalty in penalized regression, but it would be interesting also to check other commonly used sparse signal shrinkage priors. For example, the generalized Beta mixture of Gaussians family of Armagan et al. (2011) would provide a suitable prior expansion for checking the horseshoe prior (Carvalho et al. 2009;Carvalho et al. 2010) in our framework. It would also be interesting to use the score-based approach for checking priors on hyperparameters in nonparametric models like Gaussian processes. In the nonparametric setting, priors can be crucial for limiting flexibility and avoiding overfitting, but it can also be difficult to understand the predictive implications of an informative prior which makes checking the prior important.