Power-Expected-Posterior Priors as Mixtures of g-Priors

One of the main approaches used to construct prior distributions for objective Bayes methods is the concept of random imaginary observations. Under this setup, the expected-posterior prior (EPP) offers several advantages, among which it has a nice and simple interpretation and provides an effective way to establish compatibility of priors among models. In this paper, we study the power-expected posterior prior as a generalization to the EPP in objective Bayesian model selection under normal linear models. We prove that it can be represented as a mixture of $g$-prior, like a wide range of prior distributions under normal linear models, and thus posterior distributions and Bayes factors are derived in closed form, keeping therefore computational tractability. Comparisons with other mixtures of $g$-prior are made and emphasis is given in the posterior distribution of g and its effect on Bayesian model selection and model averaging.


Introduction
Let y = (y 1 , . . . , y n ) T denote some available observations. Under the objective Bayesian perspective, suppose we wish to compare the following two models (or hypotheses): where θ 0 and θ 1 are unknown, model specific, parameters. Let further suppose that M 0 is nested in M 1 . By π N (θ ), for ∈ {0, 1}, we denote the baseline prior of θ under model M . Here, as a baseline prior we consider any prior that will express low information, for example the reference prior; see for example Berger, Bernardo & Sun (2009). These reference priors are typically improper, resulting in indeterminacy of the Bayes factor when comparing M 0 to M 1 .
In order to specify the normalizing constants involved in the Bayes factor when using improper priors, Pérez & Berger (2002) developed priors through utilization of the device of "imaginary training samples". If we denote by y * the imaginary training sample, of size n * , they defined the expected-posterior prior (EPP) for the parameter vector θ , of model M , as π EP P (θ ) = π N (θ |y * ) m * (y * ) dy * , where π N (θ |y * ) is the posterior of θ for model M using the baseline prior π N (θ ) and data y * . A usual choice of m * is m * (y * ) = m N 0 (y * ) ≡ f (y * |M 0 ), i.e. the marginal likelihood, evaluated at y * , for the simplest model M 0 under the baseline prior π N 0 (θ 0 ). Then model M 0 is called the reference model. EPP offers several advantages, among which it has a nice interpretation and also provides an effective way to establish compatibility of priors among models (Consonni & Veronese 2008).
When information on covariates is also available, under the EPP methodology, imaginary design matrices X * with n * rows should also be introduced. The selection of a minimal training sample size n * has been proposed by Berger & Pericchi (2004), to make the information content of the prior as small as possible, and this is an appealing idea.
Then X * can be extracted from the original design matrix X, by randomly selecting n * from the n rows.
To diminish the effect of training samples, Fouskakis, Ntzoufras & Draper (2015) generalized the EPP approach, by introducing the power-expected-posterior (PEP) priors, combining ideas from the power-prior approach of Ibrahim & Chen (2000) and the unitinformation-prior approach of Kass & Wasserman (1995). As a first step, the likelihoods involved in the EPP distribution are raised to the power 1/δ (δ ≥ 1) and then are densitynormalized. This power parameter δ could be then set equal to the size of the training sample n * , to represent information equal to one data point. In Fouskakis et al. (2015) the authors further set n * = n; this choice gives rise to significant advantages, for example when covariates are available it results in the automatic choice X * = X and therefore the selection of a training sample and its effects on the posterior model comparison is avoided, while still holding the prior information content at one data point.

PEP Prior as a Mixture of Normal Distribution
Let y = (y 1 , . . . , y n ) T be a random sample. We would like to compare the nested models: where c 0 and c 1 are the unknown normalizing constants of π U 0 (β 0 , σ 0 ) and π U 1 (β 1 , σ 1 ) respectively, X 0 is an (n×k 0 ) design matrix under model M 0 , X 1 is an (n×k 1 ) design matrix All matrices are assumed to be of full rank. Usual choices for d 0 and d 1 are d 0 = d 1 = 0 (resulting to the reference prior) or d 0 = k 0 and d 1 = k 1 (resulting to the dependence Jeffreys prior).
In the above comparison we assume that model M 0 is nested in model M 1 , so that k 0 < k 1 and thus we henceforth assume that β 1 = β T 0 , β T e T , so that β 0 is a parameter "common" between the two models, where β e is model specific. The use of a "common" parameter β 0 in nested model comparison is often made to justify the employment of the same, potentially improper, prior on β 0 across models. This usage is becoming standard, see for example Bayarri, Berger, Forte & García-Donato (2012) and Consonni, Fouskakis, Liseo & Ntzoufras (2018). It can be justified if, without essential loss of generality, we assume that the model has been parametrized in an orthogonal fashion, so that X T 0 X 1 = 0. In the special case where M 0 is the "null" model, with only the intercept, this assumption can justified, if we assume, again without loss of generality, that the columns of the design matrix of the full model have been centred on their corresponding means, which makes the covariates orthogonal to the intercept, and gives the intercept an interpretation that is "common" to all models. Regarding the error variance, although it is also standard to be treated as a "common" parameter across models, in this paper we follow the "intrinsic prior methodology" (see for example Moreno & Girón (2008)) and we treat it as a model specific parameter. As will see later in this Section, this causes no issues about the indeterminacy of Bayes factors due to the "intrinsification".
The sufficient statistic for β 1 , σ 2 1 , under M 1 , is: where β 1 denotes the maximum likelihood estimate of β 1 and RSS 1 the residual sum of square under model M 1 . To simplify notation we drop the double index from T 11 and T 12 and therefore we use the symbols T 1 and T 2 to denote the sufficient statistics under model M 1 for β 1 , σ 2 1 . Let y * = (y * 1 , . . . , y * n * ) T be a training (imaginary) sample of size n * : k 1 +1 ≤ n * ≤ n and let X * 0 and X * 1 = [X * 0 |X * e ] denote the corresponding imaginary design matrices. As before, we assume that all matrices are of full rank. Furthermore let P denotes the sufficient statistic under model M 1 using data y * and design matrix X * 1 (where for data y * , β * 1 denotes the maximum likelihood estimate of β 1 and RSS * 1 the residual sum of square under model M 1 ), then T * 1 can be decomposed as follows: with β * 0 being the maximum likelihood estimate, given data y * , of β 0 . For the following analysis we use M 0 as the reference model.
Under the PEP methodology, using the power likelihood, we can prove that condi- The PEP prior is We know that, under H 1 , S = t * 2 δσ 2 1 ∼ X 2 n * −k 1 and i.e. the non-central X 2 distribution with k 1 − k 0 degrees of freedom and non-centrality parameter λ. Therefore, if we denote by with M () being the Kummer hypergeometric function. Therefore From the above we see that, conditionally on (β 0 , σ 1 ), the PEP prior is a beta mixture of a multivariate normal prior and overall can be written using the following hierarchical structure The EPP is directly available for δ = 1.
Therefore, to sum-up, the PEP priors (or EPPs for δ = 1) for comparing models M 0 and M 1 are In the above expression, π P EP ; i.e. the reference prior for the baseline model M 0 . Therefore there are no issues about the indeterminacy of the Bayes factor, when comparing model M 0 to M 1 , since after the "intrinsification" the unknown constants of the imposed priors will be the same for the two competing models. More specifically, the resulting Bayes factor for comparing model M 1 to M 0 takes the form where, for = 0, 1, Therefore, returning back to (5), we have that As it is obvious from (6), both normalizing constants, c 0 and c 1 , cancel out.
Under the usual case where the reference model M 0 is the null model (with only the intercept), we have that the prior variance-covariance matrix of the model coefficients is where Z * e is the matrix of the centred (at the mean) imaginary covariates. In practice, when using the PEP prior with centred covariates and imaginary design matrices equal to actual ones (as in Fouskakis et al. 2015), then the induced approach results in a mixture of g-priors (Liang, Paulo, Molina, Clyde & Berger 2008) with a different hyper-prior on g = δ/t. case, for any n * and for n * = k 1 + 1 (minimal training sample size) Table 1 summarizes the EPP and PEP priors, under the alternative hypothesis, for minimal training sample size (n * = k 1 + 1) as well as for any training sample size n * ∈ [k 1 + 1, n]. Concerning the prior distribution of β e |β 0 , σ 1 (after integrating out the hyperparameter t), for large n * , its corresponding variance will be equivalent to the variance of a "marginal" g-prior with g = 2 and g = 2δ for the EPP and PEP prior, respectively.
Clearly, the PEP prior is more dispersed accounting for information equivalent to n * /2δ additional data points, while EPP will account for n * /2 additional data-points. When we consider the EPP, with the minimal training sample, that is n * = k 1 +1, then V (β e |β 0 , σ 1 ) is similar to the variance of a "marginal" g-prior with g = d 1 /(d 0 − 1). This means that it can be defined only for choices of d 0 > 1. On the other hand, V (β e |β 0 , σ 1 ) can be defined without any problem when we consider any training sample of size n * > k 1 − d 0 + 2.
Finally, under the PEP prior, the variance of β e |β 0 , σ 1 is further multiplied by δ making larger the spread of the prior and overall the imposed prior less informative. For this reason, the corresponding posterior summaries will be more robust to the specific choices of d 1 and d 0 , especially when δ = n * = n and n is large.

Priors and Properties
Generally, a wide range of prior distributions for variable selection in regression can be written with the following form of a normal scale mixture distribution: where f N d (y; µ, Σ) denotes the density of the d-dimensional Normal distribution with mean µ and covariance matrix Σ, evaluated at y and π 1 (g) denotes the prior distribution of the parameter g under model M 1 . Under the PEP prior, the hyper-prior π 1 (g) for g is given by where SGBP stands for the shifted generalized beta prime distribution with density The beta prime distribution is a special case of (8) with p = q = 1 and s = 0. Furthermore, the generalized beta prime distribution is a special case of (8) with s = 0. In our case, since q = s = δ, the density of the hyper-prior for g simplifies to where a = n * +d 0 −k 1 2 and b = n * +d 1 −d 0 −k 1 2 .
From the above expression, it is evident that the PEP prior implements an indirect averaging approach across all values of g ≥ δ. For the recommended setup (see Fouskakis et al. 2015) of δ = n * = n, this might look quite dramatic at the first sight. But in practice, it is reasonable, under lack of prior information, to consider at most a value of g that will correspond to one unit of information. Moreover, in such cases, the shrinkage w given by g g+1 = δ δ+t (see Section 5) should approach the value of one, such that most of the posterior information comes from the data. In the case where the likelihood mass supports values of g lower than δ, this means that the data do not have enough information in order to estimate sufficiently the model coefficients. An unrestricted prior for g leads to greater shrinkage towards the prior mean of model coefficients β. The truncation avoids over-shrinkage and the posterior of g will be concentrated at the value of δ ensuring a minimum value of shrinkage towards the prior.
Most of the known priors used for variable selection assume that Σ −1 e = X T e (I n − P 0 )X e in (7). This is also the case for the PEP prior if we consider X * e = X e as in Fouskakis et al. (2015). Similarly, the benchmark prior (Ley & Steel 2012), the robust prior (Bayarri et al. 2012), the horseshoe prior (Carvalho, Polson & Scott 2010), the hyper-g and hyperg/n priors (Liang et al. 2008) can be written as in (7) with the hyper-prior for g to be as in (8); details are provided in Table 2, under the usual choice of d 1 = d 0 = 0 for simplicity reasons. Additionally, the EPP, as shown above (and also in Womack, Leon-Novelo & Casella (2014)) can be written as in (7), but using imaginary design matrices in Σ −1 e , with number of rows usually equal to the minimal training sample (n * = k 1 + 1). Also, the intrinsic prior of Casella & Moreno (2006) can be viewed as an EPP. In their approach, as an approximation, using ideas from the arithmetic intrinsic Bayes factor approach, they used the original design matrix in Σ −1 e , with all n rows, using an additional multiplicator in the covariance matrix of the normal component in (7) given by n k 1 +1 ; see for example Womack et al. (2014). Therefore, this intrinsic prior can be viewed as a PEP prior, with (a) X * e = X e ; (b) n * = k 1 + 1 (minimal training sample) and (c) a model dependent power parameter δ = n k 1 +1 ; in the rest of the paper will call this prior intrinsic. Finally the prior by Maruyama & George (2011) is also closely related, where in the normal component in (7) the rotated coordinates are used, while the Zellner and Siow prior (Zellner &   Siow 1980) is as in (7) with the hyper-prior for g to be an inverted Gamma distribution with parameters 1/2 and n/2.  (2017) showed that the PEP prior satisfies the information consistency criterion (C3 ). Additionally, as shown here, for d 0 = 0, the PEP prior belongs to a more general class of conditional priors where h 1 (·) is a proper density with support R k 1 −k 0 . Bayarri et al. (2012) prove that the group invariance criterion (C7 ) hold if and only if π 1 (β e , β 0 , σ 1 ) has the form of (10).
Additionally, if h 1 (·) is symmetric around zero, which is the case under the PEP prior, predictive matching criterion (C5 ) also holds. When, finally X * e = X e , the conditional scale matrix has the form Σ −1 e = X T e (I n − P 0 )X e and then null predictive matching, dimensional predictive matching and the measurement invariance criterion (C6 ) hold, according to Bayarri et al. (2012).

Posterior Distributions of Model Parameters
In this Section we present posterior distributions of model parameters, under the PEP approach. For compatibility with the mixtures of g-prior, we work with the parameter g = δ/t.

Full Conditional Posteriors and Gibbs Sampling
Under the PEP approach, the full conditional posterior distribution of β e is a multivariate normal distribution of the form β e |g, σ 1 , β 0 , y, M 1 ∼ N ke W e β e , W e (X T e X e ) −1 σ 2 1 where k e = k 1 − k 0 and for δ = 1 ⇒ EPP; for δ > 1 ⇒ PEP.
The matrix W e plays the role of a multivariate shrinkage effect which penalizes each coefficient locally, while w is a global shrinkage factor which affects uniformly the posterior mean and posterior variance-covariance matrix. For example, if w → 0 all the conditional posterior information is taken from the prior, while for w → 1 the conditional posterior information will be derived from the data.
Similarly we can obtain the full conditional posterior distributions of β 0 , σ 2 1 and g by where CH(p, q, s) is the confluent hypergeometric distribution with density function The above conditional distributions can be easily used to implement a full Gibbs sampler to obtain the posterior estimates of interest for any given model or in any Gibbs based variable selection sampler (see for example in Dellaportas, Forster & Ntzoufras 2002) to obtain estimates of the posterior model weights. We can further simplify the Gibbs sampler by combining the posterior distributions of β 0 and β e given in (11) and (12).
Finally, the full conditional posterior distribution of β T 1 = (β T 0 , β T e ) is given by where 0 1 × 2 is a matrix of dimension 1 × 2 with zeros, w = g/(g + 1) is the shrinkage parameter while β 1 is the MLE for β 1 of model M 1 given by β 1 = (X T 1 X 1 ) −1 X T 1 y.

Marginal Likelihoods
The marginal likelihood conditionally on a value of g is given by the usual marginal likelihood of the normal inverse gamma prior. Thus with where R 2 j is the coefficient of determination of model M j (j ∈ {0, 1}), and C 1 being constant for all models (assuming that the covariates of X 0 are included in all models) given by The full marginal likelihood f (y|M 1 ) is given by and F 1 (a , b 1 , b 2 , c ; x, y) is the hypergeometric function of two variables or Appell hypergeometric function given by Note that the marginal likelihood of the reference model M 0 is f (y|M 0 ) = C 1 .

Marginal Posterior Distribution of g
The marginal posterior distribution of g, under model M 1 , is given by for g ≥ δ with the normalizing constant C 2 given by The κ posterior moment of g is given by for κ ∈ {0, 1, 2, . . . }. Note thatF 1 (0) = F 1 (0).
The posterior expectation and variance of g are now given by where a = ke 2 + a − 1.
with density expressed as where Φ 1 () is the Humbert series (Humbert 1920 In order to get more insight about the behavior of the prior distribution of w, we can obtain approximations of the prior mean and variance by using the first terms of a Taylor expansion given by where µ t and σ 2 t are the prior mean and variance of the hyper-parameter t. By implementing the above approach, we obtain the approximations summarized in Table 3; for the PEP prior we restrict attention on the choice of δ = n * . Note that for small training samples, the dimensions k 0 and k 1 may influence the imposed prior, making it sometimes more informative than intended. From Table 3 it is evident that when considering the usual EPP setup with the minimal training sample, then the prior mean of the shrinkage is far away from the value of one for specific cases (e.g. for the reference prior or for the Jeffreys' dependence prior when k 0 = 1 and k 1 = 2). This is not the case for the PEP prior for which the prior mean of Minimal n * (n * = k 1 + 1) (for all k 1 ) (for large k 1 ) (for all k 1 ) (for large k 1 ) the shrinkage is close to one even for models of small dimension; for example, under the reference prior and for k 0 = 1 and k 1 = 2 we obtain a prior mean of the shrinkage equal to 0.86 and a prior standard deviation of the shrinkage equal to 0.071. Generally the global shrinkage w under the PEP prior is close to the value of one implying that the prior is generally non-informative since most of the information is taken from the data.

Marginal Posterior Distribution of w
Under model M 1 , the marginal posterior distribution of the shrinkage parameter w can directly derived by (15) resulting in where the constant C 2 given by (16).
The posterior κ moment is given by for κ ∈ {0, 1, 2, . . . }. Therefore the posterior expectation and variance of w are directly derived as

BMA Estimates
Let For a detailed derivation of this result and computational strategies about the implemetation of BMA we refer the reader to Fouskakis & Ntzoufras (2020).

Simulation Study
In this Section we illustrate the proposed methodology in simulated data. We compare the performance of PEP prior and the intrinsic prior, the latest as presented in Section 3. We consider 100 data sets of n = 50 observations with k = 15 covariates. We run two different Scenarios. Under Scenario 1 (independence) all covariates are generated from a multivariate Normal distribution with mean vector 0 and covariance matrix I 15 , while the response is generated from for i = 1, . . . , 50. Under Scenario 2 (collinearity), the response is generated again from (20), but this time only the first 10 covariates are generated from a multivariate Normal distribution with mean vector 0 and covariance matrix I 10 , while for j = 11, . . . , 15; i = 1, . . . , 50.
With k = 15 covariates there are only 32,768 models to compare; we were able to conduct a full enumeration of the model space, obviating the need for a model-search algorithm in this example.
Regarding the prior on model space we consider the uniform prior on model space  Under Scenario 1, the posterior inclusion probabilities for the non-zero effects (see Figure 1), for each method, follow the size each covariate's effect as expected. Hence the posterior inclusion probabilities for X 1 (β 1 = 2) are equal to one, with almost zero sampling variability, followed by X 7 (β 7 = 1.5) with posterior inclusion probabilities close to one, but with almost all values over 80%. For covariate X 5 and X 11 with absolute effects equal to one, the picture for the posterior inclusion probabilities is almost identical due to the same magnitude of the effects in absolute values. Moreover, we observe large sampling variability within and across different methods. Finally, all methods fail to identify X 13 (β 13 = 0.5) as an important covariate of the model, with the intrinsic approaches giving higher inclusion probabilities on average (around 40%). Nevertheless, the posterior inclusion probabilities for X 13 are slightly higher on average and more dispersed across different samples, than the zero effects (see Figure 2 for a representative example).
Due to the independence of the covariates, we get similar results as the ones presented Covariates with Zero Effects.
in Figure 2 for the remaining zero effect covariates, and therefore plots are omitted for brevity reasons. Concerning the comparison of the different methods we observe that: (a) PEP is systematically more parsimonious than intrinsic, as previously reported in bibliography; (b) PEP-BB is more parsimonious than PEP-Uni; (c) I-BB supports slightly more parsimonious solutions than I-Uni. Regarding the last result it is surprising at a first glance, since the BB prior on model space is promoted in bibliography as a multiplicity adjustment prior. Nevertheless, in this example, the mean of the inclusion probabilities, under the uniform prior, in each data set is around 0.46, which is slightly reduced after the BB implementation to 0.44, leaving the results virtually unchanged. This is in accordance with what is expected by this prior, since it places a U shaped distribution on the prior probabilities of each model depending on its dimensionality. This results in:  Under Scenario 2, the posterior inclusion probabilities for X 1 and X 13 (see Figure 3), have similar picture as the ones in Scenario 1. For variable X 7 we observe again high inclusion probabilities, but this time with higher uncertainty. Due to collinearity, the posterior inclusion probabilities for covariates X 5 and X 11 are not longer similar, with the later being close to one (similar picture with the posterior inclusion probabilities for covariate X 7 with true effect equal to 1.5), while the first one demonstrates posterior inclusion probabilities around 0.4, or lower, depending on the method. In Figure 4, the posterior inclusion probabilities for al the zero effects are presented. In all cases those are lower under the PEP prior, with the PEP-BB to behave the best. Moreover, they are differences across covariates, depending on collinearities, mainly in the variability across samples.

Discussion
In this article we have shown that the power-expected posterior prior, a generalization of the expected-posterior prior in objective Bayesian model selection, under normal linear models can be represented as a mixture of g-prior. This has the great advantage of being able to derive posterior distributions and marginal likelihoods in closed form, permitting fast calculations even when exploring high-dimensional model spaces.
Additional future extensions of our method include the introduction of two different power parameters in order to derive a family of prior distributions, with members all the prior distributions for variable selection in regression that are written as mixtures of gpriors, that can be derived using either fixed or random imaginary data. Furthermore, we plan to extent the applicability of PEP prior in cases where k > n. This can be done by (a) using shrinkage type of baseline priors, such as Lasso or Ridge; (b) assigning zero prior probability to models with dimension larger than n; and (c) mimicking formal approaches to use g-priors in situations where k > n, such as Maruyama & George (2011), based on different ways of generalizing the notion of inverse matrices.