Bayesian Effect Selection in Structured Additive Distributional Regression Models

We propose a novel spike and slab prior specification with scaled beta prime marginals for the importance parameters of regression coefficients to allow for general effect selection within the class of structured additive distributional regression. This enables us to model effects on all distributional parameters for arbitrary parametric distributions, and to consider various effect types such as non-linear or spatial effects as well as hierarchical regression structures. Our spike and slab prior relies on a parameter expansion that separates blocks of regression coefficients into overall scalar importance parameters and vectors of standardised coefficients. Hence, we can work with a scalar quantity for effect selection instead of a possibly high-dimensional effect vector, which yields improved shrinkage and sampling performance compared to the classical normal-inverse-gamma prior. We investigate the propriety of the posterior, show that the prior yields desirable shrinkage properties, propose a way of eliciting prior parameters and provide efficient Markov Chain Monte Carlo sampling. Using both simulated and three large-scale data sets, we show that our approach is applicable for data with a potentially large number of covariates, multilevel predictors accounting for hierarchically nested data and non-standard response distributions, such as bivariate normal or zero-inflated Poisson.


Introduction
The flexibility of modern regression methodology is both a blessing and a curse for applied researchers and statisticians alike since, on the one hand, added flexibility enables potentially more realistic models approximating the true data generating process but, on the other hand, poses additional challenges in the model building and model checking process. In this paper, we consider structured additive distributional regression models (Rigby and Stasinopoulos, 2005;Klein, Kneib, Lang and Sohn, 2015) that combine additive predictors consisting of various types of regression effects, e.g. non-linear effects of continuous covariates, spatial effects or random effects (Kammann and Wand, 2003;Ruppert et al., 2003;Wood, 2017) with the possibility to model all parameters of the response distribution (e.g. location, scale or shape parameters) in terms of covariates in a distributional regression approach. As a consequence, an analyst is faced with the challenge of not only choosing an appropriate response distribution, (a task that we will not consider in this paper since both graphical tools for model checking as well as selection criteria are well developed, see for example Klein, Kneib, Lang and Sohn, 2015) but also with determining the most appropriate subset of covariates along with their exact modelling alternative for multiple regression predictors.
As an example, in one of our empirical illustrations on childhood undernutrition in Nigeria with more than 20,000 observations, we analyse a bivariate response variable (y 1 , y 2 ) ′ consisting of two scores for chronic and acute undernutrition. A previous study (Klein, Kneib, Klasen and Lang, 2015) suggests a bivariate normal model in which not only the marginal expectations but also the marginal scale parameters and the correlation parameter depend on covariates. This leads to a distributional regression model with five parameters µ 1 , µ 2 , σ 1 , σ 2 , ρ. In a full model, all of these parameters could be related to a predictor η ik of the form η ik =x ′ i β k + f 1,k (cage) + f 2,k (mage) + f 3,k (mbmi) + f spat ,k (region), k = 1, . . . , 5, i = 1, . . . , n, where i = 1, . . . , n denotes the observation index, k refers to the five distributional parameters, x i contains 13 binary covariates (and an intercept term) with regression coefficients β k , f j,k (·), j = 1, 2, 3, are non-linear smooth functions of age of child (cage), mother's age (mage) and mother's body mass index (mbmi), and f spat ,k are spatial effects based on regional information in the data. While effect selection (deciding which of the different effects should be included in the model) via a full search in the model space would already be challenging in a mean regression framework with only one single predictor, full effect selection in a distributional regression setting with multiple predictors is typically computationally prohibitive. This is even more the case when one is interested in deciding whether the effect of a continuous covariate shall be included in a linear or non-linear form or whether it could be excluded completely from the model. In this paper, we address these challenges and develop a novel spike and slab prior structure that enables Bayesian effect selection within structured additive distributional regression models.
While there has been extensive interest in spike and slab priors for Bayesian variable selection (i.e. the selection of effects in models with purely linear predictors) or function selection (selection of non-linear effects of continuous covariates) in previous years (see for example Clyde and George, 2004;O'Hara and Sillanpää, 2009, for reviews), most research has been restricted to additive mean regression with Gaussian errors, distributions from the exponential family or survival models but also in the context of group variable selection (Zhang et al., 2014;Xu and Ghosh, 2015). Furthermore, most approaches restrict the predictor specification to include either only linear effects or only non-linear effects of continuous covariates but do not enable the consideration of more complex effect types such as spatial effects or the decomposition of non-linear effects in linear and non-linear components.
Classical Bayesian variable selection approaches for linear models based on spike and slab priors include for example Mitchell and Beauchamp (1988), or George and McCulloch (1997). Smith and Kohn (1996) utilise these approaches for function selection in nonparametric regression with Gaussian responses by assigning the variable selection priors to individual basis functions. Approaches that move beyond the framework of Gaussian models but pertain the purely linear predictor structure comprise the approaches of Rossell and Rubio (2017) who propose a Bayesian variable selection approach that allows for skewness and thicker tails compared to the Gaussian distribution, Wang et al. (2017) who consider variable selection after transforming the response, and Chung and Dunson (2009); Kundu and Dunson (2014) who propose nonparametric models where in the former proposal the mean and shape learn the effect of covariates, while the latter assumes symmetric residuals. In all these approaches however, the spike and slab prior is directly imposed on the scalar regression coefficients. In contrast, Ishwaran and Rao (2005) consider a hierarchical specification where the spike and slab structure is not imposed directly on the regression coefficients but, on a higher level of the hierarchy, on their prior variances. This approach also allows to consider situations where selection should take place on blocks of regression coefficients representing for example the coefficients of a basis expansion in nonparametric regression. This leads to function selection approaches for additive models, also considered in Yau et al. (2003); Cottet et al. (2008); Reich et al. (2009), who combine a spike with point mass at zero with a slab that has support only on the positive real numbers. In contrast, Zhu et al. (2010) specify both spike and slab as normal distributions (with very different variance components) and Panagiotelis and Smith (2008) assign a multivariate prior with spike at the origin and normal slab directly to the whole vector of basis coefficients. In either case, one typically observes poor mixing unless sampling from marginalized full conditionals which are only available in closed form for Gaussian models (Yau et al., 2003;Reich et al., 2009;Panagiotelis and Smith, 2008) or models that have a latent Gaussian representation such as the probit model (Zhu et al., 2010). Cottet et al. (2008) address function selection in double exponential regression models, where both the mean and the dispersion parameter are linked to an additive predictor which comprises linear and non-linear effects. The model space is restricted, since functional effects may enter the model only if the corresponding linear effect is included in the model. Our proposal is inspired by the approach of Scheipl et al. (2012) that introduces effect selection in generalized additive models for simple exponential family regression and with only one meanrelated additive predictor. As Scheipl et al. (2012), we rely on a redundant parameter expansion of the vector of the basis coefficients as originally proposed in Gelman et al. (2008), and which allows us to expand the vector of basis coefficients in an importance parameter shared by all basis coefficients on the one hand and standardised basis coefficients on the other hand. Effect selection is then performed by assigning a spike and slab prior to the squared importance parameter. More precisely, our paper makes the following important contributions: • We integrate effect selection based on spike and slab priors in the structured additive distributional regression framework such that selection of general effect types is no longer restricted to mean regression models with responses from simple exponential families.
• The parameter vectors representing the additive effect components in a structured additive predictor are typically assigned partially improper multivariate normal priors. Instead of explicitly reparameterising the vector of basis coefficients to enable the specification of proper priors as in Scheipl et al. (2012), we implicitly remove the partial impropriety by adding a corresponding constraint to the prior distribution. As a consequence, we can retain sparse matrix structures for speeding up computations and show empirically that this has beneficial impact on the mixing behaviour of the MCMC simulations. In particular, when the vector of regression coefficients is large, we do not observe the strong dependence on the dimensionality of the basis coefficient vector identified in Scheipl et al. (2012). This enables us to also include effects of considerable dimension such as spatial effects to truly exploit the benefits of effect selection over function selection and even allows us to further extend the model to hierarchical specifications of the predictors (Lang et al., 2014).
• Formulating the spike and slab prior for the squared importance parameter in the redundant parameterisation yields scaled beta prime marginals which have favourable shrinkage properties (Pérez et al., 2017). We study these properties in detail and provide corresponding theoretical results for our prior structure including conditions for the propriety of the posterior.
• We develop rules for eliciting the hyperparameters of the spike and slab prior based on simple scaling criteria that are easily accessible to applied researchers. Based on the elicited parameters, we find that our new prior structure has similarly favourable shrinkage properties as the approach by Scheipl et al. (2012), while it avoids to arbitrarily fix the hyperparameters.
The rest of this paper is structured as follows: Section 2 summarises the specification of our novel spike and slab prior for effect selection in distributional regression. Properties of the prior, including prior elicitation, shrinkage properties and propriety of the posterior are discussed in Section 3. Section 4 contains details on posterior estimation via Markov chain Monte Carlo simulations and points to software and implementation. Sections 5.1 and 5.2 evaluate the performance of our approach in simulations and three diverse applications. In Section 6 we conclude. Our approach to Bayesian effect selection based on spike and slab priors is developed for the general class of (multivariate) Bayesian structured additive distributional regression (Klein, Kneib, Lang and Sohn, 2015). Let (y i , ν i ), i = 1, . . . , n denote n independent obser-vations on the (not necessarily scalar) response variable y and covariates ν. We then assume that the conditional distribution of y i given ν i is specified in terms of a K-parametric distribution with density p(y i |ϑ i1 , . . . , ϑ iK ), where ϑ i = (ϑ i1 , . . . , ϑ iK ) ′ is a collection of K scalar distributional parameters ϑ ik , k = 1, . . . , K, which depend on ν i . Compared to mean regression models where p(·) is usually assumed to belong to the exponential family and where K −1 parameters are treated as fixed or nuisance parameters, in distributional regression each of the distributional parameters is linked to a structured additive predictor η ik via a suitable one-to-one transformation h k , i.e. h k (η ik ) = ϑ ik and η ik = h −1 k (ϑ ik ).

Structured Additive Predictors
The predictors themselves are specified as where the effects f sel j,k (ν i ) represent various types of flexible functions depending on (different subsets of) the covariate vector ν i that are to be selected via spike and slab priors, while η in ik represents a second additive predictor consisting of all effects f in l,k (ν i ) that are not under selection. The separation into two subsets of effects allows us to include specific covariate effects mandatorily in the model (e.g. based on prior knowledge or since these represent confounding effects that have to be included in the model in any case). In the following, we will only discuss the specification of priors for the effects under selection in detail since the effects η in ik can be handled exactly as in distributional regression models without effect selection, but we will use the differentiation later in Section 3.4 for deriving sufficient conditions for the propriety of the posterior.
Dropping the parameter index k, the function index j and the superscript sel in the rest of this section for notational simplicity, we assume that each effect f (ν i ) can be approximated by a linear combination of basis functions such that where B d (ν i ), d = 1, . . . , D are the basis functions,β = (β 1 , . . . ,β D ) ′ is the vector of (standardised) basis coefficients and τ is an importance parameter. Due to the linear basis representation, the vector of function evaluations f = (f (ν 1 ), . . . , f (ν n )) ′ can be written as f = τ Bβ where B is the (n × D) design matrix arising from the evaluation of the basis functions B d (ν i ), d = 1, . . . , D at the observed covariate values ν 1 , . . . , ν n .
Note that the parameterisation in (M3) is equivalent to the standard specification in structured additive regression but redundant as only the product β = τβ is identified. However, the importance parameter τ allows us to remove effects from the predictor for τ = 0 while effects are considered to be of high importance if τ is large in absolute terms. We will place a spike and slab prior on the squared importance parameter τ to achieve effect selection.

Constraint Prior for Regression Coefficients
Since for many specific types of effects the vector of basis coefficients β is of relatively high dimension, it is often useful to enforce specific properties such as smoothness or shrinkage. In a Bayesian formulation, this can be facilitated by assuming (partially improper) multivariate Gaussian priors where K denotes the prior precision matrix implementing the desired properties, τ 2 is a prior variance parameter and the indicator function 1[Aβ = 0] is included to enforce linear constraints on the regression coefficients via the constraint matrix A. The latter is typically used to remove identifiability problems from the additive predictor (e.g. by centering the additive components of the predictor) but can also be used to remove the partial impropriety from the prior that comes from a potential rank deficiency of the precision matrix K with rk(K) = κ ≤ D.
We specify a prior of exactly the same structure on the vector of scaled basis coefficientsβ, and assume that the constraint matrix A is chosen such that all rank-deficiencies in K are effectively removed from the prior distribution. This can, for example, be achieved by setting where ker(K) denotes the null space of K and span (ker(K)) is a representation of the corresponding basis. This specification effectively restricts the parameter vectorβ to a lower dimensional space of dimension rk(K) and allows us to establish a decomposition of the effect f (ν) into a penalized and an unpenalized part, i.e. f unpen (ν) + f pen (ν) where f unpen (ν) represents parts of the function corresponding to the null space of K which are therefore not affected by the "penalisation" induced by K while f pen (ν) represents the part of the total effect that is associated with the proper, informative prior part. Importantly, we can now put separate spike and slab priors on both parts of f . For instance, in case of penalized splines with second order random walk prior, the space of unpenalized functions contains the linear functions, while the penalized part contains nonlinear deviations from the former. Such a parameterization hence enables the decision whether a continuous covariate should be included purely nonlinearly, whether it is sufficient to assume a pure linear effect or whether the sum of a linear and a non-linear effect is needed. The resulting models are therefore both potentially more parsimonious and easier to interpret.
The specifications (M3), (M4) and (M3 * ), (M4 * ) seem to be equivalent to each other corresponding to rescaling the regression coefficients and the prior distribution as β = τβ. However, this is only true if the prior distribution (M4) is indeed proper. To see this, assume that K is rank deficient and a constant effect is not penalised by the prior precision matrix. In this case, the traditional formulation of structured additive regression models (M3 * ) implies a constant effect if τ 2 approaches zero while the rescaled version (M3) implies an effect equal to zero since the complete function is multiplied by τ .
Note, that both (M4 * ) and (M4) rely on the same precision matrix K and hence the constraint matrix A can be constructed independently of the parametrisation. The traditional way is an explicit mixed model decomposition (Fahrmeir et al., 2004;Wood, 2011) which is used by Scheipl et al. (2012) to perform effect selection for mean regression models. As the mixed model representation yields a penalised component which isβ ∼ N(0, I), this is effectively equivalent to considering our constraint prior by choosing the constraint matrix according to (M5) and by rescaling the individual entries inβ with the eigenvalues of K (see Rue and Held, 2005, Sec. 3.2 for details). However, the explicit mixed model representation used by Scheipl et al. (2012) destroys the sparsity properties of the design matrices (such as band structures for B-splines) and causes full design matrices which in turn increases computation times. In order to keep the sparsity of the design matrices of functional effects (and hence to minimize computation time) we instead implicitly remove the improper part of p(β|τ 2 ) by sampling β directly from the constrained posterior using (M4).

Normal Beta Prime Spike and Slab Prior on Squared Importance Parameter
To achieve function selection in our model, we place a spike and slab prior specification on the squared importance parameter τ 2 . This hierarchical prior relies on a mixture of one prior concentrated close to zero such that it can effectively be thought of as representing zero (the spike component) and a more dispersed, mostly noninformative prior (the slab) and is specified via the hierarchy The scale parameter ψ 2 determines the prior expectation of τ 2 , which is ψ 2 for δ = 1 and rψ 2 for δ = 0 with r ≪ 1 being a fixed small value and hence the indicator δ determines whether a specific effect β = τβ is included in the model (δ = 1) or excluded from the model (δ = 0). The parameter ω is the prior probability for an effect being included in the model and the remaining parameters a, b, a 0 , b 0 and r are hyperparameters of the spike and slab prior. We will discuss prior elicitation for these parameters in detail in Section 3.2.
Marginalising over ψ 2 , both the spike and the slab component p(τ 2 |δ) are scaled beta prime distributions with shape parameters 1/2 and a and scale parameter 2r(δ)b (Pérez et al., 2017). Therefore we call the hierarchical prior on β = τβ specified by (M4) -(M6) the Normal Beta Prime Spike and Slab (NBPSS) prior, see Section 3 for a detailed discussion of the properties of the NBPSS prior. Equations (M1) to (M6) define our complete model specification for effect selection in structured additive distributional regression.

Special Cases
We briefly discuss some of the components of structured additive predictors used later in our empirical evaluations. These include • non-linear effects based on Bayesian P-splines (Lang and Brezger, 2004), where random walk priors are used for the regression coefficients corresponding to D different B-spline basis functions. The i-th row of B then contains the basis functions If not stated otherwise, we will use second order random walk priors and cubic B-splines with 20 inner knots resulting in D = 22.
• spatial effects for a discrete set of geographical regions modelled via Gaussian Markov random fields (GMRFs) with precision matrix given by an adjacency matrix encoding the neighbourhood relation between the regions (Rue and Held, 2005) and a design matrix with entries (i, s) equal to one if observation i is located in region s and zero otherwise. We consider the simplest form of GMRFs and define two regions as neighbours if they share common borders.
• multilevel structured additive regression models as proposed by Lang et al. (2014) that allow for hierarchical prior specifications for regression effects where each parameter vector may again be assigned an additive predictor, i.e. the vector β is decomposed as β = η + ε and the predictor η can itself be of structured additive form.

Properties of the NBPSS prior
In the following, we discuss properties of the NBPSS prior hierarchy, including elicitation of hyperparameters, shrinkage properties and propriety of the posterior. For prior elicitation and shrinkage properties, the marginal distribution of β = τβ plays a crucial role. We will therefore start with deriving this marginal distribution.

Marginal Distribution
The marginal prior for the squared importance parameter τ 2 is given by the mixture of two scaled beta prime distributions BP(1/2, a, 2b) and BP(1/2, a, 2rb) with mixture weight of the slab given by P(δ = 1|a 0 , b 0 ) = a 0 /(a 0 + b 0 ). A modified version of the NBPSS prior can alternatively be derived by assuming a mixture of two scaled t distributions for the importance parameter τ = ± √ τ 2 . Specifying this prior hierarchically, the first equation in (M6) is replaced by τ |δ, ψ 2 ∼ N (0, r(δ)ψ 2 ) and as a consequence posterior sampling would no longer be possible with Gibbs steps as the corresponding conditional posterior would depend on the likelihood function. Marginalising over ψ 2 , δ and ω, the prior p(τ ) is a mixture of two scaled t-distributions with 2a degrees of freedom, location parameter 0, scale parameters b/a and rb/a and mixture weights a 0 /(a 0 + b 0 ) and b 0 /(a 0 + b 0 ), respectively. Thus, the prior on the (signed) importance parameter τ is closely linked to the NMIG prior used in Ishwaran and Rao (2005) when considering scalar regression coefficients β that are conditionally normal given the inverse gamma distributed variance parameter τ 2 (but with one level of hierarchy less) on the one hand, and, on the other hand to the peNMIG specification of Scheipl et al. (2012).
The implied marginal distribution for β = τβ can now be derived as where pβ is given in equation (M4). However no analytical solution exists for this integral such that it has to be approximated numerically.

Prior Elicitation
In the following, we discuss prior elicitation for the NBPSS prior hyperparameters a, b, a 0 , b 0 and r. More precisely, we argue that suitable default values can be suggested for a, a 0 , and b 0 based on theoretical arguments while providing intuitive and user-friendly criteria for the elicitation of b and r. In the literature, default values have often been suggested from simulation-based evidence (e.g. in Scheipl et al., 2012) but we prefer to determine b and r in a more transparent way.
Theoretical properties of the scaled beta prime distribution have been discussed in Pérez et al. (2017). From this, it follows that for both spike and slab moments of order less than a exist and the variance decreases with a. Furthermore, for small values of a, the spike and the slab component will overlap such that moves from δ = 0 to δ = 1 are possible. However to guarantee the existence of moments, a should not be too small either. Fixing a = 5 yielded overall a convincing mixing performance and we therefore use this value also in our real data examples.
For the prior inclusion parameter ω a sensible default is to use a 0 = b 0 = 1 which corresponds to a flat prior on the unit interval. Of course, one can also choose fix values for ω in case strong prior knowledge on the prior inclusion probability of the size of the expected model is available. As the marginal prior inclusion probability is given by P(δ = 1|a 0 , b 0 ) = a 0 /(a 0 + b 0 ), a 0 and b 0 can be chosen to reflect prior assumptions on the inclusion probability of effects.
For the elicitation of b and r, we propose an approach inspired by the principled approaches of Simpson et al. (2017) and Klein and Kneib (2016). More precisely, we consider marginal probability statements on the supremum norm sup ν∈D |f (ν)| over a certain set of covariate values D conditional on the status of the inclusion/exclusion parameter δ. Given δ = 1 (inclusion of the effect), the marginal distribution of f (ν) does no longer depend on r, such that the parameter b can be determined from This is the probability that the supremum norm of an effect is smaller than a pre-specified level c for all design points ν ∈ D, such that α and c should be small. Basically we formulate the prior such that it is unlikely that the supremum norm stays below a pre-specified level if it is indeed an informative effect that should be included. Both the level c and the prior probability α have to be specified by the analyst according to her/his prior beliefs. To derive r, we proceed similarly but consider the probability now conditioning on non-inclusion. Since in this case we would rather be interested in making the probability of not exceeding the threshold c large, the probability is reversed to 1 − α. Note that the absolute value of the effects can be taken without loss of generality due to the centring constraint of each function to ensure identifiability.
The basic idea of these two equations is that such prior statements can be much more easily elicited in applications, in particular in distributional regression where the application of response functions such as the exponential function or the logit transform induce default ranges of plausible effect sizes. Of course, the levels c as well the probability levels α can be chosen to be distinct for the inclusion/exclusion criteria in (3) and (4) but we suppress this possibility notationally both for simplicity and since in most cases it seems plausible to choose the same parameter settings anyway.
To access the probabilities in (3) and (4), we have to derive the marginal distribution of sup |f (ν)| which is not analytically accessible. For a single covariate value ν, the function evaluation is given by denoting the generalized inverse of K) and p(τ 2 ) is given in Equation (1). Note that using the generalized inverse effectively removes the portion of f (ν) that corresponds to the null space of K such that we take the constraint in (M4) into account. The integrals above are scalar integrals for each covariate ν which can be solved numerically. However, obtaining the supremum over a large set D, numerical integration easily becomes computationally intractable. We hence determine the distribution of the supremum based on simulations from the hierarchical NBPSS prior.
In the Online Appendix B, we show how to determine r and b independently of each other. For given design matrix B = (b ′ ν 1 , . . . , b ′ νn ) ′ , precision matrix K, probability level α and threshold c, these can be computed for general functional effects using the R package sdPrior (Klein, 2018).

Shrinkage Properties
Regularisation and shrinkage properties of certain prior settings in regression specifications can be studied by considering the marginal distribution of the regression coefficients and/or functional effects. According to Section 3.1 the marginal densities have to be determined by numerical integration.

Constraint Regions
We compare the prior specified in (M4)-(M6) with a standard NMIG prior applied directly to the coefficients in β and the parameter expanded prior (peNMIG) of Scheipl et al. (2012). Figure 1 shows the univariate marginal log-densities where the most distinct difference is between the standard NMIG prior compared to peNMIG and NBPSS priors. While the standard NMIG prior resembles the shape of a normal distribution with a finite asymptote at zero, both parameter expanded priors feature a spike in zero. As we will show in the next section, this spike is indeed infinite such that advantageous selection behaviour is to be expected for the NBPSS prior. Figure 2 supplements the univariate considerations by bivariate marginal log-densities. We differentiate between two situations: First, we consider two parameters that depend on the same value τ 2 , i.e. parameters belonging to the same function f (ν), while in the second case we consider parameters depending on different importance parameters. This distinction is important since the standard NMIG prior always assumes independent components with separate hyperparameters. As a consequence, the peNMIG and NBPSS priors deviate from the standard situation in two ways: First by the parameter expansion itself and second by making the parameters depend on the same hyperparameter. To disentangle the effect of these two deviations, we rely on the separate presentations. We make the following important observations: • The NBPSS and peNMIG priors share the same qualitative behaviour while deviating considerably from the standard NMIG prior regardless of whether the case of shared or distinct τ 2 is considered.
• The univariate marginal densities qualitatively resemble the ones of the original spike and slab prior of Mitchell and Beauchamp (1988) with tails that are heavy enough to induce a re-descending score function which ensures robustness of the Bayesian estimators (see also the next subsection).
• For the case of distinct parameters, we observe contours similar to the convex shape of L q priors with q < 1 for the peNMIG and NBPSS priors which implies weak shrinkage of large effects while small coefficients are strongly shrunken to zero.
• For the case of shared τ 2 , the shapes of the contours imply simultaneous shrinkage of both parameters instead of the strong shrinkage towards the coordinate axes observed for distinct importance parameters. This is exactly the desired type of shrinkage for parameters belonging to one effect f (ν) to completely remove the effect from the model specification.
• As already noted in Section 2.2, the specification of the prior in Scheipl et al. (2012) differs from ours insofar as they consider the mixed model decomposition of effects. Additionally, Scheipl et al. (2012) use a bimodal prior for the standardized regression effects with modes at +1 and −1. This effectively bounds the coefficients away from zero and thus encourages sampling from one mode of the posterior, while we instead explore the full posterior. Consequently, the conditional posterior ofβ of NBPSS is a standard normal distribution p NBPSS (x) = N(x; 0, 1), while the one of peNMIG is a mixture of two normals with modes, p peNMIG (x) = 0.5 N(x; 1, 1) + 0.5 N(x; −1, 1). Taking the ratio yields p peNMIG (x) p NBPSS (x) > 1 ⇔ |x| > cosh −1 (exp(0.5)) ≈ 1.08, which explains the slightly heavier tails of peNMIG in Figures 1 and 2.
We also study the implied constraint regions for the marginal prior of function evaluations f (ν) = b ′ ν β, which can be derived in complete analogy by utilising that In contrast, the marginal prior for function evaluations for the parameter expanded prior of Scheipl et al. (2012) is not numerically accessible since it involves a complex mixture of 2 D components (where D is the dimension of β) due to the bimodal prior for the elements ofβ. Figure 3 depicts marginal densities for the effect f (ν) evaluated at one (left panel) and two (right hand panel) randomly chosen covariate values of a sequence of n = 100 equidistant values in [−π, π]. The resulting design matrix B is based on cubic Bayesian P-splines with D = dim(β) = 22. Hence, the bivariate plot corresponds to the situation of one shared importance parameter since we are interested in shrinkage of the effect evaluations for the same effect at different covariate values. Qualitatively, the behaviour from the marginal densities of the regression coefficient is translated to the function evaluations, i.e. we observe a peak in zero and simultaneous shrinkage.

Tail Behaviour and Behaviour in the Origin
Visually, the marginal prior for β features a distinct peak as shown in the previous section. We now investigate more closely, whether this spike is finite or infinite by considering the behaviour of p β (β)| β=0 . Using Equation (2) we obtain (0) [log(τ )] 1 0 = ∞, and therefore the marginal prior for β indeed has an infinite spike in zero. Note that we have shown that the multivariate parameter expanded prior has a spike in zero, while Scheipl et al. (2012) have only shown the result for the univariate marginal prior. An infinite spike in zero is considered to induce particularly beneficial shrinkage properties since we obtain heavy penalisation of small effects.
The tail behaviour of the marginal prior for β can be studied by looking at the score function of p(β) which consists of the elements Figure 4 visualizes the resulting score function and compares it to the score function of the NMIG and peNMIG priors. From the graphical representation we find that all three prior structures have heavy tails such that the score functions are re-descending (i.e. they approach zero as their argument tends to infinity) which induces Bayesian robustness of the resulting estimates.
The score functions of the peNMIG and NBPSS priors resemble the shape of L q priors with q close to zero, while the shape of the score function for the NMIG prior shows a more complex non-monotonously shape around zero.

Propriety of the Posterior Distribution
While in Section 2 we do not explicitly change the design matrices to remove the nullspace of the precision matrices K j,k (both effects with NBPSS prior and the ones not under selection), we do derive an explicit mixed model representation of the predictors η k in (M2) in this section as this greatly simplifies the derivation of sufficient conditions for the propriety of the posterior.
As the exact conditions are also dependent on the prior structures employed, we need to be more precise here about η in and will therefore introduce a slightly different notation compared to that in Section 2.

Mixed Model Representation
Assume we have L k effects in η in k and J k effects under selection and let furthermore η k = η in k +η sel k be the complete predictors for k = 1, . . . , K as defined in Section 2.1.2.
We then assume a mixed model type representation (Fahrmeir et al., 2004) for η in The columns of U in l,k are a basis of ker(K in l,k ),Ṽ in l,k forms a basis of the images of K in l,k , such that dim(β in pen,l,k ) = rk(K in l,k ) = κ in l,k and β in pen,l,k |(τ 2 l,k ) in ∼ N(0, (τ 2 l,k ) in I), while β in unpen ,l,k has dimension D in l,k − κ in l,k and a flat prior. As a consequence, we obtain L k variance parameters (τ 2 l,k ) in for the L k penalized vectors of coefficients β in l in η in k . For effects in η sel we proceed similarly but with proper NBPSS priors on both parts of K sel j,k , rk(K sel j,k ) = κ sel j,k representing a basis of the nullspace and the image each. Hence, by construction all effects under selection (after centring) can be assumed to have proper prior distributions. For non-linear effects of continuous covariates with random walk priors of order > 2 for instance, this is achieved by separating the polynomial parts up to order-1 and to include separate NBPSS prior on these, see Section 2 for details. We hence assume that the sub-predictors under selection are of the form This yields J k importance parameters (τ 2 j,k ) sel with hyperparameters ψ 2 j,k , δ j,k , ω j,k in addition to the J k regression coefficients with NBPSS priors after re-parameterisation. We furthermore introduce κ k = L k l=1 κ in l,k + J k j=1 κ sel j,k . Finally, the complete predictor can be written as where we denote β unpen,k ≡ β in unpen ,k , β pen,k = ((β in . Let us in the sequel assume that the matrices U k have full column rank r k , k = 1, . . . , K and define for X k = (U k , V k ) and t k = rk(X k ) − rk(U k ) ≤ dim(β pen ,k ), rk(X k ) = r k + t k . (6) Remark 1. In order to obtain a full column rank matrix of unpenalised effects in the mixed model representation (5), all superfluous columns have to be deleted. In particular, duplicated constant columns representing the levels of the functions are deleted which is a simple way to include the centring restrictions and is equivalent to the centring of functions that we include in our MCMC algorithm. Furthermore, using the one-to-one relationship between original parameterisation and the reparameterised model the restrictions for one presentation can be deduced from the other one. Hence, sufficient rank conditions can be formulated directly for the reparameterised model (5)

Conditional Independence Assumptions
To derive the posterior distribution of model (M1) to (M6), we make the usual conditional independence assumptions (see the Online Appendix A.1, conditions (a.1)-(a.3b)) by labelling for k = 1, . . . K the coefficients β in l,k with variances (τ 2 l,k ) in , l = 1, . . . , L k for effects not under selection; and β sel j,k , (τ 2 j,k ) sel , ψ 2 j,k , δ j,k , ω j,k , ,J = 1, . . . , J k , for the effects with NBPSS prior. In general, they mean that priors for different effects are assumed to be independent, while within an effect they are dependent by construction. In general, prior independence assumptions should be a reasonable working assumption which also does not rule out posterior dependence. Note that we always assume proper NBPSS priors and in particular a j,k > 0, b j,k > 0 in the priors for ψ 2 j,k . This is justified by our considerations on prior elicitation as discussed in Section 3.2 of the main paper. In the following we assume that conditions (a.1)-(a.3b) of the Online Appendix A.1 hold.

Gaussian Mean Regression
Assume in this section a Gaussian mean regression model for y = (y 1 , . . . , y n ) ′ with predictor η from (5) in mixed model representation, i.e.
where we assume for the error variance. Note that k = 1 in this subsection and that J k , L k , κ k are replaced by J, L, κ. Applying the mixed model representation (5) allows us writing (7) as y = U β unpen + V β pen + ε, and with the corresponding rank assumptions from above.

b. Conditions for Gaussian Mean Regression
Condition (b.1) excludes Jeffrey's prior (corresponding to a in l = b in l = 0) for effects not under selection but allows for flat priors on variances and standard deviations (τ 2 l ) in . Conditions (b.2) to (b.4) relate the ranks κ in l and κ sel j of the prior precision matrices of each of the effects to the rank κ of all prior precision matrices. For effects not under selection, the conditions can be ensured by increasing a in l . Condition (b.5) restricts the number of all effects to be smaller or equal to the number of observations but can be relaxed by increasing the hyperparameters values a ε and a in l . Condition (b.7) is always fulfilled for b ε > 0. In case of an improper prior for τ 2 ε , SSE > 0 has to be assured, while b ε > 0 becomes necessary when the number of unknown parameters is greater than n.
Theorem 1. Consider the Gaussian mean regression model (7) with mixed model representation (5) and rank conditions from (6). The proof of Theorem 1 is given in the Online Appendix A.3.
Remark 2. For effects not under selection, additional conditions on the ranks κ l and the number of effects compared to the shape parameters (a l ) in of the priors are required, as the latter can be improper and hence (a l ) in < 0 becomes possible. Consequently, one has to consider the cases t = κ or L = 1 as well as t < κ and L > 1 separately. This is not necessary for effects with NBPSS prior.

Distributional Regression
In order to achieve sufficient conditions for the propriety of the posterior in distributional regression, we define a normalized submodel with Gaussian errors to be able to apply results of Theorem 1. More precisely, we first separate the random effect with largest dimension in each predictor of (5), such that we obtain where V ε,k b ε,k corresponds to the effect with proper prior and with the largest dimension, dim(b ε,k ) = rk(K ε,k ) = κ ε,k , and V k b k contains all remaining effects with proper prior, both the ones with NBPSS prior and the ones not under selection with usual inverse gamma priors. Note that b k is based on J * k = L k + J k − 1 effects in the notation in (5), with κ k denoting the sum of ranks of the J * k precision matrices of predictor k, and where, w.l.o.g. we assume that the effects in the predictors are ordered such that the (J k + L k )-th effect corresponds to the random effect in the mixed model representation with largest dimension. Similarly, the design matrix (V k , V ε,k ) corresponds to the design matrix (V in k , V sel k ). Note also, that b ε,k can originate from an effect not under selection or one with NBPSS prior and we distinguish the two cases in Theorems 2 and 3.
Assume that the set of observations can (after re-ordering) be partitioned such that for n * ≥ 1 . . , n, k = 1, . . . , K. This implies that for at least one observation the density is integrable (with respect to the predictors) and that all remaining densities are bounded. For discrete distributions, all densities are automatically bounded by 1 so that only Condition (c.1) can be an issue in practice. Condition (c.1) is usually fulfilled if certain restrictions apply on specific parameters that exclude extreme values on the boundary of the parameter space, see  for a more detailed discussion on count data and binary distributions. For continuous distributions, the densities are sometimes not bounded (e.g. for the gamma distribution). Note that this is not a problem when all observations fulfil Condition (c.1) since n * = n is allowed. Similar as for the discrete distributions, integrability of the densities can be assured by the assumption that none of the distributional parameters is on the boundary of the parameter space (an assumption that would also have to be made to apply standard maximum likelihood asymptotics).
Letñ ε = min{κ ε,1 , . . . , κ ε,K } and assume that we can chooseñ ε observations including at least one observation fulfilling (c.1) to define the submodel with these observations, such that V ε,k,s b ε,k ∼ N(0, τ 2 ε,k V ε,k,s V ε,k,s ′ ). Then the following rank conditions have to be fulfilled: The design matrix U k,s has full rank r k .

(c.5)
To ensure (c.3), superfluous columns arising from the reparameterisation have to be deleted. In particular, duplicated constant columns representing the levels of the functions are deleted, see Klein and Kneib (2016, Remark 2 (iii) for details). Condition (c.4) indicates that the rank of the design matrices in the submodel is the same as in the complete model whereas (c.5) defines a similar restriction for the design matrix of the largest random effect arising from the mixed model representation. Finally, the normalised submodel η k,s =Ũ k,s β unpen,k +Ṽ k,s b k + ε k,s , ε k,s ∼ N(0, τ 2 ε,k Iñ ε ) is obtained by multiplying (8) with M k = (V ε,k,s V ε,k,s ′ ) −1/2 such thatη k,s = M k η k,s ,Ũ k,s = M k U k,s ,Ṽ k,s = M k V k,s , and ε k,s represents an i.i.d. random effect.
The corresponding residual sum of squares for the normalised submodel is To derive sufficient conditions for the propriety of the posterior we have to distinguish two cases: the largest random effect ε k,s corresponds to an effect with a) NBPSS prior and b) not under selection and with the usual inverse gamma priors for the variance τ ε,k .
Above, Conditions (c.·a) each correspond to the case that the largest random effect has a variance with inverse gamma prior, while Conditions (c.·b) each are active when the variance of the largest random effect has an NBPSS prior. Conditions (c.6a),(c.6b) require that if for effects not under selection b in l,k is set to zero, the parameter a in l,k has to be negative. This includes situations corresponding to flat priors for the random effects variance (a in l,k = −1) or standard deviation (a in l,k = −0.5) but excludes Jeffreys prior (a in l,k =0). Conditions (c.7a), (c.7b) and (c.8a), (c.8b) relate the rank of the random effects part of one individual effect to the sum of all rank deficiencies in the corresponding predictor, are similar for effects not under selection and the ones with NBPSS prior and require that the dimensionality is not too small. The condition can be ensured by increasing the shape parameters a in l,k and a j,k , respectively. Conditions (c.9a), (c.9b) restrict the number of effects not under selection and with flat prior to be at most equal to the dimension of the largest random effects part in the model but can again be relaxed by increasing the shape parameters a in l,k . Finally, Conditions (c.10a), (c.10b) require that there is variation in the residual sum of squares in the normalized submodel (implying that not all effects are zero) in situations where the largest random effect has an NBPSS prior and either variation in the residual sum of squares or b ε > 0 when the largest random effect has the usual inverse gamma prior on the variances. The latter requirement can always be ensured in practice but excludes flat priors for the random effects variances or standard deviations.
The proof of Theorem 2 follows from the proof of  using Theorem 1 above as we assume that all NBPSS priors are proper.

Posterior Estimation
Update of the Basis Coefficients. Due to the modular structure of Markov chain Monte Carlo (MCMC) simulation algorithms, no changes in the MCMC scheme developed by Klein, Kneib, Lang and Sohn (2015) are required for updating the basis coefficients β when supplementing them with a NBPSS prior instead of the standard inverse gamma prior. We therefore apply iteratively weighted least squares based approximations to the log full conditional and generate proposals from the multivariate normal distribution N(µ, P −1 ) with expectation and precision matrix given by where η − = η − Bβ is the predictor without the effect currently updated and the working observationsỹ and weights W are determined based on first and second derivatives of the loglikelihood with respect to the predictor.

Update of the Smoothing Variance for
Effects not Subject to Selection. For effects not subject to selection, we consider an inverse gamma prior τ 2 ∼ IG(a, b) for the smoothing variances such that the update of τ 2 can be done via a simple Gibbs sampling step drawing from Update of the Squared Importance Parameter for Effects Subject to Selection. The full conditional p(τ 2 |β, δ, ψ 2 ) is a generalised inverse Gaussian distribution GIG(p, q, c), with p = −0.5 rk(K) + 0.5, q = 1/(r(δ)ψ 2 ), c = β ′ Kβ and can be generated efficiently in a Gibbsstep. This has the advantage that τ 2 can be generated independently of the likelihood in an efficient Gibbs step. This is no longer possible when the prior is formulated for the importance parameter τ as in (Scheipl et al., 2012) where a Metropolis-Hastings update is required, see the Online Appendix C.
Updates for the Hyperparameters of the NBPSS prior. For the hyperparameters of the NBPSS prior, we obtain Gibbs sampling steps via the following full conditionals: • Inclusion indicator δ: where ϕ(·; µ, σ 2 ) denotes the density of the normal distribution with mean µ and variance σ 2 and 2ψ 2 (1/r−1) .
• Hyper-variance ψ 2 : Note that it is also possible to use the same ω for multiple effects simultaneously. If ω relates to a total of L effects, the full conditional is then given by Implementation. Spike and slab based effect selection in distributional regression has been implemented in a developer version of BayesX (Belitz et al., 2015) which is available from the authors on request. The software makes use of methods for efficient storing of large data sets and sparse matrix algorithms for sampling from multivariate Gaussian distributions (George and Liu, 1981;Rue, 2001) and also allows us to access existing procedures for example for computing simultaneous confidence bands for nonparametric effects as developed in Krivobokova et al. (2010). Hyperparameter elicitation is integrated in the R-package sdPrior (Klein, 2018).

Simulations
To evaluate the performance of the NBPSS prior for effect selection in distributional regression, we conducted extensive simulations under various settings. We distinguish different scenarios for the predictor complexity, models including and excluding spatial effects, four selected response distributions, varying sample sizes, correlated and uncorrelated covariates and a set of userdefined parameters for hyperprior elicitation. Specifically, • we consider Gaussian responses with effects only on the expectation, a Gaussian location-scale model, Poisson regression and zero-inflated Poisson models.
• we specify four test functions • we distinguish two scenarios in terms of the predictor complexity: low sparsity in which out of 16 included covariates 12 have non-zero influence. The true linear predictor is η = f 1 (x 1 )+f 2 (x 2 )+f 3 (x 3 )+f 4 (x 4 )+1.5 (f 1 (x 5 ) + f 2 (x 6 ) + f 3 (x 7 ) + f 4 (x 8 ))+ 2(f 1 (x 9 ) + f 2 (x 10 ) + f 3 (x 11 ) + f 4 (x 12 ) and we simulate the two cases with additional and without additional spatial effect f spat (s), labeled as 'spatial/non-spatial'. These settings are used for η µ in the homoscedastic Gaussian and the Gaussian location-scale model, as well as for η λ in the Poisson and the zero-inflated Poisson model. high sparsity in which out of eight included covariates four have non-zero influence. The true linear predictor is η = f 1 (x 1 ) + f 2 (x 2 ) + f 3 (x 3 ) + f 4 (x 4 ) and we again simulate the two cases with additional and without additional spatial effect f spat (s). These settings are used for η σ 2 in the Gaussian location-scale model and for η π in the zero-inflated Poisson model.
• we generate covariates either -as i.i.d. realizations from U[−2, 2] or -from an AR(1) process with correlation ρ = 0.7 and standarize x in order to facilitate prior elicitation.
• we simulate 150 replications for each combination of the settings.
• we use six combinations of α and c for the elicitation of the prior hyperparameters b and r arising from the pairwise combination of -α = 0.05, 0.1, 0.2, -c = 0.1, 0.2.
• we consider the sample sizes n = 200; 1, 000 for Gaussian, n = 500; 2, 000 for Poisson, n = 1, 000; 2, 000 for Gaussian location-scale and zero-inflated Poisson responses. The sample sizes have been chosen to reflect a challenging (small sample size) and a relatively informative (large sample size) setting, taking the different complexity of the model structures into account.
As a competitor for the single parameter distributions Gaussian and Poisson, we consider the peNMIG prior of Scheipl et al. (2012) implemented in the R-package spikeSlabGAM (Scheipl, 2016). We refrain from comparison with further variable selection priors mentioned in the introduction as these usually lack applicability beyond the framework of generalized linear models. Hyperparameter elicitation for the NBPSS prior was performed with the package sdPrior (Klein, 2018) and estimation was done with the current developer version of BayesX (Belitz et al., 2015). For both the NBPSS and the peNMIG prior, non-linear effects are based on 20 cubic B-spline basis functions constructed from an equidistant set of knots combined with second-order random walk prior unless stated otherwise.
In the following, we restrict ourselves to the main conclusions, a detailed description about simulation settings and evaluation including complete graphical evidence is provided in the Online Appendix D. As a general outcome, the NBPSS prior results in very good performance for the selection of relevant effects even in challenging distributional regression settings with effect selection on multiple distributional parameters, where no competing Bayesian variable selection approach is available so far. Evidence for that is given in Figures 5 and 6 showing posterior inclusion probabilities and the ratio between predictive NBPSS log-scores and oracle log-scores (i.e. log-scores arising from a model with given, true predictor specification), respectively, in the zero-inflated Poisson model. The log-scores have been computed from independently generated test data sets with 5,000 observations.
In the simple exponential family framework with only one single regression predictor, the NBPSS prior turns out to be a strong competitor to the peNMIG prior (see Figure 7 for overall accuracy results of the Poisson model). Selection of large coefficient blocks such as spatial effects works well for all types of response distributions, while these are particularly problematic with peNMIG due to severe mixing problems. On the other hand, the explicit reparameterisation of non-linear effects used with the peNMIG prior (as compared to the constrained sampling approach that NBPSS is based on) seems to have some advantages in separating the linear and non-linear part of non-linear effects in cases where the true effect is close to linear and at the same time covariates are strongly correlated.
Coinciding with previous evidence on Bayesian effect selection, we find a strong impact of hyperprior parameter choice on the resulting effect selection performance. Our interpretable yet flexible way of eliciting hyperprior parameters equips data analysts with an intuitive approach for choosing these hyperparameters. More precisely, changing the probability α and the threshold c can help to balance between the true positive and false negative rates of effect selection. Choosing α and c smaller, results in more conservative, i.e. sparser models. Based on our simulations, we suggest α = c = 0.1 as default values in our applications.
In summary, our simulations demonstrate that the NPBSS prior provides a promising approach for Bayesian effect selection that extends existing methods to a framework that is applicable in any distributional regression model comprising both multiple hierarchical predictor specifications and high-dimensional coefficient vectors. In addition, our effect decomposition allows to select the linear part and its non-linear deviation for an effect of a continuous covariate separately.

Applications
In this section, we demonstrate the efficacy of a simultaneous selection approach via the NBPSS prior specification and its applicability for non-Gaussian, discrete or multivariate data. Core information about the different data sets Patents, Nigeria and House prices including the type of response distribution, number of observations and effects can be found in Table 1. Estimates shown in the subsequent subsections are all the model-averaged estimates obtained from the MCMC iterates with the NBPSS prior and the covariates have been standardized for prior elicitation reasons.

Number of Patent Citations
The Patents data set contains the number of citations of patents granted by the European Patent Office (EPO). An inventor who applies for a patent has to cite all related, already existing patents his patent is based on.  use this data set to illustrate their developed methodology on Bayesian zero-inflated and overdispersed count data and conducted variable selection in a stepwise forward approach based on the deviance information criterion (DIC). In the following, we focus on zero-inflated Poisson (ZIP) models for analysing the number of patent citations. The ZIP model has two distributional parameters, λ, the rate of the count process, and π the probability of observing an excess of zeros. Including all available variables in one of the predictors η k , k = 1, 2 reads as where x contains the continuous variables year (year when patent was granted), ncountry (number of designated states for patent), nclaims (number of patent claims), as well as the binary indicators ustwin (twin patent in the US), opp (oppositions against the patent), biopharm (patent from the biotech/pharma sector), patus (patent holder from the US) and patgsgr (patent holder from Germany, Switzerland or Great Britain), see Table E.1 in the Online Appendix for summary statistics of the variables. Possible non-linear effects of the three continuous variables are captured by the functions f 1 to f 3 . The predictor specifications of the model identified in  via stepwise DIC-selection are η λ = β 0,λ + β 1,λ opp + β 2,λ biopharm + β 3,λ patus + β 4,λ patgsgr + f 1,λ (year ) + f 2,λ (ncountry) + f 3,λ (nclaims) η π = β 0,π + β 1,π opp + β 2,π biopharm + β 3,λ patus + β 4,λ patgsgr + f 1,π (year ) + f 2,π (ncountry).
This model is denoted as ZIP DIC in the following.
We compare this model to the model ZIP NPBSS with predictors selected by the NBPSS prior where r and b were determined from α ∈ {0.05, 0.1}, c = 0.1. Table 2 reports predictive log-scores (obtained from ten-fold cross validation) as well as values for the DIC and the widely applicable information criterion (WAIC). From the table, we can conclude, that the ZIP NPBSS model is clearly favoured in terms of the chosen criteria. For the NBPSS model, we report posterior probabilities P(δ|y) in Table 3. Based on the decision to include an effect if P(δ|y) ≥ 0.5 holds, the NBPSS prior coincides with the stepwise approach of ZIP DIC for the effects of the continuous covariates but yields a sparser prediction specification for the effects of binary covariates.

Bivariate Analysis of Undernutrition
The Nigeria data have been extracted from Demographic and Health Surveys (DHS, https://dhsprogram.com/) containing nationally representative information about the population's health and nutrition status in numerous developing and transition countries. Here we use data from Nigeria collected in 2013. Overall there are 23,042 observations after removing outliers and inconsistent observations from the data. We use stunting and wasting as the bivariate response vector, where stunting refers to stunted growth measured as insufficient height of the child with respect to its age, while wasting refers to insufficient weight for height. Hence stunting is an indicator for chronic undernutrition while wasting reflects acute undernutrition. We assume that the two indicators are jointly normally distributed with marginal means, marginal scales and correlation parameter depending on covariates. Specifically, the model equations for all predictors of the distributions are specified as where x i contains 13 binary covariates characterising the household the child is living in as well as the child itself, see Table C.3 of the Online Appendix for a full description of variables. The three non-linear effects f 1 to f 3 of cage (age of the child in months), mage (age of the mother in years), mbmi (body mass index of the mother) are decomposed into their linear and non-linear part as described in Section 2.2. For the scale parameters, we used an exponential response function and for ρ the response function g(x) = x/ (1 + x 2 ). The DIC/WAIC of the full model and model with NBPSS prior are 159,101/159,190 and 159,101/159,173, respectively and hence slightly better for the NBPSS prior model. Figures 8 and 9 show the posterior means together with their 95% posterior credible intervals of linear and non-linear effects for the full model (blue) and the model with NBPSS prior (red).
For the the function estimates f j,k = f j,k,lin + f j,k,nonlin , Figure 9 shows the corresponding nonlinear part f j,k,nonlin separate from the linear part f j,k,lin in Figure 8, while the sum of the two components can be found in the Online Appendix F. We see that both models yield very similar point estimates, however the NBPSS prior results in slightly smoother estimates and more narrow credible intervals and hence more precise predictions -as desired with an effective variable selection approach. Spatial effects of the five distribution parameters with the NBPSS prior are visualized in Figure 10. While we omit the ones of the full model, tendencies are similar as for the remaining effects.
Inclusion probabilities are reported in Table 4. We find that the regional effect is relevant in all distribution parameters, i.e. not only the marginal means but also the scales and the correlation between stunting and wasting. Interestingly, chronic undernutrition measured by stunting seems to be mostly driven by variables describing the life situation of the children. In contrast, besides the region of residence, the mother's nutritional status measured by mbmi has a relevant effect only for acute undernutrition (wasting).

Hedonic House Prices
We apply our methodology to the house prices dataset of n = 98, 354 single family homes in Germany. The data were provided by F+B Research & Consulting for Habitation, Real Estate and Environment Ltd, a business consultancy in Hamburg, Germany. We consider the price per square metre in Euro as the response variable and explain the variation in prices in terms of four continuous covariates representing year of construction (yoc), expert rating (rating), plot area (areapl), living area (arealiv ) and spatial location (dist). We use district-specific averages yoc dist and rating dist as further covariates. We assume a Gaussian hierarchical location-scale model, where both expectation µ and log-variance log(σ 2 ) are related to the following hierarchical predictor.
• Level 1 (houses): 5,k (dist) • Level 2 (districts): 1,k (yoc dist ) + f 2,k (rating dist ) + f 3,k (dist), where f 3,k (dist) follow Gaussian Markov random fields for k = 1, 2 and, as before, we decompose the effects of the continuous covariates in both levels into their linear and non-linear part such that we end up with 26 effects in total. The NBPSS prior is put on all effects and inclusion probabilities are given in Table 5 j,k,nonlin and the estimated spatial effects can be found in the Online Appendix G. In summary, we find that the NBPSS prior demonstrates its effect selection and shrinkage abilities also in hierarchical settings. While on level 1 the full model and the model with NBPSS prior mostly coincide, we see considerable regularisation of some non-linear effects for level 2. The NBPSS prior is clearly able to select the spatial effect and non-linear part of rating dist in both distribution parameters, while the linear part and the effect of yoc dist would be excluded according to the inclusion probabilities.

Summary and Discussion
In this paper, we have developed a novel prior structure for Bayesian effect selection in structured additive distributional regression models thus extending existing approaches in terms of both flexibility of available response distributions and predictor flexibility. We derived shrinkage properties of the NBPSS prior and show its favourable properties. In simulations we demonstrate empirically that the NBPSS prior is applicable even to the selection of high dimensional coefficient blocks in more than one distribution parameter. The method promises wide applicability which we illustrate along three different examples including zero-inflated count data, a bivariate Gaussian model and a hierarchical location-scale specification for hedonic housing priors.
Instead of arbitrarily fixing hyperparameters of the inverse gamma priors we provide an intuitive and interpretable way for hyperprior elicitation which is easily accessible by applied users. This is an important feature since results react sensitively with respect to the actual choices of hyperparameters. Yet, the NBPSS prior controls the flexibility of each effect separately since priors are assumed to be independent and does not allow to control the overall complexity of the predictor. However, the NBPSS prior could be extended to achieve also global shrinkage properties, e.g. by specifying the scale parameter in the prior on τ 2 as a product of a global and a local parameter (Polson and Scott, 2010). As in distributional regression the propriety of the posterior is not trivial, however, care has to be taken with respect to the specific prior choices (Ghosh et al., 2018). Alternatively, if interest is rather in smoothing and shrinkage than in explicit effect selection shrinkage priors like the double gamma prior Bitto and Frühwirth-Schnatter (2018) or penalised complexity priors Simpson et al. (2017) might be used.
Also, it is conceptually straightforward to include Bayesian quantile or expectile regression models into the NBPSS prior framework and we aim to do so in a future work.     and of η dist ,µ and η dist ,σ 2 (second row) with the NBPSS prior.  Relative mean logarithmic scores Figure 6: Violin plots of relative mean log-scores (i.e. mean log scores obtained with the NBPSS prior divided by mean log scores of the oracle model) in the zero-inflated Poisson model. The log-scores are averaged over 5,000 new test data observations for each simulation replicate. The columns represent the different sample sizes n = 1, 000; 2, 000, rows 1 and 3 belong to the non-spatial scenarios (no spatial effect in the data generating model) and rows 2 and 4 to the spatial ones (the data generating model comprises a spatial effect). Covariates are uncorrelated in rows 1 and 2 and correlated in rows 3 and 4. The different boxplots within a column/row correspond to different combinations of α, c denoted as (α, c) in the labels.