Informative Bayesian Neural Network Priors for Weak Signals

Encoding domain knowledge into the prior over the high-dimensional weight space of a neural network is challenging but essential in applications with limited data and weak signals. Two types of domain knowledge are commonly available in scientific applications: 1. feature sparsity (fraction of features deemed relevant); 2. signal-to-noise ratio, quantified, for instance, as the proportion of variance explained (PVE). We show how to encode both types of domain knowledge into the widely used Gaussian scale mixture priors with Automatic Relevance Determination. Specifically, we propose a new joint prior over the local (i.e., feature-specific) scale parameters that encodes knowledge about feature sparsity, and a Stein gradient optimization to tune the hyperparameters in such a way that the distribution induced on the model's PVE matches the prior distribution. We show empirically that the new prior improves prediction accuracy, compared to existing neural network priors, on several publicly available datasets and in a genetics application where signals are weak and sparse, often outperforming even computationally intensive cross-validation for hyperparameter tuning.


Introduction
Neural networks (NNs) have achieved state-of-the-art performance on a wide range of supervised learning tasks with high a signal-to-noise ratio (S/N), such as computer vision (Krizhevsky et al., 2012) and natural language processing (Devlin et al., 2018). However, NNs often fail in scientific applications where domain knowledge is essential, e.g., when data are limited or the signal is extremely weak and sparse. Applications in genetics often fall into the latter category and are used as the motivating example for our derivations. Bayesian approach (Gelman et al., 2013) has been of interest in the NN community because of its ability to incorporate domain knowledge into reasoning and to provide principled handling of uncertainty. Nevertheless, it is still largely an open question how to encode domain knowledge into the prior over Bayesian neural network (BNN) weights, which are often high-dimensional and uninterpretable.
We study the family of Gaussian scale mixture (GSM) (Andrews and Mallows, 1974) arXiv: 2002.10243 (a) (b) Figure 1: a) A spike-and-slab prior with slab probability p = 0.05 induces a binomial distribution on the number of relevant features. The proposed informative spike-andslab can encode a spectrum of alternative beliefs, such as a discretized or 'flattened' Laplace (for details, see Section 3). b) The informative spike-and-slab prior can remove false features more effectively than the vanilla spike-and-slab prior with correct slab probability, where features are assumed independent (see Section 6.1).
distributions, which are widely used as priors for BNN weights. A particular example of interest is the spike-and-slab prior (Mitchell and Beauchamp, 1988) where w (l) i,j represents the NN weight from node i in layer l to node j in layer l + 1. The hyper-parameters {σ (l) , λ (l) i , p} are often given non-informative hyper-priors (Neal, 2012), such as the inverse Gamma on σ (l) and λ (l) i , or optimized using cross-validation (Blundell et al., 2015). In contrast, we propose determining the hyper-priors according to two types of domain knowledge often available in scientific applications: ballpark figures on feature sparsity and the signal-to-noise ratio. Feature sparsity refers to the expected fraction of features used by the model. For example, it is known that less than 2% of the genome encodes for genes, which may inform the expectation on the fraction of relevant features in a genetics application. A prior on the signal-to-noise ratio specifies the amount of target variance expected to be explained by the chosen features, and it can be quantified as the proportion of variance explained (PVE) (Glantz et al., 1990). For instance, one gene may explain a tiny fraction of the variance of a given phenotype (prediction target in genetics, e.g. the height of an individual), i.e., the PVE of a gene may be as little as 1%.
Existing scalable sparsity-inducing BNN priors, such as the spike-and-slab prior, are restricted in the forms of prior beliefs about sparsity they can express: conditionally on the slab probability p the number of relevant features follows a Binomial distribution. Specifying a Beta hyper-prior on p could increase flexibility, but this still is more restricted and less intuitive than specifying any distribution directly on the number of relevant features, and in practice in the BNN literature a point estimate for p is used. The value of p is either set manually, cross-validated, or optimized as part of MAP estimation (Deng et al., 2019). Moreover, the weights for different features or nodes (a) (b) Figure 2: a) The empirical distribution and corresponding kernel density estimation (KDE) of the proportion of variance explained (PVE) for a BNN, obtained by simulating from the model, before and after optimizing the hyperparameters according to the prior belief on the PVE. b) The data with PVE=0.8 in its generating process are more likely to be generated by a BNN when the mode of the PVE is approximately correctly (left) than incorrectly (right). Colored lines are functions sampled from the BNN (for details, see Section 4).
are (conditionally) independent in Equation 1; thus, incorporating correct features will not help remove false ones. In this paper, we propose a novel informative hyper-prior over the feature inclusion indicators τ (l) i , called informative spike-and-slab, which can directly model any distribution on the number of relevant features (Figure 1a). In addition, unlike the vanilla spike-and-slab, the τ (l) i for different features i are dependent in the new informative spike-and-slab, and consequently false features are more likely to be removed when correct features are included, which can be extremely beneficial when the noise level is high, as demonstrated with a toy example in Figure 1b The PVE assumed by a BNN affects the variability of functions drawn from the prior ( Figure 2b). Intuitively, when the PVE of a BNN is close to the correct PVE, the model is more likely to recover the underlying data generating function. The distribution of PVE assumed by a BNN is induced by the prior on the model's weights, which in turn is affected by all the hyper-parameters. Thus, hyper-parameters that do not affect feature sparsity, e.g. λ (l) i , can be used to encode domain knowledge about the PVE. We propose a scalable gradient-based optimization approach to match the model's PVE with the prior belief on the PVE, e.g., a Beta distribution, by minimizing the Kullback-Leibler divergence between the two distributions w.r.t. chosen hyper-parameters using the Stein gradient estimator (Li and Turner, 2018) (Figure 2a). Although it has been demonstrated that using cross-validation to specify hyper-parameters, e.g. the global scale in the mean-field prior, is sufficient for tasks with a high S/N and a large dataset (Wilson and Izmailov, 2020), we empirically show that being informative about the PVE can improve performance in low S/N and small data regimes, even without computationally intensive cross-validation.
The structure of this paper is the following. Section 2 reviews required background on Bayesian neural networks and Stein gradients. In Section 3, we describe our novel joint hyper-prior over the local scales which explicitly encodes feature sparsity. In section 4, we present the novel optimization algorithm to tune the distribution of a model's PVE according to prior knowledge. Section 5 reviews in detail a large body of related literature on BNNs. Thorough experiments with synthetic and real-world data sets are presented in Section 6, demonstrating the benefits of the method. Finally, Section 7 concludes, including discussion on limitations of our method as well as suggested future directions.

Proportion of Variance Explained
In regression tasks, we assume that the data generating process takes the form where f (x; w) is the unknown target function, and is the unexplainable noise. The Proportion of Variance Explained (PVE) (Glantz et al., 1990) The PVE is commonly used to measure the impact of features x on the prediction target y, for example in genomics (Marttinen et al., 2014). In general, PVE should be in [0,1] because the predictions' variance should not exceed that of the data. However, this may not hold at test time for non-linear models such as neural networks if the models have overfitted to the training data, in which case the variance of the residual can exceed the variance of target in the test set. By placing a prior over w whose PVE(w) concentrates around the PVE of the data generating process, the hypothesis space of the prior can be made more concentrated around the true model, which eventually yields a more accurate posterior.

Bayesian neural networks
Variational posterior approximation Bayesian neural networks (BNNs) (MacKay, 1992;Neal, 2012) are defined by placing a prior distribution on the weights p(w) of a NN. Then, instead of finding point estimators of weights by minimizing a cost function, which is the normal practice in NNs, a posterior distribution of the weights is calculated conditionally on the data. Let f (x; w) denote the output of a BNN and p(y|x, w) = p(y|f (x; w)) the likelihood. Then, given a dataset of inputs X = {x (1) , . . . , x (N ) } and outputs y = {y (1) , . . . , y (N ) }, training a BNN means computing the posterior distribution p(w|X, y). Variational inference can be used to approximate the intractable p(w|X, y) with a simpler distribution, q φ (w), by minimizing KL(q φ (w)||p(w|X, y)). This is equivalent to maximizing the Evidence Lower BOund (ELBO) (Bishop, 2006) The first term in Equation 4 is the entropy of the approximated posterior, which can be calculated analytically for many choices of q φ (w). The second term is often estimated by the reparametrization trick (Kingma and Welling, 2013), which reparametrizes the approximated posterior q φ (w) using a deterministic and differentiable transformation , which can be estimated by Monte Carlo integration.

Gaussian scale mixture priors over weights
The Gaussian scale mixture (GSM) (Andrews and Mallows, 1974) is defined to be a zero mean Gaussian conditional on its scales. In BNNs, it has been combined with Automatic Relevance Determination (ARD) (MacKay, 1996), a widely used approach for feature selection in non-linear models. An ARD prior in BNNs means that all of the outgoing weights w (l) i,j from node i in layer l share a same scale λ (l) i (Neal, 2012). We define the input layer as layer 0 for simplicity. A GSM ARD prior on each weight w (l) i,j can be written in a hierarchically parametrized form as follows: where σ (l) is the layer-wise global scale shared by all weights in layer l, which can either be set to a constant value or estimated using non-informative priors, and p(λ i,j can be obtained by integrating out the local scales given σ (l) : The hyper-prior of local scales p(λ (l) i ; θ λ ) determines the distribution of p(w (l) i,j |σ (l) ). For example, a Dirac delta distribution δ(λ i,j |σ (l) ) to a Gaussian with mean zero and variance σ (l)2 , whereas an inverse Gamma distribution on λ i,j |σ (l) ) equal to the student's t-distribution (Gelman et al., 2013). Many sparsity inducing priors in the Bayesian paradigm can be interpreted as Gaussian scale mixture priors with additional local scale variables τ For example, the spike-and-slab prior (Mitchell and Beauchamp, 1988) is the 'gold standard' for sparse models and it introduces binary local scales τ (l) i , interpreted as feature inclusion indicators, such that i ; θ τ ) jointly, but sparsity is determined by p(τ (l) i ; θ τ ) alone. Therefore, we determine the distribution p(τ i,j equals 0 with probability 1 − p (the spike) and with probability p it is drawn from another Gaussian (the slab). Continuous local scales τ (l) i lead to other shrinkage priors, such as the horseshoe  and the Dirichlet-Laplace (Bhattacharya et al., 2015), which are represented as global-local (GL) mixtures of Gaussians.
Gaussian scale mixtures, i.e., Equation 7, are often written with an equivalent noncentered parametrization (Papaspiliopoulos et al., 2007) which has a better posterior geometry for inference (Betancourt and Girolami, 2015) than the hierarchical parametrization. Therefore, non-centered parametrization has been widely used in the BNN literature (Louizos et al., 2017;Ghosh et al., 2018), and we follow this common practice as well.
The hyper-parameter θ τ in p(τ (l) i ; θ τ ) controls the prior sparsity level, often quantified by the number of relevant features. However, for continuous hyper-priors, e.g. the half-Cauchy prior in the horseshoe, which do not force weights exactly to zero, it is not straightforward to select the hyper-parameter θ τ according to prior knowledge about sparsity. On the other hand, the existing discrete hyper-priors on τ (l) i model only restricted forms of sparsity, such as the Binomial distribution in the spike-and-slab prior in Equation 8. In Section 3, we propose an informative spike-and-slab prior consisting of a new class of discrete hyper-priors over the local scales τ (l) i , capable of representing any type of sparsity. Moreover, the informative spike-and-slab makes τ (l) i dependent, which leads to a heavier penalization on false features than in the independent priors, such as the vanilla spike-and-slab, after correct features have been included.
It is well known that the scale parameter of the fully factorized Gaussian prior on BNNs weights affects the variability of the functions drawn from the prior (Neal, 2012), and thus the PVE. When the PVE of the BNN has much probability around the correct PVE, the model is more likely to recover the true data generating mechanism (demonstration in Figure 2). As we will show in Section 4, for a BNN with the GSM prior defined in Equation 9, the hyper-priors on the local scales, p(λ (l) i ; θ λ ) and p(τ (l) i ; θ τ ), control the PVE jointly 1 (Figure 3b). However, Figure 3b also shows how sparsity is determined by p(τ (l) i ; θ τ ) alone. Consequently, we propose choosing p(τ (l) i ; θ τ ) based on the prior knowledge on sparsity, and after that tuning the p(λ (l) i ; θ λ ) to achieve the desired level of the PVE, such that in the end our joint prior incorporates both types of prior knowledge.

Stein Gradient Estimator
Stein Gradient Estimator (SGE) (Li and Turner, 2018) provides a computational way to approximate the gradient of the log density (i.e., ∇ z log q(z)), which only requires samples from q(z) instead of its analytical form. Central for the derivation of the SGE is the Stein's identity (Liu et al., 2016): Then the following identity holds: SGE estimates ∇ z log q(z) by inverting Equation 11 and approximating the expectation with K Monte Carlo samples {z (1) , . . . , z (K) } from q(z), such that where contains the gradients of ∇ z log q(z) for the K samples. Thus a ridge regression estimator is designed to estimate G by adding an l 2 regularizer: where || · || F is the Frobenius norm of a matrix and the penalty η ≥ 0. By solving Equation 13, the SGE is obtained: where is the kernel function. It has been shown that the RBF kernel satifies Stein's identity and is a default choice for Stein Gradient Estimator.

Prior knowledge about sparsity
In this section, we propose a new hyper-prior for the local scales p(τ (l) i ; θ τ ) to model prior beliefs about sparsity. The new prior generates the local scales conditionally on the number of relevant features, which allows us to explicitly express prior knowledge about the number of relevant features. We focus on the case where each local scale τ (l) i is assumed to be binary with domain {0, 1}, analogously to the feature inclusion indicators in the spike-and-slab prior.

Prior on the number of relevant features
We control sparsity by placing a prior on the number of relevant features m using a probability mass function p m (m; θ m ), where 0 ≤ m ≤ D (dimension of the dataset). Intuitively, if p m concentrates close to 0, a sparse model with few features is preferred; if p m places much probability mass close to D, then all of the features are likely to be used instead. Hence, unlike other priors encouraging shrinkage, such as the horseshoe, our new prior easily incorporates experts' knowledge about the number of relevant features. In practice, p m (m; θ m ) is chosen based on the available prior knowledge. When there is a good idea about the number of relevant features, a unimodal distribution, such as a discretized Laplace, can be used: where µ m is the mode, s m is the precision, and c n is the normalization constant. Often only an interval for the number of relevant features is available. Then it is possible to use, for example, a 'flattened' Laplace ( Figure 1): where [µ − , µ + ] defines the interval where the probability is uniform and reaches its maximum value, and c n is the corresponding normalization constant. Equation 15 and 16 include the (discretized) exponential distribution as a special case with µ m = 0 and µ − = µ + = 0 respectively; it has been widely studied in sparse deep learning literature (Polson and Ročková, 2018;Wang and Ročková, 2020). The 'flattened' Laplace, with a high precision s m , is a continuous approximation of the distribution with a uniform probability mass within [µ − , µ + ] and 0 outside of [µ − , µ + ] (blue in Figure 4). If there is no prior information about sparsity, a discrete uniform prior over [0, D] is a plausible alternative. See Figure 4 for a visualization.

Feature allocation
Conditionally on the number of features m, we introduce indicators We then marginalize over m using p m (m; θ m ). We assume there is no prior knowledge about relative importance of features (this assumption can be easily relaxed if needed), i.e., {I i } D i=1 has a jointly uniform distribution given m: where the normalization constant is c d = D m −1 . Now we can calculate the joint distri- When the local scale variables τ (l) i are binary, the τ (l) i take the role of the identity variables I i . Thus we obtain a joint distribution over discrete scale parameters τ i as where θ τ represents the same set of hyper-parameters as θ m , and the distribution i ; θ m ) models the beliefs of the number of relevant features. In general, the local scales {τ Equation 19 are dependent. However, when p m (·) is set to a Binomial (or its Gaussian approximation), the joint distribution of {τ factorizes into a product of independent Bernoullis corresponding to the vanilla spike-and-slab with a fixed slab probability (Equation 8). We refer to Equation 19 as the informative spike-and-slab prior.
In BNNs, we suggest to use the informative spike-and-slab prior on the first layer to inform the model about sparsity on the feature level. For hidden layers, we do not assume prior domain knowledge, as they encode latent representations where such knowledge is rare. However, a uniform prior on the number of hidden nodes can be applied on hidden layers to infer optimal layer sizes from data by computing the posteriors, which we leave for future work. In this work, for the hidden layers we use the standard Gaussian scale mixture priors in Equation 5.

Prior knowledge on the PVE
After incorporating prior knowledge about sparsity in the new informative hyper-prior p(τ (l) i ; θ τ ), in this section we introduce an optimization approach to determine the hyperparameters (i.e., θ λ ) of the hyper-prior for the other local scale parameters p(λ

PVE for Bayesian neural networks
According to the definition of PVE, when f (x; w) is a BNN, PVE(w) has a distribution induced by the distribution on w via Equation 3. We denote the variance of the unexplainable noise in the regression model in Equation 2 by σ 2 . We use w (σ,θ λ ,θτ ) to denote the BNN weights with a GSM prior (i.e., Equation 9) parametrized by hyperparameters {σ, θ λ , θ τ }, where σ is {σ (0) , . . . , σ (L) }. The PVE of a BNN with a GSM prior is given by The noise σ and layer-wise global scales σ are usually given the same non-informative priors (Zhang et al., 2018) or set to a default value (Blundell et al., 2015). Setting of the hyper-parameter θ τ was described in Section 3. Therefore, we then optimize the remaining hyper-parameter θ λ to make the distribution of the PVE match our prior knowledge about the PVE.

Optimizing hyper-parameters according to prior PVE
Denote the available prior knowledge about the PVE by p(ρ). In practice such a prior may be available from previous studies, and here we assume it can be encoded in to the reasonably flexible Beta distribution. If no prior information is available, a uniform prior, i.e., Beta(1, 1), can be used. We incorporate such knowledge into the prior by optimizing the hyper-parameter θ λ such that the distribution induced by the BNN weight prior, p(w; σ, θ λ , θ τ ), on the BNN model's PVE denoted by q θ λ (ρ(w)) 2 , is close to p(ρ). We achieve this by minimizing the Kullback-Leibler divergence from q θ λ (ρ(w)) to p(ρ) w.r.t. the hyper-parameter θ λ , i.e., θ * λ = arg min θ λ KL[q θ λ (ρ(w))|p(ρ)]. However, the KL divergence is not analytically tractable because the q θ λ (ρ(w)) is defined implicitly, such that we can only sample from q θ λ (ρ(w)) but can not evaluate its density. We first observe that the KL divergence can be approximated by: by reparametrization and Monte Carlo integration. Here we assume that the GSM distribution p(w; θ λ ) can be reparametrized by a deterministic function w = g(ξ; θ λ ) with ξ ∼ p(ξ). For non-reparametrizable distributions, score function estimators can be used instead, which we leave for future work. Moreover, since PVE(w) is another deterministic function of w given data X, we have PVE(w) = ρ(w) = ρ(g(ξ; θ λ )). The expectation is approximated by M samples from p(ξ). Then the gradient of the KL w.r.t. θ λ can be calculated by: The first term ∇ θ λ ρ(g(ξ (m) ; θ λ )) can be calculated exactly with back-propagation packages, such as PyTorch. The last term, the gradient of the log density ∇ ρ log p(ρ), is tractable for a prior with a tractable density, such as the Beta distribution. However, the derivative ∇ ρ log q θ λ (ρ) is generally intractable, as the distribution of the PVE of a BNN q θ λ (ρ) is implicitly defined by Equation 20. We propose to apply SGE (Section 2.3) to estimate ∇ ρ log q θ λ (ρ), which only requires samples from q θ λ (ρ).
When noise σ and layer-wise global scales σ are given non-informative priors, e.g., Inv-Gamma(0.001, 0.001), drawing samples directly according to Equation 20 is unstable, because the variance of the non-informative prior does not exist. Fortunately, if all activation functions of the BNN are positively homogeneous (e.g., ReLU), we have the following theorem (proof is given in Supplementary): Theorem 2. Assume an L-layer BNN with Gaussian scale mixture prior in the form of Equation 9, if all activation functions are positively homogeneous, e.g., ReLU, we have: Now instead of giving non-informative priors to all layer-wise global scales, i.e., σ = σ (0) = . . . = σ (L) ∼ Inv-Gamma(0.001, 0.001), we propose to use σ (0) = . . . = σ (L−1) =  Var(f (X; w (1,θ λ ,θτ ) )) + σ 2 σ (L)2 = Var(f (X; w (1,θ λ ,θτ ) )) Var(f (X; w (1,θ λ ,θτ ) )) + 1 , which avoids sampling from the non-informative inverse Gamma distribution. Figure 5 illustrates the proposed approach, where we applied the method on two 3-layer BNNs containing 100-50-30-1 nodes and the ReLU activation. We considered two GSM ARD priors for the BNN weights: a mean-field Gaussian prior and a hierarchical Gaussian prior with the non-informative prior on the noise and the last layer-wise global scale. For the Gaussian prior, we optimized σ λ according to the prior PVE. For the hierarchical Gaussian prior, we optimized β while fixing α = 2 because the shape parameter of the Gamma distribution is non-reparametrizable. We see that after optimizing the hyperpriors, the simulated empirical distributions of the PVE is close to the corresponding prior knowledge in both cases, especially when the prior knowledge is informative, such as the Beta(1, 5) or Beta(5, 1). Moreover, we observe that even when we have fixed the shape parameter α of the inverse Gamma, the prior is still flexible enough to approximate the prior PVE well.

Related literature
Priors on the number of relevant features have been applied on small datasets to induce sparsity in shallow models, e.g. NNs with one hidden layer (Vehtari et al., 2001), including the geometric (Insua and Müller, 1998) and truncated Poisson (Denison et al., 1998;Insua and Müller, 1998;Andrieu et al., 2000;Kohn et al., 2001) distributions. However, those approaches rely on the reversible jump Markov chain Monte Carlo (RJMCMC) to approximate the posterior (Phillips and Smith, 1996;Sykacek, 2000;Andrieu et al., 2013), which does not scale up to deep architectures and large datasets. In this work, we incorporate such prior beliefs into the hyper-prior on the local scales of the Gaussian scale mixture prior; thus, the posterior can be approximated effectively by modern stochastic variational inference (Hoffman et al., 2013), even for deep NN architectures and large datasets.
Priors on PVE have been proposed to fine-tune hyper-parameters (Zhang et al., 2018) and to construct shrinkage priors (Zhang et al., 2020)  Priors on the BNNs weights: BNNs with a fully factorized Gaussian prior were proposed by Graves (2011) and Blundell et al. (2015). They can be interpreted as NNs with dropout by using a mixture of Dirac-delta distributions to approximate the posterior (Gal and Ghahramani, 2016). Nalisnick et al. (2019) extended these works and showed that NNs with any multiplicative noise could be interpreted as BNNs with GSM ARD priors. Priors over weights with low-rank assumptions, such as the k-tied normal (Swiatkowski et al., 2020) and rank-1 perturbation (Dusenberry et al., 2020) were found to have better convergence rates and ability to capture multiple modes when combined with ensemble methods. Matrix-variate Gaussian priors were proposed by Neklyudov et al. (2017) and Sun et al. (2017) to improve the expressiveness of the prior by accounting for the correlations between the weights. Some priors have been proposed to induce sparsity, such as the log-uniform Louizos et al., 2017), log-normal (Neklyudov et al., 2017), horseshoe (Louizos et al., 2017;Ghosh et al., 2018), and spike-and-slab priors (Deng et al., 2019). However, none of the works has proposed how to explicitly encode domain knowledge into the prior on the NN weights.
Informative priors of BNNs: Building of informative priors for NNs has been studied in the function space. One common type of prior information concerns the behavior of the output with certain inputs. Noise contrastive priors (NCPs) (Hafner et al., 2018) were designed to encourage reliable high uncertainty for OOD (out-of distribution) data points. Gaussian processes were proposed as a way of defining functional priors because of their ability to encode rich functional structures.

Experiments
In this section, we first compare the proposed informative sparse prior with alternatives in a feature selection task on synthetic toy data sets. We then apply it on seven public UCI real-world datasets 3 , where we tune the level of noise and the fraction of informative features. Finally, in a genetics application we show that incorporating domain knowledge on both sparsity and the PVE can improve results in a realistic scenario.

Synthetic data
Setup: We first validate the performance of the informative spike-and-slab prior proposed in Section 3 on a feature selection task, using synthetic datasets similar to those discussed by Van Der Pas et al. (2014) and . Instead of a BNN, we here use linear regression, i.e., a NN without hidden layers, which enables comparing the proposed strategy of encouraging sparsity with existing alternatives.
Consider n datapoints generated by: where each observation y i is obtained by adding Gaussian noise with variance σ 2 to the signal w i . We set the number of datapoints n to 400, the first p 0 = 20 signals {w i |i = 1, . . . , 20} equal to A (signal levels), and the rest of the signals to 0. We consider 3 different noise levels σ ∈ {1, 1.5, 2}, and 10 different signal levels A ∈ {1, 2, . . . , 10}. For each parameter setting (30 in all), we generate 100 data realizations. The model in Equation 27 can be considered a linear regression: y = Xŵ T + , where X = I andŵ = (A, . . . , A, 0, 0, . . . , 0) with the first p 0 elements being A, so this is a feature selection task where the number of features and datapoints are equal. We use the mean squared error (MSE) between the posterior mean signalw and the true signalŵ to measure the performance.
Parameter settings: For estimation we use linear regression with the correct structure. We consider 4 different Gaussian scale mixture priors on w that all follow the general form The details of the different priors are shown in Table 1. For all the spike-and-slab (SS) priors, we place a diffuse inverse Gamma prior on p(λ i ). For the BetaSS and DeltaSS  . Results for the two best models (DeltaSS and InfoSS) are additionally shown with a larger signal A = 10 (the last column). Beta spike-and-slab (BetaSS) is slightly worse than Delta spike-and-slab (DeltaSS), because the latter uses the correct slab probability. Informative spike-and-slab (InfoSS) outperforms alternatives by making the signals dependent. Horseshoe (HS) with the correct sparsity level overfits to the noise.
priors, we assume that p(τ i ) = Bernoulli(p), and define p ∼ Beta(2, 38) for BetaSS and p = 0.05 for DeltaSS, which both reflect the correct level of sparsity. For the informative spike-and-slab, InfoSS, we use the 'flattened' Laplace (FL) prior defined in Equation  16 with µ − = 0, µ + = 30, and s m = 5, to encode prior knowledge that the number of non-zero signals is (approximately) uniform on [0, 30]. We place an informative half-Cauchy prior C + (0, τ 2 0 ) on the global shrinkage scale τ with τ 0 = p0 n−p0 in the horseshoe (HS) , to assume the same sparsity level as the other priors 5 . Figure 7: Synthetic datasets: Mean squared error (MSE) between the estimated and true signals. The bars represent the 95% confidence intervals over 100 datasets. The novel InfoSS prior is indistinguishable from the other SS priors when for the noise level is low. However, InfoSS is significantly more accurate than the other priors when there is more noise.
Results: Figures 6 and 7 show the results on synthetic datasets. Figure 6 provides detailed visualizations of results for the four priors with two representative signal levels A = 6 and A = 10 and with noise level σ = 2. Figure 7 summarizes the results for all the signal and noise levels. From Figure 7, we see that BetaSS with p ∼ Beta(2, 38) and DeltaSS with p = 0.05 perform similarly when the noise level is low, but DeltaSS is better than BetaSS for higher noise levels. This is because the DeltaSS prior is more concentrated close to the true sparsity level; thus, it penalizes false signals more strongly (Top left and bottom panels in Figure 6). The InfoSS prior has indistinguishable performance from the other SS priors when the noise level is low, but with high noise, e.g., σ = 2.0, InfoSS is significantly better, especially when the signal is large (A > 6). This is because InfoSS places a prior on the number of features directly, which makes the signals w i dependent, and consequently including correct signals can help remove incorrect signals. In contrast, the signals are independent of each other in the other priors considered; thus, selecting true signals will not help remove false findings. Another observation is that the Horseshoe prior (HS) works well when there is little noise (e.g. σ = 1), but for larger value of σ HS is much worse than all the spike-and-slab alternatives because it can easily overfit the noise (bottom-left panel in Figure 6).

Public real-world UCI datasets
Setup: We analyze 7 publicly available datasets 6 : Bike sharing, California housing prices, Energy efficiency, Concrete compressive strength, Yacht hydrodynamics, Boston housing, and kin8nm dataset, whose characteristics, the number of individuals N and the number of features D, are given in Table 3. We carry out two types of experiments: Original datasets: we analyze the datasets as such, in which case there is no domain knowledge about sparsity; Extended datasets: we concatenate 100 irrelevant features with the original features and add Gaussian noise to the dependent variable such that Table 2: Details of the seven Gaussian scale mixture priors included in the comparison on the UCI datasets. Hyper-parameters in MF+CV and SS+CV (local scale σ λ and spike probability p) are chosen via 5-fold cross-validation on the training set. The hyperparameter β is optimized to match the prior PVE, and µ + is equal to the number of features in the corresponding original dataset without the artificially added irrelevant features.
Name of the prior p(λ Inv-Gamma(2, β) FL(0, D, 1) NA the PVE in the data is at most 0.2, which allows us to specify informative priors about sparsity (the number of relevant features is at most the number of original features) and the PVE (less than 0.2). We examine whether the performance can be improved by encoding this extra knowledge into the prior. We use 80% of data for training and 20% for testing. We use the PVE (i.e., R 2 ) on a test set to evaluate the performance 7 . We repeat each experiment on 50 different data splits to obtain confidence intervals.
Parameter settings: We considered 7 different priors: 1. mean-field (independent) Gaussian prior with cross-validation (MF+CV) (Blundell et al., 2015); 2. delta spike-andslab prior with cross-validation (SS+CV) (Blundell et al., 2015); 3. horseshoe prior (HS) (Ghosh and Doshi-Velez, 2017); 4. Hierarchical Gaussian prior (HMF), which is the same as MF+CV except that the hyperparameters receive a fully Bayesian treatment instead of cross-validation. 5. The InfoHMF, which incorporates domain knowledge on feature sparsity in the HMF by applying the proposed informative prior in the input layer; 6. HMF+PVE instead includes the informative prior on the PVE, and finally, 7. InfoHMF+PVE includes both types of domain knowledge. Lasso regression with cross-validated regularization (Lasso+CV) is included as another standard baseline (Tibshirani, 2011).
The hyper-parameters for MF+CV priors and SS+CV prior are chosen by 5-fold crossvalidation on the training set from grids σ λ ∈ {e −2 , e −1 , 1, e 1 , e 2 } and p ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, which are wider than used in the original work by Blundell et al. (2015). We define HS as suggested by Ghosh et al. (2018), such that the scale τ (l) i = τ (l) is shared by all weights in each layer l. In the HMF, we use a non-informative prior on the local scales λ (l) i . We regard MF+CV, SS+CV, HS, and HMF as strong benchmarks to compare our novel informative priors against. For InfoHMF, we use the 'flattened' Laplace (FL) prior with µ − = 0, µ + = D (D is the number of features in the original dataset) on the input layer to encode the prior knowledge about feature sparsity. For HMF+PVE and InfoHMF+PVE, we optimize the hyper-parameter β to match the PVE of the BNN with a Beta(5.0, 1.2) for the original datasets (the mode equals 0.95), and with Beta(1.5, 3.0) for the extended datasets (the mode equals 0.20). For all priors that are not informative about the PVE Table 3: Test PVE with 1.96 standard error of the mean (in parentheses) for each prior on UCI datasets. The first seven rows show the experimental results on the original datasets where we have no prior information, and the last seven rows on extended datasets with 100 irrelevant features and injected noise added. The best result in each column has been boldfaced. The dimension (P ) 9 and size (N ) are shown for each dataset. We see that both information about sparsity and PVE improve the performance, especially when prior information is available (on extended datasets). (except the HS), we use an Inv-Gamma(0.001,0.001) for the all layer-wise global scales σ and the noise σ . For priors informative on the PVE, the non-informative prior is used only for the last layer-wise global scale σ (L) and noise σ (see Section 4). The details about each prior are summarized in Table 2.
Results: The results, in terms of test PVE, are reported in Table 3. For the original datasets, we see that incorporating the prior knowledge on the PVE (HMF+PVE and InfoHMF+PVE) always yields at least as good performance as the corresponding prior without this knowledge (HMF and InfoHMF). Indeed, HMF+PVE has the (shared) highest accuracy in all datasets except Boston. The new proposed informative sparsity inducing prior (InfoHMF) does not here improve the performance, as we do not have prior knowledge on sparsity in the original datasets. Among the non-informative priors, HMF is slightly better than the rest, except for the Boston housing dataset, where the horseshoe prior (HS) achieves the highest test PVE, which demonstrates the benefit of the fully Bayesian treatment vs. cross-validation of the hyperparameters. The linear method, Lasso+CV, is worse than all BNNs in most datasets.
In the extended datasets with the 100 extra irrelevant features and noise added to the target, knowledge on both the PVE and sparsity improves performance significantly. For most of the datasets both types of prior knowledge are useful, and consequently InfoHMF+PVE is the most accurate on 5 out of 7 datasets. Furthermore, its PVE is also close to 20% of the maximum test PVE in the corresponding original dataset, reflecting the fact that noise was injected to keep only 20% of the signal. We find that the horseshoe (HS) works better than the HMF on small datasets, especially Boston, where the HS outperform others. The priors MF+CV and SS+CV do not work well for the extended datasets, and they are even worse than Lasso+CV, because cross-validation has a large variance on the noisy datasets especially for flexible models such as BNNs. The more computationally intensive repeated cross-validation (Kuhn et al., 2013) might alleviate the problem, but its further exploration is left for future work. Overall, we conclude that by incorporating knowledge on the PVE and sparsity into the prior the performance can be improved; however, the amount of improvement can be small if the dataset is large (California and Bike) or when the prior knowledge is weak (the original datasets).

Metabolite prediction using genetic data
Setup: Genome-wide association studies (GWASs) aim to learn associations between genetic variants called SNPs (input features) and phenotypes (targets). Ultimately, the goal is to predict a given phenotype given the SNPs of an individual. This task is extremely challenging because 1. the input features are very high-dimensional and strongly correlated and 2. the features may explain only a tiny fraction of the variance of the phenotype, e.g. less 1%. In such a case, neural networks may overfit severly and have worse accuracy than the simple prediction by mean. Typical approaches employ several heuristic but crucial preprocessing steps to reduce the input dimension and correlation. However, strong domain knowledge about sparsity and the amount of variance explained by the SNPs is available, and we show that by incorporating this knowledge in the informative prior we can accurately predict where alternatives fail.
We apply the proposed approach on the FINRISK dataset (Borodulin et al., 2018), which contains genetic data and 228 different metabolites as phenotypes for 4,620 individuals. We select six genes that have previously been found to be associated with the metabolites (Kettunen et al., 2016). We use the SNPs in each gene as features to predict the metabolite most strongly associated with the gene, resulting in 6 different experiments. We make predictions using the posterior mean and evaluate the performance by the PVE (larger is better) on test data. We use 50% of the data for training and 50% for testing, and we repeat this 50 times for each of the six experiments (i.e., for each gene), to account for the variability due to BNN training.
Parameter settings: We train BNNs with 1 hidden layer having 100 hidden nodes; the complexity of the data prevents the use of more complex models, but even the single hidden layer increases flexibility and improves accuracy (see results). We consider Table 4: The six Gaussian scale mixture priors included in the genetics experiment. Hyper-parameters: the local scale σ λ and the spike probability p are chosen via 5-fold cross-validation on the training set; the informative local scale σ prior λ is optimized to match the prior PVE; µ + equals 20% number of SNPs in corresponding gene.
Name of the prior p(λ six priors: the mean-field Gaussian with the local scales λ (l) i set by cross-validation (MF+CV) or optimized using the PVE (MF+PVE); the delta spike-and-slab, where the slab probability p is cross-validated and the local scales are set either by cross-validation or the PVE (SS+CV and SS+PVE); and finally, the mean-field prior including the informative prior about feature sparsity, and the local scales either cross-validated (InfoMF+CV) or set using the PVE (InfoMF+PVE). We use the 'flattened' Laplace (FL) prior with µ − = 0, µ + = 0.2D, where D is the number of SNPs in a given gene, to reflect the prior belief that less than 20% of the SNPs in the gene affect the phenotype. To encode the knowledge of the PVE, we optimize the hyper-parameter σ λ to match the mode of the PVE with previous findings (Kettunen et al., 2016). We also tried other priors such as the hierarchical Gaussian and the horseshoe but they failed to capture any signal even with prior knowledge on sparsity and the PVE, hence they are not included in the final comparison. The priors are summarized in Table 4. We include Lasso regression with cross-validated regularization as a strong linear baseline (Lello et al., 2018).
Results: Figure 8 shows results for the 6 experiments (genes). We see that using the prior knowledge on the PVE always improves accuracy (purple bars). Without the prior on the PVE the mean-field Gaussian prior can overfit severly (blue bars, negative test PVE). Furthermore, the novel informative sparse prior performs better than or similarly to the spike-and-slab prior with the cross-validated slab probability (compare InfoMF vs. SS). However, it is notable that applying the SS prior requires computationally intensive cross-validation to set the slab probability p, which is avoided by the InfoMF. The gene LIPC has the strongest signal and the BNNs can perform well even without the informative priors, but the performance improves further by incorporating the prior knowledge. For gene APOA5, only InfoMF+PVE achieves a positive test PVE, while the alternatives overfit severly. We also notice that although BNNs with informative priors are better than the Lasso in most cases, the performance is similar for gene LDLR2, which indicates that the true effect can be captured by a linear model. Overall, the highest accuracy is achieved by SS+PVE or InfoMF+PVE priors in most experiments.

Conclusion
In this paper, we provided an approach to incorporate two types of domain knowledge, on feature sparsity and the proportion of variance explained (PVE), into the widely used Gaussian scale mixture priors for Bayesian neural networks. Specifically, we proposed to use a new informative spike-and-slab prior on the input layer to reflect the belief about feature sparsity, and to tune the model's PVE with prior knowledge on the PVE, by optimizing the hyper-parameters of the local scales for all neural network weights. We demonstrated the utility of the approach on simulated data, publicly available datasets, and in a genetics application, where they outperformed strong commonly used baselines without computationally expensive cross-validation. The informative spike-and-slab is not limited to the Gaussian scale mixtures, but can be generalized to all scale mixture distributions. However, one limitation of using the PVE to reflect the signal-to-noise ratio is that PVE is only defined for priors with finite second moments. Therefore, for some heavy-tailed distributions, such as the horseshoe, other measures of signal-to-noise ratio should be developed as part of future work.