Fast Exact Bayesian Inference for Sparse Signals in the Normal Sequence Model

We consider exact algorithms for Bayesian inference with model selection priors (including spike-and-slab priors) in the sparse normal sequence model. Because the best existing exact algorithm becomes numerically unstable for sample sizes over n=500, there has been much attention for alternative approaches like approximate algorithms (Gibbs sampling, variational Bayes, etc.), shrinkage priors (e.g. the Horseshoe prior and the Spike-and-Slab LASSO) or empirical Bayesian methods. However, by introducing algorithmic ideas from online sequential prediction, we show that exact calculations are feasible for much larger sample sizes: for general model selection priors we reach n=25000, and for certain spike-and-slab priors we can easily reach n=100000. We further prove a de Finetti-like result for finite sample sizes that characterizes exactly which model selection priors can be expressed as spike-and-slab priors. The computational speed and numerical accuracy of the proposed methods are demonstrated in experiments on simulated data, on a differential gene expression data set, and to compare the effect of multiple hyper-parameter settings in the beta-binomial prior. In our experimental evaluation we compute guaranteed bounds on the numerical accuracy of all new algorithms, which shows that the proposed methods are numerically reliable whereas an alternative based on long division is not.


Introduction
In the sparse normal sequence model we observe a sequence Y = (Y 1 , . . . , Y n ) that satisfies for independent standard normal random variables ε i , where θ = (θ 1 , . . . , θ n ) is the unknown signal of interest. It is assumed that the number of non-zero signal components s in θ is small compared to the size of the whole sample (i.e. s = o(n)). Applications of this model include detecting differentially expressed genes [45,36,39,56,23], bankruptcy prediction for publicly traded companies using Altman's Z-score in finance [2,3], separation of the background and source in astronomical images [18,28], and wavelet analysis [1,31]. The model is further of interest to sanity check (approximate) inference methods for the more general sparse linear regression model (see [15] and references therein), which reduces to the normal sequence model when the design is the identity matrix.
The sparse normal sequence model, which is also called the sparse normal means model, has been extensively studied from a frequentist perspective (see, for instance, [27,7,1]), but here we consider Bayesian approaches, which endow θ with a prior distribution. This prior serves as a natural way to introduce sparsity into the model and the corresponding posterior can be used for model comparison and uncertainty quantification (see [24,53,37,5] and references therein). One natural and well-understood class of priors are model selection priors that take the following hierarchical form: i.) First a sparsity level s is chosen from a prior π n on {0, 1, . . . , n}.
iii.) Finally, given s and S, the means θ S = (θ i ) i∈S corresponding to the nonzero coordinates in S are endowed with a prior G S , while the remaining coefficients θ S c = (θ i ) i / ∈S are set to zero.
As is common, we will choose the prior G S on the nonzero coordinates in a factorized form; i.e. θ i ∼ G for all i ∈ S, where G is a fixed one-dimensional prior, which we assume to have a density g (with respect to the Lebesgue measure). Under suitable conditions on π n and G, the posterior has good frequentist properties and contracts around the true parameter at the minimax rate, as shown by Castillo and Van der Vaart [17]. Notably, they require the prior π n to decrease at an exponential rate.
A special case of the model selection priors are the spike-and-slab priors developed by Mitchell and Beauchamp [42], George and McCulloch [26], where the coefficients of θ are assigned prior probabilities with δ 0 the Dirac-delta measure at 0 (a spike) and G the same one-dimensional prior as above (called the slab in this context). The a priori likelihood of nonzero coefficients is controlled by the mixing parameter α ∈ [0, 1], and finally Λ n is a hyperprior on α. A typical choice for Λ n is the beta distribution: α ∼ Beta(κ, λ). In this case the prior on the sparsity level in the model selection formulation takes the form π n (s) = n s B(κ+s,λ+n−s) B (κ,λ) , where B(κ, λ) denotes the beta function with parameters κ and λ. The resulting prior is called the beta-binomial prior. A natural choice is κ = λ = 1 [54], which corresponds to a uniform prior on α, but this choice does not satisfy the exponential decrease condition on π n . Castillo and Van der Vaart therefore propose κ = 1 and λ = n+1, which does satisfy their exponential decrease condition [17,Example 2.2], and in Section 5 we confirm empirically that the latter indeed leads to better posterior estimates for θ.
Model selection priors set certain signal components to zero, which is desirable for model selection, but makes computation of the posterior difficult since the number of possible sets S is exponentially large (i.e. 2 n ). Castillo and Van der Vaart [17] do provide an exact algorithm, based on multiplication of polynomials (see Appendix A), but this algorithm runs into numerical problems for sample sizes over n = 250 or sometimes n = 500 (see Section 3) and it also requires O(n 3 ) computation steps, which makes it too slow to handle large n.
The computational difficulty of model selection priors has given rise to a variety of alternative priors based on shrinkage. These include the horseshoe prior [11], for which multiple scalable implementations are available [58,30]. The corresponding posterior achieves the minimax contraction rate and, under mild conditions, also provides reliable uncertainty quantification [61,60,59]. The posterior median and draws from the horseshoe posterior are not sparse, but one can use it for model selection after postprocessing the posterior. An alternative is to replace the spike in the spike-and-slab prior with a Laplace distribution with very small variance, as in the Spike-and-Slab LASSO [48]. One can efficiently compute the maximum a posteriori (MAP) estimator of the corresponding posterior distribution by convex optimization.
In this paper we return to the goal of exactly computing the posterior for model selection priors, without changing the prior or introducing approximations. In Section 2 we propose a new approach based on a representation of model selection priors by a Hidden Markov Model (HMM) that comes from the literature on online sequential prediction and data compression [64], for which we can apply the standard Forward-Backward algorithm [46]. The computational complexity of this algorithm is O(n 2 ). To appreciate the speed-up compared to O(n 3 ) run time, see Section 4, where this method runs in under 15 minutes while the previous algorithm of Castillo and Van der Vaart would take approximately 20 days. Furthermore, in Section 2.2 we specialize to spikeand-slab priors and introduce an even faster algorithm based on a discretization of the α hyper-parameter, which has only O(n 3/2 ) run time. Using results from online sequential prediction [20], we show that this discretization provides an accurate approximation of the posterior that can be made exact to arbitrary precision, provided that the density of Λ n varies sufficiently slowly. Our conditions do not directly allow κ or λ to depend on n in the beta-binomial prior, so we provide an extra result to cover the important case that κ = 1 and λ = n + 1.
Our two new approaches allow us to easily handle data sets of size n = 25 000 for general model selection priors and n = 100 000 for the subclass of spike-and-slab priors with sufficiently regular Λ n , which both substantially exceed the earlier limit of n = 500. These results are obtained on a standard laptop within a maximum time limit of half an hour. Run times for larger sample sizes can be estimated by extrapolating from Figure 2.
In Section 2.3 we further derive sufficient and necessary conditions to decide whether a model selection prior can be written in the more efficient spike-and-slab form. Since the distribution of the binary indicators for whether θ i = 0 or not is exchangeable under the model selection priors, this amounts to a finite sample de Finetti result for a restricted class of exchangeable distributions.
In Section 3, we demonstrate the scalability and numerical accuracy of the proposed methods on simulated data. We also show there that our deterministic algorithm can be used as a benchmark to test the accuracy of approximation methods: we compare the approximate posterior from Gibbs sampling and variational Bayes to the exact posterior computed by our algorithm, which shows the surprisingly limited number of decimal places to which their answers are reliable. Then, in Section 4, we compare our methods to other approaches suggested in the literature in an application to differential gene expression for Ulcerative Colitis and Crohn's Disease. In Section 5 we further use our new algorithms to empirically investigate the importance of the exponential decrease condition on π n by varying the hyper-parameters κ and λ of the beta-binomial prior. We find that exponential decrease is not just a sufficient condition for minimax posterior contraction, but it also leads to better posterior estimates of θ. The paper is concluded by Section 6, where we discuss possible extensions of our algorithms.
In addition to the main paper, we provide an accompanying R package that implements our new methods [62], and supplementary material with several appendices [63]. In Appendix A we first recall the exact algorithm by Castillo and Van der Vaart [17]. We show how to resolve its numerical stability issues by performing all intermediate computations in a logarithmic representation. The bottleneck then becomes its computational complexity, because it requires O(n 3 ) steps, which is prohibitive for large n. Two natural ideas to speed up the algorithm have been proposed by [17,12], one based on fast polynomial multiplication and one based on long division. Surprisingly, although both approaches look very promising in theory, it turns out that neither of them works well in practice: the theoretical speed-ups for fast polynomial multiplication turn out to be so asymptotic that they do not provide significant gains for any reasonable n; and the long division approach becomes numerically unstable again. In Appendix B we provide an additional variation on an experiment from Section 5. Finally, Appendix C contains all proofs.

Exact Algorithms for Model Selection Priors
In this section we propose novel, exact algorithms for computing (marginal statistics of) the posterior distribution corresponding to model selection priors. For general model selection priors we propose a model selection HMM algorithm, and for spike-and-slab priors we introduce a faster method based on discretization of the α hyper-parameter. The section is concluded with a characterization of the subclass of model selection priors that can be expressed in the more efficient spike-and-slab form.

Marginal Statistics
We are interested in computing the marginal posterior probabilities that the coordinates of θ are nonzero: q n,i := Π n (θ i = 0 | Y ) for i = 1, . . . , n.
These are sufficient to compute any other marginal statistics of interest, because, conditionally on whether θ i is 0 or not, the pair (Y i , θ i ) is independent of all other pairs (Y j , θ j ) j =i . For instance, the marginal posterior means can be expressed as where ψ(y) = φ(y − t)g(t) dt is the slab density and ζ(y) = tφ(y − t)g(t) dt, with φ the standard normal density. We may also obtain marginal quantiles by inverting the marginal posterior distribution functions dt. In particular, the marginal medians correspond tô where H −1 n,i is the inverse of the function H n,i (u) = ψ(Yi,u) ψ(Yi) and we use the conventions that

The Model Selection HMM Algorithm
Our first computationally efficient approach is based on a Hidden Markov Model (HMM) that comes from the literature on online sequential prediction and data compression [64]. This approach makes it possible to reliably compute all marginal posterior probabilities q n,i in only O(n 2 ) operations for any model selection prior.
To define the HMM, we will encode the subset of nonzero coordinates S ⊂ {0, 1, . . . , n} as a binary vector B = (B 1 , . . . , B n ), where B i = 1 if i ∈ S and B i = 0 otherwise. The crucial observation is that the conditional probabilities of the model selection prior only depend on the total number of nonzeros M i = i j=1 B j ∈ {0, . . . , i} in the first i coordinates and not on the locations of these coordinates. We can use this observation to interpret the model selection prior as the model selection HMM shown in Figure 1, where each hidden state H i = (B i , M i ) contains sufficient information to compute both the transition probabilities and the conditional distribution of θ i given H i : In fact, in our implementation we will integrate out θ i to directly obtain the conditional density We note that the sequence of hidden states H 1 , . . . , H n is in one-to-one correspondence with S. Consequently, since the model selection HMM expresses the same joint distribution on H 1 , . . . , H n as the model selection prior, and the conditional distribution of θ and Y given H 1 , . . . , H n is also the same, it follows that the model selection HMM is equivalent to the corresponding model selection prior.
What we gain is that, for HMMs, standard efficient algorithms are available, whose run times depend on the number of state transitions with nonzero probabilities P (H i+1 | H i ) [46]. For our purposes, we will use the Forward-Backward algorithm to compute Π n (H i | Y ) for all i in O(n 2 ) steps, from which we can obtain q n,i = Π n (B i = 1 | Y ) for all i in another O(n 2 ) steps by marginalizing. For numerical accuracy, we perform all calculations using the logarithmic representation discussed in Appendix A.2.
. Then the Forward phase in this algorithm computes the densities p(Y i . . , n and all values h i of the hidden states using the recursion After the Forward phase, the Forward-Backward algorithm performs the Backward phase, which computes Combining the results from the Forward and Backward phases, we can compute for all i and h i as desired. The HMM described here was introduced by [64] for the Beta(1/2, 1/2)-binomial prior (i.e. the spike-and-slab prior with Λ n = Beta(1/2, 1/2)) in the context of the Switching Method for data compression. See [35] for an overview of many variations on this HMM. Indeed, for any Beta(κ, λ)-binomial prior this HMM is particularly natural, because the transition probabilities of the hidden states have a closed-form expression: Here we add the observations that, even when the conditional probabilities Π n (B i+1 | B 1 , . . . , B i ) are not available in closed form for a given model selection prior, they still satisfy (3) and can be efficiently obtained from is the joint probability of any sequence b 1 , . . . , b i with m ones. These joint probabilities can be pre-computed for i = n, . . . , 1 in O(n 2 ) steps using the recursion Thus we can calculate the marginal posterior probabilities in O(n 2 ) steps for any model selection prior, not just for beta-binomial priors. The numerical accuracy of this algorithm is demonstrated in Section 3.

A Faster Algorithm for Spike-and-Slab Priors
In this section we restrict our attention to the spike-and-slab subclass of model selection priors, for which we propose further speed-ups. It is intuitively clear that the mixing hyper-parameter α plays a key role in the behavior of the prior distribution. The optimal choice of α heavily depends on the sparsity parameter s of the model. For instance in case of Cauchy slabs the optimal oracle choice α = (s/n) log(n/s) results in minimax posterior contraction [17] and reliable uncertainty quantification [16] in 2 -norm. However, in practice the sparsity level s is (typically) not known in advance. Therefore one cannot use the optimal oracle choice for α. In [17] it was also shown that by choosing α = 1/n the posterior contracts around the truth at the nearly optimal rate s log(n). This seemingly solves the problem of choosing the tuning hyper-parameter. However, a related simulation study in [17] shows that hard-thresholding at the corresponding 2 log(n) level pairs up with substantially worse practical performance; see Tables 1 and 2 in [17]. Furthermore, in view of [16] the choice of α = 1/n imposes too strong prior assumptions, resulting in overly small posterior spread which leads to unreliable Bayesian uncertainty quantification, i.e. the frequentist coverage of the 2 -credible set will tend to zero. Therefore in practice one has to consider a data driven (adaptive) choice of the hyperparameter α. A computationally appealing approach is the empirical Bayes method, where the maximum marginal likelihood estimator is plugged into the posterior. The corresponding posterior mean achieves a (nearly) minimax convergence rate [31], and for slab distributions with polynomial tails the corresponding posterior contracts around the truth at the optimal rate [13]. However for light-tailed slabs (e.g. Laplace) the empirical Bayes posterior distribution will achieve a highly suboptimal contraction rate around the truth; see again [13].
Another standard (and from a Bayesian perspective more natural) approach is to endow the hyper-parameter α with another layer of prior Λ n . However, computational problems may arise using standard Gibbs sampling techniques for sampling from the posterior; see Section 3.2 for a demonstration of this problem on a simulated data set. In the literature various speed-ups were proposed. One can for instance focus on relevant sub-sequences of the sequential parameter θ and apply the Gibbs sampler only on them. Another approach is to apply the Hamiltonian Monte Carlo method, see for instance [53]. However, none of these approaches provides an easy way to quantify their approximation error when run for a finite number of iterations. In the next section we propose a deterministic algorithm to approximate the marginal posterior probabilities q n,i for spike-and-slab priors, with a guaranteed bound on its approximation error that can be made arbitrarily close to zero.

Approximation via Discretization of the Mixing Parameter
For general model selection priors the fast HMM algorithm from Section 2.1 requires O(n 2 ) steps. However, for the special case of spike-and-slab priors we can do even better: we can approximate the corresponding posterior to arbitrary precision using only O(n 3/2 ) steps, provided that the density λ n of the mixing distribution Λ n on α satisfies certain regularity conditions. The Algorithm Our approach is to approximate the prior Λ n by a priorΛ n that is supported on k = O(n 1/2 ) discretization points α 1 , . . . , α k . Then let Π n be the original spike-and-slab prior corresponding to a given choice of Λ n , and letΠ n be the prior corresponding toΛ n . Conditional on α, the pairs (θ i , Y i ) are independent. Computing the likelihood for a single α therefore takes O(n) steps, and consequently we can obtain the posterior probabilitiesΠ n (α j | Y ) of all k discretization points in O(kn) steps. We can then computeq in another O(k) steps independently for each i, leading to a total run time of O(kn) = O(n 3/2 ) steps. We again perform all calculations using the logarithmic representation from Appendix A.2.

Choice of Discretization Points
As in Section 2.1, let B = (B 1 , . . . , B n ) be latent binary random variables such that B i = 0 if θ i = 0 and B i = 1 otherwise. We will choose discretization points α 1 , . . . , α k and the discretized priorΛ n such that the ratio where > 0 can be made arbitrarily small. Since, conditional on B, the discretized model is the same as the original model, this implies that the posterior probabilities Π n (θ | Y ) andΠ n (θ | Y ) must also be within a factor of (1 + )/(1 − ) ≈ 1.
Conditional on the mixing hyper-parameter α, the sequence B consists of independent, identically distributed Bernoulli random variables, and Π n andΠ n respectively assign hyperpriors Λ n andΛ n to the success probability α. To discretize α, we will follow an approach introduced by [20] in the context of online sequential prediction with adversarial data. They observe that it is more convenient to reparametrize the Bernoulli model using the arcsine transformation [4,25], which makes the Fisher information constant: We will use a uniform discretization of β with k discretization points spaced δ k = π/(2k) ∝ 1/ √ n apart, which in the α-parametrization maps to a spacing that is proportional to 1/ √ n around α = 1/2 but behaves like 1/n for α near 0 or 1. Specifically, let α j = α(β j ) with The prior mass of each α under Λ n is then reassigned to its closest discretization point in the β-parametrization. If Λ n has no point-masses exactly half-way between discretization points, then this means that Otherwise, if Λ n does have such point-masses, their masses may be divided arbitrarily over their neighboring discretization points.
Approximation Guarantees For simplicity we will assume that Λ n has a Lebesguedensity λ n (α) = dΛ n (α)/dα. It will also be convenient to let α 0 = 0 and α k+1 = 1, and to define which may be interpreted as the Bernoulli(α) likelihood of a binary sequence with maximum likelihood parameterα. In particular, ifα = s/n with s the number of ones in b ∈ {0, 1} n for integer n, then There is no reason to restrict the definition of P α (n,α) to integer n or to the discrete set ofα that can be maximum likelihood parameters at sample size n, however, and following [20] we extend the definition to allα ∈ [0, 1] and all real n > 0, which will be useful below to handle the Beta(1, n + 1) prior.
Theorem 2.1. Take k = 2(m + 1) √ n + 1 for any integer m, and suppose there exists a constant L ≥ 0 (which is allowed to depend on n) such that Then there exists a constant C L > 0 that depends only on L, such that, if m > C L , we have, for = C L /m, and consequently The result (6) holds even for non-integer n, but (7) implicitly assumes that n is the number of observations in Y and must therefore be integer. The proof is deferred to Appendix C.1 in the supplementary material. We note already that condition (5) is essentially a Lipschitz condition on the log of the density of Λ n in the β-parametrization (see (5) in Appendix C.1). Under this condition, the theorem shows that, by increasing m, we can approximate Π n (θ | Y ) to any desired accuracy, at the cost of increasing our computation time, which scales linearly with m.
Remark 1. For given m and (integer) n, the tightest possible value of in (7) may be determined numerically, by maximizing and minimizing the ratio in (6) overα = s/n for s = 0, . . . , n.

Extension to Arbitrary Beta Priors
The Lipschitz condition (5) excludes the important Beta(1, n + 1) prior, because its density varies too rapidly. We therefore describe an extension that can handle any Beta(κ, λ) prior with κ, λ ≥ 1/2, even when κ or λ grows linearly with n.

Which Model Selection Priors Are Spike-and-Slab Priors?
As described in the introduction, it is clear that spike-and-slab priors are a special case of model selection priors. However, to the best of our knowledge, it is not known when a model selection prior has a spike-and-slab representation. One advantage of the spike-and-slab formulation is that we can construct algorithms with O(n 3/2 ) run time (Section 2.2), while for a general model selection prior the computational complexity is O(n 2 ) (Section 2.1). In this section we give sufficient and necessary conditions for when a model selection prior can be expressed in spike-and-slab form.
The proof, which is given in Appendix C.2, shows that establishing this theorem amounts to proving a version of de Finetti's theorem for finite sequences.
Next we give several examples of priors π n that satisfy (or fail) the conditions of Theorem 2.3, which implies that the model selection prior can (or cannot) be given in spike-and-slab form (2). The proofs for the examples are in Appendix C.3.
First we consider binomial π n , for which it is already known that there exists a spikeand-slab representation [17, Example 2.1]. Nevertheless, to illustrate the applicability of our results, we show that this choice of π n satisfies the conditions of Theorem 2.3. We proceed to give two natural choices for π n where the corresponding model selection prior cannot be expressed in the form (2). In the first example, π n has a heavy (polynomial) tail, while in the second it has a light (sub-exponential) tail.

Comparing the Proposed Algorithms
In this section we investigate the speed and numerical accuracy of our new algorithms to the previously proposed methods for exact computation of the posterior. We consider a sequence of sample sizes n = 50, 100, 250, 500, 1 000, 2 500, . . . , 50 000, 100 000 and construct the true signal θ 0 to have 20% non-zero signal components of value 4 √ 2 ln n, while the rest of the signal coefficients are set to be zero. For fair comparison we run all algorithms for the spike-and-slab prior with Laplace slab g(x) = a 2 e −a|x| , with a = 1, and mixing hyper-prior Λ n = Beta(1, n + 1). We have set up the experiments in R, but all algorithms were implemented as subroutines in C ++ . Since numerical instability is a major concern, we have tracked the numerical accuracy of all methods using interval arithmetic as implemented in the C ++ Boost library [8] (with cr-libm as a back-end to compute transcendental functions [21]), which replaces all floating point numbers by intervals that are guaranteed to contain the mathematically exact answer. The lower end-point of each interval corresponds to always rounding down in the calculations, and the upper end-point corresponds to always rounding up. The width of the interval for the final answer therefore measures the numerical error. All experiments were performed on a MacBook Pro laptop with 2.9 GHz Intel Core i5 processor, 8 GB (1867 MHz DDR3) memory, and a solid-state hard drive.

Results
The results are summarized in Figure 2, which shows the run time of the algorithms on the left, and their numerical error on the right. The reported numerical error is the maximum numerical error in calculating q n,i over i = 1, . . . , n. To avoid overly long computations we have terminated the algorithms if they became numerically unstable or if their run time exceeded half an hour. One can see that the original Castillo-Van der Vaart algorithm was terminated for n ≥ 250, which was due to numerical inaccuracy. This problem was resolved by applying the logarithmic representation from Appendix A.2 which made the algorithm numerically stable up to n ≤ 2500; however, due to the long O(n 3 ) run time the algorithm was terminated for larger values as it reached the half-hour limit. The natural speed-up idea of applying long division (see Appendix A.3) was not successful for this data as even for small sample sizes the numerical accuracy was poor. We observe that the model selection HMM and the algorithm based on discretization performed superior to the preceding methods: the model selection HMM algorithm has run time O(n 2 ) and the largest sample size it managed to complete within half an hour was n = 25 000, while the algorithm with discretized mixing parameter in the spike-and-slab prior (initialized according to Corollary 2.2 with parameter m = 20) has run time O(n 3/2 ) and reached the time limit after sample size n = 100 000. We also note that both algorithms were numerically accurate, giving answers that were reliable up to between 5 and 11 decimal places, depending on n. For sample size n = 2 500, we have further verified empirically that indeed the model selection HMM algorithm computes the same numbers as the Castillo-Van der Vaart algorithm, as was already shown in Section 2.1.

Approximation Errors for Several Standard Methods
In this section we measure the approximation error of a selection of approximation algorithms by comparing them to the exact model selection HMM algorithm, which serves as a benchmark for the correct answer. We again consider the spike-and-slab prior with Λ n = Beta(1, n + 1), but for simplicity we use standard Gaussian slabs g(x) = 1 √ 2π e −x 2 , since the approximation methods are typically designed for this choice of slab distribution. Our first approximation method is the discretization algorithm from Section 2.2, which uses a deterministic approximation. The discretization algorithm was again initialized according to Corollary 2.2 with m = 20. We further consider a standard Gibbs sampler (with number of iterations it = 10 3 , 10 4 , 10 5 , half of which are used as burn-in) and a variational Bayes approximation. We consider the same test data as in the preceding section. The only difference is that we stop at n = 10 000 to limit the run times for the exact HMM algorithm and the Gibbs sampler with it=10 5 . Both the Gibbs sampler and variational Bayes algorithm were implemented in R. For the latter we used the component-wise variational Bayes algorithm [38,57,10,47]. We measure approximation error by computing max i |q n,i −q n,i |, where q n,i is the exact slab probability computed by the model selection HMM andq n,i is the slab probability computed by the approximation. We run each non-deterministic approximation method 5 times and report the average approximation error along with the average run time of the algorithms. The results are plotted in Figure 3 and shown numerically in Table 1.
One can see that the discretized version of the algorithm is very accurate, with at least seven decimal places of precision throughout. It approximately loses two decimal places of precision for every ten-fold increase of n, so we can still expect it to be accurate up to five decimal places for n = 100 000. We point out that its approximation error includes both the mathematical approximation from Section 2.2 and the numerical error already studied separately in Figure 2. Since the approximation error in Figure 3 is of the same order as the numerical error in Figure 2, we conclude that the numerical error  dominates the mathematical approximation error, so the discretization algorithm may be considered an exact method for all practical purposes.
At the same time the Gibbs sampler and the variational Bayes method both provide approximations of the posterior that are far less accurate. Variational Bayes is only accurate up to one decimal place, although in further investigations we did find that it provides a better approximation if we look only at the non-zero coefficients, with an approximation error of order O(10 −4 ). For the Gibbs sampler there is no theory that tells us how many iterations we have to take to achieve a certain degree of accuracy. We see here that the precision strongly depends on the number of iterations, ranging from one to three decimal places, but remains approximately constant with increasing n. However, the run time for it = 10 5 iterations would become prohibitive for sample sizes much larger than the n = 10 000 we consider.

Differential Gene Expression for Ulcerative Colitis and Crohn's Disease
In this section we compare our methods to various other frequently used Bayesian approaches in the context of differential gene expression data.  Data We consider a data set from Burczynski et al. [9] containing the gene expression levels of n = 22 283 genes in peripheral blood mononuclear cells, with the raw data provided by the National Center for Biotechnology Information. 1 This is an observational study, with microarray gene expression data on 26 subjects who suffered from ulcerative colitis and 59 subjects with Crohn's disease. We calculate Z-scores to identify differences in average gene expression levels between the two disease groups following the standard approach described by Quackenbush [45], which consists of dividing the difference of the average log-transformed, normalized gene expressions for the two groups by the standard error. More specifically, let us denote by U i,j and C i,j the measured intensities of the i-th gene and j-th person with ulcerative colitis and Crohn's disease, respectively. As a first step we normalize the intensities for each patient, i.e. we take for each gene i and patient j. Then the Z-score for the i-th gene is computed as and log C i and σ 2 C ,i are defined accordingly. Since it is assumed that the number of genes with a different expression level between the two groups is small compared to the total number of genes n, the data fit into the sparse normal sequence model with n = 22 283.

Methods
We compare the run times and the selected genes for the eight procedures listed in Table 2. We consider the model selection HMM algorithm for the Beta(1, n+1)binomial prior with Laplace slab (with hyper-parameter a = 0.5), and the discretization algorithm from Corollary 2.2 with m = 20, which is a faster way to compute exactly the same results. Genes i with marginal posterior probability q n,i ≥ 1/2 are selected.
For comparison, we also consider the model selection HMM for the Beta(1, 1)-binomial prior, which corresponds to using a uniform prior Λ n on the mixing parameter α. In this section, we used the implementations of our algorithms from our R package [62], which is approximately 5 times faster than the implementation from Section 3, because it does not incur the overhead of tracking numerical accuracy using interval arithmetic.
We compare to the empirical Bayes method of Johnstone and Silverman [31], which uses a spike-and-slab prior, but estimates the mixing parameter α using empirical Bayes. The method does not explicitly include a prior on α, but we may interpret it as using a uniform prior Λ n . We again use a Laplace slab (with the default parameter a = 0.5) and select genes by hard thresholding at marginal posterior probability 1/2, as implemented in the R package [55].
We also include EBSparse, which is a fractional empirical Bayes procedure proposed by Martin and Walker [40]. It can be interpreted as using a spike-and-slab prior with Λ n = Beta(1, γn), but with Gaussian slabs G i = N (Y i , τ 2 ) whose means depend on the data. Furthermore, in the formula for the posterior the likelihood is tempered by raising it to the power κ. We use the authors' R implementation [41], with the recommended hyper-parameter settings κ = 0.99, γ = 0.25, τ 2 = 100, and M = 1000 Monte Carlo samples. As the sampler is randomized, we run the algorithm 10 times.
We further consider the Spike-and-Slab LASSO of Ročková [48], which computes the maximum a posteriori parameters using Laplace distributions both for the spikes and for the slabs. As in [48, Section 6], we take the slab scale parameter to be λ 1 = 0.1, and estimate the spike scale parameter λ 0 via the two-step procedure described there, for the Beta(1, n + 1) hyper-prior on the mixing parameter. An R implementation called SSLasso was provided by Ročková [49].
We also add the Horseshoe estimator [11] with the Cauchy hyper-prior on its hyperparameter τ , truncated to the interval [1/n, 1], as recommended by Van der Pas et al. [60]. We use the R package [58], with its default Markov Chain Monte Carlo sampler settings of 1000 iterations burn-in and 5000 iterations after burn-in. Genes are selected if their credible sets exclude zero [60]. As the sampler is randomized, we run the algorithm 10 times.
Finally, we compare with the variational Bayes algorithm (varbvs R-package) described in [10]. Notably, this method uses Gaussian slabs. The hyper-parameters (e.g. the variance of the prior and the noise) are automatically fitted to the data. We set the tolerance and maximum number of iterations to be 10 −4 and 1000, respectively. Table 2 and Figure 4. Although we list run times to illustrate computational feasibility, it is important to keep in mind that the methods in this section compute different quantities, so their most important difference lies in which genes they select. On this point, the main conclusion is that the alternative methods give very different results from using the exact Bayesian posterior for the model selection prior.

Results are reported in
All methods except the Horseshoe and EBSparse select genes in decreasing order of the absolute values of Z i . Genes are generally selected by the Horseshoe and EBSparse in decreasing order of absolute value of Z i as well, but with some swaps for genes for which the absolute values are close to each other, so it appears that for all sampling-based methods the sampler is suffering from limited precision, as we also observed for the Gibbs samplers in Section 3.2. The methods can be divided into three main categories based on the number of genes they select: on one extreme is the variational Bayes (varbvs) method, which provides the sparsest solution; then the majority of methods select a number of genes between 557 and 674; and finally at the other extreme are the Empirical Bayes JS procedure and the Beta(1, 1)-binomial prior, which are both based on the same prior and both select a very large number of genes, making these two methods the most conservative. The lack of sparsity induced by the Beta(1, 1)-binomial prior is perhaps not surprising, given that it does not satisfy the exponential decrease condition of [17]. We study this further in Section 5, where we compare different choices for the hyper-parameters of the beta-binomial prior in simulations.
We further see that the Spike-and-Slab LASSO and the empirical Bayes JS procedures finish almost instantly. The EBSparse method takes several seconds to run, as does the discretization algorithm. The Horseshoe and our exact model selection HMM take approximately two minutes to run, while the variational Bayes varbvs method requires a little over 20 minutes. Nevertheless, all methods are feasible even for practitioners who would like to perform multiple similar experiments, for example with different variations of the prior or slab distributions. By contrast, we do not include the Castillo-Van der Vaart algorithm with logarithmic representation, because based on extrapolation of Figure 2 we expect it to take around 20 days.
In Figure 4 we also plot the posterior means (or, in case of the Spike-and-Slab LASSO, the MAP estimator) and the 800 largest Z-scores in absolute value. Since the posterior means for the model selection HMM and the discretization algorithm are the same, we label both as Beta(1, n + 1)-binomial in reference to the prior that was used. We further note that the empirical Bayes JS estimates are invisible behind the data points. We observe that the varbvs method induces the heaviest shrinkage, followed first by the Beta(1, n + 1)-binomial prior and the Empirical Bayes EBSparse method, and then by the Horseshoe and the Beta(1, 1)-binomial prior. The least shrinkage is applied by the empirical Bayes JS method, which does not shrink the observed Z-scores very much (if at all). The Spike-and-Slab LASSO is in a category of its own, because it is a MAP estimator. It applies no shrinkage to the coefficients that are selected, and sets all other coefficients to zero.

Asymptotics of Spike-and-Slab Priors
The choice of the prior Λ n on the mixing hyper-parameter α in spike-and-slab priors is considered to be highly relevant for the behavior of the posterior. Castillo and Van der Vaart [17] recommend to use Λ n = Beta(κ, λ) with parameters κ = 1 and λ = n + 1. This prior induces heavy penalization for dense models (models with large sparsity parameter s) and was shown to have optimal theoretical properties. However, it is unknown whether such heavy penalization is indeed necessary and whether even heavier penalization will result in suboptimal behavior.
In this section we investigate the asymptotic behavior of the posterior for different choices of the hyper-parameters κ and λ using our new exact algorithms, which can scale up to large sample sizes. We consider: i) the uniform prior with κ = 1 and λ = 1, which is often considered a natural choice [54]; ii) mild shrinkage, κ = 1 and λ = √ n; iii) the choice κ = 1 and λ = n + 1 recommended by Castillo and Van der Vaart; iv) heavy shrinkage, κ = 1 and λ = n 2 ; and finally v) a sparsity-discouraging choice, κ = n and λ = 1. We consider two experiments: A 1 and A 2 . In both cases the sample sizes range from n = 50 to n = 20 000. In Experiment A 1 we set the true sparsity level to s = 10 and consider uniformly distributed non-zero signal coefficients between 1 and 10, i.e. θ i ∼ U (1, 10) for i ∈ S. In Experiment A 2 the true sparsity level is taken to be s = n 1/3 and the non-zero signal coefficients are set to θ i = 2 √ 2 log n for i ∈ S, which is a factor of 2 above the detection threshold. In Appendix B of the supplementary material we consider an additional experiment A 3 that is similar to A 1 but with s = 25 and θ i ∼ U (5, 10) for i ∈ S, which gives similar results as Experiment A 1 .
We repeat each experiment 20 times and report the average 2 -error between the posterior mean for θ and the true signal θ in Table 3 and Table 6 for Experiments A 1 and A 2 , respectively. In Tables 4, 5, 7 and 8 we also report the average false discovery rates and the average true positive rates. Standard deviations are provided in parentheses in all cases.          In Experiment A 1 we see that the 2 -error is not very sensitive to the choice of hyperparameters: the uniform prior i), the mild shrinkage ii), and Castillo and Van der Vaart's recommendation iii) all perform comparably. Only the heavy shrinkage iv) is introducing too high penalization, especially for large models. Unsurprisingly, the choice of hyper-parameters v) is also substantially worse than the others, because it expresses exactly the wrong type of prior assumptions by heavily penalizing sparse models. In Experiment A 2 we see that the best hyper-parameters are Castillo and Van der Vaart's recommendation iii) and the heavy shrinkage iv), with the latter having a large variability in performance. Hyper-parameter choices i) and ii) are introducing no or only mild penalization for large models and indeed are also observed to have somewhat worse performance than choices iii) and iv), with the difference getting more pronounced for larger sample sizes. Finally, as in Experiment A 1 , the hyper-parameter setting v) is the worst by far.
We also study the false discovery rate (FDR) and true positive rate (TPR) of the spike-and-slab priors (relatedly, see [14] for the theoretical underpinning of FDR control with empirical Bayes spike-and-slab priors). Unsurprisingly, the FDR is smallest in both experiments in case of heavy shrinkage v), but almost equally good rates are obtained for the recommended choice iii). Mild ii) or no i) shrinkage result in somewhat worse FDR, while the sparsity discouraging setting v) essentially selects all the noise. In Experiment A 1 the best TPR is obtained, not surprisingly, by setting v), which conservatively selects everything. Hyper-parameter choices i) and ii) perform comparably well, closely followed by iii), while the heavy shrinkage method iv) is substantially worse. In Experiment A 2 all hyper-parameter settings perform equally well, except for the heavy shrinkage iv), which is slightly worse. The good performance of the methods is due to the relatively high value (2 √ 2 log n) for the non-zero signal coefficients, which lies above the detection threshold √ 2 log n.
We conclude that, overall, the recommended choice iii) indeed appears to have an advantage over the alternatives, and that even heavier penalization as in choice iv) is harmful.
The above simulation study is just one example of how our exact algorithms can be used to study asymptotic properties of model selection priors, and more specifically spike-and-slab priors. Another possible application not considered here would, for instance, be to study the accuracy of Bayesian uncertainty quantification (see [16] for frequentist coverage of Bayesian credible sets resulting from empirical Bayes spike-andslab priors).

Discussion
We have proposed fast and exact algorithms for computing the Bayesian posterior distribution corresponding to model selection priors (including spike-and-slab priors as a special case) in the sparse normal sequence model. Since the normal sequence model corresponds to linear regression with identity design, the question arises whether the derived algorithms can be extended to sparse linear regression with more general designs or other more complex models. We first note that all methods are agnostic about where the conditional densities of the spikes p(Y i | B i = 0) = φ(Y i ) and the slabs p(Y i | B i = 1) = ψ(Y i ) come from. It is therefore trivial to extend them to any model that replaces the distribution of Y i given S by for any densities ψ i and φ i . (In fact, this is already supported by our R package [62].) Such extensions make it possible to easily handle other noise models for ε i or general diagonal designs; and, as pointed out by a referee, it also allows incorporating a nonatomic prior on θ i in case i ∈ S. We further anticipate that extensions to general sparse design matrices may be possible by generalizing the HMM from Section 2.1 to more general Bayesian networks and applying a corresponding inference algorithm to compute marginal posterior probabilities. However, for non-sparse design matrices the extension would be very challenging, if possible at all, because the Bayesian network of the hidden states could become fully connected. An interesting intermediate case is studied by Papaspiliopoulos and Rossell [43], who consider best-subset selection for block-diagonal designs. For the normal sequence model, their assumptions amount to the requirement that Λ n is a point-mass on a single α, and they point out that in this case "best-subset selection becomes trivial." For non-diagonal designs their results are non-trivial, because they are able to integrate over a continuous hyper-prior on the variance σ 2 of the noise ε i . In contrast, we assume fixed σ 2 , which we then take to be σ 2 = 1 without loss of generality. Our methods can be used to calculate the marginal likelihood p(Y | σ 2 ) without further computational overhead, so it would be possible to run them multiple times to incorporate a discrete prior on a grid of values for σ 2 , but it is not obvious if our results can be extended to continuous priors over σ 2 . Exploration of these directions is left for future work.
Even without extending our methods to full linear regression or continuous priors on σ 2 , we believe that they are already very useful as a benchmark procedure: any approximation technique for general linear regression may be applied to the special case of sparse normal sequences and its approximation error computed as in Section 3.2. If a method does not work well in this special case, then certainly we cannot trust it for more general regression. The existence of such a benchmark method is very important, since, for instance, there are no available diagnostics to determine whether Markov Chain Monte Carlo samplers have converged to their stationary distribution or if they have explored a sufficient proportion of the models in the model space.
We have also explored the exact connection between general model selection priors and the more specific spike-and-slab priors. Since for spike-and-slab priors one can construct faster algorithms, it is useful to know which model selection priors can be represented in this form. The proof of our result amounts to a finite sample version of de Finetti's theorem for a particular subclass of exchangeable distributions, which may be of interest in its own right.

Supplementary Material
Supplement: Fast Exact Bayesian Inference for Sparse Signals in the Normal Sequence Model (DOI: 10.1214/20-BA1227SUPP; .pdf). The supplement contains a review of the exact algorithm by Castillo and Van der Vaart and a discussion on how to perform all computations in a logarithmic representation. It further includes an additional variation on Experiment A 1 from Section 5. Finally, the proofs for all theorems and the examples from Section 2.3 are also given in the supplement.