Dynamic Variable Selection with Spike-and-Slab Process Priors

The spike-and-slab methodology for variable selection has been widely used in statistical data analysis since its inception more than two decades ago. Developments for varying coefficient models have not yet received much attention. Here, we address the problem of dynamic variable selection in time series regression, where the set of active predictors is allowed to evolve over time. To capture time-varying variable selection uncertainty, we introduce new dynamic shrinkage priors for the time series of regression coefficients. These priors are characterized by two main ingredients: smooth parameter evolutions and intermittent zeroes for modeling predictive breaks. More formally, our proposed Dynamic Spike-and-Slab (DSS) priors are constructed as mixtures of two processes: a spike process for the irrelevant coefficients and a slab autoregressive process for the active coefficients. The mixing weights are themselves time-varying and depend on a lagged value of the series. A key distinguishing feature of DSS priors is that their stationary distribution is fully known and characterized by spike-and-slab marginals. This property guarantees marginal stability and probabilistic coherence. We characterize dynamic selection thresholds for MAP smoothing and implement a one-step-late EM algorithm for fast calculations. We demonstrate, through simulation and a topical high-dimensional macroeconomic dataset, that DSS priors are far more effective at finding signal and improving forecasts compared to other existing strategies.


Dynamic Sparsity
For dynamic linear modeling with many potential predictors, the assumption of a static generative model with a fixed subset of regressors (albeit with time-varying regressor effects) may be misleadingly restrictive. By obscuring variable selection uncertainty over time, confinement to a single inferential model may lead to poorer predictive performance, especially when the actual effective subset at each time is sparse. The potential for dynamic model selection techniques in time series modeling has been recognized (Frühwirth-Schnatter & Wagner, 2010;Groen et al., 2013;Nakajima & West, 2013a;Kalli & Griffin, 2014;Chan et al., 2012). A prominent practical example is inflation forecasting, where large sets of predictors are available and it is expected that different sets of predictors will predict well under different economic regimes/situations (Groen et al., 2013;Koop & Korobilis, 2012;Kalli & Griffin, 2014). For example, in recessions we might see distress related factors be effective, while having no predictive power in expansions. Motivated by such contexts, we develop a new dynamic shrinkage approach for time series models that exploits time-varying predictive subset sparsity when it exists.
We present our approach in the context of dynamic linear models (West & Harrison, 1997) (or varying coefficient models with a time effect modifier (Hastie & Tibshirani, 1993)) that link a scalar response y t at time t to a set of p known regressors x t = (x t1 , . . . , x tp ) through the relation where β 0 t = (β 0 t1 , . . . , β 0 tp ) is a time-varying vector of regression coefficients and where the innovations ε t come from N (0, 1). The challenge of estimating the T × p coefficients in (1), with merely T observations, is typically made feasible with a smoothness inducing state-space model that treats {β 0 t } T t=1 as realizations from a (vector autoregressive) stochastic process Nevertheless, any regression model with a large number of potential predictors will still be vulnerable to overfitting. This phenomenon is perhaps even more pronounced here, where the regression coefficients are forced to be dynamically intertwined. The major concern is that overfitted coefficient evolutions disguise true underlying dynamics and provide misleading representations with poor out-of-sample predictive performance. For long term forecasts, this concern is exacerbated by the proliferation of the state space in (2). As the model propagates forward, the non-sparse state innovation accumulates noise, further hindering the out-of-sample forecast ability. With many potentially irrelevant predictors, seeking sparsity is a natural remedy against the loss of statistical efficiency and forecast ability. We shall assume that p is potentially very large, where possibly only a small portion of predictors is relevant for the outcome at any given time. Besides time-varying regressor effects, we adopt the point of view that the regressors are allowed to enter and leave the model as time progresses, rendering the subset selection problem ultimately dynamic. This anticipation can be reflected by the following sparsity manifestations in the matrix of regression coefficients B 0 p×T = [β 0 1 , . . . , β 0 T ]: (a) horizontal sparsity, where each individual time series {β 0 tj } T t=1 (for j = 1, . . . , p) allows for intermittent zeroes for when j th predictor is not a persisting predictor at all times, (b) vertical sparsity, where only a subset of coefficients β 0 t = (β 0 t1 , . . . , β 0 tp ) (for t = 1, . . . , T ) will be active at the t th snapshot in time.
This problem has been addressed in the literature by multiple authors including, for example, Groen et al. (2013); Belmonte et al. (2014); Koop & Korobilis (2012); Kalli & Griffin (2014); Nakajima & West (2013a). Other related works include shrinkage approaches towards static coefficients in time-varying models (Frühwirth-Schnatter & Wagner, 2010;Bitto & Frühwirth-Schnatter, 2016;Lopes et al., 2016). We approach the dynamic sparsity problem through the lens of Bayesian variable selection and develop it further for varying coefficient models. Namely, we assume the traditional spike-andslab setup by assigning each regression coefficient β tj a mixture prior underpinned by a binary latent indicator γ tj , which flags the coefficient as being either active or inert. While static variable selection with spike-and-slab priors has received a lot of attention (Carlin & Chib, 1995;Clyde et al., 1996;George & McCulloch, 1993Mitchell & Beauchamp, 1988;, the literature on dynamic variants is far more sparse (George et al., 2008;Frühwirth-Schnatter & Wagner, 2010;Nakajima & West, 2013a;Groen et al., 2013). Narrowing this gap, this work proposes several new dynamic extensions of popular spike-and-slab priors.
We should like to draw particular attention to the latent threshold process of Nakajima & West (2013a,b); Zhou et al. (2014); Nakajima & West (2015, 2017, a related regime switching scheme for either shrinking coefficients exactly to zero or for leaving them alone on their autoregressive path: The model assumes a latent autoregressive process {b tj } T t=1 , giving rise to the actual coefficients {β tj } T t=1 only when it meanders away from a latent basin around zero [−d j , d j ]. This process is reminiscent of a dynamic extension of point-mass mixture priors that exhibit exact zeros (Mitchell & Beauchamp, 1988). Recently, there has been a resurrection of interest in continuous spike-and-slab variants due to their amenability to fast computation . The latent threshold approach has, so far, relied on rather laborious MCMC implementations. In this work, we propose new dynamic continuous spike-and-slab alternatives for which we develop a fast optimization algorithm.
The main thrust of this work is to introduce Dynamic Spike-and-Slab (DSS) priors, a new class of time series priors, which induce either smoothness or shrinkage towards zero. These processes are formed as mixtures of two stationary time series: one for the active and another for the negligible coefficients. The DSS priors pertain closely to the broader framework of mixture autoregressive (M AR) processes with a given lag, where the mixing weights are allowed to depend on time. Despite the reported success of M AR processes (and variants thereof) for modeling non-linear time series (Wong & Li, 2000Kalliovirta et al., 2015;Wood et al., 2011), their potential as dynamic sparsity inducing priors has been unexplored. Here, we harvest this potential within a dynamic variable selection framework. The remarkable feature of DSS priors, that sets it apart from the latent threshold model, is that it yields benchmark continuous spike-and-slab priors (such as the Spike-and-Slab LASSO of Rockova (2017)) as its marginal stationary distribution. This property guarantees marginal stability in the selection/shrinkage dynamics and probabilistic coherence.
By turning our time-domain priors into penalty constructs, we formalize the notion of prospective and retrospective shrinkage through doubly adaptive shrinkage terms that pull together past, current, and future information. We introduce asymmetric dynamic thresholding rules -extensions of existing rules for static symmetric regularizers (Fan & Li, 2001;Antoniadis & Fan, 2001)-to characterize the behavior of joint posterior modes for MAP smoothing. For calculation, we implement a one-step-late EM algorithm of (Green, 1990), that capitalizes on fast closed-form one-site updates. Our dynamic penalties can be regarded as natural extensions of the spike-and-slab penalty functions introduced by Rockova (2017) and further developed by Rockova & George (2016) and Rockova & George (2015). The DSS priors here are deployed as a fast MAP smoothing catalyst rather than a vehicle for a full-blown MCMC analysis (as in Nakajima & West (2013a)). The key distinguishing feature of this deployment is that DSS priors attain sparsity through sparse posterior modes rather than auxiliary thresholding.
We demonstrate the effectiveness of our introduced DSS priors with a thorough simulation study and a topical macroeconomic application. Both studies highlight the com-parative improvements -in terms of inference, forecasting, and computational time-of DSS priors over conventional and recent methods in the literature. In particular, the macroeconomic application, using a large number of economic indicators to forecast inflation and infer on underlying economic structures, serves as a motivating example as to why dynamic sparsity is effective, and even necessary, in these contexts.
The paper is structured as follows: Section 2 and Section 3 introduce the DSS processes and their variants. Section 4 develops the penalized likelihood perspective, introducing the prospective and retrospective shrinkage terms. Section 5 provides useful characterizations of the global posterior mode. Section 6 develops the one-step-late EM algorithm for MAP smoothing. Section 7 illustrates the MAP smoothing deployment of DSS on simulated examples and Section 8 on a macroeconomic dataset. Section 9 concludes with a discussion.
2. Dynamic Spike-and-Slab Priors In this section, we introduce the class of Dynamic Spike-and-Slab (DSS) priors that constitute a coherent extension of benchmark spike-and-slab priors for dynamic selection/shrinkage. We will assume that the p time series {β tj } T t=1 (for j = 1, . . . , p) in (1) follow independent and identical DSS priors and thereby we suppress the subscript j (for notational simplicity).
We start with a conditional specification of the DSS prior. Given a binary indicator γ t ∈ {0, 1}, which encodes the spike/slab membership at time t, and a lagged value β t−1 , we assume that β t arises from a mixture of the form where and For Bayesian variable selection, it has been customary to specify a zero-mean spike density ψ 0 (β | λ 0 ), such that it concentrates at (or in a narrow vicinity of) zero. One purposeful choice is the Laplace density ψ 0 (β | λ 0 ) = λ 0 2 e −|β|λ 0 (with a relatively large penalty parameter λ 0 > 0) due to its ability to threshold via sparse posterior modes (Rockova, 2017). Regarding the slab distribution ψ 1 (β t | µ t , λ 1 ), we require that it be moderately peaked around its mean µ t , where the amount of spread is regulated by λ 1 > 0. While our framework encompasses a broad range of possible choices of ψ 0 (·) and ψ 1 (·) (as elaborated on in Section 3), we will focus primarily on the Gaussian slab ψ 1 (β t | µ t , λ 1 ) (with mean µ t and variance λ 1 ) due to its ability to smooth over past/future values. Our framework can be extended to higher-order autoregressive polynomials where µ t may also depend on values older than β t−1 . However, here we focus on the first-order autoregression due to its practicality and ubiquity in practice (Tibshirani et al., 2005;West & Harrison, 1997;Prado & West, 2010).
The conditional DSS prior formulation (5) generalizes existing continuous spike-andslab priors (George & McCulloch, 1993;Ishwaran & Rao, 2005;Rockova & George, 2016) in two important ways. First, rather than centering the slab around zero, the DSS prior anchors it around an actual model for the time-varying mean µ t , a new distinctive feature. The non-central mean is defined as an autoregressive lag polynomial of the first order with hyper-parameters (φ 0 , φ 1 ). Although estimable (subject to stationary restrictions) these parameters will be treated as fixed. Assuming a fixed φ 1 is not too far from the common practice in the Bayesian literature (Omori et al., 2007;Nakajima & West, 2013a, to name a few) which consists of imposing a tight prior around, but below, 1 for φ 1 , to ensure stable, stationary estimation. It is illuminating to regard the conditional prior (5) as a "multiple shrinkage" prior (George, 1986b,a) with two shrinkage targets: (1) zero (for the gravitational pull of the spike), and (2) µ t (for the gravitational pull of the slab). It is also worthwhile to emphasize that the spike distribution ψ 0 (β t | λ 0 ) does not depend on β t−1 , only the slab does. The DSS formulation thus induces separation of regression coefficients into two groups, where only the active ones are assumed to walk on an autoregressive path.
The second important and distinctive generalization is implicitly hidden in the hierarchical formulation of the mixing weights θ t in (7), which casts them as a smoothly evolving process (as will be seen in Section 2·1 below). Before turning to this formulation, we note that the conditional form (5)-(7) is a mixture of two stationary processes. Under the spike distribution, the series {β t } T t=1 is stationary, iid with a marginal density ψ 0 (β | λ 0 ). Under the slab distribution, {β t } T t=1 follow a stationary Gaussian AR(1) process whose stationary distribution is characterized by univariate marginals a Gaussian density with mean φ 0 and variance λ 1 . The availability of this tractable stationary distribution (9) is a major appeal of the conditional Gaussian slab distribution. However, the DSS construction is not confined to the Gaussian slab (Laplace spike). We elaborate on alternative choices in Section 3. 2·1. Evolving Inclusion Probabilities A very appealing feature of DSS priors that makes them suitable for dynamic subset selection is the opportunity they afford for obtaining "smooth" spike/slab memberships. Recall that the binary indicators in (7) determine which of the spike or slab regimes is switched on at time t, where P(γ t = 1 | β t−1 ) = θ t . It is desirable that the sequence of slab probabilities {θ t } T t=1 evolves smoothly over time, allowing for changes in variable importance as time progresses and, at the same time, avoiding erratic regime switching. Because the series {θ t } T t=1 is a key driver of the sparsity pattern, it is important that it be (marginally) stable and that it reflects all relevant information, including not only the previous value θ t−1 , but also the previous value β t−1 . Many possible constructions of θ t could be considered. We turn to stationarity as a guide for a principled construction of θ t .
Although the {β t } T t=1 process will be stationary under each of the spike and slab distributions separately, it is not immediately obvious that it will be stationary under the spike-and-slab mixture where the β t 's can transition between these distributions and where θ t depends on β t−1 . However, with a suitable formulation for the {θ t } T t=1 sequence, the stability and coherence of the DSS can be maintained. Under this formulation (introduced below), the {β t } T t=1 process will not only be stationary, but will have spike-and-slab marginals. Such a θ t sequence is obtained with a deterministic transition function of the lagged β t values θ t = θ(β t−1 ). For our formulation, we introduce a marginal importance weight 0 < Θ < 1, a scalar parameter which controls the overall balance between the spike and the slab distributions. Given (Θ, λ 0 , λ 1 , φ 0 , φ 1 ), the conditional inclusion probability θ t (or a transition function θ(β t−1 )) is defined as Before turning to stationarity properties of the full DSS priors, we pause to appreciate the probabilistic meaning of (10). The conditional mixing weight θ t can be interpreted as the posterior probability of classifying the past coefficient β t−1 as arriving from the stationary slab distribution as opposed to the (stationary) spike distribution. This interpretation reveals how the weights {θ t } T t=1 proliferate parsimony throughout the process {β t } T t=1 . Suppose that |β t−1 | is large, then θ(β t−1 ) will be large as well, signaling that the current observation β t is more likely to be in the slab. The contrary occurs when |β t−1 | is small, where β t is discouraged from the slab by a smaller inclusion weight θ(β t−1 ). Let us also note that the weights in (10) are different from the conditional probabilities for classifying β t−1 as arising from the conditional slab in (5). These weights will be introduced later in Section 4. A related deterministic transition function was proposed by Nakajima & West (2013a) (see (3) and (4)). Our formulation, however, yields a process {β t } T t=1 with a completely characterized marginal stationary distribution. It is tempting to regard Θ as the marginal proportion of nonzero coefficients. However, such an interpretation is misleading because, sparsity levels are ultimately determined by the θ t sequence, which is influenced by the component stationary distributions ψ 0 (·) and ψ ST 1 (·), in particular by the amount of their overlap around zero. Such an interpretation is thus inapplicable for the continuous spike-and-slab mixtures considered here, where more caution is needed for calibration (Rockova, 2017). This issue will be revisited in Section 4. Now that we have elaborated on all the layers of the hierarchical model, we are ready to formally define the Dynamic Spike-and-Slab Process.
Definition 1. Equations (5), (6), (7) and (10) define a Dynamic Spike-and-Slab Process (DSS) with parameters (Θ, λ 0 , λ 1 , φ 0 , φ 1 ). We will write . The DSS process relates to the Gaussian mixture of autoregressive (GM AR) process of Kalliovirta et al. (2015), which was conceived as a model for time series data with regime switches. Here, we deploy it as a prior on time-varying regression coefficients within the spike-and-slab framework, allowing for distributions other than Gaussian. The DSS, being an instance/elaboration of the GM AR process, inherits elegant marginal characterizations.
2·2. Stationarity The DSS construction has a strong conceptual appeal in the sense that its marginal probabilistic structure is fully known. This property is rarely available with conditionally defined non-Gaussian time series models, where not much is known about the stationary distribution beyond just the mere fact that it exists. The DSS process, on the other hand, guarantees well behaved stable marginals that can be described through benchmark spike-and-slab priors.
Being inherently a mixture of stationary processes, the DSS process ought to be stationary. In the context of M AR models, Wong & Li (2000) describe stationarity restrictions on the autoregressive polynomial parameters as well as mixing weights that are not time varying. If the mixing weights θ t were fixed, the stationarity would be then inherited from the slab Gaussian AR(1) process when |φ 1 | < 1. Going further, Kalliovirta et al. (2015) characterize the stationary distribution of the GM AR process with time varying weights, which aligns closely with the DSS process. The following theorem is an elaboration of Theorem 1 of Kalliovirta et al. (2015).
has a stationary distribution characterized by the following univariate marginal distributions: where ψ ST 1 (β | λ 1 , φ 0 , φ 1 ) is the stationary slab distribution (9). Proof. We assume an initial condition β t=0 ∼ π ST (β 0 |Θ, λ 0 , λ 1 , φ 0 , φ 1 ). Recall that the conditional density of β 1 given β 0 can be written as From the definition of θ 1 in (10), we can write the joint distribution as Integrating π(β 1 , β 0 ) with respect to β 0 , we obtain Theorem 1 describes the very elegant property of DSS that the univariate marginals of this mixture process are Θ-weighted mixtures of marginals. It also suggests a more general recipe for mixing multiple stationary processes through the construction of mixing weights (10). 3. Other Spike and Slab Densities Rather than shrinking to the vicinity of the past value, one might like to entertain the possibility of shrinking exactly to the past value (Tibshirani et al., 2005). Such a property would be appreciated, for instance, in dynamic sparse portfolio allocation models to mitigate transaction costs associated with negligible shifts in the portfolio weights (Irie & West, 2016;Brodie et al., 2009;Jagannathan & Ma, 2003;Puelz et al., 2016). One way of attaining the desired effect would be replacing the Gaussian slab ψ 1 (·) in (5) with a Laplace distribution centered at µ t . This conditional construction, however, does not imply the Laplace distribution marginally. The univariate marginals are defined through the characteristic function given in (2.7) of Andel (1983). The lack of availability of the marginal density thwarts the specification of transition weights (10) needed within our DSS framework. There are, however, avenues for constructing an autoregressive process with Laplace marginals, e.g., through the normal-gamma-autoregressive (N GAR) process by Kalli & Griffin (2014). We define the following Laplace autoregressive (LAR) process as a special case.
Definition 2. We define the Laplace autoregressive (LAR) process by where {ψ t } T t=1 follow an exponential autoregressive process specified through The LAR process exploits the scale-normal-mixture representation of the Laplace distribution, yielding Laplace marginals β t ∼ ψ ST (β t | λ 1 ) ≡ Laplace(λ 1 ). This coherence property can be leveraged within our DSS framework as follows. If we replace the slab Gaussian AR(1) process in (5) with the LAR process and deploy ψ ST (10), we obtain a Laplace DSS variant with the Spike-and-Slab LASSO prior of Rockova (2017) as its marginal distribution (according to Theorem 1).
It is worth pointing out an alternative autoregressive construction with Laplace marginals proposed by Andel (1983), where the following AR(1) scheme is considered.
The innovations in (13) come from a mixture of a point mass at zero, providing an opportunity to settle at the previous value, and a Laplace distribution. Again, by deploying this process in the slab, we obtain the Spike-and-Slab LASSO marginal distribution. Despite it being methodologically attractive, these Laplace extensions are ultimately less attractive for implementation. Throughout the rest of the paper, we focus on the Gaussian AR(1) slab process. Before proceeding, let us note that the spike distribution ψ 0 (β | λ 0 ) can be replaced by any (continuous) density without disturbing the validity of Theorem 1. A Gaussian spike, for instance, would impose no new computational challenges due to its conditional conjugacy. However, additional thresholding would be required to obtain sparse posterior modes. In the sequel, we focus on the Laplace spike due to its automatic thresholding property.
4. Dynamic Spike-and-Slab Penalty Spike-and-slab priors give rise to self-adaptive penalty functions for MAP estimation, as detailed in Rockova (2017) and Rockova & George (2016). Here, we introduce new penalty constructs for dynamic shrinkage implied by the DSS priors.

Fig. 1. Plots of the prospective penalty function
Remark 1. Note that the dependence on the previous value β t−1 in pen(β | β t−1 ) is hidden in θ t and µ t . Throughout the paper, we will write ∂θ t /∂β t−1 and ∂µ t /∂β t−1 without reminding ourselves of this implicit relationship.
To gain some insights about the prospective penalty (14), it is helpful to point out one special case when µ t = 0. Then both the spike and the slab are centered at zero, yielding a penalty reminiscent of the Spike-and-Slab LASSO by Rockova (2017). However, instead of mixing two LASSO penalties, here, we have an adaptive variant of the elastic net (Zou & Hastie, 2005). With µ t = 0, the prospective penalty is no longer symmetric around zero (and not guaranteed to be concave). Figure 1 portrays the prospective penalty for two choices of β t−1 and two sets of tuning parameters φ 1 , λ 1 , λ 0 and Θ (assuming φ 0 = 0). Because the conditional transfer equation (5) is a mixture, pen(β | β t−1 ) is apt to be multimodal. Figure 1(a) shows an obvious peak at zero (due to the Laplace spike), but also a peak around µ t = 0.9 × β t−1 , prioritizing values in the close vicinity of the previous value (due to the non-central slab). From an implementation viewpoint, however, it is more desirable that the penalty be uni-modal, reflecting the size of the previous coefficient without ambiguity by suppressing one of the peaks. Such behavior is illustrated in Figure  1(b) and Figure 1(c), where the penalty flexibly adapts to |β t−1 | by promoting either zero or a value close to β t−1 . This effect is achieved with a relatively large stationary slab variance λ 1 /(1 − φ 2 1 ) = 10, a mild Laplace peak λ 0 = 1 and the marginal importance weight Θ = 0.9. Smaller values Θ would provide an overwhelming support for the zero mode. The parameter Θ, thus should not be regarded as a proportion of active coefficients (as is customary with point-mass mixtures), but rather an interpretation-free tuning parameter. Figure 1 plots pen(β | β t−1 ) prospectively as a function of β, given the previous value β t−1 . It is also illuminating to plot pen(β t+1 | β) retrospectively as a function of β, given the future value β t+1 . Two such retrospective penalty plots are provided in Figure 2(a) and Figure 2(b). When the future value is relatively large (β t+1 = 1.5 in Figure 2(b)), the penalty pen(β t+1 | β) has a peak near β t+1 , signaling that the value β t must be large too. When the future value is small (β t+1 = 0 in Figure 2(a)), the penalty has a peak at zero signaling that the current value β t must be small. Again, this balance is achieved with a relatively large stationary slab variance and a large Θ. Theta=0.1,lambda0=1 Theta=0.9,lambda0=1 Theta=0.9,lambda0=2 Transition weight (2.9) (c) φ 0 = 0, φ 1 = 0.9, λ 1 = 10(1 − φ 2 1 ) Fig. 2. Plots of the retrospective penalty function and the mixing weight (10) The behavior of the prospective and retrospective penalties is ultimately tied to the mixing weight θ t ≡ θ(β) in (10). It is desirable that θ(β) is increasing with |β|. However, Laplace tails will begin to dominate for large enough |β|, where the probability θ(β) will begin to drop (for |β| greater than δ ≡ ). However, we can make the turning point δ large enough with larger values Θ and smaller values λ 0 , as indicated in Figure 2(c).
To fully grasp the shrinkage dynamics implied by the penalty (15), it is useful to study the partial derivative ∂P en(β | β t−1 , β t+1 )/∂|β|. This term encapsulates how much shrinkage we expect at time t, conditionally on (β t−1 , β t+1 ). We will separate the term into two pieces: a prospective shrinkage effect λ (β | β t−1 ), driven by the past value β t−1 , and a retrospective shrinkage effect λ (β | β t+1 ), driven by the future value β t+1 . More formally, we write and 4·1. Shrinkage "from the Past" The prospective shrinkage term λ (β | β t−1 ) pertains closely to Bayesian penalty mixing introduced by Rockova (2017) and Rockova & George (2016) in the sense that it can be characterized as an adaptive linear combination of individual spike and slab shrinkage terms. In particular, we can write where Two observations are in order: first, by writing p t (β) = P(γ t = 1|β t = β, β t−1 , θ t ), (19) can be viewed as a posterior probability for classifying β as arising from the conditional slab (versus the spike) at time t, given the previous value β t−1 . Second, these weights are very different from θ t in (10), which are classifying β as arising from the marginal slab (versus the spike). From (19), we can see how p t (β) hierarchically transmits information about the past value β t−1 (via θ t ) to determine the right shrinkage for β t . This is achieved with a doubly-adaptive chain reaction. Namely, if the previous value β t−1 is large, θ t will be close to one signaling that the next coefficient β t is prone to be in the slab. Next, if β t is in fact large, p t (β t ) will be close to one, where the first summand in (17) becomes the leading term and shrinks β t towards µ t . If β t is small, however, p t (β t ) will be small as well, where the second term in (17) takes over to shrink β t towards zero. This gravitational pull is accelerated when the previous value β t−1 was negligible (zero), in which case θ t will be even smaller, making it even more difficult for the next coefficient β t to escape the spike. This mechanism explains how the prospective penalty adapts to both (β t−1 , β t ), promoting smooth forward proliferation of spike/slab allocations and coefficients.
4·2. Shrinkage "from the Future" While the prospective shrinkage term promotes smooth forward proliferation, the retrospective shrinkage term λ (β | β t+1 ) operates backwards. We can write where For simplicity, we will write p t+1 = p t+1 (β t+1 ). Then we have The retrospective term synthesizes information from both (β t+1 , β t ) to contribute to shrinkage at time t. When (β t+1 , β t ) are both large, we obtain p t (β t+1 ) and θ t+1 that are both close to one. The shrinkage is then driven by the second summand in (23), forcing β t to be shrunk towards the future value β t+1 (through µ t+1 = φ 0 + φ 1 (β t − φ 0 )). When either β t+1 or β t are small, shrinkage is targeted towards the stationary mean through the dominant term (22).

Characterization of the Joint Posterior Mode
Unlike previous developments (Nakajima & West, 2013a;Kalli & Griffin, 2014), this paper views Bayesian dynamic shrinkage through the lens of optimization. Rather than distilling posterior samples to learn about B = [β 1 , . . . , β T ], we focus on finding the MAP trajectory B = arg max π(B | y). MAP sequence estimation problems (for nonlinear non-Gaussian dynamic models) were addressed previously with, e.g., Viterbi-style algorithms (Godsill et al., 2001). Our optimization strategy is conceptually very different.
The key to our approach will be drawing upon the penalized likelihood perspective developed in Section 4. Namely, we develop a new dynamic coordinate-wise strategy, building on existing developments for static high-dimensional variable selection.

5·1. The One-dimensional Case
To illustrate the functionality of the dynamic penalty from Section 4, we start by assuming p = 1 and x t = 1 in (1). This simple case corresponds to a sparse normalmeans model, where the means are dynamically intertwined. We begin by characterizing some basic properties of the posterior mode β = arg max β π(β|y), where y = (y 1 , . . . , y T ) arises from (1) and β = (β 1 , . . . , β T ) is assigned the DSS prior.
Proof. We begin by noting that β t is a maximizer in t th direction while keeping ( β t−1 , β t+1 ) fixed, i.e.
Lemma 1 formally certifies that the mode exhibits both (a) sparsity and (b) smoothness (through the prospective/retrospective shrinkage terms).
Remark 2. While Lemma 1 assumes 1 < t < T , the characterization applies also for t = 1, once we specify the initial condition β t=0 . The value β t=0 is not assumed known and will be estimated together with all the remaining parameters (Section 6). For t = T , an analogous characterization exists, where the shrinkage term and the selection threshold only contain the prospective portion of the penalty.

5·2. General Regression
When p > 1, there is a delicate interplay between the multiple series, where overfitting in one direction may impair recovery in other directions. As will be seen in Section 7, anchoring on sparsity is a viable remedy to these issues. We obtain analogous characterizations of the global mode. We will denote with ∆ − tj and ∆ − tj the selection thresholds (24) and (25) with x = x tj , β t−1 = β t−1j , and β t+1 = β t+1j .
Lemma 2. Denote by B = { β tj } T,p t,j=1 the global mode of π(B|Y ), by B tj all but the (t, j) th entry in B and by z tj = y t − i =j x ti β ti . Let Z tj = x tj z tj . Then β tj satisfies the following necessary condition Proof. Follows from Lemma 1, noting that β tj is a maximizer in (t, j) th direction while keeping B tj fixed, i.e.
Lemma 2 evokes coordinate-wise optimization for obtaining the posterior mode. However, the computation of selection thresholds (∆ − tj , ∆ + tj ) (as well as the one-site maximizers (27)) requires numerical optimization. The lack of availability of closed-form thresholding hampers practicality when T and p are even moderately large. In the next section, we propose an alternative strategy which capitalizes on closed-form thresholding rules. 6. One-step-late EM Algorithm A (local) posterior mode B can be obtained either directly, by cycling over one-site updates (28), or indirectly through an EM algorithm, a strategy pursued here. The direct algorithm consists of integrating out Γ = [γ 1 , . . . , γ T ] and solving a sequence of nonstandard optimization problems (28), which necessitate numerical optimization. The EM algorithm, on the other hand, treats Γ as missing data, obviating the need for numerical optimization by offering closed form one-site updates. We now describe this dynamic EM approach for MAP estimation under DSS priors.
The initial vector β t=0 = (β 01 , . . . , β 0p ) at time t = 0 will be estimated together with all the remaining coefficients B. We assume that β 0 comes from the stationary distribution described in Theorem 1 where γ 0 = (γ 01 , . . . , γ 0p ) are independent binary indicators with P[γ 0j = 1 | Θ] = Θ for 1 ≤ j ≤ p. Knowing the stationary distribution, thus, has advantages for specifying the initial conditions. The goal is obtaining the joint mode [ B, β 0 ] of the functional π(B, β 0 |Y ). To this end, we proceed iteratively by augmenting this objective function with the missing data [Γ, γ 0 ], as prescribed by , and then maximizing w.r.t. [B, β 0 ]. An important observation, that facilitates the derivation of the algorithm, is that the prior distribution π(β 0 , B, γ 0 , Γ) can be factorized into the following products where π(β tj |γ tj , β t−1j ) and π(γ tj |β t−1j ) are defined in (5) and (7), respectively. For simplicity (and without loss of generality), we will assume φ 0 = 0 and thereby µ tj = φ 1 β t−1j . Then, we can write We will endow the parameters [β 0 , B] with a superscript m to designate their most recent values at the m th iteration. In the E-step, we compute the expectation E γ 0 ,Γ|· log π(β 0 , B, γ 0 , Γ|Y ) with respect to the conditional distribution of [Γ, γ 0 ], given [B (m) , β (m) 0 ] and Y . All we have to really do is obtain p tj = P(γ tj = 1|β (19), when t > 0, and p 0j ≡ θ 1j ≡ θ(β 0j ) from (10), and replace all the γ tj 's in (30) with p tj 's. In the M-step, we set out to maximize E γ 0 ,Γ|· log π(β 0 , B, γ 0 , Γ|Y ) w.r.t. [β 0 , B]. We proceed coordinate-wise, iterating over the following single-site updates while keeping all the remaining parameters fixed. For 1 < t < T , we have ti . From the first-order condition, the solution β (m+1) tj , if nonzero, needs to satisfy ∂Q tj (β)/∂β β=β (m+1) tj = 0. To write the derivative slightly more concisely, we introduce the following notation: . Then we can write for β = 0 where is obtained from (21). Recall that θ t+1j , defined in (10), depends on β tj (denoted by β above). This complicates the tractability of the M-step. If θ t+1j was fixed, we could obtain a simple closed-form solution β (m+1) tj through an elastic-net-like update (Zou & Hastie, 2005). We can take advantage of this fact with a one-step-late (OSL) adaptation of the EM algorithm (Green, 1990). The OSL EM algorithm bypasses intricate M-steps by evaluating the intractable portions of the penalty derivative at the most recent value, rather than the new value. We apply this trick to the last summand in (32). Instead of treating θ t+1j as a function of β in (32), we fix it at the most recent value β (m) tj . The solution for β, implied by (32), is then where The update (33) is a thresholding rule, with a shrinkage term that reflects the size of (β ). The exact thresholding property is obtained from sub-differential calculus, because Q tj (·) is not differentiable at zero (due to the Laplace spike). A very similar update is obtained also for t = T , where all the terms involving p t+1j and θ t+1j in Λ tj , W tj and Z tj disappear. For t = 0, we have The updates (33) and (34) can be either cycled-over at each M-step, or performed just once for each M-step.
To illustrate the ability of the DSS priors to suppress noise and recover true signal, we consider a high-dimensional synthetic dataset and a topical macroeconomic dataset.

Synthetic High-Dimensional Data
We first illustrate our dynamic variable selection procedure on a simulated example with T = 100 observations generated from the model (1) with p = 50 predictors. The predictor values x tj are obtained independently from a standard normal distribution. Out of the 50 predictors, 46 never contribute to the model (predictors x t5 through x t50 ), where β 0 t5 = β 0 t6 = ... = β 0 t50 = 0 at all times. The predictor x t1 is a persisting predictor, where {β t1 } T t=1 is generated according to an AR(1) process (8) with φ 0 = 0 and φ 1 = 0.98 and where |β 0 t1 | > 0.5. The remaining three predictors are allowed to enter and leave the model as time progresses. The regression coefficients {β 0 t2 } T t=1 , {β 0 t3 } T t=1 and {β 0 t4 } T t=1 are again generated from an AR(1) process (φ 0 = 0 and φ 1 = 0.98). However, the values are thresholded to zero whenever the absolute value of the process drops below 0.5, creating zero-valued periods. The true sparse series of coefficients are depicted in Figure 3 (black lines).
We begin with the standard DLM approach, which is equivalent to DSS when the selection indicators are switched on at all times, i.e., γ tj = 1 for t = 0, . . . , T and j = 1, . . . , p. The autoregressive parameters are set to the true values φ 0 = 0 and φ 1 = 0.98. Plots of the estimated trajectories of the first 6 series (including the 4 active ones) are in Figure 3 (orange lines). With the absence of the spike, the estimated series of coefficients cannot achieve sparsity. By failing to discern the coefficients as active or inactive, the state process confuses the source of the signal, distributing it across the redundant covariates. This results in loss of efficiency and poor recovery.
With the hope to improve on this recovery, we deploy the DSS process with a sparsity inducing spike. For now, the hyper-parameters are chosen as φ 0 = 0, φ 1 = 0.98, λ 0 = 0.9, λ 1 = 25(1 − φ 2 1 ), and Θ = 0.98. This hyper-parameter choice corresponds to a very mild separation between the stationary spike and slab distributions, and unimodal retrospective and prospective penalties. Later, we explore the sensitivity of DSS to this choice. We deploy the one-step-late EM algorithm outlined in Section 6, initializing the calculation with a zero matrix. We assume that the initial vector β t=0 is drawn from the stationary distribution and we estimate it together with all the other parameters, as prescribed in Section 6. The recovered series have a strikingly different pattern compared to the non-sparse solution (Figure 3, blue lines). First, the MAP series is seen to track closely the periods of predictor importance/irrelevance, achieving dynamic variable selection. Second, by harvesting sparsity, the DSS priors alleviate bias in the nonzero directions, outputting a cleaner representation of the true underlying signal.
We also compare the performance to the NGAR process of Kalli & Griffin (2014) and the LASSO method. The latter does not take into account the temporal nature of the problem. For NGAR, we use the default settings, b * = s * = 0.1, with 1,000 burnin and 1,000 MCMC iterations. For LASSO, we sequentially run a static regression in an extending window fashion, where the LASSO regression is refit using 1 : t for each t = 1 : T to produce a series of quasi-dynamic coefficients; a common practice for using static shrinkage methods for time series data (Bai & Ng, 2008;De Mol et al., 2008;Stock & Watson, 2012;Li & Chen, 2014), choosing λ via 10-fold cross-validation.
For the first series, the only persistent series, we see that both NGAR and DSS succeed well in tracing the true signal. This is especially true in contrast to DLM and LASSO, which both significantly under-valuate the signal. The estimated coefficient evolutions for DLM and LASSO become inconclusive for assessing variable importance, where the coefficient estimates for the relevant variables have been reduced by the increased estimates for the irrelevant variables. For the second to fourth series with intermittent zeros, we see that DSS is the only method able to separate the true zero/nonzero signal (noted by the flat coefficient estimates during inactive periods). On the other hand, NGAR seems to be not shrinking enough, instead smoothing the series as if curve-fitting. The LASSO method produces sparse estimates, however the variable selection is not linked over time and thereby erratic. For the two zero series (series five and six), it is clear that DSS is the only method that truly shrinks noise to zero. The DSS priors mitigate overfitting by eliminating noisy coefficients and thereby leaving enough room for the true predictors to capture the trend. As a revealing byproduct, we also obtain the evolving mixing weights determining the relevance of each coefficient at each time. The evolutions of rescaled weights (so that θ(0) = 0) are plotted in Figure 4. These companion plots are helpful visualizations of the time-varying sparsity profile.
We repeat this experiment 10 times, generating different responses and regressors using the same set of coefficients. We compare the sum of squared error (SSE) and the Hamming distance between the MAP estimate B and the true series B 0 . Table 1 reports average performance metrics over the 10 experiments. The performance of DSS is compared to the full DLM model (West & Harrison, 1997), NGAR (Kalli & Griffin, 2014) and LASSO. For DLM and NGAR, we use the same specifications as above. For DSS, we now explore a multitude of combinations of hyper-parameters with φ 1 = {0.95, 0.98}, λ 0 = {0.7, 0.9}, λ 1 = {10(1 − φ 2 1 ), 25(1 − φ 2 1 )}, and Θ = {0.9, 0.95, 0.98}. All these parameters are in the mild sparsity range; not over-emphasizing the spike. We initialize the calculation with a zero matrix.
Looking at Table 1, DSS performs better in terms of both SSE and Hamming distance compared to DLM, NGAR, and LASSO for the majority of the hyperparameters considered. To gain more insights, the table is divided into three blocks: overall performance on β 1:50 , active coefficients β 1:4 and noise coefficients β 5:50 . The Hamming distance is reported in percentages. Because DLM and NGAR only shrink (and do not select), the Hamming distance for the block of noisy coefficients is 100%. The number of false positives (and thereby the overall Hamming distance) is seen to increase with Θ, where large values of Θ have to be compensated with a larger spike parameter λ 0 to shrink the noisy coefficients to zero. The stationary slab variance λ 1 /(1 − φ 2 1 ) also affects variable selection, where larger values increase the selection threshold and produce less false discoveries. The reverse is true for the signal coefficients. In terms of SSE, DSS outperforms all other methods in estimating β 5:50 , demonstrating great success in suppressing unwanted parameters. Regarding the choice of φ 1 , larger values seem beneficial for the signal coefficients, where borrowing more strength enhances stability in predictive periods.
Although there are some settings where DSS underperforms, it is clear that DSS has the potential to greatly improve over existing methods for a wide range of hyperparameters. In terms of SSE, the less well-performing settings are associated with large λ 1 (e.g. λ 1 = 25/(1 − φ 2 1 ) and φ 1 = 0.95), where the slab process is allowed to meander away from the previous value. The lack of stickiness (smaller φ 1 ) also provides an opportunity for the spike to threshold. The best performing setting for SSE is seen for a sticky prior (φ 1 = 0.98) with a small slab variance (λ 1 = 10/(1 − 0.98 2 )), a larger spike penalty (λ 0 = 0.9) and not excessively large Θ. This combination seems to strike the right balance between selection and shrinkage. It is worth noting that these parameters could be estimated by, e.g., cross-validation, potentially improving overall performance.
Comparing the results with DLM and LASSO, DSS showcases the benefits of combining dynamics and shrinkage, since DLM (only dynamics) and LASSO (only shrinkage) underperform significantly. Additionally, we note that DSS is substantially faster, in terms of computation time, compared to NGAR and LASSO, which is appealing for practical applications. Table 1. Performance evaluation of the methods compared for the high-dimensional simulated example with p = 50. The results are split for the signal parameters (x 1:4 ) and noise parameters (x 5:50 ). Hamming distance is in percentages. Number in brackets for DSS is using the hyperparameter set {φ 1 , λ 0 , λ 1 /(1 − φ 2 1 ), Θ}. Best 5 results in DSS are in bold.
x 1:50 x 1:4 Now, we explore a far more challenging scenario, repeating the example with p = 1000 instead of 50. The coefficients and data generating process are the same with p = 50, but now instead of 46 noise regressors, we have 996. This high regressor redundancy rate is representative of the "p >> n" paradigm ("p >> T " for time series data) and can test the limits of any sparsity inducing procedure. Under this setting, we will be able to truly evaluate the efficacy of DSS and compare it to other methods when there is a large number of predictors with sparse signals.
The results are collated in Table 2. The same set of hyperparameters that performed best for p = 50 also does extremely well for p = 1000, dramatically reducing SSE over DLM and LASSO. We have not reported comparisons to NGAR, because it did not seem to produce reliable/sensible results in this high-dimensional scenario. Due to a long running time, we had to consider only short MCMC chains which produced unstable and un-reportable results. In addition, we did not attempt to tune the hyper-parameters and used the default options in the matlab program provided by the authors. On the other hand, DSS produces reasonable estimates, improving upon DLM and LASSO (for the majority of settings). We also note that, while LASSO does perform well in terms of Table 2. Performance evaluation of the methods compared for the high-dimensional simulated example with p = 1000. The results are split for the signal parameters (x 1:4 ) and noise parameters (x 5:1000 ). Hamming distance is in percentages. Number in brackets for DSS is using the hyperparameter set {φ 1 , λ 0 , λ 1 /(1 − φ 2 1 ), Θ}. Best 5 results in DSS are in bold.
x 1:1000 x 1:4 For this example, we aim to forecast and infer on monthly US inflation ( Figure 5), a context of topical interest (Cogley & Sargent, 2005;Primiceri, 2005;Koop et al., 2009;Nakajima & West, 2013a). One of the dual mandates of the central bank is to control inflation. To conduct effective economic policy, it is imperative for the policy maker to obtain accurate forecasts and infer on the effects of different indicators on inflation using a large set of available economic indicators. The benefit of shrinkage for this example, as well as in other contexts, is to recover sparse signals from the underlying economic indicators that are pertinent to inflation. However, because the economy is dynamic, it is natural to assume certain indicators to be effective during a certain period but useless during another. For example, one might expect financial indicators to play a significant role in the economy during a financial crisis. The dataset from 2001/1 to 2015/12 is of particular interest, as it includes the subprime mortgage crisis of the late 2000s.
To evaluate our method, we first measure its forecasting ability by conducting one month ahead forecasts and comparing the mean squared cumulative forecast error (MSFE). The analysis is done by cutting the data in half, training the methods using the first half of the data from 1986/1 to 2000/12. We then sequentially update the forecasts through the second half from 2001/1 to 2015/12, updating and rerunning estimation to produce 1-month ahead forecasts every time we observe a new data at each t (using data from 1 : t to forecast t + 1 for t = 1 : T − 1, where t = 1 is 1986/1, and t = T − 1 is 2015/11). Out-of-sample forecasting is thus conducted and evaluated in a way that mirrors the realities facing decision and policy makers. At the end of the analysis (2015/12), we estimate the retrospective coefficients throughout 2001/1 to 2015/12 in order to infer on the recovered signals, given all the data used in the analysis. As with Section 8, we compare DSS against the full DLM, now with discount stochastic volatility to learn the observation and state variance (see, e.g., West & Harrison, 1997;Prado & West, 2010;Dangl & Halling, 2012;Koop & Korobilis, 2013;Gruber & West, 2016, 2017Zhao et al., 2016), NGAR, and LASSO (expanding window). For DSS, we use the hyperparameters {φ 0 = 0, φ 1 = 0.98, λ 0 = 0.9, λ 1 = 25(1 − φ 2 1 ), Θ = 0.98}, since it produced decent results in the simulation study in Section 8.
On the comparison of forecast ability (Figure 6), it is clear that DSS dramatically improves forecasts over the methods compared (note that the axes are on a different scales). The exact values of MSFE at 2015/12 are DLM: 0.4076; NGAR: 261.7716; LASSO: 0.4307; DSS: 0.0204. DSS significantly improves over the full DLM, which is unsurprising since the full DLM model is plagued with overfitting and false signals. A surprising result is that the forecasting performance of LASSO is slightly worse than the full DLM. This indicates that, even though LASSO might achieve shrinkage, it does not capture the dynamics of signals entering and leaving the threshold. The fact that the forecasting results of LASSO and DLM are similar suggests that the gain coming from either dynamics or shrinkage are similar. DSS, capturing and capitalizing on both dynamics and shrinkage, achieves significant forecast gains that either one or the other alone cannot. For NGAR, the estimation is challenged by the high-dimensionality of the data, producing results that are not as good/reliable. Note that, for NGAR, we used the default settings with 1,000 burn-in and 1,000 iterations. Figure 7 displays the recovered coefficients from DSS at 2015/12, the end of the analysis, taking into account the full dataset available. Out of the 119 indicators (not including the intercept) only 19 were recovered, suggesting that many of these indicators are noise. As we see from the list of the 19 indicators (Table 3), the recovered predictors span six of the eight groups, covering many different aspects of the economy.
The two largest signals are the two labor indicators (Avg Weekly Hours; both of which are very similar), which can be seen as a proxy of the labor market and its effect on inflation. Interestingly, conventional indices of the labor market (including unemployment) are either not recovered or have a very low signal. Our analysis suggests that the number of hours worked is the major indicator in the labor market as it is relevant to inflation.
A large group of the retrieved variables is from the housing market, including all regions and total of housing starts and new private housing permits. This is understandable, as the real-estate market is directly tied to the real economy, including inflation. During Table 3. Economic predictors recovered by DSS in the analysis of US inflation from 2001/1 to 2015/12. key to obtaining this stabilizing property is the careful hierarchical construction of adaptive mixing weights that allows them to depend on the lagged value of the process, thereby reflecting sparsity of past coefficients. We propose various versions of DSS, using Laplace/Gaussian spike/slab distributions. For implementation, we resort to optimization as a faster alternative to posterior sampling. We implement a one-step late algorithm for MAP estimation which iterates over one-site closed-form thresholding rules. Through simulation and a macroeconomic dataset, we demonstrate that DSS are well suited for the dual purpose of dynamic variable selection (through thresholding to exact zero) and smoothing (through an autoregressive slab process) for forecasting and inferential goals. Many variants and improvements are possible on the DSS prototype constructions. While these go beyond the scope of the present paper, we would like to make the reader aware of generalizations that might be important for future deployment of these priors. The first will be linking the time series priors across the different coordinates. This can be achieved by endowing the importance weight Θ with a prior distribution, thereby allowing to adapt to the overall sparsity pattern. This elaboration can be embedded within our framework by adding an extra step in the algorithm. The second very important extension will be treating the residual variances σ 2 as random and possibly time-varying. In this paper, we focused primarily on the priors on the regression coefficients. However, the DSS priors can be deployed in tandem with a stochastic process prior on {σ 2 t } T t=1 (as e.g. in Kalli & Griffin (2014)). Third, the autoregressive parameters (φ 0 , φ 1 ) have been assumed fixed and shared across coordinates. To allow for the possibility of treating these as unknown with a prior will be an important extension.
A matlab code is available from the authors upon request.