Bayesian Nonparametric Density Autoregression with Lag Selection

We develop a Bayesian nonparametric autoregressive model applied to flexibly estimate general transition densities exhibiting nonlinear lag dependence. Our approach is related to Bayesian density regression using Dirichlet process mixtures, with the Markovian likelihood defined through the conditional distribution obtained from the mixture. This results in a Bayesian nonparametric extension of a mixtures-of-experts model formulation. We address computational challenges to posterior sampling that arise from the Markovian structure in the likelihood. The base model is illustrated with synthetic data from a classical model for population dynamics, as well as a series of waiting times between eruptions of Old Faithful Geyser. We study inferences available through the base model before extending the methodology to include automatic relevance detection among a pre-specified set of lags. Inference for global and local lag selection is explored with additional simulation studies, and the methods are illustrated through analysis of an annual time series of pink salmon abundance in a stream in Alaska. We further explore and compare transition density estimation performance for alternative configurations of the proposed model.


Introduction
This article is concerned with flexible transition density estimation for non-stationary, nonlinear time series. Let {y t : t = 1, . . . , T } denote a univariate series governed by a time-homogeneous transition density p(y t | y t−1 , . . . , y 1 ). While nonlinearity has been used to describe various qualitative characteristics of time series, we specifically refer to nonlinear dynamics, or the function mapping past observations to the present. Many existing methods for nonlinear regression have been applied to autoregressive modeling within and out of the statistical literature. Density regression has received far less attention, especially in application to transition density estimation, a crucial component of probabilistic forecasting and decision modeling. We seek to build on recent advances in transition density estimation by exploring what can be succinctly described as an extension to Bayesian nonparametric mixtures of autoregressive models. To accommodate nonlinear dependence, mixture weights are functions of lagged observations. Thus, our method is also accurately described as a locally linear autoregressive model.
Perhaps the most popular mixture modeling application to time series is the class of hidden Markov models (HMMs), which are capable of capturing nonlinear dynamics (Frühwirth-Schnatter, 2006, and references therein). Markovian dependence in a latent process, however, complicates inferences for transition densities and related functionals, especially when considering multiple lags. The likewise popular classes of threshold autoregressive models (Tong, 1990), and mixtures-of-experts (MoE) models (Jordan and Jacobs, 1994;Peng et al., 1996;Tanner, 2005, 2006) alternatively build dependence into mixture weights through lagged observations directly. We take the MoE approach, replacing parameterized link functions of lagged observations with normalized kernels for local weighting (Glasbey, 2001;Kalliovirta et al., 2015).
In contrast with most HMM and MoE methods, our models are based on countable mixtures, bypassing the need to fix the number of mixture components. Bayesian nonparametric (BNP) approaches have expanded the hidden Markov (Beal et al., 2002;Taddy and Kottas, 2009;Yau et al., 2011) and dynamic linear (Rodríguez and Ter Horst, 2008;Caron et al., 2007;Fox et al., 2011) model frameworks. Dirichlet process mixtures (DPM; Ferguson, 1973;Antoniak, 1974) of linear autoregressive (AR) models (Lau and So, 2008;Di Lucca et al., 2013), which are closer to our formulation, can be viewed as nonparametric extensions of the mixture autoregressive model of Wong and Li (2000). DPM of AR models typically use static weights, restricting transition mean functionals to be linear. Müller et al. (1997) use normalized weights that employ a finite MoE framework to accommodate nonlinearity. Posterior consistency for BNP transition density estimation has been explored by Tang and Ghosal (2007a), Tang and Ghosal (2007b), and Chae and Walker (2019).
Many of the above methods assume first-order time dependence. While convenient and occasionally justified, this assumption may over-simplify or misspecify the dynamics. Higherorder models can also enable phase-space reconstruction via time-delay embedding. Although applied to deterministic systems, a theorem by Takens (1981) justifies reconstructing multidimensional dynamical systems, up to topological equivalence, using only lags of a univariate time series. Markovian stochastic models can approximate this method for applications that exhibit noise (Kantz and Schreiber, 2004, Ch. 10, 12). The practical utility of this result is evident in fields like ecology, where full observation of all relevant variables is practically impossible.
Motivated by these considerations, we propose to model the transition density for observation y t , conditional on L lags y t−1:L ≡ (y t−1 , . . . , y t−L ), as ∞ h=1 q h (y t−1:L ) p h (y t | y t−1:L ), with component-specific normalized weight functions q h (y t−1:L ) and kernel densities p h (y t | y t−1:L ). The model form resembles that of Antoniano-Villalobos and Walker (2016), who build on Martınez-Ovando and Walker (2011), constructing a transition density from a mixture model on the stationary joint density of the current observation and a single lag.
Their likelihood is based on the conditional transition density, which is a nonparametric mixture of kernels with linear autoregressive means and lag-dependent, normalized weights. Kalli and Griffin (2018) extend this framework to a stationary multivariate autoregressive model of multiple lags, although the model is implemented with a single lag. DeYoreo and Kottas (2017) use a similar model construction, achieving superior flexibility by relaxing the stationarity assumption. The model proposed in this article extends that of DeYoreo and Kottas (2017) to accommodate multiple lags and, crucially, shrink dependence to a minimally sufficient set of lags. The added modeling and computational complexity associated with high-order dependence demands that lag selection play a vital role in this work, as it affords parsimony and significantly reduces the estimation burden.
The primary contributions of this article are 1) extension of a powerful class of nonstationary, nonlinear density autoregression models to accommodate dependence on multiple lags; 2) development of a framework for model-based selection and exploration of lag dependence; 3) investigation into the proposed model's fitness for different analysis scenarios; and 4) demonstration of the need for lag selection in high-order density autoregression.
The rest of the article is organized as follows. In Section 2, we propose a BNP time-series model for density autoregression and present details for implementation and inference.
In Section 3, we illustrate the model fit to synthetic and real data. In Section 4, we extend the model to incorporate inferences about relevant lags and demonstrate its use on data. Section 5 compares transition density estimation performance under different model configurations, using simulated nonlinear time series featuring skewness, heteroscedasticity, and different lag dependence structures. Finally, Section 6 concludes with discussion. The Supplementary Material contains details on: model modifications for stationary time series; prior specification; computing time and sensitivity analysis; the Markov chain Monte Carlo (MCMC) algorithms for the base model and its extension that incorporates lag selection; and an additional simulation example.

The modeling approach
Our objective is to develop a general-purpose and fully nonparametric, time-homogeneous Markovian model that is sufficiently flexible to: 1) estimate possible non-Gaussian transition densities, dependent on lagged values, 2) capture nonlinear dynamics, and 3) select relevant lags among a pre-specified set, up to a maximal order L. The first two objectives are accomplished through a nonparametric mixture of Gaussian densities, wherein both the mixture weights and kernel means depend on lagged observations. The general model formulation for the transition density can be written as f (y t | y t−1:L ) = ∞ h=1 q h (y t−1:L ) local weights N(y t | µ h (y t−1:L ), σ 2 h ) mixture kernels , where N(y | µ, σ 2 ) denotes a Gaussian density with mean µ and variance σ 2 evaluated at y, and with weight function q h (y t−1:L ) ≥ 0 for all h ∈ N such that ∞ h=1 q h (y t−1:L ) = 1 for all y t−1:L ∈ R L . We utilize kernel mean functions µ h (y t−1:L ) that are linear in the lags, yielding a local linear model formulation. The objective of order and lag selection is accomplished through a stochastic-search prior structure.
Time homogeneity is a consequence of time invariance in the parameters governing the mixture weights and kernels. We note that this seemingly restrictive assumption is at least partially offset by the model's flexibility with respect to lagged observations. Apparently time-dependent structural changes can sometimes be attributed to heterogeneity of response across the state space. In such cases, a latent first-order Markov process governing the mixture weights may be less effective than our approach of using the lagged values directly. Nevertheless, dynamic drift or regime-switching in model structure may be more appropriate in some scenarios, for which we urge thoughtful exploration before selecting a model. We proceed with model specification in Section 2.1, built using a covariance matrix parameterization that is useful for interpretation and implementation. Section 2.2 discusses the roles of model parameters and gives recommended prior settings. Section 2.3 briefly outlines the MCMC algorithm used for posterior inferences and addresses implementation.
Finally, Section 2.4 discusses model inferences, including transition density estimation.

Model specification
One avenue to arrive at the conditional density form in (1) begins with a prior for joint density estimation. For clarity in notation, we use y ∈ R to represent the current observation and x ≡ y t−1:L ∈ R L to denote the lags. We begin as in Müller et al. (1996), who in the regression setting consider y and x to arise jointly from a Gaussian DPM. This implies the stick-breaking representation (Sethuraman, 1994) for joint density, where the (µ h , Σ h ) arise i.i.d. from the Dirichlet process (DP) centering distribution G 0 , and the mixture weights, ω h , are constructed as Conditioning on x, we obtain the transition density model: refers to a Gaussian density with parameters corresponding to mixture component h, and N (h) (y | x) is the univariate conditional Gaussian density derived from N (h) (y, x). The joint densities in each mixture component of the numerator of (4) have been factored into their respective marginal Ldimensional Gaussian density for x (with mean µ x and covariance Σ x ) and univariate conditional Gaussian density for y (with linear mean µ(x) ≡ µ y + Σ yx (Σ x ) −1 (x − µ x ) and variance σ 2 ≡ (σ y ) 2 − Σ yx (Σ x ) −1 Σ xy ). The second line of (4) reveals the local linear model structure with lag-dependent weights.
This procedure yields a conditional density that satisfies the requirements of the proposed model (1). Specifically, since ∞ h=1 ω h = 1 almost surely, so long as there exists some positive constant c N < +∞ such that 0 < N (h) (x) < c N for all h ∈ N and all x ∈ R L (which is satisfied if there exists another constant c Σ > 0 such that det(Σ x h ) > c Σ for all h ∈ N), the denominator in q h (x) will be positive and finite for all x ∈ R L . Although x (representing y t−1:L ) can legitimately be considered random in the timeseries context, the Markovian likelihood requires that the conditional density (4) form the basis of the model. Besides creating redundancy in the likelihood, modeling separate joint distributions for consecutive length-(L + 1) coordinate vectors would not generally be coherent. To see this, consider y t , which appears in both (y t+1 , y t ) and (y t , y t−1:L ).
Modeling each vector with a joint mixture as in (2) would result in two distinct marginal distributions for y t without additional assumptions, like strong stationarity. We forego stationarity in favor of flexibility. Consequently, we interpret the {N (h) (x)} densities in {q h (x)} exclusively as functions that localize the mixture weights, and not as joint densities of lagged observations. Indeed, localizing the weights is their only role in a conditional likelihood based on (4). Supplement S1 includes discussion of possible mixture model formulations for the stationary case.
The model likelihood, based on (4) and conditional on the first L observations, is T t=L+1 f Y |X (y t | y t−1:L , G). This is the form adopted in Antoniano-Villalobos and Walker (2016) and Kalli and Griffin (2018), who assume stationarity, and DeYoreo and Kottas (2017), who do not assume stationarity. The local re-weighting of {ω h } with probability density kernels on x distinguishes our model from nonparametric extensions of MoE for regression, such as dependent Dirichlet process (DDP;MacEachern, 2000) variants (Chung and Dunson, 2009;Fuentes-García et al., 2009;Barrientos et al., 2017) and kernel stickbreaking models (Park and Dunson, 2010;Reich et al., 2012). See Wade et al. (2014) and DeYoreo and Kottas (2020) for reviews of density regression models that build on Müller et al. (1996) and do not pre-condition the likelihood.

Covariance factorization
To facilitate interpretation in our factorization of the kernels into response and lag densities, allow flexible and parsimonious covariance modeling, and to provide a vehicle for variable selection in the mixture weights, we parameterize the Gaussian covariance matrix according to the factorization Σ = B −1 ∆(B −1 ) . Here, ∆ = diag(σ 2 , δ x 1 , . . . , δ x L ) and B is an upper unit-triangular matrix with first row (1, β y 1 , β y 2 , . . . , β y L−1 , β y L ), second row (0, 1, β x 1,2 , . . . , β x 1,L−1 , β x 1,L ), and so forth until the (L − 1)th row (0, . . . , 0, 1, β x L−1,L ). This factorization is equivalent to the square-root-free Cholesky decomposition employed by Daniels and Pourahmadi (2002) and Webb and Forster (2008), and in our setting by DeYoreo and Kottas (2017). This and similar decompositions have also been used for model selection (Smith and Kohn, 2002;Cai and Dunson, 2006). Our extension for lag selection in the mixture weights is discussed in Section 4. This parameterization also yields a sequential decomposition of a joint Gaussian density for y and x into L + 1 univariate Gaussian densities. Specifically, We construct from back (most distant lag) to front (y) so that the response density depends on the entire x vector while maintaining a consistent order convention. This fully parameterized representation of the covariance matrix is flexible, as each β parameter is unrestricted and δ parameters need only be positive, and admits control over the marginal weight density of x while preserving positive definiteness. Note also that the marginal covariance matrix of x can be constructed as Σ the top row and first column of B, and ∆ x = diag(δ x 1 , . . . , δ x L ). The weight kernels in q h (x) present the most obvious and pressing opportunity to improve parameter economy in the model. We therefore also consider weight kernels with local independence between elements of x (e.g., Shahbaba and Neal, 2009). This reduction is accomplished by setting all β x ,r , for = r, equal to 0, yielding diagonal Σ x = ∆ x . We note that Gaussian mixtures with diagonal covariance can approximate general density shapes, at the cost of possibly utilizing additional mixture components to capture local behavior. The reduction becomes necessary if we include many lags, as the number of covariance parameters for each component h grows quadratically with L.
The final term in (5) involving µ y and the {β y } is overparameterized if used as a standalone regression model. However, the {µ x } parameters become at least partially identified in our mixture formulation because they serve as location parameters for the mixture weight kernels in q h (x). It is nevertheless preferable to monitor inferences for component-specific intercepts µ y + L =1 β y µ x , which in our experience are far more stable than either µ y or {µ x } alone.

Hierarchical model formulation
To implement the model, we truncate the infinite summation needed to normalize the mixture weights {q h (x)}, using blocked Gibbs sampling (Ishwaran and James, 2001). There are both theoretical and practical considerations when selecting the truncation level, H.
Given the DP concentration parameter α, we can calculate the prior expected truncation We can also monitor throughout MCMC sampling the last weight, ω H , to ensure it remains small, as well as the number of occupied components to ensure that it does not approach H.
∼ Ga(a x s 0 , , b x s 0 , ), for = 1, . . . , L. Experience with the model suggests it is practical to fix components in G 0 associated with y-indexed parameters rather than use the full prior specification above. Specifically, we find that fixing β * 0 at b * 0 , Λ * 0 at (Ψ * 0 ) −1 , and s 0 at a prior guess s 00 works well in practice.

Prior settings
The priors for the hierarchical model in Section 2.1.2 are specified in generality so that One primary functional of interest derived from the transition density in (4) is the We recommend the following default settings for a baseline prior, which in most cases should be adjusted for the analysis at hand. We typically set a α in the interval [5,15], depending on our prior beliefs about the degree of nonlinearity in the transition function. Setting b α = 1 yields a prior mean of a α . Antoniak (1974) gives the expression α log ((α + T − L)/α) as a rough prior estimate for the number of components. While this applies in the prior joint model, the number of components in our conditional model (4) is also a function of the Gaussian weight kernels on x. We set b * 0 = (ȳ, 0, . . . , 0), withȳ representing the center of the time series, either empirical or based on prior information, thus centering the model. We use Ψ * 0 = s −1 00 diag([range(y)/2.0] 2 , 16.0, . . . , 16.0), where s 00 is a user-supplied prior guess of σ 2 , and range(y) represents the range of the time series, either empirical or based on prior information. The prior guess s 00 partially compensates and controls for the fact that the covariance for β * in G 0 is multiplied by σ 2 . We use s 00 = [range(y)/6.0] 2 /R as an automatic prior guess of s 0 . The squared quantity is divided by a prior signal-to-noise ratio R > 0 that is set by the modeler on a case-by-case basis.
We interpret R roughly as the ratio of total variance to mixture-component error variance.
Supplement S3 reports a simulation study exploring sensitivity of posterior inference results to changes in α and R.

Computation
We briefly outline the MCMC algorithm used to obtain posterior samples from the proposed model. Further details are given in Supplement S4. We employ a Gibbs sampler with a variety of update methods for parameter blocks, which proceeds by successively sampling the parameters in the sets and manner described below.
Latent states: The latent states identifying component membership for each observation y t are updated individually, each using a Metropolized Gibbs step (Liu, 1996) based on discrete full conditional distributions involving {ω h }, the weight kernel density for y t−1:L , and the kernel density for y t .  (2017). To obtain direct samples from this distribution, we instead employ the multivariate hyper-rectangle slice sampler of Neal (2003) to update all v h , h = 1, . . . , H − 1, simultaneously.

Stick
Component-specific parameters: To facilitate mixing of the y-indexed, componentspecific parameters, we partition η into its y and x components η y ≡ {µ y , β y , σ 2 } and The weight normalization terms in q h (y t−1:L ) preclude simple conjugate updates of η x h , for which we employ a random-walk Metropolis step. This is then followed by an exact draw from the full conditional distribution of η y h . DP prior hyperparameters: All parameters of the DP centering distribution have conditionally conjugate updates. For computational stability, our implementation fixes, rather than updates, the parameters in G 0 associated with η y at prior summary values, as noted in Section 2.1.2. Finally, the DP concentration parameter α has a gamma posterior full conditional distribution with shape a α + H − 1 and rate b α − log(ω H ).
We typically initialize MCMC chains at default prior settings such as the prior mean or applicable summary value from the next level of the hierarchy, or with draws from the prior model (usually with G 0 fixed). The primary exception is the initial allocation to components {s t }, for which we use output from a clustering algorithm applied to (y t , y t−1 , ..., y t−L ), for all t = L + 1, . . . , T . For example, we use hierarchical clustering with Euclidean distance and Ward linkage to assign the observations into H clusters. The sampler is then run for one or several rounds of tuning or adaptation, as described in Supplement S4. If adaptation is used, scaled empirical covariance matrices inform subsequent random-walk proposals.
After a specified burn-in period, samples are collected for inference.
In our experience, the weakly identified η x and ω parameters present the primary mixing challenge. This appears to indicate redundancy in the weight functions, for which many configurations produce similar results. Our illustrations with the base model (i.e., without lag selection) focus on low-order dependence L ≤ 5. Later illustrations use diagonal Σ x = ∆ x , which reduces the computational complexity of the most expensive update, for . We further aid mixing by iterating between adaptation and pre-burn-in runs before beginning a final burn-in run. We note that despite the mixing challenges, estimates for functionals of interest are typically stable.
MCMC and other computations for the proposed model were run in the Julia language (Bezanson et al., 2017). Runtimes under various settings are compared as part of a sensitivity analysis in Supplement S3.

Transition density estimation
Posterior samples from the model yield rich inferences regarding the transition distribution for a time series. The three of most interest to us are the transition density, the transition mean functional, and inferences for relevant lags. We incorporate the latter in Section 4.
The transition mean functional and estimates of the transition density are straightforward to compute, as the stick-breaking representation and blocked Gibbs sampler yield an approximation of the random mixing distribution G at each MCMC iteration. For any value of y and x, or over a multidimensional grid of values, one can use posterior samples of parameters to calculate pointwise samples of the finite-truncated version of f Y |X in (4), given The samples can then be used to construct point and interval estimates for the transition density. Other functionals such as the transition mean or quantiles are similarly obtained. One can calculate the transition mean for each posterior over a grid of values for x, yielding pointwise estimates and intervals. We obtain samples of the u ∈ (0, 1) quantile of the transition density by solving for the unique root ofQ u (y | Monte Carlo estimates of K-step-ahead forecasts can be obtained by inductively simulating (s, y) T +k pairs, for k = 1, . . . , K, following the first two levels of the hierarchical model (6) for each posterior sample. Such samples propagate both forecast and inferential uncertainty, and can be useful for assessing model performance with validation data.

Data illustrations
We illustrate the proposed model with two examples. The first synthetic data example highlights some key features and potential uses of the model. The real data example illustrates the model's utility for lag-dependent density estimation. Two default prior settings were utilized in each case, with one promoting a higher signal variance through prior signal-to-noise ratio R = 8.0 instead of the default R = 5.0. For each model fit, multiple MCMC chains were randomly initialized using the strategy described in Section 2.3, followed by iterative tuning (no adaptation) and 300,000 burn-in samples. The next 500,000 iterations were then thinned to 5,000 for inference (plots in the following illustrations generally use 1,000 or 2,000 of these). Inferences are reported for one of the chains. These values for burn-in and thinning are fairly conservative; shorter chains often suffice.

Simulated data: Ricker model
We begin with a time series simulated from an adaptation of a classical model for population dynamics (Ricker, 1954). The series was generated from featuring first-order nonlinear dynamics as a function of the second lag only. We fit the model to the original real-valued time series with L = 2, T = 72 (so that 70 observations contribute to the likelihood), and H = 40. The R = 5.0 fit resulted in three chains with similar traces of the log-likelihood and occupied mixture components (always at two). All traces of σ 2 for the most occupied cluster (not shown) converge to approximately 3.5 times the true value of 0.0081, due in part to the prior estimate s 00 = 0.119. Flexibility and prior bias in error variance, together with low sample size, result in a transition mean fit that locally mixes two planes, capturing the general shape, but missing curvature in the region y t−2 ∈ (0, 1) (not shown). Two of three chains with higher signal-to-noise ratio (R = 8.0) use a third mixture component to better capture this curvature (although one reverts back to two components), as demonstrated for one chain in Figure 1.
The dynamics are reasonably recovered in data-rich regions of the phase space despite using an over-specified model with two lags on a short time series. We can informally assess the influence of the first lag with the second-order model by checking for sensitivity of inferences for the transition mean to different values of the first lag. For example, the left panel of Figure 1 plots estimates for the transition mean over a grid of values for the second lag, in which all values for the first lag have been fixed at their mean. The right panel replicates this plot with grid values for the first lag drawn uniformly over the range of the data. This perturbation has minimal effect, especially where data are observed, suggesting that lag 1 is negligible in the model fit. We note particularly wide credible intervals in the data-sparse region, which approximately reach 10. This appears to stem from the weight functions concentrating locally around the data, leaving data-sparse regions to revert to an indecisive mixture of the component fits and prior.  (Nicholl et al., 1994) and statistical (Azzalini and Bowman, 1990) perspectives, partly due to nonlinear as well as non-Gaussian dynamics. We revisit Old Faithful using the traditional data set reported in Azzalini and Bowman (1990), consisting of 299 consecutive pairs of eruption durations and waiting times between August 1 and 15, 1985. Figure 2 shows a trace of eruption waiting times in minutes.

Old Faithful data
We fit the proposed model to the final T = 291 observations with L = 2 and H = 40.
Likelihood traces are similar among runs under both prior signal-to-noise ratios, switching (infrequently) between values corresponding to two and three occupied mixture components.
Estimated transition mean surfaces, one of which is shown in Figure 3 (left), are primarily driven by the first lag, with minor tilt along the second. The transition mean functional is less informative for values of y t−1 above 70 minutes, when the transition distribution becomes bimodal. In this region, estimates of transition quantiles may be more appropriate than the transition mean. Inferences for quantiles over a grid of fixed lag values are easily obtained from posterior samples by following the procedure described in Section 2.4. Figure   3 (right) shows a pointwise posterior mean estimate of the 0.8 quantile surface as a function of the two lags. Credible intervals for both surfaces (excluded for simplicity in the plots) are reasonable, falling within the range of the data. for three values of the two lags. These estimates demonstrate the density autoregressive feature of the model, which in this case successfully captures density dependence on lags.
Interestingly, the transition density undergoes noticeable change between y t−2 = 50 and y t−2 = 80 when y t−1 is fixed at 80 minutes, suggesting dependence on the second lag. Other

Lag selection
We now discuss extending model (4) to include inferences for relevant lags. This step is important in many applications, as dependence may extend beyond the most recent lags. In some cases, not all recent lags are important. Methods for state-space reconstruction require a minimal number of lags to "unfold" an attractor, but using too many can be inefficient, or render estimation impractical. Reducing system dimensionality to the minimum necessary for fitting the data further simplifies posterior analysis and model interpretation. Our approach is to pre-specify a maximal lag horizon L, and fit an encompassing model that accommodates up to all L lags, but shrinks to select only those that significantly contribute to the transition density.
In the time series literature, autoregressive order is often assessed with standard information criteria, which can include regularization (Khalili et al., 2017). Bayesian approaches typically involve stochastic-search-type algorithms, and several are presented in Prado and West (2010, Ch. 2 (2017) provide a review that discusses approaches for covariate-dependent DPM, DDP, and product partition models. Most approaches involve binary indicator variables associated with each covariate that either activate lag-specific kernels (as in Reich et al., 2012) or break mixtures for key parameters (i.e., regression coefficients) involving point masses at 0 (as in Chung and Dunson, 2009). Another option with DPM models is to include model order as a mixing parameter (as in Lau and So, 2008).
We propose a model extension for global lag selection in Section 4.1. Section 4.2 discusses inference, including posterior sampling and other modifications to MCMC, and sampling for functionals. Section 4.3 describes an analogous extension for local lag selection. In Section 4.4, we revisit the data illustrations from Section 3 and include two additional data sets.

Model extension for global lag selection
In model (4), both mixture kernels and weights depend on the lags, thus necessitating coordination across multiple parameters for model-based lag selection. To this end, we employ binary variables {γ }, for = 1, . . . , L, to indicate dependence of y t on y t− , in both weights and kernels of all H mixture components, if γ = 1. The most straightforward approach to incorporating these indicators follows Kuo and Mallick (1998), wherein we replace β y with γ β y . The modification to β y controls lag dependence in the mixture kernels.
Our proposed modification to the weight kernels N (h) (y t−1:L | µ x , Σ x ) totally eliminates dependence on lags for which γ = 0, and is most clearly understood in the context of the sequential construction of weight kernels given in (5). We replace β x ,r with γ γ r β x ,r . Additionally, if γ = 0, the univariate Gaussian density associated with y t− is replaced with 1. This is equivalent to appropriately subsetting {β x } and δ x prior to constructing the covariance matrix Σ x , reducing the dimensionality of N (h) (y t−1:L | µ x , Σ x ) to n γ = L =1 γ . If n γ = 0, then the weight function reduces exclusively to ω, resulting in a standard univariate Gaussian DPM model. This approach reduces computational burden and offers a clean, complete lag selection, conditional on γ = (γ 1 , . . . , γ L ).
The modification for lag selection affects the hierarchical model in (6) through 1) the regression mean in the mixture kernel distribution for y t , which becomes µ y − L =1 γ β y (y t− − µ x ); 2) the construction of N (h) (y t−1:L ) in the discrete distribution for s t ; and 3) addition of a prior for {γ }. We again favor simplicity and assign independent Bernoulli(π γ ) priors to each γ . One option is to set π γ equal to a constant for all lags, a common choice for variable selection in regression settings. When modeling nonlinear dynamics, however, subsets of lags are often highly correlated and subject to aliasing. We thus prefer to use, as a default, a decreasing sequence for π γ that helps identify the model by giving ordered preference to lower lags. As a specific choice, π γ = 0.1 + (0.4/0.5) * 0.5 , for = 1, . . . , L, geometrically decreases from 0.5 to 0.1 to promote sparsity and dimension reduction. Supplement S3 explores posterior sensitivity to these prior options.

Posterior inference
The proposed setup is minimally disruptive to the MCMC algorithm outlined in Section 2.3. Conditional on γ, the effect of selection on the mixture kernels, and hence most of the Gibbs updates, is straightforward. We update γ as a block, with a Metropolis step that proposes switching a random subset of γ (similar to Section 3.3 of Schäfer and Chopin, 2013 Furthermore, when γ = 0, draws for the associated parameters revert to their prior distributions, which may be diffuse relative to their posterior distributions when γ = 1, producing draws that will discourage returning to γ = 1. Alternative methods such as Gibbs variable selection (Dellaportas et al., 2002) adapt the prior to improve mixing, but require tuning. We do not pursue this here, but note that despite mixing difficulties and attenuated posterior probabilities for alternate lag configurations, our experience has been that MCMC chains can provide useful inferences. We recommend running multiple MCMC chains, initialized at different selection configurations. We begin MCMC with a phase in which γ is not updated, followed by iterated tuning or adaptation and burn-in phases with the full sampler, followed by a final burn-in.

Local lag selection
Local lag selection modestly increases computational complexity as well as MCMC runtime relative to global selection (see Supplement S3). This is due to H repeated calculations of the weight denominator across all observations (each requiring up to O(T HL 3 ), or O(T HL) for diagonal Σ x , operations). In our experience, however, increased MCMC efficiency renders local selection worthwhile.
Inference for global dependence can be assessed with local selection, but is more nuanced.
We assess global lag dependence by monitoring the weight h γ (h) t 1 (st=h) /(T −L), which gives the proportion of observations in the time series belonging to mixture components for which lag is active. Alternatively, we can replace γ (h) in the preceding expression with , for some small threshold b 0 > 0, requiring both dependence in the weights and a minimum contribution to the slope of the kernel for a lag to be considered active.
The quantities h ω h γ (h) and π γ are also informative. Inferences for transition densities and associated functionals again follow the procedures in Section 4.2.

Data illustrations incorporating lag selection
We now revisit the analyses from Section 3 with lag selection, and include two additional examples. All models in this section utilize diagonal Σ x and default prior settings. Parameters of the components of G 0 associated with η y were fixed. For each example, four MCMC chains were randomly initialized using the strategy described in Section 2.3. Two chains were initialized with all lags off and two were initialized with all lags on. Tuning stages were followed by 300,000 burn-in samples. The next 500,000 iterations were then thinned to 5,000 for inference (and further thinned for computationally expensive functionals such as surfaces). Both global and local lag selection were employed and compared.

Simulated data: linear autoregression
To test the model's ability to identify simple structure, for which the proposed model is  Figure 1. A few runs include lag 4, which is reasonable given that the data reside in two diagonal quadrants of the (y t−2 , y t−4 ) lag embedding space.

Old Faithful data
Model runs (T = 294, L = 5, H = 40) fit to the Old Faithful time series have mixed results.
Global lag selection runs with low prior signal-to-noise ratio R = 5.0 all converge to lag 1 only with no mixing over other configurations. Runs with R = 8.0 continue exploring selection configurations past the specified burn-in phase, on very long timescales. One run retains lag 3 and another uses lag 2 for part of the chain. Local selection yields similar results to global, but with exploration of lag inclusion on shorter time scales and some local inclusion of lags 2 and 3. Nevertheless, these runs do not detect the density dependence noted in Section 3.2. Additional runs with priors more favorable to higher lags and larger prior weight-kernel variance also miss dependence on lag 2.

Pink salmon data
We next investigate a time series of annual pink salmon abundance (escapement) in Alaska, U.S.A. (Alaska Fisheries Science Center, 2018), whose life cycle reliably follows a two-year pattern (Heard, 1991). Naive modeling of annual population dynamics based on the previous year only would capture inter-population, rather than generational dynamic dependence.
We expect even lags to have the most influence in predicting the current year's population.
The trace of the natural logarithm of abundance in Figure 5 suggests a comprehensive analysis might appropriately include non-stationarity with long-term trends, which we forego in favor of a simple demonstration. Lag scatter plots (not shown) suggest that we should be able to detect lag dependence structure, even with as few as 30 observations.
Model runs (T = 30, L = 5, H = 25) fit to the pink salmon data demonstrate sensitivity to prior and model specification. Most runs with global lag selection and higher prior signal-to-noise ratio (R = 8.0) deselect all lags, although one run has lag 2 active for many inference samples. Runs with lower R = 5.0 deselect all lags except lag 2, which is on for  long periods in three of four chains. Local lag selection consistently retains lag 2 throughout most inference samples, as well as lag 4 occasionally. Increasing R tends to result in a higher inclusion probability for lag 4, presumably from a tendency to over-fit a transition surface informed by data only in diagonal quadrants of the (log(y t−2 ), log(y t−4 )) space, similar to the Ricker model above. Figure 6 reports posterior inferences for the transition mean as a function of lag 2, under the global and local lag selection model versions. Inferences for the transition mean as a function of lag 2 only appears appropriate and insensitive to other lags.

Transition density estimation performance
Transition density estimation is a primary objective of the methodology. To compare density estimation across model configurations and data scenarios, we fit the model to simulated time series exhibiting various features and evaluate Monte Carlo estimates of the Kullback-Leibler (K-L) divergence between the estimated and true transition densities.
The simulated time series are variants of the Ricker-type system in (7). The first modification replaces the additive Gaussian error with multiplicative log-normal error.
Specifically, transitions were generated from y t = y t−2 exp(2.6 − y t−2 + t ) , corresponding to a log-normal transition density. This produces right skew and heteroscedasticity in the transition distribution, which continues to depend exclusively on the second lag.
The lag scatter plot in Figure 7 depicts 250 transitions. We refer to this modification as the single-lag, log-normal simulation. The second modification adds dependence on the first lag through the log-scale, which is equal to 0.09 y t−1 . Thus the transition distribution is still log-normal, with each parameter depending on a separate lag. The lag scatter plot in Figure   8 depicts 500 transitions, demonstrating dependence of the variance on both lags. We refer to this modification as the two-lag, log-normal simulation. In all simulations, a sequence of 1,000 observations was reserved for model fitting, and a validation set of size 1,000 was randomly sampled from the subsequent 9,000 observations. In similar data scenarios with right skew and positive-valued variables, we have previously modeled observations on the  logarithmic scale. We nevertheless proceed by fitting these series directly in order to study and compare how the proposed models handle heteroscedasticity, subtle departures from Gaussianity, and subtle variation in lag dependence.
The following models were fit using default settings to all three series: the proposed model (which we denote as the BNP-WMAR model, for Bayesian nonparametric, weighted mixture of autoregressive models) with L = 2, full Σ x , and no lag selection; the BNP-WMAR model with L = 5, diagonal Σ x , and global lag selection; and finally with L = 5, diagonal Σ x , and local lag selection. Three chains were run for the base model that does not incorporate lag selection, and four chains were run for each model with lag selection, with two chains initialized with all lags off and the other two initialized with all lags on.
Each posterior sample was used to create density ordinates, denotedp(y t | y t−1:L ) and calculated from the transition density in Section 2.4, appropriately modified by lag selection indicators. With 2,000 replicate simulation draws {y (i) j } 2,000 i=1 from the data-generating distribution (with density p true (y t | y t−1:L )) for each validation pair {(y j , x j )} 1,000 j=1 , we approximated the Kullback-Leibler divergence using This loss metric is reported in Table 1  Despite this lack of mixing, the discrepancy in K-L loss is minimal.
In the two-lag, log-normal scenario, the base model with L = 2 has the advantage of being fixed at the correct lag structure. However, both models with lag selection manage superior performance. In the small sample, the base model over-fits a few points in the sparse region with y t−1 low and y t−2 high, producing inferences that fail to generalize. In contrast, the global selection model retains only lag 2 in three of four runs (selecting none in the poorly performing run), avoiding the over-fitting issue at the expense of missing density dependence on the first lag. The model compensates with a right-skewed transition density when y t−2 is low, irrespective of y t−1 . Local lag selection has similar behavior in the small sample. All models struggle in the region with small values of lag 1 and large values of lag 2, where the true density is far more concentrated than estimated.
In the large sample, global selection is inconsistent, correctly retaining both of the first two lags when initialized with all lags on, but retaining no lags and lag 2 only in respective runs initialized with no lags on. The two runs with correct lag selection yield effective density estimation, capturing variance and skew dependence on y t−1 . Local lag selection does the same, with consistent performance across initializations.
The dimension reduction and parsimony afforded by lag selection provide significant gains in density estimation, as measured by K-L distance, and make a strong case for the proposed model extensions. Global selection can be effective when the dependence structure is simple, but we generally recommend the more versatile local selection model.

Discussion
We have developed a modeling framework for fully nonparametric, nonlinear autoregressive models targeted at estimating transition densities. The modeling objectives of estimating flexible transition densities, accommodating nonlinear dynamics, and selecting active lags offer many degrees of freedom that in most cases will not be entirely identified with data alone. Decisions must be made, and correspondent behaviors encouraged through the prior settings. As such, we recommend completing a thorough exploratory analysis of data. We further advise that practitioners fit models with a variety of prior signal-to-noise ratio and flexibility (through α and possibly δ x ) settings, each with multiple MCMC chains.
Several considerations can help guide which settings are appropriate for a given scenario.
One that bears on lag selection is the interplay between noise and signal. A model attempting to fit noise may erroneously reach into higher dimensions. However, in the absence of noise, finding a high-dimensional structure is an objective of techniques such as time-delay embedding. Another consideration arises from correlation among lags, which can result in multiple distinct lag configurations that each produce comparably effective forecasts. This partially motivates our recommendation of decaying inclusion probabilities in the prior.
Inference for relevant lags remains practically challenging. Our experience has been that results from the models with lag selection tend to exhibit prior sensitivity, a natural consequence of the flexibility discussed. We have also noted that in models with lag selection, mixing challenges intensify with increased time series length, which tends to sharpen posterior modes. This often manifests through kernel coefficients being estimated at small, nonzero magnitudes while corresponding lag selection indicators remain on. Local selection helps alleviate this issue by breaking up the samples informing multiple lag indicators, naturally tempering the posterior distributions and encouraging greater mixing. The cost of added versatility and improved mixing is a more intricate picture of lag importance in posterior analysis.
Although binary lag inclusion parameters are easy to interpret, they offer limited insight to relative contributions from active lags. Such contributions can be quantified for the mean transition function through functional decomposition, but this is less straightforward for transition densities. Ideally, weak dependence would manifest in the posterior probability of inclusion. Alternatively, we envision a framework that quantifies lag importance with shrinkage of continuous parameters. Continuous quantification of lag importance could in turn reduce the influence of weight kernels relative to the DP weights and thus accommodate multiple sources of influence on the mixture weights.
Notwithstanding theoretical and practical challenges, lag selection is critical for dimension reduction and is an integral part of this work. Simpler models can partially avoid some of the challenges noted, but risk failing to model, or even detect, nonlinear and/or non-Gaussian dynamics. Our proposed methods extend Bayesian nonparametric density autoregressive modeling by accommodating multiple lags and providing a framework for lag selection that works in concert with the other objectives.

S1 Model for stationary time series
One way to ensure stationarity of a process with transition kernel (4) is to espouse the joint density interpretation of the mixture (2) and constrain µ h = 1µ h and Σ h to be positive definite Toeplitz, thereby ensuring reversibility with identical marginal distributions . This is the approach taken by Antoniano-Villalobos and Walker (2016) and Kalli and Griffin (2018), who assume an AR(1) covariance structure for Σ h . Generally, each mixture component accommodates up to L + 2 parameters. The Markovian likelihood then arising from (4) precludes closed-form Gibbs sampling.
Model-based lag selection presents the primary challenge in a stationary model with L > 1. An additional requirement is that lag selection is global. To see this, consider a two-component mixture of bivariate Gaussian densities, with each component using a separate lag: f (y t , y t−1 , y t−2 ) = ω N (1) (y t , y t−1 ) + (1 − ω) N (2) (y t , y t−2 ). Integrating out y t leaves a mixture of univariate Gaussian densities, whereas integrating out y t−2 leaves one bivariate and one univariate component, yielding two distinct marginal densities for (y t , y t−1 ) and (y t−1 , y t−2 ). We can maintain stationarity with a joint density defined over a set of nonconsecutive lags (e.g., f (y t , y t−2 )) if we assume that conditionally independent (in the transition) subsequences of the time series follow the same stationary distribution.
Given the structural requirement on Σ h , the Cholesky factorization in Section 2.1.1 is not useful in the stationary case. We instead consider constructing Σ h from autoregressive coefficients, which we denote φ h = (φ (h) 1 , . . . , φ (h) L ), allowing some control over lag dependence. Given φ h , µ h , and innovation variance σ 2 h , we can recover Σ h by solving the set of homogeneous difference equations defining the stationary autocovariance (equivalently, the Yule-Walker equations). To maintain causality (a sufficient condition for stationarity) of each mixture component, all roots of the AR characteristic polynomial must lie outside the unit circle (Shumway and Stoffer, 2017, pp. 86, 95, 113).
We briefly note four possible approaches to modeling lag dependence with φ h : 1. Specify a full φ h vector for each mixture component and use {γ φ (h) } L =1 to construct Σ h , as in Section 4. A drawback of this approach is that the weight kernels remain S2 L-variate Gaussian densities, dependent on all L lags.
2. To eliminate dependence on inactive lags, use the original φ h to construct Σ h before marginalizing both mixture and weight kernels as N (h) (y, y t−1:L ) dȳ t−1 and N (h) (y t−1:L ) dȳ t−1 , respectively, whereȳ t−1 contains the y t− for which γ = 0. A drawback of this approach is that the model is over-parameterized, leaving unidentified parameters and possibly inflating uncertainty.
3. Combine approaches 1 and 2, removing the effect of inactive lags when constructing the transition density, with joint densities defined over possibly nonconsecutive lags.
This approach seems the most promising. In all cases, it would be necessary to check the roots of characteristic polynomials prior to accepting any candidate φ or γ during MCMC, and mixing could suffer from inability to marginalize over mixture kernel parameters. One way to avoid checking roots is to work directly on the space of characteristic polynomial roots, as in Huerta and West (1999), although they disallow gaps in lag dependence and utilize reversible-jump algorithms to identify model order. Nevertheless, if stationarity is required in a particular modeling scenario, it may prove worthwhile to consider these approaches.

S2 Prior settings
We can visualize the effects of prior settings through prior simulation in low-dimensional models. As an example, Figure S1  in transition mean functions with few change points and long stretches of near linearity, whereas allowing more components increases variability in the curve. Low values for δ x likewise encourage rigid transition mean curves with abrupt change points. Increasing the variance in the weight kernels has a smoothing effect, as expected. Note that in regions of the lag space where multiple mixture components carry significant weight, the transition density can be multimodal, with a transition mean that does not closely follow any one of the component-specific lines. As with the transition mean, one can use prior simulation to elucidate the effects of prior settings for transition densities to aid practitioners in specifying desired characteristics and performing sensitivity analyses.

S3 Computing time and sensitivity analysis
We have noted sensitivity of results to model and prior settings at multiple instances. While model flexibility can be a feature, we underscore the importance of understanding the effects of certain settings on model performance and posterior inferences. This section reports a sensitivity analysis using simulated data from the Ricker model (Sections 3.1 and 4.4) based on a 2 6 factorial experiment with the factors listed in Table S1. Each treatment combination includes two replicates, yielding 128 model runs in total. We measured the following diagnostic responses: MCMC run timing, MCMC convergence issues, posterior summaries of lag inclusion, number of occupied mixture components, interval-estimate width for the transition mean functional, and mean-squared error. These results do not necessarily generalize to other data sets, whose complexity and order of dependence can affect the monitored responses. However, they do confirm intuition on the roles of these factors.
Aside from the controlled factors, default settings were used for the models, including L = 5 and diagonal Σ x . The DP truncation was set at an overly conservative H = 80, which appears to be sufficiently high to accommodate the prior for high α. Each chain Table S1: Six factors used in the sensitivity analysis, including values for the two levels used for each. Ga(a, b) denotes a gamma distribution with mean a/b. Sequences of prior lag-inclusion probabilities are either geometrically decreasing from 0.5 (first lag) to 0.1 (lag L), or constant at 0.5.
Posterior means for α concentrate below 0.5 under the Ga(10, 10) prior and around 4.5-6.5 under the Ga(60, 8) prior, indicating that, for small to moderate sample sizes, this prior choice is essentially a model setting with only mild influence from data.

Timing
Timing and sensitivity runs were performed on single-core, 64-bit Intel ® Xeon ® Gold 6248 processors running at 2.50 GHz, using our package in Julia Version 1.4.1 (Bezanson et al., 2017). We report on the period after adaptation and burn-in.

Convergence
Among the 128 chains, 18 showed visible signs of moderate-to-high difficulty in convergence among log-likelihood or lag-selection indicators, and an additional 16 were unacceptable.
Traces for γ that remained at the truth for all inference samples were considered acceptable, while chains that switched once or very infrequently were flagged, depending on severity.
None of the factors were predictive of difficulty generally, though several appeared to influence major difficulty. Chains had a greater tendency to become stuck in a mode with longer time series. Models employing global selection were more than twice as likely to experience major difficulty than those with local selection. Initialization is also important; several chains initialized with all lags off failed to select any lag during the run (i.e., longer burn-in was necessary). Mixing issues were also more common among runs with higher α.
The remaining analyses exclude results from the 16 unacceptable chains.

Lag selection
With simulated data from the Ricker model, the second lag is consistently and decisively selected. However, variations along the fourth lag can give the appearance of second-order dynamics, or aliasing can lead a model to select lag 4 only. Using posterior inclusion probability on lag 4 (for global selection, and inclusion probabilities on occupied clusters for local selection) as the response, there appears to be sensitivity to the prior inclusion probabilities. In this sense, the geometric sequence succeeds in avoiding the aliasing problem by discouraging inclusion of lag 4. Local lag selection also appears to include lag 4 less often. Fits to longer time series also included lag 4 less often in these runs. Initialization of γ did not sytematically affect these runs, but has appeared to contribute in other runs.
While these appear to be common patterns, it is difficult to predict inclusion of lag 4 from run to run.

Number of occupied components
Model inputs designed to influence the number of occupied mixture components are α and the prior signal-to-noise ratio, R, which influences flexibility of the transition mean functional through kernel variance parameters. Higher values of each indeed increase the number of occupied components, although the influence of α is weak. Larger sample sizes likewise increase the number of occupied components, with a compound effect when R is also high.

Width of interval estimates
In addition to the substantial effect of sample size in reducing the width of interval estimates, we found that low values of α have a similar effect. One possible explanation for the α effect is that a larger number of mixture components carry more weight a priori (through ω parameters) when α is high, increasing uncertainty.

Mean-squared error
Because the simulated transition distribution in this scenario is a nonlinear function with additive Gaussian noise, we consider estimation performance of the transition mean functional. Squared errors between the true transition function and estimated transition mean functional at each observed value were averaged across observations and a subset of posterior samples. Results are summarized in Figure S3.
Beyond the obvious effect of sample size, forcing low α also decreased the mean-squared error (MSE) in these runs, likely for the same reasons it decreased the width of interval estimates. If we restrict attention to the estimation performance on low values of y t−2 , a region with stronger nonlinearity in the transition mean (see Figure 1), then the prior signal-to-noise ratio R becomes important, especially with the longer time series (n = 150).
Increasing R allows the model to better fit curvature in the transition mean function.  Figure S3: Box plots of log mean-squared error (MSE) using sensitivity runs on the Ricker simulation data, restricted to observations where y t−2 < 2.5. Groups are separated by prior signal-to-noise R, DP concentration parameter α, and fit sample size n. Each group includes between 12 and 16 runs.

S4 MCMC details for the base model
If we condition on the first L observations of the time series, the hierarchical model (6) yields the full joint posterior distribution over all model parameters up to proportionality, The Gibbs sampler proceeds by successively sampling the parameters in the sets and manner described below.

Latent states
The latent states identifying component membership for each observation y t are updated individually, for t = L + 1, . . . , T , using their discrete full conditional distributions Pr . . , H. The step is Metropolized by first drawing a candidate with probability mass proportional to the full conditional, excluding the current state. The Metropolis acceptance ratio is then the sum over all full conditional probabilities excluding the current state, divided by the sum over all full conditional probabilities excluding the candidate state (Robert and Casella, 2004, p. 394).

Stick-breaking weights
The weights {ω h } H h=1 that appear in the likelihood are defined through the latent {v h } H−1 h=1 which, conditional on the latent states {s t } and absent the denominator in the first product term of (S1), admit H − 1 independent beta full conditional distributions (Ishwaran and James, 2001). In our model, the full conditional distributions are given as where the n * h = T t=L+1 1 (st=h) , for h = 1, . . . , H, count membership in each of the H components. We define v H = 1 for convenience in notation. To obtain direct samples from S10 this distribution, we employ the multivariate hyperrectangle slice sampler of Neal (2003) (summarized in Figure 8 of that article) to update all v h , h = 1, . . . , H − 1, simultaneously, as follows.
Let v = (v 1 , . . . , v H−1 ) denote the vector of latent beta variables used to construct {ω h } H h=1 , and let g(v) denote the unnormalized posterior full conditional density (S3) evaluated at v. The algorithm employs user-specified tuning parameters {τ h } H−1 h=1 , all of which we conservatively fix equal to 1.0 to ensure that the entire support of v (i.e., the hypercube (0, 1) H−1 ) can be reached in any iteration of MCMC.
Let v 0 denote the value of v from the previous iteration of MCMC, and v 1 denote the output of this algorithm, which proceeds as follows (Figure 8 of Neal, 2003).
3. Propose candidates v * and iteratively shrink H when points are rejected.

Component-specific parameters
The posterior full conditional density for each η h is given by where is a n * h -length vector containing all y t such that s t = h; and D h is a n * h × (L + 1) design matrix whose rows correspond to y (h) and are composed of (1, µ x 1,(h) − y t−1 , . . . , µ x L,(h) − y t−L ) for each t such that s t = h.
Note that proportionality in (S5) is preserved with respect to the {µ x ,(h) }, which appear in the regression means for y t . Aside from the factor containing normalizing weights in the mixture denominator, the full conditional for η x h could be factored into a series of conjugate updates that could serve as proposal distributions for a Metropolis step. This approach yields low acceptance rates in practice, and we instead utilize a random-walk Metropolis sampler with jointly Gaussian proposals for all parameters in η x h (with {δ x } parameters proposed on the logarithmic scale), which are evaluated using (S5). Proposals that produce computationally singular covariance matrices are automatically rejected.

DP concentration parameter
The posterior full conditional density for the DP concentration parameter α is proportional , yielding a gamma update with shape a α + H − 1 and rate b α − log(ω H ).

Adaptation
After initialization, MCMC begins with a tuning phase for diagonal elements of the covariance matrix in the candidate-generating Gaussian proposal distribution for {η x h } H h=1 . Optionally, an adaptation phase may then be used to further tune a full candidate covariance matrix.
This proceeds in four steps. In the first step, the initial covariance matrix is globally scaled to adjust acceptance rates collected over a short run. This is repeated iteratively until all acceptance rates fall within a pre-specified range (we set the range low, e.g., [0.02, 0.20], to promote exploration) or a maximum number of attempts is reached. In the second step, the proposal variances are scaled locally by parameter groups corresponding to µ x , {β x r }, and δ x , while preserving correlations. In the third step, empirical cross-covariance matrices are estimated from a longer run. In the final step, these empirical covariance matrices are scaled globally until acceptance rates fall within the pre-specified range, or a maximum number of attempts is reached. At this point, adaptation ceases and the scaled empirical covariance matrices are used for subsequent random-walk proposals. We advise against adapting prematurely, which can cause the chain to develop an affinity for a local mode during burn-in.

S5.1 Posterior inference with global selection
Conditional on γ = (γ 1 , . . . , γ L ), the selection effect on the mixture kernels can be passed through to the D h matrices, for which all elements in column + 1 are replaced with 0s if γ = 0. Because η x h is updated with a Metropolis step, one simply draws candidate values and evaluates (S5) with each N (h) (y t−1:L ) for h = 1, . . . , H, and t = L + 1, . . . , T , appropriately modified (with respect to γ). The full conditional distribution for η y h is then sampled using the modified D h . All other updates proceed as before, using the appropriately modified N (h) (y t−1:L ) and kernel means.

S5.2 Transition density estimation under global lag selection
For any value of y and x, or over a multidimensional grid of values, samples for f Y |X are calculated fromf S15 withq h (x | γ) = ω h N (h) (x | γ)/ H j=1 ω j N (j) (x | γ) and µ(x | γ) = µ y − L =1 γ β y (x − µ x ). The samples can then be used to create pointwise estimates and intervals forf Y |X .
The expression for the transition mean becomesẼ Y |X (y | x, γ) = H h=1q h (x | γ) µ (h) (x | γ). Analogous expressions include γ in the procedure for estimating quantiles in Section 2.4. Likewise, K-step-ahead forecasts are inductively sampled with (s, y) T +k pairs for k = 1, . . . , K, following the first two levels of the hierarchical model (6), adjusted for γ, for each posterior sample. While dependence on other parameters in (6) is implicit in the preceding expressions, we add explicit dependence on γ in order to emphasize the modifications necessary to include lag dependence.

S5.3 Posterior inference with local selection
The posterior full conditional probability that γ (h) = 1 is Pr γ (h) = 1 | · · · = π γ a γ h, Note that the first product in each line is over all time points allocated to mixture component h, and the second product is over all time points t = L + 1, . . . , T . The Gaussian densities in these expressions are from (S1), modified to reflect either γ (h) = 1 or 0, and appropriately reflecting all other {γ (i) k : i = h or k = }, which are held constant. Instead of drawing from the individual full conditionals, we update each γ h as a block using collapsed conditionals, with {η y h } integrated out, as in Supplement S4. First, a number of proposed switches, k ∈ {1, 2, 3}, is drawn from a truncated geometric distribution. Then, a uniformly drawn subset of k indices among {1, . . . , L}, denoted K, identifies which elements of the proposed γ cand h are switched from the current state (individually, from 0 to 1 or from 1 to 0). Symmetry of this proposal distribution (Schäfer and Chopin, 2013, Sec. 3.3) S16 yields the Metropolis ratio, p(γ cand h | · · · , −{η y h })/p(γ old h | · · · , −{η y h }), where with all quantities calculated using both the full γ h vector under evaluation, and all other {γ j : j = h} held constant.

S5.4 Illustration of global selection on linear autoregressive simulation
We demonstrate the model's ability to identify simple structure, for which the proposed model is over-specified. Although each of the nonlinear, non-Gaussian, and mixture capabilities are not necessary in this case, the model performs well. A stationary time series was generated from the model y t = µ + φ 1 (y t−1 − µ) + φ 2 (y t−2 − µ) + t , t iid ∼ N(0, σ 2 ) , with µ = 2.5, φ 1 = 1.2, φ 2 = −0.7, and σ 2 = 1.0. We then fit the proposed nonparametric model to series of length T = 75 and T = 305 with a lag horizon of L = 5 (so that 70 and 300 observations contribute to the likelihood), DP truncation at H = 25, and default prior settings. With the short time series, both global and local lag selection recover the true structure, with all chains decisively selecting the first two lags only. With the long time series, all methods select the first two lags and include lag 3 in a non-negligible fraction of the samples (ranging from 0.2 to 0.6, with reasonable mixing in each chain). Other inferences from all fits appear to accurately recover the truth, with exception that a few runs S17 with local selection occasionally tend to over-fit the data, adding one or two unnecessary mixture components. is for µ y , which in the model is replaced by µ x parameters in the lag summands, and thus over-parameterized for this stationary time series. However, the intercept, which is a function of µ y and lag coefficients is precisely identified. Trace plots for the coefficients of lags 4 and 5 are similar to that of lag 3, reflecting their prior with mean 0 in the next level of the hierarchy.
Inferences for the transition mean surface and transition densities for specific lag values (not shown), both as functions of y t−1 and y t−2 , are consistent with the data-generating mechanism. Specifically, the estimated mean surface is very close to the true plane.
Posterior mean estimates of transition densities are nearly the correct Gaussian distributions.
Furthermore, marginal posterior standard deviations are nearly identical to standard errors from a correctly specified linear model fit to the time series. Hence, conditional on admittedly overconfident lag inferences, the proposed model performs well in a simple scenario, with surprisingly low cost for additional flexibility.