Cross-validatory model selection for Bayesian autoregressions with exogenous regressors

Bayesian cross-validation (CV) is a popular method for predictive model assessment that is simple to implement and broadly applicable. A wide range of CV schemes is available for time series applications, including generic leave-one-out (LOO) and K-fold methods, as well as specialized approaches intended to deal with serial dependence such as leave-future-out (LFO), h-block, and hv-block. Existing large-sample results show that both specialized and generic methods are applicable to models of serially-dependent data. However, large sample consistency results overlook the impact of sampling variability on accuracy in finite samples. Moreover, the accuracy of a CV scheme depends on many aspects of the procedure. We show that poor design choices can lead to elevated rates of adverse selection. In this paper, we consider the problem of identifying the regression component of an important class of models of data with serial dependence, autoregressions of order p with q exogenous regressors (ARX(p,q)), under the logarithmic scoring rule. We show that when serial dependence is present, scores computed using the joint (multivariate) density have lower variance and better model selection accuracy than the popular pointwise estimator. In addition, we present a detailed case study of the special case of ARX models with fixed autoregressive structure and variance. For this class, we derive the finite-sample distribution of the CV estimators and the model selection statistic. We conclude with recommendations for practitioners.


Overview
Many workflows for constructing predictive Bayesian models require the practitioner to choose the best model among a number of candidates according to their predictive power for the task at hand.Although many predictive model selection methods are available (Vehtari and Ojanen, 2012), among the most popular is cross-validation (CV; Geisser, 1975).CV is flexible and applicable to a wide variety of statistical applications.
In finite samples, CV-based selection objectives are biased and subject to sampling variation, which leads to uncertainty about model predictive ability and the possibility of adverse model selection (Arlot and Celisse, 2010;Sivula et al., 2020).In the Bayesian context, there is a large literature on the frequency properties (i.e.variability across multiple realizations of a dataset) of model selection rules using information criteria such as the widely-available information criterion (WAIC) and the Bayes factor (e.g., Ward, 2008;Schad et al., 2022).Despite its popularity in Bayesian applications, however, less is known about the frequency properties of dependent CV procedures for Bayesian model selection under log-predictive loss.
Recent work by Sivula et al. (2020) analyzed the frequency properties of leave-one-out CV (LOO-CV) for Bayesian regression models of exchangeable data.The authors identify at least three scenarios that lead to elevated uncertainty in CV model selection, and therefore to an increased probability of adverse model choice.These pathological cases include comparisons between candidate models that produce similar predictions, where models are badly misspecified, and where training data sizes are small.
In this study, we extend the analysis of Sivula et al. (2020) to Bayesian models of serially dependent data.We aim to characterize CV model selection uncertainty for a simple but important class of models: autoregressions of order p with q exogenous regressors, ARX(p, q).Our goal is to identify the regression component of the model under the logarithmic scoring rule, leaving to one side the related task of identifying the autoregressive component.In this context, a scoring rule is a loss function for assessing the quality of probabilistic predictions (Gneiting and Raftery, 2007).While many scoring rules are available, we focus on the logarithmic scoring rule, for which a measure of predictive performance is the expected log predictive density (elpd) described in Section 2.2.
We address two important aspects of scoring rule design for models of correlated data.First, whether the scoring rule used for model assessment will be univariate or multivariate.Second, for multivariate scoring rules, whether it will be evaluated jointly (as a multivariate predictive density) or pointwise (as univariate marginal densities).We begin with a demonstration of the importance of the latter, showing improved statistical power of model selection with a jointly-evaluated scoring rule.We continue with a detailed case study of model selection under several popular CV schemes.This comparison includes several specific univariate (pointwise) methods, and several joint methods.Throughout, we find that joint methods achieve greater (statistical) efficiency, measured as lower adverse selection rates, and the associated CV estimators tend to have have lower variability.
Figure 1 illustrates the importance of joint multivariate assessments when data are correlated.The figure shows two bivariate normal distributions.In one the variates are mutually independent, and in the other they are strongly correlated.The table shows that points A and B have identical pointwise log densities under both distributions, even though point B lies in a region of very low (joint) density in the correlated case.We conclude that in contrast to the joint approach, when correlation is strong the pointwise density fails to detect that point B is in a region of low probability for this model.
To further motivate our approach, Figure 2 previews the results of a model selection experiment we will describe in more detail in Section 3. The figure compares the distribution of CV model selection statistics for selecting between two candidate ARX models.Figure 1: Illustration of the distinction between joint and pointwise log density measures in correlated models.Both plots show bivariate normal densities centered at the origin with unit marginal variance and correlation coefficients of 0 (Panel (a)) and 0.9 (Panel (b)).Panel (c) tabulates joint and pointwise log densities evaluated at the marked points A and B, indicated in red on the plots.In the correlated case, only the joint log density, log p (y 1 , y 2 ), identifies point B as having low log density.In contrast, the pointwise density log p (y 1 , y 2 ) is the same for both.
The panels show increasing degrees of dependence from left to right.As dependence increases, the pointwise model selection statistic shows an increasing rate of adverse model selection (red bars in the vertical margin).In contrast, there is little change for the jointly-evaluated case (red bars in the horizontal margin).For a full description of this experiment, please refer to Section 3.
Several CV strategies are available for models of serially dependent data, and there seems to be little agreement in the literature about which one practitioners should adopt, especially in the Bayesian modeling literature.Furthermore, much of the existing literature addresses different blocking strategies, but does not make a distinction between joint and pointwise evaluation of the scoring rule.Moreover, we speculate that different CV schemes will be useful when assessing different aspects of such a model.Even when the analytical focus is not the autoregressive component, joint predictive measures appear to be useful for identifying the regression components.
In contrast to much of the existing literature on CV for autoregressive models, including many large-sample consistency results (e.g., Bergmeir et al., 2018;Racine, 2000), our emphasis is on the frequency properties of the CV estimator in finite samples.Further to the three problematic scenarios identified by Sivula et al. (2020), our results suggest that strong serial dependence and a cross-validatory objective function that does not capture model dynamics (e.g.pointwise objectives) can pose difficulties for CV-based model selection, even when the goal is not limited to identification of the autoregressive component of the model.Our results stand as a counterpoint to large-sample consistency results and suggest that mere consistency of the estimator is not enough.That is, under

Cross-validation for Bayesian autoregressions
Model M selected (difference < 0) B Model M selected (difference > 0) A counts counts counts 10-fold CV model selection statistics elpd differences, 500 independent posteriors Figure 2: Joint log-score differences versus pointwise log score differences, computed using 10-fold-CV in a model selection statistics for 500 independent posteriors of the 'hard' case (see Section 3).The DGP is a stationary ARX(2, 3) and candidate models are M A : ARX(1, 2) and M B : ARX(1, 1).Model selection statistics are expected log pointwise predictive density (elpd) differences.The DGP has autoregressive parameter ϕ * = α (0.75, 0.2) so that α selects increasing serial dependence from left to right.Adverse selection increases for the pointwise method (vertical axis) as dependence increases.See Section 3 for a full description.
certain choices of the CV scheme the variance of the model selection statistic can be very high, leading to elevated rates of adverse selection.

Contributions
We present novel results for procedures that use CV methods under the logarithmic scoring function when serial dependence is present.Working with the logarithmic scoring function and focusing on identifying the regression component of the model, we demonstrate that: • Under serial dependence, CV schemes should be designed to account for the presumed dependence structure of the data in order to achieve good model selection performance; • When serial dependence is strong, performance measures evaluated jointly are much more (statistically) efficient than pointwise counterparts; • When the sample size is finite, there is a U-shaped relationship between certain CV scheme hyperparameters and the adverse selection rate; • We present novel results on the variability of Bayesian CV procedures for ARX(p, q) models.To our knowledge, these are among the first results describing the finitesample uncertainty of CV methods under serial dependence, particularly in a Bayesian setting.
We offer the following advice for practitioners working with models of serially-dependent data.Broadly speaking, since CV methods based on joint scoring rules are usually more complex to implement, their improved efficiency should be traded off against implementation burden.Following model criticism of each candidate model ahead of model selection, we recommend: • Where measured serial dependence in the data is not very strong, simpler pointwise CV methods (like LOO-CV and LFO-CV) can be used as a first-pass, and relied upon where the results are clear (see Sivula et al., 2020, for criteria); • Otherwise, if serial dependence is strong or results are unclear then joint CV model selection methods should be implemented instead.
• Even when the actual predictive task requires a univariate prediction (like a onestep-ahead prediction), for model selection it may be better to use a CV scheme that leaves out multiple observations, combined with a multivariate scoring rule.
The remainder of the paper proceeds as follows.In Section 2 we describe the model class and summarize CV-based model selection and some relevant literature, highlighting some key challenges associated with CV for dependent data.Section 3 presents a short simulation experiment, and Section 4 presents a detailed case study of CV model selection in a simplified form of ARX model, focusing on the properties of CV under dependence and demonstrating where challenges can arise.Finally, Section 5 discusses the results and concludes.See https://github.com/kuperov/arxfor code and experiments.

Background
In this section, we briefly review CV model selection and review some relevant literature.
We will suppose we have observed a data vector y = (y 1 , . . ., y T ), presumed to be drawn from a joint distribution p true , the (typically unknown) data-generating process (DGP).
Our goal is to construct predictions by first selecting the best available model M * from some set M of candidates (or candidate model families identified up to a parameter).This selection is made according to candidate models' ability to predict as-yet unseen realizations of the process, that is, by their out-of-sample predictive power.
To simplify our analysis, we will consider only pairwise comparisons (i.e.|M| = 2), but we do explicitly allow that M true ̸ ∈ M, i.e. the model associated with p true is not in M.

Autoregressions with exogenous regressors
We will write ARX(p, q) for an autoregression with p lags of the dependent variable and q exogenous regressors.Throughout we assume p ≪ T and q ≪ T , and that T is 'small', corresponding to common applied settings where data are limited.
The ARX(p, q) class is a key building block for time series models in a wide range of scientific, policy, and business applications.For instance, autoregressions underpin the popular vector autoregression (VAR) models used by macroeconomic policymakers (e.g., Sims, 1980) and spatial epidemiological studies (Lee, 2011).
The ARX(p, q) model is conditionally normal, for t = 1, . . ., T , where the first element of the q × 1 vector of exogenous variables z t is 1.For simplicity, we will initialize the sequence from zero, so that In comparison to the linear regression (LR) model studied by Sivula et al. (2020), the dependence structure of the ARX class substantially complicates the analysis.Naturally, one could view the lags of y t as explanatory variables which would mean the ARX(p, q) model is identical to that of the Gaussian linear regression (LR) analyzed by Sivula et al. (2020).However, to restrict our attention to models in the stationary regime, we must either impose informative priors on the autoregressive parameters (as we do in Section 3) or fix them (as we do in Section 4).In both Sections 3 and 4 we we have allowed analytical convenience to guide the choice of prior.
It is worth emphasizing that all the results we present in this paper depend on Z, the T × q matrix of exogenous covariates.Our results do not need to make any assumptions about the distribution of Z, since they are assumed known and fixed.In our experiments, we construct Z by drawing a matrix of independent standard normal variates, a matrix we keep fixed across all replicates of each experiment.

Predictive model selection
When the goal of a modeling exercise is prediction, it is natural to use predictive performance as a measure of model goodness or 'utility'.Predictive performance can be assessed using a scoring rule, a function that produces a numerical assessment of a probabilistic prediction against actual observations (Gneiting and Raftery, 2007).Since the choice of scoring rule governs the selection of M * , an ideal choice for a scoring rule would be tailored to the modeling task at hand.However, in the absence of a specific application, general-purpose scoring rules are available.
We focus on the popular logarithmic scoring rule, which enjoys the mathematical properties of being local and strictly proper and is closely related to the KL divergence (Gneiting and Raftery, 2007;Vehtari and Ojanen, 2012;Dawid, 1984).Under the logarithmic score, we call the expected score for some model M ℓ the expected log joint predictive density (eljpd), where 'joint' refers to the fact that the multivariate predictive p (ỹ|y, M ℓ ) is a joint density.Here, the T × 1 random variable ỹ is independent of the data y.
If p true were known or unlimited independent replicates ỹ ∼ p true were available so that (2.2) could be evaluated, the utility-maximizing model M * could be selected by 'external validation' (Gelman et al., 2014) of the model M ℓ joint predictive p (•|y, M ℓ ) with respect to p true , (2.4) In many cases it is computationally convenient to compute (2.3) in a pointwise fashion, which yields the expected log pointwise predictive density (elppd), The pointwise predictive p (ỹ t | θ, M ℓ ) that appears in (2.6) is simply the multivariate predictive with all but one ỹ element marginalized out, (2.7) The resulting utility measure for model M ℓ ∈ M given observed data y can be computed using the model joint predictive density.We will often want to discuss both classes of expected predictive densities in a generic sense, in which case we will use the umbrella term expected log predictive density (elpd).
We adopt (2.4) as our benchmark for the preferred model.From this perspective, the pointwise density (elppd) is useful to the extent that it is a computationally convenient approximation of the joint density (eljpd).It is important to note that while elppd and eljpd are both useful for making comparisons against similarly-constructed measures, they are fundamentally different quantities.See, for instance, Madiman and Tetali (2010) for inequalities between joint and pointwise densities.
When observations are conditionally independent given global model parameters, it is often the case that the elppd and eljpd are close or even identical.However, under serial and other forms of dependence, this is rarely the case because the eljpd captures additional information about serial dependence of the observations not reflected by the pointwise measure.
Unfortunately, the expected utility maximization framework described above suffers a crucial drawback: p true is rarely ever known in practice, and thus the elpd must be estimated purely from observed data.While one might be tempted to simply substitute p (y | y, M ℓ ) into (2.3) or (2.6), this will lead to a positively biased (over-optimistic) estimate due to model overfit (Vehtari and Ojanen, 2012;Gelman et al., 2014).Instead, we need a method for estimating elppd and eljpd using only the available data.

Cross-validation
CV is a method for estimating the elpd purely from observed data by data splitting and repeated re-fits of the model.Suppose for a moment that independent replicates of the data ỹ(s) ∼ p true , s = 1, . . ., S, were available, and the predictive were able to be evaluated pointwise.Then the utility (2.6) under model M ℓ could be targeted by the following Monte Carlo estimator, In applications where such replicates are unavailable, CV estimators exploit the fact that the data y are distributed according to p true , even if p true is itself unknown.CV proceeds by repeatedly splitting the data into disjoint testing and training data subsets, estimating the model on the training set, then constructing an estimator using pointwise predictions for the testing set.The CV estimator for elppd (M ℓ |y), which divides y into K test sets, can be defined as where test k denotes the subset of y to be evaluated under the predictive, train k denotes the subset of y to be used to train the data.The scaling factors normalize the measure to 'sum scale'.The corresponding joint measure is given by (2.10) We stress that CV schemes with multivariate test sets, like hv-block and K-block, can be evaluated in either a joint or pointwise fashion.In comparison, univariate schemes like LOO can only be evaluated pointwise.
The CV scheme blocking design is fully described by the triple K, Classic LOO, for instance, has K = T , test k = {k} and train k includes all but the kth element.
Model selection using cross-validation selects the model with the greatest estimated utility-or at least the simplest model similar to the best model.For a pairwise comparison between M = {M A , M B }, the CV estimate of the utility-maximizing objective is the sign of the difference (2.11) We have omitted from this formulation of the model selection objective the bias correction term that is sometimes included to account for the fact that there are fewer elements in the training set for each CV fold than in the full-data posterior.Typically a first-order correction is used, and it is usually very small (see Gelman et al., 2014).
Under correct model specification, the summands in the CV estimator (2.9) will usually be weakly correlated.Under relatively mild regularity conditions elppd CV should converge to the expected utility elppd as T grows large (see Bergmeir et al., 2018, for an analysis of LOO-CV, for instance).

CV schemes for serial dependence
In models of cross-sectional data where all observations can be assumed conditionally independent, the data structure imposes relatively few constraints on the sequence of training and test sets used for CV.
Under serial dependence, care is needed to ensure the contributions to (2.6) are mutually independent, or at least as independent as we can make them.To this end, a number of CV schemes have been developed specifically for models of serially dependent data.
A key consideration when selecting a CV scheme is the nature of the intended prediction task, for instance, whether the model will be used for one-step-ahead or M -step-ahead predictions.Existing analyses of these schemes refer to specific contexts that do not necessarily align with our Bayesian framework.Most use different scoring functions and all but a few are analyzed with reference to classical models that yield point predictions.For our purpose, what we take from this earlier work is the design of the blocking scheme, i.e. the choices of K,

LOO CV
. These are summarized below and illustrated in Figure 3. Burman et al. (1994) present h-block CV, an adaptation of CV for stationary dependent sequences.In order to nearly eliminate the bias arising from dependence between train and test sets in LOO-CV, their procedure deletes a buffer of size h around the training set, while retaining just a single test observation (see Figure 3).This reduces the size of the training set by 2h observations, but still leaves a total of n test sets.They propose h be a fixed proportion of the data length.Although this is 'conservative' (the authors refer to Györfi et al. (1989), whose results allow consistency if h/T −→ 0 so long as certain conditions on the underlying dependence structure are met), they argue this is appropriate because in practice the dependence structure is typically unknown.Since each h-block fold has a single test element, it is by definition a pointwise CV framework.Racine (2000) proposes hv-block CV as an extension of h-block CV, which increases the test set dimension from 1 to 2v + 1.The author claims this provides selection consistency in a wider range of circumstances, including nested models, which may be of interest in the case where model identification is the goal.Since hv-block has a multivariate test set, it can be evaluated jointly.
Another blocking scheme specific to serially dependent data is leave-future-out (LFO-CV; Bürkner et al., 2020).LFO-CV trains the model only on past observations, starting with a warmup period 0 < w ≪ T , and leaves future observations unused.To be comparable to the other methods, our implementation of LFO mirrors the structure of hv-block CV (Figure 3).That is, we write LFO (h, v, w) for a LFO scheme that includes a halo h ≥ 0, size parameter v, and initial buffer w.This generalized form of LFO contains the usual formulation, which is LFO(0, 0, w).
We have also included K-fold CV.This method is not specific to serial dependence, although it is commonly applied under serial dependence in the literature (Cerqueira et al., 2020;Bergmeir and Benítez, 2012;Bergmeir et al., 2018).We use a variant of K-fold CV that partitions the sample into K contiguous sub-blocks each roughly of size T /K.Typical values for K are 5 or 10.The predictive may be evaluated jointly or pointwise, depending on the context.

A model selection experiment
In this section, we illustrate the behavior of CV model selection under serial dependence by repeatedly performing a model selection experiment on simulated data.We have chosen this experiment because we believe it is illustrative of general behaviors of CV for autoregressive models.We use a sequence of experiments, where we control the degree of serial dependence.
Consider the following model selection problem.Let the vector y = y 1 , . . ., y T be distributed according to an ARX(2, 3) of the following form, DGP : where ε t iid ∼ N (0, 1).The fixed parameter α ∈ [0, 1] allows us to select the degree of serial dependence.When α = 0 the observations are mutually independent, and α = 1 generates a highly persistent series for the model in (3.1).All α ∈ [0, 1] generate stationary series.While the upper bound α = 1 is arbitrary, corresponding to ϕ * = (0.75, 0.2), it is a useful upper limit for our experiments that generates a persistent but nonetheless stationary series.We observe similar results for other choices of the upper bound and ratio between elements of ϕ.For simplicity, we fix initial conditions y t = 0 for t ≤ 0.
The experiment selects between two candidate models: We choose the analytically-convenient (although non-conjugate) prior, where a 0 = b 0 = c 0 = d 0 = 1.IG denotes the inverse-gamma density and BE (−1,1) the beta distribution scaled to have support on (−1, 1).
This model is 'fully Bayesian' is the sense that we regard all three parameters (β, σ 2 , and ϕ) as unknown, and we allow them all to be estimated.For computational tractability, we have chosen conjugate priors for β and σ 2 , and we will conduct inference only on stationary ARX(1, q) models.In this special case, ϕ is univariate with support on the interval (−1, 1).We center the β prior on the truth β * to avoid distortions as the effective sample size changes with α.
Both candidate models are 'misspecified' in the sense that neither has the same functional form as (3.1).However, while both models omit the second y t lag and effect β 3 , the candidate M A is nonetheless the better model in the sense that it is closer in KL divergence to the DGP.This is because it includes β 2 , which is also omitted by M B .
We will work with two vectors of true DGP parameters β * , distinguished by the relative ease with which CV is able to select the better model in our experiments: 'easy' case: These arbitrary parameter values were chosen for convenience in the context of our experiments and simulated covariates.β easy * is an example of the case where CV has little difficulty separating M A and M B under logarithmic loss, and under β hard * model identification by CV is much more challenging.Naturally, the relative difficulty of β easy * and β hard * depends on a range of factors, including the covariates z it , noise variance, and data length.This setup does not make any assumptions about the distribution of the matrix Z with elements z it , other than that it is known.In our simulations, we have drawn z it iid ∼ N (0, 1), which remain fixed throughout.
Although the posteriors associated with the above candidate models have no closed form, we are able to estimate them relatively computationally cheaply using onedimensional quadrature or MCMC; see Appendix E for the complete computational details.The availability of efficient estimation procedures is important for our experiments because we perform CV by brute force for each simulation draw.That is, we avoid computational shortcuts like importance sampling to ensure our results are not being driven by approximation error.
Figure 2 summarizes the results of this model selection experiment.It plots model selection statistic for 500 independent simulated data sets for the 'hard' model variant, comparing model selection statistics from two variants of 10-fold CV.The vertical axis shows the CV objective evaluated pointwise, that is elppd CV (M A , M B |y), and the horizontal axis shows the objective evaluated jointly, that is eljpd CV (M A , M B |y). Adverse selection for both measures is indicated by negative selection statistics, that is those that would selection M B .Points lying in the first quadrant, for instance, represent correct selection by both joint and pointwise methods.Some stylized facts are evident in the Figure 2. First, when data are mutually independent (the dependence parameter α = 0, left panel) the marginal distribution and relative performance of both methods is approximately equal.However, as α → 1 we see that the variance of the pointwise estimates grows sharply and the location of the distribution shifts in a negative direction, indicating a sharp increase in the adverse selection rate.In contrast, the joint estimates are little changed as α varies.Further note that many LOO estimates fall in the fourth quadrant, indicating that the wrong model would be selected in those cases.An equivalent experiment with the 'easy' parameter variant (not shown) displays a similar increase in the variability of the pointwise model selection statistic, but because the entire distribution lies far enough from the x-axis there is no appreciable increase in the adverse selection rate.
Figure 4 offers one explanation for the rising variance of the pointwise CV statistic.It plots the true elppd and eljpd for the 500 posteriors in Figure 2, i.e. the underlying quantities that the joint and pointwise CV estimators are targeting.The change in distributions evident in both figures is quite similar, suggesting that it is a difference between the underlying quantities, rather than some problem with the CV estimators under serial dependence, that is driving the differences between joint and pointwise 10-fold CV.

Detailed case study
In this section, we focus on the problem of identifying the regression component within the ARX class.We will work with a simplified version of the ARX(p, q), where we regard only the regression parameter β as random, with a Gaussian prior centered on the truth, β ∼ N (β * , Σ 0 ).This simplification allows us to focus our attention narrowly on the task of identifying β.It also makes available analytical expressions for the posterior distribution, as well as the distribution of the theoretical elppd and eljpd, associated CV estimators, and model selection statistics.This approach circumvents a key challenge associated with analyzing the variability of CV procedures: in general there are no closed-form expressions for the variance of elpd measures (Bengio and Grandvalet, 2004).However, when ϕ and σ 2 are fixed, a closed form does exist.The existence of closed form expressions allows us to derive the finite-sample distribution of the elpd for all of the CV procedures we consider.

Utility measures and model selection statistic
In this simple case, we can obtain exact distributions for the joint eljpd and pointwise elppd for the ARX class with fixed ϕ * and σ 2 * , as well as their associated CV estimators.The results in this section extend earlier results for the elppd and LOO-CV derived by Sivula et al. (2020), in the case of i.i.d.Gaussian linear regressions with fixed variance.
For arbitrary ARX models M A and M B , we show that the stochastic variables eljpd (M A | y) and eljpd CV (M A | y) (for all of the CV schemes listed in Subsection 2.4), as well as the model selection criteria eljpd (M A , M B | y) and eljpd CV (M A , M B | y), all have generalized χ 2 distributions with parameters that depend on parameters of the DGP, the posited model, and the exogenous covariates.
Following the setup in Section 3, suppose that y is distributed as ARX(p * , q * ) as described above, and suppose an experimenter posits a candidate ARX(p ℓ , q ℓ ) model M ℓ for y.This and other results are proven in Appendix F.
Proposition 1 (Quadratic polynomial form of utility measures).Let y be distributed according to an ARX(p * , q * ) process, and let M ℓ be the simplified ARX(p ℓ , q ℓ ) model described in Section 2.1.Then the theoretical pointwise and joint measures eljpd (M ℓ |y) and elppd (M ℓ |y) respectively defined in (2.3) and (2.6), as well as the corresponding CV estimates eljpd CV (M ℓ |y) and elppd CV (M ℓ |y), can be expressed as second-degree vector polynomials in y, for nonrandom coefficients A ℓ (a T × T matrix), b ℓ (a T -vector), and scalar c ℓ .The coefficients are functions of ϕ (ℓ) , σ 2 ℓ , Z ℓ , and the CV blocking scheme parameters.All are defined in Appendix D.
A reviewer pointed out the similarity between the quadratic polynomial form of (4.1) and that of the likelihood ratio test for non-nested models by ?.Although the latter's results are frequentist and asymptotic in character, the similarity is nonetheless interesting considering that the intent of these test statistics are so similar.

'Oracle' plug-in values for fixed parameters
Our candidate models require appropriate choices for the noise variance parameter σ 2 and autoregressive parameter ϕ.Especially when the model is misspecified, it would not necessarily be optimal to use the true DGP value σ 2 ℓ = σ 2 * , in the sense that this choice would not produce the best possible predictions with respect to log score for the chosen model class.
Inference requires suitable choices for σ 2 ℓ and ϕ (ℓ) that would correspond as closely as possible to the behavior of an inference procedure where ϕ and σ 2 are unknown.
Suppose some hypothetical Oracle happens to know the true DGP, and offers to select the best-performing autoregressive and variance parameters ϕ ℓ and σ 2 ℓ for our particular model and covariates.Naturally, this choice will be independent of any specific realization of y since we are interested in utility distributions across all potential values of y.Consider two approaches our Oracle might use for selecting these parameters.She might choose to minimize the distance (in KL divergence) between the DGP and the model K predictive.Alternatively, she could directly target the objective function by maximizing the achievable Under the logarithmic loss function, it follows from Dawid (1984) that these two options represent equivalent calculations.That is, maximizing the loss function, and minimizing the expected KL divergence between the DGP and model predictive, yield the same answer.In the above Φ ℓ ⊂ R p ℓ denotes the parameter space for ϕ associated with stationary ARX(p ℓ , •) models.In our experiments, we solve the optimization (4.3) using the Nelder-Mead algorithm.

Distribution of the model selection statistic
The purpose of conducting CV on the candidate models is to determine which candidate has the greatest predictive performance, in our case under log score.The CV model selection statistic (2.11) is the difference between the estimated scores of the two models.
The form of the model selection statistic used in this experiment is characterized by the following corollary, implied by Proposition 1.
Corollary 2 (Form of model selection objectives and CV estimates).Let y be distributed according to an ARX(p * , q * ) process, and let both M A and M B be simplified ARX(p A , q A ) and simplified ARX(p B , q B ), respectively.Then the theoretical model selection statistics eljpd (M A , M B | y) and elppd (M A , M B | y), and their corresponding CV estimators eljpd CV (M A , M B | y) and elppd CV (M A , M B | y), can be expressed as second-degree polynomials in y.
Proposition 1 and Corollary 2 imply that all of the quantities of interest follow generalized χ 2 distributions (see Definition 6 in Appendix D.4).Proposition 3 describes the mean and variance, and further states the parameters of this distribution.The associated distribution function F ω(y) (w) = Pr (ω (y) < w) must be approximated numerically.This can be done by simulation or the method of Davies (1973).
Proposition 3 (Distribution of ω (y)).Let ω (y) be a quadratic polynomial in y with coefficients A, b, and c as described in Proposition 1 and Corollary 2, where A = A ⊤ .Then ω (y) has mean and covariance: for a fixed T -vector m * and fixed where λ is the vector of k ≤ T nonzero eigenvalues in the eigendecomposition Several interesting features of the distribution of ω(y) are evident in Figure 5, which summarizes the results for the 'hard' (β * = β hard * ) case under increasing dependence α.These features are consistent with the findings of Section 3. First, notice the striking difference between the behavior of the pointwise and joint theoretical elpds as α increases (column (a)).When data are mutually independent (α = 0), both distributions are basically equal.For the joint measure, variability and location are little changed as dependence increases.In contrast, the pointwise objective exhibits sharply rising variability and shifts in a negative direction as dependence gains strength.Most of the distribution changes signs entirely to favor the simpler, worse candidate M B as α approaches 1, indicating an adverse selection rate near 100%.Second, the different profiles exhibited by pointwise and joint CV methods (columns (b) and (c), respectively).While there are clearly differences between the specific joint and pointwise methods, whether the method is computed jointly or pointwise clearly has the greatest bearing on its behavior as dependence increases.The pointwise methods in column (b) exhibit a similar increase in variability and negative shift as the pointwise objective, while the joint methods are little changed as α → 1.In addition, the pointwise CV methods (joint methods too, to a lesser extent) display an interesting decrease in variability as α approaches 1.A possible explanation for this drop-off is the decrease in effective sample size (Berger et al., 2014) for autoregressive models as dependence grows, all else being equal.
Under stronger dependence, increased variability and downward movement together reduce model selection power, consistent with the known bias of CV procedures toward simpler models under small sample sizes.See, for instance Burman (1989) for an analysis The top row plots the standard deviation of the relevant ω(y) for the corresponding column.The bottom row shows its mean and 98% interval.The model parameter α governs the degree of serial dependence.Notice that the adverse selection rate for both joint and pointwise methods is close to zero for all but the strongest dependence.This model selection experiment compares M A : ARX(1, 2) vs M B : ARX(1, 1) under an ARX(2, 3) DGP, as described in Section 4. The autoregressive parameter is ϕ * = α (0.75, 0.2), and data length T =100.See also Figure 8 for the 'easy' case.

Cross-validation for Bayesian autoregressions
of CV bias and sample size.In contrast, for the 'easy' experiment variant (β * = β easy * ) where the results are much clearer, cross-validation generates correct model selections for all but the most strongly dependent data (see Figure 8 in Appendix A).

The cost of an inefficient CV scheme
The distribution of the model selection statistic in Proposition 3 provides a direct method for computing the probability of adverse selection.Noting that a positive selection statistic indicates the correct model choice (corresponding to M A ), it follows that Pr (adverse selection) = Pr (ω (y The quantity F ω(y) (0) can be interpreted visually in Figure 5 as the share of the ω (y) distribution that falls below the x-axis.Where the probability of adverse selection is very small we say that the models are well separated under that particular CV scheme.While the infinite support of ω (y) means that (4.6) can never be zero, for practical purposes we define a small threshold γ as the cutoff for well-separatedness.For the remainder of this paper, we will use γ = 0.01.
A particular CV scheme may be well-separated in one situation and poorly separated in another.Whether a model selection procedure is well-separated is determined by all aspects of that procedure, including the details of the data generating process, candidate models, any hyperparameters for the procedure, and the values of the exogenous covariates.
We have seen that an inappropriate choice of CV scheme can result in an elevated probability of adverse selection.In this section, we attempt to quantify this cost.
Perhaps the simplest way to measure the cost of any model selection procedure is the rate of adverse model selection, also known as the '0-1 loss' because it scores all errors equally.The adverse selection rate is the probability (with respect to repeated samples) that the selection procedure will select the wrong model.
Framing the loss as a probability over all realizations of y makes good sense for this paper, since our focus is on the properties of CV methods for ARX models in general, without reference to a particular data realization y.Panels (a) and (b) of Figure 6 compare the adverse selection rate for joint and pointwise CV methods for the 'hard' variant of our model selection experiment (losses for the 'easy' variant are negligible, and are not shown).The adverse selection rate picks up as α → 1 and for pointwise procedures reaches almost 100 per cent, indicating that CV incorrectly prefers the simpler model when dependence is very strong.
The adverse selection rate overlooks a key fact, however: it does not account for the severity of the prediction error, scoring all incorrect selections equally.When there is very little difference between candidates' model predictions, it matters little which CV adverse selection rate (0-1 loss) Figure 6: Adverse selection rate for pointwise and joint CV methods as data dependence increases.This experiment compares M A : ARX(1, 2) vs M B : ARX(1, 1), under an ARX(2, 3) DGP with ϕ * = α (0.75, 0.2), for α ∈ [0, 1] as described in Section 4. Greater values of α denote stronger serial dependence.The 'hard' case (β hard ) with T = 100 is shown.model is chosen.On the other hand, when predictions differ significantly this should be reflected in the cost of the error.
An alternative measure of the cost of adverse selection is the reduction in log utility that results from choosing the incorrect model for a given y ∼ p true .Under this formulation, we use the underlying finite-sample theoretical elpd for each y that CV is trying to estimate as the benchmark.This is by its nature a finite-sample loss function.
That is, in the case where CV does not select the elpd-maximizing model we can regard the CV utility cost relative to the counterfactual where the correct model was chosen as the difference between the chosen and maximal elpd: Figure 11 in Appendix A plots the log loss defined in the previous display.While excess log loss is an attractive concept, the resulting relative measures are practically indistinguishable from the adverse selection rate.For the remainder of this paper, we will use the adverse selection rate.

Joint and pointwise objectives
Our results suggest that under dependence, joint estimators usually have lower variability and consequently a lower rate of adverse selection than their pointwise counterparts.In our experiments, these differences are typically negligible for independent observations (α = 0) and are most pronounced as α approaches 1 and the underlying data become highly persistent.This comparison is quite evident in Figure 6: when α = 0, joint and pointwise estimators perform roughly equally.As α → 1, there is little change in performance for joint estimators, compared with a dramatic increase in the error rate for the pointwise estimators.
To construct an apples-to-apples comparison between joint and pointwise CV methods, abstracting from other details of the CV scheme design, we construct pointwise analogs to joint designs.For the present experiment this amounts to diagonalizing the covariance matrix of the Gaussian predictive distribution, setting σ 2 ℓ V pw ℓ = σ 2 ℓ (I T ⊙ V ℓ ), for ⊙ the elementwise product operator.The results confirm a sharp increase in the error rate for pointwise CV methods under strong dependence (α = 1), compared with almost no difference in the independent case (α = 0).(See Figure 9 in Appendix A.)

Specific CV scheme design considerations
We have seen that CV design parameters can have a significant bearing on the overall efficiency, and therefore performance, of CV model selection.In this section we look more closely at hyperparameter choices for specific CV schemes for time-series data.
Note that hv-block CV schemes require the choice of a validation block size v (the total validation set dimension is 2v + 1) and halo size h.Consistent with earlier results, for dependent data we find that by far the most important choice is to ensure that v is large enough to capture the dynamic behavior of the model.Either parameter can harm CV performance if it is too large, since both reduce training set size, which imposes a cost on statistical efficiency, resulting in a U-shaped relationship between the adverse selection rate and both h and v when dependence is present.(Figure 13 in Appendix A compares error rates with various choices of these parameters.) The importance of preserving the size of the training set is evident in the underperformance of LFO, which represents an extreme case for dealing with contamination by discarding the entire future sample.Several authors have pointed out that this is unnecessarily conservative (Bergmeir et al., 2018).Moreover, the analyst should carefully weigh the tradeoff between the bias resulting from the use of future data and the benefit of increasing the training set size.See Figure 12 in Appendix A for a comparison between LFO and methods that use future data.In each case, the adverse selection rate for LFO is substantially higher, a consequence of LFO's reduced training set size.

Required sample size
One important practical consequence of an inefficient CV scheme design is that a larger sample size is needed for the models under consideration to be well-separated.This either requires the experimenter to collect more data than is necessary or for the adverse selection rate to be higher than it otherwise would be.In this section we demonstrate that the required sample size increases with dependence, all else being equal.When the underlying process is highly persistent, the required sample size can be significantly larger than for the independent case.
The required sample size for a model to be well-separated goes beyond the well-known principle of the effective sample size (ESS) for a time series model (see e.g.Berger et al., 2014).In this section we demonstrate that properties of the CV procedure-especially whether the scoring rule is evaluated joint or pointwise-strongly determines the data length required.
Figure 7 compares the minimum sample size required for several more joint and pointwise CV methods.Consistent with earlier results, there is little difference required sample size between pointwise and joint methods in the independent case (α = 0).Under stronger dependence, however, the greater variability of pointwise methods leads to a requirement for larger sample sizes for the two candidate models to be well-separated, all else being equal.These results underscore the importance of using efficient CV designs-especially the use of joint scoring rules-when strong dependence is present.See also Figure 10 in Appendix A.
As we might expect, the benefit of using more efficient CV methods is greatest when candidate models are more challenging to separate.In Figure 7, required sample size is greatest for the 'hard' variant, especially under a high degree of dependence.In comparison, under the 'easy' variant the differences between joint and pointwise methods are smaller, although there is nonetheless a pickup in the sample size required for pointwise CV methods.
Figure 7 also underscores the benefit of using as much of the available sample as possible, rather than discarding future observations as in LFO methods.As dependence increases, the additional statistical efficiency associated with allowing the model to learn from future data results in a well-separated model with shorter overall data lengths.

Discussion and conclusion
We have demonstrated that in settings where serial dependence is present, appropriate CV procedure design can dramatically improve model selection performance.Working with the logarithmic scoring rule and the ARX(p, q) class of autoregressive models, we have shown that evaluating the score pointwise can yield highly inefficient CV estimators that perform poorly when compared with procedures that target joint densities.Our experiments show that pointwise CV estimators exhibit greater variability and require larger sample sizes than joint designs.
We are not the first to compare the performance between joint and pointwise densities in predictive model assessment.Osband et al. (2022), for instance, apply model assessment with joint densities in the context of neural networks.
Our results show that the consequences of using an inefficient CV procedure can be particularly pernicious under strong serial dependence.One consequence is that CV procedures become biased toward overly simple models.In extreme cases where serial dependence is greatest, this can result in different CV procedures assigning completely different orderings to candidate models.Viewed another way, the consequence of the use of inappropriate CV procedures is the need for larger sample sizes to achieve good separation between models.
Relatively conservative methods like LFO can be a good choice in applied contexts, especially when CV is able to clearly separate models even without the use of the full sample for reducing estimator variability.Furthermore, some authors have advocated for the use of LFO (e.g., Bürkner et al., 2020) when the dependence structure is unknown.While LFO certainly eliminates the possibility that contamination will bias results, the results of Section 4.6 suggest that contamination arising from the use of future data and training set size does need to be carefully traded off against the benefit of retaining a larger training set.In general, it seems unlikely that the optimal CV design would be the corner solution that excludes all future observations.Risks associated with contamination can also be avoided by the use of specification tests on candidate models before conducting model comparison, including standard Bayesian model criticism procedures and testing for autocorrelation in the residuals (Bergmeir et al., 2018).Model assessment is especially important when the underlying process is highly persistent.Not only is the need for larger effective sample sizes greatest under persistence processes, but the adverse impact of contamination is most pernicious.
Our goal throughout this paper has been model selection.We are not claiming that exploiting future observations in CV schemes yields nearly unbiased estimators of the elpd.Instead, here we are targeting relative measures that are efficient model selection objectives.
This analysis has focused on serial dependence, which most often appears in time series models.However, we expect that similar results would apply for other forms of dependence such as spatial and spatio-temporal data, especially where the autoregressive signal is relatively strong compared with the global conditional mean.Naturally, dependence structures in more than one dimension presents additional analytical challenges, so further research is warranted for these and other dependence structures.Furthermore, many of our conclusions are not specific to the logarithmic score and would also apply under other scoring rules (Gneiting and Raftery, 2007).
From a practical standpoint, it should be noted that implementing joint CV methods tends to be computationally costly when compared with pointwise procedures like LOO.In situations where the difference between two models is absolutely clear, as for the 'easy' variant of our examples under weak serial dependence, pointwise CV estimators may be adequate for performing model selection and are far more convenient to construct.This is particularly relevant considering the availability of efficient computational shortcuts for computing pointwise CV procedures, such as PSIS-LOO (Vehtari et al., 2017;Bürkner et al., 2020), and a lack of similar shortcuts for joint procedures.Although in principle PSIS-LOO can also approximate joint CV procedures, in practice the thick tails of the weight distributions tend to cause importance sampling to fail.
With the complexity of implementing joint procedures in mind, we recommend the following workflow for model selection under serial dependence.Begin with a thorough model criticism of each of the candidate models, and iterate model specification until the candidate models are well-specified.Where inference results show that serial dependence is relatively weak, pointwise CV methods such as PSIS-LOO can be used as a first pass for model selection.If the pointwise CV results show a clear preference for one candidate, then that candidate can be selected.Otherwise, a joint CV procedure should be implemented and relied upon instead.
The present paper represents a first look at the uncertainty of CV-based model selection under serial dependence, but there is considerable work remaining.We have focused on identification of the regression parameter, leaving to one side the tasks of identifying the autoregressive component, theoretical analysis of these results, choosing suitable priors for model identification procedures, and constructing efficient computational methods for CV under serial dependence.We leave these aspects for future work.c) the associated CV estimates.The top row plots the standard deviation of the relevant ω(y) for the corresponding column.The bottom row shows its mean and 98% interval.The model parameter α governs the degree of serial dependence.Notice that the adverse selection rate for both joint and pointwise methods is close to zero for all but the strongest dependence.This model selection experiment compares M A : ARX(1, 2) vs M B : ARX(1, 1) under an ARX(2, 3) DGP, as described in Section 4. We use the β easy parameter setting, autoregressive parameter ϕ * = α (0.75, 0.2), and data length T =100.   Figure 12: Adverse selection rates for LFO and comparable CV schemes that use future information, by dependence α.Notice that in each case, the LFO error rate (blue) exceeds that of the other CV scheme (orange).This model selection experiment compares M A : ARX(1, 2) vs M B : ARX(1, 1) under an ARX(2, 3) DGP, as described in Section 4. We use the β hard parameter setting, autoregressive parameter ϕ * = α (0.75, 0.2), and data length T =100.settings-that is, neither M A nor M B have the same functional form as the DGP, but one is clearly a better fit than the other.
This appendix presents the results of additional experiments using other models from the ARX class.The goal of these supplementary results is to demonstrate that the main experiment is representative of typical cases, and to highlight specific cases where behaviors can differ.For consistency and simplicity we retain the same values for the parameter vectors β easy * and β hard * .We choose LOO as an example of a pointwise CV method and hv − block(3, 3) as an example of a joint CV method.In each case the focus is on identifying the regression parameter, so each candidate model has the same autoregressive structure.
Let the stationary vector y = y 1 , . . ., y T be distributed according to an ARX(p * , q * ) of the form, DGP : where ε t iid ∼ N (0, 1).The experiment selects between two candidate models, Table 1 lists the additional experiments.In each case, the DGP is parameterized by α ∈ [0, 1], which controls the degree of dependence.Each experiment is run twice, once each with β easy * and β hard * .The DGP parameter α ∈ [0, 1] is varied to select the degree of serial dependence.All DGPs are stationary.The random noise component is distributed as ε t iid ∼ N (0, 1), the fixed per-period covariates z 1t and z 2t are drawn independently from the standard normal distribution, and priors are β i ∼ N (β * i , 1), for i = 1, 2, 3. Initial values are y t = 0 for t ≤ 0. For all examples in this section, the data length is T = 100.As in Section 4, we solve (4.3) to find optimal noise variance and autoregressive parameter values.
The results of the additional experiments are broadly similar to Experiment 1: joint and pointwise methods perform similarly under no dependence, and under stronger dependence both the variability and error rates of the pointwise estimators are higher than that of the joint estimators.Experiment 4, for which the candidate models both have no autoregressive component at all, is an exception, and both methods perform terribly under strong dependence.It should perhaps not be surprising that such badly misspecified models would perform so poorly, which underscores the need for thorough model criticism for all candidates before conducting CV.
where the T × T covariance matrix is positive definite with |W ϕ | = 1.The T × T banded lower triangular coefficient matrix L ϕ has entries Proof.First note that L ϕ is lower-triangular with 1s on the diagonal, so |L ϕ | = 1 and the inverse L −1 ϕ exists.The T × 1 vector η with tth entry can equivalently be written in vector form η = L ϕ y − Zβ, where Z is the matrix with rows z ⊤ t and 1{•} is the indicator function.Then, the joint density of y is as required.To see that W ϕ is positive definite, recognize that Note that the matrix L ϕ is not the Cholesky factor of V −1 ϕ .Recall that the lower-left Cholesky factor A for V −1 ϕ is the unique solution to AA ⊤ = V −1 ϕ .In contrast, we have retaining

D.1 Posterior and predictive distributions
Lemma 5 and (2.1) imply that the model for y given ϕ, β, and σ 2 can be written as where y = (y 1 , y 2 , . . ., y T ) and y train = Sy by definition.The T × q matrix of exogenous regressors Z includes a column of ones, and the T × T lower-triangular coefficient matrix L ϕ is as defined in (C.4).
We require the posterior density conditional on fixed values for ϕ and σ 2 , which we define in Section 4.2, below.The posterior density can be derived analytically in the usual way for models with conjugate priors, as proportional to the joint density, In the above, we have completed the square and defined In the case where the full training set is used so that S = I T , we can exploit the fact that The resulting model M ℓ posterior density for β (conditional on σ 2 and ϕ) is multivariate normal, (D.8) The posterior predictive density for new data ỹtest = S ỹ can be derived by integrating out the parameter β from the joint distribution of new data and β given y train , It is worth emphasizing that in our framework the DGP (3.1) and model M ℓ can represent different underlying models.In other words, we allow M ℓ to be misspecified.This means that the theoretical and cross-validatory utility measures will be a function of both the DGP parameters (with * subscripts) and the model M ℓ parameters (with ℓ subscripts).

D.2 Theoretical expected log predictive density
The theoretical log utility under external validation (2.3) is simply the expectation of the log predictive density under the true DGP.In the simplified version of the model, both the DGP and predictives are multivariate Gaussian and an exact solution is available analytically as a consequence of Lemma 8 (Appendix F).Sivula et al. (2020) found that the elppd for linear regressions with flat priors can be expressed as a second-degree polynomial in y.We find that this is also true for both the eljpd and elppd for our simplified ARX model with conjugate priors.

Expected log joint predictive density (eljpd)
We will derive the eljpd in a form that includes selection matrixes for both testing and training data.This form is general enough that it can be used to analyze full-data elpds as well as to construct cross-validation folds with subsets of the data.Let the selection matrixes S and S respectively define the training and testing subsets of y and ỹ.Then, dropping the conditioning σ 2 and ϕ from the notation for brevity, the joint theoretical utility for model M ℓ given y is, where we have used (E.1) and (D.11) and Lemma 8. Now note that m ℓ is linear in y, where we have used (D.2) to define (D.16) Observe that in the case where the full training set is used ( S I T ), we can use the fact that Finally,rearranging (D.14) and substituting (D.15) yields the second-degree polynomial in y, (D.18) The T × T matrix A ℓ , T × 1 vector b ℓ , and scalar c ℓ are nonlinear functions of ϕ * and ϕ ℓ (respectively via V * and V ℓ ) but are free of y and hence nonrandom.Their values are It can be seen from the form of (D.14) and positive definiteness of S⊤ σ 2 ℓ SV ℓ S⊤ −1 S that for a given draw of y, the log utility measure eljpd (M ℓ | y) is larger the closer the posterior predictive mean m ℓ is to the DGP mean m * , thus assigning a greater score to better predictions.

Expected log pointwise predictive density (elppd)
The elppd can be derived from the eljpd.We use the fact that the marginal distribution of a multivariate normal distribution is univariate normal, where m ℓ,t denotes the tth element of m ℓ and V ℓ,t,t the tth diagonal element of V ℓ .Recognizing that for ⊙ the elementwise product operator, we can rewrite the elppd in a form similar to (D.18) as

D.3 Cross-validation estimators
This section derives CV estimators for the eljpd and elppd.We will keep the notation generic and construct expressions that apply for all the CV described in Section 2.4.The results of this section can be adapted for LOO, LFO, K-fold, h-block, and hv-block simply by substituting method-specific formulations of train k and test k via the selection matrixes Sk and Sk for each fold k , the joint and pointwise CV estimators for a simplified ARX model can be constructed by substituting (D.11) into (2.9) and (2.10), putting ỹtest = y test : We can write the difference that appears in (D.29) as ) and e ℓ (D.17), Cross-validation for Bayesian autoregressions Rearranging (D.29) yields a second-degree polynomial Recall that in the context of these expressions, the various CV blocking patterns-LOO, LFO, h-block, hv-block, and K-fold-differ only by the definitions of Sk and Sk .

Pointwise evaluation of CV estimators
Pointwise CV estimators are special cases of joint estimators.They can be constructed in a similar manner to the pointwise elppd as

D.4 Distribution function for quadratic polynomials
Proposition 3 shows that the quadratic polynomials ω (y) have a generalized χ 2 distribution.This section provides a formal definition for this distribution class.
Quantiles of the generalized χ 2 distribution can be approximated by simulation or from its CDF (Lemma 7) using the method of Davies (1973).The resulting posterior density is not available in closed form, but can be estimated with MCMC, one-dimensional numerical integration, or Laplace approximation.To make computations tractable, we estimate cross-validatory estimates elpd CV by quadrature for each fold, and we estimate the (theoretical) elpd using MCMC.
The posterior density follows from an application of Bayes' rule, Evaluating (E.4) therefore requires the marginal likelihood p (y train ).

E.1 Marginal likelihood
The marginal likelihood can be obtained by integration, It follows that where the Hessian H (ϕ) = ∇ 2 ϕ log p (y train , ϕ).The Laplace approximation seems very accurate in the particular context of our experiments.
From the form of these expressions it can be verified that each are functions of ϕ * , ϕ (ℓ) , σ 2 * , σ 2 ℓ , Z * and Z ℓ as well as the CV scheme parameters.

Figure 4 :
Figure 4: Joint and pointwise theoretical model selection statistics under log score for 500 independent posteriors, as dependence increases from left to right.This is the 'hard' example described in Section 4: the DGP is an ARX(2, 3) and the candidates are M A : ARX(1, 2) and M B : ARX(1, 1).Positive values select the M A , the better model.The gray line is the 45 degree line where both estimators are equal.Marginal histograms are also shown, with red bars indicating adverse selection.

Figure 5 :
Figure 5: Model selection objectives for the 'hard' case (β * = β hard ).Column (a) shows the theoretical model selection objectives, and columns (b) and (c) the associated CV estimates.The top row plots the standard deviation of the relevant ω(y) for the corresponding column.The bottom row shows its mean and 98% interval.The model parameter α governs the degree of serial dependence.Notice that the adverse selection rate for both joint and pointwise methods is close to zero for all but the strongest dependence.This model selection experiment compares M A : ARX(1, 2) vs M B : ARX(1, 1) under an ARX(2, 3) DGP, as described in Section 4. The autoregressive parameter is ϕ * = α (0.75, 0.2), and data length T =100.See also Figure8for the 'easy' case.
cost CV (y) = max M ∈M elpd (M |y) − elpd (M * CV |y) , (4.7) where M * CV = arg max M ∈M elpd CV (M |y) .Pointwise elppd and joint eljpd are of course not directly comparable.To put joint and pointwise CV measures on an equal footing, we specify the cost measure measured in terms of joint utility for both pointwise and joint CV procedures, cost CV (y) = max M ∈M eljpd (M |y) − eljpd (M * CV |y) , (4.8) where M * CV = arg max M ∈M elpd CV (M |y) and elpd CV (•) can represent either the joint estimate eljpd CV (•) or the pointwise estimate elppd CV (•).

Figure 8 :
Figure 8: Behavior of model selection objectives for the 'easy' case, a counterpoint to Figure 5. Column (a) shows the theoretical model selection objectives, and columns (b)and (c) the associated CV estimates.The top row plots the standard deviation of the relevant ω(y) for the corresponding column.The bottom row shows its mean and 98% interval.The model parameter α governs the degree of serial dependence.Notice that the adverse selection rate for both joint and pointwise methods is close to zero for all but the strongest dependence.This model selection experiment compares M A : ARX(1, 2) vs M B : ARX(1, 1) under an ARX(2, 3) DGP, as described in Section 4. We use the β easy parameter setting, autoregressive parameter ϕ * = α (0.75, 0.2), and data length T =100.
adverse selection rate and hyperparameter choices
only the indexes of the desired test and training subsets.For CV estimators, train and test are mutually disjoint.Define the n × T matrix S and v × T matrix S such that y train = Sy and ỹtest = S ỹ.The elements of S and S are given by S j,t := 1{t = train j } and S k,t := 1{t test ℓ } , (D.1) where 1{•} denotes the indicator function, train j and test k respectively denote the jth and kth elements of train and test, t = 1, • • • , T , j = 1, • • • , n, and k = 1, • • • , v.

Table 2 :
Adverse selection rates for experiments in Table σ 2 ) (t)