Dynamic Regression Models for Time-Ordered Functional Data

For time-ordered functional data, an important yet challenging task is to forecast functional observations with uncertainty quantification. Scalar predictors are often observed concurrently with functional data and provide valuable information about the dynamics of the functional time series. We develop a fully Bayesian framework for dynamic functional regression, which employs scalar predictors to model the time-evolution of functional data. Functional within-curve dependence is modeled using unknown basis functions, which are learned from the data. The unknown basis provides substantial dimension reduction, which is essential for scalable computing, and may incorporate prior knowledge such as smoothness or periodicity. The dynamics of the time-ordered functional data are specified using a time-varying parameter regression model in which the effects of the scalar predictors evolve over time. To guard against overfitting, we design shrinkage priors that regularize irrelevant predictors and shrink toward timeinvariance. Simulation studies decisively confirm the utility of these modeling and prior choices. Posterior inference is available via a customized Gibbs sampler, which offers unrivaled scalability for Bayesian dynamic functional regression. The methodology is applied to model and forecast yield curves using macroeconomic predictors, and demonstrates exceptional forecasting accuracy and uncertainty quantification over the span of four decades.


Introduction
In business, science, and industrial applications, data are commonly measured over a continuous domain, such as time, space, or wavelength. Functional data analysis treats these data as a single realization of an underlying function, and seeks to model the associations among distinct functional observations. Functional data present significant challenges for modeling and prediction: the data are usually highly-correlated, highdimensional, and often unevenly-spaced over the domain. We address the setting of time-ordered functional data, where the primary goal is to forecast future functional observations with uncertainty quantification. Examples of time-ordered functional data include yearly sea surface temperature as a function of time-of-year (Besse et al., 2000), daily pollution curves as a function of time-of-day (Damon and Guillas, 2002;Aue et al., 2015), yearly mortality rates as a function of age (Hyndman and Ullah, 2007), and yearly disease counts as a function of time-of-year (Kowal, 2019).
Forecasting and inference for functional time series data require adequate modeling of both the functional within-curve dependence and the dynamic between-curve dependence. Simultaneous identification of prominent functional features together with their temporal behaviors provides a recipe for understanding and predicting function-valued time series. To assist in this challenging task, scalar predictors are commonly recorded concurrently with functional data, and may provide information about the time evolution of the functional data. The utility of scalar predictors for modeling functional data has been demonstrated extensively in many instances of function-on-scalars regression (Ramsay and Silverman, 2005;Montagna et al., 2012;Morris, 2015;Chen et al., 2016;Barber et al., 2017;Fan and Reimherr, 2017;Kowal and Bourgeois, 2019). However, limited attention has been given to dynamic regression models for time-ordered functional data, in particular with the primary goal of forecasting functional data.
For scalar time series data, time-varying parameter (TVP) regression models have shown the ability to produce accurate forecasts with precise uncertainty quantification (Dangl and Halling, 2012;Korobilis, 2013;Belmonte et al., 2014;. In TVP regression, the scalar response is linearly associated with scalar predictors, such that the linear associations evolve dynamically over time. For financial and economic data in particular, the time-variation is essential: abrupt changes in regulations and policies, as well as gradual changes in market sentiments and the broader macroeconomy, may drive an evolving relationship between the predictors and the response (Dangl and Halling, 2012). TVP regression is a special case of a dynamic linear (or state space) model (West and Harrison, 1997), and benefits from the computational and inferential capabilities of these models.
Despite the impressive gains offered by TVP regression models, functional data present new and unique challenges. Model complexity-and risk of overparametrization -is greatly increased: each time-varying parameter is now a time-varying function. Function estimation implicitly requires regularization, usually via smoothness or sparsity, which must be combined with the simultaneous shrinkage of irrelevant predictors and extraneous time-variation. Equally important, the data dimensionality in the functional data setting presents a significant challenge for Bayesian computing, and therefore impedes access to the forecasting distribution.
To address these challenges, we develop a dynamic Bayesian model for functional regression and forecasting. We extract prominent functional features by modeling each functional observation using unknown basis functions, which are learned from the data. The unknown basis provides a lower-dimensional space in which to model the timevariation of the functional data and greatly improves computational scalability. The dynamics of the time-ordered functional observations are modeled using a TVP regression with autoregressive innovations, which incorporates both predictor-driven and latent dynamics. The resulting model allows the scalar predictors to inform the timeevolution of the functional response, specifically corresponding to the learned functional features, with time-variation in the key parameters for greater adaptability. Shrinkage priors are included to simultaneously (i) ensure smoothness of the basis functions, (ii) regularize irrelevant predictors, (iii) shrink toward time-invariance for model parsimony, and (iv) reduce sensitivity to the dimension of the unknown basis. Full posterior inference and multi-step forecasting distributions are available via an efficient Gibbs sampler. Simulation studies confirm the importance of these modeling choices, and in particular the model for the basis functions, the time-variation of model parameters, and the shrinkage priors.
The proposed methodology is applied to model and forecast U.S. yield curve data. For a given currency and level of risk of a debt, the yield curve describes the interest rate at a given time as a function of the length of the borrowing period, or time to maturity, and evolves over time. The yield curve is observed concurrently with financial and macroeconomic predictor variables, which may assist in constructing point and interval forecasts of the yield curve. While a variety of techniques are available for yield curve modeling, to the best of our knowledge there is no existing framework that simultaneously includes (i) a nonparametric regression model for unknown basis functions, (ii) dynamic regression coefficients, and (iii) Bayesian shrinkage priors for regularization. We evaluate the proposed framework for yield curve modeling and forecasting over the span of four decades, and demonstrate substantial gains in forecasting accuracy and uncertainty quantification relative to state-of-the-art and benchmark competitors.
The remainder of the paper is organized as follows: we introduce the model in Section 2; the unknown basis function model is in Section 3; the shrinkage priors are in Section 4; the MCMC algorithm is in Section 5; a simulation analysis is in Section 6; the yield curve forecasting comparison is in Section 7; and we conclude in Section 8. Supplementary files include R code and a document with additional details on the MCMC algorithm, the basis expansions, and the simulations (Kowal 2020).

Dynamic function-on-scalars regression
Let {Y t } T t=1 be a time-ordered sequence of functions on a compact index set T ⊂ R D with D ∈ Z + . Each function is observed at a discrete set of points τ j ∈ T , j = 1, . . . , M, which are fixed across all times t only for notational convenience. The observed data are noisy realizations of each Y t at these observation points: where j,t represents observation or measurement error. Since the functions Y t are typically modeled as smooth in τ , the errors j,t account for non-smoothness of the observed data. Notably, omission of observation error for time-ordered functional data can induce misspecified dynamics for {Y t } (Kowal et al., 2017b). We assume conditionally Gaussian observation errors with time-varying variance, [ j,t |σ 2 t ] ∼ N (0, σ 2 t ), with alternative distributional assumptions in Lemma 2. The dynamic variance σ 2 t allows the variability about Y t to change over time, and is essential for constructing precise and accurate yield curve forecasting intervals (see Section 7).
Suppose that at each time t, the functional data y t = (y 1,t , . . . , y M,t ) are observed concurrently with time-ordered predictors, x t = (x 1,t , . . . , x p,t ) . Note that we do not require p < M. Our goal is to leverage these dynamic predictors x t to produce superior forecasting distributions for y t , and to construct inferential summaries of the associations between x t and Y t . For time-ordered data, it is likely that associations among variables change over time. Failure to incorporate such time-variation will produce inferior forecasts and unreliable inference. It is also expected that some predictors may not be useful in forecasting Y t , while others may only provide predictive power for certain features or subdomains of {Y t }. Consequently, it is important to include mechanisms for regularization of irrelevant predictor variables.
Motivated by these considerations, we propose the following dynamic function-onscalars regression (DFOSR) model: for t = 1, . . . , T . The functions Y t are decomposed into a linear combination of K loading curves {f k (τ )} K k=1 and dynamic factors {β k,t } K k=1 . The basis expansion (2) is a core component of many functional data models. In Section 3, we develop a nonparametric regression approach for modeling each f k as unknown. As a result, the functional basis {f k } is data-adaptive, with each f k learning an essential functional feature of {Y t }.
The dynamic factors {β k,t } model the time dependence among the functions {Y t }, and specifically correspond to the functional features defined by each f k . Naturally, modeling the dynamics of {β k,t } is fundamental for constructing accurate forecasts of the functional time series data y t . We incorporate the dynamic predictors x t into the model for β k,t via (3)-(4), which includes an intercept μ k , dynamic regression coefficients {α j,k,t }, and an error term γ k,t for each factor k. Note that dynamics on the intercept μ k are not needed, since these effects appear in γ k,t . The regression coefficients {α j,k,t } are time-varying for each predictor j and factor k in (4), which implies time-variation for the functional regression model. The regression model also features an autoregressive time series model for the error terms in (5), which accounts for time-variation in {Y t } that is not fully explained by the predictors x t .
The DFOSR has an equivalent representation which makes explicit the association between the factor-specific parameters in (3)-(5) and the functions Y t . Let GP(c, C) denote a Gaussian process with mean function c and covariance function C. By substituting (3)-(5) into (2), the DFOSR model implies the dynamic functional regression model where each functional term is a linear combination of the {f k }, or explicitly,μ(τ ) := The predictors x j,t are directly associated with the functional time series Y t (τ ) via the dynamic regression coefficient functionsα j,t (τ ) := k f k (τ )α j,k,t . The error termγ t (τ ) models both functional dependence in τ and dynamic dependence via a functional autoregressive model in (7), which has been shown to offer strong forecasting performance in the absence of predictors (Kowal et al., 2017b). Note that the auto-and cross-covariance functions of Y t (τ ) are available as a special case of Propositions 1 and 2 in Kowal (2019), and do not require separability in τ and t.
The DFOSR model is highly flexible: the dynamic regression coefficients are permitted to change at every time t, while each functional feature f k is associated with predictors x t via (3) and has distinct idiosyncratic dynamics via γ k,t (5). However, without adequate regularization, there is a risk of overparametrization: it is unlikely that every predictor x j,t is associated with every factor β k,t , and that this association changes substantially at each time t. Overparametrization is a particular concern for forecasting, and can lead to increases in both bias and variance. In Section 4, we introduce Bayesian shrinkage priors which simultaneously regularize against irrelevant predictors, unnecessary time-variation, and sensitivity to the dimension K of the unknown basis. These priors are represented as conditionally Gaussian distributions for the innovations ω j,k,t and η k,t , which preserves computational scalability of the state space model (2)-(5).
Note that our setting is similar to, but distinct from, longitudinal functional data analysis (Greven et al., 2011;Chen and Müller, 2012;Park and Staicu, 2015). Longitudinal functional data are time-ordered functional data, but typically include replicates of each functional time series (e.g., time-ordered functions are observed for each of many subjects).

Modeling the loading curves
The loading curves {f k } provide the foundation for functional modeling and forecasting of Y t . The dynamics of the DFOSR model are determined by β k,t in (3)-(5), while each β k,t is mapped to the functional domain τ ∈ T via the corresponding f k (τ ) in (2). Clearly, any functional features that are not effectively captured by {f k } cannot be forecasted accurately by the DFOSR and inferentially cannot be linked to the predictors. Therefore, it is essential that the loadings {f k } adequately capture the functional dependence among the {Y t }.
There is an unavoidable tradeoff between flexibility of the basis and parsimony. As the dimension K increases, the model accounts for more functional features, yet the TVP regression model in (3)-(5) becomes more heavily parametrized. Furthermore, computational scalability of the DFOSR model-which may be expressed as a state space model in {α k,j,t , γ k,t } (see Section 5)-is limited by K. Methods that use a fixed basis expansion in (2), such as splines (Laurini, 2014) or wavelets (Morris and Carroll, 2006), require a large K for adequate modeling flexibility, and therefore produce heavily parametrized and computationally infeasible models when combined with the dynamics of (3)-(5). An appealing solution is to model the basis functions {f k } as unknown, which allows the curves {f k } to be optimized for each functional dataset. However, we must be careful to ensure that the uncertainty regarding the unknown {f k } is not ignored.

Priors and full conditional distributions for the loading curves
We propose a model for the loading curves {f k } that simultaneously (i) treats {f k } as unknown, which produces a data-adaptive basis and minimizes the number of necessary basis functions K; (ii) accounts for the inherent uncertainty in {f k }; (iii) is scalable in the number of observation points, M ; and (iv) is well-defined for T ⊂ R D with D ∈ Z + . Identifiability constraints are enforced on {f k } to preserve interpretability of the loading curves {f k } and factors {β k,t } in (2). These constraints, which are detailed in Section 3.2, are designed to simplify posterior sampling for the factor-specific parameters in (3)-(5), which greatly improves computational scalability.
A common approach in nonparametric regression is to represent each unknown function-here, each f k -as a linear combination of known basis functions, and then model the corresponding unknown basis coefficients.
is an L M -dimensional vector of known basis functions and ψ k is an L M -dimensional vector of unknown basis coefficients. We use low-rank thin plate splines (LR-TPS), which are smooth, flexible, and well-defined for T ⊂ R D with D ∈ Z + . However, modifications for other bases such as wavelets, Fourier expansions, and Gaussian processes are straightforward and follow the same construction as below. Given functional data observations y t := (y 1,t , . . . , y M,t ) , we consolidate (1)-(2) to relate the unknown coefficients {ψ k } to the observed data y t : where To encourage smoothness of f k , LR-TPS accompany the basis expansion with a roughness penalty. In the case of D = 1, the penalty is of the form P(f k ) = {f k (τ )} 2 dτ forf k the second derivative of f k ; for D > 1, derivatives are replaced by gradients (see Wood, 2006 for details). For Bayesian implementations, the penalty corresponds to a prior distribution on the basis coefficients ψ k , or equivalently, the implied function f k . Consider the conditionally Gaussian prior where λ f k > 0 is a prior precision parameter and Ω b is a penalty matrix with ( , ) which is a penalized least squares criterion, for which the precisions λ f k are smoothing parameters for each f k (and c = denotes equality up to a constant). Details on the construction of B and Ω b and the choice of L M for LR-TPS are provided in the supplement. The generalized inverse in (10) is standard for Bayesian splines, since Ω b has rank L M − 2 and is not invertible. In our implementation (see the supplement), which diagonalizes Ω b and orthogonalizes B, the improper prior is replaced with a proper prior to ensure propriety of the posterior.
For posterior inference, we design a Bayesian backfitting algorithm that iteratively draws from the full conditional distribution of each f k . The iterative approach is useful for enforcing the identifiability constraints in Section 3.2, and leads to straightforward full conditional distributions: Note that the full conditional distribution of f k = Bψ k clearly depends on the dynamic factors {β k,t }, which are learned simultaneously based on model (3) (Kowal et al., 2017a). By assigning a prior distribution to each λ f k , we allow the data to inform the degree of smoothness for each f k -without needing a cross-validation step-while incorporating the uncertainty about λ f k into the posterior distribution.
Existing methods for learning {f k } generally fall into two categories: functional principal components analysis (FPCA) and Bayesian reduced rank models. In FPCA, the basis functions {f k } are estimated in advance of model-fitting (Goldsmith and Kitago, 2016). However, this approach implicitly requires a multi-step estimation procedure, whereby the FPCs are estimated separately from-and without utilizing-the remainder of the model in (3)-(4). Dynamic FCPA  incorporates temporal structure, but is primarily designed for dimension reduction rather than forecasting or regression. When pre-computed FPCs are treated as fixed and known, the ignored uncertainty is often nontrivial: Goldsmith et al. (2013) demonstrate that FPC-based methods can substantially underestimate total variability, even for densely-observed functional data. Bayesian reduced-rank functional data models such as Montagna et al. (2012) and Suarez et al. (2017) are similar to FPCA, yet incorporate the uncertainty of {f k } into the posterior distribution. Unfortunately, these approaches have limited computational scalability, and are not designed to incorporate the dynamic regression model (3)-(5) or produce forecasts of y t .

Simplifying the likelihood via identifiability constraints
We enforce identifiability constraints on the loading curves, {f k }, which primarily serves two purposes. First, identifiability is necessary to interpret {f k } and the k-specific model parameters in (3) and (4). Second, our particular choice of constraints provides computational improvements for sampling the dynamic parameters in (3) and (4). We constrain F F = I K , where F := (f 1 , . . . , f K ) is the M ×K matrix of loading curves evaluated at the observation points τ 1 , . . . , τ M and I K is the K × K identity matrix. This constraint, combined with a suitable ordering constraint on k = 1, . . . , K (see Section 4), is suffi-cient for identifiability (up to sign changes, which in our experience are not problematic in the MCMC sampler).
While various identifiability constraints are available, the following result illustrates the utility of the proposed approach: Lemma 1. Under the identifiability constraint F F = I K and model (9), the joint full conditional distribution for {β k,t } is proportional to For sampling the dynamic parameters in (3)-(5), the full functional data likelihood under model (9) is unnecessary, and may be replaced by the simpler working likelihood under model (13). Consequently, all computations involving these parameters depend on M only via the projectionỹ k,t = f k y t , which is a one-time cost per MCMC iteration. These simplifications facilitate the inclusion of complex dynamics in (3)-(4) without sacrificing computational feasibility.
As an empirical illustration, Table 1 gives computation times for simulated data from Section 6 for the proposed DFOSR model compared to Kowal et al. (2017a) (referred to as DFOSR-NIG in Section 6). In particular, Kowal et al. (2017a) Most importantly, Lemma 2 accommodates within-curve dependence of the errors t (τ j ) := j,t , such as autoregressive or non-smooth dependence in τ . Unlike the functions Y t in (2), the error functions t (τ )-and therefore the functional data realizations y t (τ j ) := y j,t -are not restricted to be low rank. However, the errors t remain conditionally independent across time t = 1, . . . , T .
Proceeding under the setting of Lemma 1, we decompose the orthonormality constraint for each f k into two sets of constraints: the linear constraints f f k = 0 for = k and the unit-norm constraint, ||f k || 2 = 1. The sampler in Section 3.1 conditions on {f } =k , so the linearity constraint is fixed for each from Section 3.1, we enforce the linear orthogonality constraint by conditioning on Conditioning on the constraint is a natural Bayesian approach, and produces desirable optimality properties for constrained penalized regression (see Theorem 1 of Kowal et al., 2017a). Given an orthogonally-constrained sample f * k := Bψ * k , which is obtained via Theorem 1 below, we rescale to enforce the unit-norm constraint: f k = f * k /||f * k ||, and similarly rescale ψ * k . This rescaling does not change the shape of the loading curve f k , and can be counterbalanced by an equivalent rescaling of the corresponding factor, i.e., β k,t ← β k,t ||f * k ||. By applying this procedure iteratively for k = 1, . . . , K, the constraint F F = I K is satisfied for every MCMC iteration, which is required to obtain the computational simplifications of Lemma 1. Alternative approaches for directly sampling F on the Stiefel manifold are promising, but require customized algorithms (Jauch et al., 2019).
More generally, we may condition on any linear constraints C k f k = 0 for each k = 1, . . . , K. For example, the constraint f k (τ 1 ) = f k (τ M ) ensures a periodic curve f k , while the constraints f k (τ * ) = 0 for τ * ∈ T * restrict the support of f k to a subdomain of T * ⊂ T . Incorporating constraints, when appropriate, reduces variability and enhances interpretability of the loading curves {f k }. Within the proposed framework, conditioning on the linear constraints produces several important properties: The following properties hold: Direct calculation shows that f k has the same mean and covariance. Lastly, we simplify E( Theorem 1 provides a recipe for generating constrained loading curves {f k } based on a draw from the unconstrained full conditional distribution. The resulting f k is a sample from the requisite full conditional distribution restricted to the linear (orthogonality) constraint. Since the constraint holds almost surely, it follows that C k f k = 0 for each draw of f k , and therefore f f k = 0 for = k at every MCMC iteration.
The final property of Theorem 1 is broadly useful, since it provides a mechanism for applying constraints simultaneously to the loading curves {f k } and the expected functional data response, Y t . For example, the proposed modeling framework is applicable for periodic functional data, such as daily pollution curves (Damon and Guillas, 2002;Aue et al., 2015), by constraining f k (τ 1 ) = f k (τ M ), which guarantees that Y t (τ 1 ) = Y t (τ M ) for all t. The unknown basis functions {f k } are still learned from the data, yet the posterior distribution properly accounts for the pre-specified structure implied by Cf k = 0 and consequently CY t = 0.

Shrinkage priors
While the DFOSR model provides substantial dynamic modeling flexibility, there is also a risk of overparametrization. The model may contain irrelevant predictors or unnecessary time-variation, and may be sensitive to the dimension K of the unknown basis. We introduce regularization via Bayesian shrinkage priors. In (non-functional) TVP regression, shrinkage priors offer improvements in prediction (Korobilis, 2013;Belmonte et al., 2014) and provide narrower posterior credible intervals . Importantly, these effects are confirmed in the functional data setting for both simulated and real data (see Sections 6 and 7): the shrinkage priors improve forecast accuracy, point estimation accuracy, and calibration and precision of posterior uncertainty quantification.
Since many shrinkage priors may be expressed as conditionally Gaussian distributions Scott, 2010, 2012;, we consider priors of the form for all j, k, t. Importantly, the conditional Gaussianity in (16) maintains computational scalability of fully Bayesian inference for the DFOSR model (see Section 5). The priors on the variances σ 2 ω j,k,t and σ 2 η k,t will determine the shrinkage effects for ω j,kt and η k,t . The innovations ω j,k,t control the time-variation in the regression coefficients α j,k,t : when |ω j,k,t | is small, the change |α j,k,t − α j,k,t−1 | is small, so that the regression coefficients remain approximately constant at time t. In the absence of definitive evidence for time-variation of α j,k,t , the more parsimonious locally static model with |ω j,k,t | = |α j,k,t − α j,k,t−1 | ≈ 0 is preferred. Similarly, if predictor x j,t is irrelevant for factor β k,t , the prior should globally shrink each |ω j,k,t | and the initial state |α j,k,0 | to zero, which effectively removes x j,t from the model for β k,t .
To achieve both local (in time t) and global (for factor k, predictor j) shrinkage of α j,k,t , we propose nested horseshoe priors, which extend the horseshoe prior of Carvalho et al. (2010) to the functional time series setting. By design, the horseshoe prior aggressively shrinks small (absolute) values toward zero, while large (absolute) values receive minimal shrinkage. The horseshoe prior has demonstrated broad success in applications, simulations, and theory Datta and Ghosh, 2013;van der Pas et al., 2014). We apply shrinkage at multiple levels with the following hierarchy of half-Cauchy distributions: First, σ ω j,k,t ≈ 0 implies that |ω j,k,t | ≈ 0, so α j,k,t ≈ α j,k,t−1 is locally constant. Each α j,k,t for predictor j and factor k may vary at any time t, but the prior encourages most changes to be approximately negligible, which implies fewer effective parameters in the model. The shrinkage parameters λ j,k and λ j are common for all times t, and provide factor-and predictor-specific shrinkage: for each predictor j, λ j,k allows some factors k to be nonzero, while λ j operators as a group shrinkage parameter that may effectively remove predictor j from the model. Lastly, the global shrinkage parameter λ 0 controls the global level of sparsity, and is scaled by 1/ √ T − 1 following Piironen and Vehtari (2016). In the case of the non-dynamic FOSR and FOSR-AR models, we simply remove one level of the hierarchy: ω j,k,t ∼ N (0, λ 2 j,k ). To reduce sensitivity to the dimension K of the basis, we introduce ordered shrinkage across k = 1, . . . , K for both the innovations η k,t and the intercepts μ k in (3). For this purpose, we use multiplicative gamma process (MGP) priors (Bhattacharya and Dunson, 2011), which assign increasing shrinkage toward zero for larger values of k. MGP priors are a popular choice among Bayesian models with unknown rank K, including factor models (Bhattacharya and Dunson, 2011) and functional regression models (Montagna et al., 2012;Kowal and Bourgeois, 2019). The MGP prior for the intercept terms is given by μ k ∼ N (0, σ 2 μ k ) with σ −2 μ k = ≤k δ μ , for δ μ1 ∼ Gamma(a μ1 , 1) and δ μ ∼ Gamma(a μ2 , 1) when > 1. Selecting a μ1 > 0 and a μ2 ≥ 2 produces stochastic ordering among the implied variances σ 2 μ k (Bhattacharya and Dunson, 2011;Durante, 2017), which also satisfies the ordering requirement for model identifiability. Similarly, for the innovations η k,t ∼ N (0, σ 2 η k,t ), let σ 2 η k,t = σ 2 η k /ξ η kt with σ −2 η k = ≤k δ η , δ η1 ∼ Gamma(a η1 , 1), δ η ∼ Gamma(a η2 , 1) for > 1, and ξ η kt ∼ Gamma(ν η /2, ν η /2). The rate of ordered shrinkage is determined by the data separately for {μ k } and {η k,t } using the hyperpriors a μ1 , a μ2 , a η1 , a η2 ∼ Gamma(2, 1). Finally, the hyperprior ν η ∼ Uniform(2, 128) for the degrees of freedom parameter incorporates the possibility of heavy tails in the marginal distribution for η k,t .

MCMC sampling algorithm
We develop an efficient Gibbs sampling algorithm for the DFOSR model based on four essential components: (i) the loading curve sampler for {f k } with the identifiability constraint F F = I K ; (ii) the projection-based simplification of model (9) from Lemma 1; (iii) a state space simulation smoother for the dynamic regression parameters in (3) and (4); and (iv) parameter expansions for the variance components in (16). For sparsely-observed functional data, in which the functions are observed only at a small subset of {τ j } M j=1 at each time t, a sampling-based imputation step is included. An overview of the sampling algorithm is provided here, with details for the full sampling algorithm in the supplement.
Since model (14)-(15) is a dynamic linear model in the state variables (α k,t , γ k,t ) , the dynamic parameters {α k,t , γ k,t } T t=1 are sampled jointly across all t = 1, . . . , T from their full conditional posterior distribution using an efficient simulation smoothing algorithm (Durbin and Koopman, 2002). These samplers are also valid for FOSR-AR with α j,k,t = α j,k . Note that the model (14)-(15) may be aggregated across k = 1, . . . , K to produce a jointly sampler with respect to k; in our experience, however, doing so increases computation time without improving MCMC efficiency. A single draw of all dynamic regression coefficients and autoregressive regression error terms {α k,t , γ k,t } k,t has computational complexity O(KT p 3 ). For a small to moderate number of predictors p < 30, the algorithm is efficient; for sufficiently small K, the sampler is nearly equivalent to the analogous non-functional TVP regression model.
In addition to the loading curve sampler for {f k } in Section 3 and the state space simulation sampler for {α k,t , γ k,t } k,t via (14)-(15), the Gibbs sampler proceeds by iteratively sampling the intercepts {μ k }, the autoregressive coefficients {φ k }, and the variance components σ 2 t , σ 2 η k,t , and σ 2 ω j,k,t -as well as any relevant hyperparameters-from their full conditional distributions (see the supplement). Stationarity on γ k,t is imposed by the prior on the autoregressive coefficients: we set {(φ k + 1)/2} ∼ Beta(5, 2), which ensures that P(|φ k | < 1) = 1. This prior on φ k is weakly informative and encourages positive autocorrelation: P(φ k > 0) ≈ 0.9 and P(φ k > 0.5) ≈ 0.45. For the yield curve data in Section 7, we estimate P(φ k > 0.5|y) ≈ 1 for each k, so there is posterior learning which does not contradict the prior. Posterior inference is available for these quantities as well as the TVP regression functionsα j,t (τ ) := K k=1 f k (τ )α j,k,t and the fitted curvesŶ t (τ ) := K k=1 f k (τ )β k,t . To generate samples from the h-step forecasting distribution of y T +h , we leverage the state space construction in (14)-(15). Specifically, we draw from the h-step forecasting distribution of the state variables in (15), [α k,T +h , γ k,T +h |y, −], and subsequently sample [y T +h |y, −] ∼ N ( K k=1 f k β k,T +h , σ 2 T +h I M ), where β k,T +h = μ k + p j=1 x j,T +h α j,k,T +h + γ k,T +h . Note that σ 2 T +h must be sampled from the forecasting distribution of the stochastic volatility model (see Section 7). If only point estimates are needed, we useŷ T +h = K k=1 f kβk,T +h , whereβ k,T +h =μ k + p j=1 x j,T +hαj,k,T +h + γ k,T +h andα j,k,T +h andγ k,T +h are the conditional expectations of [α k,T +h ,γ k,T +h |y,−]. The estimatorŷ T +h reduces Monte Carlo error, and may be more accurate in some cases. These forecasting computations remain unchanged for various modifications of the DFOSR model, for example by using a known basis in place of {f k } or an alternative prior distribution for the innovations in (16).

Simulation study
The DFOSR model features several important components: (i) the model for {f k } from Section 3, (ii) the dynamic regression coefficients in (4), and (iii) the nested horseshoe and ordered shrinkage priors for the innovations (16) from Section 4. In concert, these features comprise the proposed DFOSR model (denoted DFOSR-HS in the subsequent figures). Naturally, these assumptions may be relaxed and modified to produce simpler models within the same modeling framework. It is therefore important to determine which of the DFOSR model features are necessary to produce reliable forecasts, estimates, and inference.
The following simulation study assesses the relative importance of these modeling choices. Specifically, we are interested in evaluating forecasts of the functional time series y t and estimates of the dynamic regression coefficient functionsα j,t (τ ), as well as the empirical coverage and precision of the accompanying posterior forecasting and credible intervals. We consider simulation designs with dynamic and non-dynamic regression coefficients, and compare the proposed methods to state-of-the-art and benchmark alternatives for functional regression.

Simulation design
The synthetic data-generating process is based on model (1)-(4), but with some notable differences. There are two sources of sparsity in the regression: (i) some predictors are not associated with the functional response Y t (τ ) and (ii) some predictors are associated with Y t (τ ) exclusively via a small number of factors. Let p 0 = 10 regression coefficients be exactly zero (for all times t), and let p 1 = 5 be nonzero, resulting in p = p 0 + p 1 = 15 regression coefficients plus an intercept, which is fixed at μ * k := 1/k. For each nonzero predictor j = 1, . . . , p 1 = 5, uniformly sample p * j factors to be nonzero, where p * j ∼ Poisson(1) truncated to [1, K * ]. For dynamic regression coefficients, the nonzero factors k for predictor j are drawn from a Gaussian random walk with randomly selected jumps: α * j,k,t := Z k,0 + s≤t Z k,s I k,s where Z k,t ∼ N (0, 0.75 k−1 ) and I k,t ∼ Bernoulli(0.01), which results in time-varying yet locally constant regression coefficients α * j,k,t . For nondynamic regression coefficients, let α * j,k,t := α * j,k ∼ N (0, 0.75 k−1 ). The regression errors are γ * , which are autocorrelated yet stationary with unconditional variance 0.75 k−1 . Finally, each time-ordered predictor {x j,t } T t=1 is simulated from a Gaussian AR(1) model with autoregressive coefficient 0.8, unconditional mean zero, and unconditional variance one for j = 1, . . . , p.

Methods for comparison
We implement several competing methods within the dynamic Bayesian framework of (1)-(5), which in each case utilizes the same general MCMC approach of Section 5 with small modifications as needed. First, let DFOSR-NIG denote the DFOSR model with normal-inverse-gamma priors on the innovations, similar to Kowal et al. (2017a), instead of the nested horseshoe priors from Section 4. Next, replace the loading curves {f k } with functional principal components (FPCs) estimated a priori using Xiao et al. (2013), where K is selected to explain 90% of the variability in {y t } t . For the FPC-based modifications, we consider both the nested horseshoe priors from Section 4 (DFPCA-HS) and the aforementioned normal-inverse-gamma prior (DFPCA-NIG). The comparative performance of these methods-DFOSR, DFOSR-NIG, DFPCA-HS, and DFPCA-NIG-illustrates the gains of including (i) shrinkage priors and (ii) a Bayesian model for {f k }. Lastly, omit the dynamic regression coefficients, α j,k,t = α j,k , reducing to a function-on-scalars regression with autoregressive errors (FOSR-AR). The FOSR-AR model is identical to the full DFOSR with the exception of (4), and therefore isolates the utility of the time-variation of {α j,k,t }.
For benchmark comparisons, we also include non-Bayesian and non-dynamic regression models. First, we include a functional regression model which estimates the regression coefficient functions using generalized least squares (FOSR-LS; Reiss et al., 2010). FOSR-LS is similar to fitting separate linear regression models of y j,t on x t for each j = 1, . . . , M, but adds smoothness in τ j for the regression coefficients and incorporates covariance among the errors j,t with respect to j = 1, . . . , M. Next, we include a FOSR model which attaches a group lasso penalty to each regression function to provide variable selection (FOSR-Lasso). Both FOSR-LS and FOSR-Lasso are implemented using the refund package in R . Note that these methods do not account for time-varying regression coefficients or autocorrelated errors (with respect to time).

Evaluation criteria
Forecasting ability is evaluated using root mean squared forecast errors, RMSFE = M =1 {Y T +1 (τ )−Ŷ T +1 (τ )} 2 /M for forecastŶ T +1 , and mean forecast interval widths, While RMSFE measures point forecast accuracy, MFIW assesses precision of the forecasting intervals. Narrower forecast intervals-and therefore smaller MFIW-are preferable among intervals that attain the nominal empirical coverage. The empirical coverage is We consider 90% intervals for all simulations.

Simulation results
The results from the dynamic regression setting are in Figures 1 and 2 for the forecasting and estimation comparisons, respectively. Similar figures for the non-dynamic regression setting, in whichα * j,t (τ ) =α * j (τ ) is time-invariant for all j, , are in Figures 3 and 4. The RMSFE and MFIW are reported in terms of the percent reduction relative to DFOSR-NIG, which provides standardization among the forecasting metrics. Specifically, for each method j we compute 100(RMSFE j − RMSFE 0 )/RMSFE 0 and 100(MFIW j − MFIW 0 )/MFIW 0 for each of 100 simulated datasets, where j = 0 denotes the reference model. The DFOSR-NIG is an appropriate reference model: it includes the model for unknown {f k } from Section 3 but does not include the nested horseshoe prior of Section 4. Methods with values above zero demonstrate improvements in forecasting performance relative to DFOSR-NIG. Figure 1 shows the relative forecasting results for the dynamic regression simulations. The inclusion of the nested horseshoe priors (DFOSR-HS) provides more accurate point forecasts with narrower forecasting intervals compared to DFOSR-NIG, which demonstrates the utility of these shrinkage priors. At the same time, the performance of DFPCA-HS, which uses the same shrinkage priors but fixes f k at the FPC estimates, is underwhelming, and is often outperformed by DFOSR-NIG. In concert, these results demonstrate that inclusion of both the model for {f k } and the nested horseshoe prior produces more accurate point forecasts and more precise forecast intervals. However, the time-variation in (4) is equally important: FOSR-AR omits these dynamics, and clearly produces less accurate point forecasts and forecast intervals with insufficient coverage.  Figure 2 presents the results for the regression coefficient functions, again for the dynamic regression simulations. Among the true nonzero coefficients (signals), it is most notable that the proposed DFOSR-HS offers significantly more accurate estimation and narrower credible intervals which maintain the correct nominal coverage. By comparison, the methods that do not include both the time-varying parameters and the model for {f k } fail to attain the correct nominal coverage. As anticipated, the shrinkage priors are comparatively more important among true zeros, where the non-dynamic FOSR-AR is clearly the best while DFOSR-HS, DFPCA-HS, and FOSR-Lasso perform similarly.
The forecasting and estimation results for the non-dynamic regression simulations are in Figures 3 and 4. As expected, FOSR-AR performs exceptionally well, since it corresponds to the true data-generating process. Again, the proposed DFOSR-HS significantly outperforms the dynamic competitors in both forecasting and inference. Interestingly, even though FOSR-AR and the non-Bayesian methods FOSR-LS and FOSR-Lasso all (correctly) assume non-dynamic coefficients, the FOSR-AR provides substantially more accurate estimates for both signal and noise coefficients.
To characterize the sensitivity of the proposed DFOSR model to our choice of priors, we considered a variation of DFOSR-HS in which the autoregressive coefficients each received a flat prior restricted to stationarity, φ k ∼ Uniform(−1, 1), and the MGP hyperparameters were fixed at a μ1 = a η1 = 2, a μ2 = a η2 = 3, and ν η = 3 instead of using the hyperpriors from Section 4. The results are quite similar to those in Figures 1-4, with some attenuation of the gains offered by DFOSR-HS. Therefore, we recommend the weakly informative priors for the autoregressive coefficients φ k and the MGP hyperpriors from Section 4.

Forecasting yield curves using macroeconomic variables
The yield curve describes the time-varying term structure of interest rates: at each time t, the yield curve Y t (τ ) characterizes how interest rates vary over the length of the borrowing period, or maturity, τ . Yield curves are an essential component in many economic and financial applications: they provide valuable information about economic and monetary conditions, inflation expectations, and business cycles, and are used to price fixed-income securities and construct forward curves (Bolder et al., 2004). Connections between yield curves and macroeconomic variables are of particular interest, both for understanding the interplay of various components of the economy Aguiar-Conraria et al., 2012) and for producing improved forecasts of key economic and financial variables (Mönch, 2008;Koop, 2013).
We are interested in leveraging macroeconomic variables to produce more accurate yield curve forecasts with reliable uncertainty quantification. There is accumulating evidence to support the inclusion of macroeconomic variables in a yield curve forecasting model (Coroneo et al., 2016;Altavilla et al., 2017). Among the most successful forecasting approaches, a noteworthy feature is the use of time-varying parameters for modeling the associations between macroeconomic variables and the yield curve (Bianchi et al., 2009;Mumtaz and Surico, 2009;Byrne et al., 2017). However, all of the models referenced above-with or without time-varying parameters-rely on the Nelson-Siegel basis expansion (Nelson and Siegel, 1987). The Nelson-Siegel approach uses a similar construction as (2), but with K = 3 parametric functions: Although some approaches estimate the parameter λ > 0 (Laurini and Hotta, 2010;Cruz-Marcelo et al., 2011), it is more common to fix λ a priori, such as λ = 0.0609 (Diebold and Li, 2006;Bianchi et al., 2009;Mumtaz and Surico, 2009;Coroneo et al., 2016;Byrne et al., 2017;Altavilla et al., 2017).
The Nelson-Siegel basis provides a convenient simplification of the functional dependence, but faces several limitations. First, it is clearly inadequate for other functional datasets, many of which have no known parametric structure. Second, the common choice of λ = 0.0609 may be inappropriate for other term structure data or non-US yield curves, yet relaxation of this assumption is challenging due to the resulting nonlinearities in the model. A more subtle difficulty is that the parametric biases of the Nelson-Siegel basis may be undetected or understated among methods that are evaluated using the popular yield curve dataset from Gürkaynak et al. (2007). Gürkaynak et al. (2007) employ a Svensson model for smoothing (Svensson, 1994), which is an augmented Nelson-Siegel model with one additional nonlinear term. Consequently, the Nelson-Siegel-based models that use this dataset (Aguiar-Conraria et al., 2012;Altavilla et al., 2017;Byrne et al., 2017) are likely to show inflated performance relative to other yield curve datasets that do not pre-smooth using Nelson-Siegel or Svensson models.
The DFOSR model offers an appealing alternative: the basis functions {f k } are modeled nonparametrically, while time-varying regression coefficients are included for the macroeconomic variables x t via (4). Existing nonparametric regression approaches for yield curve modeling include Hays et al. (2012) and Jungbacker et al. (2013), which similarly model {f k } as unknown and incorporate a state space framework for {β k,t }, yet do not provide the uncertainty quantification, TVP regression, and shrinkage capabilities of the DFOSR model. The shrinkage priors of the DFOSR model are particularly important, since it is a priori unclear whether any individual macroeconomic variable is important for forecasting and which, if any, corresponding regression coefficients are time-varying. More generally, it is of interest to assess (i) whether including macroeconomic predictors can improve forecasting performance and (ii) whether alternative models for {f k }-such as FPCs or the Nelson-Siegel basis-are competitive. The benefits of including macroeconomic predictors may depend on whether or not the model allows for time-varying regression coefficients. Therefore, we consider two special cases of the DFOSR model: FOSR-AR, which omits the time-variation in (4), and the functional dynamic linear model (FDLM) of Kowal et al. (2017a) with predictors omitted entirely. In each case, we use K = 4 factors, although results are similar for K = 6. We include two variations of each of the FOSR-AR and the FDLM by modifying the model for {f k }: first, using the Nelson-Siegel basis with (NS-X) and without (NS) predictors, and second, using the FPC basis with (FPC-X) and without (FPC) predictors. All competing models may be written in the form of (1)-(5), with forecast estimates and intervals computed using the same MCMC algorithm of Section 5.
For each model, we include a stochastic volatility model for σ 2 t to incorporate volatility clustering: where h t = log σ 2 t is the log-volatility. Sampling the log-volatility proceeds using a Gaussian mixture approximation similar to Kim et al. (1998) and Omori et al. (2007).

Forecasting design
We consider 40 years of monthly data from 1970-2009 using the unsmoothed Fama and Bliss (1987) US government bond yields provided by Van Dijk et al. (2014). These yield curve data are available for maturities of 3, 6, 9, 12, 15, 18, 21, 24, 30, 36, 48, 60, 72, 84, 96, 108 and 120 months (M = 17), and importantly are not pre-smoothed using the Nelson-Siegel or Svensson model. To accompany the yield curve data, we include the macroeconomic variables in Table 2, which are provided by the Federal Reserve and retrieved from the FRED data download website. We consider two sets of predictors: a small set of p = 3 variables and medium set of p = 12 predictors, which are transformed according to Table 2 and lagged by one month for forecasting. These variable sets are similar to those used in Koop (2013). While it is possible to incorporate concurrent (non-lagged) predictors in the DFOSR model, this requires augmentation of the dynamics in (15) and produces a more complex state space model with accompanying computational challenges.

Forecasting results
Point and 95% interval forecasts were computed for each month from 1990-2009, totaling 227 months for evaluation. Methods are compared using RMSFE and MFIW defined in Section 6, which are reported relative to the FDLM. We select the FDLM as the baseline because it incorporates some, but not all, of the proposed features in the DFOSR: it includes the model for {f k } from Section 3, but omits predictors and consequently does not include the time-variation in (4) or the shrinkage priors of Section 4.
To illustrate the challenge of the forecasting exercise, Figure 5 provides examples of the DFOSR yield curve forecasts in 2006 and 2008. There is considerable monthto-month time-variation in both the yield curve level and shape over the previous 24 months. Nonetheless, the DFOSR point and interval forecasts are accurate, and the interval forecasts are sufficiently precise that they include the realized yield curves while excluding many of the previous yield curve observations.   Table 2. All models achieve the correct nominal empirical coverage, yet each method is overconservative. Comparing methods with and without the macroeconomic variables, we identify a small decrease in forecast interval width, which suggests that predictors provide slightly more precision in the forecasting intervals. Most notable, however, is the performance of the DFOSR, which provides substantial reductions in forecasting interval widths relative to all competitors. In particular, it is clear from comparing the FDLM, the FOSR-AR, and the DFOSR that (i) including the macroeconomic predictors produces narrower forecasting intervals, and (ii) this effect is heightened for time-varying regression coefficients. The DFOSR also provides the most accurate point forecast, albeit with smaller improvements relative to competing methods.
The forecasting results for the medium set of predictors (p = 12) are in Figure 7. The larger set of predictors is more challenging for the TVP regressions, especially in the functional data setting. Nonetheless, the DFOSR again performs exceptionally well: the model for {f k } from Section 3 provides substantial gains in forecasting performance relative to the Nelson-Siegel and FPC bases, while the time-variation of the DFOSR model offers consistent improvements relative to the non-dynamic FOSR-AR model.
To further illuminate the differences between dynamic and non-dynamic regression coefficient functions, we plot the posterior expectations ofα j,t (τ ) for each predictor j as a function of maturity τ across times t from 1970-2009 in Figure 8. We include the posterior expectations under both the DFOSR and FOSR-AR models for comparison. Many of the regression coefficient functions exhibit substantial time-variation in both shape and magnitude, which cannot be modeled by non-dynamic regression coefficient functions. The improvements in forecasting performance of the DFOSR relative to the FOSR-AR suggest that the time-variation in the regression coefficient functions is an important feature of the model.

Discussion
The proposed dynamic function-on-scalars regression (DFOSR) model features several essential components: (i) a nonparametric regression model for the unknown basis functions, (ii) time-variation in the regression parameters, and (iii) shrinkage priors that regularize irrelevant predictors and shrink toward time-invariance. These model features are synthesized within a fully Bayesian framework, with posterior inference available via an efficient Gibbs sampling algorithm. Simulation studies demonstrate the utility of these modeling choices, in particular for constructing precise forecasting intervals and estimating time-varying regression coefficient functions. The proposed model is evaluated empirically for modeling and forecasting yield curves using macroeconomic predictor variables. Compared to models that exclude time-variation in the regression coefficients and models that exclude the macroeconomic predictors altogether, the DFOSR fore- casts are more accurate, and in particular provide narrower forecasting intervals that attain the correct nominal coverage.
The DFOSR model is a natural extension of existing functional regression and functional time series models. When x t = 0 for all t, i.e., there are no predictors, the DFOSR model is a reduced-rank functional factor model with autocorrelated factors, which is useful for modeling and forecasting functional time series data (Hays et al., 2012;Aue et al., 2015;Kowal et al., 2017a). When the regression coefficients α j,k,t = α j,k are non-dynamic and the autoregressive coefficients vanish with φ k = 0, we obtain a Bayesian FOSR model (Montagna et al., 2012;Kowal and Bourgeois, 2019). Similarly, constraining α j,k,t = α j,k but allowing φ k = 0 produces a Bayesian FOSR model with autoregressive errors. Each of these special cases was compared with the full DFOSR model on both simulated and real data in Sections 6 and 7, respectively.
Future work will extend model (2)-(5) for other important dependence structures, such as dynamic functional predictors X j,t (u) for u ∈ U, possibly with different domains U = T . The key simplification of the likelihood from Lemma 1 only requires the functional data likelihood (9) and the identifiability constraint F F = I K . Alternative models for the factors {β k,t } in (3), including joint state space models for forecasting Y t and x t , may leverage the simplifications from Lemma 1 to achieve computational scalability in other modeling frameworks. Lastly, the ability to incorporate constraints for {f k } in Theorem 1, although not applicable for the yield curve models, offers a promising approach for modeling unknown yet a priori constrained basis functions in functional and spatial data analysis.
Supplementary Material to "Dynamic Regression Models for Time-Ordered Functional Data" (DOI: 10.1214/20-BA1213SUPPb; .zip). Supplementary materials include (i) a document with the MCMC algorithm details, the thin plate spline construction, and additional simulation results, (ii) R scripts for the yield curve analysis and the simulation studies, and (iii) an R package which implements the proposed model and all competing models from Sections 6 and 7.