Bayesian Dependent Functional Mixture Estimation for Area and Time-Indexed Data: An Application for the Prediction of Monthly County Employment

The U.S. Bureau of Labor Statistics (BLS) publishes employment totals for all U.S. counties on a monthly basis. BLS use the Quarterly Census of Employment and Wages, where responses are received on a 6–7 month lagged basis and aggregated to county, and apply a time series forecast model to each county and project forward to the current month, which ignores the dependence among counties. Our approach treats these by-county employment time series as a collection of area indexed noisy functions that we co-model. Our model includes predictor, trend and seasonality terms indexed by county. This application is among the first in the U.S. Federal Statistical System to address the joint modeling of a collection of time series expressing heterogenous seasonality patterns between them. We demonstrate that use of a Fourier basis to model seasonality outperforms a locally-adaptive, intrinsic conditional autoregressive construction on our collection of time series where the degree of expressed seasonality varies. Countyindexed parameters of the 3 terms are drawn from a dependent Dirichlet process (DDP) prior to allow the borrowing of information. We show that employment of both spatial and industry concentration predictors into the prior probabilities for co-clustering among the counties produces better prediction accuracy. Our DDP prior accounts for the possibility that nearby counties may express distinct underlying economic structures. A feature of our joint modeling framework is that it computes efficiently to support the monthly BLS production cycle. We compare the performances of alternative formulations for the dependent Dirichlet process prior on monthly, county employment data from 2002–2016.


Introduction
The Local Area Unemployment Survey (LAUS) program of the U.S. Bureau of Labor Statistics (BLS) publishes employment and unemployment totals for local areas, in-cluding counties, across all states in the U.S. in coordination and partnership with the states. BLS administers the Current Employment Statistics (CES) survey, on a monthly basis, to business establishments in relatively large metropolitan statistical areas. While covering much of the U.S. population, the CES survey excludes 1751 relatively sparsely populated counties of the total 3108 counties in the continental U.S. The LAUS program utilizes the Quarterly Census of Employment and Wages (QCEW), a census instrument administered to all business establishments by BLS, for the purpose of collecting workplace employment information for the 1751 counties not covered by the CES, which we label "non-CES counties". There is, however, a 6-7 month availability lag for county estimates from the QCEW due to reporting requirements for establishments and subsequent internal BLS processing to flag and correct errors in these responses. As a result, the LAUS program applies a time series forecasting model to each non-CES county, separately, to project forward the QCEW employment to the current month. This approach ignores the dependence among the collection of county time series induced by similarities in their economic structures.
We devise a single model for all of the county time series and use this model to perform a 7-month prediction for all 3108 counties, simultaneously, in a fashion that allows the data to learn a dependence structure among the county time series.
Fixing a county, the time series for that county is modeled as the addition of trend, seasonality and predictor functional terms. We configure a nonparametric conditional autoregressive (CAR) (Rue and Held, 2005) prior for each of the trend and seasonality terms. The CAR prior is an equivalent to a smoothing spline model that utilizes local nearest neighbors to perform the smoothing, rather than shrinking to a global mean. The latent trend term is allowed to express a non-linear shape and the magnitude and shape of the seasonality may vary over the months. The noise is composed of measurement errors induced in the by-establishment reporting process.
The generating prior hyperparameters of each of the trend, seasonality and predictor coefficients are indexed by county. We allow the data to discover dependence among the counties by performing probabilistic clustering of these county-indexed parameters to achieve a non-parametric mixture distribution for each of the time series terms where we mix over counties. We employ a dependent Dirichlet process (DDP) prior framework of Müller et al. (2011) that uses a subset of predictors to construct a prior formulation for county co-clustering that increases the prior probability for the assignment of two counties to the same cluster to the extent that they express similar predictor values relative to the Dirichlet process (DP).  , 2002, -December, 2016. Years in 5-year increments are labeled on the horizontal axis (displayed in 2-digits), where we observe some level of pause or decline in employment around the period of 2008 during the so-called, "great recession". We expect the patterns expressed across the county-indexed collection of employment time series to demonstrate correlation with one another primarily based on their similarities in economic composition or structure and the overall health of their economies, rather than based on spatial adjacency. Our modeling goal is to capture these dependencies among the collection of non-CES county time series in order to improve the bias and efficiency properties of predictions for the missing 6-7 months as compared to the modeling of each county in a separate model. Simultaneously, we wish to construct our joint model for the collection of county time series to allow sufficient estimation flexibility to capture features that take place for a subset of months and a subset of counties, like the great recession. We utilize the full set of n = 3108 counties in our modeling, rather than the 1751 non-CES counties, in order to facilitate the borrowing of information for prediction of by-county employment totals. Our modeling approach must computationally scale with sufficiently fast turnaround to support the BLS monthly production cycle. The modeling of 3108 county times series, simultaneously, is facilitated by our use of probabilistic clustering over county-indexed model parameters described in the sequel. The clustering reduces the effective dimensionality of the 3108 sets of county-indexed parameters to the total number of estimated clusters.

Introducing the QCEW Employment Data
Figure 1: Randomly-selected by-county time series of employment total, from Jan, 2002-Dec, 2016. The horizontal axis denotes year in 5-year increments (labeled with the last 2-digits of the year). Employment totals reflect the reported number of employees for employers in each county, which is a place-of-work, rather than a place-of-residence measure.
The QCEW data are characterized by both evolving trends and 12-month seasonal patterns. Figure 2 displays the time series for Albany County, NY, where we observe both a long-term trend, including the downturn and recovery from the great recession, as well as a recurring 12-month seasonality. The horizontal axis denotes year in 5-year increments (labeled with the last 2-digits of the year). Employment totals reflect the reported number of employees for employers in each county, which is a place-of-work, rather than a place-of-residence measure.
The QCEW by-county data includes predictors that we use in both our fixed effects term and for our predictor-dependent clustering. The key predictor type that drives the prior distribution of co-clustering of county-indexed parameters is the "location quotient" (lq), defined as an index ∈ [0, 1] that quantifies the relative employment concentration of a particular industry in a county as compared to the national average. We use location quotients for industries constructed from the first 2-digits of the 6-digit detailed industry code assigned to each establishment under the North American Industry Classification System (NAICS) (which we denote as "economic sectors"). Economic sectors include; Construction, Transportation, Services, Leisure, Public, Mining, Manufacturing, Information, Education. We assemble the unemployment insurance (UI) claims in each month for each county from data provided by the states. The trend of the UI claims reveals provides a measure of the changes in the relative economic health of a county. We also utilize longitude and latitude as spatial predictors in the fixed effects term and also for our predictor-dependent clustering to capture residual spatial dependence not accounted for by the lq predictors.
We introduce our probability model in Section 2 that employs terms to capture seasonality, trend and incorporate predictors into county-indexed regression models, tied together in a predictor-dependent Dirichlet process mixture. We outline computations for the full conditional posterior distributions in Section 3, where we introduce the two fit statistics we use to compare performances of our suite of models applied to prediction of QCEW in Section 4. We conclude with a discussion in Section 5.

Model for Collection of Time Series
We construct our data matrix of county employment totals as n × T, Y, which we treat as a collection of county-indexed latent functions of time with additive noise. Our approach contrasts with a model that defines a spatially-indexed process that may evolve over time. We select the former modeling approach due to the presence of seasonality, which is more difficult to capture with the latter approach in a flexible, nonparametric construction. We do, however, later compare our approach to one implementation of a time evolving spatial process.

Model for Single County Time Series
We begin by fixing a county in some row of Y and constructing our additive functional priors specification for the 1 × T county time series.
Our likelihood is formulated as, where N (·) and G(·) denote the normal and gamma distributions, respectively, with the latter parameterized with (shape, rate) hyperparameters. The first term, x j β, includes the influence of predictors for the county estimated function for month, j ∈ (1, . . . , T ). The 2 random effect terms, (γ 1j , γ 2j ), capture residual trend and seasonality patterns (not explained by the predictor term), respectively.

Multivariate Gaussian Mixture for Fixed Effects Term
The p × 1 vector of predictors, x j , includes latitude and longitude, by-industry location quotients, and the level of unemployment insurance claims for each county and month. Regression coefficients, β, are, in turn, drawn from,

Conditional Autoregressive Prior for Random Effects Terms
We implement a conditional prior formulation for the trend and seasonal random effects that induces local smoothing with, Fixing a random effects term, m ∈ (1, 2), the prior specification for (γ mj ) j=1,...,T in (4) is intrinsic conditional autoregressive (Rue and Held, 2005) over the T months to capture the autocorrelation in the time series for each county. The intrinsic conditional autoregressive prior (ICAR) encodes a sparse precision matrix through a T × T adjacency matrix, Ω m = (ω mjr ), where ω mjr = 1 indicates j ∼ r, (which is shorthand for j is a "neighbor of" or "communicates with", r), for random effect term, m ∈ (1, 2); otherwise, ω mjr = 0. The diagonal entries of Ω m are set to 0; e.g., ω mjj = 0, ∀j ∈ (1, . . . , T ). The mean in the conditional distribution of γ mj is proportional to an neighbor-weighted average of the (γ mr ) over all r ∼ j for each month, j, and the average is normalized by d mj = r∼j ω jr , which denotes the number of neighbors of γ mj . Since ω jj = 0, ∀j ∈ (1, . . . , T ) and ω jr = 0, ∀r j, we may re-expressγ mj in (6) as the inner product, We note that our prior constructions for (γ m ) are a priori and a posteriori independent over terms, m = 1, 2, given the parameters of the conditional autoregressive models.
We utilize 2 random effects terms: 1. A trend term; 2. A 12-month seasonality term. Both are defined on a line of T = 180 months. The intrinsic conditional autoregressive prior for the seasonal term constructs, Ω 2 , with seasonal increments, (γ 2j , γ 2(j+1) , . . . , γ 2(j+O−1) ), for j = 1, . . . , T − O + 1, which produces a joint density for T × 1, γ 2 proportional to, where O = 12 denotes the period of seasonality. The bandwidth of seasonal adjacency matrix, Ω 2 , is O (Rue and Held, 2005). Our construction for seasonality is nonparametric where the data are able to learn a varying or evolving seasonal pattern over time.
The trend term (m = 1) employs a so-called "random walk of order 1" (RW1) construction, with increments, γ 1j − γ 1(j−1) and γ 1j − γ 1(j+1) , where for each month j, Ω 1 is constructed such that the month before, γ 1,(j−1) and the month following, γ 1(j+1) are deemed neighbors of γ 1j . The beginning and last months of the time series communicate with only a single neighbor. Both conditional autoregressive terms induce probabilistic local smoothing within increments defined by first differences for the trend term and by the sum of O months for the seasonal term. The structure of the (Ω m ) are band diagonal with widths proportional to the lengths of the increments. (See Rue and Held (2005) for more details.) The implied joint distribution for term m ∈ (1, 2), γ m = (γ m1 , . . . , γ mT ) = , where precision matrix, Q m , is rank deficient such that the joint distribution is improper. Under our RW1 prior for the trend term, γ 1 , the overall mean level is undefined, since the prior is constructed on increments specified as differences of random effect values for neighboring time periods (on a line), which produces rank T −1 for Q 1 . Similarly, the O = 12 order of seasonality term produces Q 2 to be of rank T − O + 1 since the joint prior is defined up to any repeated seasonal pattern, (c 1 , . . . , c O ), that sums to 0 ( O o=1 c o = 0) in each sequence of O values. The rank deficiencies arise because the ICAR prior is defined locally on neighboring values in a moving average construction, rather than globally across the T months for each county. Even though the ICAR prior produces an improper joint distribution, the conditional distributions are, nevertheless, proper.
A general Bayesian dynamic linear model (DLM) is specified with where F t is a predictor matrix and G t denotes an evolution matrix and ν t and t are independent noise processes with global scale parameters (West, 2013;Prado and West, 2010). For a trend term, the G t is diagonal with each parameter bounded ∈ (−1, 1), which induces a global (as contrasted with a local neighborhood of time points) shrinking of all γ t . Our trend term under the CAR prior is equivalent to setting the diagonals of G t to 1, which induces a nonparametric random walk neighborhood smoother. Our choice of the first difference under our CAR formulation uses time points immediately following and after each focus time point for smoothing γ t . Seasonality under a DLM specifies G t = φH a , where H a is a sinusoidal term for angle, a ∈ (0, 2π) and φ ∈ (0, 1). H a encodes a parametric, fixed seasonal smoothing pattern. Our CAR prior for seasonality is more flexible as the seasonality pattern is permitted to evolve over time, though we see in the sequel that this flexibility confounds the clustering of seasonal term patterns over counties and induces undesired spiky artifacts. In contrast to local seasonal smoothing, our use of the Fourier basis for modeling seasonality provides a semi-parametric global estimate of seasonality that we later introduce as an alternative to the CAR prior.
We adjust the ICAR prior for the modeling of the seasonality term, γ 2 , by including a global parameter, ρ m ∈ [0, 1), that we intend to allow better estimation of low seasonality expressed as a dampened (longer) wavelength or length scale.
Fixing ρ m = 1 parameterizes the rank-deficient improper joint distribution of the ICAR prior. The parameter, ρ m , is often interpreted as a strength-of-correlation parameter (Banerjee et al., 2003) because the dependencies indexed in Ω m are included in proportion to the value of ρ m . Another way that we may interpret ρ m is as a length scale adjustment parameter that determines the degree of smoothness in the estimated T × 1 function, γ m (Mukherjee et al., 2011). The maximum length scale is hard-coded based in the bandwidth of T × T Ω m and ρ m allows the data to adjust the length scale in the T × 1 function to be larger. Values of ρ 2 closer to 1 will produce the minimum length scale that estimates a higher level of seasonality. The introduction of ρ m , however, induces a global behavior over the T × 1 time series for joint shrinking of all of the (γ 2t ) T t=1 , which is a similar behavior to the DLM where for a uni-dimensional γ t , ρ 2 is equivalent to φ, though the seasonality pattern is still more flexible than for a DLM. We label this model as a proper conditional autoregressive model (PCAR) in order to distinguish it from the ICAR model. Note that we also use ρ 1 for prior on trend term, γ 1 , under the PCAR option for ease of implementation, though the data estimate ρ 1 ≈ 1, so it makes no difference in the modeling of the trend term.

Fourier Basis Alternative for the Seasonal Effects Term
The PCAR alternative dampens the expression of seasonality in the predictions for low-seasonality counties, but we later show that it also tends to dampen estimated seasonality for high-seasonality counties, which is an undesired effect. The length scale adjustment parameter, ρ m , is difficult to estimate because each ρ m parameterizes a precision matrix and the degree of estimated seasonality is very sensitive to its value. By contrast, the ICAR prior does the opposite by over-amplifying seasonality in the predictions for low-seasonality counties due to its local smoothing properties that use a subset of time points to estimate each monthly seasonal coefficient. While the modeling of seasonality is well-understood in application to a single time series, there is little guidance on prediction of seasonalities for a collection of time series in a single model.
We are motivated to construct a sinusoidal basis to augment our predictors to provide a more stable estimator for seasonality that can accommodate both high and low expressions of seasonality. The Fourier basis employs predictors, z j , constructed as an orthogonal trigonometric basis, for an even value of O, the periodicity of our repeated pattern, which is O = 12 months for our QCEW time series. This set-up constructs a basis of oscillating functions that, together with a global intercept, spans any repeated pattern of periodicity with order, O. Wei (2006) propose the use of Fourier functions as an orthogonal basis in the time domain for modeling of periodic functions (e.g., seasonality) that contrasts with the usual use of Fourier functions to transform from the time to the frequency domain (which we do not do). Montesinos-López et al. (2018) compare B-spline and Fourier bases for the modeling of seasonality in functional data regression and conclude that the Fourier basis performs best.
Our analysis suggests to us that best results are achieved by incorporating the Fourier basis, z j , into an augmented predictor vector, x j ← (x j , z j ) (in contrast to creating a separate term for Fourier basis), and modeling them as part of the fixed effects term in (3). The Fourier basis construction estimates seasonality, globally in time, across each county time series. While less flexible than the ICAR prior and more similar to a DLM, we expect this approach to better adapt to variations in the amplitude of seasonality across counties. We retain the ICAR parameterization for the trend effects, γ 1 , but drop the γ 2 , under this alternative Fourier basis set-up.
Prediction of seasonality in the presence of our predictors in an augmented x j induces a moderate amount of over-shrinkage in the prediction of seasonality. The over-shrinkage is completely eliminated by replacing the predictor matrix with its principal components where we discard any redundant components (with zero eigenvalues). This step is similar to the Moran's I basis construction (Hughes and Haran, 2013) used for the modeling of spatial association (which we discuss, in detail, in the sequel as a competitor for our modeling framework). Both approaches employ a spectral rotation.
We have defined 3 alternatives for the modeling of seasonality among the collection of time series: 1. ICAR prior on random effects, γ 2 ; 2. PCAR prior on random effects, γ 2 ; Fourier basis (FB), z j , incorporated into the fixed effects term with coefficients β, in lieu of γ 2 .

Dependent Dirichlet Process
We now expand our model from a single county model of (1), to the collection of 3108 counties by indexing our observed response, y j and the fixed effects, β j , and our trend and seasonality terms, (γ 1j , γ 2j ), by county, i ∈ (1, . . . , n) in the expanded likelihood, We collect the county-indexed generating hyperparameters for the predictor, trend and seasonality terms into θ i := (μ i , Λ i , (τ γ,1i , ρ 1i )) (in the case we use a Fourier basis to model seasonality). The indexing of θ i by county, i ∈ (1, . . . , n) induces a mixture formulation over county-indexed θ i for each of the random effect terms. We cluster the county-indexed generating distribution hyperparameters by using a nonparametric mixing measure. We index the generating parameters for all of the Fourier basis seasonality, fixed effects coefficients and trend terms by county because there is a substantial heterogeneity among the expressed time series patterns across the 3108 counties of the U.S., as we see in the illustrations of Figure 1, such that we seek a joint prior distribution that provides a large support over time series patterns. (In the case we use an ICAR or PCAR prior for seasonality in lieu of the Fourier basis, their generating patterns are also indexed by county.) To make this choice intuitive, we note that if we fail to index the generating parameters of the seasonality term (for any of the Fourier basis, PCAR or ICAR alternatives) we would fail to account for the massive heterogeneity in the presence of and patterns for expressed seasonality over the counties.
The predictors in the fixed effects term may capture some dependence among the counties induced by similarities in industries, spatial location or economic conditions. The use of location quotients may be especially useful to avoid over-smoothing based on spatial location; for example, two nearby counties (one rural and the other urban) may express very different economic structures, despite their spatial adjacency. Trend and seasonality components are expressed to varying degrees among the counties. We index the 3 terms by county, i ∈ (1, . . . , n), in order to capture heterogeneity among the counties, regulated by a partition model (clustering) prior that allows the data to learn additional dependence among the collection of county time series. We show in the sequel that using the lq and spatial predictors to influence the prior probabilities of co-clustering among the counties, in addition to their use in the fixed effects term, substantially improves prediction accuracies. Each time series, y i = (y i1 , . . . , y iT ) , is scaled to variance 1 in order to allow the basis of clustering to be the shape of the functions, rather than their magnitudes.
Our use of a nonparametric prior to be next introduced for clustering county-indexed functional generating parameters is of a similar idea to that of Gelfand et al. (2005), but they impose the Dirichlet process prior directly on the function parameters (e.g., on (β i , γ 1i , γ 2i )), so that co-clustered functions must be exactly the same, which we find to be too restrictive for working with our QCEW employment data application. Instead, we impose our nonparametric clustering prior on the generating parameters, θ i , of the prior distributions for the functions such that functions which are co-clustered are drawn from the same distribution, but are not required to be exactly equal.
Our mixing measure, F , is used to draw θ 1 , . . . , θ n |F ∼ F . F is not subscripted by a parameter to indicate that we will, in turn, sample the unknown distribution, F , from a nonparametric Dirichlet process (DP) prior, F |α, F 0 ∼ DP (αF 0 ). The DP may be specified by expressing F in a discrete, stick-breaking construction (Sethuraman, 1994), where F is a countably infinite mixture of weighted point masses (or "spikes") with "locations", θ * 1 , . . . , θ * K , in the support of F indexing the unique values for the (θ i ), where K ≤ n. We record cluster memberships of counties with s = (s 1 , . . . , s n ) where s i = k denotes θ i = θ * k so that {s, (θ * k )} provides an equivalent parameterization to (θ i ) and we recover θ i = θ * si . The DP is induced by placing priors on the weights, p k , and the locations, θ * k . Each location is drawn from θ * k ∼ F 0 . Under the alternative using the Fourier basis, F 0 is constructed as: where Huang-Wand (Λ|δ, a 1 , . . . , a p ) denotes the prior of Huang and Wand (2013) defined for covariance matrices that generalizes the inverse Wishart prior to produce marginally noninformative half t-distributed priors with δ degrees of freedom for the variances. Choosing δ = 2, produces U(−1, 1) priors for the correlations. The prior for a precision matrix is constructed by forming Q = 2δdiag (a 1 , . . . , a p ) with a ∼ G (1/2, 1) , = 1, . . . , p. We proceed to draw Λ ∼ W p (ν, Q) where ν = δ + p − 1. By contrast, the usual Wishart prior would fix the diagonals of Q. We set δ = 2 to produce a weakly informative prior for the variances.
The weight, p k ∈ (0, 1) is composed as Be (1, α). This construction provides a prior penalty on the number of mixture components, but we also see that a higher value for α will produce more clusters (unique locations). Since the locations are drawn from F 0 , as the number of unique locations increases, the estimated F approaches the base distribution, F 0 . Under this construction, E (F ) = F 0 .
The joint prior for cluster assignments under the DP prior is stated with, after marginalizing out the random measure, F , where, n k = n =1 I (s i = k) denotes the number of counties assigned to cluster, k. We note that the DP prior assigns counties, i ∈ 1, . . . , n to cluster, k ∈ (1, . . . , K) proportionally to the popularity of cluster, k, as measured by its size, n k .
We, next, generalize the above-specified DP formulation. We induce predictor-assisted clustering by, first, jointly modeling our response variable, y ij , along with a subset of selected predictors, ξ ij ⊂ x ij . We specify the additional likelihood, and, second, include the parameters of this likelihood in an augmented, on which we, third, impose prior F that, in turn, receives a DP prior. Our intent is to increase the a priori probability for co-clustering of counties, i and , proportionally to how similar are the values for the q × T , predictors, Ξ i = (ξ i1 q×1 , . . . , ξ iT ) and Ξ . Augmenting θ i to include the mixing parameters of (14) adjusts (13) to incorporate information about the predictors that produces, where, such that g(Ξ * k ) constructs the similarity function of Müller et al. (2011). If counties i and share similar values for their predictors then there is a higher prior probability that they will be co-clustered and (μ ξ,i , Λ ξ,i ) = (μ ξ, , Λ ξ, ) = (μ * ξ,k , Λ * ξ,k ). The integration outlined in (16)  The integration of (16) and incorporation of g(Ξ * k ) into (15) is accomplished by simply including (μ ξ,i , Λ ξ,i ) ∈ θ i , under the DP prior construction. Augmenting θ i in this way may be viewed as a computational device to produce (15). We don't believe that the predictors, Ξ i , are random, but treat them as such in order to insert the similarity functions, (g(Ξ * k )) k=1,...,K into the prior for s. By doing so, we have estimated a complicated predictor-dependent Dirichlet process mixture model with a much simpler Dirichlet process model through imposing a probability model on our target predictors.
The insertion of predictor information into the prior for cluster assignments can reduce prior uncertainty for the cluster assignments of establishments who share similar predictor values with a large number of other establishments; nevertheless, as the number of predictors grows, the prior uncertainty will grow because it is defined on the space of predictor values. So care is warranted in the choice of which and how many predictors to use for indexing the prior for cluster assignments.
Augmenting θ i to include the generating parameters, (μ ξ,i , Λ ξ,i ), extends F 0 to, There are similar approaches to ours that construct spatial-or predictor-dependent DP priors that could be used to induce clustering over counties; for example, the probit stick breaking construction of Rodríguez and Dunson (2011) specifies a stick-breaking breaking construction as do we for F in (11) with weights, p k (s) = Φ(α k (s)) r<k (1 − Φ(α r (s))), where Φ(·) denotes the cdf of a Gaussian distribution, for each spatial location, s. They then suggest using a Gaussian process prior over the {α k (s i )} n i=1 to encode a spatial dependence (over counties). Hossain et al. (2013) similarly utilize the stick breaking construction to induce a dependence among the weights with , where x ki are general predictors and δ ki are regression coefficients. Rodríguez and Dunson (2011) note that the incorporation of by-county dependence through the weights, rather than the atoms/locations in the stick-breaking construction (as we do in our method), has the notable advantage that it places some prior mass on an independent prior (whereas our formulation does not in that there is always some non-zero probability of co-clustering, no matter how far apart are predictor values). Since Hossain et al. (2013) allow incorporation of any predictors, not just spatial locations, it could accommodate our use of the lq predictors (which we know from experience are more relevant than are spatial locations for our class of economic data). Yet, we do not pursue these weight-dependent approaches because they are simply computationally infeasible for an application of our size. The elegance of our approach is that while it constructs a predictor-dependent Dirichlet process, it is estimated as an ordinary DP simply by augmenting θ i with predictor model hyperparameters of Ξ i . So we are able to utilize efficient DP estimation algorithms, such as Algorithm 8 of Neal (2000b).
Another class of alternatives model a smooth spatial dependencies among the collection of county time series. Nobre et al. (2011) generalize the DLM setup with x t (s) = p =1 φ (s)x t− + t (s) for spatial location, s where |φ (s)| < 1 where here p denotes the autoregressive (AR) order. The restriction on φ (s) makes estimation difficult, so they reparameterize the set-up using the characteristic polynomial, H (s), and the backshift operator, B, to achieve, p =1 (1 − BH )(s)x t (s) = t (s). While they accomplish efficient estimation, their focus is confined to modeling spatial associations among the time series, which is insufficient for our application to economic, employment data. Our industry-indexed lq predictors are much more powerful for this purpose in predicting both association and heterogeneity. Our computation is also expected to be even more efficient than is theirs because clustering induces a dimension reduction over countyindexed parameters.

Summary of our Joint Model over Counties
We summarize the probability model for the joint likelihood (y ij , ξ ij ) for our response and subset of predictors, ξ ij ⊂ x ij . We co-model ξ ij in order to extract the countyindexed parameters and include them in the prior probability for cluster assignments of counties to induce a higher prior probability of co-clustering among counties sharing similar predictor values. This procedure is used to estimate the conditional distribution for y ij |x ij in a fashion that utilizes predictors in both the fixed effects term and in the prior probabilities of co-clustering.
Likelihood for predictors used for clustering,

Proper CAR prior on both Trend and Seasonality terms Employ seasonal term, (γ 2i ), in lieu of Fourier basis in augmented x ij
The ICAR prior alternative is achieved from PCAR by removing ρ mi = ρ mi × C mj γ mi (22e)

Computation
We utilize a sequential Gibbs scan to construct our MCMC (posterior) sampler for all of our models, under which each parameter set is sampled from its full conditional posterior distribution. Random effects, (γ mij ), are drawn from conjugate univariate normal conditional distributions over the T time points for each county, i ∈ 1, . . . , n.
Coefficients, (β i ), of the predictor term (including the Fourier basis) are sampled from conjugate multivariate Gaussian conditional distributions. The n × 1 cluster assignment parameter vector, s, is sampled from a nonconjugate posterior conditional distribution using the well-known algorithm 8 of Neal (2000b), which nominates new cluster locations ahead of sampled observations that may be selected on each posterior draw in the cluster assignment, s i , for each county, i = 1, . . . , n. The posterior probability for cluster assignment of each county multiplies the prior distributions for each of the parameters in θ i into the prior probability for cluster assignment. Parameter locations, (Λ * ξ,k , μ * ξ,k ), associated with predictors, (ξ ij ), use the predictor likelihood for (14) to achieve our predictor dependent Dirichlet process construction. Conditioned on the (a q ) parameters of the Huang-Wand priors, the precision matrix locations, (Λ * k , Λ * ξ,k ), are sampled from Wishart distributions. The locations for length scale parameters of the PCAR model, (ρ * mk ) ∈ [0, 1), are sampled from the univariate slice sampler of Neal (2000a). The remaining parameters are precision parameters and are sampled from conjugate univariate gamma conditional distributions.
All of our models are implemented in R Core Team (2014) using C++ (for fast computation) wrapped into easy-to-use R functions, which we will make available on request. Each model is run for 50,000 iterations and half are discarded as burn-in. Convergence of the MCMC sampler is assessed by employing a fixed width estimator with Monte Carlo standard errors (MCSE) computed using the consistent batch means (CBM) method (Jones et al., 2006).
The Fourier basis DPP model (DDP-FB) takes approximately 63 hours or 2.6 days to complete the 50,000 iterations over our n = 3108 counties, each observed for T = 180 months, in a single thread of an Intel Core i7 processor. The ICAR DDP (DDP-ICAR) and PCAR DDP (DDP-PCAR) models take about 10% longer than DDP-FB. This computation time readily supports the BLS monthly production cycle and we note in the sequel that our chosen model provides high quality uncertainty quantification, which is important information that the LAUS program provides to the States to accompany the predictions.
We employ two prediction error criteria to compare the relative performances of our models: 1. The root mean-squared prediction error (RMSPE) based on the squared difference between the predicted values (obtained from the model posterior predictive distribution) and true values for y ij for the 7 predicted months over all 3108 counties; 2. A mean absolute prediction error percentage (MAPE-C), separately computed for each county, and then averaged over all of the counties. This criterion equally weights a 5% error in a tiny county, like Loving County, TX (with population of about 133) with a 5% error in Harris County, TX (which contains the city of Houston). On the one hand, accurate estimation of employment totals in every county is important to the States and it is for this reason that the LAUS program chose this criterion for assessing prediction accuracy. On the other hand, perhaps a state would accept a relatively small increment in error for a relatively less populated county in exchange for a decrement in the predicted error for a relatively more populated county. This criterion intends to highlight the uniform prediction performance of the models across all of the counties estimated by the LAUS program. The LAUS survey program restricts computation of the MAPE-C to the 1751 subset of non-CES counties, which are smaller and more heterogeneous than the larger CES counties. They make this restriction to non-CES counties for computating MAPE-C because they use the available survey estimates for the CES counties, which are much more stably estimated due to their relatively large sample sizes. The computations of MAPE-C for the model alternatives in this paper are performed by the LAUS survey program.

Application to Employment Prediction for QCEW
We apply our dependent functional mixture model alternatives to the prediction of county employment from the QCEW during the lag period. QCEW employment levels from January, 2002-December, 2016 are used for this purpose, where we exclude the last 7 months (June-December, inclusive) for model estimation and subsequently conduct predictions for the excluded, last 7 months of 2016 by drawing (y ij ) i=1,...,n;j=174,...,180 from its posterior predictive distribution (which marginalizes over the posterior samples for the model parameters). The prediction for each county and month is formulated as the mean its posterior predictive distribution. Our fixed effects, x j include the 9 industry-indexed lq predictors, UI claims and latitude and longitude, the latter two predictors capture residual spatial dependence. The subset of predictors, ξ ij , use for our predictor-assisted DDP formulation are the location quotients and the latitude and longitude. Using location quotients in the DDP assigns a higher probability for co-clustering to two counties where the underlying economic structures are similar.
We begin by comparing the relative performances in the predictions of seasonality for both counties that well express seasonality (high-seasonality counties) and counties that express little seasonality (low-seasonality counties) because this comparison determined our model selection. The green, short-dashed line in Figure 3 during the 7-month prediction period corresponds to predictions using the ICAR prior (which sets the (ρ 2i ) = 1). We label this model as "DDP-ICAR", where "DDP" indicates that the ICAR prior is embedded in the predictor-dependent Dirichlet process prior. The DDP-ICAR induces spikiness in many of the counties where seasonality is lightly and irregularly expressed due to the local, moving average manner in which seasonality is estimated where only a subset of months is used to estimate each seasonality coefficient, γ 2ij . Most low-seasonality counties are highly irregular in their expressions of seasonality. The clustering appears to confound the estimation because co-clustered counties that express even small differences in seasonality amplitudes may induce a spike in the county whose data expresses a smaller seasonality amplitude. For example, the observed time series counties like Uintah County, UT, displayed in Figure 1 express highly irregular patterns. There is essentially too much flexibility in the estimation of seasonality under the DDP-ICAR model with the result that misclustering induces large local (for a subset of months) spikes. Though most of response values in the prediction interval are well-predicted, the spike induces a large error on one or two data points. Figure 3: Comparing the prediction performance of the ICAR prior (green, short-dashed line) to a PCAR prior (blue, long-dashed line) and to the Fourier basis prior (pink, solid line) for a county where seasonality is less expressed for the 7-month prediction period. The hollow circles represent the data points. The shading represents 95% credible intervals.
As we expected, allowing the data to estimate the (ρ 2i ) < 1 in the PCAR construction allocated the low-seasonality counties to clusters of with ρ * 2k ≈ 0.7, which produced more smoothing and eliminated the spike as seen in the blue, long-dashed line. We label this model as "DDP-PCAR". There is, however, a bit of oversmoothing of the seasonal term under the PCAR model relative to the simpler FB model, shown in the pink, solid line. The degree of estimated seasonality is particularly sensitive to ρ 2i because ρ 2i multiplies by a sparse adjacency matrix that fully determines the seasonal amplitude and pattern. While we achieve good mixing in our MCMC for ρ 2i using a single parameter to determine the degree of seasonality allowed into the model, the adjustment is too coarse.
The FB model appears to perform relatively the best in terms of capturing small magnitude month-to-month variations, a performance advantage that holds across lowseasonality counties. The shaded regions in each plot panel represent the 95% credible intervals and we note that DDP-FB is relatively more efficient in uncertainty quantification than are the other two models. The FB uses a fixed set of oscillating predictors and the coefficients are global across months. It represents the best balance between the coarse PCAR and the overly flexible ICAR.
Comparing the same models for a typical high-seasonality county in Figure 4 reveals that the DDP-PCAR dramatically oversmooths seasonality, which is, by contrast, very well predicted by both the DDP-ICAR and DDP-FB, though the DDP-FB expresses a slightly higher degree of smoothness, which likely accounts for its prediction stability on low-seasonality counties. The DDP-FB model dominates the other two in prediction accuracy over both low-and high-seasonality counties such that it achieves both the lowest Mean Squared Prediction Error (MSPE) and MAPE-C, as shown in Table 1. Figure 4 also reveals that prediction uncertainty is more efficient for the DDP-FB model. Accurate uncertainty quantification is important because it provides the states with valuable inputs on the relative confidence in the predictions that helps focus dialogue between the LAUS program and states regarding state judgments about their economic conditions. Even the DDP-PCAR achieves lower prediction fit statistics than does the DDP-ICAR because the spikes induced in low-seasonality counties by DDP-ICAR induce large errors and because there are relatively more counties that expressed relatively weak seasonality patterns. The ρ * k for the length scale adjustment location parameter for clusters containing high-seasonality counties are estimated at ≈ 0.9, which produces dramatically attenuated seasonal patterns, even though the value is close to 1. Mixing is very good for clusters with high values of ρ * k , which suggests a limitation of the model construction, in practical application. As mentioned, the parameterization is overly coarse.
The above comparison between the three models encourages our selection of a model with FB to perform the by-county prediction of employment totals. Figures 5, 6, 7 compare use our predictor-assisted dependent Dirichlet process (DDP) clustering prior to use of a simple Dirichlet process prior (that excludes predictor information in the prior probabilities of co-clustering), both under the Fourier basis construction for seasonality. We label the latter model as "DP-FB". We used all of the predictors, including the by-industry location quotients, the latitude and longitude (computed on population centroids) and level of unemployment insurance claims to influence the prior probabilities Figure 4: Comparing the prediction performance of the ICAR prior (green, short-dashed line) to a PCAR prior (blue, long-dashed line) and to the Fourier basis prior (pink, solid line) for a county where seasonality is highly expressed for the 7-month prediction period. The hollow circles represent the data points. The shading represents 95% credible intervals.
of co-clustering for the counties. The figures readily reveal that the DDP-FB uniformly outperforms the DP-FB, as confirmed by prediction error statistics in Table 1. It is interesting to note, however, that DP-FB outperforms DDP-ICAR, which highlights the importance of the Fourier basis construction for the prior on the seasonality term. We additionally tested all subsets of predictors for the DDP and found that including them all, together, performed best. It also bears mention that including only the spatial predictors in the DDP also reduced the prediction performance of DDP-FB, indicating that the non-spatial predictors are important to improve the clustering performance.
All of our models enumerated in Section 2 configure our data as a collection of time series that are spatially linked through our dependent Dirichlet process prior construction that is indexed by latitude and longitude (as well as other predictors). In other words, our models are parameterized as spatially-indexed time series. By contrast, we may also model these data as a time evolving spatial process by proposing a spatial model over counties that is allowed to evolve over time. We accomplish the latter modeling framework with, Figure 5: Comparing the prediction performance of predictor-assisted DDP clustering (pink, solid line) to a simple DP prior (turquoise, dashed line) plan for a county where seasonality is less expressed for the 7-month prediction period. The hollow circles represent the data points. The shading represents 95% credible intervals.
where we have replaced the time series random effects terms, M m=1 γ mij , used in (1) with a spatial basis construction, where the n×n H basis matrix, H j = (H ij ) i, , encodes the spatial dependence among the counties. It is multiplied by n H × 1 coefficients, η j . The η j are indexed by month, j ∈ 1, . . . , T , so they are allowed to evolve the spatial process over time. We next briefly introduce the construction for the basis matrix, H j , below, where each column represents a spatial contrast that we label as a "harmonic"; for example, the first two low resolution contrasts (from left-to-right) include north-south and east-west. Please see Hughes and Haran (2013) for a more detailed explanation. We construct and implement this model (also available in our R Core Team (2014) implementation) as an alternative to our approach.
We utilize the generalized Moran's I operator for n × p design matrix, X j , that projects an adjacency indexed spatial basis onto the space orthogonal to that defined by X j in order to avoid confounding and inflating estimates of posterior uncertainty (Hughes and Haran, 2013). The Moran's I operator is defined on a graph, G = (V, E), where V denotes the n vertices and E the edges between pairs, (i, i ) ∈ (1, . . . , n) counties. Define the n×n adjacency matrix, Ω η = (ω ii ), where ω ii = 1(i, i ∈ E, i = i ), with diag(Ω η ) = 0. The associated spatial conditional autoregressive formulation for the Figure 6: Comparing the prediction performance of predictor-assisted DDP clustering (pink, solid line) to a simple DP prior (turquoise, dashed line) plan for a county where seasonality is more expressed for the 7-month prediction period. The hollow circles represent the data points. The shading represents 95% credible intervals.
n × n precision matrix is accomplished with, Q η = Ω η 1 − Ω η , where each diagonal entry of Q η counts the number of neighboring counties for that county. Construct P j = X j (X j X j ) −1 X j to be the projection onto the space defined by X j and its orthogonal complement, P ⊥ j = I n −P j . Hughes and Haran (2013) formulate the generalized Moran's I operator with, where the n × n eigenmatrix, H j , is employed as a spatial basis matrix. The columns of H j are harmonics that express the frequencies of spatial association, with higher frequencies corresponding to rougher dependence structures. The frequency of the contrasts and resulting resolution of spatial smoothing increase from left-to-right in H j , so it is typical (Hughes and Haran, 2013) practice to select 10% of the columns to achieve a high quality estimation with substantial dimension reduction to facilitate computation. We follow this practice, so n H = 0.1 × n denotes the number of columns or harmonics used for estimation. We label this model as "MI-T" to denote that it is a Moran's I spatial model for the counties that is allowed to evolve over the months. Figure 8 demonstrates that MI-T is both biased and less efficient in prediction relative to DDP-FB, even on this relatively low seasonality county where, perhaps, we would expect a model constructed as a time-evolving spatial process (like MI-T) Figure 7: Comparing the prediction performance of predictor-assisted DDP clustering (pink, solid line) to a simple DP prior ((turquoise, dashed line)) plan for a tiny, idiosyncratic county for the 7-month prediction period. The hollow circles represent the data points.
to perform relatively well. We, additionally, configured models that employed both a Moran-I's basis, n H =1 H ij η j and the time series random effects terms, M m=1 γ mij (including both seasonality and trend as one alternative and solely seasonality as another alternative). The fit of the expanded model was very similar to the models employing only time series terms and the prediction accuracy was worse. Table 1 compares the root mean squared predictions errors for our suite of models. We see that DDP-FB performs the best on the predicted months, both for RMSPE computed on all 3108 counties and MAP E − C, computed on the 1751 CES counties. We recall that MAPE-C computes a percentage absolute error on each county and the score averages over the counties, so it treats relatively smaller counties as equivalently important to larger ones. Table 2 presents the distribution of MAPE-C by size of county for each of the DDP-FB and DDP-ICAR models. Firstly, we see how DDP-FB performs uniformly better across counties of different sizes. Secondly, we see how the two models converge as the county size category increases. So DDP-ICAR performs particularly badly on small counties, due to the instability in estimation of seasonality on lowseasonality counties, which are predominant among smaller-sized counties, as discussed above.
We propose to provide state analysts with the posterior predictive mean value for each county and month under the DDP-FB model as the predicted point estimate for Figure 8: Comparing the prediction performance of predictor-assisted DDP clustering (pink, solid line) to a time varying Moran's I spatial process (turquoise, dashed line) for the 7-month month prediction period. The hollow circles represent the data points.  Table 1: Root mean squared prediction errors (RMSPE) and mean absolute percentage error (MAPE-C) over the 7 predicted models across the n = 3108 counties for comparison models.
each non-CES county. We additionally provide the associated 95% credible interval as an indicator of our confidence in each predicted value. The LAUS program work with the states to produce an unemployment rate estimate for each county and month using the employment prediction (in the denominator of the unemployment rate statistic). The estimate for the unemployment level (used in the numerator of the unemployment rate statistic) is derived by applying a county-to-state ratio to a state-level estimator (obtained from the Current Population Survey) in each month to obtain the countylevel estimate of unemployment. An overall measure of uncertainty is not currently published for the resulting unemployment rate statistic due to this ratio estimation of the unemployment level for each county.

Discussion
We constructed a suite of functional mixed effects models for an area-indexed collection of U.S. county time series that are designed to perform prediction for employment totals over multiple months. The last month of the prediction period is reported by the U.S. Bureau Labor Statistics (BLS) after consultation with the states. Our modeling framework is characterized by terms for trend, seasonality and the incorporation of predictors; in particular, measures of concentration across industry and spatial location centroids. These terms are specialized to county, but tied together with a predictor-dependent Dirichlet process partition model that allows the borrowing of strength among counties that express similar patterns in their time-indexed functions. We demonstrate that the incorporation of predictors into the estimation of partitions/clusters sharpens the prediction accuracy.
While there are many models in use in the U.S. Federal Statistical System that incorporate seasonality for a single time series, ours is among-the-first applications of a modeling approach for a collection of dependent time series where the degree of seasonality expressed varies among them. We have shown that use of a relatively simple Fourier basis under a flexible (Huang-Wand) prior for the covariance matrix of the countyindexed vector of coefficients performs robustly over both low-and high-seasonality counties. These strong results for our DDP-FB model suggest a modeling template for general use on collections of area-indexed time series. The semi-parametric framework allows for quick adaptation of model predictions during shifts in underlying economic conditions. Future research will further explore opportunities to combine additional information during the prediction period, such as current month employment estimates for CES counties (since the CES survey estimates are available), into a model to predict employment for non-CES counties.
This case study application is particularly relevant because actual current month employment for U.S. counties is released to the public based on look-ahead predictions, rather than current month data (which are unavailable) and because this class of timeand spatially-indexed data is broadly representative of that published by government statistical agencies. It has been particularly challenging in this study to model a collection of time series under heterogenous patterns for seasonality and it is hoped that this