Scalable Spatiotemporally Varying Coefficient Modeling with Bayesian Kernelized Tensor Regression

. As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to its high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and pro-pose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR), and kernelized tensor factorization can be considered a new and scalable approach to modeling multivariate spatiotemporal processes with a low-rank covariance structure. For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.


Introduction
Local spatial regression aims to characterize the nonstationary and heterogeneous associations between the response variable and the corresponding covariates observed in a spatial domain (Banerjee et al., 2014;Cressie and Wikle, 2015).This is achieved by assuming that the regression coefficients vary locally over space.Local spatial regression offers enhanced interpretability of complex relationships and has become an important technique in many fields, such as geography, ecology, economics, environment, public health and climate science, to name but a few.In general, a local spatial regression model for a scalar response y can be written as: where s is the index (e.g., longitude and latitude) for a spatial location, x(s) ∈ R P and β(s) ∈ R P are the covariate vector and the regression coefficients at location s, respectively, and ϵ(s) ∼ i.i.d.N (0, τ −1 ) is a white noise process with precision τ .
There are two common methods for local spatial regression analysis-the Bayesian spatially varying coefficient model (SVC) (Gelfand et al., 2003) and the geographically weighted regression (GWR) (Fotheringham et al., 2003).SVC is a Bayesian hierarchical model in which the regression coefficients are modeled using Gaussian processes (GP) with a kernel function to be learned (Rasmussen and Williams, 2006).For a collection of M observed locations, the original SVC developed by Gelfand et al. (2003) imposes a prior such that vec(β ⊤ mat ) ∼ N (1 M ×1 ⊗µ β , K s ⊗Λ −1 ), where β mat is a M ×P matrix of all coefficients, vec(X) denotes vectorization by stacking all columns in X as a vector, µ β represents the overall regression coefficient vector used to construct the mean, K s is a M × M spatial correlation matrix, Λ is a P × P precision matrix for covariates, and ⊗ denotes the Kronecker product.In this paper, for simplicity, we use a zero-mean GP to specify β, and the global effect of the covariates can be learned (or removed) through a linear regression term as in Gelfand et al. (2003).In addition, setting K s as a correlation matrix simplifies the covariance specification since the variance can be captured by scaling Λ −1 .This formulation is equivalent to having a matrix normal distribution β mat ∼ MN M ×P 0, K s , Λ −1 .GWR was developed independently using a local weighted regression approach, in which a bandwidth parameter is used to calculate the weights (based on a weight function) for different observations, with closer observations carrying larger weights.In practice, the bandwidth parameter is either prespecified based on domain knowledge or tuned through cross-validation.However, it has been shown in the literature that the estimation results of GWR are highly sensitive to the selection of the bandwidth parameter (e.g., Finley, 2011).Compared with GWR, the Bayesian hierarchical framework of SVC provides more robust results and allows us to learn the hyperparameters of the spatial kernel K s , e.g., the length-scale, which is critical to understanding the underlying characteristics of the spatial processes.In addition, by using Markov chain Monte Carlo (MCMC), we can not only explore the posterior distribution of the kernel hyperparameters and regression coefficients, but also perform out-of-sample prediction with uncertainty quantification.
The formulation in Eq. (1.1) can be easily extended to local spatiotemporal regression to further characterize the temporal variation of the coefficients.For a response matrix Y ∈ R M ×N observed from a set of locations S = {s 1 , . . ., s M } over a set of time points T = {t 1 , . . ., t N }, the local spatiotemporal regression model defined on the Cartesian product S × T = {(s m , t n ) : m = 1, . . ., M, n = 1, . . ., N } can be formulated as: where we use m = 1, . . ., M and n = 1, . . ., N to index rows (i.e., location) and columns (i.e., time point), respectively, y (s m , t n ) is the (m, n)th element in Y , and x (s m , t n ) and β (s m , t n ) are the covariate vector and coefficient vector at location s m and time t n , respectively.Based on this formulation, Huang et al. (2010) extended GWR to geographically and temporally weighted regression (GTWR) by introducing more parameters to quantify spatiotemporal weights in the locally weighted regression.For SVC, Gelfand et al. (2003) suggested using a separable kernel structure to build a spatiotemporally varying coefficient model (STVC), which assumes that [β (s 1 , t 1 ) ; . . .; β (s M , t 1 ) ; β (s 1 , t 2 ) ; . . .; β (s M , t N )] ∼ N (0, K t ⊗ K s ⊗ Λ −1 ), where K t is a N × N kernel matrix defining the correlation structure for the N time points.Note that with this GP formulation, it is not necessary for the N time points to be equally spaced.If we parameterize the regression coefficients in Eq. (1.2) as a thirdorder tensor B ∈ R M ×N ×P with mode-3 fiber B(m, n, :) = β (s m , t n ), the above specification is equivalent to having a tensor normal distribution B ∼ T N M ×N ×P 0, K s , K t , Λ −1 .However, despite the elegant separable kernel-based formulation in STVC, the model is rarely used in real-world practice mainly due to the high computational cost.
For example, for a fully observed matrix Y with corresponding spatiotemporal covariates, updating the coefficients β in each MCMC iteration requires time complexity of O M 3 N 3 P 3 .Updating the kernel hyperparameters can be achieved by integrating out β, but it still requires O M 3 N 3 in time.
In this paper, we provide an alternative estimation strategy-Bayesian Kernelized Tensor Regression (BKTR)-to perform Bayesian spatiotemporal regression analysis on large-scale data sets.Inspired by the idea of low-rank regression and tensor regression (see e.g., Izenman, 1975;Cressie and Johannesson, 2008;Banerjee et al., 2008;Zhou et al., 2013;Bahadori et al., 2014;Guhaniyogi et al., 2017), we use low-rank tensor factorization to encode the dependencies among the three dimensions in B with only a few latent factors.To further incorporate local spatial and temporal dependencies, we use GP priors on the spatial and temporal factor vectors following Lopes et al. (2008) and Luttinen and Ilin (2009), thus translating the default tensor factorization into a kernelized factorization model.With a specified tensor rank R, the time complexity becomes O R 3 M 3 + N 3 + P 3 , which is substantially reduced compared with the default STVC formulation.In addition to the spatial and temporal framework, we also consider the case where a proportion of the response matrix Y can be unobserved or corrupted, given observed values of the covariates X.Such a scenario is very common in many real-world applications, such as traffic state data collected from emerging crowdsourcing and moving sensing systems (e.g., Google Waze) for example, where observations are inherently sparse in space and time.We show that the underlying Bayesian tensor decomposition structure allows us to effectively estimate both the model coefficients and the unobserved outcomes even when the missing rate of Y is high.We conduct numerical experiments on both synthetic and real-world data sets, and our results confirm the promising performance of BKTR.

Related Work
The key computational challenge in SVC/STVC is how to efficiently and effectively learn a multivariate spatial/spatiotemporal process (i.e., β mat in Eq. (1.1) and the thirdorder spatiotemporal tensor B in Eq. (1.2)).For a general multivariate spatial process that is fully observed on the Cartesian product with white noise, a popular approach is to use separable covariance specification on which one can leverage the property of Kronecker products to substantially reduce the computational cost (Saatçi, 2012;Wilson et al., 2014).However, for SVC, we cannot benefit directly from the Kronecker property since the data y is obtained through a linear transformation of β mat .In this case, computing the inverse of an M P × M P matrix becomes inevitable when sampling these spatially varying coefficients.Existing frameworks for SVC essentially adopt a two-step approach (Gelfand et al., 2003;Finley and Banerjee, 2020): (1) update only kernel hyperparameters and µ by marginalizing β with cost O M 3 ; and (2) after burnin, use composition sampling on the obtained MCMC samples to generate samples for β with cost O M 3 P 3 ; see Finley and Banerjee (2020) for a detailed implementation of Bayesian SVC.For STVC, the corresponding costs in the two steps are O M 3 N 3 and O M 3 N 3 P 3 , respectively.The high computational cost in step (2) is the primary issue that limits the application of SVC/STVC in practice.
Our work follows a different approach.Instead of modeling β directly using a GP, we parameterize the third-order tensor B for STVC using a low-rank tensor decomposition (Kolda and Bader, 2009).The idea is inspired by recent studies on low-rank tensor regression/learning (see e.g., Lopes et al., 2008;Zhou et al., 2013;Bahadori et al., 2014;Rabusseau and Kadri, 2016;Yu and Liu, 2016;Guhaniyogi et al., 2017;Yu et al., 2018).The low-rank assumption not only preserves the global patterns and higher-order dependencies in the variable, but also greatly reduces the number of parameters.In fact, without considering spatiotemporal indices, we can formulate Eq. (1.2) as a scalartensor regression problem (Guhaniyogi et al., 2017) by reconstructing each x(s m , t n ) as a sparse covariate tensor of the same size as B. However, for spatiotemporal data, the low-rank assumption alone cannot fully characterize the strong local spatial and temporal consistency.To better encode local spatial and temporal consistency, existing studies in tensor regression have introduced graph Laplacian regularization in defining the loss function (e.g., Bahadori et al., 2014;Rao et al., 2015) in an optimization framework.Nevertheless, this approach also introduces more parameters (e.g., those used to define the distance/similarity function and weights in the loss function) and without a Bayesian hierarchical specification it has limited power in modeling complex spatial and temporal processes.The most relevant work is a Gaussian process factor analysis model (Luttinen and Ilin, 2009) developed for a completely different problem-completing a spatiotemporal matrix observed from M locations over N time points, in which different GP priors are assumed on the spatial and temporal factors, and the whole model is learned through variational Bayesian inference.Similarly, Lopes et al. (2008) developed a spatial dynamic factor model in which spatial factors are assumed to have GP priors and temporal factors follow a dynamic linear model.Lei et al. (2022) presents an MCMC scheme for this model, in which slice sampling is used to update kernel hyperparameters and Gibbs sampling is used to update factor matrices.We follow a similar idea as in Luttinen and Ilin (2009) and Lei et al. (2022) to parameterize the coefficients β and develop MCMC algorithms for model inference.In a recent work, Zhang and Banerjee (2022) also proposed to use a Bayesian Linear Model of Coregionalization (LMC) factor model to model high-dimensional multivariate spatial processes involving both M and P .To solve the large M issue, the Nearest Neighbor Gaussian Process (NNGP) (Datta et al., 2016;Finley et al., 2019) is used to model spatial factors.In the literature, there also exist other parameterization methods to model spatial processes/coefficients with multidimensional structures.For instance, Martinez-Beneito et al. (2017) used Kronecker decomposition to model a large coefficient matrix for tensor-variate data; Guhaniyogi et al. (2023) proposed to model spatially varying coefficients using basis-function models (with pre-defined basis functions and random basis coefficients) and used sketching to reduce data dimensionality to achieve scalable inference for large M .BKTR can also be considered to use tensor factorization as a special basis function method to model B, where the basis functions are also random variables (see Lei and Sun (2023) and Cressie et al. (2022)).

Preliminaries
Notations Throughout this paper, we use lowercase letters to denote scalars, e.g., x, boldface lowercase letters to denote vectors, e.g., x ∈ R M , and boldface uppercase letters to denote matrices, e.g., X ∈ R M ×N .The ℓ 2 -norm of x is defined as ∥x∥ 2 = m x 2 m .For a vector x, we denote its mth entry by x(m).For a matrix X ∈ R M ×N , we denote its (m, n)th entry by x m,n or X(m, n).We use I N to denote an identity matrix of size N × N .Given two matrices A ∈ R M ×N and B ∈ R P ×Q , the Kronecker product is defined as have the same number of columns, i.e., N = Q, then the Khatri-Rao product is defined as the column-wise Kronecker product The vectorization vec(X) stacks all column vectors in X as a single vector.Following the tensor notation in Kolda and Bader (2009), we denote a thirdorder tensor by X ∈ R M ×N ×P and its mode-k (k = 1, 2, 3) unfolding by X (k) , which maps a tensor into a matrix.The mode-3 fibers and frontal slices of X are denoted by X (m, n, :) ∈ R p and X (:, :, p) ∈ R M ×N , respectively.Lastly, we use ones(M, N ) and 1 M ∈ R M to represent a M × N matrix and a length M column vector of ones, respectively.
Tensor CP decomposition For a third-order tensor A ∈ R M ×N ×P , the CANDE-COMP/PARAFAC (CP) decomposition factorizes A into a sum of rank-one tensors (Kolda and Bader, 2009): where R is the CP rank, • represents the outer product, u r ∈ R M , v r ∈ R N , and w r ∈ R P for r = 1, . . ., R. The factor matrices that combine the vectors from the rank-one components are denoted by U = [u 1 , . . ., u R ] ∈ R M ×R , V ∈ R N ×R , and W ∈ R P ×R , respectively.We can write Eq. (3.1) in the following matricized form: where ⊙ is the Khatri-Rao product.Eq. (3.2) relates the mode-k unfolding of a tensor to its polyadic decomposition.

Model specification
Let X be an M × N × P tensor, of which the (m, n)th mode-3 fiber is the covariate vector at location s m and time t n , i.e., X (m, n, :) = x (s m , t n ).For example, in the application of spatiotemporal modeling on bike-sharing demand that we illustrate later (see Section 5), the response matrix Y ∈ R M ×N is a matrix of daily departure trips for M bike stations over N days, and the tensor variable X ∈ R M ×N ×P represents a set of P spatiotemporal covariates for the corresponding locations and time.Using y ∈ R M N to denote vec(Y ), Eq. (1.2) can be formulated as: where X (3) and B (3) are the unfoldings of X and B, respectively, the Khatri-Rao product ⊤ is a M N × M N P block diagonal matrix, and ϵ ∼ N (0, τ −1 I M N ).
Assuming that admits a CP decomposition with rank R ≪ min{M, N }, we can rewrite Eq. ( 3.3) as: where X = I M N ⊙ X (3) ⊤ denotes the expanded covariate matrix.The number of parameters in (3.5) is R(M + N + P ), which is substantially less than M N P in (3.3).
Local spatial and temporal processes are critical to the modeling of spatiotemporal data.However, as mentioned above, the low-rank assumption alone cannot encode such local dependencies.To address this issue, we assume specific GP priors on U and V following the GP factor analysis strategy (Luttinen and Ilin, 2009), use a conjugate normal prior on W , and then develop a fully Bayesian approach to estimate the model in Eq. (3.5). Figure 1 illustrates the proposed framework, which is referred to as Bayesian Kernelized Tensor Regression (BKTR) in the remainder of this paper.The graphical model of BKTR is shown in Figure 2.
As mentioned in the introduction, in real-world applications the dependent data is often partially observed on a set Ω of observation indices, with |Ω| < M N .This means that we only observe a subset of entries y m,n , for ∀(s m , t m ) ∈ Ω.We denote by D ∈ R M ×N a binary indicator matrix with d m,n = 1 if (m, n) ∈ Ω and d m,n = 0 otherwise, and by O a binary matrix of |Ω| × M N formed by removing the rows corresponding to the zero values in vec(D) from a M N × M N identity matrix.The vector of observed data can be obtained by y Ω = Oy.Therefore, we have: (3.6) Figure 1: Illustration of the proposed BKTR framework.
For spatial and temporal factor matrices U and V , we use identical GP priors on the component vectors: where K s ∈ R M ×M and K t ∈ R N ×N are the spatial and temporal covariance matrices built from two valid kernel functions k s (s m , s m ′ ; ϕ) and k t (t n , t n ′ ; γ), respectively, with ϕ and γ being kernel length-scale hyperparameters.Note that we capture the variance through W and thus restrict K s and K t to being correlation matrices by setting the variance to one.We reparameterize the kernel hyperparameters as logtransformed variables to ensure their positivity and assume normal priors on them, i.e., log(ϕ For the factor matrix W , we assume all columns follow an identical zero-mean Gaussian distribution with a conjugate Wishart prior on the precision matrix: where Ψ 0 is a P × P positive-definite scale matrix and ν 0 > P − 1 denotes the degrees of freedom.Finally, we use a conjugate Gamma prior τ ∼ Gamma(a 0 , b 0 ) on the noise precision τ defined in Eq. (3.6).
Based on the assumed priors and hyperpriors, we can write the covariance function of the coefficients B modeled with BKTR.Specifically, consider a data pair in the input space with the indices (m, n, p) and (m ′ , n ′ , p ′ ), the covariance between the two entries is (3.9) Given the prior on w r (see Eq. (3.8)), we have (3.10) Integrating out w r in Eq. (3.4) using Eqs.(3.9) and (3.10), we have Similarly, if we marginalize U in Eq. (3.4), we can derive With the analysis of the variance-covariance matrix in Eqs.(3.11) and (3.12), we can interpret the kernelized tensor CP factorization as a higher-order extension of the intrinsic model of coregionalization (Bonilla et al., 2007;Banerjee et al., 2014;Álvarez et al., 2012), where the coregionalization matrix also has a low-rank specification.However, for model inference we do not use the covariance structure; instead, we directly use tensor factorization to learn the latent factor matrices for fast inference.

Model inference
We use Gibbs sampling to estimate the model parameters, including coefficient factors {U , V , W }, the precision τ , and the precision matrix Λ w .For the kernel hyperparameters {ϕ, γ} whose conditional distributions are not easy to sample from, we use the slice sampler.

Sampling the coefficient factor matrices {U , V , W }
Sampling the factor matrices can be considered as a Bayesian linear regression problem.Taking W as an example, we can rewrite Eq. (3.6) as: where U , V are known and vec(W ) is the coefficient to estimate.Considering that the priors of each component vector w r are independent and identical, the prior distribution over the whole vectorized W becomes vec(W ) ∼ N (0, I R ⊗ Λ −1 w ).Since both likelihood and prior of vec(W ) follow Gaussian distributions, its posterior is also Gaussian with mean µ * W and precision Λ * W , such as: (3.14)where ) is mainly dominated by the Cholesky decomposition of Λ * W , and the procedure requires O(R 2 P 2 ) in storage and O(R 3 P 3 ) in time.The posterior distributions of U and V can be obtained similarly using different tensor unfoldings.In order to sample U , we use the mode-1 unfolding in Eq. (3.2) and reconstruct the regression model with vec(U ) as coefficients: where XU = X (3) ⊙ I M N ⊤ ∈ R M N ×M N P and ϵ Ω ∼ N 0, τ −1 I |Ω| .Thus, the posterior of vec(U ) has a closed form-a Gaussian distribution with mean µ * U and precision Λ * U , where The posterior for vec(V ) can be obtained by applying the mode-2 tensor unfolding.It should be noted that the above derivation provides the posterior for the whole factor matrix, i.e., {vec(U ), vec(V ), vec(W )}, so the time complexity is O(R 3 (M 3 + N 3 + P 3 )).We can further reduce the computational cost by sampling {u r , v r , w r } one by one for r = 1, . . ., R as in Luttinen and Ilin (2009).In this case, the time complexity in learning these factor matrices can be further reduced to O(R(M 3 + N 3 + P 3 )) at the cost of slow/poor mixing.

Sampling kernel hyperparameters {ϕ, γ}
As shown in Figure 2, sampling kernel hyperparameters conditional on the factor matrices should be straightforward through the Metropolis-Hastings algorithm.However, in practice, conditioning on the latent variables {U , V } in such hierarchical GP models usually induces sharply peaked conditional posteriors over {ϕ, γ}, making the Markov chains mix slowly and resulting in poor updates (Murray and Adams, 2010).To address this issue, we integrate out the latent factors from the model to get the marginal likelihood, and sample ϕ and γ from their marginal posterior distributions based on the slice sampling approach (Neal, 2003); i.e., we integrate out U when deriving the marginal posterior distribution p (ϕ | y Ω , V , W , τ, X ), and likewise we build the marginal posterior p (γ | y Ω , U , W , τ, X ) by integrating out V .Compared with recent related studies that sample hyperparameters conditioned on latent factors, e.g., Zhang and Banerjee (2022), the marginalization of latent factors can obtain tractable posteriors and avoid the possible issues in MCMC convergence.Note that in the work of Murray and Adams (2010), the proposed auxiliary variable strategy can also be used to handle outcomes with distributions beyond Gaussian.In addition, for kernel functions that contain more than one hyperparameter, one can apply a modified rectangle slice sampler (Neal, 2003).
Let us consider for example the hyperparameter of K s , i.e., ϕ.As vec(U ) ∼ N (0, K U ) with K U = I R ⊗ K s , we integrate out vec(U ) in Eq. (3.15) and obtain: where (3.18) where we compute y ⊤ Ω K −1 y|ϕ y Ω based on the Woodbury matrix identity, and use the matrix determinant lemma to compute log K y|ϕ .The detailed derivation is given in Appendix A. Computing Eq. (3.18) involves the Cholesky factorization of The slice sampling approach is robust to the selection of the sampling scale and easy to implement.Sampling γ can be achieved in a similar way.Note that, as mentioned, the sampling is performed on the log-transformed variables to avoid numerical issues.The detailed sampling process for kernel hyperparameters is provided in Appendix B. We can also introduce different kernel functions (or the same kernel function with different hyperparameters) for each factor vector u r and v r as in Luttinen and Ilin (2009).In this case, the marginal posterior of the kernel hyperparameters can be derived in a similar way as in Eq. (3.18).

Sampling Λ w
Given the conjugate Wishart prior, the posterior distribution of

Sampling the precision τ
Since we used a conjugate Gamma prior, the posterior distribution of τ is also a Gamma distribution with shape (a * ) and rate (b * ) being (3.19)

Model implementation
We summarize the implementation of BKTR in Algorithm 1.It should be noted that we update the correlated latent factors and the corresponding hyperparameters as one block in the Gibbs sampler to further ensure model convergence (Knorr-Held and Rue, 2002).Specifically, we take {ϕ, U } as a block and update ϕ from its marginal posterior with a slice sampler followed by sampling U from p (U | ϕ, −), and then similarly sample {γ, V } as another block.For MCMC inference, we run K 1 iterations as burn-in and take the following K 2 samples for estimation.

Model scalability
Compared to the original STVC approach (Gelfand et al., 2003), which requires O |Ω| 3 for hyperparameter sampling and O |Ω| 3 P 3 for coefficient variable learning.BKTR reduces the cost of updating hyperparameters and coefficients to O R 3 M 3 + N 3 + P 3 ; e.g., updating the factor matrix U following Eq.(3.16) requires O(M 3 R 3 ) in time.For the slice sampling of kernel hyperparameters ϕ, each update inside the slice sampling loop (see Algorithm 2) requires O(M 3 R 3 ) in computing the likelihood in Eq. (3.18).Such substantial gains in computing time allow us to analyze large-scale real-world spatiotemporal data and multi-dimensional relations, where generally STVC is infeasible.
Given the above analysis, the computation cost BKTR depends on rank R and BKTR can work on large data sets such as R × M ≈ 10 3 ∼ 10 4 (the same applies to N and P ).However, it should be noted that the default BKTR still encounters computational issues when the number of spatial locations M (or time points N ) becomes very large (e.g., say M > 10 4 ).We next discuss several solutions when M becomes large.Following Luttinen and Ilin (2009), the most straightforward solution is to update the three factor matrices column by column, which reduces the time cost to O R M 3 + N 3 + P 3 .Through this updating scheme, the Kroneceker products for Algorithm 1: MCMC sampling process of BKTR(y Ω , X , R, K 1 , K 2 ) Initialize {U , V , W } as normally distributed random values, ϕ = γ = 1, and Λ w ∼ W(I P , P ).Set µ ϕ = µ γ = log(1), τ ϕ = τ γ = 10, and a 0 = b 0 = 10 −4 .for k = 1 : Sample kernel hyperparameter ϕ using the Algorithm in Appendix B; Sample factor vec(U ) from a Gaussian distribution: Sample kernel hyperparameter γ using the Algorithm in Appendix B; Sample factor vec(V ) from a Gaussian distribution: matrix removing the rows corresponding to the zeros in vec(D ⊤ ) from , whose frontal slices are the transpose matrices of frontal slices of X , and Sample hyperparameters Λ w from a Wishart distribution: Sample factor vec(W ) from a Gaussian distribution: Sample precision τ from a Gamma distribution: 13 return {B (k) } K2 k=1 to approximate posterior coefficients and estimate unobserved data.
learning latent factors and hyperparameters can be avoided.Taking u r (i.e., the r-th column in U ) as an example, the posterior of u r when updated independently still follows a Gaussian distribution N µ * ur , Λ * ur −1 , where with are the factor matrices without the rth columns and H ⊤ ur H ur ∈ R M ×M is a diagonal matrix.Luttinen and Ilin (2009) also suggested to use sparse approximation and predictive processes (Banerjee et al., 2008;Titsias, 2009) to model the each latent factor vector when M becomes large, i.e., using M 0 inducing points where M 0 ≪ M .This reduces the time complexity to O(RM 2 0 M ).Given that H ⊤ ur H ur becomes an M ×M diagonal matrix, if we replace the Gaussian prior with a Gaussian Markov random field (GRMF) (Rue and Held, 2005) for which the precision matrix K −1 s becomes a sparse banded matrix, we can then use sparse matrix algorithms (as Λ * ur also becomes a sparse banded matrix) to model large-scale data sets with M ≈ 10 4 ∼ 10 6 .Similarly, NNGP can also be used here to model the spatial latent factor U for scalable inference (Finley et al., 2019).The column-by-column approach also offers the flexibility to learn different kernel hyperparameter ϕ r (and thus covariance K r s ) for each latent factor vector u r .The marginal likelihood conditioning on ϕ r in Eq. (3.17 Correspondingly, the marginal posterior in Eq. (3.18) becomes: (3.21) Lastly, considering that learning hyperparameters via slice sampling is expensive, another possible solution to further reduce the computational cost is to use cross-validation to specify kernel hyperparameters as in Finley et al. (2019).The choice of kernel hyperparameters is often not that sensitive in regression and thus cross-validation can be performed at a moderately crude resolution.However, it could still be time-consuming to directly implement the cross-validatory approach based on MCMC sampling, since such an iterative process needs to be run several folds/times for each combination of hyperparameters.

Simulation Study
In this section, we evaluate the performance of BKTR on three simulated data sets.All simulations are performed on a laptop with a 6-core Intel Xenon 2.60 GHz CPU and 32GB RAM.Specifically, we conduct three studies: (1) a low-rank structured association analysis to test the estimation accuracy and statistical properties of BKTR with different rank settings and in scenarios with different observation/missing rates, (2) a small-scale analysis to compare BKTR with STVC and a pure low-rank tensor regression model, and (3) a moderately sized analysis to test the performance of BKTR on more practical STVC modeling.
We test the performance of BKTR under different settings to evaluate its sensitivity to i) the rank specification defined by the user, and ii) the proportion of observed responses, i.e., |Ω| M N .For i), we specify R = {4, 7, 10, 13, 16, 20, 25, 30, 35, 40} and randomly sample 50% of the data Y as observed values.For ii), we applied the model under 5 different observed response rates, where we randomly sample (90%, 70%, 50%, 30%, 10%) of Y as observed values and evaluate the performance of BKTR with the true rank setting (R = 10).In all settings, we assume that the kernel functions are known, i.e., k s is Matérn 3/2 and k t is SE.We also include in the analysis a sixth covariate in X drawn from the standard normal distribution and unrelated to the outcome (the corresponding estimated coefficients should be close to zero).For each setting, we replicate the simulation 40 times and run the MCMC with K 1 = 1000 and K 2 = 500.
In each setting, we measure the estimation accuracy on B and the prediction accuracy on y Ω c (unobserved data) using mean absolute error (MAE) and root mean square error (RMSE) between the true values and the corresponding posterior mean estimations, where the true B values for the random non-significant covariate are set to zero.For evaluating the performance on uncertainty estimation, we assess interval coverage (CVG) (Heaton et al., 2019), interval score (INT) of the 95% credible intervals (CI), and continuous ranked probability score (CRPS) (Gneiting and Raftery, 2007)  values.Note that all CIs are obtained based on the K 2 samples after burn-in.Detailed definitions of all these performance metrics are given in Appendix C.

Results
Table 1 shows the performance metrics (mean±std calculated from the 40 replicated simulations) of BKTR in the case where the true rank is specified (R = 10) and when 50% of Y is partially observed.We can see that the CVG of 95% CI of B is around 95% as expected.We also plot some estimation results of one chain/simulation in Figure 3, where panel (a) illustrates the temporal evolution of the estimated coefficients for each covariate at a given location, and panel (b) maps spatial surfaces of B for each covariate at a given time point.These plots show that BKTR can effectively approximate the spatiotemporally varying relationships between the spatiotemporal data and the corresponding covariates.
The performance of BKTR with different rank specifications and 50% of the data points observed are compared in Figure 4(a).We see that when the rank is overspecified (larger than the true value, i.e., R > 10), BKTR can still achieve reliable estimation accuracy for both coefficients and unobserved data, even when R is much larger than 10, e.g., R = 40.The CVG for the 95% CI of B is also maintained around 95% when R > 10, the corresponding INT and CRPS maintain a low value.This indicates that BKTR is able to provide accurate estimation for the coefficients along with high-quality uncertainty even when the model includes redundant parameters.In addition, Figure 4(b) illustrates the effect of the observation rate of Y , where the results of BKTR (specifying R = 10) with different proportions of observed data are compared.We observe that BKTR has a stable low estimation error when the rate of observed data decreases, and it continues to obtain valid CIs despite the increased lengths of CI for lower observation rates (i.e., the INT becomes larger).These results demonstrate the powerful applicability of the proposed Bayesian framework.We further show the temporal estimations of coefficients at a given location obtained in different cases in Figure 5. From Figure 5(a), we see that a higher rank specification generates broader intervals, which is consistent with the INT results in Figure 4(a), i.e., the INT slightly increases with overspecified ranks.Panel (b) shows that BKTR can accurately estimate time-varying coefficients even when only 10% of the response data is observed, demonstrating its high modeling capacity.

Simulation setting
In this simulation, we generate a small data set in which the true B is directly generated following a separable covariance matrix.Specifically, we simulate 30 locations (M = 30) in a [0, 10] × [0, 10] square and N = 30 evenly distributed time points in [0, 10].We then generate a small synthetic data set Y ∈ R 30×30 , X ∈ R 30×30×3 (P = 3) following Eq.(3.3) with:   where x s ∼ N (0, I M ), x t ∼ N (0, I N ), K s and K t are still computed from a Matérn 3/2 kernel and a SE kernel, respectively, and Λ w ∼ W(I P , P ).For model parameters, we set the kernel variance hyperparameters at σ 2 s = σ 2 t = 2, and consider combinations of data noise variance τ −1 ∈ {0.2, 1, 5}, and kernel length-scale hyperparameters ϕ = γ ∈ {1, 2, 4}.We specify the CP rank R = 10 for BKTR estimation and compare BKTR with the original STVC model (Gelfand et al., 2003) and a pure low-rank Bayesian probabilistic tensor regression (BTR) model without imposing any spatiotemporal GP priors by setting K s = I M , K t = I N as the prior and using the same rank as BKTR.For both BKTR and STVC, we assume that the kernel functions are known, i.e., k s is Matérn 3/2 and k t is SE, and use MCMC to estimate the unknown kernel hyperparameters.
We show the approximated coefficients for the third covariate at location #8, with solid curves representing the posterior mean and the shaded areas denoting 95% CI.
As in the previous simulations, we add a normally distributed non-significant random covariate in X when fitting the model.We replicate each simulation 25 times and run the MCMC sampling with K 1 = 1000 and K 2 = 500.

Results
Table 2 presents the accuracy (mean±std) of the estimated coefficient tensor B obtained from the 25 simulations.We also report the average computing time for each MCMC iteration.As we can see, BKTR with rank R = 10 achieves similar accuracy to that of STVC, and even exhibits superior performance in the presence of increased data noise or enlarged kernel length-scales.This means that BKTR is more robust for noisy data and the low-rank assumption can fit the spatiotemporal relations better when they vary more smoothly/slowly.Additionally, BKTR also clearly outperforms BTR, which confirms the importance of introducing GPs to encode the spatial and temporal dependencies.These results suggest that the proposed kernelized CP decomposition can approximate the true coefficient tensor relatively well even if the true B does not follow an exact low-rank specification.In terms of computing time, we see that BKTR is much faster than STVC.Due to the high cost of STVC, it becomes infeasible to analyze a data set of moderate size, e.g., M, N = 100, and P = 10.
Figure 6 shows the estimation results of the three approaches for one chain/simulation when τ −1 = 1, ϕ = γ = {1, 2, 4}.We plot an example for the third covariate at location #8 (m = 8, p = 3) to show the temporal variation of the coefficients estimated by the three methods.As can be seen, BKTR and STVC generate similar results and most of the coefficients are contained in the 95% CI for these two models.Although BTR can still capture the general trend, it fails to reproduce the local temporal dependency due to the absence of spatial/temporal priors.To further verify the mixing of the GP hyperparameters, we run the MCMC for BKTR for 5000 iterations and show the trace plots and probability distributions (after burn-in) for ϕ and γ obtained when τ −1 = 1, ϕ = γ ∈ {1, 2, 4} in Figure 7(a) and (b), respectively.Figure 7(c) shows the effective sample size (ESS) of kernel hyperparameters obtained by BKTR in different settings, where in each case the mean ESS of 25 simulations are given.We can see that the Markov chains for the kernel hyperparameters mix fast and well, and a few hundred iterations should be sufficient for posterior inference.

Simulation setting
To evaluate the performance of BKTR in more realistic settings where the data size is large and the relationships are usually captured without low-rank structure, we further generate a relatively large synthetic data set of the same size as in Simulation 1, i.e., M = 300 locations and N = 100 time points.We generate the covariates X ∈ R 300×100×5 (P = 5) according to Eq. (4.1).The coefficients are simulated using vec w .The parameters {K s , K t , Λ w } and ϵ are specified as in Simulation 1, i.e., K s and K t are computed from a Matérn 3/2 and a SE kernel, respectively, with hyperparamaters {σ 2 s = σ 2 t = 2, ϕ = γ = 1}, Λ w ∼ W(I P , P ), and τ −1 = 1.We randomly select 50% of the generated data Y as observed responses.To assess the effect of the rank specification, we try different CP rank settings R in {5, 10, . . ., 60} and compute MAE and RMSE for both B and y Ω c .We replicate the experiment with each rank setting 15 times and run the MCMC sampling with K 1 = 1000 and K 2 = 500.

Results
Figure 8 shows the temporal behaviors and spatial patterns of the estimated coefficients of four covariates in one simulation when R = 40.As one can see, BKTR can effectively reproduce the true values with acceptable errors.Figure 9 shows the effect of rank.We see that choosing a larger rank R gives better accuracy in terms of both MAE B /RMSE B and MAE y Ω c /RMSE y Ω c .However, the accuracy gain becomes marginal when R > 30.This demonstrates that the proposed Bayesian treatment offers a flexible solution in terms of model estimation.5 Bike-sharing Demand Modeling

Data description
In this section, we perform local spatiotemporal regression on a large-scale bike-sharing trip data set collected from BIXI1 -a docked bike-sharing service in Montreal, Canada.BIXI operates 615 stations across the city and the yearly service time window is from April to November.We collect the number of daily departures for each station over N = 196 days (from April 15 to October 27, 2019).We discard the stations with an average daily demand of less than 5, and only consider the remaining M = 587 stations.The matrix Y contains 13.0% unobserved/corrupted values in the raw data, and we only keep the rest as the observed set Ω. Given that the collected daily departures can have different variances across the spatial locations, we take the logarithm of the data matrix Y to make the variances more consistent and obtain a nearly Gaussian distributed data.We approximate this transformed dataset using the assumption of Gaussian likelihood with a linear covariance and a homoscedastic residual process.
We follow the analysis in Faghih-Imani et al. (2014) and Wang et al. (2021), and consider two types of spatial covariates: (a) features related to cycling infrastructure, and (b) land-use and built environment features.Particularly, similar to the operations in Wang et al. (2021), the spatial covariates are collected using the intersections of 250meter circle buffers and Thiessen polygons for obtaining predictors from non-overlapping areas.For temporal covariates, we mainly consider weather conditions and holidays.Table 3 lists the 13 spatial covariates (x p s ∈ R M ) and 5 temporal covariates (x p t ∈ R N ) used in this analysis.The final number of covariates is P = 13 + 5 + 1 = 19, including the intercept.In constructing the covariate tensor X , we follow the same approach as in the simulation experiments.Specifically, we set X (:, :, 1) = ones(M, N ), and fill the rest of the tensor slices with 1 T N ⊗ x p s (p = 2 : 14) for the spatial covariates and 1 M ⊗(x p t ) ⊤ (p = 15 : 19) for the temporal covariates.Given the difference in magnitudes, we normalize all covariates to [0, 1] using a min-max normalization.Since we use a zeromean GP to parameterize β, we normalize departure trips by removing the global effect of all covariates through linear regression and consider y to be the unexplained part.The coefficient tensor B contains more than 2 × 10 6 entries, preventing us from using STVC directly.

Experimental setup
For spatial factors, we use a Matérn 3/2 kernel as a universal prior, i.e., k s (s m , , where d is the Euclidean distance between locations s m and s m ′ , and ϕ is the spatial length-scale.The Matérn class kernel is commonly used as a prior kernel function in spatial modeling.For temporal factors, we use a locally periodic correlation matrix by taking the product of a periodic kernel and a SE kernel, i.e., , where γ 1 and γ 2 denote the lengthscale and decay-time for the periodic component, respectively, and we fix the period as T = 7 days.This specification suggests a weekly temporal pattern that can change over time and allows us to characterize both the daily variation and the weekly trend of the coefficients.We set the CP rank R = 20, and run MCMC with K 1 = 1000 and K 2 = 500 iterations.

Results
Figure 10 and 11 demonstrate several examples of estimated coefficients, showing temporal plots of B(s m , :, :) and spatial maps of B(:, t n , :), respectively.As we can see in Figure 10, the temporal variations of the coefficients for both spatial and temporal covariates are interpretable.The coefficients (along with CI describing the uncertainty) allow us to identify the most important factors for each station at each time point.For example, we observe a similar variation over a week and a general long-term trend from April to October.Furthermore, the magnitude of the coefficients can be larger during summer (July / August) compared to the beginning and end of the operation period (e.g., for major roads, walkscore, and mean temperature), which is as expected as the outdoor temperature drops.Overall, for spatial variables, the positive correlation of walkability and the negative impact caused by the length of major roads indicate that bicycle demands tend to be higher in densely populated neighborhoods.For the temporal covariates, the precipitation and humidity variables both relate to negative coefficients, implying that people are less likely to ride bicycles in rainy/humid periods.The spatial distributions of the coefficients in Figure 11 also demonstrate consistent estimations, where the effects of covariates tend to be more obvious in the downtown area, involving coefficients with larger absolute values.Again, one can further explore the credible intervals to evaluate the significance of each covariate at a given location.In    addition, the estimated variance τ −1 for ϵ gives a posterior mean of 6.19 × 10 −2 with a 95% CI of [6.15, 6.28] × 10 −2 .This variance is much smaller compared with the variability of the data, validating a good performance of data fitting using the proposed BKTR model.In summary, BKTR produces interpretable results for understanding the effects of different spatial and temporal covariates on bike-sharing demand.These findings can help planners evaluate the performance of bike-sharing systems and update/design them.

Spatial interpolation
Due to the introduction of GP, BKTR is able to perform spatial prediction, i.e., kriging, for unknown locations, and such results can be used to select sites for new stations.To validate the spatial prediction/interpolation capacity, we randomly select 30% locations of the BIXI data set to be missing, and compute the prediction accuracy for these missing/testing data (y Ω c ).Given that STVC is not applicable for a data set of such size, we compare BKTR (rank R = 20) with SVC (spatially varying coefficient processes model) (Gelfand et al., 2003) which omits the modeling in time dimension.A linear regression model with static β is also considered as the baseline in order to demonstrate the role of incorporating spatiotemporally varying coefficients in such prediction tasks.We also test these methods in the case where both 30% locations and 30% time periods are missing.The prediction errors are given in Table 4, where we show the MAE, RMSE and R 2 on y Ω c .R 2 is used to illustrate the fitting performance of the model, and the definition is also provided in Appendix C. It is obvious that BKTR offers better performance with lower estimation errors in both missing scenarios, indicating its effectiveness in spatial prediction for real-world spatiotemporal data sets that often contain complicated missing/corruption patterns.

Identifiability and convergence of the model
In the regression problem studied in this paper, the kernel/covariance hyperparameters, i.e., {ϕ, γ}, interpret/imply the correlation structure of the data, and the coefficients, i.e., B, reveal the spatiotemporal processes of the underlying associations.Thus, we consider the convergence of kernel hyperparameters and the identifiability of the coefficients to be crucial.Note that the identifiability of latent factors decomposed from B are not that important.For instance, our model is invariant in applying the same column permutation to the three factor matrices.From the results in simulation experiments (see Figure 7), it is clear that the Markov chains for kernel hyperparameters converge fast and well when using the proposed BKTR model.

The superiority of the BKTR framework
As we can see from the test of different rank settings and observation rates in the simulation experiments (see Figures 4 and 9), BKTR is able to provide high estimation accuracy and valid CIs even with a much larger or over-specified rank, and also effectively estimates the coefficients and the unobserved output values when only 10% of the data is observed.This indicates the advantage of the proposed Bayesian low-rank framework.Since we introduce a fully Bayesian sampling treatment for the kernelized low-rank tensor model which is free from parameter tuning, BKTR can estimate the model parameters and hyperparameters even when only a small number of observations are available.Thus, the model can consistently offer reliable estimation results, implying its effectiveness and usability for real-world complex spatiotemporal data analysis.
Another benefit of the proposed framework is the highly improved computing efficiency.
As we mentioned in Section 3.5, the computational cost of BKTR is substantially reduced compared to the STVC model.According to the experiments conducted, BKTR is capable of dealing with regression problems containing up to millions of coefficients.

Conclusion
This paper introduces an effective solution for large-scale local spatiotemporal regression analysis.We propose parameterizing the model coefficients using low-rank CP decomposition, which greatly reduces the number of parameters from M × N × P to R(M + N + P ).Contrary to previous studies on tensor regression, the proposed model BKTR goes beyond the low-rank assumption by integrating GP priors to characterize the strong local spatial and temporal dependencies.The framework also learns a low-rank multi-linear kernel which is expressive and able to provide insights for nonstationary and complicated processes.Our numerical experiments on both synthetic data and real-world data suggest that BKTR can reproduce the local spatiotemporal processes efficiently and reliably.
There are several directions for future research.In the current model, the CP rank R needs to be specified in advance.One can make the model more flexible and adaptive by introducing a reasonably large core tensor with a multiplicative gamma process prior such as in Rai et al. (2014).In terms of GP priors, BKTR is flexible and can accommodate different kernels (w.r.t.function form and hyperparameter) for different factors such as in Luttinen and Ilin (2009).The combination of different kernels can also produce richer spatiotemporal dynamics and multiscale properties.In terms of computation, one can further reduce the cost in GP learning (e.g., O(M 3 ) for a spatial kernel) by further introducing sparse approximation techniques such as inducing points and predictive processes (Quinonero-Candela and Rasmussen, 2005;Banerjee et al., 2008).Lastly, we can extend the regression model to handle binary-and count-valued data by using proper link functions.In this case, the challenge is that we no longer have analytical posteriors for latent factors; a potential solution is to use elliptical slice sampling (Murray et al., 2010) to sample the latent variables.

A Calculation of the marginal likelihood
We use the Woodbury matrix identity y|ϕ , and further calculate y ⊤ Ω K −1 y|ϕ y Ω in Eq. (3.18) as follows: where Therefore, the log marginal posterior of ϕ is updated with:

B Sampling algorithm for kernel hyperparameters
The detailed sampling process for kernel hyperparmameters is provided in Algorithm 2.

C Evaluation metrics
The metrics for assessing the estimation accuracy on coefficients B (MAE B /RMSE B ) and unobserved output/response entries y Ω c (MAE y Ω c /RMSE y Ω c /MAPE y Ω c ) are de- and H U is the same as the definition used in Eq. (3.16), i.e., H U = O XU ((W ⊙ V ) ⊗ I M ) .The marginal posterior of ϕ becomes: (a) Estimated coefficients at location m = 3, i.e., B(3, :, p = 1, 2, 3, 4, 5, 6), where the blue lines and dashed areas are posterior means with 95% CI, and the black dot lines denote the true values.(b) Interpolated spatial surfaces of true value, estimated value (posterior mean), and absolute estimation error of the coefficients at time point n = 20, i.e., B(:, 20, p = 1, 2, 4, 6).Black circles denote the positions of sampled locations.

Figure 4 :
Figure 4: Sensitivity test of BKTR for Simulation 1: (a) Effects of the rank R; (b) Effects of the observation rate |Ω| M N .For each case, the figure shows the boxplots and mean values of the corresponding metrics calculated from 40 replications.

Figure 9 :
Figure 9: The effect of rank for Simulation 3. The figure shows the boxplot and mean of the estimation error on (a) B and (b) y Ω c with respect to the tensor rank.

Figure 10 :
Figure 10: Temporal illustration of BKTR estimated coefficients for log-transformed BIXI demand data.The plots show the posterior mean with 95% CI.
(a) Spatial maps of coefficients mean.(b) Spatial maps of coefficients std.
This research is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).Mengying Lei would like to thank the Institute for Data Valorization (IVADO) for providing a scholarship to support this study.

Table 1 :
of all B Performance of BKTR (R = 10) on Simulation 1, with 50% Y missing.

Table 3 :
Description summary of independent variables.