A Bayesian Factor Model for Spatial Panel Data with a Separable Covariance Approach

. A hierarchical Bayesian factor model for multivariate spatially and temporally correlated data is proposed. This method searches factor scores in-corporating a dependence within observations due to both a geographical and a temporal structure and it is an extension of a model proposed by Mezzetti (2012) using the results of a separable covariance matrix for the spatial panel data as in Leorato and Mezzetti (2016). A Gibbs sampling algorithm is implemented to sample from the posterior distributions. We illustrate the beneﬁt and the performance of our model by analyzing death rates for diﬀerent diseases together with some socio-economical and behavioural indicators and by analyzing simulated data.


Introduction
Factor models are often used to reduce the complexity of data and effectively extract relevant information. The typical assumption of such models is that the observed values of a set of multivariate variables is the result of the action of some unobservable latent variables, or factors, that can also be used to reduce a large number of observables into a smaller number of latent factors. In its original definition, a factor model identifies a set of common factors that are able to reduce the dimension of the response vector for all cross-sectional or time units. A thorough review can be found in Bartholomew et al. (2011).
The first formal definition of a Bayesian approach to factor analysis can be traced back at least to Press (1972) and Kaufman and Press (1973), although a few earlier works did account for subjective information, limitedly to specific problems (see also Martin and McDonald (1975), Press and Shigemasu (1989) and Press and Shigemasu (1997)). These first models were rather restricted, with the same limitations as the classical factor model.
The introduction of iterative Markov chain Monte Carlo (MCMC) tools paved the way to the development of fully Bayesian approaches to factor analysis such as, for example, in Polasek (1997), Arminger and Muthén (1998), Aguilar and West (2000), Lopes and West (2004) and Rowe (2003). These new models allowed for heterogeneities across time of either the latent factors or the factor loadings.
The application of factor models to spatial data, soon led to the introduction of further generalizations, that allowed the so called spatial factor models to cope with specific characteristics of spatial data in terms of heterogeneity and autocorrelation. Wang and Wall (2003) use a common spatial factor model to study cancer risk across counties in Minnesota; Hogan and Tchernis (2004) use a hierarchical Bayesian factor model to account for spatial correlation between locations, where the spatial correlation is introduced on the latent scale variable and can be specified either marginally or conditionally; Lopes and West (2004) and Lopes et al. (2008) consider spatio-temporal models, where temporal and spatial dependence are modeled by latent factors and factor loadings respectively; Mezzetti (2012) generalizes the model by Rowe (2003) and Rowe and Press (1998) by proposing a hierarchical Bayesian factor model for multivariate spatially correlated data. Ren and Banerjee (2013) define a low-rank spatial factor model, to deal with large vectors of outcomes for a large number of locations.
Most of these generalized Bayesian factor models exploit the flexibility of hierarchical Bayesian modeling, that allows to take into account both heterogeneities in factors (loadings and/or latent) and possible correlation between observations. The latter is obtained through the use of proper prior information to simplify the covariance structure.
In this paper we propose a hierarchical Bayesian factor model for multidimensional spatio-temporal data that is able to take into account heterogeneity in latent factors (both in time and space dimensions). This is obtained through a separable covariance structure, assuming a multiplicative effect of the covariance among variables and among regions, combined with a Vector Autoregression (VAR) specification for the latent vectors at every location. Similarly to Lawson et al. (2012), Leorato and Mezzetti (2016) or Chan (2020), the separability of the covariance structure is obtained by the introduction of a Kronecker product. Here the separability is assumed between the N -dimensional spatial covariance matrix and the covariance matrix of the K response variables. Both matrices are given an arbitrary form, thus avoiding one of the major problems with specifications of the covariance structure, which is a sore point in spatial data modeling (see the discussion in Leorato and Mezzetti (2016) and references therein). Further, as noted by Chan (2020), the definition of a Kronecker error covariance structure is also compatible to various commonly used non-Gaussian models.
Most of the literature dealing with factor models for spatio-temporal data does not allow the dynamic factors to be location-specific or does not consider multivariate observations repeated in time on several locations. One of the lines of research where there has been a growing interest in the investigation of the determinants of simultaneous spacetime variations of multiple outcomes is health and medicine. For example, Tzala and Best (2008) use a Bayesian hierarchical latent factor model for spatially and temporally correlated multivariate epidemiological data. Their idea is to exploit factor analysis to provide a flexible framework to map the space-time evolution of disease-specific trends in cancer risks. Because of the nature of data, the Poisson model is used to describe the distribution of the observed deaths in all types of cancer analyzed. In these types of applications, the identification of spatio-temporal patterns in a small number of factors that are able to synthesize the variations of a vector of several variables, potentially correlated in a complex way, provides important information and insight on the mechanisms playing a role in the evolution of one or more diseases, reinforcing the evidence of common underlying risk factors. As pointed out by Tzala and Best (2008), accounting for spatio-temporal heterogeneity in a multivariate factor analysis, is likely to "lead to improved precision for the estimation of the underlying disease risks by borrowing strength from other diseases as well as close areas or time points".
Factor models for multivariate space-time data are also suitable to cope with environmental and ecological applications. For example, Calder (2007) propose a dynamic factor process convolution model. The approach of our paper shares some similarities with the work of Calder (2007), such as the Bayesian estimation approach and the dynamic feature of the factors. However, there are important differences: first, we don't rely on the convolution of an underlying process through a kernel smoother K, because the main purpose of our approach is the reduction of the dimension of the "types of variables" only, allowing for location and time specific factors. Therefore we don't need to address the delicate problem of the choice of tuning parameters such as the resolution of the grid, kernel and bandwidth selection. A second difference consists in the covariance matrix of the factors: our factor component F (t) plays a similar role to Kx t in Calder (2007), but while the variance of the first include both a spatial (Φ) and a "type-of-variable" component (Ψ), the variance of the latter only accounts for possible heteroskedasticity in the types of variables, through var (V ec( t Despite the potential application of multivariate factor models for space-time data in other areas of research, such as, for example, in multivariate macroeconomic time series, to the best of our knowledge, limited attention has been drawn to this kind of generalization aside from the aforementioned work by Tzala and Best (2008) and Calder (2007). Few models pay close attention to the spatio-temporal structure of the error terms, typically assuming a strict diagonal structure for the error covariance (as in Fan et al. (2008)).
Therefore, a first motivation of our work is to fill this gap in the literature, by proposing a hierarchical Gaussian-type model, which has the appealing feature of being suitable to a broader range of applications compared to Tzala and Best (2008). Moreover, the use of a Kronecker product in the definition of the covariance between variables and between locations has strong computational advantages, besides those already mentioned relative to the flexibility of the model, and appears to be fairly little used in this framework. A further novelty in our approach is the way we cope with the identification of the factors, which is slightly different from the standard arguments following by Geweke and Zhou (1996): instead of setting a fully informative prior for a subset of the factor loading matrix, we allow for a diffuse conjugate prior and impose the necessary restrictions on the posterior density.
As in Tzala and Best (2008), our motivating application aims at detecting possible common sources that might have a hidden impact on one or more diseases and their eventual dynamics in space and time as well as their mutual interaction. Section 2 describes in more detail the motivating application and the data.
In Section 3, after a brief review of Bayesian factor analysis, we present our framework, we discuss the prior distributions and the identification strategy. Section 4 is dedicated to the computation of the posterior distributions, while Section 5 is dedicated to the estimation of the number of factors. In Section 6 some simulations are presented to investigate the performance of the proposed model. Section 7 presents the results of the application, with a Subsection 7.1 on sensitivity analysis. Finally, Section 8 concludes.

Data
Several authors have tried to use factor analysis to capture hidden relationships between a disease and possible risk factors (Das et al., 2010). Analysis of local variations of disease rates helps identifying latent sources of heterogeneity and uncovers the existence of risk factors that contribute to formulate etiological hypotheses. Considering several diseases simultaneously permits to detect common determinants (Yanai et al., 1978). Moreover, taking advantage of the extra information obtained from different diseases, it allows to improve the precision of the estimates. On the other hand, socio-economical as well as behavioral features tend to have a geographical pattern in Europe: as for example, Scandinavian countries evidence a low unemployment rate and adopt similar welfare programs, while Mediterranean countries share similar dietary intakes. These and other characteristics are potential determinants of one or more diseases but their interrelation is unclear as well as the potential existence of dynamic behaviors such as, for example, local persistence in time or global convergence (Flood et al., 2008).
Following this idea, we perform a factor analysis on the death rates for different diseases, on one hand, and, on the other hand, some socio-economical and behavioural indicators. We aim at highlighting the existence of common spatial patterns in the determinants of morbidity, and its eventual variations through time.
We took the data from the Demographic and Health Surveys (DHS) Program, ICF International and European Health for All databases, World Health Organization (WHO). Data collected refer to the period 1970 − 2015 and to the following countries: Albania, Armenia, Austria, Azerbaijan, Belarus, Belgium, Bosnia and Herzegovina, Bulgaria, Croatia, Czech Republic, Denmark, Estonia, Finland, France, Georgia, Germany, Greece, Hungary, Iceland, Ireland, Italy, Kazakhstan, Kyrgyzstan, Latvia, Lithuania, Luxembourg, Netherlands, Norway, Poland, Portugal, Romania, Serbia, Slovakia, Slovenia, Spain, Sweden, Switzerland, Turkey, Turkmenistan, Ukraine, United Kingdom and Uzbekistan.
Socio economic variables such as: average income measured in dollars and unemployment rate are recorded. For each country few behavioral variables are recorded, in particular: intake of energy from fat, intake of energy from protein and adult per capita consumption of recorded alcohol. Fertility rate is observed for each country. Different health status indicators are also recorded, age standardized death rates (SDR) for 100000 inhabitants for the following diseases: smoking related, alcohol related, malignant neoplastic, female breast cancer, tuberculosis, infection disease, circulatory system disease, ischaemic heart disease, liver disease and all causes mortality rate. Finally country specific life expectancy is recorded. of the observations; in order to avoid complicated structure, the missing values are just filled with a plausible one, as suggested in Pigott (2001). Both mean for the cases of the observed variable, or the value relative to the previous year provide the same results.
Figures 1-5 show the distribution over time and space of five variables among the seventeen variables considered: alcohol consumption, fertility rate, all cause mortality, annually average income and unemployment rate. A temporal pattern appears to be very strong. Spatial pattern seems to be more difficult to capture. Distribution of standardized mortality rate for all cause of mortality looks like signing a contrast between Western and Eastern countries. On the other hand, a higher value of fertility rate is observed in Northerner countries and France, while France is closer to neighboring countries regarding alcohol consumption distribution. Average income distribution determines three main areas: North, South and Eastern Europe. Unemployment rate shows a weaker spatial pattern with respect to the other variables mapped.
The following variables: fat intake, alcohol related disease SDR, all causes SDR, breast cancer SDR, all cancer SDR, circulatory system SDR, infection disease SDR, ischaemic SDR, liver related disease SDR, smoking related disease SDR, tuberculosis SDR, average income and unemployment rate are considered on a logarithm scale to better be approximated by a Gaussian distribution. We can observe three levels of correlation in the data: between time points, between countries and between variables, displayed in Figures 6-8. The three types of correlation are all very strong. We model parametrically only the first one. Overall, all variables show evidence of temporal correlation within variables and countries (see Figure 6). Among the variables considered, the spatial structure assumes different forms, even though some common spatial pattern emerges and some correlation between countries results quite strong (see Figure 7).
Since variables are highly correlated (see Figure 8), it may be interesting to identify a small number of common factors that are able to explain the majority of information within the sample covariance matrix. Each factor will result related to a particular subset of variables, and therefore the definition of factors itself suggests their interpretation. To accomplish this task, the estimation of a factor model is a natural choice. Due to the noticeable heterogeneity of the variables in space and time, it is however necessary to allow factors to vary in both dimensions. Moreover, to take into account for the evident temporal autocorrelation among almost all variables as well as for the possibility that some of the spatial structure is induced by spatial spillovers, we want to allow the country-year-specific factors to be spatially and serially correlated. Further, in the model illustrated in Section 3, the correlation within and between regions are assumed to have a multiplicative effect and separability in the covariance is modelled, while temporal effect on factor scores is assumed additively.

Model Proposed
Let X(t) be a matrix (N × K) of observations on K variables over N countries, at time t and let X = [X(1) , X(2) , · · · , X(T ) ] be the NT × K dimensional matrix of K observed variables, for the N countries at T time periods. Following Rowe (2003), we write the model for the K observed variables at time t for location n, as where μ(t) is a K-dimensional unobserved population mean vector at time t, Λ the K × m matrix of factor loadings, f n (t) is the m-dimensional vector of common factor scores for the n-th region at time t, and ε n (t) is a vector of disturbance terms of the n-th region on the K variables at time t. The parameters (μ(t), Λ, f(t), t ≥ 1) in the model are unknown and have to be estimated. Usually, the estimation of μ(t) does not present problems, as typically coincides with the sample mean (see Lawley (1940)), therefore, without lack of generality, we can assume μ(t) = 0. While factor scores are allowed to vary across time and countries, the loadings have dimension K × m. The approach of keeping the loadings fixed across time and regions, also followed in Tzala and Best (2008) or Calder (2007), favors the description of latent factors and their interpretation as well as the identification of local trends or country-specific evolution of the scores. In a classical factor analysis model the errors ε n (t) are assumed to be normally distributed and independent, conditional on the common factors: thus the common factors are the only source of dependence among the K variables.
Classical and Bayesian factor models assume independent rows of the observation matrix. This assumption turns out to be not reasonable for spatially correlated data. In order to show our proposal to account for possible spatial correlation, let us rewrite the classical factor model as illustrated in (3.1) by rearranging the N × K terms in X(t) in a NK × 1 vector (where the first K elements are the K-observations relative to first region at time t etc.). We denote by vec(X(t)) the vector obtained by this operation, and get where ⊗ indicates the Kronecker product between two matrices. By our specification, the residuals are correlated between countries and variables: where the two positive definite matrices Φ and Ψ have sizes, N ×N and K×K and induce spatial correlation and correlation among variables, respectively. Traces are constrained to be fixed and proportional to the matrix size, for identification (see also Leorato and Mezzetti (2016)).
The n-th row x n (t) of the matrix X(t) contains the K observations of variables on region n at time t. Under the separable structure defined by (3.2) and (3.3), var(x n (t)|Φ, Ψ, m, F, Λ) = σ 2 ε φ nn Ψ and the covariance between rows n and n of X is σ 2 ε φ nn Ψ, while the covariance between columns k and l of X(t) (representing observations of variables k and l on all the N regions at time t) is σ 2 ε ψ kl Φ. We rewrite (3.2) and (3.3) as: showing that the distribution of the observations conditional on the factors admits autocorrelation on space and between variables.
Besides that, we want to allow for both a time autocorrelation and a spatial dependence on factors scores F . Specifically, the factor scores F (t) follow an autoregressive model, while we allow spatial autocorrelation in the columns of F (t) (for computational convenience we assume the spatial correlation matrix to be equal to Φ, although in   Supplementary Material (Leorato and Mezzetti, 2020) we compute the posterior distribution using an arbitrary positive definite matrix). Further, we permit correlation between factor loadings of different variables through proper definition of prior distributions. Appropriate prior distributions, presented in Section 3.3, are used to account for all these dependencies.

Identification Strategy
The model in (3.1) presents a well known problem of identification of the parameters. The question of identifiability of factor models has been widely discussed in the literature, and several different proposals have been suggested, each with its specific strengths and drawbacks. This problem is generally solved through the introduction of different sets of constraints on the parameters.
Identification restrictions are necessary due to rotation invariance of factor models: model (3.2) cannot be distinguished from a model where both factor loadings and common factors are subject to a rotation. In fact, given a nonsingular m × m matrix B, transforming F (t) into F (t)B and Λ to ΛB gives the same model as (3.2), for every B: Practically, this amounts to m 2 redundant parameters (the dimension of B) and the main strategy proposed in the literature has traditionally been the introduction of the exact same number of restrictions on the parameters. There are several ways to impose those restrictions, and it is not our purpose here to cover all of them (see Bai and Li (2012), Bai (2003) and references therein for a detailed review on this topic). Sometimes all m 2 restrictions are imposed on the factor loadings matrix, leaving the common factors unrestricted. Another possibility is to share the constraints among the sample covariances of the factor loadings and of the common factors.
The question of which one of the set of restrictions is to be chosen is far from being settled. In a Bayesian estimation approach, whenever the computation of the posterior distribution relies on some MCMC tool, the use of one or another identification restriction is more likely to produce relevant differences.
One quite popular choice is to assume the factor loading matrix to be lower triangular (with or without all ones in the main diagonal ) and to add the remaining restrictions on the covariance matrix of the factor scores.
One of the main drawbacks with this identification strategy is that the posterior distributions, and therefore the interpretation of the results, in general depend heavily on the ordering of the observations. Several authors (see for instance Aßmann et al. (2016), Frühwirth-Schnatter and Lopes (2018) and Conti et al. (2014)) have proposed alternative methods to overcome these liabilities. Nonetheless, this remains one of the most used identification strategies, one of the reasons being the fact that the structure of the loading matrix itself can ease the interpretation of factors.
For this reason, we follow the approach based on a lower triangular structure of Λ: (3.5) However, we distinguish from the standard arguments on one respect: to the best of our knowledge, most of the authors imposing identification restrictions of this form, follow the same ex-ante argument, proposed by Geweke and Zhou (1996) (see for example Aguilar and West (2000), Calder (2007) or Bai and Wang (2015)). This approach consists in the definition of a prior distribution only on the unconstrained elements of the factor loadings, while, for the others, a Dirac distribution (either concentrated in 0 or 1) is assumed. In order to take advantage of the separability of covariance matrices also in the computation of the posterior of Λ, we implement the constraint in a different way. In fact, we impose the constraints by conditioning on the unrestricted posterior distribution of the whole Λ. In this way, we can avoid to resort to a fully informative (deterministic) prior distributions for a subset of the components of Λ. The computation of the constrained posterior distribution of Λ is given in detail in Section 4. In the Supplementary Material we show that the posterior distributions obtained by using our method and the approach by Geweke and Zhou (1996) do not coincide even for very simple factor models.

Prior Distributions
Following Leorato and Mezzetti (2016), we will use natural conjugate families of prior distributions for the parameters ρ, F , Λ, σ 2 ε , Ψ and Φ. Thus for σ 2 ε , we set an inversegamma prior distribution: (3.6) As stated in Section 3.1, the factor scores F (t) follow a multivariate autoregressive model of order h ≥ 1. Let R j = diag(ρ j1 , ρ j2 , . . . , ρ jm ), j ≤ h and let us define the matrix F h (t) = (F (t − 1), . . . , F (t − h)). Then, The errors ν i (t), ν j (t) are independent for all i = j and are not serially correlated, but they are spatially correlated, through the matrix Φ. In Supplementary Material, we compute the posterior distribution of F , allowing for an arbitrary matrix Γ to govern the spatial correlation matrix, in the particular case when R = ρI m . However, in the application of the general model, we use Γ = Φ to exploit the computational gain due to simplification of the posterior distribution. A similar remark holds for the prior variance of Λ, given below.
We can write the variance of the vector obtained from the matrix F as a T × T block matrix Σ F , whose (t, s) − th block is where C |t−s| (R) is a diagonal matrix whose j-th element is the |t − s|-th order autocorrelation of an AR(h) process with coefficients (ρ 1j , . . . , ρ hj ), j = 1, . . . , m.
For example, if h = 1, R = R 1 and As for F , we use a normal prior for Λ, with Therefore, by defining the vectorization λ = vec(Λ), the prior distribution of λ is Gaussian, with mean λ 0 = vec(Λ 0 ) and variance H ⊗Ψ. The hyperparameter H is fixed, while the matrix Ψ, that allows for correlation between the loadings of different variables, is estimated.
Prior distribution of F reflects our hypothesis of factors with a temporal and a spatial dependence. Note that with our approach to identification, we don't need to specify separate Gaussian prior distributions for all non null terms of the rows of Λ, but we can specify Λ, conditionally on Ψ, as a unique Gaussian random matrix with parameters Λ 0 and H.
Both Ψ and Φ follow from a priori inverted Wishart distributions with scale matrices A and G respectively: Both A and G are symmetric and positive definite matrices, while γ and ν are two scalars. We adopt the following parametrization: E(Ψ) = A/ν and E(Φ) = G/γ.
Since the shape parameters (ν and γ) determine also the level of information of the prior distributions, we add parameters κ ν and κ γ such that: A = κ ν νI K and G = κ γ γI n . Thus, the lower the parameters (κ ν and κ γ ), the less informative the distribution. Furthermore we fix the trace of the two matrices respectively as: The prior distribution for each ρ ij is a uniform distribution as p(ρ ij ) ∝ 1 for ρ ij ∈ (−1, 1), i = 1, . . . , h, j = 1, . . . , m.
Using the definitions of the likelihood function in (4.1)-(4.2) and the set of prior distributions in (3.6)-(3.11), it is then easy to write down all the conditional posteriors.
The conditional posterior density of the disturbance covariance matrix given the factor scores, the factor loadings and the data, is again an inverted Wishart with updated parameters as: Similarly, the posterior distribution forΦ is an inverted Wishart distribution with updated parameters. The expected value is a weighted sum of empirical covariance, factor scores covariance and prior hypotheses about covariance between observations, more precisely: The posterior distribution for σ 2 ε is an inverse Gamma, with updated parameters: To compute the posterior distribution of R only (3.8) contributes, as: The conditional posterior distribution for the factor scores given the correlation matrix, the disturbance covariance matrix, the factor loadings and the data is multivariate normal as shown in Supplementary Material, and it results as follows: vec (F (1), . . . , F (T )) | X, Φ, Ψ, R, . . . ∼ N (μ F ,Σ F,m ), We underline that to compute the posterior distribution of factors in the VAR case, it is convenient to consider vec (F (1), . . . , F (T )), that is the vectorization of a N × T m matrix, rather than the vectorization of the matrix F defined above, that has dimension NT × m.
In the particular case when all factor scores have the same autocorrelation coefficient, namely when R = ρI m , we have instead Where M (ρ) is defined as: Computational details for the posterior distribution of the factors (under these and more general assumptions) are shown in Supplementary Material. The (unconstrained) posterior distribution of Λ is the following: Differently to the standard Bayesian approach, that uses completely informative priors for the elements λ kj subject to the constraints, we impose ex-post the restriction by conditioning the posterior distribution. In this way, the whole posterior distribution of Λ, which enters in all terms of the constrained distribution, is informed by the data.
We consider the m × K dimensional vector vec(Λ), and define the two sub-vectors λ u , corresponding to the unconstrained elements in Λ and λ c , that instead contains the m(m+1)/2 terms of Λ that are constrained in (3.5). Then, let λ c,0 be their corresponding constrained values (either 0 or 1).

Let us denote byΣ
(4.4) andλ = vec(Λ), namely, the posterior variance matrix and mean vector of vec(Λ). Then, by appropriately reordering the elements ofΣ λ andλ, we can writeΣ λ andλ in blocks, corresponding to λ u and λ c respectively: Imposing constraints on the posterior distribution of Λ is equivalent to computing the posterior distribution of λ u , conditional to λ c = λ c,0 . This distribution is readily computed, from the well known properties of a multivariate Gaussian distribution:

Number of Factors
An important issue to be faced is the choice of the number of factors. In the previous sections, likelihood, prior and posterior distributions are defined or computed by assuming a fixed number of factors. In this Section we illustrate how their number can be determined. Bayesian and non-Bayesian references that tackle the estimation/choice of the number of common factors are, among many others, Lawley and Maxwell (1962); Bartholomew et al. (2011); Martin and McDonald (1975); Press and Shigemasu (1989); Rowe (2003). The book by Bartholomew (1987) is an excellent overview of the field right before MCMC tools became available.
According to Rowe (2003) the posterior probability of m is given by p(m) indicates the prior distribution of m. Since different methods for model selection tend to favour high number of factors, we choose a prior distribution inversely proportional to m.
Due to the computation difficulties to sample from the previous posterior distribution, to choose the optimal number of factors we propose the following algorithm. The main idea is to sample a value m from the prior distribution of m. Given m , we estimate all the components of the model according to the method previously exposed: F (t),Λ,Ψ andΦ as defined in (3.2) and (3.3). Knowing all the components, we can sample a dataset X from the estimated model. We then compute a measure of proximity between observed data and each data set sampled from the estimated distribution, based on either the correlation or a distance. The optimal number of factors is estimated according to one of the following criteria: • we take the average of the values m weighted by the correlation or the inverse distance, • all sampled values such that the distance between the observed and sampled data is greater than the 90 − th percentile of the computed distances are discarded, and then the sample mean of the observed value is the estimated value for m.

Simulations
The aims of the simulations are basically twofold: to investigate the ability to reconstruct the data with a reduction of the dimension of the variables and the ability to correctly identify the latent structure.
We need to simulate a data set where the variables are generated by a lower number of factors and investigate whether the method is able to correctly catch the underlying structure. To reach this aim, we first follow the mechanism generating data explained in Ghosh and Dunson (2009), adding an AR(1) structure for the factor scores. By assuming K = 10, N = 20, T = 10 and m 0 = 3 we simulate the following three factors model: We simulate a data set with ten variables generated assuming the structure in (6.1)-(6.7). From each data set simulated, we estimate the factor scores and the factor loadings, with different values of m and, then, reconstruct the data set withFΛ . The number of factors is determined slightly differently from the algorithm illustrated in Subsection 5, because here for each iteration the values of m from 2 to 9 are considered. We are not sampling using the prior distribution of m, since the aim is investigating the ability of the method to reconstruct the simulated data varying the value of m. Once we have estimated the reduced model, we measure the distance between simulated and estimated data through the following three measures: (i) the correlation between the data generated X and the data reconstructedFΛ (ii) the distance between the mean μ 0 = F 0 Λ 0 and (iii) the estimated meanμ =FΛ and the log condition number of Σ −1 0Σ . We repeated these steps 100 times.
In Table 1 we report the results of the simulation. For each value m the tables shows the mean value for the three measures of distance between true and reconstructed data.
From 100 simulations, the number of factors,m, is estimated to be equal to 3.67 versus the real value of 3. This result again confirms the performance of our method. The average correlation between simulated data and reconstructed data is 0.863.  Table 1: Results for the simulation according to generating mechanism as in (6.1)-(6.7). For each number of factors extracted, the average log condition number of Σ −1 0Σ , the average distance between the mean μ 0 = F 0 Λ 0 and the estimated one and the average correlation between the data generated and the data reconstructed.
Moreover, we simulate data sets with a random generating structure: p, number of variables, is fixed to be equal to 15, T = 10 and ρ 0 is fixed to be equal 0.4. The first step is the random generation of the quantities m 0 , F 0 , Σ 0 and Λ 0 .
Elements of Λ 0 are randomly sampled from the same elements that appear in the matrix of factor loadings defined in (6.4). At the second step, we simulate a data set X with the sampled generating structure. From matrix X, we extract factor loading and factor scores with number of factors from 2 to 11, and we choose the model that minimize the distance between simulated and reconstructed data. The correlation between the data generated and the model reconstructed has an average of 0.83, ranging from 0.705 and 0.893. Moreover, the correlation between the real value of m and the chosen one is 0.683.
Furthermore, to test the ability of the method proposed to guess the underlying structure, besides reconstructing the data, we simulate data sets with p, number of variables equal to 8, T = 5, and n = 20. The matrix of factor loading is a 0/1 matrix: Λ 0 = 1 0 1 1 1 0 0 0 0 1 0 0 0 1 1 1 .
The rest of the structure remains unchanged: We simulate 500 dataset following the assumptions above and assuming ρ = 0, that corresponds to assume absence of temporal structure. For each dataset, we compute the correlation between matrix Λ 0 , factor loadings used to generate the data, and matrixΛ, mean of the samples from the posterior distribution. The average correlation is 0.859, the median is 0.959, with 95% credible interval equal to [0.277, 0.99]. Furthermore, the average correlation between matrix F 0 , generating the data, and the matrixF is 0.821, the median is 0.923, with 95% credible interval equal to [0.129, 0.975].
Finally, we simulate 500 dataset assuming ρ = 0.4. For each dataset, we compute the correlation between Λ 0 andΛ, mean of the samples from the posterior distribution, the average value is 0.922, the median is 0.976, with 95% credible interval equal to [0.212, 0.99]. Furthermore, the average correlation between F 0 generating the data and the matrixF is 0.781, the median is 0.854, with 95% credible interval equal to [0.067, 0.908].

Results
The results reported are based on two chains of posterior samples of 200000 iterations per MCMC, with a burn-in iterations of 100000. Convergence for each parameter was checked by visual examination of trace plots and by the calculation of the Brooks and Gelman diagnostic (Brooks and Gelman, 1998). Convergence of matrices was checked through convergence of eigenvalues. In terms of real time, the run took about 5 hours. All of our computations performed in Matlab (version 2019b) on a PC with a processor Intel Core i7 − 6700T and 16Gb of RAM. Note that the estimates are rather stable and the number of iterations can be reduced to 20000 without significant changes.
Applying the method described in Section 5, 1500 values were sampled from the distribution of m, and we obtain a posterior mean of the number of factors equal to 5. In this Section, results obtained by calculating five factors are therefore presented. As illustrated in Section 5, different criteria are suggested to chose the optimal number of factors, all of them leading to the estimatem = 5: • the average of the sampled values of m weighted by the correlations between real data and reconstructed data provide 5.45 as the chosen value for the number of factors, • the average of the sampled values of m weighted by the inverse of distances between real data and reconstructed data provide 5.57 as the chosen value for the number of factors, • discarding any samples that provide a distance between real data and reconstructed data greater than the 90 − th percentile,m results 5.41.
The choice of hyper-parameters is a fundamental aspect of the analysis and will be discussed in Subsection 7.1. Concerning hyperparameters in the prior distribution ofΨ andΦ, indicated in (3.10) and in (3.11), we assume that the scale parameters, ν and γ, are equal to 1, while κ ν and κ γ are fixed to be equal to 0.1.
Matrix H, prior covariance matrix for the distribution of Λ, is assumed to be a diagonal matrix with 0.01 in each cell of the diagonal, and Λ 0 is defined as explained in Section 7.1. The prior distribution for each component ρ j is a uniform distribution between −1 and 1. Moreover, the parameter σ 2 ε in (3.3) is assumed to be equal to 1, by standardization of the matrix X(t).
In Figure 9 posterior estimate of matrix Λ is shown. For the first loading, constrained to be correlated with alcohol consumption, infection disease is the only variable with a negligible association. The first factor loading results highly correlated with unemployment rate and with SDR of most of the diseases: alcohol and smoking related diseases, all causes disease, circulatory system and ischaemic diseases, tuberculosis and liver disease. Furthermore first factor loading is negatively correlated with protein intake, life expectancy, fertility rate, fat intake and average income, and has a weaker negative association with SDR for breast and all cancer. As Figure 10 shows, by plotting the first factor score over Europe, in the years 2000, 2003, 2006 and 2009, it is marked a geographical contrast between Mitteleuropean countries and the rest of Europe. The first factor score clearly marks a contrast between richer versus poorer countries, with the exception of Germany and Austria, whose scores are nearer to those of East European countries. The evolution of the scores through the decade is minimal, and shows a slight divergence in the extreme values of the score. This factor marks differences in lifestyle, that is mostly attributable to heterogeneity in financial resources and impacts on (almost) all health outcomes. The second factor loading, constrained to be correlated with fat intake, has a high positive association with life expectancy and average income. At the same time it results negatively correlated with unemployment rate, SDRs for tuberculosis, circulatory system and ischaemic diseases, infection disease, all causes disease, smoking related disease and protein intake, and appears also a weaker negative association with liver disease and alcohol related diseases SDR. Among the SDRs, the only positive association is with cancer (all and breast).
As shown in Figure 11, that maps the second factor score over Europe in 2000, 2003, 2006 and 2009, the second factor score signs a contrast between Northern and other countries, but it shows a higher variability in time relative to the first factor. The second factor seems to mark some differences in cancer mortality rates related to heterogeneities both in calories intakes (correlated with income) and dietary habits.
The third factor loading is constrained to be correlated with fertility rate and results highly positively correlated to life expectancy and average income. It is worth noting that both the second and third factors have high positive correlation with income and life expectancy, and are negatively correlated with most of the SDRs, but while the second factor shows a positive correlation with cancer SDR, the third factor has a strong negative correlation with all cancer SDRs. In Figure 12, the third factor score over Europe is shown, in 2000, 2003, 2006 and 2009, identifying an evident contrast between eastern and western countries. France is dominating the maps of the third factor scores, probably due to the higher fertility rate compared to other countries with Figure 11: 2 nd Factor Score. similar socio-economic conditions. Here the opposition of richer vs poorer countries is stronger than in the first factor and captures the diffusion of primary prevention and the quality of health care systems.
The fourth factor (maps for the last two factor scores in 2000, 2003, 2006 and 2009 are available upon request), constrained to be correlated with life expectancy, has strong correlation with income (positive) and unemployment rate (negative) and is strongly positively correlated with breast cancer SDR only. It marks a contrast between Mediterranean vs other European countries. This factor is difficult to interpret, it could pattern some hidden social or cultural unobserved variables affecting women participation to prevention programs in rich countries.
The fifth factor is the only one that shows no correlation with the economic variables. It is constrained to be correlated with protein intakes and shows positive correlations with (in descending order) breast cancer, all cancer SDR, infection, liver disease and alcohol related SDR. The lack of correlation with economic variables reflects in the geographical pattern of scores, which does not evidence clusters similar to those of the previous factors (that could be roughly summarized as a South-West/North-East contrast). This factor juxtaposes France and the Baltic states, that dominate the maps in all years, in contrast to Southern-Mediterranean and Scandinavian countries, with highly negative scores. A common feature of the latter group is wide coastal area and the development of fishing industry: this affects the availability of fish protein and might suggest the fifth factor to mark differences in quality of protein intakes. The posterior distributions of each component have a wide area of overlapping, we cannot identify different behaviours regarding temporal dependence of the factors. We then consider the model with R = ρI m . In Figure 13 the posterior distribution of ρ is shown, posterior estimates of ρ results 0.42 indicating a rather strong temporal correlation between the factor scores. It also results higher than the average autocorrelation of each score.
Overall, the factor analysis seems to suggest the existence of two main geographical blocks: Eastern vs Western countries. Depending on the factor, the latter can be further decomposed into Scandinavian, Mediterranean and Mittel-European countries. These blocks appear to be stable through time, with some slight differences among factors. Occasionally, some countries are associated to different groups: Spain and France, for example, in some cases show factor scores not in line with those of their neighbors.
With the aim of comparing our analysis with the results form a different approach, we implemented, in the WinBugs software, the algorithm proposed by Tzala and Best (2008). The comparison is not straightforward: Tzala and Best (2008) model is applicable to Poisson distributed data, and can extract only one factor. To apply their model to our data, we then limited the attention to variables representing counts of deaths, that lend themselves to be modeled by a Poisson distribution, leaving out the other socio-economical and behavioural variables.
We extracted one factor from a matrix with 10 columns, containing the observed counts of deaths for all the diseases indicated. The results obtained with the WinBugs software are in line with our results, the factor loading estimated contrasts breast and all cancer deaths versus all the other diseases, with again a differentiation between Eastern and Western European countries, quite stable in time. In terms of computing time, the two procedures are pretty similar, with our procedure only slightly faster than Tzala and Best (2008)'s. Lee and Press (1998) study robustness in Bayesian factor analysis with respect to the choice of hyperparameters. Coherently with their findings, our model confirms that estimation of Λ is very sensitive to the choice of Λ 0 . As a consequence, also posterior estimates of F depends heavily on the choice of the elements in Λ 0 .

Sensitivity Analysis
To attenuate the lack of robustness, we use a two-step procedure to model Λ 0 . First, we randomly generate elements of a matrix Λ 00 from a uniform distribution between −2 and 2, according to structure in (3.5). Running the model for 1000 iterations we keep the posterior estimates of Λ as a value of Λ 0 for the next MCMC iterations. The estimates for Λ 0 obtained in the first-step procedure is the prior expected value for the factor loadings and will influence the posterior estimates as shown in (4.3). The two step procedure aims at reducing the dependence on our prior choice for Λ 00 .
By definition, H is any positive definite matrix, it is assumed here that H is either a diagonal matrix or proportional to the identity. Assuming H scalar matrix with magnitude comparable with the inverse of the sample size looks a reasonable choice. There is no reason to assume a prior correlation between elements in Λ. From (4.4), the posterior estimates of Σ Λ , covariance matrix of Λ, is the combination of the prior precision matrix, H, and the variance covariance matrix of the column of factor scores F . The posterior estimates of variance covariance matrix of Λ can result far from a diagonal matrix as the covariance matrix of factor scores indicates.
Identifiability of the factors is the most delicate aspect of the factor model. Since we choose a lower triangular structure for Λ, a critical comment to be aware of is that the order of variables impacts the interpretation of factors and might influence the model fit. We investigate whether the posterior estimates of factor loadings and factor scores depend on the order of the variables. We randomly change the order of the variables and compare the posterior estimates of the factor scores and the factor loadings obtained conditionally on the order. The posterior estimates of factor loadings are different, (since posterior estimates are also strongly dependent on Λ 0 ), but the interpretation of factor scores remains unchanged varying the order of the variables, and furthermore it has no impact on forecasting.
Considering the application to real data illustrated in Section 2, we compare the first three factors extracted from different models, obtained by varying the parameter m, from three to eleven factors. We compute the correlation between the factor extracted and we obtain that the correlation among the first factors in all different models is always greater, in absolute value, than 0.84. Further, all absolute correlations among second factors are greater than 0.73 while the absolute correlation among the third factors is always greater than 0.62.
The choice of the shape hyperparameters in inverse Wishart prior distributions has always been critical, especially when it is desired to achieve arbitrarily high noninformativity (Gelman (2006)). One of the main drawbacks of the inverse Wishart distribution is that the uncertainty of the matrix is determined by the single scalar parameter (ν or γ). The introduction of the parameters (κ ν , κ γ ) helps to better control for the uncertainty without changing prior expectation. This limitation of the Wishart prior induced some authors to consider generalizations such as, for instance, the Generalized inverted Wishart distribution (see f.i. Brown et al. (1994)). The less informative choice for γ and ν guaranteeing stability of the estimates, that is γ = 1, ν = 1, proves to work reasonably well even when sizes N and K are (moderately) unbalanced. However, when the difference between the number of variables and the space dimension increases, the prior precision parameters may differ substantially. In particular, prior precision relative to the covariance matrix with the highest dimension tends to increase and, as a consequence, the weight of the prior expected value on the posterior expected value tends to increase. The introduction of the hyperparameters κ ν and κ γ , fixed to 0.1, doesn't appear to impact the posterior distributions of Ψ and Φ: they seem, in fact, to be more sensitive to the order of magnitude of the shape parameters than to the form of the matrices A and G.

Conclusion
This paper develops and applies a new hierarchical Bayesian method for factor analysis for multivariate spatially and temporally correlated data. Our method advances the traditional literature on factor analysis in its computational feasibility and interpretation. The main hypothesis underlying the proposed model is that the latent factors share the same between-locations dependence as the observed variables. The underlying structure reflects on one hand our idea of a vector autoregressive pattern in the latent variables relative to adjacent time periods and on the other hand the presence of high correlation between two regions, (or census tracts), that are close to each other.
The main novelty of this work is the development of the factor model for spatially and temporally correlated data motivated by Burkitt (1969) and Burkitt (1970), who stated that the variation in the geographic and temporal pattern of a disease may be related to its cause. This paper applies an innovative approach in trying to analyze the mutual relationship among the geographical distributions of different diseases together with some well known risk factors, both behavioural and socio-economic.
In analyzing high-dimensional, or even moderate-dimensional, multivariate data, one is faced with the problem of estimating a large number of covariance parameters. The method proposed results particularly useful for datasets with a large number of variables where the interest is not in any single variable measured, but in uncovering meaningful common patterns underlying the data.
Assuming an inverted Wishart distribution for matrix Φ, the model proposed does not force the correlation within variables to be necessarily determined by geographic distance, but lets the data suggest the nature of the dependence. Our method borrows information either across geographical areas and different times. Indeed, the resulting multivariate spatio-temporal analysis may also lead to improved precision for the estimation and forecasting of the underlying factors by acquiring strength from other variables as well as close areas or time points.
The flexibility of a separable approach through a Kronecker product can be adapted to specify parametric structures for Φ. For example, the exponential matrix specification is considered in Strauss et al. (2017) and Leorato and Mezzetti (2016), as a way to reduce the number of parameters to be estimated.
A real data example illustrates the relevance of the methodology offering empirical evidence of the model proposed. The application highlights a geographic pattern in the risk factors that mirrors economic differences between Eastern and Western European countries. Furthermore, a simulation study is carried out and provides very encouraging results. The aim of this type of empirical analysis is to verify whether the results contribute to strengthening the hypothesis that environmental and local characteristics can act as a common risk factors for many types of cancers and of other diseases. These environmental factors, are very likely to be shared by adjacent regions, due to spatial spillovers or correlations, thus causing risk factors to show a geographical pattern. The simultaneous variation of factor scores through time and space helps to identify the presence of convergence phenomena in one or more factors or progressive concentrations of some factors in different areas. This may further add evidence to common sources of influence that reflect underlying shared risk factors, and proves to be particularly useful for diseases with less clear aetiology or rare diseases. In our application, the spatial risk pattern appears to be stable through time, thus rather than reinforcing the hypothesis of spatial convergence of risks, our analysis highlights the existence of persistent regional specific factors. Once the factors are extracted, they can be used for forecasting the variables of interest, by augmenting an observation driven model by the estimated factors.

Supplementary Material
Supplementary Material to "A Bayesian Factor Model for Spatial Panel Data with a Separable Covariance Approach" (DOI: 10.1214/20-BA1215SUPP; .pdf).