Bayesian Non-Parametric Factor Analysis for Longitudinal Spatial Surfaces

We introduce a Bayesian non-parametric spatial factor analysis model with spatial dependency induced through a prior on factor loadings. For each column of the loadings matrix, spatial dependency is encoded using a probit stick-breaking process (PSBP) and a multiplicative gamma process shrinkage prior is used across columns to adaptively determine the number of latent factors. By encoding spatial information into the loadings matrix, meaningful factors are learned that respect the observed neighborhood dependencies, making them useful for assessing rates over space. Furthermore, the spatial PSBP prior can be used for clustering temporal trends, allowing users to identify regions within the spatial domain with similar temporal trajectories, an important task in many applied settings. In the manuscript, we illustrate the model's performance in simulated data, but also in two real-world examples: longitudinal monitoring of glaucoma and malaria surveillance across the Peruvian Amazon. The R package spBFA, available on CRAN, implements the method.


Introduction
The covariance for the standard Bayesian factor model, Ψ = ΛΛ + Σ, is a matrix decomposition, constructed to learn a latent representation for some potentially highdimensional data object Y t = {Y t (s 1 ), . . . , Y t (s m )} . We use notation from the spatial statistics literature to indicate the dimension of Y t , however this is only for consistency throughout the remainder of the paper. In fact, the data object Y t is often not spatial in nature, but a vector that contains a large number of highly collinear variables. As such, throughout this paper, we refer to this dimension as the "variable dimension" of the data. The subscript t describes observed repetitions of the data object and can be inherently independent, spatial, or temporal in nature; we refer to this data dimension as the "replication dimension".
In this manuscript, we deal with the data setting where the vector Y t represents a spatial surface and is observed longitudinally across time, t. Our ultimate goal is to obtain a low-dimensional representation of Y t , at each time t, that is learned from a process that accounts for the spatial structure of the observed data. By incorporating these spatial dependencies, the hope is that meaningful latent factors are learned that aid in understanding rates of change across the spatial surface and provide a framework for clustering spatial locations based on comparable temporal trajectories. To accomplish this, we generalize the standard factor analysis, to allow for non-linear relationships (Equation (2.1)) and introduce a novel spatial Bayesian non-parametric (BNP) prior on the columns of the factor loadings matrix, Λ (Equation (2.3)). We begin by reviewing existing factor analysis methods for spatial data.
Factor analysis is characterized by dimension reduction along the variable dimension of the observed data and is accomplished by projecting the data into a lower dimensional space, defined by a set of k factors, η t = (η t1 , . . . , η tk ) . In practice the number of factors is small compared to the dimension of the data object (k m). By definition, the factors have lower variability than the original data and are more manageable for inferential purposes due to their low dimension. In a standard Bayesian factor analysis, the latent vector η t is often modeled as a standard Gaussian (Murray et al., 2013).
Typically, much of the innovation in factor analysis involves the prior for the m × k dimensional factor loadings matrix, Λ. The naive approach assumes independent Gaussian priors for each element of Λ, which has obvious computational issues when m and k are large. Furthermore, it may lead to poor inference due to the weakness of the prior specification and is not identifiable without further restrictions. In general, the specification of Ψ is not unique, as there are infinitely many possible factor loading matrices that satisfy the form. This can be seen by noting that any matrix of the form ΛP satisfies the condition, for any orthogonal matrix P (i.e., PP = I k ).
To remedy this, Λ is often a lower diagonal matrix with the loadings on the diagonal forced to be positive. This has been made computationally more efficient in recent years through parameter expansion of the loadings using basis elements (Ghosh and Dunson, 2009). Although these methods can be useful for identifiability, they remain computationally burdensome. Furthermore, it has been noted that from a Bayesian perspective, one does not require identifiability for many applications, including prediction, covariance estimation, and clustering (Bhattacharya and Dunson, 2011).
Adaptations of factor analysis to the spatial setting are plentiful and predominately focus on spatial dependence in the replication dimension of the data. A typical application of spatial factor analysis involves learning a latent representation of some highdimensional data object that is observed across a geography, whether point-referenced or areal. Here, the foundational assumption of spatial statistics, that dependence between observations weakens as the distance between locations increases, is applied to the factors, so that a latent factor at a location should be similar to factors at nearby locations. Christensen and Amemiya (2002) used this assumption to fit a shift-factor analysis method to model multivariate spatial data with temporal behavior modeled by autoregressive (AR) components. This method entertained several forms of spatial dependence through a single factor, a standard construction in the literature (Hogan and Tchernis, 2004). There have been many extensions to multiple factors, most often using Gaussian likelihoods (Ren and Banerjee, 2013;Nethery et al., 2015), but also generalizing to Poisson (Tzala and Best, 2008) and binary (Wall and Liu, 2009). There are also extensions to informative missingness (Reich and Bandyopadhyay, 2010), and spatial mis-alignment (Nethery et al., 2018). Finally, Guhaniyogi et al. (2013) combined a low rank predictive process approach with the non-stationary linear model of coregionalization for computationally feasible modeling with large spatially-referenced datasets.
In all of these methods, the latent factors are responsible for encoding spatial dependency for the purpose of reducing the observed data at each location. In this paper, however, we will focus on an alternative form of spatial factor analysis that instead introduces spatial structure along the variable dimension of the data. Thus, instead of dimension reduction for some high-dimensional response across locations, the response is now univariate, and the dimension reduction is performed across spatial units. This approach is advantageous when the modeling goal is to identify spatial clusters whose temporal behavior is similar. This approach was introduced by Lopes et al. (2008) through a spatial dynamic factor model. The key to this approach is a spatial prior on the columns of the factor loadings matrix, that allows for dimension reduction to be informed by spatial proximities. In Lopes et al. (2008), space was modeled using a distance-based Gaussian random field, while a more recent version used a Gaussian Markov random field for sparsity purposes (Strickland et al., 2011). This method has been extended to the generalized likelihood setting (Lopes et al., 2011). These methods use a lower diagonal specification for the loadings matrix for identifiability purposes and the number of factors is learned through reversible jump Markov chain Monte Carlo (MCMC). While these methods are useful for learning factors across a spatial surface, they rely on complicated identifiability constraints and lack clustering properties.
In this manuscript, we introduce a spatial factor analysis that collapses spatial locations into meaningful latent factors using a spatial BNP prior for the fully specified factor loadings matrix. In particular, we will model spatial dependency within the columns of the factor loadings using the probit stick-breaking process (PSBP), which is a scalable extension of standard spatial processes that allows for clustering. The BNP world has a rich literature involving spatial processes, mainly involving extensions of the Dirichlet Process (DP). The DP is the work-horse of BNPs and, when considering spatial dependencies, is best represented using a stick-breaking construc- α) and δ θ l is a Dirac distribution with point mass at θ l . Since the introduction of the dependent DP by MacEachern (1999), which modeled dependency through covariate information in the atoms (θ l ) and the weights (w l ), many methods have extended the DP to the spatial setting.
A popular spatial DP extension is Gelfand et al. (2005), which places a univariate stationary Gaussian process on the atoms to yield a random spatial process that is neither Gaussian nor stationary. The process has been extended to the generalized framework (Duan et al., 2007). Modeling spatial dependency through the weights of the stick-breaking representation has also been popular, however until recent years has been computationally inefficient. In the more general stick-breaking construction, Rodriguez and Dunson (2011) introduced the PSBP, which replaces the characteristic Beta distribution prior with probit transformations of normal random variables. With the introduction of the PSBP, incorporating spatial dependency in BNP priors has become computationally straightforward, and mainstream (Chung and Dunson, 2009;Pati et al., 2013;Pati and Dunson, 2014).
Using the PSBP prior to introduce spatial structure into the factor loadings matrix yields a non-separable and non-stationary spatiotemporal (ST) process, with temporal dependence introduced through the factors. We show that the PSBP prior offers benefits for scalability and is useful for clustering spatial locations into regions across space with similar risk trajectories. A computationally efficient MCMC sampler is introduced that uses slice sampling to allow for an infinite mixture model. Furthermore, a multiplicative gamma process shrinkage prior is used to adaptively determine the number of latent factors, avoiding the computationally intensive reversible jump technique. This paper is outlined as follows. In Section 2, we introduce a general factor analysis modeling framework and detail our novel spatial BNP prior for the columns of the factor loadings matrix. Through simulation in Section 3, we assess the utility of the novel prior in ST data and for clustering temporal trends across a spatial surface. Then, in Section 4, we apply the model to two real-world data applications: glaucoma disease progression and malaria risk surveillance. We conclude in Section 5 with a discussion.

Methodology
We begin by introducing a generalized modeling framework for factor analysis that allows for non-linearity and detail the temporal process for the latent factors. We then introduce the spatial PSBP prior for the factor loadings matrix and describe the multiplicative gamma process shrinkage prior for adaptively learning the appropriate number of latent factors. We conclude the section by working out a computationally efficient MCMC sampler for the infinite mixture model, describing the clustering properties of the introduced prior, and detailing prediction theory.

A General Modelling Framework
A generalized factor analysis model can be written as follows, Here, we formally define our observed data as Y t (s i,o ) for temporal visit t, (t = 1, . . . , T ) and spatial realization s i,o , for location i, (i = 1, . . . , m), and observation type o, (o = 1, . . . , O). This is a general specification, so that at each time t, the spatial object can be multi-layered (i.e., color channels or multiple disease outcomes per location) with O layers. We define vectorized versions of the observed data as follows, In our specification the factor loadings matrix is fully specified, with loadings λ j (s i,o ), corresponding to the stacking of the observed data. So the j th column is given by λ j = (λ j1 , . . . , λ jO ) , with λ jo = {λ j (s 1,o ), . . . , λ j (s m,o )} . A full specification allows for a direct application of spatial structure to λ j , j = 1, . . . , k, as it has the same dimension as the underlying process, mO. While a full specification limits the interpretability of the factors themself, as mentioned before, one does not require identifiability for many applications, including covariance estimation and prediction.
While standard Bayesian factor analysis is performed using a Gaussian likelihood, (2.1) is a generalized form. The Gaussian specification can be recovered if we choose f to be Gaussian with mean, and g the identity link. This is equivalent to the following vectorized model specification, , and σ 2 o = (σ 2 (s 1,o ), . . . , σ 2 (s m,o )) . The component, X t β, allows for covariates to adjust the factor analysis. The design matrix, X t , has rows, x t (s i,o ), which is p dimensional.
The purpose of writing the model in this general form is that it is more flexible, allowing for various likelihoods. For example, when we study malaria in Section 4.2 we will use a Bernoulli distribution, by specifying f as Bernoulli, with probability π t (s i,o ) = g −1 (ϑ t (s i,o )), and the logit link (ζ t (s i,o ) is null). Full details for deriving the Bernoulli likelihood in this context can be found in Section 5 of the Supplementary Materials online (Berchuck et al., 2020).

Spatial Bayesian Non-Parametric Factor Loadings
The scalar form of the likelihood in (2.1) motivates spatial dependency in the factor loadings. In particular, due to the fully specified factor loadings matrix, the following linear relationship between the transformed mean process and the latent factors exists, This illuminates that a factor loading λ j (s i,o ) represents the amount that observation Y t (s i,o ) is explained through the latent factor j at time t, η tj . Therefore, for two obser-vations, Y t (s i,o ) and Y t (s i ,o ), that are spatially correlated, we would assume that their relationships to the latent factor, η tj would be similar, To induce the desired spatial dependency, as motivated by (2.2), into the columns of the factor loadings matrix, we use a PSBP for each column, where {α jl (s) : s ∈ D} L−1 l=1 for j = 1, . . . , k has Gaussian marginals, with D some multivariate spatial surface, and {θ jl } L l=1 for j = 1, . . . , k are independent and identically distributed for each j. The form of (2.3) closely mirrors the stick-breaking construction of the DP, however the weights are now constructed using the standard Gaussian cumulative distribution function, Φ. As is shown in Rodriguez and Dunson (2011), this is a proper construction, because for finite L, it ensures that This property clearly transfers to our new prior across the columns of the factor loadings. A more detailed discussion of this is given in Section 6.4 of the Supplementary Materials online.
The parameters that dictate the weights, α jl (s i,o ), have a joint distribution that induces spatial dependency. Define the joint parameter, α jlo = {α jl (s 1,o ), . . . , α jl (s m,o )} and α jl = {α jl1 , . . . , α jlO } . We specify a simple, but flexible form using a separable specification, α jl ∼ N Om (0, κ ⊗ F(ρ)). This separable Gaussian specification allows for conjugacy of α jl and κ, thus maintaining computational feasibility. Notice, that while we treat space using a separable process, the resulting marginal process will be non-separable. The m × m matrix F(ρ) dictates the spatial neighborhood structure, for example a Gaussian process with exponential correlation, F(ρ) = exp{−ρD}, for a continuous spatial domain, or a Gaussian Markov random field for discrete spatial data, F(ρ) −1 = D w − ρW; we assume a proper conditional autoregressive (CAR) prior. Here D is a distance matrix (typically Euclidean) and W is an adjacency matrix, with adjacencies {w ii } that indicate the level of spatial correlation between locations i and i and do not change over time (D w is a diagonal matrix that weights the number of neighbors of each locations i). The parameter ρ indicates the level of spatial correlation.
Finally, in prior attempts to model the factor loadings matrix the number of latent factors (i.e., number of columns of Λ) was determined using the reversible jump MCMC. This decision requires a preliminary run for each choice of the number of factors and is very computationally intensive. As such, we decide to model the atoms, θ jl , using a multiplicative gamma process shrinkage prior (Bhattacharya and Dunson, 2011). This prior conveniently shrinks the magnitude of possible entries, where the degree of shrinkage increases with the column index. In particular, θ jl ind ∼ N(0, τ −1 j ), where the precision is forced to increase over the column index, τ j = j h=1 δ h , with δ 1 ∼ Ga(a 1 , 1), and δ h ∼ Ga(a 2 , 1), for h ≥ 2. This allows us to specify a value of k that is larger than the number of supposed factors, with the prior reducing the factors to a set of meaningful ones. Holding a 2 constant, increasing a 1 yields smaller column variances, and holding a 1 constant, increasing a 2 yields faster shrinkage of the column variances as j increases.

Computational Considerations
In order to facilitate Bayesian inference, the likelihood can be written in terms of the underlying atoms, a standard practice in mixture models, This is a simple replacement of the factor loadings, λ j (s i,o ), with their corresponding atom, θ jl , which is determined by a clustering indicator ξ j (s i,o ) = l. The representation in (2.4) reminds us of the discrete nature of the PSBP as the categorical parameter, ξ j (s i,o ), indicates the cluster of λ j (s i,o ), and has the following distribution, . This construction helps to illuminate the importance of the spatial dependency introduced in the PSBP, as the value of ξ j (s i,o ) (i.e., cluster of λ j (s i,o )) is sampled from a multinomial distribution with weights that have been spatially smoothed to be similar to nearby locations. This is desirable, because it encourages close locations to belong to the same cluster, and thus constructs underlying factors that relate to regions of the spatial domain.
To facilitate efficient computations, we introduce a latent variable, z jl (s i,o ) ind ∼ N(α jl (s i,o ), 1), to facilitate conjugacy in the α jl (s i,o ). Conjugacy follows from the following equivalency, This permits conjugacy in the α jl (s i,o ) by noting the following conditional independence, The theory described above was for finite L, however it can be easily extended to an infinite mixture model using a slice sampling technique (Walker, 2007). Slice sampling makes the infinite mixture model computationally feasible by introducing an upper bound for the number of clusters, thus reducing the process to a finite mixture model. In particular, all parameters that depend on the number of clusters , κ, ρ, δ h ) will be augmented using the slice sampling truncation. The idea is to introduce a latent variable, u j (s i,o ), with uniform density so that conditional on u j (s i,o ), j = 1, . . . , k, o = 1, . . . , O and i = 1, . . . , m, the conditional mixture distribution becomes finite. When dealing with full conditionals, this truncation corresponds to reducing the number of mixture components for each column of the factor loadings matrix to L * j , so that l j = 1, . . . ,

Model Properties
We focus on the class of spatial PSBP models, where each column of the factor loadings matrix has the following form, where each column progressively shrinks due to the gamma process shrinkage prior on the atoms. For each column, the process G i,o j marginally follows a PSBP for each s ∈ D. Therefore, for any set B ∈ B, we can obtain the moments of the process. In this section, we describe the moments of the PSBP process, originally derived in Rodriguez and Dunson (2011), plus new results that describe the conditional and marginal moments of the introduced spatial factor analysis.
The process moments are as follows, beginning with the mean, Finally, it is easy to see that the covariance across columns is zero, meaning the shrinkage process does not introduce dependency at the PSBP level, and the base distribution for atom, θ jl , is given by G 0j . Higher moments are also defined as such, As described in Rodriguez and Dunson (2011), these stick-breaking expectations, which show up in the model properties, have closed forms as long as the underlying spatial process has a marginal distribution, α jl (s i,o ) ∼ N(μ, σ 2 ). Then, using a change of variables, The marginal moments indicate the importance of the base distribution, G 0j , as a centering measure, with the marginal moments of the underlying Gaussian parameter, α jl (s i,o ) controlling the variance and covariance of the sampled distributions around G 0j . Furthermore, it was shown that as Having established the marginal moments of the process, we now turn our attention to deriving the moments of the observed data model. In order to derive these moments, we need to understand the induced spatial factor analysis model. The induced conditional likelihood for our model, , can be written in the following two forms, These two equivalent forms of the induced model demonstrate the mixing, which averages over the factor loadings according to the PSBP (Equation (2.3)). These representations are critical for determining the marginal and conditional moments of the observed data model. In particular, we derive moments assuming a Gaussian likelihood, as the derivations become untenable without the identity link assumption. There- ). The conditional mean and variance are given as, . The mean process is elegant, as it takes the form of the original mean process, but replaces the loadings with a mixture over the underlying atoms, weighted according to the PSBP. The spatial covariance, , is conditionally zero. The form of the conditional moments are reminiscent of a standard spatial analysis that uses a spatially varying intercept with a Gaussian process, where conditional on the spatial intercepts, the process is independent across space.
Finally, we turn our attention to the marginal moments of the process. The mean process is as follows, ), are given respectively as, Both the variance and covariance take the same form as the moments from the PSBP process in (2.5), however they are now being scaled by a summation, that is a function of the atom variances for each column and the second moment of the latent factors. In particular, as we see increases in both the number of factors (k) and the variability in the underlying atoms, the variance and covariance become inflated. For a full interpretation of the marginal moments, however, we need to place priors on the hyperparameters. For a detailed derivation of these model properties, see Section 6 of the Supplementary Materials online.

Prior Specification
We finalize the model specification by introducing priors for the remaining parameters; spatial parameters, κ and ρ, temporal parameters, Υ and ψ, along with any nuisance parameters, for example the variance in the Gaussian likelihood, σ 2 (s i,o ). For each of these parameters, we choose standard priors to promote conjugacy in the full conditionals.
The spatial covariance, κ, is an O×O matrix over the multiple levels of the image and has the conjugate inverse-Wishart (IW) prior, κ ∼ IW(υ, Θ). When O = 1 this prior reduces to an inverse-Gamma (IG) distribution, IG(υ/2, Θ/2). For degrees of freedom, we specify υ = O+1 and the scale matrix we use Θ = I O . This prior is appealing since it induces marginally uniform priors on the correlations of κ and allows for the diagonals to be weakly informative (Gelman et al., 2013). For the spatial tuning parameter, the prior is dependent on the type of spatial data, areal or point-referenced. For areal data, the prior for ρ is typically fixed at 0.99 to promote spatial smoothing, or given a uniform prior between zero and one. For point-referenced data, the prior is often uniform with bounds informed based on the expected range of the spatial variability, as in Berchuck et al. (2016).
The hyperparameters for the temporal process are assigned priors in the same vein as the spatial parameters. For the k × k covariance of the latent factors, Υ ∼ IW(ζ, Ω), with ζ = k + 1 and Ω = I k . The prior for the temporal tuning parameter ψ depends on the temporal correlation structure. For an AR(1) process the temporal tuning parameter ψ has a transformed Beta distribution, ψ ∝ (1+ψ) γ−1 (1−ψ) β−1 . While in the case of an exponential process, a uniform prior is more appropriate. Finally, in the case of a Gaussian likelihood, a weakly informative prior is used for the variances, σ 2 (s i,o ) ∼ IG(a, b).

Clustering Temporal Trends
An important characteristic of our introduced methodology is its ability to cluster regions across a spatial surface dependent on rates of temporal change. In BNP, clustering is determined based on the posterior probability that two locations belong to the same underlying cluster, o )), which alleviates the label switching issue. Unfortunately, due to the full specification of the factor loadings matrix, none of the columns are themselves identifiable, and we can not individually cluster on the k columns. Instead, we focus our clustering efforts on the factor loadings from all of the columns.
We begin by defining the loading probabilities for factor j at a particular location:  s i,o ), . . . , w k * (s i,o )}. Note that we will limit to the first k * factors, which is designed to only include factors that exhibit variability across locations. To find a good value of k * we propose the following criteria, π k ), for j = 1, . . . , k * . This criteria attempts to exclude any factors that are non-informative for clustering, by only keeping factors with g j (s i,o , s i ,o ) near zero or one. The tuning parameter, π k , is continuous and ranges between 0 and 0.5. A value of π k = 0 will remove all the factors, while π k = 0.5 will not remove any factors, resulting in the original k factors. The final object, w = {w(s 1,1 ) , . . . , w(s m,1 ) , . . . , w(s 1,O ) , . . . , w(s m,O ) } has dimension mO×Lk * . More specifically, however, when using slice sampling each factor is truncated to L * j , and thus in theory w has a much smaller number of columns.
When clustering temporal trends across the spatial surface, we will use the factor loading probability matrix, w. In particular, we apply simple k-means to w and use the gap-statistic to determine the proper number clusters (Tibshirani et al., 2001). While this process may not seem immediately intuitive, since w is potentially larger than the original data, we found that clustering w produced improved results over the raw data {Y 1 , . . . , Y T }.

Bayesian Non-Parametric Prediction
Once posterior samples have been obtained, prediction is often a priority. In particular, obtaining samples from the posterior predictive distribution (PPD) is of interest, for both new spatial and temporal instances. We begin by detailing how future instances of the spatial surface can be obtained, by defining the PPD as f (Y T +1 |Y). We express the PPD as an integral Ω f (Y T +1 |Ω, Y)f (Ω|Y)dΩ and then further partition the integral, where Ω = (ϑ T +1 , ζ T +1 , Λ, η, ζ, Υ, ψ). The convenient form of (2.7) is a function of three known densities that are defined as a consequence of the methodology introduced in Section 2.2. As such, the PPD can be obtained by composition sampling.
Density (2.7).1 represents the observed likelihood function written in vector form and is problem specific (in scalar form: Density (2.7).2 depends on properties of the conditional multivariate normal (MVN)

Justifying the Spatial PSBP for Factor Analysis
In our first simulation experiment, we aimed to illuminate the importance of the spatial PSBP and multiplicative gamma process shrinkage priors in the presence of spatial variability. We simulated data using the full model across various settings, including a different number of latent factors, k = 1, 3, 6, and the presence or absence of spatial correlation. To simulate data from a spatial process the spatial covariance, F(ρ), was set to a proper CAR prior, with ρ = 0.99. To simulate data with no spatial dependence, the spatial covariance was fixed at the identity matrix, (i.e., F(ρ) = I). Finally, for an additional simulation setting, we simulated data from a model that removed the PSBP mechanism, assigning Gaussian processes directly to the columns of the factor loadings matrix. The purpose of this final setting was to assess our proposed model under the data generating mechanism of existing methods that rely on Gaussian process models, for example, Lopes et al. (2008). This yielded twelve simulation settings.
The simulation aimed to imitate the monitoring of glaucoma progression, so we fixed the number of spatial locations, (m = 52, O = 1), to the number on a visual field (details of this setting follow in Section 4.1). The visits are uniformly spaced between zero and one with total number of visits set to the average in our visual field data (T = 10). Furthermore, we used the adjacency matrix from the visual field, where two locations i and i are considered neighbors if they share an edge or corner, w ii = 1(i ∼ i ). Finally, we set κ = 1, τ j = 1, ψ = 0.3, and Υ was sampled from its prior distribution and was dependent on k. For each setting, 100 datasets were simulated, where every dataset is generated from one simulated instance of α to ensure that the results are not affected by a particular realization. In the simulation settings that removed the PSBP, the factor loadings matrix was simulated directly from α.
We now describe specific details of the model implementation, which, unless otherwise noted, apply to all subsequent modeling examples. For the spatial process, we used a proper CAR with ρ = 0.99 to encourage spatial dependency, similar to how the data was simulated. An exponential correlation structure was used for the temporal process, so that ψ ∼ Uniform(a ψ , b ψ ). The bounds for ψ cannot be specified arbitrarily since it is important to account for temporal range. We specified the following conditions for finding the bounds, [a ψ : , where x min and x max are the minimum and maximum temporal differences between visits. The remaining priors come directly from Section 2.5, however, because there was only one spatial observation type, we specified the following prior, κ ∼ IG(0.001, 0.001). Finally, to promote shrinkage from the gamma process prior on the columns of the factor loadings matrix, we specified, a 1 = 1 and a 2 = 20. Inference proceeds using the MCMC sampler described in Section 2.3, with non-convergence evaluated primarily through examination of traceplots, but also the Geweke statistic (Geweke, 1992). For each simulated dataset, the MCMC sampler is run for 100,000 iterations after a 10,000 burn-in and thinned to a final sample size of 10,000. We call this method Model 1.  Table 1: Assessing the performance of the spatial stick-breaking process and multiplicative gamma process shrinkage prior. Simulation settings are defined by the number of true underlying factors (k = 1, 3, 6), whether spatial dependency is present (Y: F(0.99), N: I), and whether the PSBP or Gaussian process (GP) was used. Each of the simulated datasets has 10 uniform time points, which is the average in our visual field dataset (i.e., T = 10). Model fit was assessed using widely applicable information criterion (WAIC) and prediction performance was defined as accuracy of predicting the 13 th time point, and is determined by the continuous ranked probability score (CRPS). Smaller values are preferred for both. Each summary is based on 100 simulated datasets.
In order to compare our introduced methodology, we compared it to various simplifications. Model 2 removed the spatial component, setting F(ρ) = I. Model 3 removed the gamma shrinkage prior, instead using independent priors, δ h ∼ Ga(a 1 , a 2 ). Models 4 and 5 replaced the PSBP prior with a standard multivariate CAR prior for each column of the factor loadings matrix, comparable to the model of Lopes et al. (2008). Furthermore, Models 4 and 5 removed the multiplicative gamma process shrinkage prior, instead using the same criteria as Model 3. Finally, Model 5 removed spatial dependency, using the identity matrix. All models were fit assuming six underlying latent factors (i.e., k = 6).
We compared the five models using both a model fit and prediction summary. Model fit was assessed using widely applicable information criterion (WAIC) and prediction performance was defined as accuracy of predicting a 13 th simulated time point, and was measured by the continuous ranked probability score (CRPS) (Hersbach, 2000;Vehtari et al., 2017). Smaller values are preferred for both.
Results are found in Table 1. We begin by studying the results for model fit. The most clear conclusion is that Models 1 and 3, the models that have the spatial PSBP (Model 3 loses the multiplicative gamma process shrinkage prior), perform the best across all settings. The difference between Models 1 and 3 is minimal, but inclusion of the gamma shrinkage prior (Model 1) does normally fit better. The only time Model 3 outperforms Model 1 is when the true number of latent factors is the same as the simulated data (i.e., k = 6), indicating that the gamma shrinkage prior is useful when the true number of fac-tors is less than the number specified in the model. Another valuable comparison is with Models 4 and 5, which do not use the stick-breaking construction or the gamma shrinkage prior (Model 5 also does not include space), and have worse performance. Clearly, in the presence of space the spatial PSBP is crucial for model fit. The prediction results mirror the same trend as the model fit, with Models 1 and 3 being superior. Finally, we can compare the results of the simulation across settings dependent on whether the PSBP or Gaussian process were used. The setting where a Gaussian process was used is reflective of existing models such as the one from Lopes et al. (2008). In general, we see little changes across this simulation setting, indicating that the PSBP and Gaussian process data generating mechanisms may be similar. This seems to indicate that regardless of the data generating mechanism, the PSBP has superior performance, and therefore has utility over models with only a Gaussian process specification.
Of course gains in performance need to be considered within the context of computation time. In this simulation study, the average (SD) runtime in minutes for models 1-5, respectively, was 273 (58), 260 (54), 268 (54), 279 (60), and 234 (52), when using the spBFA R package. Therefore, as expected the models with spatial variability took the longest to run, while Model 5, with only Gaussian processes and no spatial variability was shortest. This seems to indicate that the increased model performance from Model 1 has utility as it has a runtime comparable to the other models. However, we should note that the MCMC sampler was optimized for Model 1 and the computation time for Models 4 and 5 can likely be lowered substantially.

Clustering Using the Spatial PSBP
In our second simulation study, we aimed to demonstrate the use of the spatial PSBP to cluster temporal changes across space. In order to do this we used a similar data generating process as the simulation in Section 3.1, however we made some key changes. In particular, we simulated data based on two true clusters, where the first cluster represented the region of the visual field called the inferior nasal, which includes eight spatial locations. The second cluster consisted of the remaining locations on the visual field.
We compared the clustering technique introduced in Section 2.6 to a simplified version that performed k-means on the raw data. For comparison, we used the ratio of between sum of squares (BSS), if w(s i,o ) belonged to cluster c. The quantityw was the overall mean, o,i w(s i,o )/(Om). For adequate clusters, we would have expected this ratio to be close to one, because a large BSS indicates high variability between clusters (and accordingly, small variability within clusters). We present the SS ratio (BSS / TSS) for the raw clustering method, and also for the PSBP technique with π k = 0.2, 0.3, 0.4, 0.5. We presented results only for Models 1-3, since they are the only models with clustering capabilities.
The results of the simulation can be found in Figure 1. We can interpret the clustering performance of the PSBP across values of π k in absolute and relative terms (i.e., compared to the raw clustering). In general, we can see that, in relative terms, the PSBP has improved clustering performance. In the only settings where the raw clustering technique has better relative performance (δ β0 = 0, δ β1 = 0 or 3 and δ σ 2 = 2), the results are negligible, because the SS are close to zero, meaning the settings were overly difficult. Overall, it appears that the biggest boost in performance for the PSBP is when the true underlying intercepts are different between groups. There is also evidence that the PSBP process is capable of detecting clusters based on differences in the underlying true slope (i.e., δ β1 ), which is particularly impactful for clustering spatial locations based on temporal trajectories.
We see that the results of the simulations are relatively robust to the choice of π k . Interestingly, the greatest performance corresponds to π k = 0.2, which indicates the utility of the multiplicative gamma process shrinkage prior, as this setting limits the clustering to only the most informative latent factors. This is confirmed when looking at the average (SD) number of latent factors used for clustering across values of π k , 2.11 (1.02), 4.17 (1.35), 6.00 (0.00), 6.00 (0.00), for π k = 0.2, 0.3, 0.4, 0.5, respectively. This indicates that π k = 0.2 is likely preferred, as the true number of clusters in the simulated data was two.
Finally, the average (SD) runtime for Models 1-3, respectively, was 225 (19), 221 (18), 221 (18) minutes, using the spBFA R package. Thus, the model that included both the gamma shrinkage prior and the spatial variability was 4 minutes longer on average. As with the first simulation, for each simulated dataset, the MCMC sampler was run for 100,000 iterations after a 10,000 burn-in and thinned to a final sample size of 10,000.

Glaucoma Progression Using Visual Fields
In our first case study using real data, we used the spatial PSBP to determine glaucoma progression from longitudinal visual fields. Glaucoma, an optic neuropathy, is the leading cause of irreversible vision loss worldwide. Although glaucomatous damage is irreversible, early treatment can usually prevent or slow down progression to functional damage and visual impairment. Estimation of rates of functional deterioration by visual fields is essential for determining patient prognosis and aggressiveness of therapy (Weinreb et al., 2014). Figure 1: Assessing the clustering performance of the spatial PSBP and multiplicative gamma process shrinkage prior. Simulations are based on a true setting with two clusters. The first cluster is generated from a linear regression model with mean values of intercept, slope, and variance, β 0 = −8, β 1 = −4, and σ 2 = 3. The second cluster is simulated from the following mean parameters, β * 0 = β 0 + δ β0 , β * 1 = β 1 + δ β1 , and σ 2 * = σ 2 + δ σ 2 , where δ β0 , δ β1 and δ σ 2 dictate the variability across clusters. Furthermore, model parameters were simulated either from an independent (ρ = 0) or spatial (ρ = 0.99) process. The PSBP clustering from Section 2.6 with π k = 0.2, 0.3, 0.4, 0.5 was compared with k-means performed on the raw data. We present the ratio of between sum of squares (SS) over total SS. Greater values of SS are preferred. Visual fields are a psychophysical procedure that assesses a patient's field of vision, with standard automated perimetry (SAP) being the default method. In this study, we analyzed fields generated from the Humphrey Field Analyzer-II (HFA-II; Carl Zeiss Meditec Inc., Dublin, CA). The HFA-II is an interactive technology that assesses a patient's reaction as light is systematically introduced at gridded locations across their visual field. In this study, we represented functional loss using total deviation (TD) values, an age-adjusted measure of sensitivity loss, measured in decibels (dB). TD is a continuous measure, with large negative values indicating functional loss. An example longitudinal series of visual fields can be found in Figure 2A. Our data included 79 patients (110 eyes) diagnosed with glaucoma at baseline, with an average of 10 clinic visits and 4 years of follow-up. Of the 110 glaucomatous eyes, 51 (46%) were defined as progressing and the remaining 59 (54%) were stable. More details about the glaucoma population are given in Section 1 of the online Supplementary Materials. See Berchuck et al. (2019b) for a more in depth introduction to visual fields for statisticians.

Rates of Glaucoma Progression
The spatial factor analysis was fit to each of the 110 eyes in our study. For this case study, the MCMC sampler used a burn-in of 50,000 samples, 300,000 subsequent iterations, and then thinned to a final sample of 10,000. Across the 110 eyes, the MCMC chains ran for 554 (152) minutes on average (SD) using the spBFA R package. For the example longitudinal series of visual fields, presented in Figure 2, we present posterior mean and standard deviation fits at each clinic visit. From this visualization, we see that the method is properly spatially smoothing the observed data to better reveal patterns across time.
We are interested in using the spatial factor analysis model to improve clinicians' ability to quantify rates of change across time. Our introduced methodology is appro-priate for assessing longitudinal changes on the visual field, because instead of analyzing each location, it models temporal changes of underlying regions. This is much closer to how a clinician interprets change on the visual field, as glaucoma has characteristic patterns. To this end, we performed independent linear regressions of the posterior mean estimates of the latent factors across time. The two-sided p-values from these regressions were used as predictors of progression. In particular, we present five variations that include, all six of the factors (i), the first three (ii), and only the first (iii), second (iv), and third (v) factors. To assess the diagnostic ability of these methods to discriminate progression status, we performed logistic regressions of the resulting p-values of the latent factors across time, using the predicted probabilities of progression as a diagnostic.
We compared the probabilities from the latent factors to established methods of determining progression on SAP fields, mean deviation (MD) and pattern standard deviation (PSD). Both MD and PSD are age-adjusted measures of vision loss on a visual field, with MD representing a global loss and PSD indicating the level of localized loss. In practice MD (and PSD) is used to assess rates of progression using ordinary least squares (OLS) regression across time, with a lower (and upper) p-value less than 0.05 indicating progression. Again, p-values were regressed against disease status and predicted progression probabilities were obtained.
We compared diagnostics using area under the receiver operating characteristic (ROC) curve (AUC) and partial AUC (pAUC). Larger values of AUC and pAUC indicate superior discriminatory ability. Based on the precedent of a previous study, we limited the pAUC to regions of clinically relevant specificity, 85-100% (Berchuck et al., 2019a).
Of the two established methods, PSD had better performance with an AUC of 0.65 and pAUC of 0.17 (Figure 3). When all six latent factors were included in the analysis, the AUC was improved slightly, however the pAUC decreased. This is problematic, because the pAUC is clinically more meaningful than overall AUC. Through inspection of the posterior factors, we determined that for the majority of eyes only three factors contained meaningful data (a result of the shrinkage prior on the loadings matrix). When we limited the metric to only contain the first three factors, the pAUC nearly doubled with a maximum pAUC of 0.28. When further exploring each of the first three factors independently, it became clear that the meaningful information had been encoded in the second factor, as the corresponding ROC curve has a much steeper trajectory from 100% specificity. These results indicate that the rates of change learned from the posterior latent factors have clinical utility in determining underlying structural progression from visual fields.

Sensitivity to Choice of Hyperparameters
Since our proposed model has multiple layers with numerous hyperparameters, we present a thorough sensitivity analysis of the results in the previous section. In particular, we assess sensitivity of our results to four modeling assumptions: i) the number of latent factors, k, ii) the base parameter in the multiplicative gamma process shrinkage prior, a 1 , iii) along with the multiplicative parameter, a 2 , and iv) the criterion for the bounds of ψ. For each setting, we replicated the results from Figure 3, which looked at diagnostic performance of the posterior factors for discriminating progression status. In this sensitivity analysis, we look at version (i), that included all latent factors, since it is invariant across our sensitivity settings when changing the number of latent factors. The results of the sensitivity analysis, which display AUC and pAUC at 85% specificity, are displayed in Table 2. In the original analysis, the number of latent factors, k, is set to 6, while values of 2, and 10 are used for comparison in the sensitivity analysis. To demonstrate sensitivity to the choice of prior for the multiplicative gamma process shrinkage prior, we allowed the hyperparameters, a 1 and a 2 to vary from the following values we used in the original analysis: a 1 = 1, and a 2 = 20. A value of a 1 = 3, will yield smaller variances compared to a value of 1, holding a 2 constant, while changing a 2 to a value of 10, results in the shrinkage process becoming slower, compared to a value of 20, and holding a 1 constant. Finally, we compare the original results to a version where the upper bound of ψ was based on the following criterion, [b ψ : [H(b ψ )] t,t = 0.05, |x t − x t | = x min ], changed from the original 0.01. This criterion is motivated to give narrower and symmetric bounds.
The results from the sensitivity analysis show an interesting relationship between the choice of hyperparameters for the multiplicative gamma process shrinkage prior and the AUC pAUC a 1 a 2 b ψ k = 2 k = 6 k = 10 k = 2 k = 6 k = 10 1 10 0.  Table 2: Sensitivity analysis to various hyperparameter assumptions in the analysis of visual field data and glaucoma progression risk. For each set of assumptions, we present area under the receiver operating characteristic curve (AUC) and partial AUC (pAUC) for discriminating progression status. Assumptions include i) the number of latent factors, k, ii) the base parameter in the multiplicative gamma process shrinkage prior, a 1 , iii) along with the multiplicative parameter, a 2 , and iv) the criterion for the bounds of ψ. The results from the original analysis are bolded.
number of latent factors, as related to the diagnostic performance in predicting glaucoma progression status. While at first it appears the performance increases as the number of latent factors increases, it actually appears that this increase in performance depends on the hyperparameters of the shrinkage prior. When you look at the scenario where the shrinkage prior has the highest variance and slowest shrinkage (a 1 = 1, a 2 = 10), you see that the performance generally increases as the number of latent factors increases. However, when the shrinkage prior has a smaller initial variance and faster shrinkage (a 1 = 3, a 2 = 20), the difference in performance is not as clear, and in fact the setting with 6 latent factors performs very well. This seems to indicate the importance of choosing the number of latent factors and hyperparameters for the shrinkage prior with care and while considering the form of the data. In our data setting, we were analyzing visual fields, and therefore chose the number of latent factors to be equal to the number of meaningful anatomical regions of the retina. Then, our hyperparameters for the shrinkage prior allow for some shrinkage, as it is unlikely that each anatomical region has a unique rate of progression.

Clustering Temporal Trends
We close this data illustration by demonstrating the clustering ability of the spatial PSBP detailed in Section 2.6. In Figure 4A, for each latent factor, the posterior probabilities of belonging to the same cluster, g j (s i,o , s i ,o ), are presented. For this eye, only the first three factors contain meaningful information. This is further reinforced in Figure 4B, where the stacked posterior probabilities w(s i,o ) are presented across the spatial surface and observation type (w). On the left of Figure 4B, the probabilities are presented in their original order (i.e., un-ordered), while the right frame represents the ordered factors (according to k-means with two groups, determined using the gap statistic). The ordered version removed the three non-informative factors based on the criteria from Section 2.6, with π k = 0.2. Due to the identifiability issue, the right of Figure 4B is required for clustering.
We presented the clusters obtained from the spatial PSBP (left) and clustering of the raw data using k-means with one cluster, based on the gap statistic (right). The clusters, presented in Figure 4C, show p-values, which are the average lower p-values from point-wise regressions of the PPD across all locations within each cluster. From this presentation, we can see that the spatial PSBP was capable of producing a map with regions of varying temporal trends that are straightforward for clinicians to interpret. While clustering the raw data was non-informative, the PSBP illustrated a region with faster temporal trajectories, that may require intervention, and corresponds with true progression (Figure 2A).

Malaria Incidence in Peru
In our second case study using real data, we investigated rates of malaria across Loreto, Peru's northernmost region that is located in the Amazon rainforest. This case study is a nice complement to the glaucoma example, as it introduces additional complexities, including dealing with non-Gaussian data, having multiple spatial observation types (i.e., malaria types), and introducing covariates into the mean process.
To model malaria counts we used conventions based on the Peruvian Ministry of Health, which defines significant levels of malaria based on observed counts from previous years. In particular, define the counts of malaria c t (s i,o ) at time t for district i and malaria type o. For malaria data, the temporal index is actually composed of two dimensions that represent year and epidemiological week, t = {y, w}, for y = 2012, . . . , 2018, and w = 1, . . . , 52. Furthermore, the Loreto region has 51 districts and there are two types of malaria monitored, P. falciparum (o = 1) and P. vivax (o = 2). In order to model malaria counts, we defined the proportion of malaria p yw (s i,o ) = c yw (s i,o )/n yw (s i,o ), where n yw (s i,o ) represents the population. Then, we defined our out- is the average proportion of cases for a malaria type at a location over the past five years. This binary definition, allowed us to model the probability of exceeding the number of cases seen in the past five years, π yw (s i,o ), an important indicator for specifying interventions.
To best model the mean process, we incorporated the following covariates through x t (s i,o ), rainfall (millimeters), temperature (Celcius), and an indicator of being in the rainy season, which is approximately from February-July, but defined using weeks, 1(w ∈ {6, . . . , 31}), in addition to an intercept, so that p = 4 (Vittor et al., 2009) To model the malaria indicator we used a Bernoulli likelihood, specified as follows, (s i,o )). The likelihood is then as follows, While inference can proceed using this likelihood, it is computationally intensive due to a loss of conjugacy across the majority of parameters. Computation can be made feasible through data augmentation using Pólya-Gamma (PG) latent variables (Polson et al., 2013).
In particular, based on Polson et al. (2013), we chose to model the observed data through a joint likelihood, f (Y t (s i,o ), ω t (s i,o )|ϑ o (s i,o )), that consists of an augmented parameter with distribution, ω t (s i,o ) ∼ PG(1, 0). The reason for this is that the conditional distribution f (ω t (s i,o )|ϑ t (s i,o )) is a tilted version of the PG with the following density, using the fact that cosh{x} = (1 + exp{2x})/(2 exp{x}).  process shrinkage prior to adaptively learn the proper number of latent factors. We have described an efficient MCMC sampler to accompany the model, which has been published as an R package, spBFA. We have illustrated the model's performance in both simulated and real data examples. In particular, we showed that by encoding spatial information into the loadings matrix, meaningful factors were learned that respect the observed neighborhood dependencies, making them useful for assessing rates of change across space.
While much of the factor analysis literature performs dimension reduction on the mean process for Gaussian data, we place no such restriction, allowing for a general likelihood and non-linear relationship with the latent factors. We illustrated this through our modeling of malaria counts with a Bernoulli likelihood. Although our specification of both space and time are separable and stationary Gaussian processes, we have noted that the resulting model is neither separable, stationary nor Gaussian.
Importantly, our proposed model relies on the fully-specified factor loadings matrix from Bhattacharya and Dunson (2011). One of the important drawbacks of Bhattacharya and Dunson (2011) is that the factors are not identifiable, hence inference can't be obtained directly on the factors. This also applies to the factors in our proposed model. However, we overcame this limitation by proposing a method for clustering temporal trajectories across a spatial surface that relies on the entire collection of factor loadings, which in their entirety are identifiable (Section 2.6). Through simulation in Section 3.2, we showed this technique was capable of identify true underlying spatial clusters of temporal change. We then confirmed this in two real-world data applications in Section 4. Finally, as noted in Bhattacharya and Dunson (2011), there are numerous cases where this fully-specified form of the factor loadings matrix is useful, including covariance estimation, and prediction. We confirmed this by using the posterior distribution of the factors to diagnose glaucoma disease progression in Figure 3.
Using our spatial PSBP prior we were able to find regions within the spatial domain with similar temporal trajectories, an important task in many applied settings. Through simulation, we showed that our clustering technique is preferred over a standard clustering routine. In real data, we showed that aggregating rates of change at the cluster level produced patterns that aided in making actionable decisions. The task of clustering trajectories has been addressed sparsely in the literature, with the majority of methods clustering trends with no shape constraint, like our temporal process. A recent method by Napier et al. (2018), instead, clusters trajectories on pre-specified parametric forms (e.g., linear). While this technique can be limiting, in applied settings with known trajectories, it has utility and therefore would be a valuable extension to our model. Finally, this work opens up numerous avenues for future statistical research. If applicable, a seasonal component could be added to the temporal model of the latent factors, η t . Currently, we only allow for constant temporal smoothing, however this could easily be generalized to incorporate any type of seasonal effect, for example a Fourier dynamic linear model (West et al., 1985). Any temporal effect that has Gaussian errors will maintain conjugacy for the MCMC sampling. Other natural extensions to this approach include the treatment of sparse data and remedy to deal with a large number of m locations. Although, we did not treat these problems in the current manuscript, existing methods could be nicely incorporated in our modeling framework. For example, the highly scalable nearest-neighbor Gaussian process model, that allows for inference in settings of a large number of spatial locations, could be applied in a straightforward fashion (Datta et al., 2016).

Supplementary Material
Supplementary material to 'Bayesian Non-Parametric Factor Analysis for Longitudinal Spatial Surfaces' (DOI: 10.1214/20-BA1253SUPP; .pdf). The supplement contains details related to the model, including full conditionals, moment derivations, prediction theory, and also specifics of the glaucoma population.