Gaussian orthogonal latent factor processes for large incomplete matrices of correlated data

We introduce Gaussian orthogonal latent factor processes for modeling and predicting large correlated data. To handle the computational challenge, we first decompose the likelihood function of the Gaussian random field with a multi-dimensional input domain into a product of densities at the orthogonal components with lower-dimensional inputs. The continuous-time Kalman filter is implemented to compute the likelihood function efficiently without making approximations. We also show that the posterior distribution of the factor processes is independent, as a consequence of prior independence of factor processes and orthogonal factor loading matrix. For studies with large sample sizes, we propose a flexible way to model the mean, and we derive the marginal posterior distribution to solve identifiability issues in sampling these parameters. Both simulated and real data applications confirm the outstanding performance of this method.

Compared to a large number of studies on approximating GPs, less progress have been made on efficiently computing the likelihood function without approximation. In this work, we propose a flexible and computationally feasible approach to model large incomplete matrix observations of correlated data, called Gaussian orthogonal latent factor (GOLF) processes. Bayesian inference was derived to assess the uncertainty in parameter estimation and predictions. GPs with product covariance functions on lattice observations or semiparametric latent factor models (Sacks et al., 1989;Kennedy and O'Hagan, 2001;Teh et al., 2005) can be represented as full-rank GOLF processes, which permit much smaller computational costs than directly computing the likelihood function and making predictions. Further reducing the computational cost can be achieved by low-rank GOLF processes, where the computational cost is similar to the order of principal component analysis.
We highlight a few contributions of this work. We first show that for GPs with product covariance functions or semiparametric latent factor models, if the latent factor loading matrix is orthogonal, prior independence of latent factor processes implies posterior independence of factor processes. The new finding allows one to decompose the likelihood function of lattice data into a product of densities of projected output, which greatly reduces the computational complexity. Separate continuous-time Kalman filters can be applied to compute the posterior distributions of factor processes at lower dimensional inputs in parallel, which has linear computational operations with respect to the number of observations. Second, as a large number of observations provide rich information, we introduce a flexible way to model the mean function and derive the marginal posterior distribution of the linear coefficients, to solve identifiability issues in posterior sampling. Furthermore, compared with the maximum marginal likelihood estimation of factor loadings derived in Gu and Shen (2020), our approach is applicable to model observations on incomplete lattice. Finally, we developed Bayesian inference for uncertainty assessment, which is critically important for inverse problems in applications (Kennedy and O'Hagan, 2001;Bayarri et al., 2007).
The purpose of this work are twofold. First, we aim to develop a pipeline of computationally efficient methods of modeling correlated data with multi-dimensional input without approximating the likelihood function. Properties of GOLF processes derived in this work are useful for developing an efficient approximation algorithm for scenarios with multidimensional input variables. Besides, the nonseparable covariance and coordinate-specific mean coefficients proposed in this work provide flexible choices for models of local information. Second, we primarily focus on applications based on images, which include inverse problems by satellite radar interferograms (Anderson et al., 2019), and estimating dynamic information from microscopic videos (Cerbino and Trappe, 2008). Our approach allows for efficient Bayesian inference in a large sample scenario.
The rest of the article is organized as follows. In Section 2.1, we introduce the GOLF model with an emphasis on the orthogonal decomposition of the likelihood function and posterior independence of latent factor processes. The flexible mean function, spatial latent factor loading matrix and kernel functions are discussed in Section 2.2-2.4, respectively.
We introduce the Markov Chain Monte Carlo (MCMC) algorithm and discuss the computational complexity in Section 3.1. In Section 3.2, we introduce the continuous-time Kalman filter in computing the likelihood function with linear computational complexity. Section 4 compares our approach with other alternatives, and numerical results for comparing these approaches are presented in Section 5-6. We conclude this work and discuss several potential extensions in Section 7. Proofs of lemmas and theorems are given in supplementary materials. The data and code used in this paper are publicly available (https : //github.com/UncertaintyQuantification/GOLF).
2 Gaussian orthogonal latent factor processes 2.1 Orthogonal decomposition and posterior independence Let y s (x) = (y s 1 (x), ..., y sn 1 (x)) T be an n 1 × 1 vector of observations at coordinates s = (s 1 , ..., s n 1 ) T with s i ∈ R p 1 for i = 1, ..., n 1 and input x ∈ R p 2 . For spatially correlated data, for instance, s and x denote the latitude and longitude, respectively, and in spatio-temporal models, the spatial coordinates and time points can be defined as s and x, respectively.
Consider the latent factor model: where A s = [a 1 , ..., a d ] is a n 1 × d factor loading matrix and z(x) = (z 1 (x), ..., z d (x)) T is a d-dimensional factor processes with d ≤ n 1 , ∼ N (0, σ 2 0 I n 1 ) being independent Gaussian noises. The mean function m s (x) = (m s 1 (x), ..., m sn 1 (x)) T is typically modeled via a linear trend of regressors, which will be discussed in Section 2.2.
As data are typically positively correlated at two nearby inputs, we assume z l (·) independently follows a zero-mean Gaussian process (GP), meaning that for any {x 1 , ..., x n 2 }, Z T l = (Z l (x 1 ), ..., Z l (x n 2 )) T is a multivariate normal distribution: where the (i, j)th entry of the covariance matrix is σ 2 l K l (x i , x j ) with kernel function K l (·, ·) and variance parameter σ 2 l , for l = 1, ..., d. Here we assume independence between the factor processes a priori. A detailed comparison between our approach and other related approaches is discussed in Section 4.
Note that only the d-dimensional linear subspace of factor loadings A s can be identified if not further specification of factor loading matrix A s is made, as the model (1) is unchanged if the pair (A s , z(x)) is replaced by (A s G, G −1 z(x)) for any invertible matrix G. Besides, the computation could be challenging when the number of factors or input parameters is large. Thus, we assume that the column of A s is orthonormal.
Assumption (1) may be replaced by A T s A s = Λ, where Λ is a diagonal matrix. Since we estimate variance parameters σ 2 = (σ 2 1 , ..., σ 2 d ) T of latent factor processes by data, diagonal terms of Λ are redundant. Thus we proceed with the Assumption 1.
Let us first assume we have an n 1 × n 2 matrix of observations Y = [y s (x 1 ), ..., y s (x n 2 )] at inputs {x 1 , ..., x n 2 }, and then we extend our method to incomplete matrix observations in the Section 3. Denote B the regression parameters in the n 1 × n 2 mean matrix M = (m s (x 1 ), ..., m s (x n 2 )). Denote Θ = (A s , B, σ 2 , γ), which contains the factor loadings, mean parameters, variance parameters and range parameters in the kernel functions. Further let A F = [A s , A c ] = [a 1 , a 2 , ..., a n 1 ], where A c is an n 1 × (n 1 − d) matrix of the orthogonal complement of A s . Assumption 1 allows us to decompose the marginal likelihood (after integrating out the random factor Z) into a product of multivariate normal densities of the outcomes at the projected coordinates: .., n 1 ,Σ l = Σ l +σ 2 0 I n 2 and PN (· ; µ, Σ) denotes the density of the multivariate normal distribution with mean µ and covariance matrix Σ. In practice, note that we can avoid computing A c by using the identity A s A T s + A c A T c = I n 1 . The derivation of Equation (4) is given in the supplementary materials.
The orthogonal factor loading matrix in Assumption 1 and prior independence of factor processes lead to the posterior independence of the factor processes, introduced in the following corollary.
2. For l = 1, ..., d, the posterior distribution (Z T l | Y, Θ) follows a multivariate normal distribution where We call the latent factor processes in (1) with Assumption 1 Gaussian orthogonal latent factor (GOLF) processes, because of orthogonal decomposition of the likelihood function and posterior independence between two factor processes. The main idea is to decompose the likelihood of GP models with multi-dimensional inputs by a product of densities with low dimension input and to utilize the continuous-time Kalman filter for fast computation. As we will see in Section 3, these two properties dramatically ease the computational burden.

Flexible mean function and marginalization
The mean function m s (·) plays an important role in modeling and predicting correlated data. Computer models (such as the numerical solution of partial differential equations), Figure 1: Estimated linear coefficients for temperature observations in Heaton et al. (2019). In the left panel, the dots are the estimated coefficients in a linear regression of observations at each longitude separately using latitudes as regressors. The estimated linear coefficients for the observations at each latitude are graphed in the right panel, where longitudes are used as regressors.
for example, can be included as a part of the mean in an inverse problem (Kennedy and O'Hagan, 2001). Here for simplicity, we use only a linear basis function of s and x, whereas additional terms may be included in the mean if available.
In a GP model, the regression coefficients are often assumed to be the same across one basis function. For instance, the mean function may be modeled as m s (x) = h 1 (s)b 1,0 , or m s (x) = h 2 (x)b 2,0 , where h 1 (s) and h 2 (x) are a set of 1 × q 1 and 1 × q 2 mean basis functions with b 1,0 and b 2,0 being q 1 × 1 and q 2 × 1 regression coefficients, respectively. The regression coefficients b 1,0 , for example, are shared across each x.
The shared regression coefficients may be a restrictive assumption when data sets are large. Consider, for instance, the temperature data set used in Heaton et al. (2019), where the temperature values are shown in Figure 5. In Figure 1, we graph the fitted linear regression coefficients using latitudes or longitudes as regressors. The estimated regression coefficients are not the same across latitude or longitude. A natural extension of modeling the mean function, therefore, is to allow the mean parameters at each row or column of the observations to be different, e.g. m s i (x j ) = h 1 (s i )b 1,j , or m s i (x j ) = h 2 (x j )b 2,i , for i = 1, ..., n 1 and j = 1, ..., n 2 . Some choices of the individual mean functions are summarized in Table 1. The mean function may be specified based on model interpretation or exploratory data analysis. Models with different regression coefficients across different types of coordinates are more suitable to model a large number of observations, as they are more flexible to capture the trend.
To implement full Bayesian inference of the parameters, one may sample from the posterior distribution of regression parameters p(B | Θ −B , Y, Z). However, we found a severe identifiability problem between the mean M and AZ, when the regression coefficients B are sampled from the full posterior distribution. This is because the likelihood function of the mean parameters is flat when data are very correlated. Consequently, the absolute values of the entries of these two matrices can be both big, making the MCMC algorithm very unstable. To alleviate the identifiability problem, we first integrate out factors and sample regression parameters from the marginal posterior distribution p(B | Θ −B , Y). The marginal Individual mean Table 1: Summary of the mean function studied in this work. In the third column, H 1 = (h T 1 (s 1 ), ..., h T 1 (s n 1 )) T and H 2 = (h T 2 (x 1 ), ..., h T 2 (x n 2 )) T are n 1 × q 1 and n 2 × q 2 mean basis matrices, respectively. Regression coefficients are denoted as B 1 = (b 1,1 , ...., b 1,n 2 ) and B 2 = (b 2,1 , ...., b 2,n 1 ) for the basis function h 1 (·) and h 2 (·), respectively. posterior distributions of the regression parameters are given in the following Theorem 1 and Theorem 2.
1. (Row regression coefficients). Assume M = H 1 B 1 and the objective prior π(B 1 ) ∝ 1 for B 1 . After marginalizing out the factor Z, the posterior samples of B 1 from p(B 1 | Y, Θ −B 1 ) can be obtained by 0,s is an n 2 × d matrix with the lth column independently sampled from N (0,Σ l ) for l = 1, ..., d, and Z 0,1 is an n 1 × n 2 matrix with each entry independently sampled from the standard normal distribution.
2. (Column regression coefficients). Assume M = (H 2 B 2 ) T and the objective prior π(B 2 ) ∝ 1 for the regression parameters B 2 . After marginalizing out the factor Z, the posterior samples of B 2 from p(B 2 | Y, Θ −B 2 ) can be obtained by and B 2,0,s is a q 2 ×d matrix with the lth column independently sampled from N (0, (H T 2Σ −1 l H 2 ) −1 ) for l = 1, ..., d. L H 2 is a q 2 × q 2 matrix such that L H 2 L T H 2 = (H T 2 H 2 ) −1 and Z 0,2 is a q 2 × n 1 matrix with each entry independently sampled from the standard normal distribution.
When both the row regression coefficients and column regression coefficients are in the model, we found that M 1 = H 1 B 1 and M 2 = (H 2 B 2 ) T are not identifiable, if we sample B 1 and B 2 from the full conditional distribution. To avoid this problem, we first marginalizing out B 2 and Z to sample B 1 and then we condition B 1 to sample B 2 .
1. After marginalizing out Z and B 2 , the marginal posterior sample of B 1 from p(B 1 | Y, Θ −B 1 ,−B 2 ) can be obtained by ,Q is an n 2 ×d matrix with the lth column independently sampled from N (0, Q 1,l ), with Q 1,l = P lΣ −1 for l = 1, ..., d. Z 0,1 is an n 1 × n 2 matrix with each entry independently sampled from standard normal distribution and P 0 = (I n 2 − H 2 (H T 2 H 2 ) −1 H T 2 ).

Posterior samples of
can be obtained through equation (7) by replacing Y by Y − H 1 B 1 .
In Theorem 1 and Theorem 2, the marginal posterior distribution of the regression coefficients depends on the n 1 × d factor loading matrix, but not the complement of the factor loading matrix (A c ). Since we do not need to compute A c , the most computationally intensive terms are those containing the covariance matrix Σ l and its inverse. Fortunately, each term can be computed with linear complexity with respect to n 2 instead of n 3 2 when the Matérn covariance is used, discussed in Section 3.2.

Spatial latent factor loading matrix
This section discusses a model of the latent factor loading matrix A s that satisfies the orthogonal constraint in (3). As output values are marginally correlated at two inputs s a and s b , a natural choice is to let A s be the eigenvectors corresponding to the largest d eigenvalues in the eigendecomposition of the correlation matrix R s , where the (i, j)th entry is specified by a kernel function K s (s i , s j ), for 1 ≤ i, j ≤ n 1 . We give a few examples of models that can be written as special cases of the GOLF model when the A s is specified as eigenvectors of R s . For simplicity, we assume the mean is zero. The first and second classes of models are the GP models with separable covariance functions of input with two dimensions and three dimensions, respectively.
Example 1 (Spatial model with separable covariance). Consider a spatial model of Y at a regular n 1 × n 2 lattice, where the (i, j)th input is (s i , x j ) with s i and x j denoting the ith latitude coordinate and jth longitude coordinate, respectively. Assume the covariance of the spatial process is separable, meaning that Y ∼ N (0, σ 2 R s ⊗ R x + σ 2 0 I n 1 n 2 ), where the (l 1 , m 1 ) term of R s is parameterized by the kernel function K s (s l 1 , s m 1 ) and the (l 2 , m 2 ) term of R x is K x (x l 2 , x m 2 ) for 1 ≤ l 1 , m 1 ≤ n 1 and 1 ≤ l 2 , m 2 ≤ n 2 . Let R s = U s Λ s U T s , where U s is a matrix of eigenvectors and Λ s is a diagonal matrix of eigenvalues of R s with the lth diagonal term λ l . The density of this spatial model is equivalent to model (1) with A s = U s , Σ l = σ 2 λ l R x and d = n 1 .
Example 2 (Spatio-temporal model with separable covariance). Consider a spatio-temporal model of Y at n 1,1 × n 1,2 × n 2 lattice, where the (i, j, k)th input is (s 1,i , s 2,j , x k ), with s 1,i and s 2,j denoting the ith latitude coordinate and jth longitude coordinate, respectively, and x k denoting the kth time point. Let n 1 = n 1,1 × n 1,2 . Assume the covariance of the spatiotemporal process is separable, meaning that Y ∼ N (0, where U i is a matrix of eigenvectors and Λ i is a diagonal matrix of eigenvalues λ l i for 1 ≤ l i ≤ n 1,i and i = 1, 2. The density of this spatio-temporal model is equivalent The separable covariance is widely used in emulating and calibrating computationally expensive computer models with scalar output (Sacks et al., 1989) and vector output (Conti and O'Hagan, 2010;Paulo et al., 2012), whereas the isotropic covariance, i.e., the covariance as a function of Euclidean distance of inputs, is used more often in modeling spatially correlated data (Gelfand et al., 2010). Some anisotropic kernels, such as the geometrically anisotropic kernel, were studied in Zimmerman (1993) for modeling spatially correlated observations. Note that the covariance of GOLF processes in (1) is not separable in general, as the variance and kernel parameters of each factor process z l (·) can be different. Different kernel parameters make the model more flexible, as the factor processes corresponding to large eigenvalues are often found to be smoother than the ones corresponding to small eigenvalues. Separable covariance may be restrictive in this regard as factor processes are assumed to have the same kernel and parameters.
Computing the likelihood of GP with separable covariance on a complete n × n lattice data generally takes O(N 3/2 ) operations through eigen-decomposition of sub covariance matrices. This work generalizes this approach to nonseparable covariance for both complete and incomplete lattice observations. One can further reduce the computational complexity by selecting d eigenvectors corresponding to the d largest eigenvectors from the eigendecomposition of the correlation matrix R s . The proportion of summation of the d largest eigenvalues over the summation of total eigenvalues shall be chosen as large as possible to allow the model to explain the most variability of the signal (Higdon et al., 2008). We found that using more factors than the truth typically will not incur a large reduction of predictive accuracy, whereas using a much smaller number of factors than the truth will cause a large predictive error (Example 4 in simulated studies). Thus one should be cautious about using a very small number of factors.

Kernel functions
We first discuss the kernel function for the factor process Z l (·), l = 1, ..., d. We assume a product kernel between the inputs (Sacks et al., 1989), i.e. for any input x a = (x a1 , ..., x ap 2 ) and is a kernel of the lth coordinate of the input for l = 1, ..., d and i = 1, ..., p 2 .
We focus on Matérn covariance (Handcock and Stein, 1993) as kernel function K l,i (·) in this work. Each kernel contains positive roughness parameter ν l,i and a nonnegative range parameter γ l,i for l = 1, ..., d and i = 1, ..., p 2 . The roughness parameter of the Matérn kernel controls the smoothness of the process. When ν l,i = 1 2 , the Matérn kernel becomes the exponential kernel: K l,i (|x ai − x bi |) = exp(−|x ai − x bi |/γ l,i ), and when ν l,i → ∞, the Matérn kernel becomes the Gaussian kernel: . The half-integer Matérn kernel (i.e. (2ν l,i + 1)/2 ∈ N) has a closed form expression. When ν l,i = 5/2, for example, the Matérn kernel is for l = 1, ..., d and i = 1, ..., p 2 . In constructing GOLF processes, we decompose the density of the GP model with multidimensional input into a product of the orthogonal components with lower-dimensional input. This is because the likelihood and the predictive distribution of a GP model with a halfinteger Matérn covariance can be computed through linear operations with respect to the sample size by the continuous-time Kalman filter (Särkkä and Hartikainen, 2012) when p 2 = 1. The computational advantage will be discussed in Section 3.2.
For the factor loading matrix, we let A s be the first d eigenvectors of R s . The kernel functions for R s can be chosen similarly as the kernel for the latent factor processes. Without the loss of generality, we assume R s is parameterized by a product kernel with the range parameters γ 0 , and the Matérn kernel being used for each coordinate of s.
3 Posterior sampling for GOLF processes

A Markov chain Monte Carlo approach
In many applications, the observations contain missing values. Denote Y o v and Y u v the vectors of observed data and missing data in matrix Y with size N o and N u , respectively. Directly computing the likelihood includes calculating the inverse and determinant of an o ) in general, making it infeasible for large number of observations. Here we discuss a computationally feasible way for the GOLF model when observations are from incomplete matrices.
We start with a set of initial values at the locations with missing observations. Denote Y are vectors of observations and samples at the missing locations in the tth iteration, t = 1, ..., T . First, we use a Metropolis algorithm to sample Θ (t+1) from the marginal posterior distribution p(Θ | Y (t) ), where the marginal density is given in Equation (4). In the second step, we sample Z (5) for l = 1, ..., d, and then we generate and A (t+1) is a n 1 × d matrix of the d eigenvectors corresponding to the d largest eigenvalues from the eigendecomposition of the correlation matrix R s in the (t + 1)th iteration. We can obtain Y v is never changed. For computational reasons, we define the nugget parameter in each kernel (i.e. the inverse of the signal variance to the noise variance ratio parameter) η l = σ 2 0 /σ 2 l for l = 1, 2, ..., d, and the inverse range parameter β l,i = 1/γ l,i , where i = 1, ..., p 1 when l = 0, and i = 1, ..., p 2 when l ≥ 1. The transformed parametersΘ contain the mean parameters B, inverse range parameters β = (β 0 , ..., β d ), nugget parameters η = (η 1 , ..., η d ) of the factor processes and the variance of the noise σ 2 0 . For mean and noise variance parameters, we use an objective prior π R (B, σ 2 0 ) ∝ 1/σ 2 0 . We assume the jointly robust (JR) prior for the kernel parameters: π JR (β l , η l ) ∝ ( p 2 i=1 (c l,2 β l,i + η l )) c l,1 exp(−c l,3 p 2 i=1 (c l,1 β l,i + η l )) with default parameters c l,1 = 1/2 − p 2 , c l,2 = 1/2, and c l,3 being the average distance between the lth coordinate of two inputs for l = 1, ..., d (Gu, 2018). Note here c l,1 = 1/2 − p 2 is the default parameter for the MCMC algorithm, whereas Algorithm 1 MCMC algorithm when the kernel parameters are different ). Update the mean matrix M (t+1) and the projected observationsỹ ) and go back to (1) when t < T .
this prior parameter is different if one maximizes the marginal posterior distribution. The jointly robust prior is equivalent to the inverse gamma prior when the input dimension is one without a nugget parameter. The inverse gamma prior is assumed for each coordinate of β 0 with shape and rate parameter being −1/2 and 1, respectively. The JR prior can alleviate the potential numerical problem when the estimated range and nugget parameters are close to the boundary of the parameter space, as the density of the JR prior is close to zero at these scenarios. As the sample size is large, the bias inserted from the prior is small.
The MCMC algorithm of the GOLF model is given in Algorithm 1. In step (1) to step (4) of Algorithm 1, we marginalize out the factor processes to compute the posterior distribution of the parameters. This is critically important as we found severe identifiability problems between the mean matrix M and AZ if the parameters are sampled from the full conditional distributions. Moreover, after marginalizing out the factor processes, the covariance matrix of the distribution PN (ỹ l ; 0,Σ l ) in (4) contains a nugget term, which makes the computation stable.
The Algorithm 1 can be easily modified for different scenarios. When the factor processes have the same covariance matrix, we can combine step (1) and step (2) to sample the shared kernel and nugget parameter.
Step (4) may be skipped if one has zero-mean or modified if one has the shared regression coefficients in the model.
Denote Σ l = L l L T l where L l is a lower triangular matrix in the Cholesky decomposition of Σ l . We need to efficiently compute the terms |Σ l |, 2 ) computational operations for each l = 1, ..., d. Luckily, for Matérn covariance with a half-integer roughness parameter and one-dimensional input, computing any of these terms only takes O(n 2 ) operations without approximation.

Continuous-time Kalman filter
We briefly review the continuous-time Kalman filter algorithm and the connection between the Gaussian Markov random field and GP with Matérn covariance. The spectral density of the Matérn covariance with the half-integer roughness parameter was shown to be the same as a continuous-time autoregressive process defined as a stochastic differential equation (SDE) (Whittle, 1963). Suppose the observations areỹ l = (ỹ 1,1 , ...,ỹ l,n 2 ) T . For j = 1, ..., n 2 and l = 1, ..., d, starting from the initial state θ l (s 0 ) ∼ MN(0, W l (s 0 )), the solution of the SDE follows (Hartikainen and Sarkka, 2010): where w l (x j ) ∼ N (0, W l (s j )), l,j is an independent white noise for l = 1, ..., d and j = 1, ..., n 2 . For the Matérn kernel with a half-integer roughness parameter, the terms G l (x j ), W l (x j ), and F can be expressed explicitly as a function of |x j − x j−1 | and the range parameter of the kernel. Thus, the forward filtering and backward smoothing algorithm (FFBS) can be applied to compute the likelihood and to make predictions with linear computational operations of the number of observations (see e.g. Chapter 4 in West and Harrison (1997) and Chapter 2 in Petris et al. (2009) for the FFBS algorithm). The likelihood function and predictive distribution of a GP model having the Matérn kernel with roughness parameters being 1/2 and 5/2 through the FFBS algorithm are implemented in FastGaSP package available at CRAN. The computational complexity of the FFBS algorithm is only O(n 2 ), with n 2 being the number of observations. We briefly discuss how to apply the FFBS algorithm to compute terms L −1 lỹ l and |Σ l | needed in Algorithm 1, for l = 1, ..., d. In the FFBS algorithm, the one-step-ahead predictive distribution (ỹ l,j |ỹ l,1:j−1 ) ∼ N (f l (x j ), Q l (x j )) can be derived iteratively for j = 1, ..., n 2 and for each l = 1, ..., d. Closed form expressions of f l (x j ) and Q l (x j ) for the Matérn covariance in (9) are given in Gu and Xu (2020). For l = 1, ..., d, we have following expressions for the computational expensive terms in the likelihood function: We use the backward sampling algorithm (Petris et al., 2009) to sample θ l,n 2 from p(θ l,n 2 | y (t) l , β (t+1) , η (t+1) ) and θ l,j from p(θ l,j |ỹ (t) l , θ l,j+1 , β (t+1) , η (t+1) ) sequentially, for j = n 2 − 1, ..., 1. Posterior samples Z T l = (z l (x 1 ), ..., z l (x n 2 )) T can be obtained by the first entry of the posterior sample θ l,j from the backward sampling algorithm, for j = 1, ..., n 2 . Furthermore, for any n 2 × 1 real vector v l , we have L l v l = (f l,1 + Q l (x 1 )v l,1 , ..., f l,n 2 + Q l (x n 2 )v l,n 2 ) T for l = 1, ..., d and j = 1, ..., n 2 .

Computational complexity
Denote p = p 1 × p 2 the total dimension of the inputs (s, x) and suppose the observational matrix is n 1 × n 2 with irregular missing values, where n 1 ≤ n 2 and N = n 1 n 2 . We discuss the computational complexity for three scenarios with p = 2 (e.g. spatially correlated data), p = 3 (e.g. spatio-temporal data) and p > 3 (e.g. functional data).
When p = 2, the computational complexity of the GOLF model with the half-integer Matérn kernel is O(N d). First, we compute the first d eigenvectors of Σ s to obtain A s , which has O(n 2 1 d) operations (see e.g. Chapter 4.5.5 in Bai et al. (2000)). Second, computing the marginal likelihood and sampling the factor processes by the FFBS algorithm only cost O(n 2 d) operations. The largest computational order is from the matrix multiplicationỸ T = (Y − M) T A s , which is at the order of O(N d).
For p = 3, we let A s = A s 1 ⊗ A s 2 , where A s 1 and A s 2 are the first d 1 and d 2 eigenvectors of n 1,1 × n 1,1 matrix Σ s 1 and n 1,2 × n 1,2 matrix Σ s 2 , respectively, with n 1,1 × n 1,2 = n 1 and Σ s 1 ⊗ Σ s 2 = Σ s . Without the loss of generality, assume d 1 ≤ d 2 and n 1 ≤ n 2 . Let the total number of factor processes be d = d 1 d 2 . The computational order of the GOLF model with a half-integer Matérn covariance function is O(n 1 n 2 d max ) where d max is the maximum of d 1 and d 2 (noting this is smaller than O(n 1 n 2 d)). To see this, computing the eigendecomposition of Σ s 1 and Σ s 2 requires O(d 1 n 2 1,1 ) and O(d 2 n 2 1,2 ) operations, respectively. Second, using the FFBS algorithm to compute the marginal likelihood and to sample factor processes costs O(dn 2 ) operations. At last, we do NOT directly compute Y T A s as its computation operations are O(N d). Instead, we first write the observations as an n 2 ×n 1,2 ×n 1,1 array Y T ar , where the (i, j, k)th entry being the outcome at (s 1,i , s 2,j , x k ). Then we do a 3-mode matrix product followed by a 2-mode matrix productỸ T ar × 3 A s 1 × 2 A s 2 (Kolda and Bader, 2009), which has the computation operations O(n 2 n 1 d 1 ) and O(n 2 n 1,2 d), respectively. Finally we concatenate the second and third dimensions ofỸ T ar to obtain the n 2 × d matrixỸ T . For the case when p > 3, there might be two scenarios. In the first scenario, the data are observed in an n 1,1 × n 1,2 × ... × n 1,k × n 2 tensor with irregular missing values, where n 1,1 × n 1,2 × ... × n 1,k = n 1 . In this scenario, the computation will be N d max , where d max is the maximum of d 1 , ..., d k with similar deduction for the case with p = 3. In the second scenario, we have p 2 > 1. Examples include emulating a computationally expensive computer output with multivariate output (Conti and O'Hagan, 2010;Paulo, 2005). In this case, the Kalman filter algorithm may not be applied, so the additional computational order is O(n 3 2 ), when the covariance of the factor process is the same. If the covariance is not the same, we need to additionally compute the inverse of covariance matrices of d multivariate normal distributions, which is at the order of O(dn 3 2 ). In sum, the computational complexity of GOLF for all scenarios considered herein is much smaller than O(N 3 o ) from directly inverting the covariance matrices. Besides, a few steps in the MCMC algorithm can be computed in parallel, such as FFBS algorithm to compute the product of d marginal densities of projected output and the matrix multiplicationỸ T = (Y − M) T A s , to further reduce the computational complexity.

Comparison and connection with other related models
GOLF processes are closely connected to a wide range of approaches on approximating GPs for modeling large correlated data. Model (1) is a linear model of coregionalization (LMC) (Gelfand et al., 2004), where the factor loading matrix is parameterized by input variables. Another widely used model for multivariate functional data is the semiparametric latent factor model (SLFM) (Teh et al., 2005), where the factor loading matrix can be estimated by the principal component analysis (PCA) (Higdon et al., 2008). However, the linear subspace estimated by PCA is equivalent to maximum marginal likelihood estimator (MMLE) with independent factors (Tipping and Bishop (1999)), whereas the latent factors at different input variables are assumed to be correlated. The MMLE of factor loadings with correlated factors was derived in (Gu and Shen, 2020), called the generalized probabilistic principal component analysis (GPPCA). Our approach has two distinctions. First, our approach applies to observations with irregular missing values, whereas the observations are required to be matrices in GPPCA. Second, both inputs s and x are used for estimation, whereas only the input in latent processes is used in GPPCA and predictions can be more accurate.
To overcome the computational bottleneck of GPs, we project observations on orthogonal coordinates in a GOLF model, as the complexity of computing the likelihood of GPs with Matérn covariances with one dimension input is fast by the continuous-time Kalman Filter. The computational complexity can be further reduced by only using factor processes with large eigenvalues. The reduced rank approach is used widely in modeling correlated data. For instance, the predictive process by a set of pre-specified knots was studied in Banerjee et al. (2008), and the multiresolution local bisqaure functions were used in Cressie and Johannesson (2008). Limitations of the reduced-rank method are studied in Stein (2014). Note that even for the full rank covariance, the computational order of GOLF is much less than O(N 3 o ). The primary goal is not to propose a reduced rank model herein, but to reduce the computational complexity of a GP model with a full-rank, flexible covariance function through orthogonal projections.
Many other approximation methods for GPs follow the framework of Vecchia's approximation (Katzfuss and Guinness, 2017;Vecchia, 1988). Vecchia's approximation is a broad framework that assumes the sparsity of the inverse of Cholesky decomposition of the covariance matrix of the latent processes, where the key is on selecting the order of the latent variables and imposing sensible conditional independence assumptions between variables. GOLF processes with Matérn kernel is closely related to Vecchia's approximation, in the sense that the model can be written as a vector autoregressive model with orthogonal factor loading matrix. Our way of computing likelihood and predictions based on the FFBS algorithm is exact, rather than an approximation to the likelihood function. We compare our approach with a few other methods that fall into the framework of Vecchia's approximation in Section 6.1.

Simulated studies
We discuss two simulated examples in this section. We first study a simulated example with a small sample size to study the predictive performance and parameter inference between GOLF processes and the exact GP model by directly computing the inversion and determinant of the covariance matrix in the likelihood function. In the second simulated example, we generate observations from separable and nonseparable models to study the predictive performance of GOLF processes with a different number of factors, and with the same or different kernel parameters. For both examples, we implement J = 100 experiments in each scenario, and we generate T = 5, 000 MCMC samples for each method with the first 20% of the samples used as the burn-in samples.
Denote y * i,j the ith held-out data in the jth simulated experiment in each scenario, for i = 1, ..., n * and j = 1, ..., J. Letŷ * ij and CI ij (95%) be the predictive mean and 95% predictive credible interval of the ith held-out data at the jth experiment, respectively. For both simulated examples, we record the root mean square error, the percentage of held-out observations percentage covered in the 95% predictive interval, and the average length of the 95% predictive interval of the jth experiment (L CI j (95%)): for j = 1, ..., J. We compute average values of these three quantities over J = 100 simulations to evaluate each approach. A precise method should have a small average RMSE, P CI (95%) close to the 95% nominal level, and short predictive interval lengths. Here we only consider the pairwise interval of responses at each coordinate as outputs are univariate on spatial or spatio-temporal domain. Simultaneous credible interval can be used for applications with multivariate responses (Sørbye and Rue, 2011).
Example 3 (GOLF processes and exact GP model). Data are sampled from a zero-mean separable GP model with two-dimensional inputs at a 25 × 25 regular lattice in [0, 1] 2 . Two missing patterns are considered, where the data are missing at random in the first case, and a disk in the centroid of the lattice is missing in the second case.
We assume a small sample size in Example 3 because of the computational burden by the exact Gaussian process model. We use the unit-variance covariance matrix parameterized by the exponential kernel and the Matérn kernel in (9) to generate the data. The range parameters of Matérn kernel are chosen as γ 0 = 1 and γ 1 = ... = γ d = 1/3. The range parameters of exponential kernel are chosen to be γ 0 = 4 and γ 1 = ... = γ d = 1. All the range parameters, the variance of the kernel, and noise are estimated by each method based on the MCMC algorithm.
We compare GOLF processes and the exact GP model where the inverse and determinant of the covariance matrix are directly computed. Both models use the same prior and proposal distribution in the MCMC algorithm to sample the kernel parameters. Table 2 gives the predictive performance of both methods for three scenarios, where 50% and 20% of the output are missing at random in the first two scenarios, and approximately 20% of the output is missing in a disk in the centroid of the lattice in the third scenario. Graphs of the observed data, full data, predictions, and trace plots of the posterior samples in one simulation are given in the supplementary materials.
As shown in Table 2, both methods have accurate predictions and uncertainty assessment for all scenarios. Out-of-sample RMSE for predicting the held out observations is close to 0.1, the standard deviation of the noise. The 95% predictive confidence intervals cover around 95% of the held-out observations, and the average length of the predictive confidence interval is small. Predictions of both methods are more precise for the cases when the data are missing at random than the ones when a disk of output is missing in the centroid of the lattice, as the ij,GOLF andŷ * ij,GP denote the predictive mean by GOLF processes and exact GP model, respectively. ∆L and ∆U measure the average absolute difference between the lower bound and upper bound of 95% predictive intervals of the GOLF processes and the exact GP model, respectively. For Example 3, note that GOLF processes and the exact GP model are the same with two different computational strategies. For GOLF processes, we sample the missing values to use the fast computational strategy, whereas the inverse and determinant of the covariance matrix are computed in the exact GP model directly. Therefore, the two different strategies have significantly different computational operations. The computational operations of GOLF processes is O(N d) with N = n 1 × n 2 (d = n 1 in Example 3), whereas the computational operations of the exact GP model is O (N 3  o ), where N o is the number of observations. Thus, GOLF processes are computationally feasible for a large data set. On the other hand, the difference in predictions and uncertainty assessment between the exact GP model and GOLF is small (last three columns in Table 2), since we do not make any approximation in computing GOLF processes. Figure 2 shows the histogram of the 4000 after burn-in posterior samples from the GOLF processes and exact GP model in one simulation of Example 3. The posterior samples of the Blue curves and red curves denote the GOLF processes with different kernel parameters and the same kernel parameter, respectively. In the left panels, the solid curves denote the RMSE for predicting the (noisy) observations, and the dashed curve denotes the RMSE for predicting the mean of the observations. Proportions of observations covered in the 95% predictive interval and the average length of the predictive interval are graphed in the middle and right panels, respectively. two methods are close to each other. The difference becomes even smaller when we increase the number of MCMC samples.
Example 4 (GOLF processes with different number of factors and kernel parameters). The data are sampled from two scenarios with two-dimensional inputs being a 100 × 100 lattice in [0, 1] 2 . In the first scenario, the range parameters of the kernel of each factor process are the same, whereas these parameters are chosen to be different in the second scenario.
In both scenarios, a disk of output in the centroid of the lattice is masked out for testing, corresponding to approximately 20% of the total number of data. We use d = 30 (low-rank) and d = 100 (full-rank) factors to generate the data. We test GOLF processes with a different number of factors, same or different range parameters.
In Example 4, the factor processes are assumed to have the Matérn kernel in (9) and unit variance. The kernel parameter is shared in the first scenario, where γ 0 = 1/4 and γ l = 1/2, and in the second scenario γ 0 = 1/3 and γ l = 1/l, for l = 1, ..., d. We estimate these parameters through the posterior samples from the MCMC algorithm.
Predictive performance of different approaches for data simulated by d = 30 latent processes are graphed in Figure 3. In the first row of the panels, since data are simulated by GOLF processes with different kernel parameters, nonseparable GOLF processes have smaller predictive RMSE and a shorter interval that covers almost 95% of the data. In the second row of the panels, GOLF processes with the same kernel parameter seem to be slightly better, as the true factor process has the same kernel parameter. The difference between the two methods in the second row is smaller, as the GOLF model with a separable kernel is a special case of the one with different kernel parameters.
From Figure 3, we found that when we use d = 20 factor processes or more, the predictive results seem to be similar, as the data are simulated using d = 30 factor processes. The way of selecting the number of factors is currently ad-hoc. One may select the number of factors to ensure a large proportion of the variance explained by the sum of the eigenvalues of the correlation matrix R s . This simulation suggests that using more factors may be better in prediction than using very few factors.
In Figure 4, we graph the simulated observations, simulated mean, and the prediction from the GOLF model with d = 30 in one simulation. Predictions look reasonably accurate. Results when the data are generated by a full rank kernel (d = 100) are provided in Figure  S3 in supplementary materials. Results are very similar to Figure 3. 6 Real applications 6.1 Predicting large spatial data on an incomplete lattice We compare GOLF processes with different approaches on predicting the missing temperature values in Heaton et al. (2019). This data set contains daytime land surface temperatures on August 4, 2016, at 300 × 500 spatial grids with the latitude and longitude ranging from 34.30 to 37.07, and from -95.91 to -91.28, respectively. The complete data set consists of 148,309 observations with 1, 791 missing values due to cloud cover. The training data (plotted in the left panel in Figure 1) Table 3: Comparison for the dataset in Heaton et al. (2019). The standard deviation of observations is 4.07. For each method, we compute RMSE, P CI (95%) and L CI (95%) defined in (11)-(13). A satisfying method should have small RMSE and small L CI (95%) and P CI (95%) closed to be 95% nominal level. We compare the fixed rank kriging (FRK) (Cressie and Johannesson (2008)), the Gapfill method (Gerber et al. (2018)), GOLF processes, the local approximate Gaussian processes (LAGP) (Gramacy and Apley (2015)), the lattice kriging (LatticeKrig) (Nychka et al. (2015)), the multiresolution approximation (MRA) (Katzfuss (2017)), the nearest neighbor Gaussian processes (NNGP) (Datta et al. (2016)), the spatial partitioning (Partition) (Heaton et al. (2017)), and stochastic partial differential equations (SPDE) (Lindgren et al. (2011)).
the upper panel in Figure 5. We define GOLF processes on this dataset with s being latitude and x being longitude. Since areas with higher latitude typically have lower temperature on average, we assume a mean parameter for each latitude value, i.e. M = (H 2 B 2 ) T , where H 2 = 1 n 2 and B 2 = (b 2,1 , ..., b 2,n 1 ) T . We let d = n 1 /2 and use exponential kernels with distinct variances and range parameters sampled from the marginal posterior distribution for GOLF processes. We compute M = 6000 posterior samples where the first 20% were used as the burn-in samples. Results of longer MCMC chains and different initial values of the parameters are given in the supplementary materials.
In Heaton et al. (2019), 12 groups of researchers across the globe implemented their methods to predict missing temperature values for competition. Among this cohort of researchers are authors that conjured up some of the most popular methods for large spatially correlated data. Other than GOLF processes, we implement 8 of 12 approaches based on the code provided in (Heaton et al., 2019). We could not implement the other 4 approaches due to memory limitation of the computing facility or unavailability of the code. All computations are operated on a 3.60GHz 8 cores Intel i9 processor with 32 GB of RAM on a macOS Mojave operating system.
The predictive performance of different approaches is recorded in Table 3. Most of the results are consistent with what is shown in Heaton et al. (2019), whereas small differences remain for those requiring random starts or stochastic algorithms. E.g., 5 implementation of the SPDE method gives different RMSE ranging from 1.55 to 1.88. Besides, running time of some methods are slightly different. For SPDE and LatticeKrig, for instance, it takes 35 We acknowledge that held-out observations were not released in Heaton et al. (2019), adding difficulty for model specification. The good performance of the GOLF model may be explained by two reasons. First, different mean parameters are assumed at each latitude, which is more flexible to capture information from a large number of observations. Second, we assume different range and variance parameters of the factor processes, which are more flexible than the separable or isotropic kernel functions.
The 95% predictive interval of the GOLF model is the shortest, and it covers around 92% of the held out test data, as shown in Table 3. In supplementary materials, we provide diagnostic plots of the fitted values from the GOLF model and predictive performance based on several configurations, including 40, 000 MCMC samples and different initial parameters. The predictive performance of the GOLF model at different configurations is similar. Besides, the computational time of GOLF per one MCMC iteration is around 0.49s for this example, which is comparable to NNGP (0.53s) and faster than MRA (3.29s) for one iteration. The posterior sampling obtained here provided uncertainty quantification of model parameters, whereas most of the methods provided in Table 3 only provide a point estimator of the parameters. Future works are needed to reduce the number of iterations in GOLF to achieve a similar level of predictive accuracy.
The predictive mean of the GOLF processes and SPDE are graphed in the middle panel and right panel in Figure 5, respectively. Predictions from the GOLF processes are more accurate for predicting temperatures in areas with high latitude, possibly due to flexible mean parameters estimated from data. Both methods seem to be slightly oversmoothing.  Table 4: Predictive performance of different approaches for the NOAA monthly gridded temperature dataset. The standard deviation of the outcomes in this dataset is 0.940. Results of the FRK, GOLF, and LAGP are given in the first to the third rows. For the results in the fourth and fifth rows, spatial models were fitted using the RobustGaSP package with one initial value and two initial values of the range and nugget parameters for finding their marginal posterior mode, respectively.
Yet predicting the missing values of this data set is challenging, as the observations are missing in spatial blocks. Both methods seem precise in prediction.

Analysis of large spatio-temporal data set
We consider the monthly gridded temperature anomalies from U.S. National Oceanic and Atmospheric Administration (NOAA) 1 . The data set contains the average air and marine temperate anomalies at 5 degrees longitude-latitude grids with respect to 1981-2010 base period. R code and examples to load NOAA gridded data can be found in Shen (2017). We compare the predictive performance using the data from Jan 1999 to Dec 2018. For each month, we observe the temperature anomalies at n 1 = 36 × 28 spatial grids with longitude ranging from 182.5 to 357.5 and with latitude ranging from -62.5 to 72.5, respectively. There are 11, 122 missing data, leaving the total number of observations to be 230, 798. We held out 50% randomly sampled temperature anomalies as the missing data, and the rest 50% is used as training data (i.e., n = n * = 115, 399). Predicting the missing values in this scenario is more difficult than the example in (Gu and Shen, 2020), where the data are missing in a set of locations over the same months. We fit the GOLF processes with the covariance of each spatial coordinate modeled by the Matérn covariance, and the factors processes are defined on the temporal input with different kernel parameters. Due to computational limitation, we let the number of factors be d = 0.75 2 n 1 = 567 and assume the factor loadings to be a Kronecker product of the first three-quarters of the eigenvectors of the sub-covariance matrices for longitude and latitude. Although we have a large number of factors, the computational complexity is O(N d max ) with d max = 0.75 × 36 = 48 rather than O(N d 1 d 2 ) by the mode multiplication of tensor (see Section 3.3 for the discussion). We assume the coefficients of the intercept and linear coefficients are different at each location, i.e. M = (H 2 B 2 ) T where H 2 = [1 n 2 , x], with x being 240 months and B 2 being a matrix of 2 × n 1 coefficients. We use M = 3000 MCMC samples with the first 20% as the burn-in samples, as posterior samples converge at a small number of iterations in this example. In Table 4, we compare the GOLF processes with a few other spatial and spatio-temporal methods for the NOAA dataset. We fit two spatial models separately for each month using the RobustGaSP package available on CRAN. Also implemented are FRK and LAGP based on their packages (Zammit-Mangion et al., 2017;Gramacy, 2016).
As shown in Table 4, GOLF processes have the smallest predictive RMSE and the shortest predictive interval that covers around 94% of the held-out output. Since the temporal input is not used, it is not surprising that the RMSE and the length of the predictive interval of the two spatial models are larger than the ones by GOLF processes. If we include the temporal inputs, the computation cost is too large for inverting the covariance matrix directly. FRK and LAGP also seem to have a larger predictive error, though both the spatial and temporal inputs are used in these methods.
Predictions from GOLF processes are more accurate due to three reasons. First, we can compute the model with a large number of factors efficiently, and no further approximation of the likelihood function is required. Second, mean and trend parameters at each location are different, making the model flexible to capture the dynamic trend of temperature values at different locations. Finally, Latent factor processes have different kernel parameters that fit diverse smoothness levels of projected observations.
In Figure 6, we graph the full temperature anomalies in Jan 2018, predictions from the GOLF and spatial GP model by RobustGaSP package. 50% of the observation in the left panel are held out for testing. Both models seem to be accurate. Since the temporal coordinate is used in prediction, the predictive error by GOLF processes is smaller.

Concluding remarks
We have introduced GOLF processes as a computationally feasible approach to model large incomplete lattice observations. For GPs with a product covariance function or LMC with orthogonal latent factor loadings, the likelihood can be decomposed into a product of multivariate normal densities, and prior independence of factor processes leads to posterior independence of factor processes. These two properties allow one to reduce the computational burden of GPs on incomplete lattice observations without approximating the likelihood function. Further computational reduction can be made by reducing the number of factors as well. Besides, we have introduced a flexible way to model the mean function and the closed-form marginal likelihood is derived to alleviate the identifiability issue. Finally, we have developed an MCMC algorithm for Bayesian inference for large incomplete matrices of spatial and spatio-temporal data.
The computational tools developed in this work require observations from a lattice with potential missing values. Approximation methods such as the NNGP approach may be integrated to model correlated data with a more general design. Besides, further computational reduction can be made by reducing the number of factors, and a principle way to select the number of factors will be useful. Finally, direct marginalization of factor processes based on an elementwise representation of GPs may be feasible to reduce the computation time from drawing a large number of posterior samples.

Supplementary materials
This supplementary materials contain three parts. The proof of Section 2 is given in Section S1. The additional numerical results for the simulated studies and real applications are given in Section S2 and Section S3, respectively. S1 Proofs for Section 2 S1.1 Auxiliary facts 1. Let A and B be matrices, further assuming A and B are invertible, 2. Let A, B, C and D be the matrices such that the products AC and BD are matrices, Proof of Equation 4. Denote C Y = (2πσ 2 0 ) − n 1 n 2 2 d l=1 |Σ l /σ 2 0 + I n 2 | −1/2 . Directly marginalizing out Z, one has

For matrices
where the first equation is based on Lemma 1 and the Woodbury matrix identity (to compute the normalizing constant C Y ); the second and third equations are from fact 3; the fourth equation is from Woodbury matrix identity. The Equation (4) follows immediately.
Proof of Corollary 1. The proof is implied by the proof of Theorem 4 in (Gu and Shen, 2020). For completeness of this article, we include the proof below. From Equation (1) and Equation (2), we have Based on vectorization, one has We are ready to prove Theorem 1.
Proof of Theorem 1. After marginalizing out Z, we have

(Row regression coefficients).
From Lemma S1, the posterior mean of ( We denote the centeredB 1 byB 1,0 = [B 1,0,s ,B 1,0,c ] =B 1 − Y T A F , wherẽ B 1,0,s is the first d columns ofB 1,0 andB 1,0,c is the last (n 1 − d) columns ofB 1,0 . Let b 1,0,l be the l-th column ofB 1,0 . Then the posterior mean of (B 1 | Y, Θ −B 1 ) can be calculated beloŵ whereB 1,0,s is a n 2 × d matrix with the lth column independently sampled from N (0,Σ l ) for l = 1, ..., d. For the distribution of A cB T 1,0,c , using part 1 of Lemma S1, we have Thus we can sample A cB T 1,0,c by σ 0 (I n 1 − A s A T s )Z 0,1 , where Z 0,1 is an n 1 × n 2 matrix with each entry independently sampled from standard normal distribution. The results soon follow.

(Column regression coefficients).
We first compute the posterior mean of (B 2 | Y, Θ −B 2 ) beloŵ We denote the centeredB 2 byB 2,0 = [B 2,0,s ,B 2,0,c ]. We have whereB 2,0,s is a q 2 × d matrix with the lth column independently sampled from N (0, (H T Figure S1: The simulated data with full observations, disk missing pattern and missing-atrandom pattern with 50% of the missing values are graphed in the left, middle and right panels, respectively. where by Lemma S2,B 1,Q is an n 2 ×d matrix with the lth column independently sampled from N (0, Q 1,l ) for l = 1, ..., d. For the distribution of A cB Thus we can sample marginal posterior distribution of A cB T 1,0,c by σ 0 (I n 1 − A s A T s )Z 0,1 P 0 , where Z 0,1 is an n 1 ×n 2 matrix with each entry independently sampled from standard normal distribution. The results soon follow.

S2 Additional results of simulated studies in Section 5
We provide additional results for the simulated studies in Example 3 in Figure S1 and Figure  S2. We graph the simulated data set with full observations, disk missing pattern and missingat-random pattern with 50% of the missing values in Figure S1. Posterior samples of the logarithm of the inverse range parameter of factor loading matrix, the nugget parameter and the inverse range parameter of the factors are graphed from the upper to lower panels in Figure S2, respectively. The posterior samples of parameters in the exact GP model and GOLF processes are similar to each other. Figure S2: Trace plots of posterior samples of parameters in the exact GP model and GOLF processes for the simulated data in figure S1.

S3
Additional results for real applications in Section 6.1 In this section, we include additional results for GOLF processes predicting the missing values of the temperature data set discussed in Heaton et al. (2019). We show the details of 5 different configurations of GOLF processes, where the result reported in the main body of the article is the configuration 1. For all the configurations, the proportion of the burnin samples is 20%. We use the normal distribution centered on the previous values as the proposal distribution of the logarithm of the inverse range parameters and logarithm of the nugget parameters. For the logarithm of the inverse range parameters of the factor loading matrix, the standard deviation of the proposal distribution is 40/n 1 . For the logarithm of the inverse range parameters and the nugget parameters of the factor processes, the standard deviation of the posterior distribution is set to be 40/n 2 . The details of 5 configurations are given in Table S1. The predictive RMSE, P CI (95%) and L CI (95%) of the 5 configurations are given in Table S2. The predictive RMSE is similar for all 5 configurations. Increasing the posterior sample size seems to slightly increase the proportion of the samples contained in the 95% predictive interval.  Figure S3: The predictive performance of GOLF process with d = 5, 10, 20, 30, 40, 50 and 100 factors for Example 4, when the true number of factor is d real = 100 in generating the data. The nonseparable kernel with distinct kernel parameters is assumed to generate the data in the first row of panels, and the separable kernel with the same kernel parameter of each factor process is used for simulation in the second row of panels. The blue curves and red curves denote the performance by the GOLF processes with the different kernel parameters and the same kernel parameter, respectively. In the left panels, the solid curves denote the RMSE for predicting the (noisy) observations, and the dashed curve denotes the RMSE for predicting the mean of the observations. The proportions of observations covered in the 95% predictive interval and the average length of the predictive interval are graphed in the middle and right panels, respectively. Figure S4: Diagnostic plots of the GOLF processes for the data set in Heaton et al. (2019).  Table S1: Detailed settings of 5 different configurations of GOLF processes for the data set in Heaton et al. (2019). The number of samples and the computing system are shown in the second column and third column, respectively. The choice of the initial values of the missing data is given in the fourth column, using either the mean of the observations at each latitude or overall mean of the observations with a small random Gaussian noise (with standard deviation being 0.1 times of the standard deviation of the observations). The initial values of the logarithm of the inverse range parameters are either chosen to be a fixed value or randomly sampled from the uniform distribution, shown in columns 5-6.  The fitted values from the GOLF processes in configuration 1 against the residuals and the normal Q-Q plot are graphed in the left panel and the right panel in Figure S4, respectively. The Q-Q plot indicates the fitted values are slightly left-skewed and slightly under-dispersed.