Bayesian nonstationary and nonparametric covariance estimation for large spatial data

In spatial statistics, it is often assumed that the spatial field of interest is stationary and its covariance has a simple parametric form, but these assumptions are not appropriate in many applications. Given replicate observations of a Gaussian spatial field, we propose nonstationary and nonparametric Bayesian inference on the spatial dependence. Instead of estimating the quadratic (in the number of spatial locations) entries of the covariance matrix, the idea is to infer a near-linear number of nonzero entries in a sparse Cholesky factor of the precision matrix. Our prior assumptions are motivated by recent results on the exponential decay of the entries of this Cholesky factor for Matern-type covariances under a specific ordering scheme. Our methods are highly scalable and parallelizable. We conduct numerical comparisons and apply our methodology to climate-model output, enabling statistical emulation of an expensive physical model.


Introduction
Modeling spatial data typically involves specification of spatial dependence in the form of a covariance function or matrix, under an implicit or explicit assumption of joint Gaussianity. A motivating example for this paper is statistical climate-model emulation (e.g., Castruccio and Stein, 2013;Castruccio et al., 2014;Nychka et al., 2018;Haugen et al., 2019): based on an ensemble of spatial fields generated by an expensive computer model (Figure 1), the goal is to learn the underlying joint distribution, and then, for instance, to draw additional samples from the distribution. This involves many challenges, including small ensemble sizes, high-dimensional distributions, and complex, nonstationary dependence. Thus, there is a need for flexible and scalable methods for inferring high-dimensional spatial covariances.
Countless approximations have been proposed to address computational challenges in spatial statistics (see Heaton et al., 2019, for a recent review and comparison). In recent years, there has been increasing interest in the idea of Vecchia (1988), which effectively approximates the Cholesky factor of the precision (i.e., inverse covariance) matrix as sparse. Under certain settings, the Vecchia approximation can provably provide -accurate approximations at near-linear computational complexity in the number of spatial locations (Schäfer et al., 2020). A generalization of the Vecchia approach includes many popular spatial approximations as special cases . However, Vecchia approaches have mostly been used for approximating parametric and often isotropic covariance functions. Isotropic, parametric covariance functions (e.g., Matérn) only depend on spatial distance and on a small number of unknown parameters. Despite being highly restrictive, this is the standard assumption in spatial statistics, especially in the absence of replicates. Approaches to relax these assumptions include parametric nonstationary covariances (e.g., as reviewed by Risser, 2016), stationary nonparametric covariances (e.g., Huang et al., 2011;Choi et al., 2013;Porcu et al., 2019), nonparametric and nonstationary covariances (e.g., Fuentes, 2002), and domain transformations (e.g., Sampson and Guttorp, 1992;Damian et al., 2001;Qadir et al., 2019). In the context of local kriging, covariance functions are typically estimated locally from a parametric (e.g., Anderes and Stein, 2011;Nychka et al., 2018) or nonpara-metric (e.g., Hsing et al., 2016) perspective, but this generally does not imply a valid joint model or positive-definite covariance matrix.
Outside of spatial statistics, covariance estimation is often performed based on (modified) Cholesky decompositions of the precision matrix. This approach is attractive, because it automatically ensures positive-definiteness, because sparsity in the Cholesky factor directly corresponds to ordered conditional independence and hence to directed acyclic graphs, and because it allows covariance estimation to be reformulated as a series of regressions. Regularization can be achieved as in other regression settings, for example by enforcing sparsity using a Lasso-like penalty or a thresholding procedure (e.g., Huang et al., 2006;Levina et al., 2008) or via Bayesian prior distributions (e.g., Smith and Kohn, 2002). Motivated by a Gaussian Markov random field assumption for spatial data, Zhu and Liu (2009) estimate the Cholesky factor based on an ordering of the spatial locations intended to minimize the bandwidth, which amounts to coordinate ordering on a regular grid, and they regularize the entries of the Cholesky factor using a weighted Lasso penalty depending on spatial distance; this approach scales cubically in the number of spatial locations.
Here, we propose scalable nonparametric Bayesian inference on a high-dimensional spatial covariance matrix. The basic idea is to infer a near-linear number of nonzero entries in a sparse Cholesky factor of the inverse covariance matrix. Our model can be viewed as a nonparametric extension of the Vecchia approach, as regularized inference on a sparse Cholesky factor of the precision matrix, or as a series of Bayesian linear regression or spatial prediction problems. We specify prior distributions that are motivated by recent results (Schäfer et al., 2017(Schäfer et al., , 2020 on the exponential decay of the entries of the inverse Cholesky factor for Matérn-type covariances under a maximum-minimum-distance ordering of the spatial locations (Guinness, 2018;Schäfer et al., 2017). Our method scales well to very large datasets, as the number of nonzero entries in the Cholesky factor and the computational cost both scale near-linearly in the number of spatial locations, in effect inferring a nearlinear number of parameters in the sparse inverse Cholesky factor instead of a square number of parameters in the dense covariance matrix. Further speed-ups are possible, as the main computational efforts are perfectly parallel. Our approach is applicable to a single realization of the spatial field, but the inference will be most useful and accurate if replicate observations are available.
The remainder of this document is organized as follows. Section 2 describes our methodology. Section 3 provides numerical comparisons using simulated data. In Section 4, our method is used for climate-model emulation. Section 5 concludes.
2 Methodology 2.1 Sparse inverse Cholesky approximation for spatial data Consider a N × n matrix of spatial data, where y ( ) i is the th observation at spatial location s i . We assume that the locations s 1 , . . . , s n , and hence the columns of Y, are ordered according to a maximin ordering (Guinness, 2018;Schäfer et al., 2017), which sequentially selects each location in the ordering to maximize the minimum distance from locations already selected (see Figure 2).
We model the rows y ( ) = (y ( ) 1 , . . . , y ( ) n ) of Y as independent n-variate Gaussians: We assume that the data are centered, either using an ad-hoc pre-processing step (e.g., by subtracting location-wise means) or using a more elaborate procedure (see Section 2.8).
Our goal is to make inference on the n × n spatial covariance matrix Σ based on the N × n observations Y, in the case where n is large (at least in the thousands) and N is relatively small. Typically, a parametric, and often isotropic, covariance function is assumed such that Σ is a function of only a small number of parameters, which can then be estimated relatively easily. Here, we avoid explicit assumptions of stationarity and isotropy.
We assume a form of ordered conditional independence, where g m (i) ⊂ (1, . . . , i − 1) is an index vector consisting of the indices of the min(m, i − 1) nearest neighbors to s i among those ordered previously; that is, s (gm(i)) j is the jth nearest neighbor of s i among s 1 , . . . , s i−1 (see Figure 2). While (3) holds trivially for m = n − 1, for many covariance structures it even holds (at least approximately) for m n, as has been demonstrated numerically (e.g., Vecchia, 1988;Stein et al., 2004;Datta et al., 2016;Guinness, 2018;Katzfuss and Guinness, 2019;Katzfuss et al., 2020a,b) and theoretically (Schäfer et al., 2020) in the context of Vecchia approximations of parametric covariance functions. Assume for now that m is known.
Consider the modified Cholesky decomposition of the precision matrix: where D = diag(d 1 , . . . , d n ) is a diagonal matrix with positive entries d i > 0, and U is an upper triangular matrix with unit diagonal (i.e., U ii = 1). (To be precise, (4) is the reverseordered Cholesky factorization of the reverse-ordered Σ −1 , which simplifies our notation later.) The ordered conditional independence assumed in (3) implies that U is sparse, with at most m nonzero off-diagonal elements per column (e.g., Katzfuss and Guinness, 2019, Prop. 3.1). We define u i = U gm(i),i as the nonzero off-diagonal entries in the ith column.

Covariance estimation via Bayesian regressions
From (4), we see that we can estimate the O(n 2 ) unknown entries of Σ by inferring the O(nm) variables d 1 , . . . , d n and u 1 , . . . , u n . To do so, our data model (2) can be written as a series of n linear regression models (Huang et al., 2006):  show their roughly equidistant spread over the domain for maximin. As an example, for m = 4, we would have conditioning sets g 4 (15) = (13, 9, 14, 6) for coordinate and g 4 (15) = (7, 13, 10, 1) for maximin.
where the "response vector" y i = (y (1) consisting of the N observations at the ith spatial location, and the "design matrix" X i consists of the observations at the m neighbor locations of s i , stored in the columns of Y with indices g m (i); specifically, X i is an N × m matrix with th row −y ( ) gm(i) . The Bayesian regression models in (5) are completed by independent conjugate normalinverse-gamma (NIG) priors: where θ is a vector of hyperparameters determining m, V i , α i , and β i (see Section 2.3 below). Due to conjugacy, the posterior distributions (conditional on θ) are also NIG: Using (7), we can easily obtain samples or posterior summaries of the entries of U and D conditional on θ. However, in many applications, primary interest will be in computing posterior summaries of Σ and other quantities. If n is not too large (n < 10 4 , say), we can simply compute Σ −1 (and hence Σ) from U and D. For large n, it is often not possible to even hold the entire dense matrix Σ in memory, but we can quickly compute useful summaries of it based on the sparse matrices U and D (e.g., Katzfuss et al., 2020a). For example, a selected inversion algorithm can compute the variances Σ ii and all entries Σ ij for which i ∈ g m (j) or j ∈ g m (i). We can also compute the covariance matrix for any set of linear combinations Hy ( ) as HΣH = A A, where A = D 1/2 U −1 H . In many applications, including climate-model emulation, it is of interest to sample new spatial fields from the model, which we can do by sampling z ∼ N (0, I n ), and then setting y = (U ) −1 D 1/2 z; if U and D are sampled from their posterior distribution given Y, then we have obtained a sample from the posterior predictive distribution p(y |Y).

Parameterization of the prior distributions
We now discuss parameterizing the NIG priors for u i and d i in (6) as a function of a small number of hyperparameters, θ = (θ 1 , θ 2 , θ 3 ) , inspired by the behavior of Matérn-type covariance functions. The parameter θ 1 is related to the marginal variance, while θ 2 and θ 3 are related to the range and smoothness. In general, our prior parameterizations are motivated by interpreting u i and d i as the kriging weights and variance, respectively, for the spatial prediction problem implied by (3) (6). For an exponential covariance with variance θ 1 and range 2/θ 2 , we have Σ i,j = θ 1 exp(−θ 2 s i − s j /2); assuming m = 1, we obtain where g = g 1 (i), and the distance s i − s g between location s i and its nearest previously ordered neighbor decreases roughly as (i) −1/p for a regular grid on a unit hypercube, D = [0, 1] p . (Throughout, i is an index and not the imaginary number.) This motivates a prior for d i that shrinks toward d i ≈ θ 1 (1−e −θ 2 (i) −1/p ). While (8) only holds exactly for an exponential covariance with m = 1, Figure 3 illustrates that this functional form approximately holds for Matérn covariance functions in two dimensions with m = n − 1 as well. Thus, we set the prior mean as Figure 3, the empirically observed variance of the d i elements around the fit line decreases with i as well, and so we set the prior standard deviation of d i to be half of the mean. Solving for α i and β i , we obtain α i = 6 and ). Recent results based on elliptic boundary-value problems (Schäfer et al., 2017, Sect. 4.1.2) imply that the Cholesky entry (u i ) j , corresponding to the jth nearest neighbor, decays exponentially as a function of j, for Matérn covariance functions whose spectral densities are the reciprocal of a polynomial (ignoring edge effects). Thus, we assume Figure 4 demonstrates this exponential decay as the neighbor number increases.
Finally, consider the choice of conditioning-set size m. Simply setting m to a fixed, reasonable value (e.g., m ≈ 10, depending on computational constraints) works well in many settings, but the results can be highly inaccurate if m is chosen too small, and the  Figure 4: Illustration of the entries (u i ) j of U as a function of neighbor number j for the same setting as in Figure 3. The dark lines correspond to approximate pointwise 95% prior intervals (±2 exp(−θ 3 i)).
computational cost is unnecessarily high if m is chosen too large. Hence, we prefer to allow the data to choose m by tying m to the prior decay of the elements of U; for all of our numerical experiments, we set m as the largest j such that exp(−θ 3 j) > 0.001, where j denotes the neighbor number. This coincides with the amount of variation expected to be learnable from the data. Thus, entries of U with sufficiently small prior variance as implied by a specific θ 3 are set to zero, which ensures computational feasibility of our method.

Inference on the hyperparameters θ
The hyperparameters θ = (θ 1 , θ 2 , θ 3 ) determine m, V i , α i , and β i as described in Section 2.3. We now discuss how θ can be inferred based on the data Y. All elements of θ are assumed to be positive due to the decay previously discussed, and so we perform all inference on the logarithmic scale. The crucial ingredient for inference on θ is the marginal or integrated likelihood, which can be obtained by combining (5) and (6), moving the product over locations outside of the integral over the entries of U and D, and simplifying: where Γ denotes the gamma function, the prior parameters α i , β i , V i are given in (6), and the posterior parameters α i , β i , G i are given in (7). Based on this integrated likelihood, both empirical and fully Bayesian inference are straightforward. Empirical Bayesian inference is based on a point estimate of θ obtained by numerically maximizing the log integrated likelihood. Fully Bayesian inference requires the specification of a hyperprior for θ, which we simply assume to be flat (on the log scale). As a result, the posterior distribution p(θ|Y) ∝ p(Y|θ) is proportional to the integrated likelihood in (9). While this distribution cannot be obtained analytically, we can sample from the posterior using the Metropolis-Hastings (MH) algorithm. To avoid slow mixing due to large negative correlation between θ 1 and θ 2 , we employ an adaptive MH algorithm that jointly proposes θ and learns its covariance matrix on-line; specifically, we use the implementation in R by Scheidegger (2012).

Computational complexity
The cost for inference, including computing the posteriors in (7), sampling y , or evaluating the integrated likelihood in (9), is dominated by computing the m × m matrix G i , which requires O(m 2 N ) time, and decomposing G i , which requires O(m 3 ) time, for each i = 1, . . . , n. Hence, the time complexity is O(n(m 2 N + m 3 )) for each unique value of θ, where m is often very small (e.g., m ≈ 10 in most of our numerical experiments). In addition, the most expensive computations can be carried out in parallel over i = 1, . . . , n.
For very small numbers of replicates, with N < m, we can use alternative expressions (see below (7)) relying on computing and decomposing the N × N matrix The maximin ordering and large nearest-neighbor conditioning sets (with m max = 50, say) can be computed in quasilinear time in n (Schäfer et al., 2017(Schäfer et al., , 2020. For any m ≤ m max implied by a specific θ, we can then simply select g m (i) as the first m entries of g mmax (i).

Asymptotics
Assume temporarily that (2) holds for some true n × n positive-definite covariance matrix Σ 0 , with fixed n and N → ∞. Then, the data model with the true Σ 0 can be written in the regression form (5) with m = n − 1. Under these assumptions, there are a fixed number (depending only on n, not on N ) of variables in the regression models, and our prior distributions on the u i , d i , and θ place nonzero mass on the true model. Hence, using well-known asymptotic results based on the Bernstein-von Mises theorem (e.g., Van der Vaart, 2000), the posterior distributions will be asymptotically normal and our posterior of Σ will contract around the true covariance Σ 0 as the number of independent replicates N approaches infinity. However, results of this nature are of limited use here, as we are most interested in the case n N , which we will examine numerically in Sections 3 and 4.

Correlation-based ordering
For our methods, as discussed in Section 2.1, we recommend a maximin ordering of the variables y 1 , . . . , y n , and then selecting the conditioning sets g m (i) based on the m nearest previously ordered variables, with m determined by θ as described at the end of Section 2.3. So far, these tasks were assumed to be based on the Euclidean distance of the corresponding locations s 1 , . . . , s n (see Figure 2), which implies that our priors shrink toward isotropy (i.e., distributions for which dependence is only a function of distance). This shrinkage is not appropriate in some real-data applications. However, it is relatively straightforward to adapt our methods to processes (e.g., anisotropic or nonstationary) for which Euclidean distance is not meaningful. We merely require some prior guess of the correlation structure, based on expert knowledge, historical data, or (a regularized version of) the sample correlation of the data Y; a simple choice used here is the element-wise product of the sample correlation and an (isotropic) exponential correlation with a large range parameter (e.g., half the maximum distance between any pair of locations in the dataset). Then, our procedures can be carried out as before, except that the ordering and nearest-neighbor conditioning is based on a correlation distance, defined as (1 − |correlation|) 1/2 . This implicitly scales the space, so that the process is approximately isotropic in the transformed space. This approach can increase accuracy in the context of Vecchia approximations of parametric covariances (Kang and Katzfuss, in prep.); we propose it here for our nonparametric procedures. Schäfer et al. (2020, Alg. 7) allows us to compute the correlation-based ordering and conditioning sets in quasilinear time in n.

Noise or spatial trend
Our methodology described so far is most appropriate if the data are observed without any noise or nugget, meaning that realizations of the underlying spatial field are continuous over space; in this setting, approximations based on sparse inverse Cholesky factors of many popular covariance functions can be highly accurate (e.g., Katzfuss and Guinness, 2019;Schäfer et al., 2020). Now consider noisy observations w ( ) |y ( ) iid ∼ N n (y ( ) , τ 2 I n ), = 1, . . . , N , with y ( ) as in (2). One option is to simply apply our methodology directly to the data w ( ) as before; this will likely work well if the noise variance τ 2 is small, but the conditional-independence assumption in (3) is less appropriate if τ 2 is large (e.g., Katzfuss and Guinness, 2019), meaning that a much larger m might be necessary. A larger m results in higher computational cost and potentially less accuracy due to the higher number of Cholesky entries that must be estimated.
Hence, for large noise levels, we instead propose a Gibbs sampler that iterates between sampling y ( ) conditional on w ( ) and Σ −1 = UD −1 U , and sampling θ and the entries of U and D conditional on the y ( ) as in Sections 2.2 and 2.4. The former task can be accomplished without increasing the computational complexity for each Gibbs iteration, by exploiting the sparsity of the Cholesky factor UD −1/2 of the prior precision, and approximating the Cholesky factor of the posterior precision using an incomplete Cholesky factorization to avoid fill-in as described in Schäfer et al. (2020, Sect. 4.1). (If τ 2 is unknown, it is straightforward to sample from its full-conditional distribution as well.) A similar Gibbs-sampling strategy can be employed to make inference on a spatial trend. For example, if the observations w ( ) are given by y ( ) plus a linear spatial trend with a Gaussian prior on the trend coefficients, the coefficients can be sampled in closed form conditional on Σ, and all other unknown quantities can be sampled given the trend coefficients as before based on y ( ) obtained by subtracting the trend from w ( ) .

Simulation study
We compared the following methods: SCOV: Basic sample covariance OURS: Our method described in the previous sections MLE: Estimate based on the MLEs of u i and d i for the regressions in (5) (i.e., no prior shrinkage), with m = min(m OURS , N − 1), with m OURS implied by OURS θ estimate LASSO: Lasso for each regression in (5), with all possible previous points included as possible predictors (i.e., m = n − 1) SLASSO: Spatial LASSO with penalty scaled by the spatial distance to favor inclusion of nearer points as predictors, intended to be similar to Zhu and Liu (2009) The spatial domain for all comparisons was the unit square.  Figure 5: Based on N = 20 draws from a Gaussian process with Matérn covariance at n = 900 locations (see Section 3.1): (a) Sample estimates (SCOV) and posterior 80% credible intervals using our fully Bayesian method (OURS) for 20 entries of the covariance matrix. (b) 50%, 80%, and 95% credible intervals using OURS for one randomly sampled entry of the covariance matrix corresponding to each unique distance.

Uncertainty quantification
First, we fit a fully Bayesian version of OURS to simulated data, to demonstrate the uncertainty quantification in the covariance estimation. Specifically, we considered N = 20 realizations of a Gaussian process with Matérn covariance function with variance 3, smoothness 1, and range parameter 0.25, at n = 900 randomly sampled locations. We obtained 50,000 samples of θ using an adaptive MCMC (Scheidegger, 2012). The trace plots showed good mixing and convergence, and the individual effective sample sizes for the three parameters were all larger than 1,000. After conservatively discarding the first half of the samples for burn-in and thinning by a factor of 50, a covariance matrix was calculated from a sample from (7) for each θ draw. Figure 5a shows the resulting 80% posterior credible intervals (CIs) along with the SCOV estimates for 20 randomly sampled matrix entries Σ ij as a function of s i − s j , the distance between the corresponding spatial locations. Most of the OURS CIs contained the true value and tracked the decay of the covariance as a function of distance. This is also the general trend for CIs at all distances shown in Figure 5b.

Comparison to LASSO for small n
We compared estimation accuracy using the Kullback-Leibler (KL) divergence between the estimated distribution N n (0,Σ) and the true distribution N n (0, Σ): where tr(·) denotes the trace and | · | denotes the determinant. This exclusive KL divergence does not require inverting the estimateΣ and thus avoids issues with SCOV for N < n. To obtain a point estimate for OURS, we computedΣ = (Û −1 ) DÛ −1 , whereÛ andD were the maximum a posteriori (MAP) estimates from (7), based on the value of θ that maximized the integrated likelihood (9).  Figure 6 shows the results, using the same set-up with n = 900 as in Section 3.1, for various numbers of replicates N . MLE was similarly accurate as OURS for large N , as expected, but it performed worse for small N due to the lack of prior shrinkage. Similarly, the inclusion of spatial information in SLASSO resulted in higher accuracy than LASSO for small N . LASSO and SLASSO were not competitive with OURS and MLE, despite increased flexibility in selecting predictors (i.e., conditioning sets) in the regressions (5), and despite much higher computational cost due to calculations involving all O(n) possible predictors. Hence, we did not consider (S)LASSO further. Figure 7 shows further comparisons with n = 2,500 spatial locations using the KL divergence in four different settings (counter-clockwise from top right), all with a marginal variance of 5: Matérn with smoothness 1 and range parameter 0.5 on a regular 50 × 50 spatial grid (corresponding to the middle panel in the bottom row of Figures 3 and 4); a Cauchy covariance with range 0.25 and memory parameters 1 and 0.5 on a regular 50 × 50 grid; Matérn covariance with varying anisotropy (Paciorek and Schervish, 2006), for which the range parameter is constant at 0.05 in the x direction but varies as 0.05 + 0.45 s y (as a function of the y-coordinate s y ) in the y direction, on a regular 50 × 50 grid; Matérn with smoothness 1 and range 0.25 at n = 2,500 randomly spaced locations sampled uniformly.

Comparison for larger n
For all scenarios, MLE was roughly as accurate as OURS for very large N , but performed poorly for small N , indicating that the added shrinkage from our prior improved the accuracy. OURS strongly outperformed SCOV in all settings. For the nonstationary covariance, we also considered the correlation-based ordering (COR) described in Section 2.7. While we used the true correlation for the comparison here, the element-wise product of the sample covariance and an exponential correlation proposed in Section 2.7 resulted in comparable accuracy (not shown). As expected, OURS-COR performed better than OURS-MM in this nonstationary setting. We also conducted some experiments (not shown) using a natural ordering by one of the spatial coordinates, which performed comparably to maximin ordering for isotropic covariances on a regular grid, but was much less accurate for randomly sampled locations. Overall, our method performed well across all simulations, even though our prior distributions were motivated by isotropic Matérn-like covariances. In addition, the computational burden for OURS was relatively low, with the estimated m often around ten and always below 30. While we only considered moderate n here in order to be able to carry out many comparisons using the KL divergence, it is also possible to run our method on much larger datasets. For example, using a C++ implementation, evaluating the integrated likelihood (9) once only took about 6 seconds on a 4-core laptop (Intel i7-7560U) for n = 250,000, m = 10, N = 50.

Climate-model emulation
We analyzed climate-model output from the Community Earth System Model (CESM) Large Ensemble Project (Kay et al., 2015). Specifically, we considered daily mean surface temperature (in Kelvin) on July 1 in 98 consecutive years starting in the year 402, on a roughly 1 • longitude-latitude grid of size n = 81 × 96 = 7,776 containing much of the Americas (see Figure 1). The chosen region features ocean, land, islands, and mountain ranges, leading to a complicated, nonstationary dependence structure. The data Y were defined as the temperature anomalies obtained by standardizing the climate-model output at each grid point to unit mean and variance. We found no evidence of temporal correlation in the data, and so the assumption of independent replicates in (2) was at least approximately satisfied. First, we compared three covariance estimates: an exponential covariance with a range parameter estimated from the data (EXP); a tapered sample covariance given by the elementwise product of the sample covariance and an exponential correlation with a range of 6,000 km, with a small added nugget with variance 10 −5 for numerical stability (SCOVT); and the MAP estimate (as in Section 3.2) using our method with correlation ordering (Section 2.7) based on the SCOVT matrix (OURS). Of the 98 replicates (i.e., years), we randomly selected and withheld 18 as test data, and fit the models on subsets of various sizes N between 6 and 80. As the true distribution was unknown, it was not possible to compute the KL divergence. Instead, we used the strictly proper log score (e.g. Gneiting and Katzfuss, 2014) given by the average negative log posterior predictive density of the test data based on (2), with Σ replaced by the corresponding estimate for each of the three estimates. Figure 8 shows the resulting scores, averaged over five random training/test splits. OURS was more accurate than SCOVT for all values of N , and more accurate than EXP for all N ≥ 10. We also tried OURS with Euclidean (instead of correlation-based) ordering, which resulted in similar scores for large N but required almost twice the N = 17 replicates to surpass EXP (not shown). While it may be possible to find other (e.g., parametric nonstationary) methods that can result in even lower scores than OURS for this dataset, such methods would likely require substantial amounts of manual tuning (e.g., specifying the parametric form of the nonstationarity).
We created a stochastic simulator emulating the climate model, by fitting a fully Bayesian version of OURS to the full dataset with N = 98 and sampling from the posterior predictive distribution p(y |Y) as described at the end of Section 2.2. Four such samples are shown  Figure 9: Four temperature-anomaly fields (in Kelvin) sampled from the posterior predictive distribution using our fully Bayesian method, computed as described in Section 4 based on climate-model output as in Figure 1 in Figure 9; they look qualitatively similar to the actual samples from the climate model in Figure 1, including reproducing features corresponding to land/ocean effects despite using no explicit information on land boundaries. These results were based on 50,000 Metropolis-Hastings (MH) samples of θ (after a burn-in of 50,000) with trace plots showing good mixing and effective sample sizes all larger than 1,000; the samples were then thinned by a factor of 50. It took about 200 minutes to train the emulator, and it took 2.5 seconds to obtain a sample y for a given value of θ, on a 4-core laptop (Intel i7-7560U) without parallelization.

Conclusions
We have developed a scalable, flexible Bayesian model for spatial covariance estimation and emulation. We regularize our method by taking advantage of a form of ordered conditional independence often assumed for spatial data. This motivates the assumption of sparsity in the Cholesky of the precision matrix, which greatly improves scalability and reduces the number of unknown parameters from quadratic to near-linear in the number of spatial locations. We describe three hyperparameters related to the marginal variance and the decay of Cholesky entries; these hyperparameters can be quickly optimized or sampled, resulting in an automatic data-based selection of the sparsity structure. Hence, our method requires no manual tuning or cross-validation. While our approach was motivated by the behavior of isotropic covariances on regular grids, our numerical comparisons demonstrated its generality with more complex covariances and irregularly spaced locations. We also applied the method to climate-model emulation, where it captured the nonstationary behavior better than standard methods. Template R code for our implementation is provided with this article.
There are several interesting extensions for our spatial covariance estimation procedure. Our method can be extended to handle missing values by imputation using a Gibbs sampler similar to the ones described in Section 2.8; however, if the number of observations at a particular location is very small or even zero, the posterior distribution at that location will be very vague and thus generally not particularly useful, unless some additional assumptions about the covariance between the unobserved and observed locations are made. For example, more explicit shrinkage toward a specific parametric covariance could be achieved by setting the prior mean of the nonzero entries of U and D in (6) to the values implied by a parametric Vecchia approximation (e.g., Katzfuss and Guinness, 2019, Sec. 4.1). Another potential extension is to estimate the covariance as a function of external variables by including them as additional covariates in the regressions in (5); for instance, for climate-model emulation, the covariance could depend on season, year, elevation, or land versus ocean. Finally, our approach could be extended to data assimilation, by using it to infer the forecast covariance matrices in an ensemble Kalman filter.