Nonparametric Bayesian Modeling and Estimation of Spatial Correlation Functions for Global Data

We provide a nonparametric spectral approach to the modeling of correlation functions on spheres. The sequence of Schoenberg coefficients and their associated covariance functions are treated as random rather than assuming a parametric form. We propose a stick-breaking representation for the spectrum, and show that such a choice spans the support of the class of geodesically isotropic covariance functions under uniform convergence. Further, we examine the first order properties of such representation, from which geometric properties can be inferred, in terms of Hölder continuity, of the associated Gaussian random field. The properties of the posterior, in terms of existence, uniqueness, and Lipschitz continuity, are then inspected. Our findings are validated with MCMC simulations and illustrated using a global data set on surface temperatures.


Introduction
There has been an increasing interest in the modeling, inference and prediction of random processes defined continuously over a large portion or the entire planet Earth. Global data sets have become ubiquitous, and along with their increasing complexity, new scientific questions and challenges have emerged involving several branches of applied mathematics, statistics, computer sciences and machine learning.

Context and State of the Art
This paper deals with nonparametric Bayesian modeling, inference and prediction, for Gaussian random fields defined continuously over the two-dimensional sphere embedded in the three dimensional Euclidean space. Although more sophisticated settings may be considered (see, e.g., Lindgren et al., 2011, with their approach on manifolds), we choose the sphere as the reference domain for the stochastic processes of interest as it is a reasonable approximation to the true geometry of the Earth (Diggle and Ribeiro, 2007). We take explicit advantage of the assumption of Gaussianity, which implies that the underlying process is uniquely characterized by its second order properties: the mean, and the covariance function (Stein, 1999). More formally, let the unit sphere S 2 be defined as the set of points in R 3 with unit Euclidean distance. We consider Gaussian random fields {Z(x), x ∈ S 2 }, with constant mean, and with covariance function C(x 1 , x 2 ) = σ 2 corr{Z(x 1 ), Z(x 2 )}, for x 1 , x 2 ∈ S 2 . Here, σ 2 is the variance and corr denotes correlation function. In this paper, estimation of the variance plays a minor role, and most attention will be given to the correlation function.
In particular, we focus on covariance functions that are geodesically isotropic (Porcu et al., 2018a): the function C above depends exclusively on the geodesic distance, that is, the shortest arc joining two points on the spherical shell. Mathematical details will be given subsequently. The nonparametric Bayesian approach to the function C has three basic ingredients: (a) the function C does not belong to any parametric family and is instead treated as an unknown function, to be estimated from data; (b) since C must be positive definite (details are given below) it is convenient to resort to spectral techniques, for which the mathematical restrictions are less severe; (c) a prior needs to be specified for the corresponding spectral density.

Random Fields on Spheres: Literature Review
The literature on random fields on spheres, manifolds, or product spaces involving spheres with locally compact groups, has become quite extensive, prompting several recent reviews. Gneiting (2013) provides a comprehensive treatment of covariance functions on spheres, Jeong et al. (2017) offers a review of geostatistical approaches to spheres, and Porcu et al. (2018a) surveys the state-of-the-art of covariance functions on spheres and spheres cross time. Further, the essays in Gneiting (2013) and Porcu et al. (2018a) report collections of open problems pertaining to the mathematical, statistical and computer science communities.
A common assumption in spatial statistics is that the covariance function has the property of being isotropic, that is, the covariance between any pair of random variables observed at two different points of the spatial domain depends exclusively on the distance between the points. A particularly relevant issue when modeling Gaussian fields on spheres is that their covariance function should depend on the appropriate metric, in this case, the great circle (or geodesic) distance. As an alternative, one can build covariance functions based on chordal distance, and we refer the reader to Banerjee (2005), Gneiting (2013), Porcu et al. (2016) and Porcu et al. (2018b) for constructive criticism about the use of such a metric on the sphere.
Covariance functions are positive definite and there is a well-established theory about positive definite functions on d-dimensional spheres of R d+1 . The reader is referred to Schoenberg (1942) and Gneiting (2013); and to the generalizations obtained by Berg and Porcu (2017) and Porcu et al. (2016).
Spectral representations of covariance functions on spheres allow for the understanding of the properties of a Gaussian field with a geodesically isotropic covariance. For instance, Lang and Schwab (2013) show that the rate of decay of the spectrum related to the function C determines the regularity properties of the associated Gaussian field in terms of interpolation spaces and Hölder continuity of sample paths. Spectral analysis on spheres became recently useful in contexts as diverse as spatial statistics (Guinness and Fuentes, 2016), equivalence of Gaussian measures and infill asymptotics (Arafat et al., 2018), approximation theory (Menegatto et al., 2006;Beatson et al., 2014;Ziegel, 2014;Massa et al., 2017) and spatial point processes (Møller et al., 2018).
Statistical methods based on spectral techniques have been explored from the more practical perspective of overcoming the computational burden associated with classical methods based on covariance functions. Most of these techniques are based on spectral representations for processes defined on a lattice and we refer again to Porcu et al. (2018a), and references therein, for a comprehensive review.

Our contribution
For a clearer exposition, our contribution can be categorized as follows: (a) Spectral analysis and priors on d-dimensional spheres. For the sake of completeness, we work on d-dimensional spheres (including the Hilbert sphere, for which details are given below). We choose a stick-breaking representation for the spectral expansion related to geodesically isotropic covariance functions.
(b) Provide the topological support for the prior. The spectral expansion of the function C becomes, under our setting, a random sequence. Thus, the function C inherits this fact and becomes a random object for which a topological support must be provided. We do so under the topology of uniform convergence.
(c) First order properties and Hölder continuity. We choose a special case of a stick-breaking representation, called GEM (Pitman and Yor, 1997, details below) that allows for the inspection of the first order properties of the random spectral sequence. As a corollary, we deduce the geometric properties of a Gaussian field with covariance function of this nature, in terms of Hölder continuity.

(d) Posterior inspection.
We show that the posterior for the above random spectral sequence exists, it is absolutely continuous, and admits a closed form. Finally, we show that such a posterior is Lipschitz continuous with respect to the Hellinger distance.
Our findings are validated through both a simulation study as well as through data analysis. Specifically, the outline of the paper is as follows: Section 2 contains the mathemat-ical background needed to understand our proposed methodology. Section 3 describes our spectral nonparametric approach together with the prior specification. Section 4 presents the posterior analysis for a single realization of the Gaussian process under the proposed prior. This is later extended to the case of multiple samples in Section 5. Section 6 provides a simulation study that illustrates some practical aspects of our approach. Section 7 reports on the application of our method to a dataset of global temperatures provided by the National Center for Atmospheric Research (NCAR). A discussion finishes the paper in Section 8. Mathematical proofs and other details are given in a Supplementary Materials file (Porcu et al., 2020).

Random Fields and Correlation Functions on d-Dimensional Spheres
We start with some notation. For a positive integer d, denotes the d-dimensional unit sphere embedded in R d+1 , with · being the Euclidean distance. We shall also refer to the Hilbert sphere is the continuous mapping defined as where is the transpose operator. Throughout, we equivalently use θ or θ(x 1 , x 2 ) to denote this distance, provided no ambiguity arises.
We consider stationary Gaussian random fields {Z(x), x ∈ S d }, with constant mean, and with covariance function C(x 1 , x 2 ) = cov{Z(x 1 ), Z(x 2 )}, for x 1 , x 2 ∈ S d . Since the variance of any linear combination of a finite realization must be non-negative, the covariance function cannot be an arbitrary mapping. In other words, for any positive integer n, {x 1 , . . . , x n } ⊂ S d and {c 1 , . . . , c n } ⊂ R, we have the condition The mappings C that satisfy (1) are called positive definite, or strictly positive definite if the inequality is strict for any non zero vector (c 1 , . . . , c n ) . A discussion of these and related topics can be found in Menegatto et al. (2006) and references therein. If, in addition, for some mapping ψ : [0, π] → R such that ψ(0) = 1 and σ 2 > 0 denoting the variance of Z(x), then C is called a geodesically isotropic covariance function by Porcu et al. (2018a), with ψ(θ) a correlation function. Gneiting (2013) calls Ψ d the class of continuous functions ψ : [0, π] → R with ψ(0) = 1 such that the positive definite function C defined on S d × S d given by (2) is positive definite. For the remainder of the paper, we do not distinguish between positive and strict positive definiteness, unless otherwise specified. We also define Ψ ∞ = ∞ d=1 Ψ d , with the inclusion relation being strict.

Spectral Representations on Spheres
Spectral representations for positive definite functions on spheres, being the equivalent of Bochner and Schoenberg's theorems in Euclidean spaces (see Daley and Porcu, 2013, and references therein) are available thanks to Schoenberg (1942), who shows that a mapping ψ : [0, π] → R belongs to the class Ψ d if and only if it can be uniquely written as where c λ n denotes the normalized λ-Gegenbauer polynomial of degree n (Abramowitz and Stegun, 1964), and {b n,d } ∞ n=0 is a probability mass sequence. Note that in the cases d = 1 (the unit circle) and d = 2 (the unit sphere of R 3 ) the Gegenbauer polynomials simplify to Tchebytcheff and Legendre polynomials (Abramowitz and Stegun, 1964), respectively.
Characterization of the class Ψ ∞ is also available thanks to Schoenberg (1942): ψ belongs to the class Ψ ∞ if and only if with {b n } ∞ n=0 again a probability mass sequence. We follow Daley and Porcu (2013) and denote the d-Schoenberg sequences of coefficients in (4) by {b n,d } ∞ n=0 , to emphasize the dependence on the index d in the class Ψ d . Analogously, {b n } ∞ n=0 are called Schoenberg coefficients throughout. Fourier inversion allows for an explicit representation of the sequences {b n,d }. Specifically, for a fixed positive integer d and any n = 0, 1, . . ., where κ(n, d) is a positive constant (see Berg and Porcu, 2017). Moreover, arguments in Corollary 2 of Gneiting (2013) show that, for any d ≥ 2 and n = 0, 1, . . ., Family

Analytic expression Parameters range
Negative Binomial Parametric families of some members of the classes Ψ ∞ are listed in Table 1. They are all obtained by evaluating the probability generating function associated to a probability mass sequence. The Schoenberg sequences of the third and fifth entries are provided in Porcu et al. (2016). For the Sine Power family, the Schoenberg coefficients have been obtained by Soubeyrand et al. (2008). Other parametric families whose Schoenberg coefficients are not available in closed-form are listed in Gneiting (2013).
The first entry in Table 1 is the Negative Binomial family. Its Schoenberg sequence is given by for n = 0, 1, . . ., and corresponds to the probability mass function of the Negative Binomial distribution. The Multiquadric family -the second entry in Table 1 -is obtained through the same Schoenberg sequence but with the change of variable δ = 2p/(1 + p 2 ), for p ∈ (0, 1). Note that an explicit form of the d-Schoenberg coefficients can be calculated using Theorem 4.2(b) in Møller et al. (2018).

Nonparametric Spectral Modeling of Covariance Functions on Spheres
Although applications usually involve cases with d = 1 and d = 2, the following exposition applies to the general case of the d-dimensional spheres S d , where d can be either finite or infinite.
Equations (4) and (6) establish a one-to-one mapping between the class Ψ d and the infinite-dimensional simplex to which the d-Schoenberg coefficients necessarily belong to. Therefore, defining a (prior) probability measure on the space Ψ d (for a given dimension d) is equivalent to defining a probability measure on Δ ∞ .
It might be worth stressing that, for any d ∈ N and any element ψ Schoenberg sequences associated to ψ by a series representation of the type (4), all of which belong to Δ ∞ . These d-Schoenberg sequences are mutually related by recursive relations (Gneiting, 2013) and projection operators (Møller et al., 2018).
For any element ψ ∈ Ψ d , we assume that a spatial covariance is of the form C(θ) = σ 2 ψ(θ), with σ 2 denoting the variance. Setting aside the case of σ 2 for now, we focus on the construction and study of properties of a prior distribution for the class of covariance functions on a sphere. Let us consider a random sequence n=0 a Schoenberg random sequence. Without loss of generality, we assume throughout that every random variable under consideration is defined on the same probability space (Ω, F, P ). The fact that the d- We now equip the class Ψ d with an appropriate sigma-algebra, A d , so that ψ (B d ) is properly defined as a random element (namely a measurable function) from (Ω, F, P ) into (Ψ d , A d ). To complete this step, it becomes critical to equip the class Ψ d with the topology induced by the uniform (or sup) norm, that is, the norm given by or equivalently, by uniform convergence. Also, we let A d be the corresponding Borel sigma-field, namely the sigma-algebra generated by the open sets in the topology of uniform convergence. To justify that such a choice is appropriate we argue as follows.
The simplex Δ ∞ is naturally equipped with the product topology and the corresponding Borel sigma-field. Let T : is in turn measurable as a function from Ω into Ψ d . Analogous remarks can be made about ψ B as a function from Ω into Ψ ∞ .

Defining a Prior for d-Schoenberg Sequences
We can now take advantage of the above considerations, together with the fact that Fourier pairs are unique, to assign a prior distribution to the random element ψ (B d ) .
We do this by defining a prior for the sequence B d = {B n,d } ∞ n=0 . A natural choice is to consider a discrete parametric distribution on the natural numbers, and place a prior on its parameters, but this would imply making strong assumptions on the behavior of the sequence of the d-Schoenberg coefficients that would be difficult to justify. For this reason it is worth considering a Bayesian nonparametric approach. Several options are available for this task. For instance, any element of the class of proper species sampling priors (Pitman, 1996) provides a sequence supported in Δ ∞ . However, this is far too general to be of practical use. Therefore, we opt for adopting a prior distribution for the sequence B d using a stick-breaking representation (see e.g. Ishwaran and James, 2001): where {W j,d } ∞ j=0 is a sequence of independent random variables with values in the unit interval [0, 1] such that the support of W j,d is the whole unit interval and It is not difficult to verify that under (11) and (12), ∞ n=0 B n,d = 1 almost surely (see subsequent Lemma 1). Ishwaran and James (2001) show that such a convergence is guaranteed even if condition (12) is replaced with a weaker condition.
The representation in (11) combined with condition (12) are central to providing the topological support of the (prior) distribution of ψ (B d ) given by (11) with respect to the topology of uniform convergence. Recall that the support of the random variable ψ (B d ) is the set of all elements ψ within the class Ψ d such that the (prior) probability that ψ (B d ) belongs to U (ψ) is positive for every open ball U (ψ) of ψ in the topology of the uniform convergence. For clarity, we henceforth denote by supp(X) the support of a random variable X. The following theorem shows that under minimal conditions the support of ψ (B d ) is full, that is, it coincides with the whole class Ψ d , in the sup norm.
Theorem 1 shows clearly why the approach proposed in this paper falls within the framework of Bayesian nonparametric approaches.
The following result illustrates a useful property of the proposed construction.
Proposition 2. Let the hypotheses of Theorem 1 hold true. Additionally, if for every j = 0, 1, 2, . . ., P (W j,d ∈ {0, 1}) = 0, then for every positive integer N , and every finite collection of pairwise distinct points x 1 , . . . , x N ∈ S d and constants c 1 , . . . , c N ∈ R, where c j = 0, for some j, the following holds almost surely: That is, ψ (B d ) is almost surely strictly positive definite.

First Order Properties and Hölder Continuity Under GEM specifications
So far we have provided a completely general construction for random elements ψ (B) having the whole class Ψ d as support. A deeper study of these objects needs a parametric specification for the marginal distributions of the independent random variables W 0,d , W 1,d , . . . appearing in the stick-breaking representation (11) that would induce a parametric specification on the d-Schoenberg random sequence B d . By Theorem 1, it is only necessary that the support of each W n,d be the unit interval. A well known probability distribution for B d is obtained through the stick-breaking representation (11) that arises when W j,d follows a Beta distribution with parameters (1 − α, ϑ + (j + 1)α), with 0 ≤ α < 1 and ϑ > −α. Such distribution is known in the probability literature as the Griffiths-Engen-McCloskey (GEM) distribution and is related to the Pitman-Yor process (Pitman, 1996;Pitman and Yor, 1997), also known as two parameter Poisson Dirichlet process. This particular specification implies increased flexibility than, e.g. the well known Ferguson-Dirichlet process (Freedman, 1963;Ferguson, 1973Ferguson, , 1974, while still retaining some tractability. For further reference, we use the notation B d ∼ GEM(α, ϑ). In particular, when α = 0 the Ferguson-Dirichlet process is obtained. For a detailed account about these distributions, the reader is referred to Pitman (2006).
Theorem 1 together with the GEM specification shows that the support of any random object ψ (B) defined according to (10) is the whole class Ψ d . The next result provides the first order properties of the d-Schoenberg random sequence B d when coupled with the stick-breaking representation under the GEM specification. Denote by x (m) the rising factorial, that is,

Proposition 3. Assume the prior distribution of the d-Schoenberg random sequence
Some comments are in order. Proposition 3 shows that the mapping n → μ d (α, ϑ, n) is the probability mass function of the beta geometric distribution with parameters (1 − α)/α and (1 + ϑ)/α. In other words, using the notation introduced in Section 2, and denoting by g u,v the beta density with parameters u and v, where b n (δ, τ ) is the d-Schoenberg sequence associated with the negative binomial family, as reported in (8). Thus, it becomes apparent that μ d does not depend on the integer d that indexes the class Ψ d .
Proposition 3 shows even more: the prior distribution of the random element ψ (B d ) is centered at the (deterministic) element of Ψ d whose d-Schoenberg coefficient are given by (13), namely: In particular, if d = ∞, then arguments in Proposition 3.4 in Berg and Porcu (2017) and the application of bounded convergence show that the prior of the random element ψ (B) (θ) has expected value where N δ,τ (θ) = 1−δ 1−δ cos θ τ is the negative binomial family as reported in the first entry of Table 1 and belonging to the class Ψ ∞ .
The behavior of the correlation function at the origin is fundamental in determining the geometric properties of the associated Gaussian random field (Diggle and Ribeiro, 2007). The framework proposed in this section allows to provide results along these lines. Since this paper deals with correlation functions ψ(θ), for ψ : [0, π] → R, derivatives at the origin of ψ can be seen as derivatives of the even extension ψ of ψ to the interval [−π, π].
For the remainder of the paper, for any pair of real-valued sequences {a n } and {d n }, we write a n ∼ d n (a n is asymptotically equivalent to d n ) when a n /d n → 1 as n → ∞. Also, we define the Digamma function (Abramowitz and Stegun, 1964) through Proposition 4. Assume that the d-Schoenberg random sequence B d = {B n,d } ∞ n=0 has prior specified as GEM(α, ϑ). Then, P -almost surely, 1. If 0 < α < 1, then there exists a random variable Z, with 0 < Z < ∞, such that Proposition 4 implies actually much more. For convenience, we recall a recent result relating Hölder continuity and d-Schoenberg sequences.
Proposition 5 (Lang and Schwab, 2013). Let d be a positive integer and let ψ be a member of the class for some β > 0, then there exists a constant k β such that for all θ ∈ [0, π], We can now combine Propositions 4 and 5 to assert the following: almost surely.
Proposition 6 has precise consequences in terms of sample path properties of Gaussian random fields. In fact, the Hölder continuity of the covariance function is directly related to the P-almost sure continuity of the associated Gaussian field. This is a direct consequence of the Kolmogorov-Chentsov theorem, which analogue on the sphere has been provided by Lang and Schwab (2015).

Existence of the Posterior and Lipschitz Continuity
The specification of the prior in Section 3.1 allows us to obtain further results. In particular, let d be a positive integer. Suppose the Gaussian process Z on S d is sampled at locations (x 1 , . . . , x N ) . Call Z the n-dimensional random vector (Z(x 1 ), . . . , Z(x N )) and let z := (z(x 1 ), . . . , z(x N )) be one such realization. Let Z have unknown correlation ψ (B d ) , being uniquely determined through the d-Schoenberg random sequence B d . For simplicity we assume that the variance of the marginal distributions of the process is known and fixed. Then, conditionally on the event {B d = b d }, Z is a Gaussian process with correlation function ψ (b d ) (θ), θ ∈ [0, π]. We note that joint distribution of B d and Z can be specified through the prior distribution of B d together with the conditional distribution of Z given B d .
For any set A, let I A denote its corresponding indicator function, namely I A (x) = 1 if x belongs to A and I A (x) = 0 otherwise. It is worth noting that the joint distribution of Z and B d can be obtained by setting: where μ = E(Z(x)), i.e., μ is the (constant) mean of the process and σ 2 is the variance.
Here, the discrete random variable M and the processes {Z n (x) : x ∈ S d , n = 0, 1, . . .} are conditionally independent given B d . Also, we let P (M = n | B d ) = B n,d and for n = 0, 1, . . ., {Z n (x) : x ∈ S d } are zero mean Gaussian processes independent of B d with with δ denoting the Kronecker delta. The assumption above on the covariance between Z n and Z m is certainly correct because, for every n ∈ N 0 and for any positive integer d, the n-th normalized Gegenbauer polynomial of order (d − 1)/2 is positive definite on S d × S d ; see Berg and Porcu (2017) for technical details.
We now have the ingredients to establish the posterior distribution of B d , namely the conditional distribution of B d given Z = z, provided such a distribution exists. We denote the posterior of B d by P z and the prior by P.
Then, the posterior P z of B d exists, is unique and absolutely continuous with respect to P, with uniquely determined density evaluated at where f (z; b d ) is the N -dimensional centered multivariate Gaussian density evaluated at z with covariance matrix where It is also useful to examine whether the posterior P z varies smoothly as a function of z, as it relates to the robustness of the posterior to small changes in the observed data. Using continuity in the Hellinger metrics as a criterion, if it is established then we say that the estimation task is well-posed (see Stuart, 2010;Dashti and Stuart, 2017).
In order to address the well-posedness of the posterior, recall the definition of the Hellinger distance between two probability measures, ν 1 and ν 2 , dominated by the same measure ν with Radon-Nikodym derivatives f 1 and f 2 , respectively: We are now able to state our next finding.

Statistical Model
For the remainder of this article, we will assume a statistical model based on a realization (possibly with replicates) of a Gaussian process Z on S d , sampled at locations (x 1 , . . . , x N ) . We work under the setting outlined in Proposition 7, i.e., Z is assumed to be a Gaussian process with unit variance and correlation function ψ (B d ) , a random element with values in Ψ d , as previously defined in (10). For the upcoming developments, we consider the case where m replicates of this process is available. We denote these by Z 1 , . . . , Z m . Write Z (m) = (Z 1 , . . . , Z m ). The model then states that Z 1 , . . . , Z m are conditionally iid given B d = b d with multivariate Gaussian distribution with covariance matrix Σ N (b d ), as defined in (17). In turn, the sequence B d is modeled as generated from a GEM(α, ϑ) sequence, i.e., as in (11), with W j,d ∼ Beta(1 − α, ϑ + (j + 1)α), where 0 ≤ α < 1 and ϑ > −α. Finally, we assume a prior distribution on α and ϑ. In summary, we work with the following model: where 0 N denotes the zero-vector in R N , LT N (θ 1 , θ 2 , θ 2 3 ) denotes the univariate normal distribution with mean θ 2 and variance θ 2 3 , but truncated to the interval (θ 1 , ∞), that is, with probability density function given by and α 0 , β 0 and A 2 are user-supplied positive constants. Note that the prior on GEM parameters, p(α, ϑ) is specified by means of a conditional and a marginal distribution. This same strategy was employed in Jara et al. (2010). In all simulation experiments and data analyses described below we assume σ = 1 by which we simply imply that the data are assumed to be standardized so that the correlation function and the Schoenberg sequence b d are our main inferential targets. We note also that extending the previous model to the case of unknown variance is straightforward, only requiring an extra prior distribution on σ 2 (or σ). We omit the corresponding details.
It is also straightforward to show that the theoretical results developed earlier in Section 4 for the case of a single realization of the Gaussian process (i.e., m = 1) remain valid in the more general case (i.e., m > 1), with some obvious adjustments, e.g. as in the case of Proposition 7 and Theorem 8, which we use here without proof.

Simulation Study
We have so far studied properties of the proposed class of nonparametric priors for a correlation function over the sphere. We also discussed in Section 4 the case of the posterior distribution corresponding to a sample from a Gaussian process having a correlation function in Ψ d , which can then be represented by means of (11). We now study through simulation the performance of various aspects of the model, in particular, estimation of correlation functions in S d and prediction.
Despite the generality of the results and model stated above, and for the sake of simplicity, the simulation experiments we present next are performed over the two dimensional sphere S 2 embedded in R 3 . We focus on two aspects: on the one hand, we explore the efficiency of Markov Chain Monte Carlo (MCMC) techniques from simulations drawn from a GEM(α, ϑ) distribution. We consider the case of a varying number of points on the sphere. On the other, we examine the flexibility of the proposed model in reproducing the second-order structure of common parametric forms. In particular, we consider two parametric classes of geodesically isotropic correlation functions. First, the exponential correlation function ψ E , given by where φ > 0 is a positive scaling parameter. Arguments in Gneiting (2013) show that ψ E belongs to the class Ψ ∞ for all φ > 0. Secondly is the multiquadric correlation ψ M (·; p, τ ), corresponding to the second entry in Table 1, where the parameter restriction is reported in the same table.
In all the simulation scenarios discussed below, we adopt the model defined as (20) through (23) with α 0 = 1, β 0 = 5 and A 2 = 10 5 , reflecting a prior that assigns higher mass to values of α closer to 0 (i.e., the Dirichlet process) and a vague specification for ϑ.

Simulation I
In this first simulation experiment, we explore some inferential aspects of the Schoenberg sequence b 2 for data drawn from the GEM(α, ϑ) prior. In particular, we consider estimation of the Schoenberg coefficients, and most importantly, of the corresponding correlation function ψ (b2) (θ), θ ∈ [0, π]. We assess the efficiency of MCMC estimation technique, varying the number of points on the sphere from which synthetic realizations are drawn. The spatial design mimics an infill asymptotics setting on the sphere, where an increasing number of points is considered over a compact set. For a recent account of infill asymptotics on d-dimensional spheres, the reader is referred to Arafat et al. (2018). Specifically, we generate 3 increasing sets consisting of N randomly chosen points over the unit sphere, with N = 100, 250, 400. For each N , we fix and assume that the true GEM distribution has a fixed value of ϑ = 3 and α = 0.3, from which a random realization is drawn. Although a realization of GEM distribution entails a countably infinite sequence in (0, ∞), conducting inference necessarily calls for a truncation of the sequence, details of which are provided in Ishwaran and James (2001). Here we consider a truncation up to and including the 30-th term. The obtained covariance matrix Σ N (b d ) is the analogue of the matrix Σ N (b d ) as defined in (17), and obtained by replacing the series with the corresponding partial sum. Inference is performed on m ∈ {1, 100, 1000} independent spatial replicates of the Gaussian random field Z.
In particular, inference is based on Hamiltonian Monte Carlo, implemented in the Stan modeling language (Carpenter et al., 2017) within the R package rstan.  Of fundamental importance for this work is the correlation function. Figure 2 shows the (truncated) correlation functions corresponding to each of the simulation scenarios, evaluated at a grid of 60 equally spaced values over [0, π]. The true values are in black and 95% posterior credibility bands are in red. All dots are joined to facilitate graphical display. As expected, the true collection of correlation points are contained in the corresponding credibility bands, and increasing m has the effect of reducing the width of these bands so that for m = 1000 they are barely distinguishable from the true values.
In closing this first simulation study we mention that we carried out a number of alternative computational experiments with different combinations of GEM parameter values as data generating distribution (including α = 0 and several values for ϑ). We also tried different specifications of the prior, including the case when both α and ϑ are known. We found a general agreement in the results in the sense that the inference for b 2 and ψ (b2) (θ) for θ ∈ [0, π] produced results of similar quality to those shown here (data not shown).

Simulation II
In this second simulation experiment we compare our model with that corresponding to the exponential correlation function. Specifically, we simulate data from a zero mean Gaussian random field, Z, defined over the two dimensional sphere S 2 , assumed to be continuous in the mean-square sense (thus, no nugget effect is considered in the covariance function), with unit variance, and with exponential correlation function ψ E as defined in (24). Thus, Z is continuous but not mean-square differentiable because ψ E is not differentiable (under an even extension) at the origin. We consider two cases for the scaling parameter, φ = 100, 1000 km (see Figure 3). For each value of φ, we Figure 2: Posterior distribution of the estimated correlation function for each of the simulation scenarios. The black curve corresponds to the true correlation function evaluated at a grid (dots are joined for display). The red curves correspond to point-wise 95% credibility bands (also joined for display). randomly generate 100 points, simulate 50 independent realization of the associated random field, and then fit both exponential correlation functions ψ E , using a vague uniform prior for the scaling parameter as φ ∼ U (0, 10 8 ), and the nonparametric model with coefficients b n,2 based on the GEM specification. Specifically, we consider 3 levels of truncation, K = 30, 40, 50. We repeat this 50 times and compute the average predictive scores, shown in Table 2. Such prediction scores are the root mean square error and the continuous ranked probability score. LettingẐ −i (x i ) denote the drop one prediction, namely the best prediction based on all data Z(x j ) for j = i, the root mean square error (RMSE) is: Letting F i (y) = P (Z(x i ) ≤ y | Z(x j ), j = i) be the predictive cumulative density function, the continuous ranked probability score (CRPS) is Both RMSE and CRPS thus correspond to measures of predictive accuracy. In both cases we observe that the level of agreement between the Bayesian Nonparametric (BNP) and parametric models increases with the level of truncation, specially for the case where the scaling parameter equals 1000. Nevertheless, the lack-of-smoothness at the origin, a well-known feature of the exponential correlation function, complicates the approximation. This explains why the model used to generate the data provides better predictions than the BNP, although by a modest margin that decreases with K.
Analogously, we repeat the experiment but now with data generated from a multiquadric correlation function ψ M (·; p, τ ), with τ = 0.5 and p = 0.8, 0.9 (see Figure 3), and using independent priors for p and ν given by p ∼ U (0, 1) and τ ∼ U (0, 10 8 ). The average predictive scores are presented in Table 3.
For both values of p, we find that even the more parsimonious ψ BNP (·; 30) performs comparably to what is obtained using the true data generating model. Indeed, for p = 0.8, increasing the number of coefficients yields no meaningful improvement in   Table 3: Average prediction scores for each model (50 cases), with data being generated from a multiquadric correlation function ψ M with τ = 0.5 and p = 0.8, 0.9. the predictive performance. This is to be expected because unlike the exponential family case illustrated earlier, the multiquadratic correlation function is a better-behaved element of Ψ ∞ (recall Table 1).

Data Illustration: Surface Temperature
Lastly, we provide an illustration of the proposed nonparametric device to surface temperature data provided by the publicly available Large Ensemble Project (LENS) (Kay et al., 2015), developed at NCAR. The dataset consists of over 30 simulations from the Community Earth System Model (CESM) version 1, covering the years 1920 to 2100. Historical forcing is used from 1920 to 2005, and the representative concentration pathway 8.5 (RCP8.5) onwards until 2100. The horizontal resolution is approximately 1 • , with 288 longitude points, and 192 latitude points, for a total of 55,296 points at each time-step. The chaotic nature of the climate system ensures that each simulation resembles an independent realization from a common data generating mechanism, by introducing small round-off level differences in the initial atmospheric conditions. We use Boreal summer averages -averages for the months of June, July and August (JJA) -for the years 2003 to 2005, for the first 30 ensemble members. As an example, Figure 4 shows the 2003 JJA mean of the first ensemble member. This gives us a set of 30 means for each of the three JJA months and we assume the total of m = 90 JJA means to be i.i.d. spatial replicates. Given the different surface temperature characteristics between land and ocean, we focus only on land regions, and only within the latitude band 57 • S-70 • N to avoid potential distortions near the poles, thus reducing the number of points to 12,037. For this study, N = 100 points were chosen from these at random, whose locations are depicted in Figure 4. Our main goal in this analysis is to illustrate the application of our model compared to other parametric alternatives, and to estimate the spatial correlation function.
A preliminary step to the modeling of the correlation structure is the removal of the mean structure from the JJA averages. Previous studies, such as Stein (2007) and Jun et al. (2008) for ozone concentration, and Jeong and Jun (2015) for sea-level pressure, have used spherical harmonics for this purpose. We initially considered their use, but found that they had difficulty in capturing localized features, for instance, over the Qinghai-Tibetan plateau and its surroundings, despite considering degrees = 0, . . . , 12, and orders 0 ≤ m ≤ . Instead, standardizing the residuals across the 90 realizations worked well in achieving a mean close to zero. Figure 5 shows the mean surface and the residual surface of the realization in Figure 4.
To examine the sensitivity of the predictive performance of the BNP model to the degree of truncation, we consider K = 30, 40, 50 as in the simulation study in Section 6. To further test the model's performance, we added this time the case K = 100. The analysis is again based on model (20) through (23) with the same prior choices as in Section 6, i.e., α 0 = 1, β 0 = 5 and A 2 = 10 5 .
In order to assess the performance relative to parametric alternatives, we fit the exponential correlation function ψ E as in (24), the multiquadric correlation function ψ M  Table 4: Posterior means and credibility intervals in parentheses, for the exponential, multiquadric and Legendre-Matérn correlation functions fit to LENS JJA surface temperature means. The parameter φ corresponds to the scale parameter of the exponential and Legendre-Matérn, the parameters p and τ are those denoted in Table 1 for the multiquadric, and ν refers to the smoothness parameter of the Legendre-Matérn. The table also includes posterior means and credibility intervals for GEM parameters.
defined in second entry in Table 1, and the Legendre-Matérn correlation function (Guinness and Fuentes, 2016) with truncation after the first 50 terms. Finally, we consider two elements of the class Ψ ∞ that do not admit a d-Schoenberg expansion in closed form: the power exponential model (see Gneiting, 2013, for details) and the generalized Cauchy class (Gneiting and Schlather, 2004). Table 4 shows the parameter estimates (posterior means) and their associated 95% credibility intervals. Note that only the results for the exponential correlation, ψ E , the multiquadratic, ψ M , and the Legendre-Matérn, ψ LM are shown, as these were the best performing. Table 5 reports the average predictive scores RMSE and CRPS, across the 90 spatial replicates, for the three versions of the nonparametric model ψ BNP together with the other parametric models.
As expected, the additional flexibility of ψ BNP (·; 100), afforded by the larger number of Schoenberg coefficients, allows it to outperform the other ψ BNP (·; K) models, K = 30, 40, 50. The posterior means and 95% credibility intervals for α on the one hand clearly suggest a departure from the simplistic case of the Dirichlet process (i.e., when α = 0), especially when K = 100, as seen from the last two rows in Table 4. In fact, when K = 100 this posterior distribution is clearly shifted to the right compared to the other cases. On the other hand, the posterior of ϑ is concentrated on fairly large values, suggesting that all the terms in the truncated mixture are important to achieve the approximation required by the data. When K = 100, this pos-  terior is somewhat shifted to the left compared to K = 30, 40, 50. It is also interesting to note that the model performs better than the parametric counterparts in terms of both RMSE and CRPS. With respect to the parametric models, the Legendre-Matérn has E(ν | Z) = 0.197 suggesting that the process is not mean-square differentiable, and thus explaining the better fit of the exponential and the Legendre-Matérn; whereas the multiquadric, which is twice differentiable at the origin, underperforms. In terms of predictive performance, we find that the exponential model slightly outperforms the other two. We point out that the lack of smoothness exhibited by the data, and empirically corroborated by the posterior summaries in Table 4, makes it generally harder for a mixture of smooth functions to produce good approximations as required by the available data. In practice, this is reflected in the need to employ an increased number of mixture terms compared to other datasets that do not exhibit this problem. For most problems, a truncation at 50 mixture terms would be more than adequate (Ishwaran and James, 2001), but interestingly, the results exhibited in Table 5 show that for these data we actually need to "get to the limit" and use more than K = 50 terms to obtain a clearly better performance with respect to competitors.
Finally, we consider estimation of the spatial correlation function using the proposed model and also the parametric alternatives mentioned earlier. Figure 6 shows the empirical spatial correlation function obtained from package ncf available for the R software. All the curves are joined to help graphical display. The posterior mean of the spatial correlation functions for several of the methods considered earlier (exponential, Legendre-Matérn, multiquadratic) are also included for comparison. Recall that the exponential correlation model (dotted line) gave the best parametric estimate according to Table 5. Even though the exponential correlation estimate generally follows the empirical curve, it cannot reproduce the ups and downs observed in the solid curve. Considering now the BNP estimate, the dashed line corresponds to the posterior mean of the correlation function obtained from ψ BNP (·; 100) (estimates are virtually identical to the case of ψ BNP (·; 50), not shown here). Observe how the BNP estimate closely follows the empirically computed function, thus denoting the flexibility of the nonparametric construction and model we have studied here. Figure 6: Estimated correlation functions under the various models considered in this analysis. The solid curve represents the empirical spatial correlation function, estimated using package ncf from R.

Conclusion and Discussion
This paper provides a novel perspective on covariance functions on spheres. We have shown that a nonparametric Bayesian approach offers a valid alternative to parametric forms from the point of view of statistical accuracy and optimal linear prediction. Our theoretical findings have been supported by a satisfactory simulation study and by a data analysis on global surface temperatures.
This work poses a series of challenges that may be of interest to pursue in the future. One of them is the nonparametric Bayesian modeling of space-time covariance functions, for which the spectral representation becomes quite involved. Surely, it would be worthwhile to extend the proposed approach to multivariate random fields, for which modeling marginal and cross spectra will become necessary.
The approach proposed in this paper has an interesting connection with integration methods that have been recently proposed in the machine learning community, and a good contribution would be to show the advantages of our approach within that community.