Asymptotic properties of the maximum likelihood and cross validation estimators for transformed Gaussian processes

The asymptotic analysis of covariance parameter estimation of Gaussian processes has been subject to intensive investigation. However, this asymptotic analysis is very scarce for non-Gaussian processes. In this paper, we study a class of non-Gaussian processes obtained by regular non-linear transformations of Gaussian processes. We provide the increasing-domain asymptotic properties of the (Gaussian) maximum likelihood and cross validation estimators of the covariance parameters of a non-Gaussian process of this class. We show that these estimators are consistent and asymptotically normal, although they are defined as if the process was Gaussian. They do not need to model or estimate the non-linear transformation. Our results can thus be interpreted as a robustness of (Gaussian) maximum likelihood and cross validation towards non-Gaussianity. Our proofs rely on two technical results that are of independent interest for the increasing-domain asymptotic literature of spatial processes. First, we show that, under mild assumptions, coefficients of inverses of large covariance matrices decay at an inverse polynomial rate as a function of the corresponding observation location distances. Second, we provide a general central limit theorem for quadratic forms obtained from transformed Gaussian processes. Finally, our asymptotic results are illustrated by numerical simulations.


Introduction
Kriging [41,34] consists of inferring the values of a (Gaussian) random field given observations at a finite set of points. It has become a popular method for a large range of applications, such as geostatistics [31], numerical code approximation [35,36,9], calibration [33,10], global optimization [27], and machine learning [34].
When considering a Gaussian process, one has to deal with the estimation of its covariance function. Usually, it is assumed that the covariance function belongs to a given parametric family (see [1] for a review of classical families). In this case, the estimation boils down to estimating the corresponding covariance parameters. Nowadays, the main estimation techniques are based on maximum likelihood [41,34], cross-validation [48,6,7,13] and variation estimators [25,3,4].
The asymptotic properties of estimators of the covariance parameters have been widely studied in the two following frameworks. The fixed-domain asymptotic framework, sometimes called infill asymptotics [41,18], corresponds to the case where more and more data are observed in some fixed bounded sampling domain. The increasing-domain asymptotic framework corresponds to the case where the sampling domain increases with the number of observed data.
Under fixed-domain asymptotics, and particularly in low dimensional settings, not all covariance parameters can be estimated consistently (see [24,41]). Hence, the distinction is made between microergodic and non-microergodic covariance parameters [24,41]. Although non-microergodic parameters cannot be estimated consistently, they have an asymptotically negligible impact on prediction [38,39,40,47]. There is, however, a fair amount of literature on the consistent estimation of microergodic parameters (see for instance [47,28,20,42,45,46]). This paper focuses on the increasing-domain asymptotic framework. Indeed, generally speaking, increasing-domain asymptotic results hold for significantly more general families of covariance functions than fixed-domain ones. Under increasing-domain asymptotics, the maximum likelihood and cross validation estimators of the covariance parameters are consistent and asymptotically normal under mild regularity conditions [30,37,7,23].
All the asymptotic results discussed above are based on the assumption that the data come from a Gaussian random field. This assumption is indeed theoretically convenient but might be unrealistic for real applications. When the data stem from a non-Gaussian random field, it is still relevant to estimate the covariance function of this random field. Hence, it would be valuable to extend the asymptotic results discussed above to the problem of estimating the covariance parameters of a non-Gaussian random field.
In this paper, we provide such an extension, in the special case where the non-Gaussian random field is a deterministic (unknown) transformation of a Gaussian random field. Models of transformed Gaussian random fields have been used extensively in practice (for example in [17,43,2,44]).
Under reasonable regularity assumptions, we prove that applying the (Gaussian) maximum likelihood estimator to data from a transformed Gaussian random field yields a consistent and asymptotically normal estimator of the covariance parameters of the transformed random field. This (Gaussian) maximum likelihood estimator corresponds to what would typically be done in practice when applying a Gaussian process model to a non-Gaussian spatial process. This estimator does not need to know the existence of the non-linear transformation function and is not based on the exact density of the non-Gaussian data. We refer to Remark 2 for further details and discussion on this point.
We then obtain the same consistency and asymptotic normality result when considering a cross validation estimator. In addition, we establish the joint asymptotic normality of both these estimators, which provides the asymptotic distribution of a large family of aggregated estimators. Our asymptotic results on maximum likelihood and cross validation are illustrated by numerical simulations.
To the best of our knowledge, our results (Theorems 3, 4, 5, 6 and 7) provide the first increasing-domain asymptotic analysis of Gaussian maximum likelihood and cross validation for non-Gaussian random fields. Our proofs intensively rely on Theorems 1 and 2. Theorem 1 shows that the components of inverse covariance matrices are bounded by inverse polynomial functions of the corresponding distance between observation locations. Theorem 2 provides a generic central limit theorem for quadratic forms constructed from transformed Gaussian processes. These two theorems have an interest in themselves.
The rest of the paper is organized as follows. In Section 2, general properties of transformed Gaussian processes are provided. In Section 3, Theorems 1 and 2 are stated. In Section 4, an application of these two theorems is given to the case of estimating a single variance parameter. In Section 5, the consistency and asymptotic normality results for general covariance parameters are given. The joint asymptotic normality result is also given in this section. The simulation results are provided in Section 6. All the proofs are provided in the appendix.

General properties of transformed Gaussian processes
In applications, the use of Gaussian process models may be too restrictive. One possibility for obtaining larger and more flexible classes of random fields is to consider transformations of Gaussian processes. In this section, we now introduce the family of transformed Gaussian processes that we will study asymptotically in this paper. This family is determined by regularity conditions on the covariance function of the original Gaussian process and on the transformation function.
Let us first introduce some notation. Throughout the paper, C inf > 0 (resp. C sup < ∞) denotes a generic strictly positive (resp. finite) constant. This constant never depends on the number of observations n, or on the covariance parameters (see Section 5), but is allowed to depend on other variables. We mention these dependences explicitly in cases of ambiguity. The values of C inf and C sup may change across different occurrences.
For a vector x of dimension d we let |x| = max i=1,...,d |x i |. Further, the Euclidean and operator norms are denoted by ||x|| and by ||M || op = sup{||M x|| : ||x|| ≤ 1}, for any matrix M . We let λ 1 (B) ≥ . . . ≥ λ r (B) be the r eigenvalues of a r ×r symmetric matrix B. We let ρ 1 (B) ≥ . . . ≥ ρ r (B) ≥ 0 be the r singular values of a r × r matrix B. We let N be the set of non-zero natural numbers.
Further, we define the Fourier transform of a function h : For a sequence of observation locations, the next condition ensures that a fixed distance between any two observation locations exists. This condition is classical [7,11]. Condition 1. We say that a sequence of observation locations, (x i ) i∈N , x i ∈ R d , is asymptotically well-separated if we have inf i,j∈N,i =j |x i − x j | > 0.
The next condition on a stationary covariance function is classical under increasing-domain asymptotics. This condition provides asymptotic decorrelation for pairs of distant observation locations and implies that covariance matrices are asymptotically well-conditioned when a minimal distance between any two distinct observation locations exists [30,7]. Condition 2. We say that a stationary covariance function k on R d is subexponential and asymptotically positive if: i) C sup and C inf exist such that, for all s ∈ R d , we have |k(s)| ≤ C sup exp (−C inf |s|); (1) ii) For any sequence (x i ) i∈N satisfying Condition 1, we have inf n∈N λ n (Σ) > 0, where Σ is the n × n matrix (k(x i − x j )) i,j=1,...,n .
In Condition 2, we remark that k : R d → R is called a stationary covariance function in the sense that (x 1 , x 2 ) → k(x 1 − x 2 ) is a covariance function. We use this slight language abuse for convenience.
We also remark that, when non-transformed Gaussian processes are considered, a polynomial decay of the covariance function in Condition 2 i) is sufficient to obtain asymptotic results [7,8]. Here an exponential decay is needed in the proofs to deal with the non-Gaussian case. Nevertheless, most classical families of covariance functions satisfy inequality (1).
When considering a transformed Gaussian process, we will consider a transformation satisfying the following regularity condition, which enables us to subsequently obtain regularity conditions on the covariance function of the transformed Gaussian process.
Condition 3. Let F : R → R be a fixed non-constant continuously differentiable function, with derivative F . We say that F is sub-exponential and non-decreasing if: i) For all t ∈ R, we have |F (t)| ≤ C sup exp (C sup |t|) and |F (t)| ≤ C sup exp (C sup |t|); ii) The function F is non-decreasing on R.
In the following lemma, we show that the covariance function of a transformed Gaussian process satisfies Condition 2, when Conditions 2 and 3 are satisfied, for the original process and for the transformation.
Lemma 1. Assume that the stationary covariance function k satisfies Condition 2 and that the transformation F satisfies Condition 3. Let X be a zero-mean Gaussian process with covariance function k and let k be the stationary covariance function of F (X(·)). Then, k satisfies Condition 2.
In the next lemma, we show that we can replace the condition of an increasingtransformation by the condition of a monomial transformation of even degree (with an additive constant).
Lemma 2. If a covariance function k satisfies Condition 2 (i) and if the Fourier transformk of k is strictly positive on R d , then k satisfies Condition 2 (ii). Furthermore, in this case, in Lemma 1, Condition 3 (ii) can be replaced by the condition F (x) = x 2r + u for r ∈ N, u ∈ R and x ∈ R.

Transformed Gaussian process framework
Throughout the paper, we will consider an unobserved latent Gaussian process Z on R d with d ∈ N fixed. Assume that Z has zero-mean and stationary covariance function k Z . We assume throughout that k Z satisfies Condition 2.
We consider a fixed transformation function T satisfying Condition 3. We assume that we observe the transformed Gaussian process Y , defined by Y (s) = T (Z(s)) for any s ∈ R d .
We assume throughout that the random field Y has zero-mean. We remark that, for a non-linear transformation F : R → R, the random variable F (X(s)) does not necessarily have zero mean for s ∈ R d . Hence, we implicitly assume that T is of the form F − E[F (Z(x))], where F satisfies Condition 3. Note that E[F (Z(x))] is constant by stationarity and that, if F satisfies Condition 3 or the condition specified in Lemma 2, then F − E[F (Z(x))] also satisfies these conditions. Here, as in many references, the assumption of a zero-mean for Y is made by notational convenience and for the sake of brevity, and could be alleviated.
We let k Y be the covariance function of Y . We remark that, from Lemma 1, k Y also satisfies Condition 2.
We let (s i ) i∈N be the sequence of observation locations, with s i ∈ R d for i ∈ N. We assume that (s i ) i∈N satisfies Condition 1.
The problem of estimating the covariance function k Y from the observation vector y is crucial and has been extensively studied in the Gaussian case (when T is a linear function). Classically, we assume that k Y belongs to a parametric family of covariance functions. We will provide the asymptotic properties of two of the most popular estimators of the covariance parameters: the one based on the (Gaussian) maximum likelihood [34,41] and the one based on cross validation [13,6,48]. To our knowledge, such properties are currently known only for Gaussian processes, and we will provide analogous properties in the transformed Gaussian framework.

Bounds on the elements of inverse covariance matrices
In the case of (non-transformed) Gaussian processes, one important argument for establishing the asymptotic properties of the maximum likelihood and cross validation estimators is to bound the largest eigenvalue of the inverse covariance matrix R −1 . Unfortunately, due to the non-linearity of the transformation T , such a bound on the largest eigenvalue is no longer sufficient in our setting.
To circumvent this issue, we obtain in the following theorem stronger control over the matrix R −1 : we show that its coefficients decrease polynomially quickly with respect to the corresponding distance between observation locations. This theorem may have an interest in itself. Theorem 1. Consider the setting of Section 3.1. For all fixed 0 < τ < ∞, we have, for all n ∈ N and i, j = 1, . . . , n where C sup depends on τ but does not depend on n, i, j.

Central limit theorem for quadratic forms of transformed Gaussian processes
In the proofs on covariance parameter estimation of Gaussian processes, a central step is to show the asymptotic normality of quadratic forms of large Gaussian vectors. This asymptotic normality is established by diagonalizing the matrices of the quadratic forms. This diagonalization provides sums of squares of decorrelated Gaussian variables and thus sums of independent variables [25,7]. In the transformed Gaussian case, one has to deal with quadratic forms involving transformations of Gaussian vectors. Hence, the previous arguments are not longer valid. To overcome this issue, we provide below a general central limit theorem for quadratic forms of transformed Gaussian vectors. This theorem may have an interest in itself.
This asymptotic normality result is established by considering a metric d w generating the topology of weak convergence on the set of Borel probability measures on Euclidean spaces (see, e.g., [22] p. 393). We prove that the distance between the sequence of the standardized distributions of the quadratic forms and Gaussian distributions decreases to zero when n increases. The introduction of the metric d w enables us to formulate asymptotic normality results in cases when the sequence of standardized variances of the quadratic forms does not necessarily converge as n → ∞.
Theorem 2. Consider the setting of Section 3.1. Let (A n ) n∈N be a sequence of matrices such that A n has dimension n × n for any n ∈ N. Let A = A n for concision. Assume that for all n ∈ N and for all i, j = 1, . . . , n, where C sup does not depend on i, j. Let . Then, as n → ∞, In addition, the sequence (n Var(V n )) n∈N is bounded.

Remark 1.
In the case where limits E ∞ and σ ∞ exist, such that and the sequence (n Var(V n )) n∈N converges to a fixed variance σ 2 ∞ , the result of Theorem 2 can be written in the classical form as n → ∞.

Estimation of a single variance parameter
We let σ 2 0 be the marginal variance of Y , that is Var(Y (s)) = σ 2 0 for any s ∈ R d . We let k Y = σ 2 0 c Y be the stationary covariance function of Y , where c Y is a correlation function. We assume that the same conditions as in Section 3 hold. Then, the standard Gaussian maximum likelihood estimator of the variance parameter isσ One can simply show that E[σ 2 ML ] = σ 2 0 even though y is not a Gaussian vector, since y has mean vector 0 and covariance matrix σ 2 0 C. Hence, a direct consequence of Theorems 1 and 2 is then that the maximum likelihood estimator is asymptotically Gaussian, with a n 1/2 rate of convergence, even though the transformed process Y is not a Gaussian process. Corollary 1. Let L n be the distribution of √ n(σ 2 ML − σ 2 0 ). Then, as n → ∞, In addition, the sequence n Var(σ 2 ML − σ 2 0 ) n∈N is bounded. The proof of Theorem 2 actually allows us to study another estimator of the variance σ 2 0 of the formσ The proof of Theorem 2 then directly implies the following. Corollary 2. Let K n be any sequence of positive numbers tending to infinity. Let L Kn,n be the distribution of √ n(σ 2 ML,Kn − σ 2 0 ). Then, as n → ∞, In addition, we have The above corollary shows that one can taper the elements of C −1 when estimating the variance parameter, and obtain the same asymptotic distribution of the error, as long as the taper range goes to infinity, with no rate assumption. This result may have an interest in itself, in view of the existing literature on covariance tapering for Gaussian processes under increasing-domain asymptotics [23,37]. We also remark that the computation costs ofσ 2 ML,K andσ 2 ML have the same orders of magnitude because C −1 needs to be computed in both cases.

Framework
As in Section 3.1, we consider a zero-mean Gaussian process Z defined on R d with covariance function k Z satisfying Condition 2. Let Y be the random field defined for any s ∈ R d by Y (s) = T (Z(s)), where T is a fixed function satisfying Condition 3. Furthermore we assume that Y has zero-mean function and we recall that from Lemma 1, its covariance function k Y also satisfies Condition 2. Finally, the sequence of observation locations (s i ) i∈N satisfies Condition 1.
Let {k Y,θ ; θ ∈ Θ} be a parametric set of stationary covariance functions on R d , with Θ a compact set of R p . We consider the following condition on this parametric set of covariance functions.
Condition 4. For all s ∈ R d , k Y,θ (s) is three times continuously differentiable with respect to θ, and we have The smoothness condition in (4) is classical and is assumed for instance in [7]. As discussed after Condition 2, milder versions of (3) can be assumed for nontransformed Gaussian processes, but (3) is satisfied by most classical families of covariance functions nonetheless.
The next condition, on the Fourier transforms of the covariance functions in the model, is standard.
We letk Y,θ be the Fourier transform of k Y,θ . Thenk Y,θ (s) is jointly continuous with respect to θ and s and is strictly positive on Θ × R d .
Finally, the next condition means that we address the well-specified case [6,8], where the family of covariance functions does contain the true covariance function of Y . The well-specified case is considered in the majority of the literature on Gaussian processes.
In the next two subsections, we study the asymptotic properties of two classical estimators (maximum likelihood and cross validation) for the covariance parameter θ 0 . The asymptotic properties of these estimators are already known for Gaussian processes and we extend them to the non-Gaussian process Y .

Maximum Likelihood
For n ∈ N, let R θ be the n × n matrix k Y,θ (s i − s j ) i,j=1,...,n , and let with L θ = 1 n log(det(R θ )) + y R −1 θ y be a maximum likelihood estimator. We will provide its consistency under the following condition.
Condition 7 can be interpreted as a global indentifiability condition. It implies in particular that two different covariance parameters yield two different distributions for the observation vector. It is used in several studies, for instance [7]. Theorem 3. Consider the setting of Section 5.1 for Z, T , Y and (s i ) i∈N . Assume that Conditions 4, 5, 6 and 7 hold. Then, as n → ∞, Remark 2. The (Gaussian) maximum likelihood estimatorθ ML in Theorem 3 is not, strictly speaking, a maximum likelihood estimator, because the distribution of the observation vector y is non-Gaussian. If the transformation T is known, and assuming that the covariance function of the Gaussian process Z belongs to a parametric set {k Z,α , α ∈ A}, a (non-Gaussian) maximum likelihood estimatorα Z,ML (y) could be defined. When T is bijective, y is a fixed invertible transformation of (Z(s 1 ), . . . , Z(s n )) and so this maximum likelihood estimatorα Z,ML (y) coincides with the standard Gaussian maximum likelihood estimator based on Z. Asymptotically, it is known thatα Z,ML (y) converges to the true parameter α 0 for which Z has covariance function k Z,α0 . In Theorem 3, we show thatθ ML similarly converges to the true parameter θ 0 for which Y has covariance function k Y,θ0 . Notice that α 0 and θ 0 usually do not coincide. For instance, if the covariance models {k Y,θ ; θ ∈ Θ} and {k Z,α ; α ∈ A} are both {σ 2 e −ρ||·|| }, then if α 0 = (σ 2 0 , ρ 0 ), by Mehler's formula [5], and thus θ 0 = (2σ 4 0 , 2ρ 0 ). Hence, the exact (non-Gaussian) maximum likelihood estimatorα Z,ML (y) based on the knowledge of T and modeling the covariance function of Z and the pseudo (Gaussian) maximum likelihood estimatorθ ML that ignores T and models the covariance function of Y estimate covariance parameters of different natures and do not have the same limit. Condition 8 can be interpreted as a regularity condition and as a local indentifiability condition around θ 0 . In the next theorem, we provide the asymptotic normality of the maximum likelihood estimator. In this theorem, the matrices M θ0 and Σ θ0 depend on the number of observation locations.
Let Σ θ0 be the p×p covariance matrix defined by (Σ θ0 ) i,j = Cov(n 1/2 ∂L θ0 /∂θ i , n 1/2 ∂L θ0 /∂θ j ). Let L θ0,n be the distribution of √ n(θ ML − θ 0 ). Then, as n → ∞, In addition, lim sup Remark 3. If the sequences of matrices M θ0 and Σ θ0 converge as n → ∞, then √ n(θ ML − θ 0 ) converges in distribution to a fixed centered Gaussian distribution where the limiting covariance matrix is given by Conditions 7 and 8 involve the model of covariance functions {k Y,θ ; θ ∈ Θ} and the sequence of observation locations (s i ) i∈N but not the transformation T . They are further discussed, in a different context, in [12]. We believe that these conditions are mild. For instance, Conditions 7 and 8 hold when the sequence of observation locations (s i ) i∈N is a randomly perturbed regular grid, as in [7].

Cross Validation
We consider the cross validation estimator consisting of minimizing the sum of leave-one-out square errors.
The asymptotic behaviour ofψ CV was studied in the Gaussian framework in [7] and under increasing-domain asymptotics. We also remark that, in the Gaussian framework, a modified leave-one-out criterion was studied in [13] in the case of infill asymptotics.
The next identifiability condition is also made in [7].
The next theorem provides the consistency of the cross validation estimator.
Theorem 5. Consider the setting of Section 5.1 for Z, T , Y and (s i ) i∈N . Assume that Conditions 4, 5, 6 and 9 hold. Then, as n → ∞, The next condition is a local identifiability condition.
In the next theorem, we provide the asymptotic normality of the cross validation estimator. In this theorem, the matrices N ψ0 and Γ ψ0 depend on the number of observation locations.
Theorem 6. Consider the setting of Section 5.1 for Z, T , Y and (s i ) i∈N . Assume that Conditions 4, 5, 6, 9 and 10 hold. Let N ψ0 be the (p − 1) × (p − 1) matrix defined by In addition, lim sup Similarly as for maximum likelihood, Condition 9 is a global identifiability condition for the correlation function. In the same way, Condition 10 is a local identifiability condition for the correlation function around ψ 0 . The sequence of observation locations presented in Lemma 3 also satisfies Conditions 9 and 10 by replacing k Y,θ with c Y,ψ .
We remark that the conditions for cross validation imply those for maximum likelihood.

Joint asymptotic normality
From Theorems 4 and 6, both the maximum likelihood and cross validation estimators converge at the standard parametric rate n 1/2 . Let us writeθ ML = (σ 2 ML ,ψ ML ). In the case where T is the identity function (that is, where we observe Gaussian processes instead of transformed Gaussian processes), numerical experiments tend to show thatψ ML is more accurate thanψ CV [7]. Indeed, when T is the identity function, maximum likelihood is based on the Gaussian probability density function of the observation vector.
In contrast, when T is not the identity function,ψ ML is an M-estimator based on a criterion which does not coincide with the observation probability density function anymore. Hence, it is conceivable thatψ CV could become more accurate thanψ ML . Furthermore, it is possible that using linear combinations of these two estimators could result in a third one with improved accuracy [29,14].
Motivated by this discussion, we now provide a joint central limit theorem for the maximum likelihood and cross validation estimators.
Let Q θ0,n be the distribution of Then, as n → ∞, In addition, lim sup Remark 4. From Theorem 7, considering any C 1 function f from S 2 → S such that f (ψ, ψ) = ψ for any ψ ∈ S, and applying the classical delta method, we obtain the asymptotic normality of the new estimator f (ψ ML ,ψ CV ). A classical choice for f is f (ψ 1 , ψ 2 ) = λψ 1 + (1 − λ)ψ 2 , which leads to linear aggregation [29,14]. We remark that selecting an optimal λ leading to the smallest asymptotic covariance matrix necessitates an estimation of the asymptotic covariance matrix in Theorem 7. We leave this as an open problem for further research.

Illustration
In this section we numerically illustrate the convergence of different estimators as stated in Corollary 1 and Theorems 4, 6 and 7. We use the following simulation setup in dimension d = 2. For n = 4(m + 1/2) 2 we define the grid {−m, . . . , m} 2 and add i.i.d. variables with uniform distribution on [−0.4, 0.4] 2 to obtain n observation points. Thus, we have a distance of at least ∆ = 0.2 between the individual observation locations. The zero-mean Gaussian process Z has stationary covariance function k Z (s) = σ 2 0 exp(−||s||/ρ 0 ), s ∈ R 2 and we will denote this reference case as the Gaussian case throughout. We define the zero-mean process Y = T (Z) = Z 2 − σ 2 0 and we will denote this case as the non-Gaussian case. Recall that k Y (s) = 2k Z (2s) = 2σ 4 0 exp(−2||s||/ρ 0 ) (see Remark 2). We set the marginal variance to σ 2 0 = 1.5 and the range to ρ 0 = 2. Hence, in the non-Gaussian case, the marginal variance of Y is 2σ 4 0 = 4.5 with a range of ρ 0 /2 which is equal to a half of that of Z.
To start, we consider the maximum likelihood estimates of the marginal variance parameters, when the range of Y or Z is assumed to be known, i.e., Corollary 1. Figure 1 illustrates the empirical densities of L n in Corollary 1 for n = 100, 400 and 900 observation locations based on N = 2500 replicates. For moderate n sizes and in the non-Gaussian case, the asymptotic variance σ 2 ∞ := n Var(σ 2 ML − 2σ 2 0 ) (Corollary 1) can be calculated based on (26) and using Cov(y i y j , y k y l ) = 4(k 2 i,k k 2 j,l + k 2 i,l k 2 j,k ) + 16(k i,k k i,l k j,k k j,l + k i,j k i,l k j,k k k,l + k i,j k i,k k j,l k k,l ), with k i,j = k Z (s i − s j ). The above display follows from tedious computations based on Isserlis' theorem. The densities in red in Figure 1 are based on the asymptotic distribution N [0, σ 2 ∞ ]. In the non-Gaussian case and for n ≥ 400, the calculation of σ 2 ∞ is computationally prohibitive, so σ 2 ∞ has instead been approximated by the empirical variance with the corresponding densities indicated in green. As expected, the convergence for the Gaussian case is faster than for the non Gaussian case. But in both situations, the results behave nicely.
We now turn to Theorem 4 and consider the bivariate variance and range maximum likelihood estimation. That is, we consider the two-dimensional maximum likelihood estimates of (σ 2 0 , ρ 0 ) in the Gaussian case and of (2σ 2 0 , ρ 0 /2) in the non-Gaussian case. Again, we do not observe many surprises. Skewness of the empirical distribution is slightly higher compared to the single variance parameter estimation, and convergence is slightly slower, as is illustrated in Figure 2. Histograms of standardized and studentized variance maximum likelihood estimates. Blue: empirical density (kernel density estimates), red: asymptotic density, green: asymptotic density based on the empirical variance. The top row shows results for the process Z (the Gaussian case), the bottom row for the transformed process Y = Z 2 − σ 2 0 (the non-Gaussian case). The columns are for n = 100, 400, 900. All panels are based on N = 2500 replicates.
For the general setting, when estimating jointly the variance and range parameter, the asymptotic bivariate covariance matrix is challenging to compute (see Theorem 4) and thus Figure 2 illustrates the empirical densities and densities based on the empirical bivariate covariance matrix.
We now consider not only maximum likelihood estimation of the variance and range, but also cross validation estimation of the range (see the beginning of Section 5.3). We have observed that the range estimates based on cross validation are much more variable, and in many situations the maximum was attained at the (imposed) boundary. Here we used the bound [2/15, 12], i.e., smaller than the minimal distance between two observation locations and 6 (resp. 12) times the diameter of the observation points of Z (resp. Y ). Estimates at or close to the boundary indicate convergence issues and would imply a second, possibly manual, inspection. For the reported results, we eliminated all cross validation cases that yielded estimates outside [0.14, 11.4]. Figure 3 shows the mean squared error, squared bias and variance ofσ 2 ML , ρ ML andρ CV under different settings. For maximum likelihood, we consider the univariate case (one parameter is estimated while the other is known) and the bivariate case (both parameters are jointly estimated). For cross validation, only the range parameter is estimated (see the beginning of Section 5.3), and thus only the univariate case is considered. The mean squared error is dominated by the variance component. Univariate maximum likelihood estimation for Gaus- sian cases have low bias and the lowest variance (top left and right panel). Joint maximum likelihood estimation has a somewhat larger variance than individual estimation. Surprisingly, cross validation for Gaussian cases has a higher variance compared to cross validation for non-Gaussian cases.
Recall that Theorems 4, 6 and 7 show that, as n increases, the distribution of the standardized estimation error is close to a Gaussian distribution in terms of the metric d w . In Figure 4, we illustrate this by computing one-dimensional Wasserstein distances between the empirical distribution of the standardized estimation errors and Gaussian distributions. The figure shows the Wasserstein distance (p = 1) as a function of the number of observation locations n for individual parameters and for specific bivariate settings (similarly as for Figure 3). In each case, the samples have been centered around the true mean (true parameter values) and standardized by an empirical standard deviation (n-weighted average over all the samples). Their empirical distribution is compared to the standardized Gaussian distribution. The top left panel shows that the densities of the cross validation parameters are converging slowest whereas their mean squared error is comparable (see Figure 3); the densities are highly skewed and thus lead to much larger Wasserstein distances compared to the distributions of the maximum likelihood derived parameters. For the bivariate maximum likelihood estimation the marginal distributions have very similar Wasserstein distances; in the center panels: the dashed and solid colored lines  are visually hardly separable. As suggested by the individual panels of Figures 1  and 2, convergence in the Gaussian case is much faster compared to the non-Gaussian case. The right column of Figure 4 illustrates the joint asymptotic normality of the range parameter estimators by maximum likelihood and cross validation. The gray lines there illustrate Theorem 7 and are Wasserstein distances for linear combinations of the range estimates by maximum likelihood and cross validation, i.e., λρ ML + (1 − λ)ρ CV for λ = j/10, j = 1, . . . , 9. The highly skewed distribution of the cross validation-estimated range parameter for Gaussian processes is clearly visible. In the non Gaussian case, the effect of the skewness is less pronounced since the maximum likelihood is skewed as well.

Conclusion
We have shown that the covariance parameters of transformed Gaussian processes can be estimated by cross validation and Gaussian maximum likelihood, with the same rate of convergence as in the case of non-transformed Gaussian processes. In particular, Gaussian maximum likelihood works well asymptotically, despite the fact that the observations do not have a Gaussian distribution. Hence Gaussian maximum likelihood is here robust with respect to non-Gaussian data. This provides the first step of a theoretical validation of the use of Gaussian maximum likelihood in frequent cases where the data are non-Gaussian.  In future research, it would be interesting to extend the results of this paper to other classes of non-Gaussian random fields rather than only transformed Gaussian processes. In addition, the asymptotic analysis of estimators of the transformation of transformed Gaussian processes is of great interest. i = 1, . . . , q. We let w i be the mean of W i for i = 1, . . . , q. We have, for t > 1, where W ∼ N [0, 1]. From the Gaussian tail inequality, we obtain, for The function of t above is clearly summable as t → +∞. Hence, we have E[g(W )] = ∞ 0 P(g(W ) ≥ t) < +∞. Lemma 6. Let X be centered Gaussian process with covariance function k X satisfying Condition 2. Let F satisfy Condition 3. Let W be the spatial process F (X(·)) and assume that W is centered. Let (x i ) i∈N satisfy Condition 1.
Lemma 7. Consider the setting of Section 3.1, that is k Z satisfies Condition 2 and T satisfies Condition 3. For n ∈ N and i, j = 1, . . . , n we have k,l=1,...,n |Cov (y i y j , y k y l )| ≤ C sup , where C sup does not depend on n, i, j.
Proof. We let d(a, (b, c)) = min(|a − b|, |a − c|) for a, b, c ∈ R d . It is enough to show that |Cov(y i y j , y k y l )| ≤ C sup exp − C inf max d(s k , (s i , s j )), d(s l , (s i , s j )) . (13) Indeed, let, for t ≥ 0, N i,j,t be the number of pairs (k, l), with 1 ≤ k, l ≤ n such that t ≤ max(d(s k , (s i , s j )), d(s l , (s i , s j ))) ≤ t + 1.
From Condition 1, we can show that we have sup n∈N,i,j=1,...,n N i,j,t ≤ C sup t 2d . Hence, we have Thus, (13) implies the result of the lemma and it suffices to prove (13). Let i, j, k, l ∈ {1, . . . , n} and let ∆ = max(d(s k , (s i , s j )), d(s l , (s i , s j ))). By symmetry, we can consider that d(s k , (s i , s j )) = ∆.
If |s k − s l | ≤ |s i − s l | and |s k − s l | ≤ |s j − s l |, then |s i − s l | ≥ ∆/2 and |s j − s l | ≥ ∆/2. Hence, we can apply Lemma 6 with distance ∆/2 to obtain | Cov(y k y l , y i y j )| ≤ C sup exp(−C inf ∆/2), where C sup and C inf do not depend on n, i, j, k, l, ∆.
If |s i − s l | ≤ |s k − s l | and |s i − s l | ≤ |s j − s l |, then |s k − s l | ≥ ∆/2. We then have since it is assumed that Y has zero-mean. In (14), the first and third covariances are bounded in absolute value by C sup exp(−C inf ∆/2) from Lemma 6, because |s k − s l | ≥ ∆/2, |s k − s i | ≥ ∆/2 and |s k − s j | ≥ ∆/2. Hence we have | Cov(y k y l , y i y j )| ≤ C sup exp(−C inf ∆/2), where C sup and C inf do not depend on n, i, j, k, l, ∆.
If |s j − s l | ≤ |s k − s l | and |s j − s l | ≤ |s i − s l |, we obtain the same bound by symmetry. We have thus considered all possible cases and the proof of (13) is concluded.
In the context of Theorem 2, the following lemma provides an approximation of V n , based on replacing A by a sparse matrix. We remark that a similar approximation was shown in a time series context in [32]. Nevertheless, we find that our assumptions on the random field Y are more transparent and interpretable than the assumptions in [32], where cumulants are used. Because of these differences of assumptions, our proof of the following lemma differs from that in [32]. Proof. For any K, n ∈ N we have n Var 1 n y Ay − 1 n y A (K) y = 1 n n i,j,k,l=1 imsart-generic ver. 2014/10/16 file: nonGaussRF.tex date: November 27, 2019 We observe that |(A−A (K) ) k,l | is equal to 0 or is smaller than C sup /(1+K d+C inf ) by assumption. Hence we have where we have used Lemma 7 and where we have observed that |(A − A (K) ) i,j | is equal to 0 or is smaller than C sup / 1 + |s i − s j | d+C inf . We have also used Lemma 4 in [23] for the last inequality above. All the above constants C sup and C inf naturally do not depend on n, so the lemma is proved.
We have  where the different C sup and C inf do not depend on b, J and ∆. This concludes the proof from (16).
Lemma 10. Consider a sequence (x i ) i∈N of points in R d satisfying Condition 1. Let τ > 0 be fixed. For n ∈ N, let (A θ ) θ∈Θ and (B θ ) θ∈Θ be families of n × n matrices. Assume that for all n ∈ N, i, j = 1, . . . , n and θ ∈ Θ, where C sup does not depend on n, i, j, θ. Then we have for all n ∈ N, i, j = 1, . . . , n and θ ∈ Θ, where C sup does not depend on n, i, j, θ.
Lemma 12. Consider the setting of Section 5.1. Under Conditions 4 and 5, we have, for n ∈ N and i, j ∈ {1, . . . , n}, where C sup and C inf do not depend on n, i, j, θ.
Proof. One can show that the proof of Theorem 1 can be made uniform over θ ∈ Θ, thus yielding Lemma 12.

A.2. Proofs of the main results
Proof of Lemma 1. As a special case of Lemma 6, k satisfies Condition 2 i).
Let us now show that k satisfies Condition 2 ii). Let (x i ) satisfy Condition 1. Let n ∈ N be fixed and let R be the n × n covariance matrix k(x i − x j ) i,j=1,...,n . Let a 1 , . . . , a n ∈ R. We have n i,j=1 We now let z = (X(x 1 ), . . . , X(x n )) and g : R n → R be defined by g(t) = n i=1 a i F (t i ). The gradient of g at t is ∇g(t) = (a 1 F (t 1 ), . . . , a n F (t n )) . We use the inequality in Theorem 3.7 in [16]. This yields n i,j=1 From Condition 2 ii), we have λ 1 (Cov(z)) ≥ C inf . This yields From Condition 3, the above expectation is non-zero, which concludes the proof.
Proof of Lemma 2. The fact that k satisfies Condition 2 ii) follows from Theorem 4 in [11].
Let us now consider the case where F is defined by F (x) = x 2r + u. Since we consider a covariance function we can assume that u = 0 without loss of generality. Assume also that k(0) = 1 without loss of generality. From Lemma 6, k satisfies Condition 2 i). Let us show that Condition 2 ii) is satisfied. Let By independence of A 1 and A 2 , we obtain, for i = 0, . . . , 2r, From Isserlis' theorem, one can show that (22) is zero if i is odd and is strictly positive if i is even. As a consequence, we have with α 1 , . . . , α r > 0. Hence, the Fourier transform of k is a linear combination of multiple convolutions of the Fourier transform of k, with strictly positive components. Since the Fourier transform of k is strictly positive everywhere, then also the Fourier transform of k is strictly positive everywhere. Hence, from Theorem 4 in [11], k satisfies Condition 2 ii).
Proof of Theorem 1. Condition 1 and Lemma 6 in [23] imply that the spectral norms of R −1 and R are bounded functions of n. Let C sup = sup n λ 1 (R) < ∞ and C inf = inf n λ n (R) > 0. We write We remark that the above sum is well-defined because the eigenvalues of I − R/C sup are between 0 and 1 − C inf /C sup . We denote M = I − C −1 sup R and h i,j = |s i − s j |. Let 1 ≤ A < ∞ and a > 0 be fixed such that M i,j ≤ Ae −2ahi,j . Let δ = inf i,j∈N,i =j |s i − s j | > 0 (Condition 1).
Let D < ∞ be a constant such that D ≥ 1 and, for any L ≥ δ and i ∈ N, the set {j ∈ N; |s i − s j | ≤ L} has no more than (D/2)L d elements.
Let 0 < µ < ∞ be fixed. We show by induction over ∈ N that there exists a constant 1 ≤ ϕ < ∞, depending on µ but not depending on , i, j, such that for ≤ µ log(h i,j ), In the case log(h i,j ) < 0, there is nothing to prove in (24), so we consider i, j such that log(h i,j ) ≥ 0 when proving (24). For = 1, (24) holds. Assume that (24) holds for some ∈ N. We have From the triangle inequality we obtain The constant Q is finite and depends only on d and δ from Condition 1. We also let ≤ µ log(h i,j ) to show the last above inequality. Hence, in order to finish the proof of (24), it remains to show that the term (.) in the above display is a bounded function of h i,j , and to let ϕ/2 ≥ 1 be a bound for the term (. The above function of h i,j clearly goes to 0 as h i,j goes to ∞. Thus, the above term (.) is bounded and thus (24) is proved.
Coming back to (23), using (24) and using the triangle inequality, we obtain, letting ∆ = 1 − C inf /C sup , and for h i,j large enough, In the above display, for any τ < ∞ in the statement of Theorem 1, we can choose µ such that µ log(∆) ≤ −d − τ . Then, it is clear that the first summand in (25) is also smaller than a constant (depending on τ ) time h −d−τ i,j . This concludes the proof of Theorem 1, since also sup n∈N max i,j=1,...,n |(R −1 ) i,j | is bounded by C sup .
Proof of Theorem 2. We have n Var(V n ) = 1 n n i,j,k,l=1 A i,j A k,l Cov(y i y j , y k y l ).
From Lemma 7, we obtain from Lemma 4 in [23]. Hence, n Var(V n ) is bounded as n → ∞.
To reach a contradiction, we will show that To simplify notations in the sequel, without loss of generality, we will consider that φ(n) = n and show that For K ≥ 0, let with the notation Lemma 8. We have n Var(V as n → ∞, in order to prove (31) and thus to conclude the proof. We remark that, because of Lemma 8, we have lim inf n→∞ n Var(V (K) n ) − lim inf n→∞ n Var(V n ) goes to 0 as K → ∞. Hence, we may take L such that lim inf n→∞ n Var(V (K) n ) > 0 for K ≥ L. Hence, up to extracting a subsequence, it is sufficient to show √ n V (K) where n Var(V X n (s i ), say, where X n can be interpreted as a centered random field defined on (s i ) i∈N .
We now let D = (s i ) i∈N , D n = (s 1 , . . . , s n ), Z i,n = n −1/2 X n (s i ) for n ∈ N and i = 1, . . . , n. We also let c i,n = n −1/2 for n ∈ N and i = 1, . . . , n. We remark that Z i,n /c i,n can be written as f (w) where w is a Gaussian vector of dimension less than N K , with variances 1 and where |f (x)| ≤ C sup e Csup|x| , where C sup does not depend on n ∈ N and i = 1, . . . , n. One can thus show, from the Cauchy-Schwarz inequality and with the same techniques as in Lemma 5, that for any q > 0 lim M →+∞ sup n∈N max i=1,...,n E |Z i,n /c i,n | 2+q 1 |Zi,n/ci,n|≥M = 0.
With the previous notation and with (33), one can show that all the assumptions of Corollary 1 in [26] are satisfied. This shows (32) and thus concludes the proof.
Hence, Theorem 3 can be proved by proceeding as in the proof of Proposition 3.1 in [7].
One can check that the mean value of ∂L θ0 /∂θ i ∂θ j is (M θ0 ) i,j (also because this mean value is calculated as if Y were a Gaussian process with zero-mean and covariance function k Y,θ0 ).
For i = 1, . . . , p − 1, As in the proof of Theorem 3, GCT and Lemma 4 in [23] lead us to sup ψ∈S λ 1 (A ψ,i A ψ,i ) ≤ C sup , which in turn implies sup ψ∈S ρ 1 (A ψ,i ) ≤ C sup . It follows that Hence, Theorem 5 can also be proved by proceeding as in the proof of Proposition 3.4 in [7].
It can be shown, similarly as in the proof of Proposition 3.7 in [7] that lim inf n→∞ λ p−1 (N ψ0 ) > 0.
From the proofs of Theorems 4 and 6 (see also the proof of Proposition D.10 in [7] that is referred to there), we know Also, from Condition 4 and Lemma 12 used inside Lemma 10, we have for i = 1, . . . , p, where c θ0 ∈ R is deterministic and, for i = 1, . . . , p − 1, where A θ,i is a n × n symmetric matrix satisfying sup θ∈Θ |(A θ,i ) a,b | ≤ C sup /(1 + |s a − s b | d+C inf ) and B ψ,i is a n × n matrix satisfying sup ψ∈S |(B ψ,i ) a,b | ≤ C sup /(1 + |s a − s b | d+C inf ). As in the proofs of Theorems 4 and 6, one can check that ∂L θ0 /∂θ i has mean zero for i = 1, . . . , p and ∂CV ψ0 /∂ψ i has mean zero for i = 1, . . . , p − 1. Thus, we can rewrite As the vectors λ and γ as well as the matrices M −1 θ0 and N −1 ψ0 are fixed, the bound holds for all k, l ∈ {1, . . . , n}.
The variance can be written as Hence, by applying product by blocks we get the matrix form expression n Var(J n ) = λ , γ D −1 θ0 Ψ θ0 D −1 θ0 λ γ .
We conclude the proof by applying the Wald Theorem.