Maximum likelihood estimation for Gaussian processes under inequality constraints

We consider covariance parameter estimation for a Gaussian process under inequality constraints (boundedness, monotonicity or convexity) in fixed-domain asymptotics. We address the estimation of the variance parameter and the estimation of the microergodic parameter of the Mat\'ern and Wendland covariance functions. First, we show that the (unconstrained) maximum likelihood estimator has the same asymptotic distribution, unconditionally and conditionally to the fact that the Gaussian process satisfies the inequality constraints. Then, we study the recently suggested constrained maximum likelihood estimator. We show that it has the same asymptotic distribution as the (unconstrained) maximum likelihood estimator. In addition, we show in simulations that the constrained maximum likelihood estimator is generally more accurate on finite samples. Finally, we provide extensions to prediction and to noisy observations.


Introduction
Kriging (Stein, 1999;Rasmussen and Williams, 2006) consists in inferring the values of a Gaussian random field given observations at a finite set of points. It has become a popular method for a large range of applications, such as geostatistics (Matheron, 1970), numerical code approximation (Sacks et al., 1989;Santner et al., 2003;Bachoc et al., 2016) and calibration (Paulo et al., 2012;Bachoc et al., 2014), global optimization (Jones et al., 1998), and machine learning (Rasmussen and Williams, 2006).
When considering a Gaussian process, one has to deal with the estimation of its covariance function. Usually, it is assumed that the covariance function belongs to a given parametric family (see Abrahamsen, 1997, for a review of classical families). In this case, the estimation boils down to estimating the corresponding covariance parameters. The main estimation techniques are based on maximum likelihood (Stein, 1999), cross-validation (Zhang and Wang, 2010;Bachoc, 2013Bachoc, , 2014a and variation estimators (Istas and Lang, 1997;Anderes, 2010;Azaïs et al., 2018).
In this paper, we address maximum likelihood estimation of covariance parameters under fixed-domain asymptotics (Stein, 1999). The fixed-domain asymptotics setting corresponds to observation points for the Gaussian process that become dense in a fixed bounded domain. Under fixed-domain asymptotics, two types of covariance parameters can be distinguished: microergodic and non-microergodic parameters (Ibragimov and Rozanov, 1978;Stein, 1999). A covariance parameter is said to be microergodic if, when it takes two different values, the two corresponding Gaussian measures are orthogonal (Ibragimov and Rozanov, 1978;Stein, 1999). It is said to be non-microergodic if, even for two different values, the corresponding Gaussian measures are equivalent. Although non-microergodic parameters cannot be estimated consistently, they have an asymptotically negligible impact on prediction (Stein, 1988(Stein, , 1990aZhang, 2004). On the contrary, it is at least possible to consistently estimate microergodic covariance parameters, and misspecifying them can have a strong negative impact on predictions. Now consider a triangular array (x (n) i ) n∈N,i=1,...,n of observation points in [0, 1] d , where we write for concision (x 1 , . . . , x n ) = (x (n) 1 , . . . , x (n) n ). We assume that (x (n) i ) is dense, that is sup x∈[0,1] d inf i=1,...,n |x − x (n) i |→ 0 as n → ∞. Let y be the Gaussian vector defined by y i = Y (x i ) for i = 1, . . . , n. For θ ∈ Θ, let R θ = [k θ (x i − x j )] 1 i,j n and L n (θ) = − n 2 ln(2π) − 1 2 ln(|R θ |) − 1 2 y R −1 θ y, be the log likelihood function. Here, |R θ | stands for det(R θ ). Maximizing L n (θ) with respect to θ yields the widely studied and applied MLE (Santner et al., 2003;Stein, 1999;Ying, 1993;Zhang, 2004).
In this paper, we assume that the information {Y ∈ E κ } is available where E κ is a convex set of functions defined by inequality constraints. We will consider which correspond to boundedness, monotonicity and convexity constraints respectively . For E 0 , the bounds −∞ < u +∞ are fixed and known.
First, we will study the conditional asymptotic distribution of the (unconstrained) MLE obtained by maximizing (1), given {Y ∈ E κ }. Nevertheless, a drawback of this MLE is that it does not exploit the information {Y ∈ E κ }. Then we study the cMLE introduced in (López- . This estimator is obtained by maximizing the logarithm of the probability density function of y, conditionally to {Y ∈ E κ }, with respect to the probability measure P θ on Ω. This logarithm of conditional density is given by L n,c (θ) = L n (θ) − ln P θ (Y ∈ E κ ) + ln P θ (Y ∈ E κ |y) = L n (θ) + A n (θ) + B n (θ), say, where P θ (·) and P θ (·|·) are defined in Section 2.2. In , the cMLE is studied and compared to the MLE. The authors show that the cMLE is consistent when the MLE is. In this paper, we aim at providing more quantitative results regarding the asymptotic distribution of the MLE and the cMLE, conditionally to {Y ∈ E κ }.

Notation
In the paper, 0 < c < +∞ stands for a generic constant that may differ from one line to another. It is convenient to have short expressions for terms that converge in probability to zero. Following (van der Vaart, 1998), the notation o P (1) (respectively O P (1)) stands for a sequence of random variables (r.v.'s) that converges to zero in probability (resp. is bounded in probability) as n → ∞. More generally, for a sequence of r.v.'s R n , X n = o P (R n ) means X n = Y n R n with Y n P → 0, X n = O P (R n ) means X n = Y n R n with Y n = O P (1). For deterministic sequences X n and R n , the stochastic notation reduce to the usual o and O. For a sequence of random vectors or variables (X n ) n∈N on R l , that are functions of Y , and for a probability distribution µ on R l , we write when, for any bounded continuous function g : R l → R, we have We also write X n = o P|Y ∈Eκ (1) when for all ε > 0 we have P(|X n | ε|Y ∈ E κ ) → 0 as n → ∞. Finally, we write X n = O P|Y ∈Eκ (1) when we have lim sup n→∞ P(|X n | K|Y ∈ E κ ) → 0 as K → ∞.
For any two functions f (Y ) and g(Y ), let E θ [f (Y )] (respectively E θ [f (Y )|g(Y )]) be the expectation (resp. the conditional expectation) with respect to the measure P θ on Ω. We define similarly P θ (A(Y )) and P θ (A(Y )|g(Y )) when A(Y ) is an event with respect to Y . Let θ 0 ∈ Θ be fixed. We consider θ 0 as the true unknown covariance parameter and we let E[·], E[·|·], P(·), and P(·|·) be shorthands for E θ0 [·], E θ0 [·|·], P θ0 (·), and P θ0 (·|·). When a quantity is said to converge, say, in probability or almost surely, it is also implicit that we consider the measure P θ0 on Ω.

Conditions on the observation points
In some cases, we will need to assume that as n → ∞, the triangular array of observation points contains finer and finer tensorized grids.
Condition-Grid. There exist d sequences (v (j) i ) i∈N for j = 1, . . . , d, dense in [0, 1], and so that for all N ∈ N, there exists n 0 ∈ N such that for n n 0 , we have In our opinion, Condition-Grid is reasonable and natural. Its purpose is to guarantee that the partial derivatives of Y are consistently estimable from y everywhere on [0, 1] d (see, for instance, the proof of Theorems 3.2 and 3.3 for κ = 1 in the appendix). We believe that, for the results for which Condition-Grid is assumed, one could replace it by a milder condition and prove similar results. Then the proofs would be based on essentially the same ideas as the current ones, but could be more cumbersome.
In some other cases, we only need to assume that the observation points constitute a sequence.
Condition-Sequence. For all n ∈ N and i n, we have x Condition-Sequence implies that sequences of conditional expectations with respect to the observations are martingales. This condition is necessary in some of the proofs (for instance, that of Theorem 3.3) where convergence results for martingales are used.

Model and assumptions
In this section, we focus on the estimation of a single variance parameter when the correlation function is known. Hence, we let p = 1, θ = σ 2 , and for where k 1 is a fixed known function such that k 1 defined by We define the Fourier transform of a function h: where ß 2 = −1 and we make the following assumption.
-If κ = 0, k 1 is α-Hölder, which means that there exist non-negative constants c and α such that for all t and t in R d , where . is the Euclidean norm. Furthermore, the Fourier transform k 1 of k 1 satisfies, for some fixed P < ∞, -If κ = 1, the Gaussian process Y is differentiable in quadratic mean. For i = 1, . . . , d, let k 1,i = −∂ 2 k 1 /∂x 2 i . Remark that the covariance function of ∂Y /∂x i is given by k 1,i defined by k 1,i (u, v) = k 1,i (u − v). Then k 1,i is α-Hölder for a fixed α > 0. Also, (4) holds with k 1 replaced by the Fourier transform k 1,i of k 1,i for i = 1, . . . , d.
These assumptions make the conditioning by {Y ∈ E κ } valid for κ = 0, 1, 2 as established in the following lemma. Lemma 3.1. Assume that Condition-Var holds. Then for all κ ∈ {0, 1, 2} and for any compact K in (0, +∞), we have Proof of Lemma 3.1. It suffices to follow the same lines as in the proof of (López-Lopera et al., 2018, Lemma A.6) noticing that Condition-Var implies the conditions of (López-Lopera et al., 2018, Lemma A.6) (see the discussion in López-Lopera et al., 2018).

Asymptotic conditional distribution of the constrained maximum likelihood estimator
Here, we assume that the compact set Θ is [σ 2 l , σ 2 u ] with 0 < σ 2 l < σ 2 0 < σ 2 u < +∞ and we consider the cMLE σ 2 n,c of σ 2 0 derived by maximizing on the compact set Θ the constrained log-likelihood in (2): Now we show that the conditional asymptotic distribution of the cMLE is the same as the asymptotic distribution of the MLE. Theorem 3.3. For κ = 1, 2, we assume that Condition-Grid holds. For κ = 0, 1, 2, under Condition-Var and Condition-Sequence, the cMLE σ 2 n,c of σ 2 0 defined in (7) is asymptotically Gaussian distributed. More precisely, 4 Microergodic parameter estimation for the isotropic Matérn model

Model and assumptions
In this section, we let d = 1, 2 or 3 and we consider the isotropic Matérn family of covariance functions on R d . We refer to, e.g., (Stein, 1999) for more details. Here k θ = k θ,ν is given by, for x ∈ [0, 1] d , The Matérn covariance function is given by k θ,ν (u, v) = k θ,ν (u − v). The parameter σ 2 > 0 is the variance of the process, ρ > 0 is the correlation length parameter that controls how fast the covariance function decays with the distance, and ν > 0 is the regularity parameter of the process. The function κ ν is the modified Bessel function of the second kind of order ν (see Abramowitz and Stegun, 1964). We assume in the sequel that the smoothness parameter ν is known. Then θ = (σ 2 , ρ) and p = 2.
We remark that Condition-ν naturally implies Condition-Var so that the conditioning by {Y ∈ E κ } is valid for any κ = 0, 1, 2 as established in the next lemma. We refer to (Stein, 1999) for a reference on the impact of ν on the smoothness of the Matérn function k θ,ν and on its Fourier transform.

Asymptotic conditional distribution of the maximum likelihood estimator
The log-likelihood function in (1) for σ 2 and ρ under the Matérn model with fixed parameter ν can be written as where Then the MLE is given by It has been shown in (Zhang, 2004) that the parameters σ 2 0 and ρ 0 can not be estimated consistently but that the microergodic parameter σ 2 0 /ρ 2ν 0 can. Furthermore, it is shown in (Kaufman and Shaby, 2013) that √ n σ 2 n / ρ 2ν n − σ 2 0 /ρ 2ν 0 converges to a N (0, 2 σ 2 0 /ρ 2ν 0 2 ) distribution. In the next theorem, we show that this asymptotic normality also holds conditionally to {Y ∈ E κ }.

Asymptotic conditional distribution of the constrained maximum likelihood estimator
We turn to the constrained log-likelihood and its maximizer. We consider two types of estimation settings obtained by maximizing the constrained log-likelihood (2) under the Matérn model. In the first setting, ρ = ρ 1 is fixed and (2) is maximized over σ 2 (in the case ρ 1 = ρ 0 this setting is already covered by Theorem 3.3). In the second setting, (2) is maximized over both σ 2 and ρ. Under the two settings, we show that the cMLE has the same asymptotic distribution as the MLE, conditionally to {Y ∈ E κ }.
(ii) For κ = 1, 2, assume that one of the following two conditions hold.
We have ν > κ + 2 and the observation points {x 1 , . . . , x n } are so that, for all n 2 d , with N = n 1/d , Then σ 2 n,c / ρ 2ν n,c is asymptotically Gaussian distributed for κ = 0, 1, 2. More precisely, In Theorem 4.4, we assume that ν is larger than in Condition-ν, and we assume that the observation points have specific quantitative space filling properties. The condition (i) b) also implies that a portion of the observation points are located in the corners and borders of [0, 1] d . Furthermore, the condition (ii) b) implies that the majority of the observation points are located on regular grids. We believe that these two last conditions could be replaced by milder ones, at the cost of similar proofs but more cumbersome than the present ones.
We make stronger assumptions in Theorem 4.4 than in Theorem 4.3 because the former is more challenging than the latter. Indeed, since ρ = ρ 1 is fixed in Theorem 4.3, we can use the equivalence of two fixed Gaussian measures in order to obtain asymptotic properties of the conditional mean function of Y under k 1,ρ1,ν (see the developments following (35) in the proofs). This is not possible anymore when considering the conditional mean function of Y under k 1, ρn,c,ν , where ρ n,c is random. Hence, we use other proof techniques, based on reproducing kernel Hilbert spaces, for studying this conditional mean function, for which the above additional conditions are needed. We refer for instance to the developments following (40) in the appendix for more details.

Numerical results
In this section, we illustrate numerically the conditional asymptotic normality of the MLE and the cMLE of the microergodic parameter for the Matérn 5/2 covariance function. The numerical experiments were implemented using the R package "LineqGPR" (López-Lopera, 2018).

Experimental settings
We let d = 1 in the rest of the section. Since the event {Y ∈ E κ } can not be simulated exactly in practice, we consider the piecewise affine interpolation Y m of Y at t 1 , . . . , t m ∈ [0, 1], with m > n (Maatouk and Bay, 2017;López-Lopera et al., 2018). Then, the event {Y ∈ E κ } is approximated by the event {Y m ∈ E κ }, where E 0 (respectively E 1 , E 2 ) is the set of continuous bounded between and u (resp. increasing, convex) functions. We can simulate efficiently Y m conditionally to {Y m ∈ E κ } by using Markov Chain Monte Carlo procedures (see, for instance, Pakman and Paninski, 2014).
In Section 5, we consider the Matérn 5/2 function defined by for x ∈ R and with θ = (σ 2 , ρ). Remark that k θ,5/2 is obtained by the parametrization of (Roustant et al., 2012;López-Lopera, 2018) rather than that of Section 4.1. For an easy reading, we keep the same notation.
In Figure 2, we report the same quantities for κ = 1 (monotonicity constraints) and for (σ 2 0 , ρ 0 ) = (0.5 2 , 1). In this case, we observe that the distributions of both the MLE and the cMLE are close to the limit one even for small values of n (n = 5, 20).

Results on prediction
In the next proposition, we show that, conditionally to the inequality constraints, the predictions obtained when taking the constraints into account are asymptotically equal to the standard (unconstrained) Kriging predictions. Furthermore, the same is true when comparing the conditional variances obtained with and without accounting for the constraints. Proposition 6.1. Let κ = 0, 1, 2 be fixed. Consider a Gaussian process Y on [0, 1] d with mean function zero and covariance function k of the form k and In Proposition 6.1, when taking the constraints into account or not, the predictions converge to the true values and the conditional variances converge to zero. Thus, the results in Proposition 6.1 are given on a relative scale, by dividing the difference of predictions by the conditional standard deviation (without constraints), and by dividing the difference of conditional variances by the conditional variance (without constraints).
Similarly as for estimation in Sections 3 and 4, the conclusion of Proposition 6.1 is that the constraints do not have an asymptotic impact on prediction.
When there is no constraints, significant results on using misspecified covariance functions that are asymptotically equivalent to the true one have been obtained in (Stein, 1988(Stein, , 1990a. Let k, σ(x 0 ), Y (x 0 ) and Y c (x 0 ) be as in Proposition 6.1. Let k 1 satisfy Condition-Var and let k 1 be defined from k 1 as in Proposition 6.1. Let the Gaussian measures of the Gaussian processes with mean functions zero and covariance functions k and k 1 on [0, 1] d be equivalent (see Stein, 1999).
, when taking the conditional expectations with respect to k 1 rather than k. Then it is shown in (Stein, 1988(Stein, , 1990a,c) (see also Stein, 1999, Chapter 4, Theorem 8 and Both expressions above mean that the predictions and conditional variances obtained from equivalent Gaussian measures are asymptotically equivalent. A corollary of our Proposition 6.1 is that this equivalence remains true when the predictions and conditional variances are calculated accounting for the inequality constraints. Corollary 6.2. Let κ = 0, 1, 2 be fixed. Consider a Gaussian process Y on [0, 1] d with mean function zero and covariance function k of the form k Let the Gaussian measures of Gaussian processes with mean functions zero and covariance functions k and k 1 on Finally, an important question for Gaussian processes is to assess the asymptotic accuracy of predictions obtained from (possibly consistently) estimated covariance parameters. In this section, we have restricted the asymptotic analysis of prediction to fixed (potentially misspecified) covariance parameters.
When no constraints are considered, and under increasing-domain asymptotics, predictions obtained from consistent estimators of covariance parameters are generally asymptotically optimal (Bachoc, 2014b;Bachoc et al., 2018). Under fixed-domain asymptotics, without considering constraints, the predictions obtained from estimators of the covariance parameters can be asymptotically equal to those obtained from the true covariance parameters (Putter and Young, 2001). It would be interesting, in future work, to extend the results given in (Putter and Young, 2001), to the case of inequality constraints. This could be carried out by making Proposition 6.1 uniform over subspaces of covariance parameters, and by following a similar approach as for proving Corollary 6.2.

Microergodic parameter estimation for the isotropic Wendland model
In this section, we let d = 1, 2 or 3 and extend the results for the Matérn covariance functions of Section 4 to the isotropic Wendland family of covariance functions on [0, 1] d (Gneiting, 2002;Bevilacqua et al., 2019). Here k θ = k θ,s,µ , with θ = (σ 2 , ρ), is given by The parameters s > 0 and µ (d + 1)/2 + s are considered to be fixed and known. The Wendland covariance function is given by k θ,s,µ (u, v) = k θ,s,µ (u − v). The parameter s drives the smoothness of the Wendland covariance function, similarly as for the Matérn covariance function (Bevilacqua et al., 2019). The parameters σ 2 > 0 and ρ > 0 are interpreted similarly as for the Matérn covariance functions and are to be estimated. We remark that, for appropriate equality conditions on ν (see Section 4), s and µ, the Gaussian measures obtained from the Wendland and Matérn covariance functions are equivalent (Bevilacqua et al., 2019). The Wendland covariance function is compactly supported, which is a computational benefit (Bevilacqua et al., 2019).
Let us define the MLE ( σ 2 n , ρ n ) in the exact same way as in Section 4.2 but for the Wendland covariance functions, It is shown in (Bevilacqua et al., 2019) that the parameters σ 2 0 and ρ 0 cannot be estimated consistently but that the parameter σ 2 0 /ρ 1+2s 0 can. Furthermore, √ n σ 2 n / ρ 1+2s Then, we can extend Theorem 4.2, providing the asymptotic conditional distribution of the MLE of the microergodic parameter for the Matérn model, to the Wendland model.

Noisy observations
The results above hold for a continuous Gaussian process that is observed exactly. It is thus natural to ask whether similar results hold for discontinuous Gaussian processes or for Gaussian processes observed with errors. In the next proposition, we show that the standard model of discontinuous Gaussian process with a nugget effect yields a zero probability to satisfy bound constraints. Hence, it does not seem possible to define, in a meaningful way, a discontinuous Gaussian process conditioned by bound constraints. Proposition 6.6. Let E 0 be defined as in Section 2.1 with −∞ < or u < +∞. Let Y be a Gaussian process on where Y c is a continuous Gaussian process on [0, 1] d and Y δ is a Gaussian process on [0, 1] d with mean function zero and covariance function k δ given by In addition, assume that Y c and Y δ are independent. Then This proposition can be extended to monotonicity and convexity constraints. Hence, in the rest of this section, we consider constrained continuous Gaussian processes observed with noise.
In the case of noisy observations, obtaining fixed-domain asymptotic results on the (unconstrained) MLE of the covariance parameters and the noise variance is challenging, even more so than in the noise-free context. To the best of our knowledge, the only covariance models that have been investigated theoretically, under fixed-domain asymptotics with measurement errors, are the Brownian motion (Stein, 1990b) and the exponential model (Chen et al., 2000;Chang et al., 2017).
In the case of the exponential model, We let Y be a Gaussian process on [0, 1] with mean function zero and covariance function k σ 2 0 ,ρ0 with (σ 2 0 , ρ 0 ) ∈ Θ. We consider the triangular array of observation points defined by (x 1 , . . . , x n ) = (0, 1/(n − 1), . . . , 1), for n 2. We consider that the n observations are given by In (Chen et al., 2000), it is shown that the MLE σ 2 n / ρ n of the microergodic parameter and the MLE δ 2 n of the noise variance jointly satisfy the central limit theorem Hence, the rate of convergence of the MLE of the microergodic parameter is decreased from n 1/2 to n 1/4 , because of the measurement errors. The rate of convergence of the MLE of the noise variance is n 1/2 .
In the next proposition, we show that these rates are unchanged when conditioning by the boundedness event {Y ∈ E 0 }. Proposition 6.7. Consider the setting defined above, with θ 0 , δ 0 in the interior of Θ × ∆. Then, as n → ∞, It would be interesting to see whether the central limit theorem in (15) still holds conditionally to {Y ∈ E 0 }. This would be an extension to the noisy case of Theorems 3.2 and 4.2. Nevertheless, to prove Theorem 3.2, we have observed that, in the noiseless case, the MLE of σ 2 0 is a normalized sum of the independent variables W 2 n,1 , . . . , W 2 n,n , with for i = 1, . . . , n. We have taken advantage of the fact that conditioning by W n,1 , . . . , W n,k enables to condition by Y (x 1 ), . . . , Y (x k ) and to approximately condition by the event {Y ∈ E 0 }, while leaving the distribution of W n,k+1 , . . . , W n,n unchanged. We refer to the proof of Theorem 3.2 for more details.
In contrast, in the noisy case, the authors of (Chen et al., 2000) show that the MLE of σ 2 0 /ρ 0 is also a normalized sum of the independent variables W 2 n,1 , . . . , W 2 n,n , but each W n,i depends on the observation vector y = (y 1 , . . . , y n ) entirely (see Chen et al., 2000, Equations (3.40) and (3.42)). Hence, it appears significantly more challenging to address the asymptotic normality of the MLE of σ 2 0 /ρ 0 and δ 0 , conditionally to {Y ∈ E 0 }. We leave this question open to future work.
The constrained likelihood and the cMLE can be naturally extended to the noisy case. Nevertheless, the asymptotic study of the cMLE, in the context of the exponential covariance function as in Proposition 6.7, seems to require substantial additional work. Indeed, to analyze the cMLE in the noiseless case for the Matérn covariance functions, we have relied on the results of (Kaufman and Shaby, 2013) and (Wang and Loh, 2011), that are specific to the noiseless case. Furthermore, the martingale arguments, used for instance in the point 4) in the proof of Theorem 3.3, require the observation points to be taken from a sequence. Hence, these martingale arguments are not available in the framework of this section, in which the observation points are taken on regular grids. Finally, the RKHS arguments, used for instance in the point 4) in the proof of Theorem 4.4, require to work with covariance functions that are at least twice differentiable, which is not the case with the exponential covariance functions. Hence, we leave the asymptotic study of the cMLE, in the noisy case, open to future research.

Concluding remarks
We have shown that the MLE and the cMLE are asymptotically Gaussian distributed, conditionally to the fact that the Gaussian process satisfies either boundedness, monotonicity or convexity constraints. Their asymptotic distributions are identical to the unconditional asymptotic distribution of the MLE. In simulations, we confirm that the MLE and the cMLE have very similar performances when the number n of observation points becomes large enough. We also observe that the cMLE is more accurate for small or moderate values of n.
Hence, since the computation of the cMLE is more challenging than that of the MLE, we recommend to use the MLE for large data sets and the cMLE for smaller ones. In the proofs of the asymptotic behavior of the cMLE, one of the main steps is to show that P θ (Y ∈ E κ |y) converges to one as n goes to infinity. Hence, in practice, one may evaluate P θ (Y ∈ E κ |y), for some values of θ, in order to gauge whether this conditional probability is not too close to 1 so that it is worth using the cMLE despite the additional computational cost. Similarly, Proposition 6.1 (and its proof) show that if P θ (Y ∈ E κ |y) is close to one, then it is approximately identical to predict new values of Y with accounting for the constraints or not. The latter option is then preferable, as it is computationally less costly.
Our theoretical results could be extended in different ways. First, we remark that the techniques we have used to show that P θ (Y ∈ E κ ) and P θ (Y ∈ E κ |y) are asymptotically negligible (see (21) and (22)) can be used for more general families of covariance functions. Hence, other results on the (unconditional) asymptotic distribution of the MLE could be extended to the case of constrained Gaussian processes in future work. These types of results exist for instance for the product exponential covariance function (Ying, 1993).
Also, in practice, computing the cMLE requires a discretization of the constraints, for instance using a piecewise affine interpolation as in Section 5, or a finite set of constrained points (Da Veiga and Marrel, 2012). Thus it would be interesting to extend our results by taking this discretization into account.
Finally, in this paper, we have focused on Gaussian processes that are either observed directly or with an additive Gaussian noise. These contexts are relevant in practice when applying Gaussian processes to computer experiments (Santner et al., 2003) and to regression problems in machine learning (Rasmussen and Williams, 2006). Nowadays, it has also become standard to study other more complex models of latent Gaussian processes, for instance in Gaussian process classification (Rasmussen and Williams, 2006;Nickisch and Rasmussen, 2008). Some authors have also considered latent Gaussian processes subjected to inequality constraints for modelling point processes (López-lopera et al., 2019). It would be interesting to obtain asymptotic results similar to those in our article, for latent Gaussian processes. This could be a challenging problem, as few asymptotic results are available even for unconstrained latent Gaussian process models. We remark that some of the techniques we have used in this paper could be useful when considering latent Gaussian processes under constraints. These techniques are, in particular, Lemmas A.1 and A.2 and their applications.
Now we establish three lemmas that will be useful in the sequel.
Lemma A.1. Let (X n ) n∈N be a sequence of r.v.'s and (m k,n ) n,k∈N, k n and (M k,n ) n,k∈N, k n be two triangular arrays of r.v.'s. We consider a random vector (m, M ) such that m m k,n M k,n M for all k n. We assume that P(m = ) = P(M = u) = 0 and P( m M u) > 0 for some fixed and u ∈ R. Moreover, we consider a sequence (k n ) n∈N so that, k n n, k n → n→∞ ∞ and (m kn,n , M kn,n ) a.s.
Then for any a ∈ R, lim n→+∞ P(X n a| m kn,n M kn,n u) − P(X n a| m M u) = 0.
Proof of Lemma A.1. For the sake of simplicity, we denote by E k,n (respectively E) the event { m k,n M k,n u} (resp. { m M u}). Then |P(X n a|E kn,n ) − P(X n a|E)| |P(X n a, E kn,n ) − P(X n a, E)| P(E kn,n ) (i) By (16), P(E kn,n ) goes to P(E) = P( m M u) > 0 as n goes to +∞. Thus 1/P(E kn,n ) is well-defined for large values of n and bounded as n → ∞. Moreover, by trivial arguments of set theory, one gets |P(X n a, E kn,n ) − P(X n a, E)| P(E kn,n ∆E) = P(E kn,n \ E), since P(E \ E kn,n ) = 0. Now let ε > 0. One has P(E kn,n \ E) = P( m kn,n M kn,n u, (m, M ) / ∈ [ , u] 2 ) P( m kn,n M kn,n u, m < ) + P( m kn,n M kn,n u, M > u) P( m kn,n , m < ) + P(M kn,n u, M > u).
The first term in the right hand-side goes to 0 as n goes to infinity. By Portemanteau's lemma and (16), We handle similarly the term P(M kn,n u, M > u). Hence, in the r.h.s. of (18), the first term goes to 0 as n → ∞.
(ii) Now we turn to the control of the second term in (18). Upper bounding P(X n a, E) by 1, it remains to control 1 P(E kn,n ) − 1 P(E) which is immediate by the convergence in distribution of (m kn,n , M kn,n ) as n goes to infinity (implied by the a.s. convergence) and the fact that P(E) > 0 and P(m = ) = P(M = u) = 0. The proof is now complete.
Lemma A.2. Consider three sequences of random functions f n , g n , h n : [x inf , x sup ] → R, with 0 < x inf < x sup < ∞ fixed. Consider that for all x ∈ [x inf , x sup ], f n (x), g n (x), and h n (x) are functions of Y and x only. Let Assume the following properties.
(i) There exists A > 0, B > 0 and δ > 0 such that with probability going to 1 as n → ∞.
(ii) There exists C > 0 such that for all with probability going to 1 as n → ∞.
Lemma A.3. Let {k θ ; θ ∈ Θ} be the set of functions in Section 2 where Θ is compact. Assume that k θ satisfies Condition-Var in the case κ = 0, where c and α can be chosen independently of θ. Let Z n,θ be a Gaussian process with mean function zero and covariance function (

B Proofs for Sections 3 and 4 -Boundedness
We let κ = 0 throughout Section B.

B.1 Estimation of the variance parameter
Proof of Theorem 3.2 under boundedness constraints.
1) Let m k,n = min i=1,...,k y i , M k,n = max i=1,...,k y i , and (m, M ) = (Y * , Y * ) , where Y * and Y * have been defined in Appendix A. We clearly have m m kn,n M kn,n M . Since (x i ) i∈N is dense, for any sequence (k n ) n∈N so that k n → ∞ as n → ∞ and k n n, we have (m kn,n , M kn,n ) → (m, M ) a.s. as n → ∞ (up to re-indexing x 1 , . . . , x n ).
2) Let k ∈ N be fixed. We have Writing the Gaussian probability density function of y as the product of the conditional probability density functions of y i given y 1 , . . . , y i−1 leads to The terms in the sum above are independent. Indeed, Cov(y l , y i − E[y i |y 1 , . . . , y i−1 ]) = 0, for any l i − 1 and the Gaussian distribution then leads to independence. Therefore, Var(y i |y 1 , . . . , y i−1 ) − 1 .
The first term is o P (1) being the sum of k r.v.'s (whose variances are all equal to 2) divided by the square root of n.
Because P σ 2 min i=1,...,k y i max i=1,...,k y i u > 0, the first term is also o P (1) conditionally to min i=1,...,k y i max i=1,...,k y i u . The second term is equal to σ 2 0 / √ n times the sum of n − k independent variables with zero mean and variance 2 and is also independent of y 1 , . . . , y k . Hence, from the central limit theorem and Slutsky's lemma (van der Vaart, 1998, Lemma 2.8), we obtain that 3) Hence, for x ∈ R, there exists a sequence τ n −→ n→∞ ∞ satisfying τ n = o(n) as n → ∞ so that: with V ∼ N (0, 2σ 4 0 ). The above display naturally holds. Indeed, if (S τ,n ) n∈N,τ =1,...,n is a triangular array of numbers so that, for any fixed τ , S τ,n → S as n → ∞, where S does not depend on τ , then there exists a sequence τ n → ∞ so that S τn,n → S as n → ∞.
Therefore, from Lemma A.1, This concludes the proof.
In order to apply Lemma A.2, we need to check that the conditions (19) to (22) hold.
Now y R −1 1 y is the square of the norm of a Gaussian vector with variance-covariance matrix σ 2 0 I n , where I n stands for the identity matrix of dimension n. Thus one can write y R −1 1 y as the sum of the squares of n independent and identically distributed r.v.'s ε i , where ε i is Gaussian distributed with mean 0 and variance σ 2 0 . We prove that (19) is satisfied. One may rewrite L n (σ 2 ) as where the o P (1) above does not depend on σ 2 and f a has been introduced in Appendix A. By a Taylor expansion and the definition ofσ 2 n , we have, with probability going to 1 as n → ∞, with σ 2 in the interval with endpoints σ 2 andσ 2 n . Hence, non-random constants A > 0 and δ > 0 exist for which (19) is satisfied.
Let m n,y and σ 2 k n be the conditional mean and covariance functions of Y given y, under the probability measure P σ 2 . Using Borell-TIS inequality (Adler and Taylor, 2007), with Z n,σ 2 a Gaussian process with mean function zero and covariance function σ 2 k n , we obtain l , σ 2 u ] as n → ∞. By (Bect et al., 2018, Proposition 2.8) and because the sequence of observation points is dense, Consequently, (25) leads to Similarly, taking −Y instead of Y , one may prove easily that Then, we deduce that Now let ε > 0, ε = 2|ln(1 − ε)| and E 0,δ : We have: by the triangular inequality and (28). Therefore, As already shown, the term (29) converges to 0 as n → +∞ for any fixed δ > 0. For (30), we have This follows from Tsirelson theorem in (Azaïs and Wschebor, 2009). Hence for all ε > 0, there exists δ * > 0 such that, Similarly, for all ε > 0, there exists δ * > 0 such that, Taking δ = min(δ * , δ * ), we conclude the proof of (22).
By Theorem 3.2 and Slutsky's lemma, we conclude the proof.

B.2 Isotropic Matérn process
Before proving Theorems 4.2, 4.3 and 4.4, we establish an intermediate result useful in the sequel.

4)
From a special case of Theorem 4.2 when κ = 0 and with ρ l = ρ u = ρ 1 , we have Therefore, by Lemma A.2 and Slutsky's lemma, we conclude the proof of Theorem 3.3 when κ = 0.
Proof of Theorem 4.4 under boundedness constraints. Let κ = 0 in this proof. We apply Lemma A.2 to the sequences of functions f n , g n and h n defined by f n (x) = L n (x ρ 2ν n,c , ρ n,c ), g n (x) = A n (x ρ 2ν n,c , ρ n,c ), and h n (x) = B n (x ρ 2ν n,c , ρ n,c ). L n,c (x ρ 2ν n,c , ρ 2ν n,c ).
Then, if we show (21) and (22), we can show similarly as for (38) that: with probability going to 1 as n → ∞. Hence, from Lemmas A.2 and B.1 and Slutsky's lemma, we can obtain, if (21) and (22) Therefore, in order to conclude the proof, it is sufficient to prove (21) and (22).

3)
We turn to (21). Let σY ρ be a Gaussian process with mean function zero and covariance function k σ 2 ,ρ,ν . Then we have, from Lemma 4.1, We introduce the following notation: and assume that Therefore, there exists a sequence (ρ k , t k , ε k ) k∈N such that We extract from (ρ k , t k , ε k ) k∈N a subsequence (still denoted (ρ k , t k , ε k ) k∈N ) such that (ρ k ) k is convergent and we denote by ρ its limit. Let Φ be the cumulative distribution function of a standard Gaussian random variable. Then by the mean value theorem, noticing that inf p∈[0,1] Φ −1 (p) > 0 and using (39).
But, using the concavity of Φ −1 • F ρ (see Lifshits, 1995, Theorem 10 in Section 11), one gets The convergence comes from the continuity of the function ρ → F ρ (t) for a fixed t (see the proof of López-Lopera et al., 2018, Lemma A.6). From Lemma 4.1, the above limit is finite, which is contradictory with (39). Hence, (21) is proved.

4)
Finally, we turn to (22). We let m n,ρ,y and σ 2 k n,ρ be the mean and covariance functions of Y given y under covariance function k σ 2 ,ρ,ν . Our first aim is to show that, for any ε > 0, with probability going to 1 as n → ∞, and sup Now we use tools from the theory of reproducing kernel Hilbert spaces (RKHSs) and refer to, e.g., (Wendland, 2004) for the definitions and properties of RKHSs used in the rest of the proof. For ρ ∈ [ρ l , ρ u ], the function m n,ρ,y belongs to the RKHS of the covariance function k 1,ρ,ν . Its RKHS norm m n,ρ,y k1,ρ,ν can be simply shown to satisfy m n,ρ,y 2 k1,ρ,ν = y R −1 ρ,ν y.
• For d = 2, we consider {v 1 , v 2 , v 3 } ⊂ {x 1 , . . . , x n } for whichx belongs to the convex hull of v 1 , v 2 , v 3 . Then, ifx belongs to one of the three segments with end points v 1 , v 2 or v 2 , v 3 or v 1 , v 3 , from the previous step with d = 1, it follows that sup x∈[0,1] 2 H x m n,ρ,y cε/a 2 n . Consider now thatx does not belong to one of these segments and consider the (unique) intersection point r of the line with direction v 1 −x and of the segment with endpoints v 2 and v 3 . If m n,ρ,y (r) Y * + ε/2, by considering the triplet (v 1 ,x, r), from the reasoning of the case d = 1, it follows that sup x∈[0,1] 2 H x m n,ρ,y cε/a 2 n . If m n,ρ,y (r) Y * + ε/2, by considering the triplet (v 2 , r, v 3 ), it also follows that sup x∈[0,1] 2 H x m n,ρ,y cε/a 2 n .
Hence eventually, in all the configurations of the case b), we have sup x∈[0,1] d H x m n,ρ,y cε/a 2 n , under the event (44). Hence, from (43), (40) follows in the case b). Analogously, (41) holds in that case.
Consequently, (22) follows and the proof is concluded.

C Proofs for Sections 3 and 4 -Monotonicity
We let κ = 1 throughout Appendix C.

C.1 Estimation of the variance parameter
Proof of Theorem 3.2 under monotonicity constraints. The proof is similar to that of Theorem 3.2 when κ = 0 and is also divided into the three steps 1), 2) and 3).
1) For n ∈ N, let N n be the greatest integer such that Condition-Grid holds. Now we define for all i = 1, . . . , d. Since N n → ∞ as n → ∞, m i,n a.s.

s. by Condition-
Var. Now we notice that m i,n = g i,n (y 1 , . . . , y n ) and we define m k,i,n = g i,k (y 1 , . . . , y k ). One can see that a slightly different version of Lemma A.1 can be shown (up to re-indexing x 1 , . . . , x n ) with m k,n = min{m k,1,n , . . . , m k,d,n }, m = inf x∈[0,1] d min i=1,...,d ∂Y (x)∂x i , and M k,n = M = u = +∞. After applying this different version, points 2) and 3) in the proof of Theorem 3.2 when κ = 0 remain unchanged. This concludes the proof.
Proof of Theorem 3.3 under monotonicity constraints. The proof is similar to that of Theorem 3.3 when κ = 0 and is also divided into the five steps 1) to 5). We apply Lemma A.2 to the sequences of functions f n , g n and h n defined by f n (σ 2 ) = L n (σ 2 ), g n (x) = A n (σ 2 ) and h n (σ 2 ) = B n (σ 2 ). Here we recall that for σ 2 ∈ Θ, A n (σ 2 ) = − ln P σ 2 (Y ∈ E 1 ) and B n (σ 2 ) = ln P σ 2 ( Y ∈ E 1 | y) .
In order to apply Lemma A.2, we need to check that the conditions (19) to (22) hold.
3) Let us introduce the Gaussian process Y r with mean function zero and covariance function k 1 . Then we have Hence A n (σ 2 ) does not depend on σ 2 so that (21) holds.

C.2 Isotropic Matérn process
Proof of Theorem 4.2 under monotonicity constraints. The proof is the same as for Theorem 4.2 when κ = 0 and is concluded by applying Theorem 3.2 when κ = 1.
Proof of Theorem 4.3 under monotonicity constraints. The proof is similar to that of Theorem 4.3 when κ = 0 and is also divided into the four steps 1) to 4). We apply Lemma A.2 to the sequences of functions f n , g n and h n defined by f n (σ 2 ) = L n (σ 2 , ρ 1 ), g n (x) = A n (σ 2 , ρ 1 ), and h n (σ 2 ) = B n (σ 2 , ρ 1 ).
1) The proof that (19) and (20) are satisfied is identical to the proof of Theorem 4.3 when κ = 0, as (19) and (20) do not involve the event {Y ∈ E 1 }.
2) Let us introduce the Gaussian process Y r with mean function zero and covariance function k 1,ρ1,ν . Then we have Hence A n (σ 2 , ρ 1 ) does not depend on σ 2 so that (21) holds.
3) We turn to B n (σ 2 , ρ 1 ). We conclude to (22) following the same lines as in the proof of Theorem 4.3 when κ = 0 and using the equivalence of measures.

4)
We conclude the proof of Theorem 4.3 when κ = 1 similarly as in the proof of Theorem 4.3 when κ = 0 using Theorem 4.2 when κ = 1.
Proof of Theorem 4.4 under monotonicity constraints. The proof follows the similar four steps of the proof Theorem 4.4 when κ = 0. We apply Lemma A.2 to the sequences of functions f n , g n and h n defined by f n (x) = L n (x ρ 2ν n,c , ρ n,c ), g n (x) = A n (x ρ 2ν n,c , ρ n,c ), and h n (x) = B n (x ρ 2ν n,c , ρ n,c ).
1) The proof that (19) and (20) are satisfied is identical to the proof of Theorem 4.4 when κ = 0, as (19) and (20) do not involve the event {Y ∈ E 1 }.
3) Similarly as in the proof of Theorem 4.3 when κ = 1, we show that (21) holds.
Hence, we have that with probability going to 1 as n → ∞, This proves that (22) holds so that the proof is complete.

D Proofs for Sections 3 and 4 -Convexity
We let κ = 2 throughout Appendix D.

D.1 Estimation of the variance parameter
Proof of Theorem 3.2 under convexity constraints. The proof is similar to that of Theorem 3.2 when κ = 1, where the finite differences of order one are replaced by finite differences of order two.
Proof of Theorem 3.3 under convexity constraints. The proof is similar to that of Theorem 3.3 when κ = 0 introducing the Gaussian process V defined on and observing that where Hf (x) represents the Hessian matrix of f at x which means that Hf (x) i,j = ∂ 2 f (x)/(∂x i ∂x j ).

D.2 Isotropic Matérn process
Proof of Theorem 4.2 under convexity constraints. The proof is the same as for Theorem 4.2 when κ = 0 and is concluded by applying Theorem 3.3 when κ = 2.
Proof of Theorem 4.3 under convexity constraints. The proof is similar to that of Theorem 4.3 when κ = 0.
Proof of Theorem 4.4 under convexity constraints. The proof follows the similar four steps of the proof Theorem 4.4 when κ = 0. Points 1) to 3) are identical. Turning to (22), point 4) can be treated similarly as in the proof of Theorem 4.4 when κ = 1 but with more cumbersome notation and arguments. In order to ease the reading of the paper, we omit this technical proof.

E Proofs for Section 6
Proof of Proposition 6.1. We have .
say. By Cauchy-Schwarz's inequality, we have |A| E[E n (x 0 ) 2 |y] 1/2 P(Y ∈ E κ |y) 1/2 . In the above display, the first square root is 1 by definition and the second one is a o P|Y ∈Eκ (1), as shown in the point 4) in the proof of Theorem 3.3.
Proof of Theorems 6.3, 6.4 and 6.5. The proof is the same as in the Matérn case in Theorems 4.2, 4.3 and 4.4. In particular, when 1 + 2s = ν, the Matérn and Wendland covariance functions have the same smoothness, see (Bevilacqua et al., 2019, Theorem 1). Hence, a lemma similar as Lemma 4.1 holds. We also remark that a lemma similar as Lemma B.1 can be proved, by using (Bevilacqua et al., 2019, Lemma 1) together with the results given in the proof of (Bevilacqua et al., 2019, Theorem 8) (see the online supplementary material to this paper).
Proof of Proposition 6.6. Without loss of generality, we can consider that u < +∞. Recall the notation Y * * c = sup x∈[0,1] d |Y c (x)|< +∞ a.s. of Appendix A. Let (x i ) i∈N be any sequence of two-by-two distinct points in [0, 1] d . We have The above probability goes to zero as n → ∞ for any Y * * c < ∞. Thus by dominated convergence, the above expectation goes to zero as n → ∞. This concludes the proof.