On the Inference of Applying Gaussian Process Modeling to a Deterministic Function

Gaussian process modeling is a standard tool for building emulators for computer experiments, which are usually used to study deterministic functions, for example, a solution to a given system of partial differential equations. This work investigates applying Gaussian process modeling to a deterministic function from prediction and uncertainty quantification perspectives, where the Gaussian process model is misspecified. Specifically, we consider the case where the underlying function is fixed and from a reproducing kernel Hilbert space generated by some kernel function, and the same kernel function is used in the Gaussian process modeling as the correlation function for prediction and uncertainty quantification. While upper bounds and the optimal convergence rate of prediction in the Gaussian process modeling have been extensively studied in the literature, a comprehensive exploration of convergence rates and theoretical study of uncertainty quantification is lacking. We prove that, if one uses maximum likelihood estimation to estimate the variance in Gaussian process modeling, under different choices of the regularization parameter value, the predictor is not optimal and/or the confidence interval is not reliable. In particular, lower bounds of the prediction error under different choices of the regularization parameter value are obtained. The results indicate that, if one directly applies Gaussian process modeling to a fixed function, the reliability of the confidence interval and the optimality of the predictor cannot be achieved at the same time.


Introduction
Computer experiments are often used to study a system of interest. For example, Mak et al. (2018) studies a complex simulation model for turbulent flows in swirl injectors. Other examples include Burchell et al. (2006), who estimates sexual transmissibility of human papillomavirus infection, and Moran et al. (2015), who uses the Cardiovascular Disease Policy Model to project cost-effectiveness of treating hypertension in the U.S. according to 2014 guidelines. In these examples, the simulators are expensive, and the inputs/responses pairs are often not available for an extensive exploration of the underlying function. One well-established approach for solving this problem is the use of an emulator, which is an inexpensive approximation for the simulator.
In computer experiments, the two major problems are prediction and uncertainty quantification. Gaussian process modeling, which is a widely used method in computer experiments, naturally enables prediction and statistical uncertainty quantification. In the Gaussian process modeling, the underlying function is assumed to be a realization of a Gaussian process. Based on the Gaussian process assumption, the conditional distribution can be constructed at each unobserved point in a region, which provides a natural predictor via conditional expectation and a pointwise confidence interval. The pointwise confidence intervals can be used for statistical uncertainty quantification.
However, in practice, the Gaussian process is usually misspecified. The responses of a computer model usually come from a deterministic function which may not be a sample path of the Gaussian process used in Gaussian process modeling, or may come from a smaller function space that has probability zero (Ba et al., 2012;Bower et al., 2010;Higdon, 2002;Tan, 2018;Xu et al., 2019). For example, a function in an infinite-dimensional reproducing kernel Hilbert space is typically smoother than a sample path of the corresponding Gaussian process Steinwart (2019). Here, the word "corresponding" means that the covariance function of the Gaussian process and the kernel function of the reproducing kernel Hilbert space are the same, up to a constant multiplier. Therefore, if one applies Gaussian process modeling to a function in the corresponding reproducing kernel Hilbert space, a model misspecification issue occurs.
Despite this model misspecification, maximum likelihood estimation is commonly used to estimate unknown parameters in the covariance function within the Gaussian process model (Santner et al., 2003). Applying maximum likelihood estimation when a model is misspecified may be problematic because the estimated parameter can diverge as the sample size goes to infinity. For example, Xu and Stein (2017) shows that if the underlying function is f (x) = x γ on [0, 1] and a Gaussian correlation function is used in the Gaussian process modeling, the estimated variance can either go to zero or infinity as the sample size increases to infinity. Another question is: when the Gaussian process model is misspecified, are the confidence intervals with estimated parameters reliable? In practice, it is often observed that Gaussian process models have poor coverage of their confidence intervals (Gramacy and Lee, 2012;Joseph and Kang, 2011;Yamamoto, 2000). One possible reason is that the Gaussian process model may be misspecified; thus the confidence intervals may be inadequate for quantifying the uncertainty of predictions.
In this work, we investigate the prediction and confidence intervals in misspecified Gaussian process models used to recover deterministic functions from a frequentist view, i.e., assuming the underlying function is fixed but unknown. This is different from the Bayesian perspective, where a Gaussian process prior on the function space is induced. Specifically, we consider the following settings.
Settings 1.1. The underlying deterministic function f is fixed and lies in a reproducing kernel Hilbert space. A Gaussian process model is applied for prediction and uncertainty quantification. The kernel function of the reproducing kernel Hilbert space and the correlation function in the Gaussian process model are the same.
In Settings 1.1, we assume that the variance in the Gaussian process model is unknown and needs to be estimated. For more on the formal settings considered in this work and more discussion, see Section 2.5. As stated before, the model misspecification occurs under Settings 1.1. This model misspecification does not change the form of the predictor (which is one reason that the Gaussian process model is typically misspecified), but significantly changes the uncertainty quantification results. We consider two cases, one case is that the observations have no noise; the other is that the observations have noise. When the observations have no noise, we show that if an estimated variance obtained by maximum likelihood estimation is used in the confidence intervals, then the confidence intervals are not reliable. Here, the reliability is used in the sense that is to be introduced later; see Section 2.3. This suggests that, if the Gaussian process model is misspecified, the confidence interval needs to be carefully constructed to quantify the uncertainties-not merely derived by the corresponding Gaussian process model. In many cases, computer experiments are stochastic, in the sense that stochastic errors are introduced to simulate the randomness in real systems. A recent overview of stochastic emulators is Baker et al. (2020). In stochastic computer experiments, a regularization parameter is used to counteract the noise's influence. The value of this regularization parameter is usually a constant (Ankenman et al., 2010;Baker et al., 2020;Dancik, 2007). It is known that the optimal convergence rate for non-parametric regression is determined by the smoothness of the underlying function, denoted by ν. The optimal convergence rate is n − ν 2ν+d , where n is the sample size, and d is the dimension of the input space (Stone, 1982). In this work, we show that if the regularization parameter value is chosen to be a constant, the corresponding predictor is not optimal, in the sense that the convergence rate of the prediction error is of a higher order than the optimal rate. Furthermore, with an estimated variance obtained by maximum likelihood estimation, we show that under different choices of the regularization parameter value (not restricted to be a constant), the corresponding predictor is not optimal, or the confidence interval is not reliable. We also derive some lower bounds on the convergence rates of the prediction error of Gaussian process modeling. These results suggest that we may lose the prediction efficiency or reliability of uncertainty quantification if the Gaussian process model is misspecified and maximum likelihood estimation is used.
The rest of this paper is arranged as follows. In Section 2, we introduce Gaussian process modeling, maximum likelihood estimation, and reproducing kernel Hilbert spaces, as well as our definition of reliability of confidence intervals. The main results of this work are also summarized in Section 2. In Sections 3 and 4, we present the main results of this work, under the case where observations have no noise and the case where observations have noise, respectively. Simulation studies are reported in Section 5. Conclusions and discussion are made in Section 6. The technical proofs are given in Appendix.

Preliminaries
This section provides a brief introduction to Gaussian process modeling, maximum likelihood estimation, and reproducing kernel Hilbert spaces, which are used in developing the main results. We also provide our definition of the reliability of confidence intervals. Problem settings and a summary of the main results are presented at the end of this section.

Gaussian process modeling
In this work, we consider applying Gaussian process modeling to a fixed function f , defined on a convex and compact set 1 Ω ⊂ R d with a positive Lebesgue measure. Because the domain Ω is fixed, the corresponding asymptotic framework is called fixed-domain asymptotics (Stein, 1995(Stein, , 1999. Suppose we have n observed pairs (x k , y k ), k = 1, . . . , n, given by where x k ∈ Ω are distinct measurement locations (i.e., x k = x j for k = j) and k ∼ N (0, σ 2 ) are i.i.d. normally distributed random errors with variance σ 2 ≥ 0. If the observations are not corrupted by noise, we have σ 2 = 0, otherwise σ 2 > 0. One popular method to recover the function f is stationary Gaussian process modeling. Let Z be a stationary Gaussian process defined on R d . For the ease of mathematical treatment, we consider simple kriging. Therefore, we assume Z has mean zero, variance σ 2 and correlation function Ψ, denoted by Z ∼ GP (0, σ 2 Ψ). The correlation function Ψ is stationary, i.e., the function value of Ψ(x, x ) only depends on the difference x − x ; thus we can write Ψ(x − x ) := Ψ(x, x ). We also assume Ψ is strictly positive definite and integrable on R d , and Ψ(0) = 1. By Bochner's theorem (Page 208 of Gihman and Skorokhod (1974); Theorem 6.6 of Wendland (2004)) and Theorem 6.11 of Wendland (2004), there exists a function f Ψ such that for any h ∈ R d . The function f Ψ is known as the spectral density of Z or Ψ. In this work, we suppose that f Ψ decays algebraically, i.e., satisfies the following condition.
One example of correlation functions satisfying Condition 2.1 is the isotropic Matérn correlation function (Stein, 1999), given by with the spectral density (Tuo and Wu, 2016) where φ,ν > 0, and Kν is the modified Bessel function of the second kind. By setting ν =ν + d/2, we can see Ψ M satisfies Condition 2.1.
Suppose we are interested in the value of Z(x) on Ω and we observe data where x k ∈ Ω are distinct measurement locations (i.e., x k = x j for k = j) and k ∼ N (0, σ 2 ) are i.i.d. normally distributed random errors with variance σ 2 ≥ 0. Conditional on Z = (z 1 , ..., z n ) T , Z(x) is normally distributed at point x. Note that Z is a random vector, where the randomness is induced by Z(x j ) and j . The conditional expectation of Z(x) is given by where ) jk , I n is an identity matrix, and µ = σ 2 /σ 2 . The conditional expectation is a nature predictor of Z(x), and it can be shown that the conditional expectation (4) is the best linear unbiased predictor (Santner et al., 2003;Stein, 1999). A predictor given by Gaussian process modeling is then the conditional expectation of Z(x).
In addition to prediction, uncertainty quantification plays an essential role in statistics. Gaussian process modeling enables statistical uncertainty quantification via confidence intervals (Rasmussen, 2006;Santner et al., 2003). Conditional on Z, the conditional variance of Z(x) is given by where R, r(x) and µ are as in (4). Let Φ denote the cumulative distribution function of the standard normal distribution N (0, 1) and let q β = Φ −1 (1 − β/2) denote the (1 − β/2)th quantile, where β ∈ (0, 1). A level (1 − β)100% pointwise confidence interval on point x ∈ Ω can be constructed by and r(x) and R are as in (4). The confidence interval is often used in the numerical simulations to show the uncertainty quantification results in Gaussian process modeling. A few examples are Ba et al. (2012); Gramacy and Apley (2015); Hung et al. (2015); Williams et al. (2006).
Recall that we apply Gaussian process modeling to the deterministic function f . Therefore, we treat the observations Y = (y 1 , ..., y n ) T as the observations from the Gaussian process Z.
The predictor of f (x) becomes and the confidence interval becomes where c n,β (x) is as in (5).

Maximum likelihood estimation
Recall that we consider the underlying function is deterministic and lies in some reproducing kernel Hilbert space (see Settings 1.1). The Gaussian process modeling is used for prediction and uncertainty quantification, where the correlation function is the same as the kernel function of the reproducing kernel Hilbert space. Despite the model misspecification issue, i.e., the underlying function is typically smoother than the sample path in the corresponding Gaussian process, maximum likelihood estimation (Rasmussen, 2006;Santner et al., 2003;Stein, 1999) is often used to specify the parameters value that are not pre-determined in the covariance function σ 2 Ψ(· − ·). Let Ψ θ be a family of correlation functions indexed by θ = (θ 1 , . . . , θ q ) T ∈ Θ ⊂ R q , and R(θ) = (Ψ θ (x j − x k )) jk . By direct calculation and reparametrization, it can be shown that, up to an additive constant, the log-likelihood function is (Page 169 of Stein (1999); Page 66 of Santner et al. (2003)) The maximum likelihood estimate of the unknown parameters θ, σ 2 , µ can be found by maximizing the log-likelihood function. In practice, it is often assumed that the scale parameter, which is φ in (2) if a Matérn correlation function is used, and the variance σ 2 are unknown and need to be estimated (Bachoc, 2013b;Bostanabad et al., 2018;Hung, 2011;Joseph, 2006;Lee and Owen, 2018;Li and Sudjianto, 2005;Rasmussen, 2006;Santner et al., 2003;Stein, 1999). The smoothness parameterν in the Matérn correlation function (2) is usually pre-determined. We will discuss the parameter µ later. The consistency of parameter estimation has been studied in literature under the assumption that the underlying truth is a realization of a Gaussian process (Anderes et al., 2010;Bevilacqua et al., 2019;Kaufman and Shaby, 2013;Mardia and Marshall, 1984;Wang et al., 2011;Ying, 1991Ying, , 1993Zhang, 2004). In particular, Bevilacqua et al. (2019); Kaufman and Shaby (2013); Wang et al. (2011);Ying (1991); Zhang (2004) show that the parameter estimation can be inconsistent and only the microergodic parameter can be estimated. For details of microergodic parameters, see Matheron (2012); also see pages 163-165 of Stein (1999).
If the underlying truth is a fixed function that is not a sample path of the Gaussian process used in the Gaussian process modeling, to the best of our knowledge, the only related work is Xu and Stein (2017). In Xu and Stein (2017), it is shown that if the underlying function is f (x) = x γ with γ ≥ 0 defined on [0, 1] and a Gaussian correlation function is used in the Gaussian process modeling, the estimated variance can either go to infinity or converge to zero as the sample size increases. Other works related to the parameter estimation under model misspecification include Bachoc (2013a); Bachoc et al. (2018).
In this work, we do not consider the consistency of parameter estimation, but we investigate the influence of parameter estimation on the prediction and uncertainty quantification of Gaussian process modeling under model misspecification. For the ease of mathematical treatment, we assume Ψ is known. It follows the standard arguments that the maximizer of (8) with respect to σ 2 isσ Similar settings have also been considered by Karvonen et al. (2020), where they call σ a scale parameter and consider only estimating σ. In practice, µ is usually imposed as a constant (Dancik, 2007), estimated by the sample average if there are replicates on each measurement location (Ankenman et al., 2010), or estimated via maximum likelihood estimation (Wang and Haaland, 2019). We mainly focus on the first approach (using an imposedμ n ) and note that the results can be easily generalized to the case of using the second approach. The third approach is much more complicated, and we do not consider it in the present work. Here we useμ n to stress the difference, because it may not be the true value of µ. Recall that the underlying truth is a deterministic function from a frequentist point of view. Thus there is no definition of the "true" value of σ 2 . It is also not clear what is the "true" µ = σ 2 /σ 2 . In this work, we callμ n a regularization parameter, because this parameter is imposed. Nevertheless, we consider that the value of the regularization parameterμ n can increase or decrease with the sample size and is not restricted to be a constant.
By plugging in estimated (and imposed) parameters, we can obtain the corresponding predictor and confidence interval. We usef n (x) to denote the predictor with imposedμ n on an unobserved point x, i.e.,f where r(x) and R are as in (4). By pluggingσ 2 as in (9) andμ n into (7) and (5), we obtain the estimated pointwise confidence interval on point x ∈ Ω andf n (x) is as in (10). In (12), we impose a regularization parameterμ n > 0 if σ 2 > 0, and setμ n = 0 if σ 2 = 0. Note that the estimated varianceσ 2 is not present in (10); thus it does not influence the predictor. However,σ 2 appears in (11) and as we will see later, it influences the reliability of the confidence interval; thus the uncertainty quantification results of Gaussian process modeling.

Reliability of confidence intervals
In practice, the Gaussian process model is often misspecified. The underlying fixed function may lie in a subspace of the support of the corresponding Gaussian process and the subspace may have probability zero, or may not even be in the support. This model misspecification may influence the reliability of the confidence interval thus the quality of uncertainty quantification. However, it is not possible to quantify the reliability of confidence intervals without having a clear definition of the term "reliability". In this section, we first review some possible ways to define the reliability, and propose our definition of the reliability of confidence intervals.
Recall that in this work, we assume that the underlying truth is a deterministic function. Therefore, we mainly consider the reliability of confidence intervals for a fixed function. Let g ∈ G be a fixed function, where G is a Hilbert space of functions equipped with norm · G . Let I X g be a linear predictor for a function g ∈ G, where X = {x 1 , ..., x n } ⊂ Ω is the set of measurement locations. The predictor I X g depends on X and observations. Suppose the observations are y (g) j for j = 1, ..., n, given by where j 's are i.i.d. noise realizations of a random variable with mean zero and variance σ 2 ∈ [0, ∞). Typically, I X g is a linear combination of the observations, i.e., has the form for point x ∈ Ω, where b j 's are functions not depending on g but can depend on X. Let CI X,β (x) = [I X g(x) − a β (x), I X g(x) + a β (x)] be an imposed level (1 − β)100% pointwise confidence interval on point x ∈ Ω (determined by the uncertainty quantification method that a user applies), where β ∈ (0, 1) and a β is a non-negative function. Clearly, this imposed confidence interval may not have confidence level (1 − β)100%. We want to define the reliability of this imposed confidence interval. Note that CI X,β and a β can depend on X, but we suppress the dependency for notational simplicity. Also note the confidence interval on point x is centered at I X g(x).
Probably the most natural way to define the reliability is by the definition of confidence intervals. This approach is considered by Sniekers and van der Vaart (2015). Consider the probability P g (g(x) ∈ CI X,β (x)) for point x ∈ Ω, where P g refers to the distribution of y (g) 1 , ..., y n as in (13), where the "true" g is given. If confidence intervals are reliable, the probability P g (g(x) ∈ CI X,β (x)) should be close to the nominal level (1 − β)100%, or at least larger than (1 − β)100% (conservative). In Sniekers and van der Vaart (2015), a function defined on [0, 1] and a Brownian motion prior are considered. The measurement locations are equally spaced. Under these settings, Sniekers and van der Vaart (2015) shows that CI X,β can be conservative or unreliable, depending on the smoothness of g. However, as stated in Sniekers and van der Vaart (2015), the exact formulas strongly depend on the equally spaced measurement locations and cannot be easily extended to a more general choice of measurement locations.
Another probability-based definition of reliability of confidence intervals is by the average coverage probability (ACP) (Nychka, 1988). In Nychka (1988), confidence intervals are considered to be reliable if the ACP for the function g and the confidence interval CI X,β 1 n n j=1 P(g(x j ) ∈ CI X,β (x j )) is close to the nominal level (1 − β)100%, where x j and g(x j ) are as in (13). This definition only quantifies the reliability of the confidence interval on the measurement locations and does not count the confidence interval CI X,β (x) at any unobserved point x ∈ Ω. Therefore, the ACP is not suitable to be used for quantifying the uncertainties because if the observations are noiseless and an interpolant is used, the ACP is always equal to one.
Coverage rates are often used to assess the reliability of the confidence interval in the field of computer experiments (Joseph and Kang, 2011;Lee and Owen, 2018;Sung et al., 2020). The coverage rate is defined by where Vol(A) denotes the volume of a set A ⊂ Ω with respect to the Lebesgue measure. A practical way to compute the coverage rate is by random sampling. Suppose x 1 , ..., x N are N uniformly distributed points in Ω. Then the coverage rate can be approximated by where card(B) denotes the cardinality of a set B. However, we find it is hard to theoretically investigate the quantity (15), because {x|g(x) ∈ CI X,β (x)} can be irregular and hard to characterize.
In this work, we consider the ratio of the prediction error and the width of the confidence interval, given by (g − I X g)/|CI X,β |, where |CI X,β | = 2a β denotes the width of CI X,β . We use the convention 0/0 = 0 if |CI X,β (x)| = 0 for some x ∈ Ω. If the confidence interval CI X,β is reliable, the width of the confidence interval should be large enough to cover the difference between the predictor I X g and the true function g with high probability such that the ratio |g(x) − I X g(x)|/|CI X,β (x)| is small for x ∈ Ω. In particular, we consider the expectation for 2 ≤ p ≤ ∞ (we assume it exists; if it does not exist, then the confidence interval is thought to be not reliable), where the expectation is taken with respect to the noise and the set of measurement locations X, and f Lp(Ω) is the L p -norm of f ∈ L p (Ω), defined by The expectation E (g − I X g)/|CI X,β | p is the L p -norm on the probability space (A, B, P ), where A is the sample space, B is the Borel algebra, and P is the probability measure induced by the noise and X. Note that the randomness in (g − I X g)/|CI X,β | does not come from the function g, because g is fixed from a frequentist perspective. Because we are interested in the scenario when the number of measurement locations increases, we consider an infinite sequence of the set of measurement locations, denoted by X = {X 1 , X 2 , ..., X n , ...}. Without loss of generality, we assume that card(X n ) = n, where n takes its value in an infinite subset of N + . We call X a sampling scheme, as in Tuo and Wang (2019). In the rest of this work, we suppress the dependency of X on n for notational simplicity. If the confidence interval CI X,β is reliable, E (g − I X g)/|CI X,β | p Lp(Ω) 1/p should be small, at least should be less than a constant that does not depend on the sample size. From a standard frequentist perspective, we consider the minimax setting, i.e., we consider the worst case.
According to the prior knowledge on the function g, we consider two subcases. Recall that g ∈ G, where G is a Hilbert space of functions equipped with norm · G . The first subcase is that g G is upper bounded by some known constant. Without loss of generality, assume this known constant is one, i.e., the function g lies in the unit ball of G. We say the confidence interval CI X,β is L p -weakly-reliable, if holds for all X ∈ X and all n, where C is a constant not depending on n. In other words, the confidence interval is weakly-reliable if it is reliable in a ball of G with certain radius. However, in practice we cannot always expect g G to be bounded by a known constant. Since g is a fixed function in G, we know g G is finite. Therefore, for any increasing sequence {a n } n≥1 not depending on g and lim n→∞ a n = ∞, there exists an N such that for all n ≥ N , g G ≤ a n . We say the confidence interval CI X,β is L p -strongly-reliable, if there exists an increasing sequence {a n } n≥1 not depending on g such that lim n→∞ a n = ∞ and sup g∈G, g G ≤an holds for all X ∈ X and all n, where C is a constant not depending on n. Here we note that the constants C and C can depend on X . Roughly speaking, a confidence interval is strongly-reliable if it is eventually reliable in the entire space G as the sample size increases to infinity. We summarize the above arguments in the following definition. Note in Definition 2.1, we suppress the dependency of X on n for notational simplicity.
Definition 2.1. Let β ∈ (0, 1) be fixed, and X be a sampling scheme. Let I X g be a linear predictor as in (14) and n = card(X) be the sample size. Let CI X,β be an imposed level For 2 ≤ p ≤ ∞, CI X,β is said to be L p -weakly-reliable if (16) holds for all n, and is said to be L p -strongly-reliable if there exists an increasing sequence {a n } n≥1 not depending on g, and lim n→∞ a n = ∞ such that for all n, (17) holds, where C and C are constants not depending on n but possibly depending on p, β, and X . The expectation is taken with respect to noise and X.
Remark 2.1. In Definition 2.1, the reason we require the width of the confidence interval satisfies lim n→∞ sup x∈Ω |CI X,β (x)| = 0 because we want the confidence interval to provide some information, otherwise we can select a wide confidence interval (for example, CI X,β (x) = [I X g(x) − n, I X g(x) + n] for all x ∈ Ω) which can cover g(x) and does not provide any useful information.
Definition 2.1 is motivated by the properties of confidence intervals of Gaussian process. Let Z ∼ GP (0, σ 2 Ψ) be a Gaussian process defined on Ω. On point x ∈ Ω, let I (1) (4). It can be seen that I (1) X Z is a linear predictor and has the form as in (14). Let CI n,β (x) be the confidence interval as in (7). Furthermore, assume the observations are not corrupted by noise, which implies σ 2 = 0 and µ = 0. Consider . We have the following proposition.
Proposition 2.1. Let Z, I (1) X Z, CI n,β be described above, and β ∈ (0, 1). Then we have holds for all n and any 2 ≤ p < ∞, where C is a constant only depending on p, β and Ω.
It can be seen that our definition of reliability stated in Definition 2.1 is analogous to (18). According to Definition 2.1, if a confidence interval CI X,β is reliable, then for any fixed constant c > 0, cCI X,β : is also reliable. Furthermore, the "less than or equal to" relationship in Definition 2.1 encourages a wider confidence interval. Therefore, our definition of the reliability is more like a necessary condition rather than a sufficient condition. One way to specify the constant in Definition 2.1 is by using the constant C in Proposition 2.1. However, one can argue that this constant may not be appropriate because unlike the unbiased predictor I (1) X Z, I X g is usually a biased predictor, and the constant in Proposition 2.1 may not be large enough to cover the bias. Practitioners may also consider other constants to counteract the model misspecification. How to choose an appropriate constant is out of the scope of this work, and we do not make any further discussion.

Reproducing kernel Hilbert spaces and power functions
In this subsection, we review reproducing kernel Hilbert spaces and power functions, which are closely related to the Gaussian process model. Under the settings of computer experiments, if µ = 0 in the Gaussian process model, the right-hand side of (6) is called a kriging interpolant , denoted by where X = {x 1 , ..., x n } denotes the set of measurement locations. Note that x k ∈ Ω are distinct measurement locations and Ψ is strictly positive definite, thus R is invertible. In the area of scattered data approximation, the interpolation using operator I Ψ,X is also called radial basis function approximation. A standard theory of radial basis function approximation works by employing reproducing kernel Hilbert spaces. One way to define the reproducing kernel Hilbert space generated by a stationary correlation function is via the Fourier transform, defined by The definition of the reproducing kernel Hilbert space can be generalized to Girosi et al. (1995) and Theorem 10.12 of Wendland (2004).
with the inner product For a positive number ν > d/2, the Sobolev space on R d with smoothness ν can be defined as It can be shown that H ν (R d ) coincides with the reproducing kernel Hilbert space N Ψ (R d ), if Ψ satisfies Condition 2.1 (Wendland (2004), Corollary 10.13, also see Lemma C.3). This equivalence allows us to evaluate whether a predictor in a reproducing kernel Hilbert space is optimal; see Section 4 for more details.
Reproducing kernel Hilbert spaces can also be defined on a suitable subset (for example, convex and compact) Ω ⊂ R d , denoted by N Ψ (Ω), with norm where f E | Ω denotes the restriction of f E to Ω. Sobolev spaces on Ω can be defined in a similar way.
If f ∈ N Ψ (Ω), there is a simple error bound (Wendland (2004), Theorem 11.4): for each x ∈ Ω, where P Ψ,X is a function independent of f . The square of P Ψ,X is called the power function, given by where r(x) and R are as in (4). In addition, we define Note that the power function P Ψ,X and its supremum P Ψ,X only depend on X, Ω and Ψ, and does not depend on the observations.

Problem settings and summary of results
In this work, we consider the inference of misspecified Gaussian process models. Specifically, we consider prediction and uncertainty quantification when applying Gaussian process modeling to a fixed function f ∈ N Ψ (Ω), under the following misspecified model assumption.
Assumption * 2.1 (Misspecified model assumption). The function f is a realization of a Gaussian process with mean zero and covariance function σ 2 Ψ with a finite σ > 0.
We use an asterisk " * " to denote that Assumption * 2.1 is a misspecified assumption, and is not true. After the earliest version of this work was submitted, Assumption * 2.1 was also considered by Karvonen et al. (2020). Under Assumption * 2.1, we incorrectly assume f is a realization of Z ∼ GP (0, σ 2 Ψ) for f ∈ N Ψ (Ω). Assumption * 2.1 is a misspecified model assumption because if Z ∼ GP (0, σ 2 Ψ), then P(Z ∈ N Ψ (Ω)) = 0 since Ψ satisfies Condition 2.1 and Ω is convex and compact with positive Lebesgue measure Driscoll (1973). In fact, the smoothness of the sample paths are at least d/2 different from the smoothness of the correlation function, if ν in Condition 2.1 is larger than d Steinwart (2019). The different assumptions of f ∈ N Ψ (Ω) and f is a realization of Z ∼ GP (0, σ 2 Ψ) yield the same predictor, but the prediction error analysis methodologies are completely different. For discussion of these two different assumptions, see Scheuerer et al. (2013).
Under the misspecified model assumption Assumption * 2.1, one can use maximum likelihood estimation to "estimate" the unknown parameters and impose confidence intervals. Of course, this is questionable, but it is widely used in practice as stated in Section 1, and also in numerical examples showing in research papers. In these synthetic numerical examples, the test function is typically chosen to be a fixed function with closed form (and usually infinitely differentiable), which naturally satisfies the condition f ∈ N Ψ (Ω). Under Assumption * 2.1, we show the following results: (i) If the observations are not corrupted by noise, then the confidence interval is not L p -weakly-reliable for p ∈ (2, ∞], and is not L 2 -strongly-reliable. (ii) If the observations are corrupted by noise, then the confidence interval is not L 2 -stronglyreliable, or the predictor is not optimal, in the sense that the predictor does not achieve the optimal convergence rate under L 2 metric.
In the rest of this work, we will use the following definitions. For two positive sequences a n and b n , we write a n b n if, for some constants C, C > 0, C ≤ a n /b n ≤ C . Similarly, we write a n b n and b n a n if a n ≥ Cb n for some constant C > 0. For notational simplicity, we will use C, C , C 1 , C 2 , ... and η, η 0 , η 1 , ... to denote the constants, of which the values can change from line to line.

When the observations are noiseless
In this section, we consider the case that the observations have no noise. We call this case deterministic case, because several measurements at the same location will always lead to the same response.

The unreliability of the confidence interval
We focus on the Matérn correlation function, defined in (2). Since φ andν are known, we can let φ = 1/(2 √ν ), because otherwise we can stretch the region Ω to adjust the scale parameter φ. After a proper reparametrization, we can rewrite (2) as Recall that in the deterministic case, σ 2 = 0, thus k = 0, k = 1, ..., n, µ = 0 andμ n = 0. The predictorf n (x) in (10) becomes a kriging interpolant (19), i.e., for any point x ∈ Ω, where r(x) and R are as in (4), and Y = (y 1 , ..., y n ) T . Because the observations are not corrupted by noise, we have y k = f (x k ), for k = 1, ..., n. Note that in (23), the variance is not present and there is no estimated or imposed parameter.
As stated in Section 2.3, for β ∈ (0, 1), an imposed confidence interval with estimated variance at point x ∈ Ω can be constructed by pluggingμ n = 0 in (11). The confidence interval is given by Since the underlying function f is fixed and f ∈ N Ψ (Ω), we can apply (20) to derive an upper bound on the prediction error |f ( Comparing this inequality with (20), and noting that c n,β (x) σP Ψ,X (x), if the confidence interval is reliable, intuitively, it can be expected that σ 2 should be close to f 2 N Ψ (Ω) . However, this is not true. From the identity (Wendland, 2004) f it can be seen thatσ 2 = I Ψ,X f 2 , which is not close to f 2 N Ψ (Ω) as n becomes larger. This indicates thatĉ n,β is too small to be used in constructing confidence intervals. Following this intuition, we show that the confidence interval is not reliable, as stated in Theorem 3.1. We need the following condition. Recall that we suppress the dependency of X on n for notational simplification.
Condition 3.1. Let X = {x 1 , ..., x n }, thus n = card(X). The fill distance of X, defined as satisfies h X,Ω n −1/d , for all X ∈ X , where X is a sampling scheme.
Condition 3.1 can be easily fulfilled. For example, sampling schemes with grid points satisfy Condition 3.1. In fact, any quasi-uniform sampling scheme satisfies Condition 3.1, as shown in the following proposition.
Proposition 3.1 (Proposition 14.1 of Wendland (2004)). Let X be a sampling scheme. Suppose there exists a constant C > 0 such that for all X ∈ X , h X,Ω ≤ Cq n , where By the definition of fill distance, it can be seen that where B(x k , h X,Ω ) denotes the Euclidean ball centered at x k with radius h X,Ω . Therefore, a comparison of volumes yields Hence, for any set of measurement locations X with card(X) = n, h X,Ω n −1/d . By (20), (21) and Lemma C.2 in Appendix C, a set of measurement locations with small fill distance is desired, because we want the measurement locations to be spread in Ω as much as possible.
Because quasi-uniform sampling schemes achieve the optimal rate of fill distance, they are widely used in computer experiments. Thus we believe Condition 3.1 is satisfied in many practical situations.
Proposition 3.2. For any fixed sampling scheme X satisfying Condition 3.1, we have that Under Condition 3.1, we have the following theorem. Note that f ∈ N Ψ (Ω) and Ψ is a Matérn correlation function defined in (22) with ν > d/2 imply f ∈ H ν (Ω); thus f ∈ L ∞ (Ω).
Theorem 3.1. Suppose 2 < p ≤ ∞, β ∈ (0, 1) are fixed, and σ = 0. For any fixed sampling scheme X satisfying Condition 3.1, we have that holds for all n. For any increasing sequence {a n } n≥0 satisfying lim n→∞ a n = ∞, we have that holds for all n. In (27) and (28),f n is as in (23), CI n,β is as in (24), C and C are positive constants depending on p, β, Ω, Ψ and the constants in Condition 3.1, and do not depend on n.
Remark 3.1. After the earliest version of this work was submitted, a related result has appeared as Theorem 3.2 of Karvonen et al. (2020), which showed that for any function Theorem 3.1 states that if one uses the estimated varianceσ 2 derived by maximum likelihood estimation to construct a pointwise confidence interval, the confidence interval can be unreliable. The confidence interval is not L p -weakly-reliable for 2 < p ≤ ∞ as in (27), i.e., for any M > 0 and sufficient large n, there exists a function f in the unit ball of (28), i.e., for M > 0 and sufficient large n, there exists a function f ∈ N Ψ (Ω) such that (f −f n )/| CI n,β | L 2 (Ω) ≥ M . Therefore, it may not be appropriate to quantify the uncertainties by using the confidence interval derived by Gaussian process modeling for a deterministic function lying in the corresponding reproducing kernel Hilbert space if there is no noise.

Some reliable confidence intervals under Assumption * 2.1
We adopt a reviewer's suggestion and consider two other approaches to imposingσ 2 : (1) setting it equal to a constant; and (2) removing the 1/n factor from the maximum likelihood estimate. Note that in both cases, Lemma C.2 and Condition 3.1 imply that | CI n,β (x)| ≤ Cn −ν/d+1/2 ; thus lim n→∞ sup x∈Ω | CI n,β (x)| = 0. If we setσ 2 to be a positive constant, the corresponding confidence interval is L ∞ -weakly-reliable (thus is L p -weakly-reliable for 2 ≤ p < ∞) but not L ∞ -strongly-reliable, as stated in the following theorem.
Theorem 3.2. Suppose β ∈ (0, 1) is fixed, and σ = 0. Let Ψ = Ψ M , where Ψ M is as in (22). Let CI n,β be as in (24) but withσ 2 = 1. For any fixed sampling scheme X satisfying Condition 3.1, we have that holds for all n. For any increasing sequence {a n } n≥0 satisfying lim n→∞ a n = ∞, we have that holds for all n. The constant C is positive and depends on p, β, Ω, Ψ and the constants in Condition 3.1, but does not depend on n.
Next we discuss the second approach, removing the 1/n factor from the maximum likelihood estimate. By this approach, the constructed confidence interval CI n,β is as in (24) witĥ . From the identity (26), it can be seen that if f −I Ψ,X f 2 (Edmunds and Triebel, 2008). Therefore, we need to impose a stronger condition on f such that f − I Ψ,X f 2 N Ψ (Ω) converges to zero and the corresponding confidence interval is reliable. Define an integral operator T : L 2 (Ω) → L 2 (Ω) by where P Ψ,X is as in (21), and C only depends on Ω.
Lemma 3.1 can be derived directly by the proof of Theorem 11.23 in Wendland (2004) and the fact P Ψ,X ≤ 1; thus the proof is omitted here. We have the following theorem, which states that the confidence interval constructed by the second approach is asymptotically reliable for a fixed function f . Theorem 3.3. Suppose β ∈ (0, 1) and f ∈ T (L 2 (Ω)) are fixed, and σ = 0. Let Ψ = Ψ M , where Ψ M is as in (22). Let CI n,β be as in (24) but withσ 2 = Y T R −1 Y . For any fixed sampling scheme X satisfying Condition 3.1, there exists N > 0 depending on Ψ, Ω, f and the constants in Condition 3.1, such that for all n ≥ N , where C is a positive constant only depending on f , Ψ, Ω, and β.
Although Theorem 3.3 does not imply that the confidence interval is L ∞ -strongly-reliable because the sample size N depends on f , it can provide a guideline for practitioners to construct confidence intervals for deterministic functions. Whether the confidence interval withσ 2 = Y T R −1 Y is L ∞ -strongly-reliable and the confidence interval with constantσ 2 is L p -strongly-reliable (p < ∞) will be pursued in future works.

When the observations are noisy
In this section, we consider the case that the observations are corrupted by noise. We call it stochastic case, because multiple evaluations of the function on the same measurement location may have different observations. The observations y k 's are given by (1). In the stochastic case, the variance of noise σ 2 > 0. In this section, we still assume f ∈ N Ψ (Ω) in (1) is a fixed function. Under the misspecified assumption Assumption * 2.1, we usef n (x) defined byf to predict f (x) on a point x ∈ Ω, where r(x) and R are as in (4), and Y = (y 1 , ..., y n ) T . Through this section, we assume that the measurement locations x 1 , ..., x n are drawn uniformly from the input space Ω, andμ n n α with α ∈ R. It is obvious that α should be less than one in order to make meaningful predictions. In particular, if α = 0, thenμ n is at a constant rate, which is widely used in computer experiments (Baker et al., 2020;Dancik, 2007). If replicates on the same measurement location are available, then Ankenman et al. (2010) setŝ µ n to be the sample variance of these replicates, which also converges to a constant as the number of replicates on each measurement location goes to infinity. It is well-known that if α = d/(2ν + d),f n achieves the optimal convergence rate n − ν 2ν+d under L 2 metric (Stone, 1982;van de Geer, 2000). In the following theorem, we show that if α = d/(2ν + d), the optimal convergence rate is not achieved. Recall that we use C, C , C 1 , C 2 , ... and η, η 0 , η 1 , ... to denote the constants, of which the values can change from line to line, and x k 's are drawn uniformly from Ω.
Theorem 4.1 provides non-asymptotic lower bounds on the mean squared prediction error under different choices of the regularization parameter value. In particular, it shows that if α = d/(2ν + d), the optimal convergence rate cannot be achieved with high probability, as summarized in the following corollary.
Corollary 4.1 states that with the uniformly distributed measurement locations, the value of α other than d/(2ν + d) cannot lead to the optimal predictor. This is intuitively true because the regularization parameterμ n determines the trade-off between the bias and variance. It is well-known in the literature that by choosing α = d/(2ν + d), we can achieve the best trade-off between the bias and variance. If α > d/(2ν + d), the bias becomes large, and the variance is small. On the other hand, if α < d/(2ν + d), the variance is large, and the bias is small. In both cases, we cannot achieve the best trade-off between the bias and variance, and have a slower convergence rate of the prediction error. To the best of our knowledge, it has not been presented in the literature that by choosing the regularization parameter value other than the rate n d 2ν+d , the optimal rate cannot be achieved.
Next, we consider the uncertainty quantification forf n . Recall that for β ∈ (0, 1), at point x ∈ Ω, the confidence interval constructed by Gaussian process modeling is given by andσ 2 =Y T (R +μ n I n ) −1 Y /n.
Proposition 4.1 is a direct result of Lemmas F.6 and F.8 in Appendix F, and implies lim n→∞ sup x∈Ω | CI n,β (x)| = 0 if 0 ≤ α < 1. Intuitively, when α is large, the bias is large. Therefore, the confidence interval, which is a reflection of the variance, is not wide enough to capture the bias. As a consequence, the confidence interval CI n,β is not reliable. On the other hand, a smaller regularization parameter value lets the variance dominate. Therefore, the variance dominates the confidence interval; thus the confidence interval is reliable. The results related to the reliability of the confidence interval CI n,β are presented in the following theorem. In Theorem 4.2, recall that x 1 , ..., x n are drawn uniformly from Ω, and | CI n,β (x)| = 2c n,β (x;μ n ).
Suppose α = d 2ν+d . Then for any increasing positive sequence {a n } n≥0 satisfying lim n→∞ a n = ∞, (iii) Suppose α < 0. With probability at least 1 − C 5 exp(−C 6 n η 5 ), In (36)-(39),f n is as in (31) andc n,β (·;μ n ) is as in (35). In all statements, the expectation is taken with respect to = ( 1 , ..., n ) T , and the constants C , C , C , C i 's and η j 's are positive, and depend on Ψ, Ω, β and σ 2 but not depending on n. The probabilities are with respect to X.
As direct results of Theorem 4.2, we have the following corollary, which states the results related to the L 2 -reliability of the confidence interval CI n,β (x).
Suppose α = d 2ν+d . Then for any increasing positive sequence {a n } n≥0 satisfying lim n→∞ a n = ∞, In all statements, the constants C , C , C , C i 's and η 1 are positive, and only depend on Ψ, Ω, β and σ 2 but not depending on n.
Note that Theorem 4.2 and Corollary 4.2 do not make any theoretical assertion about L 2 -weak-reliability under the case 0 ≤ α ≤ d 2ν+d , and L 2 -strong-reliability under the case 0 ≤ α < d 2ν+d . As (37) indicates, we conjecture that the constructed confidence interval under the stochastic case is L 2 -weakly-reliable if α = d 2ν+d , and is L 2 -strongly-reliable if 0 ≤ α < d 2ν+d . We also note that (iii) in Corollary 4.2 does not imply the L 2 -strong-reliability since we do not confirm the width of the confidence interval converges to zero when α < 0.
Combining Corollary 4.1 and Corollary 4.2 suggests that if one applies the prediction and uncertainty quantification procedure from Gaussian process modeling to a deterministic function with noisy observations, the optimality of the predictor and the L 2 -strong-reliability of the confidence interval cannot be achieved at the same time.
As a by-product of Theorem 4.1, we show that if the observations are not corrupted by noise, and a regularization parameter is used as a counteract of the potential numerical instability (Peng and Wu, 2014), then with uniformly distributed measurement locations, the prediction error can be controlled.
Theorem 4.3. Suppose σ = 0,μ n n α with α < 1, and the correlation function Ψ satisfies Condition 2.1. Then for all n, with probability at least 1 − C 1 exp(−C 2 n η 1 ), we have where C 1 , C 2 , C 3 and η 1 are positive constants depending on Ψ, Ω, and f . Note that in this theorem, it is allowed that the regularization parameter value increases as the sample size increases.

A numerical example
We present a numerical example to illustrate the results in Section 3, where we show that the confidence interval is not reliable in the deterministic case.
Recall that Theorem 3.1 states that there exists a function with smoothness (at least) ν such that the confidence interval is not L p -reliable. However, such a function is generally not straightforward to find. In this section, we numerically illustrate that there exists a function such that the confidence interval is not L p -reliable.
Consider the following function (Gramacy and Lee, 2012), for x ∈ [0, 1], where t 1 (·; J µ , J σ ) is a Cauchy density with mean J µ and spread J σ . In Gramacy and Lee (2012), it is shown that using a Gaussian correlation function yields a poor coverage rate. In this section we use a Matérn correlation function. The numerical results suggest that with a Matérn correlation function, the Gaussian process model does not provide an L p -reliable confidence interval. As in Section 3, we compute where f is as in (40),f n is as in (23), CI n,β is as in (24), and p = 4. We use a Matérn correlation function as in (22) with ν = 3.5. It can be checked that f ∈ N Ψ M ([0, 1]). We consider the 95% confidence interval constructed by the Gaussian process modeling. Thus, β = 0.05. We set the number of measurement locations as n = 20k, k = 2, ..., 20, and the measurement locations are grid points. We use to approximate (f −f n )/| CI n,β | 4 L 4 (Ω) , where x j 's are the first 500 points in the Halton sequence (Niederreiter, 1992). This should give a good approximation since the points are dense enough. We add a jitter 10 −8 to stabilize the computation of the matrix inverse in (23) and (24). The results are shown in Panel 1 of Figure 1.
We use the following approach to numerically show the rate of divergence of the ratio of the prediction error divided by the width of the confidence interval. By Theorem 3.1, we have Taking logarithm on both sides of (42), we have log sup We regress log 1 500 on log n. If the regression coefficient is larger than one, it indicates that the results in Theorem 3.1 hold. The results are shown in Panel 2 of Figure 1.
Next, we consider two approaches for constructing confidence intervals described in Section 3.2. Similarly, we compute 1 500 It can be seen from Panel 1 of Figure 1 that E in (41) increases as the number of measurement locations increases. From Panel 2 of Figure 1, we can see that the regression line fits the data relatively well. The regression coefficient is 1.548, which is larger than one. This gives us an empirical confirmation that our results in Theorem 3.1 are valid, and there exists a function such that the confidence interval is not L p -reliable. As indicated by Figure 1, we believe the results in Theorem 3.1 can be improved. In Panel 3, although the ratio increases, the largest value of the ratio is only about 10 −4 . Panels 3 and 4 indicate that the two approaches in Section 3.2 can provide reliable confidence intervals. It can be seen that the confidence interval derived by settingσ 2 = C may be conservative (the ratio is very small). Therefore, other uncertainty quantification methods may be considered if the underlying function is known to be in some reproducing kernel Hilbert space.

Conclusions and discussion
In this work, we consider the prediction and uncertainty quantification of the Gaussian process model applied to a fixed function in the corresponding reproducing kernel Hilbert space from a frequentist perspective. The model is misspecified under Assumption * 2.1. We consider two cases, the deterministic case, in which the observations are noiseless, and the stochastic case, where the observations are corrupted by noise. In both cases, we assume that the variance is estimated by maximum likelihood estimation, according to Assumption * 2.1.
In the deterministic case, we show that the constructed confidence interval in the Gaussian process model is not L p -weakly-reliable for p > 2, and is not L 2 -strongly-reliable. We also present two reliable confidence intervals under some scenarios. In the stochastic case, the regularization parameter value is assumed to be at a certain rate. We show that the predictor derived by the Gaussian process modeling is not optimal and/or the constructed confidence interval is not L 2 -strongly-reliable. These results indicate that the optimal predictor and L 2 -strong-reliability cannot be achieved at the same time if the Gaussian process model is misspecified. As by-products, we obtain several lower bounds on the mean squared prediction error under different choices of the regularization parameter value.
In the Gaussian process model, it is often assumed that there are some unknown parameters, and maximum likelihood estimation or Bayesian methods, are used to estimate these parameters, even if the Gaussian process model is misspecified. Prediction performance of the Gaussian process model with maximum likelihood estimation under misspecification has been studied by Stein (1993); Bachoc et al. (2018). In Stein (1993), they show for periodic functions, under some situations, the misspecified Gaussian process model with maximum likelihood estimation can still work well in terms of prediction. If the underlying truth is indeed a Gaussian process, using a misspecified correlation function and maximum likelihood estimation may not have desired prediction performance, as suggested in Bachoc et al. (2018).
In addition to prediction, uncertainty quantification is another important problem in computer experiments. Because the Gaussian process model has a probabilistic structure, models based on Gaussian process modeling are usually validated via confidence intervals. In order to test the performance of these models, typically, several simulations are conducted. In some literature, the test function is selected to be a deterministic function with a closed form. In these cases, the Gaussian process model is usually misspecified, i.e., the function may not be a sample path of the corresponding Gaussian process. However, an imposed pointwise confidence interval is still constructed and used to quantify the uncertainty. It has been observed that Gaussian process models often have poor coverage of their confidence intervals (Gramacy and Lee, 2012;Joseph and Kang, 2011;Yamamoto, 2000). There is no theoretical result explaining this phenomenon from a frequentist perspective to the best of our knowledge.
Our results provide some insights on explaining the poor coverage of the confidence interval, and a better understanding of the model misspecification in Gaussian process modeling.
Several statistical inference methods have been studied if the confidence interval is not derived directly from the Gaussian process modeling. Most of them are from a Bayesian perspective. For example, in Sniekers and van der Vaart (2015); Yoo et al. (2016), credible intervals are constructed and analyzed for Gaussian process models (or particularly Brownian motion). Additionally, Yang et al. (2017); Yano and Kato (2018) derive finite sample bounds on frequentist coverage errors of Bayesian credible intervals for Gaussian process models. Therefore, one can also use other inference methods besides the confidence interval in the Gaussian process modeling.
Several problems are not considered in this work. First, we only consider uniformly distributed measurement locations in the stochastic case, where fixed designs are not considered. Second, we only consider the case that the regularization parameter value is predetermined with a certain rate. We could not confirm similar results if we use maximum likelihood estimation to estimate the regularization parameter value, or select parameter values using other criteria as in Kou (2003). Also, we do not consider using maximum likelihood estimation to estimate the parameters of the correlation function Ψ. Therefore, a thorough investigation of applying maximum likelihood estimation under misspecification is needed. Third, as discussed in Section 2.3, Definition 2.1 is a necessary condition. One possible way to improve this definition is to restrict the L p -norm of the ratio such that it is bounded away from zero.

A Notation
We use ·, · n to denote the empirical inner product, which is defined by for two functions f and g, and let g 2 n = g, g n be the empirical norm of function g. In particular, let for a function f , where = ( 1 , . . . , n ) T . Let a ∨ b = max(a, b) for two real numbers a, b. We use H(·, F, · ) and H B (·, F, · ) to denote the entropy number and the bracket entropy number of class F with the (empirical) norm · , respectively. Through the proof, we assume Vol(Ω) = 1 for the ease of notational simplicity.

B Proof of Proposition 2.1
By (7) and (5), it can be seen that |CI n,β (x)| = 2q 1−β/2 Var[Z(x)|Z] for a point x. By Fubini's theorem, where the second equality can be derived by the calculation of pth moment of normal distribution. See Walck (1996).
Lemma C.1. Suppose p ≥ 2. The approximation number b n defined in (C.1) satisfies that for all n ∈ N, where c 1 and c 2 are two positive constants depending on Ω, ν and p.
Lemma C.2 is a direct result of Theorem 5.14 of Wu and Schaback (1993), which provides an upper bound on P Ψ M ,X defined in (21).
Lemma C.2. Let Ω be compact and convex with a positive Lebesgue measure; Ψ M be a Matérn correlation function given by (22). Suppose Condition 3.1 holds for a sampling scheme X . Then there exist constants c, c 1 depending only on Ω, X , and ν in (22), such that P Ψ M ,X ≤ cn − ν d + 1 2 provided that n − ν d + 1 2 ≤ c 1 and X ∈ X . Lemma C.3 is a direct result of Corollary 10.13 in Wendland (2004)  2. Suppose Ω is compact and convex. Then the reproducing kernel Hilbert space N Ψ (Ω) coincides with the Sobolev space with smoothness ν H ν (Ω), and the norms · N Ψ (Ω) and · H ν (Ω) are equivalent.
Now we are ready to prove Theorem 3.1.
By Lemma C.1, there exists a function φ n ∈ H ν (Ω) satisfying φ n H ν (Ω) = 1 such that since I Ψ,X is a rank n linear operator. By Lemma C.2, (26), and Lemma C.3, we have for sufficiently large N such that N − ν d + 1 2 ≤ h 0 and for all n ≥ N , for any x ∈ Ω, where Y = (φ n (x 1 ), ..., φ n (x n )) T . Let f in (27) be equal to φ n . Therefore, we have which finishes the proof of (27).
The case p = 2 can be proved similarly. The only difference is that we let f = a n φ n such that a n φ n H ν (Ω) = a n .
The second inequality (30) can be shown by a similar approach as in the proof of Theorem 3.1. By Lemma C.1, there exists a function φ n ∈ H ν (Ω) satisfying φ n H ν (Ω) = 1 such that By Lemma C.2, we havê for any x ∈ Ω and sufficiently large n such that n − ν d + 1 2 ≤ h 0 . Letting f = a n φ n , we have which finishes the proof.

E Proof of Theorem 3.3
By pluggingĉ n,β ( , it suffices to show that there exists N > 0 such that for all n > N , By (20) and Lemma 3.1, for sufficiently large n, it can be seen that where the first inequality is by (26), and the last inequality follows from n − 2ν d +1 converges to 0. Then we finish the proof.

F Proof of Theorem 4.1
Recall that in the stochastic case, we assume x 1 , ..., x n are drawn uniformly from Ω. Before we show the proof of Theorem 4.1, we first present some lemmas used in this section. Note that the proof of Lemma F.1 is based on Lemma 8.4 of van de Geer (2000); thus it is omitted here. Lemmas F.2 and F.3 are Theorem 10.2 of van de Geer (2000) and Theorem 2.1 of van de Geer (2014), respectively. Lemma F.1. Suppose 1 , ..., n are independent and identically normally distributed variables. Then for all t > C, with probability at least 1 − C 1 exp(−C 2 t 2 ), wheref n is defined as in (31).
Then for all t > 0, with probability at least 1 − exp(−t), where C 1 is a constant, and with C 2 another constant.
The following lemma is a Bernstein-type inequality for a single function g. See, for example, Massart (2007).
We first state the intuition behind the proof. Direct calculation shows that Ifμ n is large (α > d 2ν+d ), the bias dominates. Therefore, to obtain a lower bound of the mean-squared prediction error for the case α > d 2ν+d , we only need to obtain a lower bound of the bias term. On the other hand, ifμ n is small (α < d 2ν+d ), we only need to obtain a lower bound for the variance term. Now we present the proof of Theorem 4.1. We first consider the case α > d 2ν+d . By the proof of Lemma F.6, it can be seen that In the rest of proof we will writef n asf for simplification. Plugging (1) into the objective function of (F.2), we have , with probability at least 1 − C 1 exp(−C 2 t 2 ). By Lemma F.2 and the triangle inequality, Therefore, (F.3) can be lower bounded by (F.4) By Lemma F.2 and the interpolation inequality, we have f N Ψ (Ω) ≤ C 4 , and f L∞(Ω) ≤ where H ν (C 5 ) denotes the ball in the Sobolev space H ν (Ω) with radius C 5 . Thus, the bracket entropy number can be bounded by (Adams and Fournier, 2003) By Lemma F.3, for all t > 0, with probability at least 1 − exp(−t), where R = sup f ∈F f L 2 (Ω) . Choosing t = n η , where η = (α−1)(2ν+d) 2ν + 1 /4, for sufficient large n, we have that the right-hand side of (F.4) can be lower bounded by where we also apply Jensen's inequality.

G Proof of Theorem 4.2
In the proof of Theorem 4.2, we hide all the probabilities with the form 1 − C exp(−C n −η ) for the conciseness of the proof.
Case 1: α > d 2ν+d . By (F.23) and Lemma F.6, we have where r(x) and R are as in (4), and f (X) = (f (x 1 ), ..., f (x n )) T . The first inequality is true because of the Cauchy-Schwarz inequality and that f −f n is normal. The last inequality follows Lemma F.8. If α > d 2ν+d , then by the proof of Theorem 4.1, we have n Ω (f (x) − r(x) T (R +μ n I n ) −1 f (X)) 2 n β with β > 0 for f ∈ N Ψ (Ω) constructed in the proof of Theorem 4.1, which finishes the proof of Case 1.
Noting that f −f n is normal, we havê where the first inequality is by Fubini's theorem, the second inequality is by the Cauchy-Schwarz inequality, the third inequality is by (20) and the fact f 2 N Ψ (Ω) ≤ log n, the fourth inequality is byμ n ≤ C 2 n d 2ν+d , and the fifth inequality is by (F.25). By Lemma C.2 and Lemma F.8, (G.1) can be further bounded bŷ 2ν+d or the conditions of Lemma F.5 are satisfied. If the later happens, by Lemma F.5, it can be shown that By noticing that n α− ν d +1/2 log n + n − 2ν 2ν+d n α log n ≤ C 5 n (α−1)(1− d 2ν ) , we finish the proof of the first part.
For the second part, by the standard minimax theory in nonparametric regression, there exists a function f such that E f −f n 2 L 2 (Ω) ≥ C 6 n − 2ν 2ν+d f 2 N Ψ (Ω) , which implies ≥C 9 a n .
By Lemma F.6, we have The second inequality is true because of (F.25), and the third inequality is true because of Lemma F.8. Noteμ n log n → 0, which finishes the proof of the case α < 0.

H Proof of Lemma F.5
The idea of the proof is to use the bracket entropy number. By the definition of the bracket entropy number, we can find finite functions g s 's such that the ball with small radius centered on g s 's can cover the function class G. By showing the results hold for these g s 's, we can show the results hold for all function g ∈ G.
By choosing appropriate C and η 1 (the choice only depends on Vol(Ω)), we obtain Therefore, taking all g ∈ G yields P inf Since H B (δ n /Vol(Ω), G , · L∞(Ω) ) ≤ nδ 2 n 1200Vol(Ω)K 2 , it can be seen that for some constants C 1 and C 2 only related to Vol(Ω), which finishes the proof of the first part.
Since H B (δ n /Vol(Ω), G , · L∞(Ω) ) ≤ for some constants C 3 and C 4 related to Vol(Ω), which finishes the proof of the second part.

I Proof of Lemma F.6
Notice that which can be verified by taking minimization of the objective function inside the right-hand side of (I.1). Letû = R −1Ŷ . By pluggingû into the right-hand side of (I.1), we have Therefore, by the representer theorem and (1), the right-hand side of (I.2) is the same as (y i −f (x i )) 2 + Cn − 2ν 2ν+d f 2 N Ψ (Ω) .
Applying the results in the case of µ −1 = O P (n − d 2ν+d ), we obtain µ n Y T (R + µI) −1 Y ≤ 2σ 2 , with probability at least 1 − C 1 exp(−C 2 n η ). The lower bound can be obtained by with probability tending to one, where the first inequality is because of the Cauchy-Schwarz inequality. The last inequality is true because f −f 2 n and µ n f 2 N Ψ (Ω) converge to zero (van de Geer, 2000;Gu, 2013). This completes the proof.

J Proof of Lemma F.7
We only present the proof of the first inequality. The second inequality can be proved similarly. By direct calculation, it can be shown that tr((A + B)(A + B + C) −1 ) ≥ tr(A(A + C) −1 ) ⇔tr(C(A + B + C) −1 ) ≤ tr(C(A + C) −1 ) ⇔tr(C(A + B + C) −1 B(A + C) −1 ) ≥ 0, which is true since A, B and C are positive definite.

K Proof of Lemma F.8
Notice that at point x, for any u = (u 1 , ..., u n ) T ∈ R n , where the first inequality is because of the Cauchy-Schwarz inequality. Plugging u = (R +μ n I n ) −1 r(x) finishes the proof of the first part. Now we prove the second part of this lemma.

L Properties of eigenvalues
Lemma L.1 states the asymptotic rate of the eigenvalues of Ψ(· − ·). Lemma L.2 states the minimum eigenvalue of Ψ T 1 Ψ 1 , where Ψ 1 will be defined later. Since Ψ(· − ·) is a positive definite function, by Mercer's theorem, there exists a countable set of positive eigenvalues λ 1 ≥ λ 2 ≥ ... > 0 and an orthonormal basis for L 2 (Ω) {ϕ k } k∈N such that where the summation is uniformly and absolutely convergent.
Proof. Let T be the embedding operator of N Ψ (Ω) into L 2 (Ω), and T * be the adjoint of T . By Proposition 10.28 in Wendland (2004), By Lemma C.3, H ν (Ω) coincide with N Ψ (Ω). By Theorem 5.7 in Edmunds and Evans (2018), T and T * have the same singular values. By Theorem 5.10 in Edmunds and Evans (2018), for all k ∈ N, a k (T ) = µ k (T ), where a k (T ) denotes the approximation number for the embedding operator (as well as the integral operator), and µ k denotes the singular value of T . By Theorem in Section 3.3.4 in Edmunds and Triebel (2008), the embedding operator T has approximation numbers satisfying where C 3 and C 4 are two positive numbers. By Theorem 5.7 in Edmunds and Evans (2018), T * T ϕ k = µ 2 k ϕ k , and T * T ϕ k = T * ϕ k = λ k ϕ k , we have λ k = µ 2 k . By (L.2), λ k k −2ν/d holds.
The entropy condition is satisfied if where δ n = 1/ρ, K ≤ C 4 p 1/2 1 /ρ, and C 6 is some constant depending on C 1 -C 5 and Ω. By direct calculations, if p 1 ≤ C 7 √ n for some constant C 7 , (L.3) is satisfied.
By Lemma F.5, with probability at least 1 − C 8 exp(−C 9 n η 1 ) for some constant η and η 1 . This finishes the proof.