Estimation of a density using an improved surrogate model

Abstract: Quantification of uncertainty of a technical system is often based on a surrogate model of a corresponding simulation model. In any application the simulation model will not describe the reality perfectly, and consequently also the surrogate model will be imperfect. In this article we show how observed data of the real technical system can be used to improve such a surrogate model, and we analyze the rate of convergence of density estimates based on the improved surrogate model. The results are illustrated by applying the estimates to simulated and real data.


Introduction
Any design of complex technical systems by engineers nowadays is based on some sort of mathematical model of the technical system. Such models are never able to describe the reality perfectly, therefore their analysis has to take into account some kind of uncertainty. This uncertainty might occur, e.g., because some of the parameters of the model are not exactly known, because of the use of an imperfect mathematical model of the technical system during the design process which does not really describe all aspects of the underlying technical system, or because of lack of knowledge about future use. A good quantification of uncertainty of the system is essential in order to avoid oversizing and to conserve resources. In this article we quantify uncertainty of the technical system by estimating a density of a real-valued random variable representing the outcome of an experiment with the technical system. The starting point for our estimation problem is a physical model of the technical system with uncertain parameters. This physical model has parameters which are chosen randomly because their exact values are uncertain and consequently unknown, and it computes the outcome of the technical system by computing the value of a function depending on concrete values of these parameters. In case the distribution of the parameters is known (which we will assume from now on) and the function, which has to be computed, is given, Monte Carlo can be used to estimate either quantiles or the density of the output of the technical system.
Usually, the distribution of X is estimated by random sampling using a computer program, and computer experiments can be used to generate values for the Monte Carlo estimates. However, it often happens that generation of the values is rather time consuming, so that standard Monte Carlo estimates cannot beused. Instead, one has to apply techniques which are able to quantify the uncertainty in the computer experiment using only a few evaluations of the computer program. There is vast literature on the design and analysis of such computer experiments, cf., e.g., Santner, Williams and Notz (2003) and the literature cited therein. Often, so-called surrogate models of the computer experiment are used in order to analyze computer experiments. Surrogate models have been introduced and investigated with the aid of the simulated and real data in connection with the quadratic response surfaces in Bucher and Bourgund (1990), Kim and Na (1997) and Das and Zheng (2000), in context of support vector machines in Hurtado (2004), Deheeger and Lemaire (2010) and Bourinet, Deheeger and Lemaire (2011), in connection with neural networks in Papadrakakis and Lagaros (2002), and in context of kriging in Sacks et al. (1989), Kaymaz (2005) and Bichon et al. (2008). Consistency and rate of convergence of density estimates based on surrogate models have been studied in Devroye, Felber and Kohler (2013), Bott, Felber and Kohler (2015) and Felber, Kohler and Krzyżak (2015a). A method for the adaptive choice of the smoothing parameter of such estimates has been presented in Felber, Kohler and Krzyżak (2015b).
Computer simulation usually allows to estimate parameters of the computer model to ensure that it closely matches the performance of the technical system. In a Bayesian context, a corresponding approach towards parameter estimation (so-called calibration) was described in Kennedy and O'Hagan (2001). Applications of this approach in various fields can be found, e.g., in Bayarri et al. (2007), Goh et al. (2013), Han, Santner and Rawlinson (2009), Higdon et al. (2013) and Wang, Chen and Tsui (2009). Tuo and Wu (2015) pointed out that this approach might lead to estimates, which are not consistent (in the sense introduced by them) in case of an imperfect computer model, for which there exist no values of the parameters which fit the technical system perfectly. They suggested and analyzed non-Bayesian methods for the choice of parameters of such models.
It is clear that the mathematical/computer model cannot perfectly mimick the performance of the technical system and thus the question arises about the characterization of this mismatch in terms of uncertainty quantification. The standard approach in science is to make some assumptions about the reality, and to try to quantify the uncertainty under these assumptions. E.g., in Bayesian analysis of computer experiments, Kennedy and O'Hagan (2001) and the papers cited above, which applied their methods, suggested to model the discrepancy between the computer experiments and the outcome of the technical system by a Gaussian process. Under the assumption that the reality is described by this Gaussian process perfectly, this enables to compute various estimates and to compute bounds on their error even for small sample sizes. But of course the derived bounds on the error of the estimates do only hold if the assumptions about the reality are true, which illustrates the saying "We buy information with assumptions" (Coombs (1964)).
Due to the fact that erroneous specification of uncertainty in the technical system (e.g., inadequate physical model of the maximum relative compression of the suspension strut of the airplane wheel assembly -see Section 4 and Fig. 5) might result in catastrophic failure of the technical system during its operation (e.g., malfunction of the aircraft's front wheel during touchdown), it is very important to avoid arbitrary assumptions in uncertainty quantification as much as possible. In this paper we are interested in a non-Bayesian approach towards uncertainty quantification in case of imperfect models. Recently, such an approach was proposed in Wong, Storlie and Lee (2017), where standard nonparametric regression estimates have been applied in order to estimate the discrepancy function between the model and the real data by using these methods to smooth the residuals of the model. The authors used bootstrapping for uncertainty quantification by means of confidence regions and they analysed it using an empirical norm.
There are some similarities between our approach and gradient boosting introduced by Friedman (2001;2002). As in our approach the residuals are also smoothed in gradient boosting, however in gradient boosting the residuals of original data are smoothed while in our approach the residuals of the new (and better) data are smoothed.
In this paper we use a similar approach as in Wong, Storlie and Lee (2017). We assume that we have available an additional (small) sample of the real technical system, and we consider the problem of estimation from this sample together with the imperfect simulation model an improved surrogate model. But in order to be able to apply our method also for small samples sizes (in our application below we have just n = 10 real data points available) we use a weighted penalized least squares estimates where we augment the n = 10 residuals of the model by artificial data points. Furthermore, unlike Wong, Storlie and Lee (2017) we analyze the error of our improved surrogate model in the theoretical (and not the empirical) L 2 norm which allows us to derive error bounds for density estimates based on our improved surrogate model.
The mathematical setting which we consider is as follows: Let (X, Y ), (X 1 , Y 1 ), (X 2 , Y 2 ), . . . be independent and identically distributed random variables with values in R d × R, and let m : R d → R be a measurable function (known and given). Here Y describes the outcome of an experiment with our technical system, and our aim is to estimate the density g of Y (w.r.t. the Lebesgue measure), which we assume to exist. The random vector X with known and given distribution and the measurable function m describe our physical model with uncertain parameters, and in this model we use m(X) as an approximation of Y . Given the data (X 1 , Y 1 ), . . . , (X n , Y n ), (X n+1 , m(X n+1 )), . . . , (X n+Ln , m(X n+Ln )), X n+Ln+1 , . . . , X n+Ln+Nn (1.1) (where L n , N n ∈ N) our goal is to construct an estimate for g. The simplest way of solving this problem is to ignore X and m and to use only the data Y 1 , . . . , Y n (1.2) to estimate g by the kernel density estimatê Here K : R → R (so-called kernel, which is assumed to be a density) and h n > 0 (so-called bandwidth) are parameters of the estimate.
In the sequel we assume that for m * : is small. In this case an alternative way to estimate g is to use the data to construct an estimatê of m * , and to define the corresponding surrogate density estimatê (1.6) In this article we are interested in situations, where the sample size n is rather small (in our application in Section 4 we will have n = 10), since the collection of the real data (1.2) is rather expensive owing to the complex nature of the physical system. Consequently, it might also be useful to use data from our model to estimate g. One possibility of doing this is to define an estimate of g on the basis of the data (X n+1 , m(X n+1 )), . . . , (X n+Ln , m(X n+Ln )), X n+Ln+1 , . . . , X n+Ln+Nn (1.7) by first estimating a surrogate (often called an emulator in the literature) of m and by subsequently defining the corresponding surrogate density estimate viaĝ (X,m(X)),Ln (y) = The main question which we want to investigate is whether there exist situations in which suitably defined estimates based on the complete data (1.1) achieve simultaneously better rate of convergence than the estimates (1.3), (1.6) and (1.9). Due to the fact that our sample size n is rather small, we will consider only those cases where the model m is rather good, and we will show that in this case we can improve a density estimate based on this model by real data.
To do this, we propose in the next section a novel method for improving the surrogate models (1.5) and (1.8) by using a combination of the real data (1.4) and the simulation data (1.7). Our main result is that the rate of convergence of the corresponding surrogate density estimate is at least as good as the rates of convergence of the density estimates (1.3), (1.6) and (1.9), and is, in special situations, better than any of the above rates of convergence. The finite sample size behaviour of our estimates is validated on simulated data (see Section 4). The usefulness of our newly proposed estimates for uncertainty quantification is demonstrated by using it to analyze the uncertainty occurring in experiments with a suspension strut.
Throughout this paper we use the following notation: N, N 0 and R are the sets of positive integers, nonnegative integers and real numbers, respectively.
Let p = k + β for some k ∈ N 0 and 0 < β ≤ 1, and let C > 0. A function m : for all x, z ∈ R d . If X is a random variable, then P X is the corresponding distribution, i.e., the measure associated with the random variable. Let D ⊆ R d and let f : R d → R be a real-valued function defined on R d . The outline of this paper is as follows: In Section 2 the construction of the improved surrogate model is explained. The main results are presented in Section 3 and proven in Section 5. The finite sample size performance of our estimates is illustrated in Section 4 by applying it to simulated and real data.

A new method for improving an imperfect surrogate model by real data
In this section we describe our ideas behind the construction of the improved surrogate model. In order to construct density estimates on the basis of the data (1.1) we proceed as follows: We start by defining a surrogate model m Ln (·) =m Ln (·, (X n+1 , m(X n+1 )), . . . , (X n+Ln , m(X n+Ln ))) : R d → R (2.1) of m. The reason behind using this surrogate model instead of m is that it might be much more complicated to evaluate m (because, e.g., we may have to solve a partial differential equation in order to compute a function value) than to evaluate the surrogate model. In principle, any nonparametric regression estimate can be used to construct the surrogate model. In Section 4 we will use the penalized least squares estimate defined bỹ m Ln,(k,λ Ln ) (·) = arg min is a penalty penalizing the roughness of the estimate and where W k (R d ) denotes the Sobolev space The condition 2k > d implies that the functions in W k (R d ) are continuous and hence the value of a function at a point is well defined. In order to be able to analyze the rate of convergence of this estimate for an arbitrary distribution of X and dimension d > 1 we will truncate this estimate at some level β > 0, i.e., we definem If we know the bound on the estimated function in an application we can truncate our estimate at the same level.
Next we compute the residuals on the data set (1.4) of the estimate (2.1), i.e., we computeˆ (2.4) Then we define an estimatemˆ which smoothes these residuals (see below) and define our final surrogate model (X,m n (X)) (wherem n is an estimate of m * (·) = E{Y |X = ·}) by settinĝ (2.6) In order to define the estimate (2.5) we use two data sets: The first data set corresponding to the residuals ofm Ln on X 1 , . . . , X n , i.e., the data set (2.7) And the second data set corresponding to the residuals ofm Ln on the artificial sample with measurement errors of (X, Y ), i.e., the data set The second data set will be, in particular, useful in case whenm Ln is already very close to since the sample size n of the data set (2.7) might be too small to detect that 0 ≈ m * −m Ln is the optimal choice formˆ n .
Since both data sets are not equally trustworthy, we weigh them by some weight w (n) ∈ [0, 1], and set mˆ n (·) =mˆ n (·, (X n+1 , m(X n+1 )), . . . , (X n+Ln , m(X n+Ln )), where c ≥ 1 and α n > 0. Finally, we use (X,m n (X)) as a surrogate model for (X, Y ) and estimate the density g of Y by applying a kernel density estimate to a sample ofm n (X). To do this, we choose a kernel K : R → R and a bandwidth h Nn > 0 and definê The procedure is summarized in Algorithm 1.

Main results
In the next theorem we present bounds on the rate of convergence of our surrogate estimate, which we will use to derive bounds on the rate of convergence of our density estimate. In principle, all of our error bounds are also valid for finite n. In order to simplify the presentation, we consider the case n → ∞ and assume that the distribution of (X, Y ) and also the physical model with uncertain parameters (X, m(X)) change for increasing n such that Y − m * (X) and the error m(X) − m * (X) get smaller with increasing n. In order to simplify the notation we write (X, Y ) and m instead of (X (n) , Y (n) ) and m (n) , resp. In order to derive our main result we will use the following assumptions: Collect real data (X 1 , Y 1 ), . . . , (Xn, Yn) from experiment with the physical system; generate additional data (X n+1 , m(X n+1 )), . . . , (X n+Ln , m(X n+Ln )) using known computer model m(x); generate additional data X n+Ln+1 , . . . , X n+Ln+Nn from the distribution of X; Compute the penalized least squares regression estimatẽ Compute the penalized least squares regression estimate weighing residuals corresponding to real and artificial datã Use (X,mn(X)) as a surrogate model for (X, Y ) and estimate the density g bŷ . Set as result the valueĝ Nn (y) return result; Algorithm 1: Proposed estimate of the density g of Y using a surrogate model for (X, Y ).
(A4) For some α n > α * n (where α * n is the value from (A2)) and for some K, σ 0 > 0 we have Here (A1) imposes some mild condition on the data (X 1 , Y 1 ), . . . , (X n , Y n ) from our experiments with the technical system. (A2) describes how well Y can be approximated by a function of X, and α * n is an upper bound on the error of this approximation. The simplest (but most restrictive) case is that (A2) is satistfied for α * n = 0 which means that Y = m * (X) holds. In (A3) the function m : R d → R describes the simulation model which we use. Here we impose smoothness assumptions on this function. Our main assumption is (A4): Here α n is a bound on the error of our model m (when we compare it with m * ), and we impose a smoothness assumption on m * − m.
, and assume that Definemˆ n by (2.10) and (2.11) for some N n satisfying N n ≤ c 4 · n l for some let w (n) ∈ [0, 1] and definem n by (2.6). Then there exist constants c 5 , . . . , c 10 ∈ R + such that In particular, in case w (n) = 1 and λ n = c 11 · ((log n)/n) 2·k/(2·k+d) we get . Remark 1. In any application of the estimate in Theorem 1 we have to choose the parameters depending on the data. In Section 4 we will use k-fold crossvalidation applied to the data (X 1 , Y 1 ), . . . , (X n , Y n ) in order to choose w (n) and λ n , and we choose λ Ln by generalized cross-validation applied the data (X n+i , m(X n+i )) (i = 1, . . . , L n ).
Theorem 1 implies the following corollary concerning the L 1 error of the density estimate (2.12): Corollary 1. Assume that the density g of Y is (r, C)-smooth for some r ∈ (0, 1] and that its support is compact. Let K : R → R be a symmetric and bounded density which decreases monotonically on R + and define the estimatê g Nn as in Section 2, wherem n is defined as at the end of Theorem 1. Assume that the assumptions of Theorem 1 are satisfied, and that, in addition, Proof. Lemma 1 in Bott, Felber and Kohler (2015) implies that for any z 1 , z 2 ∈ R we have From this and standard bounds on the L 1 error of kernel density estimates (cf., e.g., proof of Theorem 1 in Felber, Kohler and Krzyżak (2015a)) we conclude Application of Theorem 1 yields the assertion.
Remark 2. It is well-known that the L 1 error of the standard kernel density applied to the data (1.2) achieves under the assumptions of Corollary 1 the (optimal) rate of convergence n −r/(2r+1) .
It follows from the proof of Corollary 1 (together with standard error bounds on the L 2 error of smoothing spline estimates, cf., e.g., Chapter 21 in Györfi et al. (2002)), that the L 1 errors of the surrogate density estimates defined in (1.6) and (1.9) achieve under the assumptions of Corollary 1 the rates of convergence For α n suitably small the bound on the rate of convergence in Corollary 1 converges faster to zero than any of the above rates of convergence, which proves, that there exist situations in which our estimate theoretically outperforms the estimates defined in (1.3), (1.6) and (1.9). So our main result is that in case that our model m is really good, we can improve the rate of convergence of a density estimate based on this model by incorporating real data. In the next section we demonstrate with simulated data that this is also the case for finite sample sizes.
Remark 3. In case that the error α n of our simulation model does not converge to zero quickly enough, it might happen that the rate of convergence to zero in Corollary 1 is slower than the rate n −r/(2r+1) of the standard kernel density applied to the data (1.2). For large n it is easy to modify our estimate such that it always achieves at least the rate of convergence of this standard kernel density estimate: To do this we can split the data (1.2) in two parts of approximately equal size, define our estimate in Corollary 1 and the standard kernel density estimate by using only the first part of this data, and use the second part to choose via the combinatorial method in density estimation of Devroye and Lugosi (2001) the better of this estimate. By using the results presented in Devroye and Lugosi (2001) it is easy to see that the L 1 error of the resulting density estimate converges to zero with the rate under the assumptions of Corollary 1. Since in this paper we are interested in applications, where the sample size is rather small (we have n = 10 in our real data application in the next section), we are interested in situations where the standard kernel density estimate cannot produce satisfying results, and therefore we do not incorporate this splitting of the sample in our simulations in the next section.
Remark 4. The rate of convergence in Corollary 1 gets worse in case that the dimension d of X increases. This is a consequence of the well-known curse of dimensionality in nonparametric regression, which states that estimation of functions in high-dimensions is particular difficult. In our method we use smoothing splines in order to estimate a surrogate model of our model m and the difference between m and m * . It is well-known that this kind of estimate usually requires d ∈ {1, 2, 3} in order to produce satisfying results. For larger d one has to use other techniques from function estimation in our procedure, and to impose structural assumptions on the functions to be estimated in order to get satisfying results. How this can be achieved with neural networks in the context of the newly proposed density estimate is described in Götz, Kersting and Kohler (2020).

Application to simulated and real data
In this section we illustrate the finite sample size performance of our estimates by applying them to simulated and real data.
We begin with the simulated data, which we use to illustrate how the size of the error of the model influences the performance of our estimate. To do this, we choose X with d-dimensional standard normal distribution and uniformly distributed on [0, 1] such that X and are independent and set Y = m(X) + σ · for some m : R d → R defined below and σ ∈ {0.1, 0.5, 1}, and let (X 1 , Y 1 ), (X 2 , Y 2 ), . . . be independent and identically distributed random variables. Our estimate uses the data (X 1 , Y 1 ), . . . , (X n , Y n ) from the real technical system as well as data (X n+1 , m(X n+1 ), . . . , (X n+Ln , m(X n+Ln )) from the (imperfect) computer model (where σ controls the maximal error occurring in this model), and the additional X-values X n+Ln+1 , . . . , X n+Ln+Nn .
If we compare this setting with Theorem 1 we see that following the notation of Theorem 1 we have m * (x) = E{Y |X = x} = m(x) + 1 2 · σ and it is easy to see that in this setting the assumptions of Theorem 1 are satisfied if we choose α * n = σ/(32) 1/3 and α n = σ/2. In all of our applications we choose n ∈ {10, 20, 40} and L n = 500. As surrogate estimate we use a thin plate spline as implemented in the routine T ps() in the statistics software R, where we use 5-fold cross-validation (applied to the data D n ) to choose the degree of freedom df of the fitted spline from the set {4, 8, 16, . . . , 256}. In the same way we also choose w (n) from the set {0, 0.1, . . . , 1}, i.e., we choose simultaneously the degree of freedom df and the weight w (n) by 5-fold cross-validation. During the 5-fold cross-validation we compute an estimate of the L 1 error of our smoothing spline estimate.
For our newly proposed density estimate we use a sample of size N n = 500,000 ofm n (X) (wherem n is the estimate introduced in Section 2) and construct a kernel density estimate based on this sample with L 2 cross-validation as implemented in the routine density() in the statistics package R. Here the bandwidth is chosen as the maximum of the bandwidth chosen by L 2 cross-validation for the kernel density estimate (as implemented in R), and two times the estimated absolute L 1 error of our smoothing spline estimates.
The density of Y is the convolution of the density of m(X) and a uniform density. We do not try to compute its exact form, instead we compute it approximately by applying a kernel density estimate (as implemented in the routine density() in R) with a sample of size 1,000,000 of Y . In order to judge the quality of our density estimates the resulting density is treated in our simulations as if it were the real density.
We compare our estimate (est. 4 ) with four other density estimates. The first one (est. 1 ) is the standard kernel density estimate as implemented in R applied to the sample of size n of Y , cf., (1.3). The estimates two and three are surrogate density estimates, where the kernel density estimate of R is used with a sample of size N n = 500,000 of the surrogate model. For est. 2 the surrogate model is chosen by applying a thin plate spline (as implemented in R) to the sample of size n of (X, Y ), cf., (1.6). And for est. 3 the surrogate model is computed in the same way, but using this time the sample of size L n = 500 of our model (X, m(X)), cf., (1.9). est. 5 is our newly proposed estimate, but this time with weight w (n) = 1 chosen independent of the data.
We consider three different models. In the first model we choose d = 2 and define it by m(x 1 , x 2 ) = 2 · x 1 + x 2 + 2. Figure 1 shows plots of the reference density and of the five different density estimates for a data set of model 1, where we use n = 20, σ = 0.5, L n = 500 and N n = 500,000.
In the second model we choose again d = 2, but define m this time by Figure 2 shows plots of the reference density and of the five different density estimates for a data set of model 2, where we use n = 20, σ = 0.5, L n = 500 and N n = 500,000.
In the third model we choose d = 1 and define m by m(x) = exp(x). Figure 3 shows plots of the reference density and of the five different density estimates for a data set of model 3, where we use n = 20, σ = 0.5, L n = 500 and N n = 500,000. We compare the L 1 errors of our five different estimates. To do this, we approximate the integral by a Riemann sum defined on an equidistant partition consisting of 8192 subintervals of the interval [−6, 10] (in model 1) or the interval [0, 10] (in models 2 and 3). Since this L 1 error is random, we repeat each simulation 100 times and report in Table 1 the median (and in brackets the interquartile range) of the 100 L 1 errors for each of our four estimates.
From Table 1 we see that our estimate outperforms the estimates 1 through 3 in 19 out of 27 settings, and in these cases often its error is between 2 to 3 times smaller than the errors of all other estimates. And in the five cases where it does not achieves the smallest error compared with estimates 1 through 3, its error is approximately of the same size as the smallest error (and at most 30 percent larger). These larger errors occur only in model 1, where the function m is a linear function which can be easily estimated even from a small sample of observation, and where therefore the surrogate density estimate based on an estimated surrogate is rather good.
By comparing the performance of estimates 4 and 5 in Table 1, we can furthermore get an idea about the usefulness of our choice for the data-dependent weight. Here we see that estimate 4 is better than estimate 5 in 11 cases, and estimate 5 is better than estimate 4 in 11 cases. However, this occurs for different sample sizes: for n = 10 estimate 4 with the data-dependent weight is 6 times better (and only 1 time worse) than estimate 5 (with weight w (n) = 1). But for n = 40 estimate 4 is only 2 times better (and 5 times worse) than estimate 5. From this we see that our data-dependent choice of the weight is useful for very small sample size (which is the most important case in practice).
The merits of the new uncertainty quantification approach proposed in this paper are reinforced by applying it to the real data obtained in the experimental study of the suspension strut illustrated in Figure 4. This assembly has been used to study uncertainty in load distributions and feasibility to control vibrations, stability and load paths in suspension struts such as aircraft landing gears. The structure of the strut assembly is shown in the left panel of Figure 5. The right panel presents the simplified physical model of the suspension strut. This suspension strut consists of an upper structure and lower structure. The lower structure contains a spring-damper component, which serves as mediating agent transmitting axial forces between the upper structure and the foot of the suspension strut. Our goal is study the impact of the randomly selected free-fall height h f on the maximum relative compression of the spring-damper component. To that end we assume that the free-fall heights are independent normally distributed random variables with mean 0.05 meter and standard deviation 0.0057 meter. We analyze the uncertainty in the maximum relative compression in our suspension strut using a simplified mathematical/physical model of the suspension strut (cf., Figure 5 (right panel)). In the model the upper and the lower structures of the suspension strut are represented by the lump masses m and m 1 , respectively, the spring-damper component is characterized by the stiffness and damping coefficients k and b, respectively, and elasticity of the foot component is described by the stiffness parameter k ef . Assuming linear stiffness and an axiomatic damping it is possible to compute the maximum relative compression by solving an ordinary differential equation using Runge-Kutta algorithm (cf., model a) in Mallapur and Platz (2017)). We use L n = 500 samples generated in computer experiments to construct a surrogate estimatem Ln described above.
In Figure 6 we see in the upper left panel data from L n = 500 computer experiments together with the corresponding surrogate model (solid line), and in the upper right panel the corresponding surrogate density estimate. In the lower left panel we see again the surrogate model (dashed-dotted) based on the data from the computer experiments together with n = 10 real data points from the experiment. Clearly, our (dashed-dotted) surrogate model based only on the computer experiment is imperfect since it does not really fit the real data. We can improve this imperfect surrogate model by using the techniques Table 1 Simulation results in the three different models. introduced in this paper and effectively obtain the solid line in the lower left panel of Figure 6. The corresponding surrogate density estimate is shown in the lower right panel of Figure 6. We observe that the use of n = 10 addi-  tional data points clearly leads in this example to a different density estimate than the estimate based only on the model data in the upper right panel of Figure 6.

Auxiliary results
In this subsection we present various auxiliary results on smoothing spline estimates, which we use in the next subsection in order to derive a new error bound on smoothing splines applied to weighted data with additional measurement errors in the dependent variable, cf., Theorem 2 below. This result will be used to proof Theorem 1.

A deterministic lemma
where we tacitly assume that the above minimum exists), and and letm * N ∈ F N be arbitrary. Then the assertion follows as in the proof of Lemma 5 in Furer and Kohler (2015). A complete proof is included in the Appendix.

A bound on a covering number
Definition 1. Let l ∈ N and let F be a class of functions f : R l → R. The covering number N 2 ( , F, x n 1 ) is defined for any > 0 and x n 1 = (x 1 , ..., x n ) ∈ (R l ) n as the smallest integer k such that there exist functions g 1 , ..., g k :

Lemma 2. Let L, A, c > 0 and set
Then there exist constants c 17 , c 18 , c 19 ∈ R + depending only on A, k and d such that for any > 0 and all Proof. See Lemma 20.6 and Problem 20.9 in Györfi et al. (2002), or Lemma 3 in Kohler, Krzyżak and Schäfer (2002).

A bound on the error for smoothing spline estimates for fixed design regression
Let L ≥ 0 and for some x 1 , . . . , x n ∈ R d , m : R d → R and some random variables W 1 , . . . , W n which are independent and have expectation zero. We assume that the W i 's are sub-Gaussian in the sense that max i=1,...,n for some K, σ 0 > 0. Our goal is to estimate m from (x 1 ,Ȳ 1,n ), . . . , (x n ,Ȳ n,n ), whereȲ 1,n , . . . ,Ȳ n,n ∈ [−L, L] are arbitrary (bounded) random variables with the property that the average squared measurement error is "small". Let F n be a set of functions f : R d → R and consider the least squares estimate with complexity penaltỹ m n (·) = arg min where for f ∈ F n pen 2 n (f ) ≥ 0 is a penalty term penalizing the complexity of f and where β n ≥ L. Set Lemma 3. Assume that the sub-Gaussian condition (5.4) holds. Let F n be a set of functions: R d → R, let m n (·) =m n (·, (x 1 ,Ȳ 1,n ), . . . , (x n ,Ȳ n,n )) ∈ F n and setm n = T βnmn . Then there exist constants c 20 , c 21 , c 22 > 0 which depend only on σ 0 and K such that for any δ n > c 20 /n with du for all δ ≥ δ n /6 and all g ∈ F n we have for any m * n ∈ F n P m n −m * n 2 n + pen 2 n (m n ) + 4 · δ n ≤ 24 Proof. The result follows from the proof of Lemma 2 in Kohler and Krzyżak (2012). A detailed proof is included in the Appendix.

A bound on the deviation between the L 2 error and the empirical L 2 error for smoothing splines
Let (X, Y ), (X 1 , Y 1 ), . . . be independent and identically distributed R d ×R valued random variables with EY 2 < ∞. Let m(x) = E{Y |X = x} be the corresponding regression function. LetȲ 1,n , . . . ,Ȳ n,n be R-valued random variables and define the estimatem n bỹ m n (·) = arg min where F n is a set of functions f : R d → R and for f ∈ F n pen 2 n (f ) ≥ 0 is a penalty term penalizing the complexity of f . Set m n = T βnmn for some β n > 0. Then the following result holds.
Lemma 4. Let β n ≥ L ≥ 1 and assume that the m is bounded in absolute value by L. Let F n be a set of functions f : R d → R, let m n (·) =m n (·, (X 1 ,Ȳ 1,n ), . . . , (X n ,Ȳ n,n )) ∈ F n and setm n = T βnmn . Then there exist constants c 23 , c 24 , c 25 , c 26 > 0 such that for any δ n > 0 which satisfies du for all δ ≥ δ n and all x 1 , . . . , x n ∈ R d , we have for n ∈ N P |m n (x) − m(x)| 2 P X (dx) > δ n + 3 · pen 2 n (m n ) Proof. The result follows from the bound on P 1,n presented in the proof of Lemma 3 in Kohler and Krzyżak (2012). A detailed proof is included in the Appendix.
In the first step of the proof we show that we can assume w.l.o.g.
Y i,n ∈ [−β n , β n ] for all i = 1, . . . , n + L n . (5.13) To do this, we let A n = |Ȳ i,n | ≤ β n for all i = 1, . . . , n + L n be the event that allȲ i,n be bounded in absolutely value by β n . The union bound together with Markov inequality implies On the event A n the estimatem n coincides with the estimatem From this we can conclude that n , which completes the first step of the proof.
So from now on we assume that (5.13) holds. Set In the second step of the proof we show that the assertion follows from ∞ 36·γn P{T n > t} dt ≤ c 33 n .
To do this, we observe The definition of γ n and of the weights implies the assertion of step 2.
In the fourth step of the proof we derive a upper bound on ∞ 36·γn P 1,n (t) dt.
Let t ≥ 36 · γ n . The definition of the weights together with implies that we have We show next that Lemma 4 is applicable to the two different probabilities, where both times β n is replaced by β and where we use sample sizes n and L n , resp. Since t ≥ 36 · γ n ≥ 2 · γ n implies in order to show that Lemma 4 is applicable to the first probability, it suffices to show δ n > c 34 · β 2 n and Using this together with Lemma 2 we see that Lemma 4 is applicable to the first probability, if δ n > c 34 · β 2 n and the following inequality hold: The last condition is implied by which in turn follows from δ ≥ c 40 · log n n · λ d/(2k) n and δ ≥ c 40 · log n n .

678
M. Kohler and A. Krzyżak In case that λ n ≤ 1 the last two conditions hold for all δ ≥ δ n , provided c 32 is chosen large enough.
In the same way one can show that Lemma 4 is also applicable to the second probability above.
In the fifth step of the proof we derive a upper bound on ∞ 36·γn P 2,n (t) dt.
which will allow us to increase the truncation level in the following. To do this, we observe that the above inequality together with (5.13) and Lemma 1 (applied withm * n = m) implies P 2,n (t) Proceeding as in the proof of step 4 we can conclude from the definition of the weights that the last probability is bounded by and that Lemma 3 can be applied to both probabilities. From this we can conclude that the above probabilities are bounded by Summarizing the above results we get the assertion.

Acknowledgment
The first author would like to thank the German Research Foundation (DFG) for funding this project (Projektnummer 57157498 -SFB 805) The second author would like to acknowledge the support from the Natural Sciences and Engineering Research Council of Canada under Grant RGPIN-2020-06793.

Analysis of Computer Experiments
This together with the definition of the estimate implies We show next that T 1 ≤ T 2 . Assume to the contrary that this is not true. Then But this is a contradiction to (5.1), so we have indeed proved T 1 ≤ T 2 . As a consequence we can conclude from (5.1) In the next to last inequality we have used, that a 2 /2 − b 2 ≤ (a − b) 2 (a, b ∈ R) Application of Chernoff's exponential bounding method (cf. Chernoff (1952)) together with (5.4) yields