Deep Gaussian Processes for Calibration of Computer Models

. Bayesian calibration of black-box computer models oﬀers an estab-lished framework for quantiﬁcation of uncertainty of model parameters and predictions. Traditional Bayesian calibration involves the emulation of the computer model and an additive model discrepancy term using Gaussian processes; inference is then carried out using Markov chain Monte Carlo. This calibration approach is limited by the poor scalability of Gaussian processes and by the need to specify a sensible covariance function to deal with the complexity of the computer model and the discrepancy. In this work, we propose a novel calibration framework, where these challenges are addressed by means of compositions of Gaussian processes into Deep Gaussian processes and scalable variational inference techniques. Thanks to this formulation, it is possible to obtain a ﬂexible calibration approach, which is easy to implement in development environments featuring automatic dif-ferentiation and exploiting GPU-type hardware. We show how our proposal yields a powerful alternative to the state-of-the-art by means of experimental validations on various calibration problems. We conclude the paper by showing how we can carry out adaptive experimental design, and by discussing the identiﬁability properties of the proposed calibration model.


Introduction
The task of carrying out inference of parameters of expensive computer models from data is a classical problem in Statistics (Sacks et al., 1989). Such a problem is referred to as calibration (Kennedy and O'Hagan, 2001), and the results are often of interest to draw conclusions on parameters that have a direct interpretation of physical quantities (see, e.g., Section 2.2 of Brynjarsdóttir and O'Hagan 2014). Calibration finds numerous applications in fields as diverse as climatology (Sansó et al., 2008;Salter et al., 2018), environmental sciences (Larssen et al., 2006;Arhonditsis et al., 2007), biology (Henderson et al., 2009), and mechanical engineering , to name a few. There are many fundamental difficulties in calibrating expensive computer models, which we can distinguish between computational and statistical. Computational issues arise from the fact that traditional optimization and inference techniques require running the expensive computer model many times for different values of the parameters, which might be unfeasible within a given computational budget. Statistical limitations, instead, arise from the fact that computer models can only abstract real processes with a given level of accuracy.
Building on previous work from Sacks et al. (1989), in their seminal paper, Kennedy and O'Hagan (2001) propose a statistical model, based on Gaussian processes (gps; Rasmussen and , which jointly tackles the problems above. In their model, which we will refer to as koh, the output of a deterministic computer model is emulated through a gp estimated from a set of computer experiments; in this way, computational issues are bypassed by using the predictive distribution of the emulating gp for any given set of parameters in place of expensive runs of the computer model. Observations from the real process, also known as field observations, instead, are modeled through the gp emulating the computer model with the addition of a so-called discrepancy term, which is also assigned a gp prior. The introduction of the discrepancy term is key to avoid biased estimates of the parameters due to misspecifications of the computer model (Brynjarsdóttir and O'Hagan, 2014). The koh model is treated in a Bayesian way, making it suitable for problems where quantification of uncertainty is an important requirement. This is often the case when one is interested in drawing conclusions on parameters of interest, making predictions for decision-making with specific cost associated to predictions, or when one is interested in iteratively improving the experimental design.
While the koh model and inference make for an attractive and elegant framework to tackle quantification of uncertainty for calibration of expensive computer models, there are limitations which we aim to overcome with this work. From the modeling perspective, gps are indeed flexible emulators, provided that a suitable covariance function is chosen, as in the literature of nonstationary gps (e.g. Paciorek and Schervish 2003). However, more recent approaches like Deep gps (dgps; Damianou and Lawrence 2013) have shown great modeling flexibility for many classes of functions and can potentially lead to more accuracy in the emulation of the computer model and the real process compared to gps. From the computational perspective, limitations are inherited from the poor scalability of gps (Rasmussen and Williams, 2006), for which inference becomes impractical when the number of runs of the code and the number of real observations are collectively beyond a few thousands. In addition, the use of Markov chain Monte Carlo (mcmc) (Neal, 1993) techniques to carry out inference for gp models can be painfully slow without careful tuning and clever parameterizations (Filippone et al., 2013;Filippone and Girolami, 2014).
This work aims to tackle these issues by proposing the use of recent developments in the gp and dgp literature and variational inference, (i) to extend the modeling capabilities of gps in emulation using dgps; (ii) to extend the original framework in Kennedy and O'Hagan (2001), by casting the model as a special case of a dgp; (iii) to adapt techniques based on random feature expansions and stochastic variational inference, building on the work by Cutajar et al. (2017), to obtain a scalable framework for Bayesian calibration of computer models. Thanks to this formulation, it is possible to obtain a flexible calibration approach, which is easy to implement in development environments featuring automatic differentiation and exploiting GPU-type hardware.
We validate our proposal, which we name dgp-cal, on a variety of calibration problems, comparing with alternatives from the state-of-the-art. We demonstrate the flexibility and the scalability of dgp-cal, as well as the ability to capture the uncertainty in model parameters and model discrepancy. We conclude the paper by showing how we can carry out adaptive experimental design, and by discussing the identifiability properties of the proposed calibration model. The code to replicate all the results in the paper is available at the following url: https://github.com/SebastienMarmin/variational-calibration.

Background
In this section, we introduce the problem of calibration of computer models and we present the koh model. We then introduce Gaussian processes (gps), which are the main modeling ingredients in the koh model. Motivated by the difficulties associated with carrying out inference with gps, we present random feature expansions as a way to reduce complexity and being able to exploit recent advances in approximate inference. In particular, we focus on variational inference (vi) techniques that are able to operate on mini-batches of data and that can be easily implemented in developing environments featuring automatic differentiation. We conclude this section by showing how we can increase flexibility of gps by composing processes, obtaining dgps, for which we can extend the use of random feature expansions and vi. This background material gives us all the elements to present our proposed calibration model in Section 3.

Bayesian Calibration
Consider prediction and uncertainty quantification for a phenomenon approximated by a computer model, which is expensive to evaluate. Throughout the paper, we will assume that the output of the computer model is univariate, but we will discuss ways in which we can deal with multiple responses. We denote observations from the real phenomenon of interest by y ∈ R, and we assume that we have n of these available y = [y 1 , . . . , y n ] for a number of inputs X = [x 1 , . . . , x n ] , with x i ∈ D 1 ⊂ R d1 . For example, in climate modeling, y could correspond to temperature measurements at n locations identified by latitude and longitude (in this case, The computer model simulating the real phenomenon requires the so-called calibration parameters θ ∈ D 2 ⊂ R d2 , as well as input variables x ∈ D 1 . Calibration parameters may have a physical meaning (e.g., exchange rates determining the carbon cycle) and inference over these is a central goal of Bayesian calibration.
Beside the observations y associated with X, the computer model is run at (possibly different) inputs X * = [x * 1 , . . . , x * N ] and calibration parameters T = [t 1 , . . . , t N ] , yielding a collection of outputs z = [z 1 , . . . , z N ] . Note that we denote by T the collection of N parameter configurations at which the computer model is run, while θ denotes the true (unknown) parameter we are interested in inferring. Generally N is larger than n as running the computer model is easier compared to obtaining real world observations (albeit computationally expensive).
Following the definition of the koh model in Kennedy and O'Hagan (2001), we assume that y and z are drawn from p(y i |f i ) and p(z j |η * j ), which determine the likelihood functions. The vectors f = [f 1 , . . . , f n ] and η * = [η * 1 , . . . , η * N ] result from mapping (x i , θ) i=1,...,n and (x * j , t j ) j=1,...,N through random functions f and η, respectively. The link between the computer model with latent representation η, and the real phenomenon with latent representation f , is modeled by where δ(x) represents the discrepancy between the computer model and the real process. Figure 1 illustrates the koh calibration model. In their Bayesian formulation, Kennedy and O'Hagan (2001) assume η(x, t) and δ(x) to be independent Gaussian processes (gps); in other words, the koh model assumes a given prior distribution over these functions, which takes the form of a gp. In addition, they assume a prior over θ, and they aim to characterize the posterior distribution over θ given the observations of the real process and runs of the computer model. In order to keep the notation uncluttered, we denote by ψ the set of gp parameters for η(x, t) and δ(x), and we denote by U the collection of all input locations X, X * , T . The marginal likelihood of the koh model is , θ), . . . , η(x n , θ)] and δ = [δ(x 1 ), . . . , δ(x n )] . The high dimensionality and the nontrivial dependence of this integrand with respect to the parameters of interest θ, makes their inference intractable, thus requiring approximations. Similar considerations can be made for the predictive distribution, that is the one made by the koh model at new input values. In the original work by Kennedy and O'Hagan (2001), inference is carried out using mcmc, which offers guarantees of convergence to the true posterior distribution over model parameters, but it can be extremely slow and impractical when the number of field observations and computer simulations is large. This is due to the poor scalability properties of gps, which adds up to the need to repeatedly solve these within mcmc.
It is worth mentioning the identifiability issues associated with the koh model, brought up in the discussion of the paper of Kennedy and O'Hagan (2001). Such issues arise due to the over-parameterization of the model, whereby it is possible to confound the effects of calibration parameters and model discrepancy. In particular, the koh calibration model yields a joint posterior distribution over θ and δ(x), but as the number of observations increases, the koh model concentrates this posterior over the manifold Brynjarsdóttir and O'Hagan (2014) nicely illustrate this problem as they study how removing the discrepancy would result in a model where the estimation of model parameters θ is biased. The argument is that if η(x, θ) does not model the real physical process exactly and there is no discrepancy, the estimate of θ would be based on a misspecified model. Increasing the number of observations would not cure this fundamental issue with model misspecification. The conclusion is that discrepancy in the koh model is necessary to hope for a sound inference over calibration parameters θ of the computer model, and imposing good priors on θ and δ(x) becomes of fundamental importance to mitigate the lack of identifiability. Alternatively, one can improve identifiability when multiple responses are available, and these are mutually dependent on the same set of calibration parameters θ (Arendt et al., 2012). In the literature, there are works addressing the issue of identifiability of the koh models with alternative formulations, such as loss minimization (Tuo and Wu, 2016) or frequentist formulations (Wong et al., 2017). Within this line of works, assuming that the optimal θ is the optimizer of a loss function, Plumlee (2017) showed that the prior over δ(x) should be orthogonal to the gradient of the computer model. We remark that the issue of identifiability affects in a similar way the model that we propose here, and we dedicate part of the discussion to comment on how we can adopt current strategies from the literature to deal with this.

Gaussian Process and Random Features Expansions
A Gaussian process (gp) is a set of random variables such that any subset of these is jointly distributed as a Gaussian (Rasmussen and Williams, 2006). This definition makes them suitable for assigning priors over functions. Imposing a gp prior over a function g exact (u), u ∈ D ⊂ R d means assigning a prior over the realizations of the function [g exact (u 1 ), . . . , g exact (u l )] at a set of l inputs u 1 , . . . , u l , such that this is multivariate Gaussian; this is because of the properties of marginals of multivariate Gaussian distributions. What needs to be specified is a mean function and a covariance function c(u i , u j ), which determines how realizations of the function at different inputs covary and therefore the properties of the functions that can be drawn from the gp. For simplicity, we assume a constant zero mean, but adding a parametric mean function is straightforward. Inference in models involving gps quickly becomes intractable when l grows beyond a few thousands. The reason is that sampling from gps and posterior inference requires algebraic operations with the covariance matrix obtained by evaluating the covariance function among all possible pairs of inputs. These operations usually involve O(l 3 ) operations, and require storing O(l 2 ) entries for the covariance matrix. In this work, we bypass these limitations by making a model approximation which lowers both complexities to O(l), as we discuss shortly.
Note that in this short presentation of gps we assume that the function is univariate, that is g exact (u) ∈ R. Extending this to multivariate functions g exact (u) ∈ R o is rather straightforward. What needs to be specified is a richer covariance structure, which is able to characterize the covariance between (g exact ) r (u i ) and (g exact ) s (u j ); see, e.g., Alvarez and Lawrence (2011) for an extensive treatment of these scenarios. When we assume a zero covariance across functions, we are effectively modeling each g r (u) as an independent gp. We are free to parameterize each gp separately, or to use a common covariance c(u i , u j ), so that covariance parameters are shared across the o functions. In this paper we prefer this latter approach to avoid introducing too many parameters, although we can easily incorporate more advanced modeling assumptions in our implementation.
For a large class of covariance functions, it is possible to show that draws from the gp prior are a linear combination of a possibly infinite number of basis functions with Gaussian-distributed weights (Neal, 1996;Rasmussen and Williams, 2006). This can be formulated for u ∈ D as an infinite sum with w ∞ infinite dimensional random vector with i.i.d. standard normal components, and φ ∞ the evaluation of an infinite set of basis functions at u. The exact covariance of g exact is readily obtained as ∀u, u ∈ D The infinite representation induced by the covariance function suggests a way to approximate gps by means of a finite dimensional truncation of φ ∞ (u), which we denote by φ(u) ∈ R p , so that The function g exact (u) is then approximated by g exact (u) ≈ g(u) = φ(u) w. (2.5) When using gps in modeling problems with l observations, the truncation has the advantage of avoiding the need to solve expensive algebraic operations with the covariance matrix. Instead, the truncation turns gps into generalized linear models. In order to retain the probabilistic flavor of gps, it is natural to treat these models in a Bayesian way, and this requires algebraic operations with matrices of size p × p (cost O(p 3 )), while the complexity with respect to l is linear.
Random feature expansions (Rahimi and Recht, 2008;Lázaro-Gredilla et al., 2010) offer an elegant framework to construct a finite p-dimensional representation φ(u), which are referred to as the random features. As a working example, throughout this paper we consider the Gaussian covariance (or kernel) function, also known as the squared exponential or radial basis covariance function: The symmetric positive definite matrix A controls the scaling and mixing of the inputs, whereas σ 2 controls the marginal variance of the gp. When the covariance is shiftinvariant, it is possible to express the covariance as the Fourier transform of a positive measure (Rahimi and Recht, 2008). Applying the Fourier transform to the Gaussian covariance, and defining ι = √ −1, we obtain: which immediately suggests that the distribution p(ω) is also Gaussian, and it has the form p(ω) = N (0, A). By sampling from p(ω), we can approximate the integral in the Fourier formulation by means of Monte Carlo, thus obtaining a finite dimensional representation of the covariance: In this expression, we sampled N RF /2 values of ω, denoted byω r , and exploited the property of shift invariance to split the complex exponential in two parts. While the basis functions are complex, by realizing that the left-hand side is a real number, with simple trigonometric manipulations of the right-hand side, it is possible to express the previous equation in an equivalent form as: Therefore, introducing φ : R NRF → R NRF as the element-wise application of sine (for the first N RF /2 components) and cosine (for the last N RF /2 components), the resulting basis functions are where Ω = [ω 1 , . . . , ω NRF/2 ], and the sin and cos functions are applied element-wise to their argument. The basis functions are also called random features, because they are obtained by multiplying the inputs u i with a random matrix Ω, followed by the application of a nonlinearity. Similar considerations can be made, for instance, for the Matérn covariance, where the ω's are sampled according to a multivariate Student-t distribution. See also Cho and Saul (2009) for an alternative derivation showing how the arc-cosine covariance of order one can be approximated in a similar fashion by sampling ω from p(ω) = N (0, A) and by employing a rectified linear unit nonlinearity (h(x) = x if x > 0 and h(x) = 0 otherwise), which is very popular in the literature of deep neural networks.

Deep Gaussian Processes and Random Feature Expansions
A Deep Gaussian Process (dgp) is defined as a composition of functions: where each function g (i) (·) is assigned a gp prior (Damianou and Lawrence, 2013;Neal, 1996). Again, for simplicity we focus our presentation on univariate gps, but we will discuss ways in which we can deal with multivariate gps shortly. The composition can be interpreted as a way to feed the output of g (i) as input to another g (i+1) . In a parallel with deep neural networks, each gp can be thought of as a "layer". The composition operation makes dgp priors substantially different from gps; that is, the realizations of g (L) (u) at U are no longer multivariate Gaussian in general. We refer the reader to Neal (1996); Duvenaud et al. (2014); Matthews et al. (2018) for some in-depth discussions and illustrations on the composition of gps.
The dgp prior induced by the choice of gp priors over the functions in the composition can be used as a prior over functions in statistical models. After choosing an appropriate likelihood function, one is usually interested in optimizing all model parameters, which include the covariance parameters of the gps at all layers, characterizing the posterior over g (L) (u) at U , and making predictions for any u * . For dgps, these tasks are analytically intractable due to the nontrivial dependence introduced by the composition. Most of the literature on dgps extends approximations and inference techniques developed for "shallow" gps. For instance, Hensman and Lawrence (2014); Salimbeni and Deisenroth (2017) extend the use of Nytröm-type approximations (also known as inducing points approximations) to dgps and carry out inference using variational techniques, whereas Bui et al. (2016)

employs expectation propagation.
This work focuses on random feature expansions for dgps instead, which were proposed and studied in Gal and Ghahramani (2016); Cutajar et al. (2017). In this framework, each gp in the composition is approximated by means of random features, as shown in the previous section. With this approximation, each gp layer becomes a linear model with a given distribution over the weights. Denoting by a the input to layer (i), the gp at the i th layer approximated with random features implements the following operations: With this approximation, each gp layer can be seen as a two-layer neural network. The first layer implements a multiplication by a random matrix Ω (i) and applies a nonlinearity by means of trigonometric functions. The second layer implements a linear combination of the inputs. Therefore, composing these approximate gps gives rise to a particular form of a Bayesian deep neural network. In traditional deep neural networks nonlinearities are applied at each layer and all the weights are optimized; in the approximate dgp viewed as a Bayesian deep neural network, nonlinearities are applied every other layer, and only the w (i) weights are inferred, whereas the Ω (i) are random. See Cutajar et al. (2017) for an in-depth discussion on the connection with Bayesian deep neural networks and other ways in which Ω (i) can be treated in order to improve performance.
The model approximation with random features bypasses the challenges of carrying out inference having to deal with the composition of gps, but the composition of the resulting linear models is still intractable from a Bayesian perspective due to the nonlinearities introduced by the basis functions. In the next section, we present variational inference as a way to derive a tractable and scalable inference scheme for dgps approximated with random features.

Stochastic Variational Inference
In this work, we make use of Variational Inference (vi) techniques to carry out inference over model parameters. We give a brief overview of vi here, and we will show how this is applied to the proposed calibration model in the next section. We consider a modeling problem for a set of l pairs of input/labels observations (u Imagine developing a statistical model with parameters Θ and with likelihood function p(v|U, Θ), and assume a prior p(Θ).
vi is useful when the posterior over Θ, that is p(Θ|v, U) is intractable. In vi, an approximation q(Θ) ∈ Q to the posterior is introduced, and the objective is to make it as close as possible to the actual posterior p(Θ|v, U). The standard way to do so in vi is to set up the following optimization problem: (2.13) where D KL is the Kullback-Leibler divergence measuring how different the two distributions are. With simple manipulations, it is possible to show that an equivalent problem is the one of maximizing the following lower bound to the log-marginal likelihood with respect to q(Θ) (see, e.g., Jordan et al. 1999;Graves 2011;Blei et al. 2017): (2.14) In other words, a candidate q(Θ) providing the highest lower bound also minimizes the divergence to the exact posterior.
With an expression for the lower bound of the marginal likelihood, we can now attempt to maximize it with respect to q(Θ), which generally means to optimize it w.r.t. its parameters. Therefore, the family of distributions, Q, for the candidate approximation q(Θ) ∈ Q needs to be chosen before inference. While the approximation is constrained by the choice of the family Q, the complexity of Q impacts the complexity of the lower bound maximization. The trade-off between the inference speed and the quality of the approximation is an active research domain. For instance Rezende and Mohamed (2015) and Liu and Wang (2016) increase the expressiveness of the variational distribution while keeping the inference tractable, at the cost of increasing the number of parameters of q(Θ).
The lower bound contains two terms: the first is a model fitting term, whereas the second is a regularization term which penalizes approximations that deviate too much from the prior. This D KL term can be computed analytically when priors and approximate posteriors have particular forms. For example, when p 1 (x) = N (μ 1 , σ 2 1 ) and p 2 (x) = N (μ 2 , σ 2 2 ), the Kullback-Leibler divergence between the two has the form: ( 2.15) The first term in the lower bound (2.14) depends on q(Θ) through the expectation of the log-likelihood. This complication is usually bypassed by employing stochastic optimization using Monte Carlo: The Monte Carlo approximation is unbiased, and so it is its derivative with respect to any of the parameters of q(Θ). This means that we can employ stochastic gradient optimization to adapt the parameters of q(Θ) to maximize the lower bound with guarantees to reach a local optimum of the objective (Robbins and Monro, 1951;Graves, 2011). The only precaution to take to make this viable, is to reparameterize the samples from q(Θ) using the so-called reparameterization trick (Kingma and Welling, 2014). In its simplest form, assuming a fully factorized Gaussian posterior over all parameter components θ ∈ Θ, the expression (θ ) (k) = μ + (˜ ) (k) σ separates out the stochastic ((˜ ) (k) ∼ N (0, 1)) and deterministic (μ and σ ) components in the way samples from the approximate posterior are generated. In this way, the lower bound can be differentiated with respect to the variational parameters μ and σ (with the (˜ ) (k) variables fixed), and it is therefore possible to perform gradient-based optimization (Graves, 2011;Kingma and Welling, 2014;Cutajar et al., 2017). The gradients are stochastic because the (˜ ) (k) variables are random, but the Monte Carlo estimate of the objective guarantees that the estimate of the gradient is unbiased, allowing for the use of stochastic gradient-based optimization (Robbins and Monro, 1951). This is the reason why this implementation of vi is also referred to as stochastic variational inference (svi).
Finally, it is worth noting that it is possible to considerably reduce the variance of the stochastic gradients, thus increases convergence speed of the optimization by means of the so-called local reparameterization trick (Kingma et al., 2015). In this approach, instead of sampling (˜ ) (k) , one samples from the distribution of the product of the inputs to a layer and samples from q(Θ); see Kingma et al. (2015) for more details.
Mini-Batch-Based Learning and Automatic Differentiation Part of the huge success of deep learning is due to the exploitation of mini-batch-based optimization and automatic differentiation (Graves, 2011). The former enables scalability, as the model is updated by iteratively processing subsets of data. The latter, instead, allows one to tremendously simplify the implementation of complex models, as one has to implement the objective function and automatic differentiation takes care of computing its derivatives based on the graph of computations. Naïvely operating with mini-batches in gps ignores the covariance among observations, which is crucial for effective gp modeling.
The proposed gp and dgp approximation and svi allow us to exploit mini-batchbased optimization. Assuming that the likelihood factorizes across observations, p(v|Θ, U) = n j p(v j |Θ, U), the terms within the approximation of the fitting term in the lower bound can be estimated without bias by selecting I, a set of m out of n indices (Graves, 2011) ( 2.17) This approximation introduces an extra level of stochasticity in the optimization (without introducing bias), but it allows one to scale the inference of these models to virtually any number of observations; previous work has reported results on dgps for 10 7 observations with a single-machine implementation (Cutajar et al., 2017). Another important remark is that this approximation and inference approach for dgps can be implemented relying exclusively on matrix-matrix and matrix-vector products, which can be accelerated by using GPU-type hardware.

DGPs for Calibration of Computer Models
In this section we present our contribution, which we refer to as dgp-cal. We begin by observing that the koh calibration model can be seen as a special case of a dgp, and this allows us to generalize the original formulation of the koh model to more flexible ones. We then show how we can leverage the advances in approximation and inference for dgps presented in the previous sections, namely random feature expansions and stochastic variational inference, in order to obtain a scalable framework for calibration, while retaining the flexibility offered by the use of dgps. We conclude the section by discussing implementation details.

Generalization of the KOH Calibration Model as a DGP
The original formulation of the koh calibration model involves the use of gps to emulate the computer model and to model the additive discrepancy. As pointed out by Kennedy and O'Hagan (2001), additive discrepancy is somewhat specific and it can be generalized (see, e.g., Qian and Wu 2008). We propose to do so by assuming that the function underlying the real process is obtained by a warping function γ applied to the emulator: , t), x) . (3.1) We retrieve the koh formulation (2.1) when the warping applies the identity to η(x, t) and adds it to a gp on x. In other words, we can think of the function f (x, t) in the koh model as a composition of two functions; the first is a gp that, given the inputs x and t, yields η(x, t), whereas the second applies a linear combination of η and another gp δ(x) with fixed unit weights.
In the original formulation of the koh model, the discrepancy between the computer code and the real process is modeled through the additive discrepancy term δ(x). In the proposed generalization of the koh model, instead, we assume a gp prior over γ (η(x, t), x), so that we apply a warping as a function of x. Similarly to the original koh model, the analysis of the warping function γ (η(x, t), x) allows one to reason about the discrepancy between the computer model and the real process.
A further possible extension, which we implement in our work, is to increase model flexibility by letting η(x, t) and/or γ(η(x, t), x) to be modeled as dgps instead of gps. The deep extension is particularly useful when the emulator or the real process exhibit a space-dependent behavior that is difficult to model by designing appropriate covariance functions. dgps offer a way to learn such nonstationarities from data, so this is particularly appealing in such challenging applications. We will give illustrations of dgps and the generalized formulation in the experiments. Thanks to the possibility to employ random feature approximations of the dgps in the generalized model, we can obtain a scalable framework for calibration as discussed next.

Model Approximation Using Random Features
In this section, we discuss how to employ the random feature approximation to make the koh model and its generalization suitable for variational inference. We start from the koh model, assuming that η(x, t) and δ(x) are modeled as gps with the Gaussian covariance in (2.6); we denote their anisotropy matrices by A η and A δ and their marginal variances σ 2 η and σ 2 δ , respectively. By applying the random feature expansion detailed in Section 2.2 we obtain The feature maps φ η and φ δ use the functions φ : R NRF → R NRF given in (2.10). The elements of w η and w δ , of size N RF , have i.i.d. standard normal priors, whereas the matrices Ω η , Ω δ of size N RF ×(d 1 +d 2 ), N RF ×d 1 , have i.i.d. normal rows, with covariance dependent on the positive definite matrices A η and A δ ; in particular, the columns of Ω η and Ω δ are i.i.d. N (0, A η ) and N (0, A δ ), respectively. Figure 2 represents the model (using a neural network-like diagram) according to (2.1), (3.2), and (3.3).
It is straightforward to extend this formulation to model η(x, t) and δ(x) with dgps instead of gps, by applying the random feature expansion to each gp layer. Assuming that each dgp has L η and L δ layers, in this case, η(x, t) and δ(x) are approximated by

2) and (3.3) formulate random feature expansions for η(x, t) and δ(x).
a Bayesian deep neural network, where each gp layer is approximated by: Similarly, we can construct a random feature approximation for a dgp modeling δ(x) in the koh model, or apply the same construction for the warping function γ (η(x, t), x) in the generalized version of the koh model. The advantage of the proposed approximation using random features is that it enables the use of stochastic variational inference presented in Section 3.3, as discussed next.

Inference of the Approximate DGP Calibration Model
Again, we focus on the formulation of the koh model with the discrepancy term δ(x) instead of the warping γ(η(x, t), x), but it is easy to follow the same derivation for this latter case. We make use of variational inference (vi), so we approximate the posterior distribution over all model parameters w η , w δ , and θ by introducing a variational posterior q(w η , w δ , θ), see Section 2.4. We assume q(w η , w δ , θ) to be Gaussian and completely factorized across parameters, that is although the factorization assumption can be relaxed. With the assumption that q(w η , w η , θ) factorizes across parameters, we deal with computations and storage which are linear in the number of model parameters. Relaxing the factorization means to assume that certain groups of parameters have nonzero covariance. For example, one could assume q(w η,·j ) = N (μ η,j , Σ η,j ) and parameterize Σ η,j = L η,j L η,j to preserve positivedefiniteness by optimizing L η,j . It is easy to verify that this choice requires O(N 2 RF ) storage and O(N 3 RF ) computations. Following the principles of vi, we need to derive a lower bound to the marginal likelihood of the model and maximize it with respect to the distribution q(w η , w δ , θ). In practice, this problem turns into the optimization of the lower bound with respect to the parameters that govern q(w η , w δ , θ). Taking the lower bound (2.14) and its unbiased Monte Carlo and mini-batch-based approximation (2.17), we now adapt this to our calibration model. In this adaptation we realize that there are two types of input points, namely observations X and computer runs [X * , T ]; note that the shapes of X, X * and T does not allow a concatenation of these three matrices in a single matrix. One possibility is to apply mini-batch on the union of the data sets. However we do not recommend this procedure without ensuring that each category ("Observations" versus "Runs") gets sufficiently represented. For example a uniform sampling blind to the category of the input points with n << N, would make the number of sampled observations vary a lot from one iteration to the other and may sometimes sample none of them. A workaround is to draw two index sets, I ⊂ {1, . . . , n} and J ⊂ {1, . . . , N} of sizes m and M . The fitting term of the lower bound can now be approximated as E := E q(wη,w δ ,θ) [log(p(y, z|w η , w δ , θ, U))] (3.6) , w δ , θ). The regularization term can be easily calculated when both priors and posteriors are Gaussian using (2.15).

Implementation Details
Considering the large number of parameters to optimize, the optimization procedure is divided into stages. We first focus on the computer model response; all parameters are fixed except the ones influencing the prediction of z, i.e. σ z , the means and variances of the components of w η , and the gp/dgp parameters of η(x, t). In the second stage all others parameters are freed for inferring y and θ jointly, i.e. adding the weights and hyperparameters of δ(x) or γ (η(x, t), x). Within each stage, we first optimize the means and variances of θ, and then all parameters jointly with a smaller learning rate.
For initializing the weights of the dgp (or gp) η(x, t), we use the methodology proposed by Rossi et al. (2019), which is a scalable, layer-wise initialization strategy based on Bayesian linear regression. The idea is to perform a series of regression tasks mapping each layer to the labels, so that sensible initial parameters for the variational distribution can be obtained. Starting from the first layer, we perform Bayesian linear regression from the inputs X * and T to the labels z, so that we can estimate the posterior over the parameters at the first layer. We then freeze this distribution and compute the output of the first layer with the given input data X * and T . This result is then used as the input to the second layer, for which the corresponding linear model is estimated by performing regression to the labels z. We then proceed iteratively up to the last layer. Intuitively, this initialization promotes configurations where the first layers are already capable of obtaining sensible regression result, while the layers closer to the labels serve as refinements.
We define the routine μ, Σ ← LM(X LM , y LM ), which performs Bayesian linear regression on a set of input-output pairs X LM , y LM . The routine returns the mean and the covariance of the posterior distribution over the weights w in the instrumental regression y LM := z = X LM w + σ 2 η ε, where σ 2 η is the variance of the likelihood function, and the prior for the components of w, ε is i.i.d. N (0, 1). With these definitions, the initialization is reported in Algorithm 1. The posterior distribution in Bayesian linear regression has full covariance in general, whereas the assumed posterior is fully factorized. In this case, we match the optimal factorized Gaussian distribution to the actual posterior using the Kullback-Leibler divergence, which explains the assignment to q(W ( ) η,:,i ) in the last line of Algorithm 1. The procedure can easily exploit mini-batching, as reported in Algorithm 1, and it can operate with stochastic optimization, thus making it suitable for large-scale problems. We refer the reader to Rossi et al. (2019) for more details.

Experiments
In this section we validate dgp-cal on a number of calibration problems. In each experiment, we specify whether dgp-cal uses the additive structure in (2.1), as in the koh formulation, or the general one in (3.1); we also specify when the model is tested with dgps instead of gps (the default). The experiments have the following setup. The likelihoods p(y i |f i ) and p(z j |η * j ) are Gaussian with variances σ 2 y and σ 2 z treated as hyperparameters within ψ. All covariance functions of η(x, t) and δ(x) are Gaussian, except for the comparative experiment in Section 4.2 where a Matérn kernel is used. The variational posteriors q(W η )q(W δ )q(θ) and the prior p(θ) are Gaussian.
Competing Methods A large literature is devoted to the practicalities of numerically challenging applications. Gramacy et al. (2015) use local approximate gp modeling and calibrate parameters by solving a derivative-free maximization of a likelihood term. Pratola and Higdon (2016) handle large problems using a Bayesian sum-of-trees regression Input: . (g (1) (x, t)))) with variational distribution of the weights η } set to the prior N (0, 1) i.i.d. for all components.
• Computer runs X * , T with outputs z organized in mini-batches of size M .

Result:
A distribution q(w η ) to start the lower bound maximization with.
for each layer index do for each output component i of g ( ) through a sample path of η and save X LM as: ; end end Algorithm 1: Initialization for the approximate posterior distribution over η.
for modeling data from computer model and the real process jointly. More recently in Gu and Wang (2018) and Gu (2018), calibration is performed within a Bayesian framework by defining a prior distribution directly on the L 2 norm of the discrepancy. Xie and Xu (2018) sample from the posterior distribution over calibration parameters by minimizing the L 2 norm of a sample path of the discrepancy (similarly to Wong et al. 2017). These authors provided easy-to-use code in R or C++ packages. We will refer to these methods as LaGP, Sum-of-trees, Robust, and Projected, respectively

Illustrative Example
We illustrate dgp-cal on a calibration problem with one variable and one calibration input. As a first test, the prior and hyperparameters used to generate the data set are assumed to be known, with θ ∼ N (0, 1), σ η = 1, A η = 1 2 I, σ δ = 2 10 , A δ = 1 20 . We choose locations for N = 7 computer runs and n = 4 observations from the real process in a space filling manner in [0, 1] × [− 5 2 , 5 2 ]. The output vector z of the computer model at (x * i ) i=1,...,N is sampled from its prior distribution. In order to determine the real observations y, we first sample p(θ) to get θ true and then a sample path of δ(x). The gp priors of η(x, t) and δ(x) are approximated with N RF = 50 random features through (3.2) and (3.3), and the observations are computed using (2.1). The results of dgp-cal are displayed in Figure 3. In the first and the third panels, we see that the posterior of θ obtained analytically by integrating out w η and w δ has its mass concentrated around the true value ≈ 0.8, where there is a (color) match between z (the dots) and y (the lines). The variational posterior (blue line) offers a reasonable approximation of the true posterior.

Model Calibration in Cell Biology
We apply dgp-cal to a biological application, which has been previously studied in Plumlee (2017) and Xie and Xu (2018). The output is the normalized current through ion channels of cardiac cells needed to maintain the membrane potential at −35 mV. The input variable x is the logarithm of the experiment time rescaled to D 1 = [0, 1]. The calibration inputs θ ∈ D 2 = [0, 10] 3 control a mathematical model η cell (·, θ) of the phenomenon proposed by Clancy and Rudy (1999). Here it is considered to be an expensive black box with N = 300 runs available, whereas the number of observations is n = 19. The runs are located in a space filling manner in D 1 × D 2 (Latin hypercube sampling optimized with maximin distance criterion).
We compare dgp-cal with additive and general discrepancy against four competitors. The method "L 2 " is a simple minimization over θ of the L 2 residual error ||y −η cell (X, θ)||, whereη cell is a surrogate of η cell given X * , T and Z. Its minimization takes 30 seconds and the residual error is 1.31. This method is good for predicting observations from the real process, but it provides no quantification of uncertainty.
In Table 1, we report the mean squared error (mse), E q(θ) (||y − η cell (X, θ)|| 2 ), where q represents the estimated posterior density of θ. All tuning parameters of the codes are left to default values, and all methods are run on the same machine to ensure some fairness in reporting running times (laptop with 4×2.50 GHz cores). We see that the mse values obtained by the methods Projected, dgp-cal and Robust are significantly lower than the mse of koh. The proposed dgp-cal is the fastest. The version of dgpcal with general discrepancy performs slightly better, thanks to the relaxation of the hypothesis of purely additive discrepancy.
The posterior distributions over θ obtained by the calibration methods we tested are reported in Figure 4. All methods yield a distribution concentrated around the L 2 minimizer (the red dot). However, the distributions are clearly not similar to each other (except for Robust and dgp-cal). This could be explained by differences in the model formulations. Although we ensured that the covariance and mean functions are the same for all competing methods (Matérn with smoothness 5/2 and constant mean), there are several differences that cannot be matched. For instance, Robust has an additional step in the hierarchy of priors concerning the L 2 norm of the discrepancy. Moreover, the definition of the calibration parameters θ itself differs among methods. In Projected, θ is a minimizer of a given stochastic process, while other methods follow the koh definition. Also, Robust performs a fully Bayesian inference including hyperparameters, while in the other methods, including ours, they are optimized. It would be straightforward to allow for a Bayesian treatment of the hyperparameters in dgp-cal, but we leave this for future work.
To visualize the results of the calibration process, in Figure 5 we overlay the observations from the real process with the responses of the computer model η cell (·, θ) when θ is sampled from its posterior distribution. All the probabilistic methods present a good fit while allowing for quantification of uncertainty in the predictions, with larger uncertainty for models that account for the uncertainty in the hyperparameters.
We see how the computer model output η is warped by γ in dgp-cal with general discrepancy (3.1). In Figure 6 we display the expected derivative of the warping with respect to the computer model output, i.e., E ∂γ(·,x) ∂η , for three values of x. As the estimated values oscillate around one for every x ∈ D 1 , this model confirms that an additive discrepancy is a sensible assumption. When the estimated γ(·, x) is exactly the identity, the general discrepancy boils down to an additive one. This figure also shows how the model with general discrepancy can adapt to data sets with space-dependent behavior. Indeed in this test case the values of η have a very different distribution according to x. If x is around 0.2, the distribution of the computer runs is very asymmetric, with a heavy tail for high values, and very light tail for low values (see the grey dots in Figure 5). On the other hand for higher values of x, say higher than 0.6, the distribution of z is more symmetric, and looks closer to a Gaussian distribution. This corresponds to the warping observed in Figure 6, where η gets its output warped and concentrated asymmetrically toward lower values for x = 0.2, while for x = 0.6 or 1, its Gaussian output is almost left untouched.

Model Calibration for Complex Response
We deal now with a case-study with locally nonsmooth/nonstationary response of the computer model, for which a stationary gp is generally inadequate. The computer model is a simulator of the effects of underground nuclear tests on radionuclide diffusion into aquifers at the Yucca Flats in the United States (Fenelon, 2005). We take the same data set as generated by a script in the supplementary material of Pratola and Higdon (2016), which is available online, with d 1 = 2, d 2 = 6, n = 10 and N = 17600.
In Pratola and Higdon (2016), the size of the dataset as well as nonstationary modeling is handled with a sum-of-trees regression. We carry out calibration using dgp-cal with a two-layer dgp emulator for the computer model to showcase the ability of a more complex emulator to capture the nonstationarity that characterizes this problem. We therefore compare dgp-cal with a shallow gp emulator. Details about the initialization can be found in Table 2. Furthermore, we compare against the modularized method with Local Approximate gps (lagp) of Gramacy et al. (2015).
In Figure 7, we display the posterior over the function f (·, θ) modeling the real observations. We observe that only the deep variational calibration and the sum-of-trees approach manage to reproduce the nonstationary nature of the data set by capturing the spike characterizing one observation.

PARAM. SHALLOW DEEP
We build a shallow dgp-cal model with additive discrepancy as described by (2.1), (3.2), and (3.3). Indeed, as the response surface of the borehole functions is smooth, we consider "shallow" gps for η(x, θ) and δ(x), with Gaussian covariance approximated by 100 random features.
Concerning the sum-of-trees calibration, a sensible budget would be to set 2000 posterior samples plus 10000 for burn-in, with 1000 tree cut-points. However, this corresponds to one month of computation on our computers, so we divided the sampling budget by 5, and set 100 cut-points, keeping all other default parameters untouched.
We did not compare with the modularized calibration using lagp, as the current implementation in R does not support large amount of real observations. This does not question the relevance of the method, which could be modified by using a scalable gp for the discrepancy.  Table 3: Results of calibration on a large data set.
We evaluate the performance by comparing the posterior over θ with the truth (Figure 8) and by evaluating the mse error between the computer model and observation from the real process E q(θ) (||y − η bh (X, θ)|| 2 ) ( Table 3). dgp-cal provides the best performance both in retrieving θ and mse, and it is the fastest by far. Figure 8: Posterior distribution of θ on a large data set (black: truth, blue: dgp-cal, green: sum-of-trees).

Discussion
The koh model and inference in Kennedy and O'Hagan (2001) offers a classical framework to tackle calibration problems where quantification of uncertainty is of primary interest. In this paper we proposed dgp-cal, which offers a number of improvements over the koh calibration. From the modeling perspective, we cast the koh calibration model as a special case of a more general dgp model, where the latent process modeling the real observations is a warped version of the emulator of the computer model; we showed that this general calibration model retains the possibility to reason about uncertainty in the discrepancy between the computer model and the real process.
Furthermore, the proposed approximation of gps and dgps with random features and approximate inference through variational techniques give dgp-cal a number of advantages, such as simplicity of implementation in development environments featuring automatic differentiation and the possibility to exploit GPU-type hardware. The experiments showed that the approximations introduced to recover tractability do not affect the ability of dgp-cal to effectively calibrate parameters of complex computer models, while enjoying scalability to large number of observations and/or computer runs, demonstrating that dgp-cal is a powerful alternative to the state-of-the-art. We are currently investigating the application of dgp-cal to other large-scale calibration problems in Figure 9: Calibration error as a function of the number of runs. environmental sciences, where the koh model and related calibration methodologies are usually not the preferred choice due to limited scalability.
We conclude the paper by discussing two important aspects of this work, namely adaptive experimental design and identifiability.

Adaptive Experimental Design
Building up the design sequentially may allows smaller training sets, limiting the evaluation of the simulator code (see e.g. Sauer et al. 2021). It is possible to extend dgp-cal to handle cases where the uncertainty in the model can be used to guide the incremental design of the experiment. For simplicity, we assume η(x, t) in the koh model to be modeled as a gp. The goal is to improve calibration by sequentially optimizing the locations of the inputs of the computer model (X * and T ). More precisely, instead of determining the N locations from the outset, N add points are added at each of n it iterations to an initial design of N 0 < N experiments. At each iteration, the model is inferred on the data integrating the new batch of N new inputs and outputs. The batch of points [X * ca , T ca ] with corresponding labels z ca added to the design is determined as a solution to the optimization of a criterion. We can specify a criterion such that the evaluation of η at the best candidate points maximally reduces the variance θ. Many sampling criteria can be imagined and tested; for example, one can choose the sum of the partial derivatives of the lower bound with respect to the (logarithm of the) variance of the components of θ: where ξ i is the parameter of the of variational distribution controlling the variance of θ i . The function L ca returns the lower bound in (2.14) after updating the variational distribution with candidate input points.
In the case of gps with random features, the update of the variational distribution can be computed analytically (see, e.g., Section 2.3.3 of Bishop (2006) for a derivation), obtaining updated mean and covariance as with Φ η,ca ∈ R Nca×NRF the row by row evaluation of φ η (x, t) on X * ca and T ca , μ 0 and Σ 0 are the mean and the covariance of the variational distribution before selecting the candidate points, and σ 2 z > 0 corresponds to the noise term of the likelihood function. The criterion forecasts the effect of evaluating η at [X * ca , T ca ] using the predictive mean z ca = E(η(X * ca , T ca )|U, ψ, Ω). Therefore, the adaptive sampling chooses new input locations which maximally reduce the variance of the approximate posterior over θ.
As an illustration of this approach, we perform adaptive sampling on the borehole function in (4.1), starting from N 0 = 100 points and adding batches of N add = 20 until we reach N = 400. The initial design is obtained by a latin hypercube sampling with optimized maximin distance. The sampling criterion in (5.1) is optimized with gradient descent using automatic differentiation with respect to X * ca and T ca . We compare against latin hypercube sampling with optimization of the maximin distance, which is a classic space-filling design, for N = 100, 200, 300 and 400. The experiment is reproduced five times with different initial and space filling designs, and we report calibration errors ||E q(θ) (θ) − θ|| with standard deviations in Figure 9. The designs produced by the adaptive method lead to consistently lower calibration error due to the optimized locations of the computer runs.

Identifiability
The issue of identifiability that we discussed for the koh model affects our formulation too. In particular, for the generalized formulation where the discrepancy is not modeled through an additive component but through a composition with an unknown function γ(η(x, t), x), our approach yields a posterior over θ and γ(η(x, t), x) which concentrates on the manifold M = {(t, γ(η(x, t), x)) | γ(η(x, t), x) = f (x, t) with x ∈ X ∪ X * } , as the number of observations increases. Ignoring the discrepancy by removing the composition of the warping function with η(x, θ) would not constitute a problem from the implementation perspective, and our approach would result in a fast and flexible calibration model where θ is identifiable. Similarly to the koh model, however, ignoring the discrepancy results in a model that could potentially yield a biased estimation of the parameters θ, depending on how accurate η(x, t) is in modeling the physical process.
The composition with a discrepancy function is therefore needed to avoid bias in the estimate of θ. As a result, the importance of priors over θ and model discrepancy to mitigate the issue of lack of identifiability is as important as in the koh model. This can dictate the choice of gp versus dgp priors. For instance, in the radionuclide diffusion test case (Section 4.3), a preview of the hectic observations hints in favor of a dgp model. On the contrary, in Section 4.4 knowing that the borehole function is smooth, the computer model and the discrepancy were given gp priors. Beyond these general considerations, there exist works allowing to impose constraints on dgps priors, such as positivity, monotonicity, convexity, or being solutions to differential equations. In Lorenzi and Filippone (2018), this is done within the same framework proposed here involving random feature approximations and variational inference, so it would be rather straightforward to incorporate these in our model and inference.
Other ideas proposed in the literature on how to deal with the lack of identifiability could easily be adapted to our framework. For example, when multiple responses are available and they are mutually dependent on the same set of calibration parameters θ, this improves identifiability (Arendt et al., 2012), and accommodating for multiple responses in our framework is straightforward. Furthermore, it would be possible to apply similar ideas to the ones proposed in Tuo and Wu (2016); Wong et al. (2017) where gp priors are replaced by dgps.
Another exciting line of research, which we could benefit from in order to investigate the identifiability properties of dgp-cal, follows recent results on so-called disentanglement (Locatello et al., 2019). This literature studies the properties of deep models in the context of density estimation, through the use of variational autoencoders (Kingma and Welling, 2014). Autoencoders are models composed of two elements: an encoder and a decoder. The encoder maps a set of observations into a low-dimensional latent representation, whereas the decoder maps the latent variables into the observations. In variational autoencoders, one is interested in obtaining a distribution over the latent variables by means of a variational formulation. Recently, a lot of interest has been devoted to disentangled representations, which are those where the latent variables are independently controlling different generative factors, and their number is sufficient to capture the diversity observed in the observations. A popular example is the one where observations are images of a ball of different colors and sizes, and with varying color of the background; the three generating factors are color and size of the ball, and color of the background, and one hopes to be able to infer this without any prior knowledge and simply by analyzing the given images. Locatello et al. (2019) made a theoretical breakthrough showing that this is impossible unless some extra information on these generating factors is available. Following this negative result, Khemakhem et al. (2020) provided theoretical guarantees on how to recover identifiability of these latent generating factors by means of a factorized prior distribution over the latent variables that is conditioned on some additional observed variables. While the setup is different to calibration, there are many similarities with our model in that they both use deep models and variational inference. Our calibration model can be seen as the decoder of such variational autoencoders, and θ in our model can be seen as the latent generative factors without encoder. We believe that this is an interesting direction to establish novel results and insights on the identifiability of the proposed calibration model.