Eﬃcient Emulation of Computer Models Utilising Multiple Known Boundaries of Diﬀering Dimension

. Emulation has been successfully applied across a wide variety of sci-entiﬁc disciplines for eﬃciently analysing computationally intensive models. We develop known boundary emulation strategies which utilise the fact that, for many computer models, there exist hyperplanes in the input parameter space for which the model output can be evaluated far more eﬃciently, whether this be analytically or just signiﬁcantly faster using a more eﬃcient and simpler numerical solver. The information contained on these known hyperplanes, or boundaries, can be incorporated into the emulation process via analytical update, thus involving no additional computational cost. In this article, we show that such analytical updates are available for multiple boundaries of various dimensions. We subsequently demonstrate which conﬁgurations of boundaries such analytical updates are available for, in particular by presenting a set of conditions that such a set of boundaries must satisfy. We demonstrate the powerful computational advantages of the known boundary emulation techniques developed on both an illustrative low-dimensional simulated example and a scientiﬁcally relevant and high-dimensional systems biology model of hormonal crosstalk in the roots of an Arabidopsis plant.


Introduction
Computer models, or simulators, have been used across a wide range of disciplines to help understand the behavioural dynamics of physical systems.Such computer models are often high-dimensional, due to them possessing large numbers of input parameters, and take a substantial amount of time to evaluate.As a result, performing a full uncertainty analysis of model behaviour -a critical part of any scientific study that requires model evaluations at a vast number of parameter combinations -may be unfeasible.For this reason, emulators are frequently used as fast statistical approximations to computer model output, providing a predicted value at any input and a corresponding measure of uncertainty, given that the model has been evaluated for a set of training inputs (Sacks et al., 1989;Higdon et al., 2004;Bowman and Woods, 2016).Emulation has been successfully applied across a variety of scientific disciplines, such as astrophysics (Higdon et al., 2004;Kaufman et al., 2011;Vernon et al., 2014), climate science (Castel-letti et al., 2012;Williamson et al., 2013;Edwards et al., 2021), engineering (Du et al., 2021), epidemiology (Andrianakis et al., 2015;McKinley et al., 2018), and volcanology (Gu and Berger, 2016;Marshall et al., 2019).Improved emulation strategies therefore have the potential to benefit many scientific areas, allowing more accurate analysis at lower computational cost.Vernon et al. (2019) describe an advance in emulation strategy that can lead to substantial improvements in emulator performance by exploiting the fact that there often exist parameter settings for which computer model output can be evaluated far more efficiently (whether this be analytically or just significantly faster using a simpler numerical solver).This may be possible as a result of allowing various modules to decouple from more complex parts of the model, particularly when certain parameters are set to zero.Such parameter settings commonly lie across boundaries or hyperplanes of the input parameter space, hence leading to effectively known model behaviour on these boundaries that impose constraints on the emulator itself (note that such Dirichlet boundary conditions on the emulator are distinct from the Dirichlet boundary conditions that could be imposed on the computer model itself).The information on these known boundaries can be incorporated into the emulation process via analytical update, thus involving no additional computational cost.This is preferable to the approach explored by Tan (2018), which uses substantial extra modelling and multiple extra emulator parameters (each requiring estimation) to ensure consistency with the known boundary.In contrast, the approach in Vernon et al. (2019) includes no extra modelling, and zero additional parameters, instead updating the Gaussian Process (GP) style emulator with the boundary information in a natural way.
In this article, we extend the work of the literature to show that such analytical updates are available for multiple boundaries of various dimensions.In particular, we demonstrate which configurations of boundaries such analytical updates are available for.The results of this article both provide analytical insights and are directly applicable to the analysis of many realistic physical systems represented by computer models.We demonstrate this by applying the methodology to a scientifically relevant model of hormonal crosstalk in the roots of Arabidopsis Thaliana.Due to the ease and substantial benefits of including known boundaries when emulating the Arabidopsis model, we would suggest that future Uncertainty Quantification (UQ) analyses of scientific models include a phase of identification and incorporation of known boundaries, if they are found to exist, as standard practice.
The remainder of this article is organised as follows.In Section 2, we review and extend the work of Vernon et al. (2019) to the case of a single known boundary of any dimension (as opposed to p − 1, where p is the number of input components to the computer model).Section 3 extends the theory to multiple boundaries of various dimensions, covering which configurations of boundaries may be incorporated into an emulator analytically, before exhibiting a low-dimensional illustrative example.This example can be thoroughly investigated via associated code available at https://github.com/samjacksonstats/KBE. Section 4 applies the emulation techniques to a current systems biology model of Arabidopsis Thaliana, with the article being concluded in Section 5.All referenced appendices can be found in the supplementary material (Jackson and Vernon, 2022).

Known Boundaries of Dimension p − k
This section reviews the work presented in Vernon et al. (2019), whilst extending it by allowing the known boundaries to be of any dimension.

Emulation of Computer Models
We consider a computer model f (x), where x ∈ X denotes a p-dimensional vector containing the computer model's input parameters, and X ⊂ R p is a pre-specified input parameter space of interest.We assume that f (x) is univariate, however, the results presented directly generalise to the corresponding multivariate case, with acceptable correlation structure, as discussed further in Appendix H.We make the judgement, consistent with much of the computer model literature, that f (x) has a product correlation structure: with r j (0) = 1, corresponding to deterministic f (x).For example, a common choice is the Gaussian correlation function, given by (2) If we perform a set of runs at locations X D = {x (1) , . . ., x (n) } over the input space of interest X , giving computer model outputs as the column vector D = (f (x (1) ), . . ., f (x (n) )) T , then we can update our beliefs about the computer model f (x) in light of D. While this can be done using Bayes theorem (if, say, f (x) is assumed to be a Gaussian Process), we instead prefer a less fully specified framework.Hence, we treat expectation as primitive (motivated by de Finetti (1974)), give a partial prior belief specification only in terms of expectations, variances and covariances, and employ the Bayes linear update which is the natural choice when operating under such a partial specification (Goldstein, 1999;Goldstein and Wooff, 2007): 5) ] are the expectation, variance and covariance of f (x) adjusted by D (Goldstein, 1999;Goldstein and Wooff, 2007).Although we will work within the Bayes linear formalism, the derived results would apply to a version of the fully specified probabilistic Bayesian emulation case, were one willing to make the additional assumption of full normality that use of a GP entails, and to condition on various emulator parameters.See Goldstein (1999) and Goldstein and Wooff (2007) for a further discussion of the Bayes linear approach and its foundational motivation.
As discussed in Vernon et al. (2019), since the results of this article rely on the product correlation structure of the emulator, extension of these methods to more general emulator forms requires further calculation.

Known Boundary Emulation
hyperplane defined by fixing the value of a subset of the inputs, with J K = {j 1 , . . ., j k } ⊂ P = {1, . . ., p} being the indices of the fixed inputs.We consider the situation where f is analytically solvable along K (that is, we know f (x) for any x ∈ K).We wish to update our emulator, and hence our beliefs about f (x) at input point x ∈ X , in light of K. We capture model behaviour along K by evaluating K = (f (y (1) ), . . ., f(y (m) )) for a large but finite number m of points on K, denoted y (1) , . . ., y (m) , however we structure our calculations so that they can be easily generalised to the case of continuous model evaluations on K, as shown in Vernon et al. (2019).Naively plugging these m runs into the Bayes Linear update Equations ( 3), ( 4) and ( 5) by replacing D with K may be infeasible due to the size of the m × m matrix inversion Var[K] −1 (m may need to be extremely large to capture all the information available from K).A direct update of the emulator is therefore non-trivial, hence we show from first principles that this update can be performed analytically for a wide class of emulators.This is done by exploiting a sufficiency argument briefly described in the supplementary material of Kennedy and O'Hagan (2001) though, to our knowledge, only utilised for the first time in the context of known boundary emulation in Vernon et al. (2019).
We begin by evaluating f (x K ), where x K is the orthogonal projection of x onto the boundary K, and extending the collection of boundary evaluations, K, to be the m + 1 column vector K = (f (x K ), f(y (1) ), . . ., f(y (m) )) T .Note that x K j = α K j for j ∈ J K .Crucially, we have that arising from the first row of the somewhat trivial equation Var , where I (m+1) is the identity matrix of dimension (m + 1).
Equation ( 6) is of particular value when considering the behaviour of f at the point of interest x.As we have defined x K as the orthogonal projection of x onto K, we can define a K = x − x K to be the p-vector of shortest distance from boundary K to x.Note that the elements of a K have the property that: where we define (for two sets A, B) A\B to be the elements in A but not B. We also define: for a generic collection of indices J and vector of constants q.By partitioning the dimension indices P = {1, . . ., p} into P = {J K , P K } we obtain the following covariance expressions: since r j (0) = 1, components J K of x K and y (s) must be equal as they all lie on K (that is, x K j = y (s) j for j ∈ J K ), and for j ∈ P K .From ( 8) and ( 9), the covariance between f (x) and the set of boundary evaluations K is given by Using ( 6) and ( 10) we obtain the important result that: As we have avoided the need to explicitly evaluate the intractable matrix inverse Var[K] −1 , we can find the Bayes Linear adjusted expectation for f (x) with respect to K analytically by combining (3) and ( 11): where we have defined Δf ( We have thus eliminated the need to explicitly invert the large matrix Var[K] entirely by exploiting the symmetric product correlation structure and (6).Similarly, we find the adjusted covariance between f (x) and f (x ) given the boundary K, where f (x ) is the model output at a second point x , using (5) and ( 11), and again exploiting the partition P = {J K , P K }: where the 'updated correlation component' in the x J K directions is given as By setting x = x , we obtain an expression for the adjusted variance of f (x): Equations ( 12) and (15) give the expectation and variance of the emulator at a point x, updated by a known boundary K.As they require only evaluations of the analytic boundary function and the correlation function they can be implemented with trivial computational cost in comparison to a direct update by K. Useful insights into the sufficiency, stationarity and limiting behaviour, along with a generalisation of the above to continuous boundary evaluations K, are discussed in Vernon et al. (2019).Note that when using a general product correlation structure, as given by ( 1), we require the known boundary K (and all subsequent known boundaries) to be axially aligned.However, if the Gaussian/squared exponential correlation structure is used which is invariant under rotations in R p , the initial boundary K can have any orientation, and subsequent boundaries (see Section 3) can be included that are parallel to K or perpendicular to K (in a basis where all correlation lengths are equal).

Three-dimensional Example
For illustration, we consider the problem of emulating the three-dimensional function: + cos(x 3 ), ( 16) 2π, 2π].This simulated example will, throughout this article, take a prior expectation E[f (x)] = 0, and a product Gaussian covariance structure, as given by (2), with correlation length parameters θ = (π, π/8, π) and variance parameter σ 2 = 2.These values are adequate for the example presented, as illustrated in the diagnostic panels of the figures that follow.Having said this, it is important and informative to explore the effect of varying the correlation length parameters on emulator predictions, particularly in combination with known boundaries.We explore varying these parameters for the Arabidopsis model application in Section 4.
We begin by assuming a known boundary K = {x ∈ R 3 |(x 2 , x 3 ) = (0, 0)}, hence that we can evaluate f (x K ) = sin(x 1 ) + 1 for any point on the boundary x K ∈ K.We hence apply the emulator expectation and variance update given by ( 12) and (15).
In order to illustrate the effect of the known boundary on the emulator, we examine emulator behaviour across two-dimensional slices (keeping one variable fixed) of the three-dimensional input space, as shown in Figure 1.The top row depicts the input space as a cube, with the one-dimensional boundary being illustrated by the red line.The green planes are two-dimensional slices of the input space over which emulator and simulator behaviour are compared in the remaining plots.The remaining rows show (from top to bottom) simulator f (x) (for comparison purposes), emulator expectation In each case, the variable with smaller index is along the horizontal axis.The left column of the figure shows the results for x 2 = 0. Since this slice contains the known boundary, we see that for x 3 = 0 emulator expectation precisely matches the true simulator function, and the variance goes to zero.As we move further away from the boundary in the x 3 direction, the variance increases.Note that, since K is parallel to the x 1 direction, altering the value of this variable doesn't alter the emulator variance.The middle column shows a slice away Figure 1: Updating the emulator for the three-dimensional function given by ( 16) by a single boundary K. Rows from top to bottom, show: 1) position of known boundary (red line) in the three-dimensional input space, along with the position of the twodimensional slices (green planes) illustrating the two-dimensional plane in the threedimensional input space, over which the remaining plots in the same column are plotted, 2) simulator function f (x), 3) emulator expectation μ(x), 4) emulator variance ν(x), 5) standardised errors s(x).Columns from left to right show results on the three planes x 2 = 0, x 2 = −π/8 and x 1 = −π respectively, shown as the green planes in the top row.Note that for each two dimensional plot, the variable with smaller index is along the horizontal axis.from the boundary (x 2 = −π/8).Again, the smallest variance is at x 3 = 0, however, now it is not zero.The right column shows x 1 = −π.In this case, the function is only known at the centre point (x 2 , x 3 ) = (0, 0) with variance increasing radially away from this point.The diagnostic plots provide evidence for the validity of the emulator, with few parts of the input space having standardised errors greater than 2. Associated code for the examples contained in Sections 2 and 3 can be found at https://github.com/samjacksonstats/KBE.This code permits user-friendly investigation into varying all features of the presented example, including toy function, parameter ranges, known boundaries and diagnostic slices.

Updating by Further Model Evaluations
Since we have analytic expressions for E we are now able to include additional computer model evaluations into the emulation process.To do this, we perform n (expensive) evaluations of the computer model across X to obtain D = (f (x (1) ), . . ., f(x (n) )), and use these to supplement the evaluations, K, available on the boundary.We want to update the emulator by the union of the evaluations D and K, that is to find This can be achieved via a sequential Bayes Linear update: Var where we first update our emulator analytically by K, and subsequently update these quantities by the evaluations D (Goldstein and Wooff, 2007).n is typically of small/modest size due to the relative expense of evaluating the computer model, hence these calculations (in particular Var K [D] −1 ) will remain tractable, leading to an overall O(n 3 ) calculation.It is worth comparing this to the brute force approach of including a large number m of points on the (p − k)-dimensional known boundary K, and updating the emulator directly, which leads to an O((m + n) 3 ) operation.For example, we may consider a fine grid of points that are half a correlation length (θ/2) apart in each direction to be enough to capture most of the information from the boundary K.In this case we would have ).For the Arabidopsis application, p = 38 and k = 4.This represents a vastly less efficient calculation compared to the above analytic O(n 3 ) approach.
It is worth noting that users of standard black-box GP emulation packages may be unable to implement directly the formulae of ( 12) and ( 15).However, we see that due to underlying sufficiency arguments, such a user can simply add the (n + 1) trivial boundary evaluations f (x K ) and and then their black box Gaussian process package will produce results that precisely match ( 17)-( 19).This would only require the inversion of a (2n + 1) × (2n + 1) matrix, and hence be of O((2n + 1) 3 ).However, this is only true for a single emulated point, while typically a user would wish to emulate at a large number n of input locations.Naively adding all corresponding n boundary evaluations directly to D * would lead to an unnecessary O((n + 2n) 3 ) calculation with n n.However, splitting the n points into N batches improves calculation efficiency to O(N ( n N + 2n) 3 ), which has a computationally optimal batch size of n, thus an optimal number of batches N = n /n.This results in the calculation being O(N , where again n n.As typical values may be say n = 10 2 and n = 10 6 , whereby n = n 3 , were this relationship to scale, we would have O(n 5 ).In comparison, the above proposed analytic method (( 17)-( 19) combined with ( 12), ( 13), and ( 15)) is of O(n 3 ) regardless of the size of n , and hence is seen to be clearly superior.For further discussion of this issue and its exacerbation for multiple known boundaries, see Appendix G.

Multiple Boundaries of Various Dimensions
In this section, we begin by discussing the requirements for being able to analytically update by a second known boundary, before considering larger numbers of boundaries.

Two Known Boundaries
Given the results of Section 2, we now proceed to consider analytical updating by a second known boundary As this section progresses, we will restrict the form of L, for example, to being either intersecting and orthogonal, or parallel to K, however, for now we consider the general case.We define L = f (x L ), f(z (1) ), . . ., f(z (m) ) T to be a vector of model evaluations, where z (1) , • • • , z (m) constitute a large but finite number m of points along L, and denote the p-vector of shortest distance to x from its orthogonal projection x L as a L = x − x L .We also define x LK to be the sequential orthogonal projection of x first onto L and then onto K, and correspondingly a LK = x − x LK to be the p-vector of shortest distance to x from this sequential projection.In Vernon et al. (2019), it was demonstrated that analytic updating of two perpendicular or parallel p − 1-dimensional boundaries K, L could be achieved.As an example, the update for two perpendicular p − 1-boundaries was given by: where it is assumed that K and L are defined by x 1 = α K and x 2 = α L , and we have utilised the notation of ( 7) and ( 14).The inclusion-exclusion nature of this result will also feature in the general results that follow, both in this section and in Section 3.2.
More generally, to permit analytic updating by a second boundary L of dimension p − l, as defined above, we need to find an analogous version of (10), which relates Cov K [f (x), L] to Cov K f (x L ), L .We begin by examining the expression for Cov K f (x L ), f(z (s) ) in light of (13), exploiting the notation of ( 7) and ( 14), and using the partition P K = {J L \J K , P K∪L } where P K∪L = P \(J K ∪ J L ): where LK j is a constant giving the orthogonal distance from L to K in the x j -direction.In addition, note that we define throughout r ∅ (•) = 1.We then examine: By combining ( 22) and ( 23) we obtain: In order to obtain an equation analogous to (10) we need to be able to write ) and a function that does not depend on z (s) , thus permitting replacement of f (z (s) ) by L in the Cov K • , f(z (s) ) terms.In general, this is not possible for a second boundary, since the required product correlation structure no longer exists, thus resulting in the appearance of z (s) several times in the quotient in the expression on the right hand side of (24).However, there are two general and commonly occurring cases when this dependency does not exist and our methods permit further analytic update by a second known boundary (and indeed further known boundaries, as discussed in Sections 3.2 and 3.3).The first case is if we wish to update by a known boundary which is a hyperplane which is orthogonal to and intersecting the first, and the second case is if we wish to update by a known boundary that is a hyperplane which is parallel to the first, or a subplane thereof.We discuss these two cases in the following sections.

Two Intersecting Orthogonal Known Boundaries
If K and L are two intersecting orthogonal boundaries, that is K ∩ L = ∅, then α K j = α L j , a K j = a L j and LK j = 0 for j ∈ J K ∩ J L .In this case we can rewrite (24) as: We can now avoid explicit evaluation of the intractable Var K [L] −1 term by combining (26) with sequential update ( 17)-( 19) to give: with where J T = t∈T J t .Extended derivation of the general Expressions ( 27) and ( 28) for updating by any two intersecting orthogonal boundaries can be found in Appendix A. Note that if K is defined by x 1 = α K , and L is defined by x 2 = α L , these expressions collapse back to those given by ( 21).We can see that Expressions ( 27) and ( 28) are invariant under the interchange of the two boundaries.This should be as expected, since the boundaries are orthogonal to and intersecting each other.

Two Parallel Boundaries
Consider now that L is such that J K ⊆ J L .In other words, L is either a hyperplane which is parallel to K, or a subplane thereof.Note that now x L = x KL = x LK , that is, the order of the boundaries matters, and that We also define LK to be the p-vector of shortest distance from (any point on) L to K.
In this case, (24) can be rewritten as hence we have that: Equation ( 30) allows us to again avoid explicit evaluation of the intractable Var K [L] −1 term.The adjusted expectation and covariance can then be calculated using the sequential update ( 17)-( 19), to be: where we define: Extended derivation of Expressions ( 31) and (32) can be found in Appendix B.
We observe that, for the case when J K ⊂ J L , the result is not invariant under the interchange of the two boundaries K ↔ L, as expected.Although the order in which we update by the two boundaries should not affect the final result, whilst we were able to provide the analytical solution above for the case where we updated by the boundary of largest dimension first, this is not the case if we first update by the boundary of lower dimension.As discussed in Section 3.1, a problem arises in the latter case due to us being unable to write Cov K f (x), f(z (s) ) as a product of Cov K f (x L ), f(z (s) ) and a function of x only.Therefore, we cannot obtain an expression analogous to (11) which enables analytic updating of f (x) by K and L by avoiding the explicit inversion of Var K [L] −1 .In the case when J K = J L , the result is invariant under K ↔ L, as shown in Appendix C. In addition, if K is given by x 1 = α K and L is given by x 1 = α L , Expressions (31) and (32) collapse to the result in Vernon et al. (2019).

Multiple Known Boundaries
Following Section 3.1, it is logical to assume that analytic updating would be possible for further intersecting orthogonal and parallel hyperplanes along which model behaviour is known.We hence proceed to discuss the form of an emulator updated by h boundaries K H = K 1 , . . ., K h , where boundary In this section, we first consider intersecting orthogonal boundaries, then we consider parallel boundaries.Similar to the previous sections, we define a Ki = x − x Ki to be the vector of shortest distance from boundary K i to x, where x Ki is the orthogonal projection of x onto boundary K i .Letting T = (t 1 , . . ., t τ ) be a sequence of boundary indices, we also define K T to be the sequence of boundaries K t1 , . . ., K tτ , x K T to denote x projected sequentially onto boundaries K T in reverse order (that is, K tτ , K tτ−1 , etc.), and a

Multiple Intersecting Orthogonal Boundaries
In this section, we consider that the h boundaries are all orthogonal to and intersecting each other, in other words that Theorem 3.2.1.The expectation and covariance of f (x) sequentially adjusted by boundaries K H , such that K 1 ∩ • • • ∩ K h = ∅, are given by: where H = 1, . . ., h, J T = t∈T J Kt , J H = i∈H J Ki and P H = P \J H .
Theorem 3.2.1 provides the general form for analytically updating our emulator by multiple intersecting orthogonal boundaries.We can see that Expressions (33) and ( 34) are invariant under the interchange of the h boundaries.This should be as expected, since all boundaries are orthogonal to and intersecting each other.Proof of Theorem 3.2.1 by induction is presented in Appendix D.

Multiple Parallel Boundaries
In this section, we consider that the h boundaries are such that J Ki−1 ⊆ J Ki .In other words, for all i ≥ 2, K i is either a hyperplane which is parallel to K i−1 , or a subplane thereof.Such ordering of the boundaries by decreasing dimension size is required in order to leave the correlation structure in the appropriate product form to perform all the calculations analytically at each stage (see the discussion in the main part of Section 3.1 and at the end of Section 3.1 for more detail).Theorem 3.2.2.The expectation and covariance of f (x) adjusted by h ≥ 2 parallel boundaries K H , with J Ki−1 ⊆ J Ki for i = 2, . . ., h, are given by: where and R  (γ) is defined recursively by: R (γ) Note that for a single boundary R (1) K (x, x ) = R J K (a K , a K ) .Theorem 3.2.2,proved by induction in Appendix E, provides the general formulae for analytically updating our emulator by multiple parallel boundaries.Expressions ( 35) and ( 36) are not invariant under interchange of the h boundaries due to the need for the boundaries to be taken in order of decreasing dimension size in order for the calculations to be performed analytically.

Additional Sets of Known Boundaries
Section 3.2 demonstrated that analytic update calculations are possible given a set of mutually orthogonal known boundaries or for sets of known boundaries where each boundary is a hyperplane which is parallel to the previous one, or a subset thereof.Given these results, the natural question to ask is: for which combinations of known boundaries can an emulator be updated, whilst allowing all of the necessary calculations to be performed analytically?We now state the following proposition to answer this general question.

we can update by a further boundary K h if and only if, for each
In other words, for each i = 1, . . ., h − 1, K h must either be an intersecting and orthogonal hyperplane to K i , or be a hyperplane (or subplane thereof) which is parallel to K i .
As discussed in Section 3.1, in order to perform an analytical update by a further known boundary, we needed to be able to write Cov K f (x), f(z (s) ) as a product of Cov K f (x L ), f(z (s) ) and a function that does not depend on z (s) .In other words, we needed an appropriate product correlation structure.This same criterion extends to further known boundaries, and must hold between every pair of a set of boundaries for analytic update to be performed.The general formula is somewhat complex, although the analytic update formulae can be derived by iteratively applying the sequential update ( 17)-( 19) with appropriate analogous equations to (10).

Three-Dimensional Example
We continue the example of Section 2.3 by adding two extra boundaries.We add a onedimensional boundary L = {x ∈ R 3 |(x 2 , x 3 ) = (0, −π)} which is parallel to the first, and also a two-dimensional boundary M = {x ∈ R 3 |x 1 = 0}, which is orthogonal to the others.K, L and M therefore form a set of known boundaries satisfying the conditions given in the propostion in Section 3.3.The update formulae for this particular set of boundaries can be given as: The derivation of ( 37) and ( 38) is provided in Appendix F. The emulator outputs, derived using these equations, are shown in Figure 2. We can see that with these three boundaries, much is learnt across each of the displayed two-dimensional slices of the input space.Variance is particularly reduced for x 2 = 0 (left-hand column), this column also essentially containing the story of a smaller two-dimensional example (that when x 2 = 0) with three one-dimensional boundaries.The emulator predicts the model across much of the input space well; only in the top left and top right corners, when x 3 is large and x 1 small or large, is behaviour really uncertain.
The middle column shows the plane x 2 = π/8.We can see that the intersecting known boundary M at x 1 = 0 has much greater influence on the adjusted beliefs across the plane of interest than the lower dimensional known boundaries K and L, these being subplanes of a plane parallel to the one of interest in this case.In contrast, if the plane of interest is parallel to the two-dimensional plane, for example x 1 = π in the right-hand column, then the intersecting lines have a greater influence, although concentrated over a smaller area of the plane.The right-hand column particularly highlights the advantages of having as many known boundaries as possible.The intersecting lines provide much Figure 2: Updating the emulator for the three-dimensional function given by ( 16) by two sets of perpendicular boundaries with two and one boundaries respectively.Rows from top to bottom, show 1) position of known boundaries (red line K, blue line L and pink plane M) in the three-dimensional input space, along with the position of the two-dimensional slices (green planes) illustrating the two-dimensional plane in the three-dimensional input space, over which the remaining plots in the same column are plotted, 2) simulator function f (x), 3) emulator expectation μ(x), 4) emulator variance ν(x), 5) standardised errors s(x).Columns from left to right show results on the three planes x 2 = 0, x 2 = −π/8 and x 1 = −π respectively, shown as the green planes in the top row.Note that for each two-dimensional plot, the variable with smaller index is along the horizontal axis.increased precision over a smaller area, whilst the parallel plane reduces variance slightly (though still to a worthwhile degree) across the whole plane.In addition, the diagnostics are satisfactory across each plane in the example.
To summarise, for computer model applications where such sets of known boundaries exist, the gains of including them in the analysis using the general results derived in this section can be substantial, and therefore they should be included whenever possible.

Application of Methods to Arabidopsis Model
In the previous sections of this article, we have presented methodology for utilising knowledge of the behaviour of computer models along particular boundaries of the input space to aid emulation across the whole input space.In this section, we explore the implications such boundaries can have on a higher-dimensional scientifically relevant systems biology model of the hormonal crosstalk in the roots of an Arabidopsis plant.

Model of Hormonal Crosstalk in Arabidopsis Thaliana
Arabidopsis Thaliana is a small flowering plant that is widely used as a model organism in plant biology (Initiative, 2000).We demonstrate our known boundary emulation techniques on a model of hormonal crosstalk in the root of an Arabidopsis plant that was constructed by Liu et al. (2013).This Arabidopsis model represents the crosstalk of auxin, ethylene and cytokinin in Arabidopsis root development as a set of 18 differential equations, given in Table 2 of Appendix I, which must be solved numerically.The model takes an input vector of 45 rate parameters (k 1 , k 1a , k 2 , . ..), although we will be interested in a subset of 38 of them, as discussed in Appendix I, and returns an output vector of 18 chemical concentrations ([Auxin], [X], [P LSp], . ..).This Arabidopsis model has been successfully emulated in the literature in the context of history matching (Vernon et al., 2018).
For the purposes of this article, we are interested in modelling the important output component [ET ], which represents the concentration of ethylene (Burg, 1973;Swarup et al., 2007), at early time t = 2.The ranges over which we allowed the inputs to vary are given in Table 3 in Appendix I, these being elicited as ranges of interest deemed sensible by the biological experts (Liu et al., 2013), and square rooted and mapped to a [−1, 1] scale prior to analysis.

Establishing Known Boundaries
Establishing known boundaries requires some understanding of the scientific model.It is not uncommon for one or more known boundaries to occur in a model for some output components.Often, setting certain parameters to specific values will decouple smaller subsections of the system, which may allow subsets of the model equations to be solved analytically, for particular output components, as is the case for the Arabidopsis model.
We consider known boundaries for output component [ET ] by considering its rate equation: A known boundary exists when rate parameter k 12a = 0, since in this case: where [ET 0 ] is the initial condition of the [ET ] output component, and we see that [ET ] has been entirely decoupled from the rest of the system.
[ET ] can now be obtained along the boundary k 12a = 0 with negligible computational cost.Note that this boundary is of dimension p − 1 = 38 − 1 = 37.The second (perpendicular) known boundary for [ET ] is a p − 4 = 34-dimensional boundary given by k 1a = k 2a = k 3a = k 18a = 0, which decouples the combined system of [Auxin], [CK] and [ET ].In this case, we can solve for [Auxin] and [CK] first: Inserting these solutions into the rate equation for [ET ] then yields: which can now be solved analytically with negligible computational cost, given the initial conditions [Auxin 0 ] and [CK 0 ] for Auxin and Cytokinin respectively.In this case, we have [CK 0 ] = [ET 0 ] = [Auxin 0 ] = 0.1 as the initial conditions suggested by the biological experts.The remaining initial conditions are shown in Table 4 in Appendix I. We will refer to this p − 4-dimensional boundary as K and the earlier presented p − 1-dimensional boundary as L in order to show the effect of the smaller-dimension boundary in comparison to the larger-dimension one.In addition, it is important to note that both boundaries K and L lie outside the input space of interest X as given by Table 3 in Appendix I. Despite this, assuming the behaviour of the model is reasonable in the vicinity of the boundaries, the information provided by the analytical solutions along the boundary can be useful for predicting model behaviour inside X .

Emulator Structure and Specification
We restrict the form of our emulator to have the covariance structure as given by ( 1).We used a product Gaussian correlation function of the form given by ( 2), as we assumed that the solution to the Arabidopsis model would most likely be smooth and that many orders of derivatives would exist.
The prior emulator expectation and variance were taken to be constant, that is E[f (x)] = β and Var[f (x)] = σ 2 , where β and σ 2 were estimated to be the sample mean and variance of a set of previously evaluated scoping runs.In this section, we specify a common correlation length parameter θ = 3 for each input, a choice consistent with the argument for approximately assessing correlation lengths presented in Vernon et al. (2010).This value of θ was also checked for adequacy using standard emulator diagnostics (Bastos and O'Hagan, 2008).We made this relatively simple emulator specification for illustrative purposes, the reason being that we wish to demonstrate that there are benefits to utilising the known boundaries regardless of how the parameters may have been estimated.To this end, in Section 4.5 we compare the effects of several different values of θ on an analysis with and without the known boundaries, but for now keep the value fixed at θ = 3.

Comparison of Results
In this section, we compare the emulators of the above form constructed with and without use of the known boundaries K : k 1a = k 2a = k 3a = k 18a = 0 and L : k 12a = 0, and also with and without the addition of training points.The design for the additional training points is obtained by constructing a Maximin Latin hypercube design of size 1000 across the 38-dimensional input space, this then being sampled from to explore the effects of using different numbers of training points up to 1000.Bayes linear updates were carried out using the single and two perpendicular boundary updates given by ( 12), ( 13) and ( 33), (34) respectively.Additional updating is then performed using the sequential update formulae given by ( 17)-( 19).
Equivalent plots to those shown in Figures 1 and 2 are substantially more difficult to visualise across all dimensions of a high-dimensional input space.We will use numerical diagnostics to assess these emulators in Section 4.5, but in this section we will restrict comparison of the emulators to visual diagnostics.Figure 3 shows model output against emulator expectation ±3 standard deviations for a set of 100 diagnostic test points for each of six emulators; first row: no boundaries (0KB); second row: 1 known boundary K (1KB); third row: two known boundaries K and L (2KB).The left column shows diagnostics for emulators without any training points D in the bulk of the input space (0TP), and the right column shows diagnostics for emulators that include 500 training points (500TP).Since the error bars for most of the points intersect the line f (x) = E[f (x)], this gives evidence to the fact that these emulators are valid, in the sense of generating predictions with appropriate associated uncertainty estimates, without making reference to the accuracy of the predictions.This heuristic appeals to Pukelsheim's Three Sigma Rule (Pukelsheim, 1994) which states that 95% of the probability mass of any unimodal distribution lies within 3 standard deviations of the mean.In terms of the predictions themselves, the middle left panel of Figure 3 shows that the expected values of the points have been marginally influenced in general by K, but to a greater degree for inputs for which the model output is smaller.The bottom left panel shows that the addition of L results in a much improved effect on the predictions than boundary K alone.We do note that in addition to increased accuracy, this leads here to slightly underestimated predictions: this is due to the the function values on the boundary L being typically lower than corresponding orthogonal parts of the input space, in addition to the emulator expectation approximately tracking the form of the correlation component r 1 (a) when moving orthogonally away from the boundary, a form which may undershoot the true form of f (x).We note that the results of including both boundaries K and L are comparable to using no boundaries and 500 training points across the input space (top right panel), thus highlighting that utilising knowledge of computer model behaviour along known boundaries is worthwhile.Crucially, however, whereas the 500 training points require 500 potentially computationally intensive model evaluations and emulator matrix inversion calculations, the known boundaries involve no model evaluations or matrix inversion calculations.The simultaneous inclusion of both known boundaries and additional training points leads, unsurprisingly, to the emulators with greatest accuracy.In particular, it may be fair to say that out of the six emulators for which diagnostic plots are provided here, only the one with diagnostics provided in the bottom right panel yields sufficient accuracy and predictive capability for practical application.
The substantial and moderate effects of L and K respectively on our beliefs in comparison to individual points is largely a result of the dimension of the objects.The known boundaries are p − 1-and p − 4-dimensional objects respectively, resulting in significant variance resolution as a consequence of the volume of the input space within their proximity (particularly L for which the correlation function is effectively over a single dimension).In comparison, individual training runs (which are 0−dimensional objects) influence far smaller volumes especially in high dimensions.
Since there is little computational cost involved in the incorporation of known boundaries, the most practical solution is to utilise them in conjunction with the regular training points.Looking at the bottom right panel of Figure 3, we notice a substantial improvement in comparison to using either the known boundaries or the training points individually.Were one aware of the known boundaries in advance, one could design the set of 500 runs accordingly, leading to further efficiency gains (see Vernon et al. (2019)).In terms of the biological application to the Arabidopsis model, we see that we are now able to utilise the information from many more known boundaries, now of differing dimensions, for each of the biological outputs of interest, leading to more powerful emulators.This is in contrast to Vernon et al. (2019) where only p − 1 boundaries could be used, which restricted the number of boundaries available.

Sensitivity to Emulator Parameter Specification
We now compare emulators constructed using various different emulator parameter specifications.In particular, we explore the effect of changing the common correlation length parameter θ discussed in Section 4.3.We do this as we wish to focus on the advantage of utilising known boundaries on emulation without confounding the effect on choice of parameter specification.Whilst we will demonstrate that the effects of known boundaries are substantial regardless of emulator structure and parameter specification, the value of θ does affect the relative size of the contributions of individual points to known boundaries (that is, larger dimensional objects).We compare several emulators with various values of correlation length parameter θ, numbers of training points and numbers of known boundaries using numerical diagnostics for 500 diagnostic points.These diagnostics, shown numerically in Table 1 and visually in Figure 4, are the sum of variances and Mean Absolute Standardised Prediction Error (MASPE), given by: 500 w=1 ν(x (w) ) and 1 500 , respectively, where μ(x) and ν(x) represent the appropriate emulator mean and variance in each case.
The prior sum of variances is 344.23 (constant for all θ), this being reduced by various degrees depending on the three varying features of our analysis.For small θ = 0.1, neither training points nor known boundaries reduce the variances of the diagnostic points appreciably.With θ = 1, training points are having negligible effect on variance, however, the larger known boundary objects have sufficient diagnostic points within their proximity to reduce uncertainty to some degree.For θ = 6, the reduction in diagnostic variance arising from two known boundaries only (15.04) is greater than that of 1000 training points alone (23.87).As θ gets larger, 1000 training points in X have greater affect than two known boundaries outside of X , for example, reducing the sum of variances to 2.07 and 2.44 respectively when θ = 10 (and to 0.02 when both are used).These results are as expected from purely geometrical considerations.
It is common for acceptable values of the MASPE to be broadly around 1 (appealing to the properties of a standard half-normal distribution, which has expectation 2/π), and providing substantial evidence that an emulator is invalid if much greater than 2 or 3 (appealing to Pukelsheim's 3σ rule (Pukelsheim, 1994)).Equivalently, substantial change in MASPE between prior and adjusted beliefs is also cause for concern.Prior MASPE is 0.84, which is suitably close to both 1 and 2/π ≈ 0.8.The MASPE values for emulators with large values of θ are as expected unacceptable, with the value for 1000 training points alone being 25.09 and that for 1000 training points and two known boundaries 7764.93.This excessively larger value is due to the different ways in which known boundaries and training points influence the emulator.The known boundaries affect the input space as a large object with much influence over a particular part of the input space.On the other hand, since the training points are spread out across X , the effect of averaging via interpolation of the points is likely to result in more accurate (and thus with common variance reduction appear more valid) predictions, even if θ is large.The MASPE values for θ = 3, 1000 training points and two known boundaries is 1.07, which is much more acceptable.For the emulators with acceptable diagnostics, we see that the inclusion of known boundaries is clearly beneficial.In addition to sum of variances and MASPEs, we also calculated Root Mean Square Errors (RMSEs) for each emulator, these being displayed and discussed in Appendix I.

Conclusion
We have discussed how additional prior insight into the physical structure of a computer model related to known boundaries can be incorporated into emulators leading to substantial increases in accuracy for little additional computational cost.
In particular, here it is shown that if a computer model has boundaries or hyperplanes in its input space where it can either be analytically solved or just evaluated far more efficiently, then these known boundaries can be formally incorporated into the emulation process by analytic Bayesian updating of the emulators with respect to the information contained on the boundaries.Furthermore, we have shown that this is possible for a large class of emulators, and for multiple boundaries of various forms.The progress in this work in comparison to Vernon et al. (2019) is that we presented substantially more general results for arbitrary numbers of boundaries of varying dimension, stating which configurations of known boundaries permit analytical updates.Due to these analytic results and to the ease and substantial benefits of including known boundaries when emulating the Arabidopsis model, we would suggest that future UQ analyses of serious scientific models include a phase of identification and incorporation of known boundaries, if they are found to exist, as standard practice.Whilst the results of this article were with respect to a univariate computer model, the results extend naturally to the multivariate case, as discussed in Appendix H.In addition, many of the examples in this article can be thoroughly investigated via the associated code available at https://github.com/samjacksonstats/KBE.There are many ways in which the work of this article could be developed.For example, extensions to the case of uncertain regression parameters within the emulator are possible, although the formal update would now depend on the specific form of the correlation function r j (a), which may not be tractable for some choices.Curved boundaries of various geometries could also be incorporated, provided both that suitable transformations were found to convert them to hyperplanes and that we were happy to adopt the induced transformed product correlation structure as our prior beliefs.Finally we note that, for some applications, there may be several hyperplanes in the input space along which model behaviour is known, however, analytical updates incorporating the information given by all of them may not be possible due to the set not satisfying the properties of the proposition in Section 3.3.There is then a possible design problem which involves selecting the best (in some sense) subset of the known boundaries which do permit analytic updating.This choice of boundaries may be in conjunction with design of the training points in the bulk of the input space X D .Training point design should anyway take the known boundaries into account, as discussed in Vernon et al. (2019).We leave all these considerations to future research.

Figure 3 :
Figure 3: Diagnostic plots for the emulators of the Arabidopsis model output component [ET ].These show true model output against emulator expectation plus/minus 3 standard deviations for a set of diagnostic test points given; first row -no known boundary (0KB); second row -single known boundary K (1KB); third row -two known boundaries K and L (2KB).The left column shows diagnostics for emulators without additional training points D in the bulk of the input space (0TP), and the right column shows with 500 additional points (500TP), all for common correlation length parameter θ = 3 for all input parameters.

Figure 4 :
Figure 4: Sum of Variances and log Mean Absolute Standardised Prediction Error for the set of 500 diagnostic points for different values of common θ and numbers of known boundaries (KB) and training points (TP) in the bulk of the input space.

Table 1 :
Sum of Variances and Mean Absolute Standardised Prediction Error for the set of 500 diagnostic points for different values of common θ and numbers of known boundaries (KB) and training points (TP) in the bulk of the input space.