Bayesian decision-theoretic design of experiments under an alternative model

Traditionally Bayesian decision-theoretic design of experiments proceeds by choosing a design to minimise expectation of a given loss function over the space of all designs. The loss function encapsulates the aim of the experiment, and the expectation is taken with respect to the joint distribution of all unknown quantities implied by the statistical model that will be fitted to observed responses. In this paper, an extended framework is proposed whereby the expectation of the loss is taken with respect to a joint distribution implied by an alternative statistical model. Motivation for this includes promoting robustness, ensuring computational feasibility and for allowing realistic prior specification when deriving a design. To aid in exploring the new framework, an asymptotic approximation to the expected loss under an alternative model is derived, and the properties of different loss functions are established. The framework is then demonstrated on a linear regression versus full-treatment model scenario, on estimating parameters of a non-linear model under model discrepancy and a cubic spline model under an unknown number of basis functions.


Introduction
The Bayesian decision-theoretic approach (Chaloner and Verdinelli, 1995) is a natural framework to plan experiments in many fields of science and engineering. It starts with specification of a loss function representing the aim of the experiment. A Bayesian design then minimises the expectation of the loss over the space of all possible designs where expectation is with respect to a probability distribution over all unknown quantities implied by the statistical model that will be fitted upon observation of the experimental responses.
The approach is scientifically appealing since the precise aim of the experiment is encoded by the loss function and, through specification of a probability distribution over all unknown quantities, pre-experimental (lack of) knowledge is incorporated. In practice, finding a Bayesian design by minimising the expected loss is a challenging computational problem (Ryan et al., 2016b). The expected loss is typically analytically intractable and the space of all designs potentially high-dimensional. However, recently there has been significant progress in the development of novel computational methodology for finding Bayesian designs originating in diverse fields of study (e.g. Long et al., 2013;Ryan et al., 2016b;Foster et al., 2019;Beck et al., 2020, and references therein).
In this paper, we propose an extended framework for Bayesian decision-theoretic design of experiments. This is achieved by defining the expected loss by taking expectation with respect to a probability distribution implied by an alternative statistical model (termed a designer model). Etzioni and Kadane (1993) considered the case where the fitted and designer prior distributions were different but the joint distributions for the responses were identical. We extend this by considering the case where both the prior and distribution for responses can be different. There are several reasons why a design may be sought under a different model to the fitted model. For example, the fitted model may emulate a partially observed process, but by considering the whole data-generating process, robustness is introduced into the design procedure. Assuming a simpler fitted model can also be used to induce robustness to model misspecification (Joseph et al., 2009) or may be necessary to ensure computational feasibility of finding Bayesian designs (Ryan et al., 2016a). Conversely, reliable prior specification can be assisted by the designer model being simpler than the fitted model. Accordingly, we provide the framework and supporting methodology to enable Bayesian design within such settings, and apply these methods to typical design problems found in the literature.
The outline of the paper is as follows. In Section 2, the extended framework for Bayesian design of experiments under an alternative model is described and justified, and the choice of loss function considered. In Section 3, an asymptotic approximation to the expected loss under the extended framework is developed to explore the consequences of designing under an alternative model. Under the traditional approach, designs found under a loss, and its posterior expectation, are equivalent. This is no longer the case under the proposed extended framework. We consider the choice between these pairs of loss functions and show that there is essentially a bias-variance trade off to be made. To demonstrate the extended framework, three examples are considered in Section 4 and Section SM6 in the Supplementary Material. The first considers a linear regression versus full-treatment model scenario where application of the extended framework results in objective functions sharing properties of classical design criteria that promote increased replication (Gilmour and Trinca, 2012). The second considers estimating parameters of non-linear models under model discrepancy, and finds that the resulting designs resemble a compromise of designs found under the traditional Bayesian decision-theoretic framework and space-filling designs. Finally, the third (in the Supplementary Material), considers estimation of the parameters of a model formed using cubic splines, with an unknown number of basis functions.
The Supplementary Material includes proofs and derivations of the results, and the cubic spline model example.
2 Bayesian design under an alternative model

Internal and external expected loss
Suppose the ultimate aim of the experiment is to learn the relationship between k controllable variables and a response. The experiment consists of n runs where, for i = 1, . . . , n, the ith run involves specifying a treatment-level combination, i.e. values for the k controllable variables ∆ i = (∆ i1 , . . . , ∆ ik ) T , and subsequent observation of a response y i . Let ∆ denote the design; the n × k matrix with ith row given by ∆ T i and let y = (y 1 , . . . , y n ) T denote the n × 1 vector of observed responses. On the completion of the experiment, a statistical model is assumed describing the relationship between the controllable variables and response. That is, y is assumed to be a realisation from a multivariate probability distribution with density/mass function π(y|β, F, ∆) completely known up to a p × 1 vector of unknown parameters β with prior density π(β|F, ∆). We refer to π(y|β, F, ∆) and π(β|F, ∆) as the fitted likelihood and prior, respectively, and collectively as the fitted model, where the conditioning on F is to make it clear that this is under the fitted model. Using Bayes' theorem, the fitted posterior distribution is π(β|y, F, ∆) ∝ π(y|β, F, ∆)π(β|F, ∆). Note that β refers to parameters of interest. There may be additional nuisance parameters present but these have been marginalised, with respect to a prior distribution, to obtain π(y|β, F, ∆).
Since, apart from the values of β, the fitted model completely specifies the relationship between the controllable variables and response, the experimental aim reduces to determining the value of β. This aim can manifest itself in various different ways. Bayesian decision-theoretic design of experiments starts by encoding the exact aim in a loss function, λ(β, y, F, ∆), giving the loss of estimating β via the fitted posterior distribution conditional on y. The choice of loss function is considered in more detail in Section 2.2.
Prior to experimentation, the aim is to specify ∆ to best estimate β as measured by the loss function. Traditionally, a Bayesian design minimises the expected loss over the space of all designs, where expectation is with respect to the joint distribution of β and y as implied by the fitted model (e.g. Lindley, 1972). Mathematically, the expected loss is and referred to, in this paper, as the internal expected loss under the fitted model. Unless otherwise specified; if referring to internal expected loss, it is under the fitted model. Instead, consider minimising the expectation of the loss with respect to an alternative model: the designer model. Here we suppose that y is a realisation of a probability distribution with density/mass π(y|θ, D, ∆), where θ is a p θ ×1 vector of parameters with prior density π(θ|D, ∆). We refer to π(y|θ, D, ∆) and π(θ|D, ∆) as the designer likelihood and prior, respectively, and the conditioning on D is to make it clear that this is under the designer model. Similar to the fitted model, any additional nuisance parameters have been marginalised to obtain π(y|θ, D, ∆).
Let β = (γ T , θ T ) T , where γ is a p γ × 1 vector of parameters only present in the fitted model and θ are parameters common to both models. If the fitted and designer models share no common parameters, i.e. β = γ, then the two models are referred to as disjoint. Conversely, if all parameters of interest are common to both models, i.e. β = θ, then the models are referred to as compatible. As such, prior to marginalisation, the fitted model can be nested within the designer model, vice-versa, or no nesting may be present.
A design found under the proposed extended framework is given by minimising the external expected loss defined as The external expected loss is formed by assuming that once responses are observed, we proceed as if the fitted model represents the true data-generating process, i.e. under the M-closed paradigm (Bernardo and Smith, 1994, page 384). Initially, expectation is taken with respect to the fitted posterior distribution of γ conditional on θ. This is because these parameters are only present in the fitted model (if the models are non-compatible). Then expectation is taken with respect to the remaining unknowns under the designer model. Note that if the fitted and the designer models are the same, then the external and internal expected loss functions are equal, i.e. L DF (∆) = L F F (∆).
There are several reasons to consider finding a design under an alternative model. These can be separated into two cases: when the designer model is simpler than the fitted model and the vice-versa. In Section 4 and Section SM6 in the Supplementary Material we provide examples of both cases.
First consider the case where the fitted model actually represents the, a-priori, current best representation of the data-generating process, and the designer model is actually a proxy simplification. This can occur when the fitted model is complex with a large number of parameters and it is necessary to use a fitted prior distribution representing weak prior information. It is known (e.g. Ryan et al., 2016b) that in these cases, the vaguery of the fitted prior distribution combined with the large number of parameters leads to the internal expected loss being relatively flat and, therefore, non-trivial to (numerically) minimise. However it may be possible to elicit a prior distribution for a simpler designer model and use this to take expectation of the loss function formed under the fitted model. Now consider the case where the designer model actually represents the best representation of the datagenerating process and the fitted model is a simplification. In this case, it initially would make sense to find a design by minimising the internal expected loss under the designer model, i.e.
For certain models it may not be possible to represent the experimental aim by a suitable loss function, λ(θ, y, D, ∆), depending on the designer posterior distribution. Instead, an alternative fitted model is used Expression for Name Expression for λ(β, y, F, ∆) λ(y, F, ∆) for which it is possible to form a suitable loss function. However it is prudent to take expectation under the original designer model since this is the best representation of the data-generating process. For example, suppose the designer model is a full-treatment model (e.g. Gilmour and Trinca, 2012) with each unique treatment-level combination permitting a unique mean response. This represents a plausible data-generating process but for convenience and interpretability a regression model (positing a relationship between the numerical values of the controllable variables and mean response) is usually assumed as the fitted model. Alternatively, suppose the designer model is based on scientific reasoning through a mechanistic or phenomenological model. Again it would make sense to find the design via the internal expected loss under the designer model (3). Assuming the fitted and designer models are compatible, consider Monte Carlo approximations to the external and internal expected loss functions given byL DF is a sample generated from the joint distribution of θ and y under the designer model. There are models, e.g. so-called intractable likelihood models, for which it is straightforward to generate such a sample but extremely difficult to evaluate the designer posterior distribution and hence the loss function λ (θ b , y b , D, ∆), for b = 1, . . . , B. However by forming an approximating fitted model it becomes computationally feasible to find a Bayesian design via the external expected loss (e.g., see the indirect inference approach of Ryan et al., 2016a). In Section 2.2, we show, under a certain class of loss function, that the external expected loss is an upper bound on the internal expected loss under the designer model, providing formal justification for using such approaches.

Loss functions
In Section 2.1, the loss function is given informally by λ(β, y, F, ∆), stating that it gives the loss of estimating β via the fitted posterior (conditional on y). We now consider the specification of the loss and the implications of such a choice when designing an experiment under an alternative model. For this, we consider two exemplar loss functions as defined in Table 1 (left-hand column); self-information and squared error. Note that v 2 2,A = v T Av and I p is the p × p identity matrix. These loss functions have been chosen since under a normal linear model with non-informative prior distributions, the internal expected loss reduces to the objective function under D-and A-optimality, respectively; two widely-studied classical optimal design criteria.

Optimal loss functions
First consider estimating β based on observed responses y. Suppose the loss for this aim is λ 0 (β, b) where b ∈ B is an action for estimating β and B is the action space (e.g. b is a point estimate of β in B = R p ). The optimal action minimises the fitted posterior expectation of λ 0 (β, b), i.e.b F = arg min b∈B E β|y,F,∆ [λ 0 (β, b)].
Under optimal loss functions, the following result giving an inequality for the external expected loss in relation to the internal expected loss under the designer model. The proof is given in Section SM1 of the Supplementary Material.
Lemma 2.1 Suppose fitted and designer models are compatible, i.e. β = θ, with λ(β, y, is the internal expected loss under the designer model given by (3).
In Lemma 2.1, the difference between the external expected loss and the internal expected loss under the designer model is due to the loss of information caused by the fitted model being different to the model under which expectation is taken. Note that Lemma 2.1 only requires the compatible fitted and designer models to be different, not that the designer model is more complex.
By finding the design that minimises the external expected loss, we are minimising an upper bound on the internal expected loss under the designer model. A special case of Lemma 2.1 for the self-information loss is called the Barber-Agakov bound (Barber and Agakov, 2003) and has been used by Foster et al. (2019) to approximately find Bayesian designs. Lemma 2.1 now justifies the use of this approach to other loss functions as has previously been pursued for intractable likelihood models (for example, Ryan et al. 2016a, as discussed in Section 2.1).
Lemma 2.1 is uninformative on the relationship between the external expected loss and the internal expected loss under the fitted model (1).

Generator and composite loss functions
There are intuitive choices of loss function given as scalar summaries of the fitted posterior distribution, independent of β, e.g. λ(y, F, ∆) = |var(β|y, F, ∆)|, where | · | denotes the determinant of a square matrix. In some cases, these are given as the fitted posterior expectation of a loss function λ G (β, y, F, ∆), i.e. λ C (y, F, ∆) = E β|y,F,∆ [λ G (β, y, F, ∆)]. For such pairs, we term λ C (y, F, ∆) as composite and λ G (β, y, F, ∆) as generator. The right-hand column of Table 1 shows the entropy and trace variance loss functions which are composite to the generator loss functions of self-information and squared error, respectively, in the left-hand column.
The internal expected composite loss is equal to the corresponding internal expected generator loss. Apart from when fitted and designer models are disjoint, i.e. β = γ, this equivalence is not necessarily true when considering the external expected loss. This leads to a question on how to choose between a pair of loss functions equivalent under internal expected loss, e.g. how should we choose between self-information and entropy loss functions? In the next section, to answer this question, we develop an asymptotic approximation to the external expected loss. This is used to explore differences in behaviour between the exemplar generator and composite loss functions shown in Table 1. We find that a composite loss solely focuses on fitted posterior precision. On the other hand, a generator loss also incorporates point estimation of common parameters θ. This is essentially a bias-variance trade-off. Considering this trade-off is unnecessary under internal expected loss because the fitted model is, in a sense, "unbiased" as it is identical to the designer model.

Understanding the external expected loss
In this section, an asymptotic approximation to the external expected loss is derived. This approximation is used to provide insight into properties of exemplar loss functions, in particular, to investigate the difference between external expected generator and composite loss functions.

Asymptotic approximation of external expected loss
In general, the external expected loss (2) is not available in closed form and use of an approximation is therefore necessary. Recent developments in computational methodology for approximating (and minimising) the expected loss (example references in Section 1) can actually be used for this task with little modification. Indeed, in the examples in Section 4.2 and Section SM6 in the Supplementary Material we describe Monte Carlo approximations to the external expected loss function.
However, in this section, to gain understanding of the behaviour of the external expected loss, we develop an asymptotic approximation. The resulting approximations to the external expected loss under the exemplar loss functions in Table 1 are derived and compared to the corresponding expressions under the internal expected loss.
First, some preliminary definitions are required. Letβ = γ T ,θ T T be the p × 1 vector of parameter values for β that minimise the Kullback-Leibler divergence between the designer and fitted likelihoods for a given value of θ under the designer model, i.e.β maximises Define I andĨ to be evaluated at t =θ, respectively. LetK =Ĩ −1JĨ −1 withĨ andK partitioned as whereĨ γγ is the p γ × p γ sub-matrix corresponding to γ with similar definitions forĨ θθ ,Ĩ θγ =Ĩ T γθ and the sub-matrices ofK.
Under regularity conditions (see Section SM2 in the Supplementary Material), an asymptotic approximation (Kleijn and van der Vaart, 2012) to the fitted posterior distribution of β is given by is the maximum likelihood estimate of β under the fitted model. The loss function is approximated by replacing fitted posterior quantities by the corresponding approximated fitted posterior quantities under (5). Such an approximation is denoted λ * (β,β, ∆). For example, the self-information loss is approximated by where N SI = p log(2π)/2 is a constant not depending on the design. To approximate the external expected loss (2), the parameters present only in the fitted model, γ, are marginalised with respect to an approximate fitted posterior conditional on θ, induced from (5). Then expectation with respect to the distribution of y conditional on θ under the designer model can be approximated using asymptotic results for inference under an alternative (or misspecified) model (White, 1982).
The result is now formally stated with proof given in Section SM2 of the Supplementary Material.
Theorem 3.1 Under conditions given in Section SM2 of the Supplementary Material, an asymptotic approximation to the external expected loss is given by where the two inner expectations are taken with respect to In Theorem 3.1,K has the sandwich form of the "robust variance formula" (e.g. Pawitan, 2003, page 374). In inferential settings, the values ofĨ andJ are estimated from observed responses, whereas, in design settings, the values follow from the assumption of a designer model.
Due to the tractability of the normal distribution, the two inner expectations in (6) are available in closed form for many common loss functions, e.g. the exemplar loss functions in Table 1. Then the approximate external expected loss is written L ; the designer prior expectation of a function * G (θ, ∆), typically a function ofĨ andK.
For loss functions independent of β, including composite losses, the approximate loss is denoted λ * (β, ∆) and the approximate external loss given by L * If the fitted and designer models are identical, then I D = J D , andβ = β. In this case, we obtain an asymptotic approximation to the internal expected loss given by L Table 2 shows the functions * G (θ, ∆) and * C (θ, ∆) for the exemplar loss functions considered in this paper (see Table 1). Derivation of these expressions is given in Section SM3 of the Supplementary Material. Also shown is the function * U (β, ∆) for the asymptotic approximation to the internal expected loss. Note that designs that minimise the fitted prior expectation of * U (β, ∆) under self-information and squared error are commonly referred to as pseudo-Bayesian D-and A-optimal, respectively. In Table 2

Exemplar loss functions
is the approximate marginal fitted posterior variance matrix of θ, under (5). Additionally,S θθ = I p θ + I θγĨ −1 γγĨ −1 γγĨγθ . Note that * G , * C and * U all feature − log |I| (under self-information/entropy) or tr I −1 (under squared error/trace variance). Minimising these terms has the effect of (approximately) maximising posterior precision. For the pseudo-Bayesian objective functions these are averaged with respect to the fitted prior. Whereas for external expected loss, they are averaged with respect to the designer prior of θ throughβ; a function of θ and the best possible estimate of β under the fitted model. The expressions for * C (i.e. under composite loss) give the entropy and trace of variance under the asymptotic distribution of β under a misspecified model (5).
The expressions for * G (θ, ∆) under the generator squared error and self-information loss functions feature two extra terms related to the quality ofθ as a point estimate of the common parameters θ. The first of these, θ −θ 2 2,A where A is eitherT −1 θθ orS θθ , relates to the bias. This term penalises designs that lead tô θ being (on average under the designer model) far from θ. The second term measures the difference between the variance ofθ under the fitted model, i.e.T θθ , and under the designer model, i.e.K θθ . Self-information Squared error *

Examples
Two examples are considered in this section where designs are found under our proposed design framework. In both cases, the fitted model is simpler than the designer model. As will be seen, the designs found provide robustness through replicated design points within a linear regression setting, and also through providing space filling properties for estimating parameters under a non-linear model. In Section SM6 in the Supplementary Material, a further example is provided where the fitted model is more complex (as measured by number of parameters) than the designer model.

Normal linear regression vs full-treatment model
In this example we explore the implications of assuming a full-treatment designer model within a fitted linear regression setting. Under the fitted model, it is assumed where X is an n × p model matrix (a function of ∆), and σ 2 > 0 is a nuisance parameter. We assume a conjugate normal-inverse-gamma fitted prior distribution for γ and σ 2 such that γ|σ 2 , F ∼ N µ F , σ 2 V F and σ 2 |F ∼ IG (a F /2, b F /2). For the designer model, at this stage, it is only assumed that y has finite mean E (y|D, ∆) = m D and variance var (y|D, ∆) = Σ D . Note that the fitted and designer models are disjoint, sharing no common parameters. Under the fitted and designer models, it is possible to derive closed form expressions, to some extent, for the external expected loss. The fitted posterior distribution of γ is t μ The external expected self-information and squared error loss functions are given by and H SI,1 is a constant not depending on the design. Derivations of (7) and (8) are provided in Section SM4 in the Supplementary Material. The expectation of logb F with respect to the designer model, appearing in the external expected self-information loss (7), is not available in closed form. In what follows, we use a delta approximation (for example Davison, 2003, pages 33 -35) where E y|D,∆ (logb F ) ≈ log E y|D,∆ (b F ). However, note that by Jensen's inequality, i.e. E y|D,∆ (logb F ) ≤ log E y|D,∆ (b F ), the approximate external expected self-information loss (which we minimise to find designs) is actually an upper bound on the true external expected loss. Compare (7) and (8) to the corresponding expressions for the internal expected self-information and squared error given by where τ is a q × 1 vector of mean treatment effects. The n × q designer model matrix Z is a function of ∆ where q is the number of unique treatments. For i = 1, . . . , n, the ith row of Z is the vector e j , i.e. a vector of zeros except for the jth element which is one, indicating that run i receives unique treatment j. Therefore, Interpretation is simplified by assuming non-informative fitted and designer prior distributions where, for the fitted model b F = 0, µ F = 0 p (the p × 1 vector of zeros) and V −1 F = 0 p×p (the p × p matrix of zeros), and, for the designer model, µ D = 0 q . However, V D needs to be finite for Σ D to be finite and for the external expected loss to exist. It is assumed the elements of τ are independent with τ j ∼ N 0, κσ 2 /n j , where κ > 0 and n j is the number of runs receiving the jth unique treatment, for j = 1, . . . , q. Under this prior specification, the designer posterior mean of τ j is the sample mean of the corresponding responses shrunk toward zero by a common factor κ/(1+κ), with κ controlling the amount of shrinkage. Consequently, V D = κ (Z T Z) −1 . Under these prior specifications, minimising the external expected self-information and squared error loss functions is equivalent to minimisinḡ respectively, where d = n−q is known as the pure error degrees of freedom, i.e. the number of repeated treatments. Derivations of expressions (9) and (10) are provided in Section SM5 of in the Supplementary Material. We refer to designs that minimise (9) and (10) as DE-and AE-optimal, respectively. The corresponding expressions to (9) and (10) under the internal expected loss arē respectively, and designs that minimiseL F F,SI (∆) andL F F,SE (∆) are called D-and A-optimal, respectively. Hence it can be seen that under the external expected loss there is a trade-off between precision of estimation of γ, as measured by functions of X T X, and pure error degrees of freedom, i.e. DE-and AE-optimal designs will tend to feature more replication than D-and A-optimal designs, respectively. A numerical example verifying this is now provided. Consider Example 1 from Gilmour and Trinca (2012) involving an experiment with n = 16 runs and k = 3 design variables. The fitted model is a second-order model including an intercept, three first-order terms, three quadratic terms and three pairwise interactions, i.e. p = 10. We specify κ = n leading to a unit-information prior for τ (Smith and Spiegelhalter, 1980). DE-and AE-efficiency of a design ∆ are defined by respectively, where ∆ * DF,SI and ∆ * DF,SE are the DE-and AE-optimal designs, respectively, with similar expressions for D-and A-efficiency. The D-efficiency of the DE-optimal design is 85%, whereas the DEefficiency of the D-optimal design is only 7%. The corresponding figures for squared error are 84% and 10%. Clearly, the D-and A-optimal designs are not robust to the external expected loss, whereas the DEand AE-optimal designs are robust to the internal expected loss. The D-optimal design has q = 16 unique treatments compared to q = 10 for the DE-optimal design. The equivalent values for the A-and AE-optimal designs are q = 14 and q = 10, respectively. The minimum number of unique treatments to be able to estimate all elements of γ under the fitted model is q = p = 10. Thus the DE-and AE-optimal designs feature the maximum amount of replication possible whilst still being able to estimate γ.
Expressions for the external expected loss under self-information (9) and squared error (10) should be compared to the following objective functions under the DP-and AP-criteria (Gilmour and Trinca, 2012) respectively. Note thatL DP (∆) in (11) is actually the natural logarithm of the objective function in Gilmour and Trinca (2012) but the optimisation problem is equivalent due to log being a monotonically increasing function. In (11) and (12), F p,d,1−α the 1−α-quantile of the F p,d distribution with F p,d,1−α being a decreasing function of d. The motivation behind the DP-and AP-criteria is to minimise the volume of the confidence region or sum of confidence intervals, respectively, for γ, where the response variance σ 2 is estimated under the full-treatment model; a model-independent estimate. In both cases, the objective functions are modified by a decreasing function of the pure error degrees of freedom, d. As shown above we are able to provide a different, Bayesian decision-theoretic, motivation for the same outcome.

Estimating parameters of a Michaelis-Menten model under model discrepancy
In this section we consider designing an experiment to estimate the parameters of a non-linear model. Finding designs for such models is widely featured in the design of experiments literature (see, e.g. Pronzato and Pázman 2013;Federov and Leonov 2014). Under such models, it is hypothesised that the expectation of y i is η(θ, ∆ i ); a mathematical model based on mechanistic or phenomenological arguments. As such the parameters θ can have interpretable physical meaning. As a specific example, consider the Michaelis-Menten model (e.g. Dette and Biedermann, 2003), commonlyused in the study of pharmacokinetics and chemical kinetics. A standard experiment has a response given by reaction velocity y i measured for k = 1 controllable variable: the scaled concentration of substrate ∆ i = x i ∈ [0, 1], for i = 1, . . . , n. The model has η(θ, x) = θ 1 xL/ (xL + θ 2 ) , where θ 1 > 0 is the maximum velocity of the chemical reaction, θ 2 > 0 is the substrate concentration at which the reaction rate is half of θ 1 , and L is the maximum possible concentration of substrate, which we assume is L = 400. Supposing a normal distribution, we assume the following fitted model where η(θ, ∆) = (η(θ, x 1 ), . . . , η(θ, x n )) T and σ 2 > 0 is the unknown response variance (a nuisance parameter, not of direct interest). The design problem is to choose the values of ∆ = (x 1 , . . . , x n ) T , i.e. the n concentrations.
The fitted model given by (14) is assumed for pragmatic convenience. However the reasoning behind the mathematical model is usually incomplete, meaning that there can be a systematic discrepancy between the true expected value of y i and η(θ, x i ). The Kennedy and O'Hagan (2001) framework aims to model this discrepancy using a Gaussian process and we assume this for the designer model. Specifically, the designer model is assumed to be y|θ, σ 2 , δ, D, ∆ ∼ N η(θ, ∆) + δ, σ 2 I n .
We consider three designs found by minimising (i) external expected squared error loss; (ii) external expected trace variance loss; (iii) internal expected squared error loss.
In each case, the expected loss is not available in closed form. Indeed, since the fitted posterior is not of known form, the loss function itself is analytically intractable. To approach this problem, we use a nested Monte Carlo approximation (e.g. Rainforth et al., 2018). The following algorithm is used to approximate the external expected squared error loss.
b=1 of size B from the designer prior distribution of θ, σ 2 , α and ρ. 2. For b = 1, . . . , B, generate y b from the designer likelihood 3. Approximate the external expected squared error loss by the following Monte Carlo approximation In (18)  prior distribution of θ and σ 2 . Then Note that the sample generated in Step 3 is reused to calculateÊ (θ|y b , F, ∆) for all b = 1, . . . , B.
In a similar fashion, the external expected trace variance loss can be approximated bŷ .
To minimise the nested Monte Carlo approximations to the expected loss functions, we use the approximate coordinate exchange algorithm (Overstall and Woods, 2017). We use B =B = 20, 000 but it is noted that if θ −Ê (θ|y, F, ∆) 2 2,I2 is continuously differentiable then the convergence rate can be slightly improved (Rainforth et al., 2018) by setting B ∝B 2 (whilst holding BB fixed).  Table 3 shows the efficiencies of the three designs under the three expected loss functions and the number of unique concentrations. The top panel of Figure 1 shows µ(θ, x) plotted against x for 100 values generated from the designer prior of θ and the bottom panel shows the location of the concentrations for each of the three designs, i.e. found by minimising (i) external expected squared error loss; (ii) external expected trace variance loss; and (iii) internal expected squared error loss. The size of the plotting symbol is an increasing function of the replication of each concentration. From Figure 1, the three designs have common clusters of concentrations around 0.2 and 1. The designs found by minimising the external expected trace variance and internal expected squared error loss functions are broadly similar. This is confirmed in Table 3 by the relatively high efficiencies for these designs. Compared to the other designs, the concentrations under the design that minimises the external expected squared error loss are more "spread out", i.e. a compromise between the designs found under external trace variance and internal squared error and a space-filling design.
The similarity of the designs found under the external expected trace variance (ii) and the internal expected squared error (iii) loss functions agrees with the asymptotic results from Section 4.2. In this example, the function d(t) given by (4) is where t θ and t σ 2 are the elements of t corresponding to θ and σ 2 , respectively, and σ 2 0 = E σ 2 |D,∆ (σ 2 ) and ρ 0 = E ρ|D,∆ (ρ) are the designer prior means of σ 2 and ρ, respectively. Clearly, d(t θ , t σ 2 ) is maximised bỹ θ = θ andσ 2 = σ 2 0 (1 + ρ 0 ). It then follows from (5) that the fitted marginal posterior distribution of θ is approximated by N θ,Ĩ −1 Therefore, from Table 2, the external expected trace variance loss is approximated by By contrast, from Table 2, the internal expected squared error loss is approximated by where Since σ 2 0 (1 + ρ 0 ) and E σ 2 |F,∆ (σ 2 ) do not depend on the design (for this example), the approximations (19) and (20) are minimised by the same design.

Discussion
An extended framework is proposed for Bayesian design whereby the expectation of the loss is taken with respect to the probability distribution implied by an alternative model (the designer model) to the one being fitted to the observed responses. Properties of the framework are explored. The key findings are that the external expected loss is an upper bound on the corresponding internal expected loss under the designer model, providing justification for using a simpler fitted model for the purposes of computing a design. We also observed that designs found under composite and generator loss functions are no longer equivalent. By developing an asymptotic approximation to the external expected loss we were able to show that the composite loss functions focus solely on maximising posterior precision, whereas generator loss functions also incorporate point estimation.
We encourage researchers consider the extended framework when designing experiments under the Bayesian decision-theoretic approach, if not to specify the design to be used in practice, but at least as a benchmark to assess the performance of that design. Designs can be found under the proposed framework with only minor modification of existing computational methods (see Section 4.2 and Section SM6 in the Supplementary Material).
In two of the examples, the designer model is more complex than the fitted model, and we are, in a sense, attempting to insert robustness into the design procedure. Robust Bayesian inference (i.e. inference under the M-open paradigm of Bernardo and Smith 1994, page 385) has become an important research field in recent years (e.g. Bissiri et al., 2016;Grünwald and van Ommen, 2017;Jewson et al., 2018). Related to this is a body of current work on the estimation of the parameters of a non-linear model (see Section 4.2) taking account of the incompleteness of the underlying mathematical model (e.g. Plumlee, 2017;Xie and Xu, 2021). Considering Bayesian decision-theoretic design of experiments under these approaches will be the focal point of future work.
Lastly, we have focused on an experimental aim of parameter estimation. In principle, the extended framework can be generalised to model selection and prediction aims. Exploring this will be a further avenue for future work.

SM1 Proof of Lemma 2.1
Recall that the fitted and designer models are compatible, so that β = θ. Therefore, the internal expected loss under the designer model can be written as and from (SM1), that for compatible models.

SM2 Proof of Theorem 3.1
The following regularity conditions are sufficient to prove Theorem 3.1. A3 The fitted log-likelihood log(y|t, F, ∆) converges almost surely to d(t) for all t ∈ B.
The external expected loss given by (2) is approximated by by replacing the loss by the approximate loss under (5).
LAN is a general property. However it may be non-trivial to verify for a given fitted likelihood. Instead, Miller (2019) provides the following conditions which are more restrictive but may be easier to verify.

SM3 Derivation of expressions in
The inner expectation on the right hand side of (SM3) is as required.

SM3.2 Squared error loss
For the generator squared error loss The inner expectation is
SM4 Derivation of expressions (7) and (8) in Section 4.1 The self-information loss function is given by where Γ(·) is the Gamma function. The expected value of λ SI (γ, y, F, ∆) with respect to the fitted posterior distribution of γ is where (SM7) follows from, for example, Kotz and Nadarajah (2004, page 23) with ψ(·) the digamma function. The expression (7) is given by taking expectation of (SM8) with respect to the marginal distribution of y under the designer model. The squared error loss is λ SI (γ, y, F, ∆) = γ −μ F 2 2,Ip . From properties of the multivariate t-distribution (for example Kotz and Nadarajah, 2004, page 11), the expected value of λ SE (γ, y, F, ∆) with respect to the fitted posterior distribution of γ is The expression (8) is given by taking expectation of (SM9) with respect to the marginal distribution of y under the designer model where (9) and (10) in Section 4.1

SM5 Derivation of expressions
Under the fitted prior distribution with V −1 with the last line following from d = n − q. Now The q × q matrix Z T Z is diagonal with the lth diagonal element, denoted r l , giving the number of times treatment l = 1, ..., q is replicated. Note that q l=1 r l = n. Since Z T Z is diagonal, with h Xij the ijth element of H X and z il the ilth element of Z. Note, if runs i and j receive the same treatment (say l), then h Xij = h Xii = h Xjj and z il = z jl = 1. Then Lastly (SM10) is substituted into expressions (7) and (8) ignoring constants that do not depend on ∆.

SM6 Estimating parameters of cubic spline models
Suppose the aim is to learn the relationship between a scalar controllable variable x ∈ X = [0, 1] and a response y. For i = 1, . . . , n, it is assumed that where µ(x) and σ 2 are unknown. The function µ(x) can be approximated by the Michaelis-Menten model given by (13). However, this model is known to be an incomplete representation of reality. In light of this, the fitted model is assumed to have where g m (x) be cubic spline basis functions (for example Woods, 2017, Chapter 5). Instead of fixing m we allow m to be unknown with the constraint 4 ≤ m ≤ n (the lower bound following from the assumption of cubic splines).
where R m = I n + κG m G T m . The expressions (SM11) and (SM12) follow from the fact that the fitted model is a normal linear model and can be found in, for example, O' Hagan and Forster (2004, Chapter 11).
The chosen loss function is predictive squared error λ(y, µ(·), F, ∆) = X (µ(x) − E (µ(x)|y, F, ∆)) 2 dx, a measure of the accuracy of the model-averaged fitted posterior mean of µ(x). We consider finding designs by minimising (a) internal expected predictive squared error loss; and (b) external expected predictive squared error loss.
For (b) external expected predictive squared error loss, the disjoint designer model has µ(x) = η(ξ, x) where η(·, x) is given by (13)  Neither of the expected loss functions are available in closed form and require approximation for which we use Monte Carlo. The internal expected predictive squared error loss can be approximated using the following steps.
The external expected predictive squared error loss can be approximated using the following steps.
1. For b = 1, . . . , B, generate ξ b from the designer prior distribution and σ 2 b from IG a D 2 , b D 2 .Then generate y b from N η(ξ b , ∆), σ 2 b I n . For both approximations, the integration in (SM13) is evaluated using quadrature. The hyperparameters of the fitted and designer prior distributions of σ 2 are set as a F = a D = 6 and b F = b D = 4, ensuring that the prior mean and variance are both one. The value of κ is chosen to be 10 6 , indicating a very vague fitted prior distribution for γ (m) . Designs are found with n = 10 using the approximate coordinate exchange algorithm. Table SM1 shows the efficiencies of the two designs under internal and external predictive squared error loss. Figure SM1 shows the values of ∆ = (x 1 , . . . , x n ) T for the two designs (where the size of the plotting symbol is an increasing function of the replication.). The design found by minimising internal expected loss is similar to an evenly-spaced design whereas the design found by minimising external expected loss appears to be a compromise between the evenly-spaced design and the structure of points found for the Michaelis-Menten model in Section 4.2. Further, it was found that the internal expected loss was flat near the design minimising its value, evidenced by the high efficiency under the internal expected loss for the design minimising the external expected loss. This agrees with the observation of Ryan et al. (2016b) that expected loss functions become flat under vague prior distributions (i.e. in this example κ = 10 6 ). By contrast, the design found by minimising the external expected loss provides a design of high efficiency for the fitted model but reflects best current knowledge of the underlying process.