A Role for Symmetry in the Bayesian Solution of Differential Equations

The interpretation of numerical methods, such as finite difference methods for differential equations, as point estimators suggests that formal uncertainty quantification can also be performed in this context. Competing statistical paradigms can be considered and Bayesian probabilistic numerical methods (PNMs) are obtained when Bayesian statistical principles are deployed. Bayesian PNM have the appealing property of being closed under composition, such that uncertainty due to different sources of discretisation in a numerical method can be jointly modelled and rigorously propagated. Despite recent attention, no exact Bayesian PNM for the numerical solution of ordinary differential equations (ODEs) has been proposed. This raises the fundamental question of whether exact Bayesian methods for (in general nonlinear) ODEs even exist. The purpose of this paper is to provide a positive answer for a limited class of ODE. To this end, we work at a foundational level, where a novel Bayesian PNM is proposed as a proof-of-concept. Our proposal is a synthesis of classical Lie group methods, to exploit underlying symmetries in the gradient field, and non-parametric regression in a transformed solution space for the ODE. The procedure is presented in detail for first and second order ODEs and relies on a certain strong technical condition -- existence of a solvable Lie algebra -- being satisfied. Numerical illustrations are provided.


Introduction
Numerical methods underpin almost all of scientific, engineering and industrial output. In the abstract, a numerical task can be formulated as the approximation of a quantity of interest Q : Y → Q, subject to a finite computational budget. The true underlying state y † ∈ Y is typically high-or infinite-dimensional, so that only limited information is provided and exact computation of Q(y † ) is prohibited. For example, numerical integration aims to approximate an integral Q(y † ) = y † (t)dt given the values A(y † ) = {(x i , y † (x i ))} n i=1 of the integrand y † on a finite number of abscissa {x i } n i=1 . Simi-larly, a numerical approximation to the solution Q(y † ) = y † of a differential equation dy/dx = f (x, y(x)), y(x 0 ) = y 0 , must be based on at most a finite number of evaluations of f , the gradient field. In this viewpoint a numerical method corresponds to a map b : A → Q, as depicted in Figure 1a, where b(a) represents an approximation to the solution of the differential equation based on the information a ∈ A.
The increasing ambition and complexity of contemporary applications is such that the computational budget can be extremely small compared to the precision that is required at the level of the quantity of interest. As such, in many important problems it is not possible to reduce the numerical error to a negligible level. Fields acutely associated with this challenge include climate forecasting (Wedi, 2014), computational cardiology (Chabiniok et al., 2016) and molecular dynamics (Perilla et al., 2015). In the presence of non-negligible numerical error, it is unclear how scientific interpretation of the output of computation can proceed. For example, a posteriori analysis of traditional numerical methods can be used to establish hard upper bounds on the numerical error, but these bounds typically depend on an unknown global constant. In the case of ODEs, this may be the maximum value of a norm f of the gradient field (see e.g. Estep, 1995). If f were known, it would be possible to provide a hard bound on numerical error. However, in the typical numerical context all that is known is that f is finite. One could attempt to approximate f with cubature, but that itself requires a numerical cubature method whose error is required to obey a known bound. In general, therefore, there are no hard error bounds without global information being a priori provided (Larkin, 1974). Our aim in this paper is to consider, as an alternative to traditional numerical analysis, an exact Bayesian framework for solution uncertainty quantification in the ordinary differential equation context.

Probabilistic Numerical Methods
The field of probabilistic numerics dates back to Larkin (1972) and a modern perspective is provided in Hennig et al. (2015); Oates and Sullivan (2019). Under the abstract framework just described, numerical methods can be interpreted as point estimators in a statistical context, where the state y † can be thought of as a latent variable in a statistical model, and the 'data' consist of information A(y † ) that does not fully determine the quantity of interest Q(y † ) but is indirectly related to it. Hennig et al. (2015) provide an accessible introduction and survey of the field. In particular, they illustrated how PNM can be used to quantify uncertainty due to discretisation in important scientific problems, such as astronomical imaging.
Let the notation Σ Y denote a σ-algebra on the space Y and let P Y denote the set of probability measures on (Y, Σ Y ). A probabilistic numerical method (PNM) is a procedure which takes as input a 'belief' distribution μ ∈ P Y , representing epistemic uncertainty with respect to the true (but unknown) value y † , along with a finite amount of information, A(y † ) ∈ A. The output is a distribution B(μ, A(y † )) ∈ P Q on (Q, Σ Q ), representing epistemic uncertainty with respect to the quantity of interest Q(y † ) after the information A(y † ) have been processed. For example, a PNM for an ordinary differential equation (ODE) takes an initial belief distribution defined on the solution space of the differential equation, together with information arising from a finite number of evaluations of the gradient field, plus the initial condition of the ODE, to produce a distribution over either the solution space of the ODE, or perhaps some derived quantity of interest. In this paper, the measurability of A and Q will be assumed.
Despite computational advances in this emergent field, until recently there had not been an attempt to establish rigorous statistical foundations for PNM. In Cockayne et al. (2019) the authors argued that Bayesian principles can be adopted. In brief, this framework requires that the input belief distribution μ carries the semantics of a Bayesian agent's prior belief and the output of a PNM agrees with the inferences drawn when the agent is rational. To be more precise recall that, in this paper, information is provided in a deterministic 1 manner through (1) and thus Bayesian inference corresponds to conditioning of μ on the level sets of A. Let Q # : P Y → P Q denote the push-forward map associated to Q. i.e. Q # (μ)(S) = μ(Q −1 (S)) for all S ∈ Σ Q . Let {μ a } a∈A ⊂ P Y denote the disintegration, assumed to exist, 2 of μ ∈ P Y along the map A.
This definition is intuitive; the output of the PNM should coincide with the marginal distribution for Q(y † ) according to the disintegration element μ a ∈ P Y , based on the information a ∈ A that was provided. The definition is equivalent to the statement that Figure 1b is a commutative diagram. In Cockayne et al. (2019) the map A was termed an information operator and the map B was termed a belief update operator ; we adhere to these definitions in our work. The Bayesian approach to PNM confers several important benefits: 1 It is of course possible to perform Bayesian inference in the noisy-data context, but for the ODEs considered in this paper we assume that one can obtain noiseless evaluations of the gradient field.
2 The reader unfamiliar with the concept of a disintegration can treat μ a as a technical notion of the 'conditional distribution of y given A(y) = a' when reading this work. The disintegration theorem, Thm. 1 of Chang and Pollard (1997), guarantees existence and uniqueness up to a A # μ-null set under the weak requirement that Y is a metric space, Σ Y is the Borel σ-algebra, μ is Radon, Σ A is countable generated and Σ A contains all singletons {a} for a ∈ A.
• The input μ and output B(μ, a) belief distributions can be interpreted, respectively, as a prior and (marginal) posterior. 3 As such, they automatically inherit the stronger formal semantics and philosophical foundations that underpin the Bayesian framework and, in this sense, are well-understood (see e.g. Gelman and Shalizi, 2013).
• The definition of Bayesian PNM is operational. Thus, if we are presented with a prior μ and information a then there is a unique Bayesian PNM and it is constructively defined.
• The class of Bayesian PNM is closed under composition, such that uncertainty due to different sources of discretisation can be jointly modelled and rigorously propagated. This point will not be discussed further in this work, but we refer the interested reader to Section 5 of Cockayne et al. (2019).
Nevertheless, the strict definition of Bayesian PNM limits scope to design convenient computational algorithms and indeed several proposed PNM are not Bayesian (see Table  1 in Cockayne et al., 2019). The challenge is two-fold; for a Bayesian PNM, the elicitation of an appropriate prior distribution μ and the exact computation of its disintegration {μ a } a∈A must both be addressed. In the next section we argue that -perhaps as a consequence of these constraints -a strictly Bayesian PNM for the numerical solution of an ODE does not yet exist.

Existing Work for ODEs
The first PNM (of any flavour) for the numerical solution of ODEs, or which we are aware, was due to Skilling (1992). Two decades later, this problem is receiving renewed critical attention as part of the active development of PNM. The aim of this section is to provide a high-level overview of existing work and to argue that existing methods do not adhere to the definition of Bayesian PNM.

Notation
The notational convention used in this paper is that the non-italicised y denotes a generic function, whereas the italicised y denotes a scalar value taken by the function y. The notation y † is reserved for the true solution to an ODE. Throughout, the underlying state space Y is taken to be a space occupied by the true solution of the ODE, i.e. y † ∈ Y.

Skilling (1992)
The first paper on this topic, of which we are aware, was Skilling (1992). This will serve as a prototypical PNM for the numerical solution of an ODE. Originally described as 'Bayesian' by the author, we will argue that, at least in the strict sense of Definition 1, it is not a Bayesian PNM. Consider a generic univariate first-order initial value problem Throughout this paper all ODEs that we consider will be assumed to be well-defined and admit a unique solution y † ∈ Y where Y is some pre-specified set. In this paper the quantity of interest Q(y † ) will either be the solution curve y † itself or the value y † (x T ) of the solution at a specific input (in this section it will be the former). The approach outlined in Skilling (1992) allows for a general prior μ ∈ P Y . The gradient field f is treated as a 'black box' oracle that can be queried at a fixed computational cost. Thus we are provided with evaluations of the gradient field [f ( . This approach of treating evaluations of the gradient field as 'data' will be seen to be a common theme in existing PNM for ODEs and theoretical support for this framework is rooted in the field of information-based complexity (Traub and Woźniakowski, 1992). Let a i = f (x i , y i ) and a i = [a 0 , . . . , a i ]. The selection of the input pairs (x i , y i ) on which f is evaluated is not constrained and several possibilities, of increasing complexity, were discussed in Skilling (1992). To fix ideas, the simplest such approach is to proceed iteratively as follows: (0.1) The first pair (x 0 , y 0 ) is fully determined by the initial condition of the ODE.
(0.2) The oracle then provides one piece of information, a 0 = f (x 0 , y 0 ).
(0.3) The prior μ is updated according to a 0 , leading to a belief distribution μ 0 which is just the disintegration element μ a 0 .
(1) A discrete time step n > 0, is performed and a particular point estimate y 1 = y(x 1 )dμ 0 (y) for the unknown true value y † (x 1 ) is obtained. This specifies the second pair (x 1 , y 1 ).
The process continues similarly, such that at time step i−1 we have a belief distribution μ i−1 = B(μ, a i−1 ) ∈ P Y , where the general belief update operator B is yet to be defined, and the following step is performed: The final output is a probability distribution μ n = B(μ, a n ) ∈ P Y . Now, strictly speaking, the method just described is not a PNM in the concrete sense that we have defined. Indeed, the final output μ n is a deterministic function of the values a n of the gradient field that were obtained. However, in the absence of additional assumptions on the global smoothness of the gradient field, the values of f (x, y) outside any open neighbourhood of the true solution curve C = {(x, y) : y = y † (x), x ∈ [x 0 , x T ]} do not determine the solution of the ODE and, conversely, the solution of the ODE provides no information about the values of the gradient field outside any open neighbourhood of the true solution curve C. Thus it is not possible, in general, to write down an information operator A : Y → A that reproduces the information a n when applied to the solution curve y † (·) of the ODE.
The approach taken in Skilling (1992) was therefore to posit an approximate information operatorÂ and a particular belief update operator B, which are now described. The approximate information operator is motivated by the intuition that if y † (x i ) is well-approximated by y i at the abscissa x i then dy † dx (x i ) should be well-approximated by f (x i , y i ). That is, the following approximate information operatorÂ was constructed: Of course,Â(y † ) = a n in general. To acknowledge the approximation error, Skilling (1992) proposed to model the information with an approximate likelihood: This was referred to in Skilling (1992) as simply a "likelihood" and, together with μ 0 = μ a 0 , the output μ n is completely specified. Here σ is a fixed positive constant, however in principle a non-diagonal covariance matrix can also be considered. The negative consequences of basing inferences on an approximate information operator A are potentially twofold. First, recall that values of the gradient field that are not contained on the true solution curve of the ODE do not, in principle, determine the true solution curve y † . It is therefore unclear if these values should be taken into account at all. Second, in the special case where the gradient field f does not depend the second argument then the quantities dy dx (x i ) and f (x i , y i ) are identical. From this perspective, μ n represents inference under a mis-specified likelihood, since information is treated as erroneous when it is in fact exact. The use of a mis-specified likelihood violates the likelihood principle and implies violation of the Bayesian framework. This confirms, through a different argument, that the approach of Skilling (1992) cannot be Bayesian in the strict sense of Definition 1. After Skilling (1992), several authors have proposed improvements to the above method. The approach of Schober et al. (2014) considered (5) in the σ ↓ 0 limit. In order that exact conditioning can be performed in this limit, the input belief distribution μ was restricted to be a k-times integrated Wiener measure on the solution space of the ODE. The tractability of the integrated Weiner measure leads to a closed-form characterisation of the posterior and enables computation to be cast as a Kalman filter. This direction of research can be motivated by the following fact: For k ∈ {1, 2} the authors prove that if the input pair (x 1 , y 1 ) is taken as y 1 = y(x 1 )dμ 0 (y), as indicated in Section 1.2, then the smoothing estimateŷ 1 = y(x 1 )dμ 1 (y), i.e. the posterior mean for y(x 1 ) based on information a 1 , coincides with the deterministic approximation to y † (x 1 ) that would be provided by a k-th order Runge-Kutta method. As such, theoretical guarantees such as local convergence order are inherited. For k = 3 it was shown that the same conclusion can be made to hold, provided that the input pair (x 1 , y 1 ) is selected in a manner that is no longer obviously related to μ 0 . In all cases the identification with a classical Runge-Kutta method does not extend beyond iteration n = 1. Similar connections to multistep methods of Nordsieck and Adams form were identified, respectively, in Schober et al. (2019) and Teymur et al. (2016Teymur et al. ( , 2018. The approach of Schober et al. (2014) is not Bayesian in the sense of Definition 1, which can again be deduced from dependence on values of the gradient field away from the true solution curve, so that the likelihood principle is violated.

Kersting and Hennig (2016)
The subsequent work of Kersting and Hennig (2016) attempted to elicit an appropriate non-zero covariance matrix for use in (5), in order to encourage uncertainty estimates to be better calibrated. Their proposal consisted of the approximate likelihood This can be viewed as the predictive marginal likelihood for the value f (x i , y(x i )) based on μ i−1 . From a practical perspective, the approach is somewhat circular as the integrals in (7) and (8) involve the black-box gradient field f and are therefore cannot be computed. The authors suggested a number of ways that these quantities could be numerically approximated, 4 which involve evaluating f (x i , y i ) at one or more values y i that must be specified. The overall approach again violates the likelihood principle and is therefore not Bayesian in the sense of Definition 1.

Chkrebtii et al. (2016)
The original work of Chkrebtii et al. (2016) is somewhat related to Kersting and Hennig (2016), however instead of using the mean of the current posterior as input to the gradient field, the input pair (x i , y i ) was selected by sampling y i from the marginal distribution for y(x i ) implied by μ i−1 . The approximate likelihood in this approach was taken as follows: Compared to (6), (7) and (8), this approach does not rely on integrals over the unknown gradient field. However, the approach also relies on the approximate information operator in (3) and is thus not Bayesian according to Definition 1.

Conrad et al. (2017); Abdulle and Garegnani (2018)
The approaches proposed in Conrad et al. (2017); Abdulle and Garegnani (2018) are not motivated in the Bayesian framework, but instead seek to introduce a stochastic perturbation into a classical numerical method. Both methods focus on the quantity of interest Q(y † ) = y † (x T ). In the simple context of (2), the method of Conrad et al. (2017) augments the explicit Euler method with a stochastic perturbation: The distribution of the sequence ( i ) n i=1 must be specified. In the simplest case where the i are modelled as independent, say with i ∼ ρ, the canonical flow map Φ i : R → R of the explicit Euler method, defined as Φ i (z) = z + hf (x i , z), is replaced by the probabilistic counterpart Ψ i : P R → P R given by through which stochasticity can be propagated. The output of the method of Conrad et al. (2017) is then B = Ψ n • · · · • Ψ 1 δ(y 0 ), where δ(z) denotes an atomic distribution on z ∈ R. For the case where each ρ i has zero mean, it can be shown that the mean of B equals Φ n • · · · • Φ 1 (y 0 ), which is exactly the deterministic approximation produced with the explicit Euler method.
This framework can be practically problematic, since i is charged with modelling the extrapolation error and such errors are not easily modelled as independent random variables -Section 2.8 of Higham (2002) is devoted to this point. Thus, if for example f (x, y) = y, the true linearisation error at step i is e xi − e xi−1 so that the 'true' sequence ( i ) n i=1 in this case is monotonic and exponentially unbounded. The challenge of designing a stochastic model for the sequence ( i ) n i=1 that reflects the highly structured nature of the error remains unresolved. On the other hand, the mathematical properties of this method are now well-understood (Lie et al., , 2019. The proposal of Abdulle and Garegnani (2018) was to instead consider randomisation of the inputs {x i } T i=0 in the context of a classical numerical method, also outside of the Bayesian framework.

Cockayne et al. (2019); Tronarp et al. (2019)
The survey just presented begs the question of whether a Bayesian PNM for ODEs can exist at all. A first step toward this goal was taken in Cockayne et al. (2019), where it was argued that an information operator can be constructed if the vector field f is brought to the left-hand-side in (2). Specifically, they proposed the information operator for which the 'data' are trivial;ã n = 0. It was rigorously established that the approximate likelihood leads to an exact Bayesian PNM in the limit: Here F → denotes convergence in an integral probability metric defined by a suitable set F of test functions (see Section 4 of Cockayne et al., 2019). However, the dependence of the information operatorÃ on f means that this cannot be used as the basis for a practical method. Indeed, unless f depends linearly on its second argument and conjugacy properties of the prior can be exploited, the posterior cannot easily be characterised. Approximate techniques from nonlinear filtering were proposed to address this challenge in Tronarp et al. (2019).

Our Contribution
The comprehensive literature review in the previous section reveals not only that no Bayesian PNM has yet been proposed, but also that such an endeavour may be fundamentally difficult. Indeed, a theme that has emerged with existing PNM for ODEs, which can be traced back to Skilling (1992), is the use of approximate and subjective forms for the likelihood. The complex, implicit relationship between the latent ODE solution y † and the data f (x i , y i ) arising from the gradient field appears to preclude use of an exact likelihood. Of course, violation of the likelihood principle is not traditionally a concern in the design of a numerical method, yet if the strictly richer output that comes with a Bayesian PNM is desired, then clearly adherence to the likelihood principle is important. It is therefore natural to ask the question, "under what conditions can exact Bayesian inference for ODEs be made?". This paper presents a proof-of-concept PNM for the numerical solution of a (limited) class of ODEs that is both (a) Bayesian in the sense of Definition 1 and (b) can in principle be implemented. The method being proposed is indicated in Figure 2 and its main properties are as follows: • The classical theory of Lie groups is exploited, for the first time in the context of PNM, to understand when an ODE of the form in (2) can be transformed into an ODE whose gradient field is a function of the independent state variable only, reducing the ODE to an integral. • For ODEs that admit a solvable Lie algebra, our proposal can be shown to simultaneously perform exact Bayesian inference on both the original and the Lietransformed ODE. Crucially, as we explain later, to identify a Lie algebra only high-level a priori information about the ODE is required. The case of first-and second-order ODEs is presented in detail, but the method itself is general.
• In general the specification of prior belief can be difficult. The prior distributions that we construct are guaranteed to respect aspects of the structure of the ODE. As such, our priors are, to some extent, automatically adapted to the ODE at hand as opposed to being arbitrarily posited.
• In addition to the benefits conferred in the Bayesian framework, detailed in Section 1.1 and in Cockayne et al. (2019), the method being proposed can be computationally realised. On the other hand, there is a cost in terms of the run-time of the method that is substantially larger than existing, non-Bayesian approaches (especially classical numerical methods). As such, we consider this work to be a proof-of-concept rather than an applicable Bayesian PNM.
The remainder of the paper is structured as follows: Section 2 is dedicated to a succinct review of Lie group methods for ODEs. In Section 3, Lie group methods are exploited to construct priors over the solution space of the ODE whenever a solvable Lie algebra is admitted and exact Bayesian inference is performed on a transformed version of the original ODE which takes the form of an integral. Numerical experiments are reported in Section 4. Finally, some conclusions and recommendations for future research are drawn in Section 5.

Background
This section provides a succinct overview of classical Lie group methods, introduced in the 19th century by Sophus Lie in the differential equation context (Hawkins, 2012). Lie developed the fundamental notion of a Lie group of transformations, which roughly correspond to maps that take one solution of the ODE to another. This provided a formal generalisation of certain algebraic techniques, such as dimensional analysis and transformations based on spatial symmetries, that can sometimes be employed to algebraically reduce the order of an ODE.
This section proceeds as follows: First, in Section 2.1 we introduce a one-parameter Lie group of transformations and then, in Section 2.2, we explain what it means for a curve or a surface to be transformation-invariant. In Section 2.3 we recall consequences of Lie's theory in the ODE context. Last, in Section 2.4 the generalisation to multiparameter Lie groups is indicated. Our development is heavily influenced by Bluman and Anco (2002) and we refer the reader to their book when required.

One-Parameter Lie Groups of Transformations
The purpose of this section is to recall essential definitions, together with the first fundamental theorem of Lie, which relates a Lie group of transformations to its infinitesimal generator. In what follows we consider a fixed domain D ⊂ R d and denote a generic state variable as Definition 2 (One-Parameter Group of Transformations). A one-parameter group of transformations on D is a map X : D × S → D, defined on D × S for some set S ⊂ R, together with a bivariate map φ : S × S → S, such that the following hold: (1) For each ∈ S, the transformation X(·, ) is a bijection on D.
(3) If 0 is the identity element in (S, φ), then X(·, 0 ) is the identity map on D.
In what follows we continue to use the shorthand notation x * = X(x, ). The notion of a Lie group additionally includes smoothness assumptions on the maps that constitute a group of transformations. Recall that a real-valued function is analytic if it can be locally expressed as a convergent power series.
Definition 3 (One-Parameter Lie Group of Transformations). Let X, together with φ, form a one-parameter group of transformations on D. Then we say that X, together with φ, form a one-parameter Lie group of transformations on D if, in addition, the following hold: (5) S is a (possibly unbounded) interval in R.
(7) For each x ∈ D, X(x, ·) is an analytic function on S.
Without the loss of generality it will be assumed, through re-parametrisation if required, that S contains the origin and = 0 is the identity element in (S, φ). The definition is illustrated through three examples: Example 1 (Translation in the x-Axis). The one-parameter transformation x * 1 = x 1 + , x * 2 = x 2 for ∈ R forms a Lie group of transformations on D = R 2 with group composition law φ( , δ) = + δ.
Example 2 (Rotation Group). The one-parameter transformation x * 1 = x 1 cos( ) − x 2 sin( ), x * 2 = x 1 sin( )+x 2 cos( ) for ∈ R again forms a Lie group of transformations on D = R 2 with group composition law φ( , δ) = + δ. The first fundamental theorem of Lie establishes that a Lie group of transformations can be characterised by its infinitesimal generator, defined next:

Definition 4 (Infinitesimal Transformation). Let X be a one-parameter Lie group of transformations. Then the transformation
is called the infinitesimal transformation associated to X and the map ξ is called an infinitesimal.
Definition 5 (Infinitesimal Generator). The infinitesimal generator of a one-parameter Lie group of transformations X is defined to be the operator X = ξ · ∇ where ξ is the infinitesimal associated to X and ∇ = ( ∂ ∂x1 , ∂ ∂x2 , . . . , ∂ ∂xn ) is the gradient. Example 4 (Ex. 1, continued). For Ex. 1, we have so the infinitesimal generator for translation in the x-axis is X = ∂ ∂x1 . Example 5 (Ex. 2, continued). Similarly, the infinitesimal generator for the rotation The first fundamental theorem of Lie provides a constructive route to obtain the infinitesimal generator from the transformation itself: Theorem 1 (First Fundamental Theorem of Lie; see pages 39-40 of Bluman and Anco (2002)). A one parameter Lie group of transformations X is characterised by the initial value problem: where τ ( ) is a parametrisation of which satisfies τ (0) = 0 and, for = 0, dδ.
Here δ −1 denotes the group inverse element for δ.
Since (9) is translation-invariant in τ , it follows that without loss of generality we can assume a parametrisation τ ( ) such that the group action becomes φ(τ 1 , τ 2 ) = τ 1 + τ 2 and, in particular, τ −1 = −τ . In the remainder of the paper, for convenience we assume that all Lie groups are parametrised such that the group action is φ( 1 , 2 ) = 1 + 2 .
The next result can be viewed as a converse to Theorem 1, as it shows how to obtain the transformation from the infinitesimal generator. All proofs are reserved for Supplemental Section A.1 (Wang et al., 2019).

Theorem 2. A one parameter Lie group of transformations with infinitesimal generator
The following is immediate from the proof of Theorem 2:

Invariance Under Transformation
In this section we explain what it means for a curve or a surface to be invariant under a Lie group of transformations and how this notion relates to the infinitesimal generator. Based on the results in Section 2.1, one might expect that invariance to a transformation can be expressed in terms of the infinitesimal generator of the transformation. This is indeed the case: The following definition is fundamental to the method proposed in Section 3: Definition 7 (Canonical Coordinates). Consider a coordinate system r = (r 1 (x), . . . , r n (x)) on D. Then any one parameter Lie group of transformations x * = X(x, ) induces a transformation of the coordinates r * i = r i (x * ). The coordinate system r is called canonical for the transformation if r * 1 = r 1 , . . . , r * n−1 = r n−1 and r * n = r n + .
In canonical coordinates, a one parameter Lie group of transformations can be viewed as a straight-forward translation in the r n -axis. The existence of canonical coordinates is established in Thm. 2.3.5-2 of Bluman and Anco (2002). Note that Thms. 3 and 4 imply that Xr * i = 0 for i = 1, 2, . . . , n − 1, Xr * n = 1. The invariance of a surface, as for a function, can be cast in terms of an infinitesimal generator:

Symmetry Methods for ODEs
The aim of this section is to relate Lie transformations to ODEs for which these transformations are admitted. These techniques form the basis for our proposed method in Section 3.
For an ODE of the form in (2), one can consider the action of a transformation on the coordinates (x, y); i.e. a special case of the above framework where the generic coordinates x 1 and x 2 are respectively the independent (x) and dependent (y) variables of the ODE. It is clear that such a transformation also implies some kind of transformation of the derivatives y m := d m y dx m . Indeed, consider a one-parameter Lie group of transformations (x * , y * ) = (X(x, y; ), Y (x, y; )). Then we have from the chain rule that y * m := d m y * d(x * ) m is a function of x, y, y 1 , . . . , y m and we denote y * m = Y m (x, y, y 1 , . . . , y m ; ). As an explicit example: In this sense a transformation defined on (x, y) can be naturally extended to a transformation on (x, y, y 1 , y 2 , . . . ) as required.
Our next task is to understand how the infinitesimal generator of a transformation can be extended to act on derivatives y m .
Definition 10 (Extended Infinitesimal Transformation). The mth extended infinitesimals of a one parameter Lie group of transformations (x * , y * ) = (X(x, y; ), Y (x, y; )) are defined as the functions ξ, η, η (1) , . . . , η (m) for which the following equations hold: It can be shown straightforwardly via induction that where d dx denotes the full derivative with respect to x, i.e. d dx = ∂ ∂x + y 1 ∂ ∂y + m+1 k=2 y k ∂ ∂y k−1 . It follows that η (m) is a polynomial in y 1 , y 2 , . . . , y m with coefficients linear combinations of ξ, η and their partial derivatives up to the mth order.

Definition 11 (Extended Infinitesimal Generator). The mth extended infinitesimal generator is defined as
is the extended gradient.
The following corollaries are central to the actual computation of the admitted Lie groups of an ODE.  order ODE F (x, y, y 1 , . . . , y m ) = 0 if and only if its extended infinitesimal generator X (m) satisfies X (m) F (x, y, y 1 , . . . , y m ) = 0 whenever F (x, y, y 1 , . . . , y m ) = 0.

Multi-Parameter Lie Groups and Lie Algebras
To leverage the full power of Lie symmetry methods for ODEs of order m ≥ 2, we need to consider multiple Lie symmetries which are collectively described by a Lie algebra. Fortunately, the notion of a multi-parameter Lie group of transformations is a natural generalisation from the one parameter case. Thus, this last section of background material concerns the generalisation of the definitions in Section 2.1 to the case of a multi-parameter Lie group. The associated Lie algebra will also be defined.

Definition 13 (Infinitesimal Matrix). The appropriate generalisation for the infinitesimal transformation is the infinitesimal matrix
, whose entries are defined as .

Definition 14 (Infinitesimal Generator). An r-parameter Lie group of transformations is associated with
The first fundamental theorem of Lie can be generalised to the multi-parameter case. In particular, it can be shown that an r-parameter Lie group of transformations is characterised by the set of its r infinitesimal generators. The generalisation is straightforward and so, for brevity, we refer the reader to pages 39-40 of Bluman and Anco (2002).
Next we explain how the collection of infinitesimal generators forms a Lie algebra. This relies on the basic facts that the set D of differential operators on D is a vector space over R (i.e. λX + μY ∈ D for all X, Y ∈ D and λ, μ ∈ R) and that differential operators can be composed (i.e. XY ∈ D for all X, Y ∈ D).

Definition 15 (Commutator). The commutator of two infinitesimal generators
Then [·, ·] maps into L. i.e. the right hand side of (11) belongs to L for all λ, μ ∈ R r .
The space L, defined in Thm. 5, satisfies the properties of an r-dimensional Lie algebra L, defined next: Definition 16 (Lie Algebra). An r-dimensional vector space L over R together with a bilinear operator [·, ·] : L × L → L is called an r-dimensional Lie algebra if the following hold: (1) Alternativity: [X, X] = 0 for all X ∈ L In general, for the methods presented in Section 3 to be applied, existence of an nparameter Lie group of transformations is not in itself sufficient; we require the existence of an n-dimensional solvable Lie sub-algebra, defined next:

Definition 17 (Normal Lie Sub-algebra). Consider a Lie sub-algebra J of a Lie algebra L with bilinear operator [·, ·], i.e. a subset J ⊂ L such that, when equipped with the restriction of [·, ·] to J × J , is itself a Lie algebra and, in particular,
Definition 18 (Solvable Lie Algebra). An r-dimensional Lie algebra L is called solvable if there exists a chain of sub-algebras L 1 ⊂ L 2 ⊂ . . . ⊂ L q−1 ⊂ L r =: L such that L i−1 is a normal sub-algebra of L i for i = 2, 3, . . . , r.
For low-order ODEs, the existence requirement for an admitted Lie group of transformations is more restrictive than the requirement that the associated Lie algebra is solvable. Indeed, we have the following result: Theorem 6. All two-dimensional Lie Algebras are solvable.
This completes our review of background material. The exact Bayesian PNM developed in Section 3 for an nth order ODE require the existence of an admitted n-parameter Lie group of transformations with a solvable Lie algebra. In practice we therefore require some high-level information on the gradient field f , in order to establish which transformations of the ODE may be admitted. In addition, the requirement of a solvable Lie algebra also limits the class of ODEs for which our exact Bayesian methods can be employed. Nevertheless, this class of ODEs is sufficiently broad to have merited extensive theoretical research (Bluman and Anco, 2002) and the development of software (Baumann, 2013).

Methods
In this section our novel Bayesian PNM is presented. The method relies on high-level information about the gradient field f and, in Section 3.1, we discuss how such information can be exploited to identify any Lie transformations that are admitted by the ODE. In the case of a first order ODE, any non-trivial transformation is sufficient for our method and an explicit information operator is provided for this case, together with recommendations for prior construction, in Section 3.2. Together, the prior and the information operator uniquely determine a Bayesian PNM, as explained in Section 1.1. In the general case of an mth order ODE, we require that an m-dimensional solvable Lie algebra is admitted by the ODE. The special case m = 2 is treated in detail, with an explicit information operator and guidance for prior construction provided in Section 3.3. In the Supplemental Section A.2 the selection of input pairs (x i , y i ) to the gradient field is discussed.

From an ODE to its Admitted Transformations
For the methods proposed in this paper, transformations admitted by the ODE, together with their infinitesimal generators, must first be obtained. The algorithm for obtaining infinitesimal generators follows as a consequence of Cor. 4. Indeed, suppose we have a mth order ODE of the form y m − f (x, y, y 1 , . . . , y m−1 ) = 0. Then, by Cor. 4, a transformation with infinitesimal generator X is admitted by the ODE if and only if: y, y 1 , . . . , y m−1 )) = 0 whenever y m = f (x, y, y 1 , . . . , y m−1 ). (12) In infinitesimal notation, (12) is equivalent to The direct solution of (13) recovers any transformations that are admitted.
In the common scenario where f (x, y, y 1 , . . . , y m−1 ) is a polynomial in y 1 , y 2 , . . . , y m−1 , the algorithm just described, for identification of admitted transformations, can be fully automated (c.f. Baumann, 2013). Indeed, from Def. 10 it follows that the extended infinitesimals η (k) for k ∈ 1, 2, 3, . . . , m are polynomial in y 1 , y 2 , . . . , y k . Thus, by substituting y m = f (x, y, y 1 , . . . , y m−1 ), (12) too must be a polynomial in y 1 , y 2 , . . . , y m−1 . Moreover, the coefficients of this polynomial must vanish because (12) holds for arbitrary values of x, y, y 1 , . . . , y m−1 . This argument, of setting coefficients to zero, leads to a system of linear partial differential equations (overdetermined when m ≥ 2) for ξ(x, y) and η(x, y), which can be exactly solved to retrieve all the infinitesimal generators of the ODE. The same strategy can often be applied beyond the polynomial case and explicit worked examples of this procedure are now provided:

Example 9 (First Order ODE). Consider the class of all first order ODEs of the form
x . For this equation to hold for general F , the coefficients of F , F 2 and F must vanish: η x = 0, η y −ξ x = 0, ξ y = 0, −ξ y x 2 +η 1 x = 0. This is now a linear system of partial differential equations in (ξ, η) which is easily solved; namely ξ = x, η = y. The associated infinitesimal generator is X = x ∂ ∂x + y ∂ ∂y .

Example 10 (Second Order ODE). The infinitesimal generators for the second order ODE
are derived in Supplementary Section A.1.

The Case of a First Order ODE
In this section we present our approach for a first order ODE. This allows some of the more technical details associated to the general case to be omitted, due to the fact that any one-dimensional Lie algebra is trivial. The main result that will allow us to construct an exact Bayesian PNM is as follows: Theorem 7 (Reduction of a First Order ODE to an Integral). If a first order ODE admits a one parameter Lie group of transformations, then there exists coordinates r(x, y), s(x, y) such that for some explicit function G(r).
Note that the transformed ODE in (16) is nothing more than an integral, for which exact Bayesian PNM have already been developed (e.g. Briol et al., 2019;Karvonen et al., 2018). At a high level, as indicated in Figure 2, our proposed Bayesian PNM performs inference for the solution s(r) of (16) and then transforms the resultant posterior back into the original (x, y)-coordinate system. Our PNM is therefore based on the information operator which corresponds indirectly to n + 1 evaluations of the original gradient field f at certain input pairs (x i , y i ). The selection of the inputs r i is discussed in Section A.2.
The transformation of a first order ODE is clearly illustrated in the following: Example 11 (Ex. 9, continued). Consider the first order ODE dy dx = f (x, y(x)), f (x, y) = F y x . Recall from Ex. 9 that this ODE admits the one parameter Lie group of transformations x * = αx, y * = αy for α ∈ R and the associated infinitesimal generator is X = x ∂ ∂x + y ∂ ∂y . Solving the pair of partial differential equations Xr = 0, Xs = 1 yields the canonical coordinates s = log y, r = y x . The transformed ODE is then −r 2 +rF (r) =: G(r). Thus an evaluation G(r) corresponds to an evaluation of f (x, y) at an input (x, y) such that r = y x .
Two important points must now be addressed: First, the approach just described cannot be Bayesian unless it corresponds to a well-defined prior distribution μ ∈ P Y in the original coordinate system Y. This precludes standard (e.g. Gaussian process) Figure 3: Illustration of the implicit prior principle: A prior elicited for the function s(r) in the transformed coordinate system (r, s) must be supported on functions s(r) that correspond to well-defined functions y(x) in the original coordinate system (x, y). Thus the situation depicted would not be allowed.
priors in general, as such priors assign mass to functions in (r, s)-space that do not correspond to well-defined functions in (x, y)-space (see Figure 3). Second, any prior that is used ought to be consistent with the Lie group of transformations that the ODE is known to admit. To address each of these important points, we propose two general principles for prior construction in this work. The first principle is the implicit prior principle. This ensures that a prior specified in the transformed coordinates (r, s) can be safely transformed into a well-defined distribution on Y. For such an implicit prior to be well-defined we need to understand when a function in (r, s) space maps to a well-defined function in the original (x, y) domain of interest. Let S denote the image of Y under the canonical coordinate transformation.
Principle 1 (Implicit Prior). A distribution ν ∈ P S on the transformed solution space S corresponds to a well-defined implicit prior μ ∈ P Y provided that x(r, s(r)) is strictly monotone as a function of r.
Thus dx dr > 0 if and only if s (r) > 1 r and the invariant prior principle requires that we respect the constraint log(r) ≤ s(r) ≤ log(r) + log(x T ) for all r > 0. The set S must therefore consists of differentiable functions s defined on r ∈ (0, ∞) and satisfying log(r) ≤ s(r) ≤ log(r) + log(x T ). Now we turn to the second important point, namely that the prior ought to encode knowledge about any Lie transformations that are known to be admitted by the ODE. In working on the transformed space S, it become clear how to construct a prior measure in which this knowledge is encoded. Our second principle for prior specification states that equal prior weight should be afforded to all curves that are identical up to a Lie transformation: Principle 2 (Invariant Prior). A distribution ν ∈ P S on the transformed solution space S is said to be invariant provided that ν(S) = ν(S + ) where the elements of S + are the elements of S after a vertical translation; i.e. s(·) → s(·) + and both S, S + ∈ Σ S .
Our recommendation is that, when possible, both the implicit prior principle and the invariant prior principle should be enforced. However, in practice it seems difficult to satisfy both principles and our empirical results in Section 4 are based on implicit priors that are not invariant.

The Case of a Second Order ODE
In this section we present our approach for a second order ODE. The study of second order ODEs is particularly important, since Newtonian mechanics is based on ODEs of second order. The presentation is again simplified relative to the general case of an mth order ODE, this time by virtue of the fact that any two dimensional Lie algebra is guaranteed to be solvable (Thm. 6). The main result that will allow us to construct an exact Bayesian PNM is as follows: for some explicit function H(r).
Note that the ODE in (18) is reduced to two integrals, namely (19) and (20). At a high level, our proposed Bayesian PNM performs inference for the solution s(r) of (19) and then transforms the resultant posterior back into the original (x, y)-coordinate system. However, because G in (19) depends on the solutions(r) of (20), we must also estimates(r) and for this we need to evaluate H. Our PNM is therefore based on the information operator A(y) = [G(r 0 ), . . . , G(r n ), H(r 0 ), . . . , H(r n )] ∈ A = R 2(n+1) which corresponds indirectly to 2(n + 1) evaluations of f , the original gradient field. The extension of our approach to a general mth order ODE proceeds analogously, with A = R m(n+1) . The use of Thm. 8 is illustrated in Example 14 in the Supplement.
The two principles of prior construction that we advocated in the case of a first order ODE apply equally to the case of a second-and higher-order ODE. It therefore remains only to discuss the selection of the specific inputs r i (andr i in the case of a second order ODE) that are used to define the information operator A. This discussion is again reserved for Supplemental Section A.2.

Numerical Illustration
In this section the proposed Bayesian PNM is empirically illustrated. Recall that we are not advocating these methods for practical use, rather they are to serve as a proof-ofconcept for demonstrating that exact Bayesian inference can in principle be performed for ODEs, albeit at considerable effort; a non-trivial finding that helps to shape ongoing research and discussion in this nascent field.
The case of a first order ODE is considered in Section 4.1 and the second order case is contained in Section 4.2. In both cases, scope is limited to verifying the correctness of the procedures, as well as indicating how conjugate prior distributions can be constructed.

A First Order ODE
This section illustrates the empirical performance of the proposed method for a first order ODE.
ODE To limit scope we consider first order ODEs of the form Note that admitted transformation and associated canonical coordinates for this class of ODE have already been derived in Ex. 9, Ex. 11 and Ex. 12.
Prior In constructing a prior μ ∈ P Y we refer to the implicit prior principle in Section 3.2. Indeed, recall from Ex. 11 that the ODE in (21) can be transformed into an ODE of the form ds dr = G(r), r∈ (0, ∞), s(y 0 ) = log(y 0 ).
Then our approach constructs a distribution ν ∈ P S where, from Ex. 12, S is the set of differentiable functions s defined on r ∈ (0, ∞) and satisfying log(r) ≤ s(r) ≤ log(r) + log(x T ).
To ensure monotonicity in the implicit prior principle, we take dx dr > 0, which translates into the requirement that ds dr > 1 r .
If (23) holds, then ν induces a well-defined distribution μ ∈ P Y . Note that the constraints in (22) and (23) preclude the direct use of standard prior models, such as Gaussian processes. However, it is nevertheless possible to design priors that are convenient for a given set of canonical coordinates. Indeed, for the canonical coordinates r, s in our example, we can consider a prior of the form where the function ζ : (0, ∞) → R satisfies For this experiment, the approach of López-Lopera et al. (2018) was used as a prior model for the monotone, bounded function ζ; this requires that a number, N , of basis functions is specified -for brevity we defer the detail to Appendix A.3.
The prior just described incorporates the symmetric structure of the ODE, in the sense that the independent variable r = y x is the first canonical coordinate of the infinitesimal generator of the Lie group of transformations of the original ODE in 11. In other words, r is a variable fixed by the Lie group of transformations of the ODE (in this case x * = αx, y * = αy, so r * = r). Because the prior is defined on functions s(r) of r, this means the prior itself is also unchanged by the Lie group of transformations of the ODE, so that the prior effectively incorporates the symmetric structure of the ODE.

Results
To obtain empirical results we consider the ODE with F (r) = r −1 + r and y 0 = 1, x T = 5. The posterior distributions that were obtained as the number n of data points was increased were sampled and plotted in the (r, s) and (x, y) planes in Figure 4. In each case a basis of size N = 2n was used. Observe that the implicit prior principle ensures that all curves in the (x, y) plane are well-defined functions (i.e. there is at most one y value for each x value). Observe also that the posterior mass appears to contract to the true solution y † of the ODE as the number of evaluations n of the gradient field is increased.

A Second Order ODE
This section illustrates the empirical performance of the proposed method for a second order ODE.
ODE Consider again the second order nonlinear ODE in (14) together with the initial condition y(x 0 ) = y 0 , dy dx (x 0 ) = y 0 .
Prior It is shown in Ex. 14 in the Supplement that 14 can be reduced to a first order ODE in (s, r) with − 1 x0 − r ≤ s ≤ − 1 x T − r. The implicit prior principle in this case requires that ds dr > −1. Thus we are led to consider a parametrisation of the form where the function ζ again satisfies the conditions in (24). The approach of López-Lopera et al. (2018) was therefore again used as a prior model. For this example an additional level of analytic tractability is possible, as described in detail in Ex. 14 in the Supplement. Thus we need only consider an information operator of the form A(y) = [G(r 0 ), . . . , G(r n )].

Results
The posterior distributions that were obtained are plotted in the (r, s) plane and the (x, y) plane in Figure 5. A basis of size N = 2n was used, with [y 0 , y 0 ] = [−10, 1], [x 0 , x T ] = [5,10]. Observe that the implicit prior principle ensures that all curves in the (x, y) plane are well-defined functions (i.e. there is at most one y value for each x value). The true solution appears to be smoother than the samples, even for 50  (left) and (x, y) plane (right), whilst the exact solution is indicated in red. The blue curves represent a constraint on the domain that arises when the implicit prior principle is applied. The number of gradient evaluations was n = 50. gradient evaluations, which suggests that the prior was somewhat conservative in this context.

Conclusion
This paper presented a foundational perspective on PNM. It was first argued that there did not exist a Bayesian PNM for the numerical solution of ODEs. Then, to address this gap, a prototypical Bayesian PNM was developed. The Bayesian perspective that we have put forward sheds light on foundational issues which will need to be addressed going forward: Foundation of PNM As explained in Section 1.2, existing PNM for ODEs each take the underlying state space Y to be the solution space of the ODE. This appears to be problematic, in the sense that a generic evaluation f (x i , y i ) of the gradient field cannot be cast as information A(y † ) about the solution y † of the ODE unless the point (x i , y i ) lies exactly on the solution curve {(x, y † (x)) : x ∈ [x 0 , x T ]}. As a consequence, all existing PNM of which we are aware violate the likelihood principle and are therefore not strictly Bayesian. The assumption of a solvable Lie algebra, used in this work, can be seen as a mechanism to ensure the existence of an exact information operator A, so that the likelihood principle can be obeyed. However, for a general ODE it might be more natural to take the underlying state space to be a set F of permitted gradient fields and the quantity of interest Q(f ) to map a gradient field f to the solution of the associated ODE. This would make the information operator A trivial but evaluation of the push-forward Q # μ a would require the exact solution operator of the ODE. However, the reliance on access to an oracle solver Q makes this philosophically somewhat distinct from PNM.

Limitations of Bayesian PNM
The proposed method was intended as a proof-ofconcept and it is therefore useful to highlight the aspects in which it is limited. First, when an mth order ODE admits an r-parameter Lie group of transformations with r > m, there is an arbitrariness to the particular m-dimensional sub-group of transformations that are selected. Second, the route to obtain transformations admitted by the ODE demands that some aspects of the gradient field f are known, in contrast to other work in which f is treated as a black-box. For instance, in Ex. 11 we used the fact that f can be expressed as f (x, y) = F ( y x ), although knowledge of the form of F was not required. Third, the class of ODEs for which a solvable Lie algebra is admitted is relatively small. On the other hand, references such as Bluman and Anco (2002) document important cases where our method could be applied. Fourth, the principles for prior construction that we identified do not entail a unique prior and, as such, the question of prior elicitation must still be addressed.
Outlook The goal of providing rigorous and exact statistical uncertainty quantification for the solution of an ODE is, we believe, important and will continue to be addressed. Traditional numerical methods have benefitted from a century of research effort and, in comparison, Bayesian PNM is an under-developed field. For example, the limited existing work on PNM for ODEs, such as Skilling (1992) 2019), does not attempt to provide adaptive error control (though we note promising ongoing research in that direction by Chkrebtii and Campbell, 2019). Nevertheless, the case for developing Bayesian numerical methods -which shares some parallels with the case for Bayesian statistics as opposed to other inferential paradigms -is clear, as argued in Diaconis (1988) and Hennig et al. (2015). The insights we have provided in this paper serve to highlight the foundational issues pertinent to Bayesian PNM for ODEs. Indeed, our proof-of-concept highlights that performing exact Bayesian inference for ODEs may be extremely difficult. This in turn provides motivation for the continued development of 'approximately Bayesian' approaches to PNM, which in Section 1.2 we surveyed in detail.