Numerical methods for backward stochastic differential equations: A survey

Backward Stochastic Differential Equations (BSDEs) have been widely employed in various areas of social and natural sciences, such as the pricing and hedging of financial derivatives, stochastic optimal control problems, optimal stopping problems and gene expression. Most BSDEs cannot be solved analytically and thus numerical methods must be applied to approximate their solutions. There have been a variety of numerical methods proposed over the past few decades as well as many more currently being developed. For the most part, they exist in a complex and scattered manner with each requiring a variety of assumptions and conditions. The aim of the present work is thus to systematically survey various numerical methods for BSDEs, and in particular, compare and categorize them, for further developments and improvements. To achieve this goal, we focus primarily on the core features of each method based on an extensive collection of 333 references: the main assumptions, the numerical algorithm itself, key convergence properties and advantages and disadvantages, to provide an up-to-date coverage of numerical methods for BSDEs, with insightful summaries of each and a useful comparison and categorization.


Introduction
A Backward Stochastic Differential Equation (BSDE) was first introduced by Jean-Michel Bismut in 1973 [34]. The paper used a linear BSDE as an equation for the adjoint process in the stochastic version of the Pontryagin maximum principle. BSDEs were generalized by Pardoux and Peng in 1990 [255] to a general non-linear BSDE of the following form: where questions regarding the existence and uniqueness of the solution (Y t , Z t ) t∈[0,T ] were addressed. We also mention here the closely related Forward BSDEs (FBSDEs) given as follows: for t ∈ [0, T ]. They are essentially a BSDE coupled with a (forward) Stochastic Differential Equation (SDE), in the sense that the driver can depend on the solution to the SDE, and the terminal condition of the BSDE is a function of the terminal value of the SDE. We also look in detail at FBSDEs, as these types of BSDEs appear widely in literature (especially in regards to numerical methods). The solution of an FBSDE can be formulated in terms of the solution of a semilinear parabolic PDE, and this equivalence is exploited in many numerical methods. For the remainder of the introduction when we mention BSDEs, this can refer to BSDEs or FBSDEs. In following sections, however, we will be strict on which type of BSDE we are referring to. BSDEs have been used widely in a number of areas of social and natural sciences. The problem of pricing and hedging a European option can be formulated in terms of a BSDE. In fact, any pricing problem using a replication argument can be written in terms of a BSDE [105,171]. A BSDE can also take into account portfolio constraints in pricing problems [43,53,83,85]. Thus, BSDEs have often been employed for the valuation of many financial derivatives in both complete and incomplete or constrained markets, including European and American options [103,104]. BSDEs can provide necessary and sufficient conditions for optimality [260,261] and enhance implementability [150,166] in optimal control problems such as utility maximization control problems with constraints and risk-sensitive control problems [101,171,172,176,273,307]. BSDEs have also been tied to nonlinear expectations and thus have been used to construct the risk measures [262]. The more complex reflected BSDEs [103] and doubly reflected BSDEs [84,214] have been developed in connection with zero-sum Dynkin games, optimal stopping problems, recallable options, mixed differential games, and mixed stochastic control [157,159]. In addition, BSDEs have found applications in filtering [14,15,16] and physics [93,278]. More examples of applications in these areas can be found in [79,102,105] and image processing [36].
Due to the complex nature of BSDEs, it is rarely possible to find an analytical solution. Thus, one must often resort to numerical methods in order to solve these equations. There have been a variety of numerical methods proposed over the past few decades and also many more currently in progress. For the most part they exist in a complex and scattered manner, with each requiring a variety of assumptions and conditions. The ultimate goal of this survey is to further facilitate the research activity on numerical methods for BSDEs by providing an up-to-date overview of the different types in an organized structure. Before providing the systematic survey of numerical methods, we will first review some basic theory regarding BSDEs (Section 2) in order to better prepare for the goal. Broadly speaking, we will then look at the three main classes of numerical methods, namely backward (Section 3), forward (Section 5), and deep learning based methods (Section 6). By a backward method, we mean a numerical method which works backwards in time and requires the computation of conditional expectations (Section 4), whereas for a forward method, we refer to a method that does not inherently work backward in time so as to (originally, at least) avoid the computation of conditional expectations. In sharp contrast to backward and forward methods, the method of deep learning (Section 6) is quite distinctive in its rather mixed structure and effectiveness in high-dimensional problems.
We then summarize in Section 7 the major components of each category to discuss the power and limitations of various classes of numerical methods in collective comparison. We next provide in Section 8 a further survey on numerical methods for BSDEs with a variety of nonstandard features, such as coupled FBSDEs (Section 8.1), reflected BSDEs (Section 8.2), BSDEs with jumps (Section 8.3), non-Lipschtiz BSDEs (Section 8.4), quadratic BSDEs (Section 8.5), second-order BSDEs (Section 8.6), McKean-Vlasov BSDEs (Section 8.7) and backward stochastic partial differential equations (Section 8.8), each of which has attracted increasing interest for addressing various emerging problems in social and natural sciences and nonlinear partial differential equations. Finally, in Section 9, we summarize our discourse and highlight some future directions.
In order to achieve our main goal of describing and discussing the main idea of each category, we will only focus on one or two relevant representative methods in each subsection, followed by an overview of other methods. To avoid overloading the paper with non-essential technical details, we omit the rather lengthy technical intricacies of the methods in most instances. In particular, the algorithms we provide in each section are described without going into detail and are presented for illustrative purposes only.

Preliminaries
The aim of this section is to review the basic theory of BSDEs (Section 2.1) and FBSDEs (Section 2.2) in brief. In particular, we give the standard sufficient conditions that ensure the existence and uniqueness of their solutions. Let (Ω, F , F, P) be a given filtered probability space which satisfies the usual conditions of completeness and right continuity. On this space, we define the d-dimensional standard Brownian motion (W t ) t∈[0,T ] , whose natural filtration, augmented by the class of P-null sets of F T , is F = (F t ) t∈ [0,T ] . We reserve T for a strictly positive constant which indicates a fixed terminal time of the interval [0, T ], and denote by π n := {0 =: t 0 < · · · < t k < t k+1 < · · · < t n := T } an arbitrary yet fixed partition of the interval [0, T ] for all n ∈ N, with |π n |:= max k∈{0,···,n−1} (t k+1 − t k ). For the sake of brevity, we often suppress the subscript n from the notation π n . We define the following probabilistic spaces and sets: • P: the σ -field of predictable sets in Ω × [0, T ].
• L 2 m (F t ): the set of all R m -valued random vectors X that are F t -measurable and square-integrable, that is, E[ X 2 ] < ∞. We reserve C for a positive constant, whose value changes from line to line, depending on the context at hand. Let |·| and · denote the magnitude and Euclidean norm respectively, where the latter is understood to be a suitable matrix norm in the context of matrices. We denote D ⊗2 := DD ⊤ and by B(D) the Borel σ -field of a set D. We define the following function spaces which appear throughout: • C k,l b : set of continuously differentiable and bounded functions which are k times continuously differentiable in their first coordinate and l times in their second, with bounded partial derivatives up to order l. • C k p : set of C k−1 functions with piecewise continuous k-th derivative. • C k+α : set of C k functions whose k-th derivative is Hölder continuous of order α ∈ [0, 1].

Backward stochastic differential equations
We aim here to review the basics of the following BSDE: • (Z t ) t∈[0,T ] is an F-predictable process in R m×d satisfying P( T 0 Z t 2 dt < ∞) = 1.
We will only consider solutions (Y, Z) in the space S 2 m (0, T ) × H 2 m×d (0, T ). The following theorem [255,318] identifies the standard sufficient conditions imposed to ensure the existence and uniqueness of a solution to the BSDE (2.1) in S 2 m (0, T ) × H 2 m×d (0, T ).
Before moving on, we briefly present some intuition regarding the solution (Y, Z) of the BSDE (2.1). Consider the simple case when f ≡ 0 in (2.1) giving us the following BSDE: Clearly, the degenerate driver f ≡ 0 satisfies the sufficient assumptions for uniqueness and existence. Hence, under the assumption that ξ ∈ L 2 m (F T ), it holds by Theorem 2.1 that the BSDE with null generator (2.2) admits a unique solution (Y, Z) ∈ S 2 m (0, T ) × H 2 m×d (0, T ). If (Y, Z) is the unique solution in S 2 m (0, T ) × H 2 m×d (0, T ) of the BSDE with null driver (2.2), then Y satisfies and thus Y is an (F t ) t∈[0,T ] -martingale. Moreover, based on the formulation (2.2), the martingale Y can be written as follows: As the process Y is a martingale, it is certainly a local martingale. Thus, by the martingale representation theorem, there exists a unique process γ ∈ H 2 m×d (0, T ) satisfying Y t = Y 0 + t 0 γ s dW s for all t ∈ [0, T ]. Hence, we obtain which shows that (Y, γ) solves the BSDE (2.2). By the uniqueness of the solution to the BSDE, we get Z = γ. For this simple case, we see that an adapted solution to the BSDE can only be given by a pair, that is, the Z component is needed to ensure that the process Y is adapted. In a sense, the Z component "steers" the system and is thus called the control process. One cannot simply revert time as in deterministic ODEs, as the filtration can only go in one direction.
To better illustrate the interpretations of Y , Z and ξ , we provide a simple example in a tangible problem in finance. Consider a financial market with one risky asset S, whose price at time t, S t , follows the following SDE: A trader can either invest in the risky asset S or borrow/invest money at an instantaneous risk free interest rate, denoted by r.
Here, we assume that µ, σ and r are bounded and predictable processes. If the amount of money invested in S at time t is φ t and the total wealth of the trade is Y , then the magnitude |Y t − φ t | represents the amount of money that is borrowed (if Y t − φ t is negative) or invested (if Y t − φ t is positive) at time t. The wealth process Y can thus be shown to follow the dynamics: Consider a European option with payoff at time T given by a random variable ξ ∈ L 2 m (F T ). We note that the payoff for a European option ξ will usually be a function of the asset price S at time T , that is ξ = Φ(S T ). Although this would move us to the context of a FBSDE (Section 2.2), we ignore this fact here for illustrative purposes. A trader who wants to sell this option must identify the minimal initial amount of capital Y 0 in such a way that the payoff ξ = Φ(S T ) can be replicated. If a process φ is found such that Y T = ξ , then this initial amount is Y 0 . In other words, we look for a couple (Y, φ ) such that If there exists a predictable process π such that (µ − r) = σ π, then by setting Z = φ σ , the above equation becomes Hence, the problem of pricing and hedging is reduced to finding a solution to the above (linear) BSDE, where Y is the wealth process (which is equivalent to the price of the option), ξ = Φ(S T ) is the payoff of a European option and Z is proportional to the volatility in the model and the amount invested in the risky asset S.
Another typical example in quantitative finance, that is more general than the simple European case and which justifies the use of nonlinear BSDEs, is the so-called the differential rates problem (for instance, [313] among many others). Consider again, a financial market with one risky asset S given by the SDE (2.3), with φ t representing the amount of money invested in S at time t. We thus have that the dynamics of the total wealth Y are again given by (2.4). Let a lending interest rate r be applied to the lending case Y t − φ t ≥ 0 and let a borrowing interest rate R such that R > r be applied to the borrowing case Y t − φ t ≤ 0. In a similar manner to the European case above, if a process φ is found such that Y T = ξ with target payoff ξ ∈ L 2 m (F T ) and with Z = φ σ , we then get a backward process: which lies in the framework of nonlinear BSDEs. Stochastic control is undoubtedly a major field of application. To briefly describe stochastic control in our notation, consider the stochastic optimization problem sup k J(k) in the weak formulation [105]: where Φ and f take values in R, and k controls the SDE as in and E k is the expectation with respect to the probability measure P k under which B k t := B t − t 0 σ −1 (s, X s )b(s, X s , k s )ds is a Brownian motion. Then, the following linear BSDE under P k has a unique solution (Y k , Z k ): due to Theorem 2.1 with ξ = Φ(X T ), provided that all the conditions are met. In view of J(k) = Y k 0 , the stochastic control problem (2.5) is well within the framework (2.1).

Forward backward stochastic differential equations
We next review forward-backward stochastic differential equations (FBSDEs), which have already made a brief implicit appearance in the example on European option pricing (Section 2.1). Consider the following BSDE: where (Y, Z) t∈[0,T ] takes values in R m × R m×d , X is the R q -valued diffusion process solving the standard stochastic differential equation: and f : are all given functions, and x 0 is a suitable point in R q . We note that the dimension of the Y component is set to one (m = 1) in many instances, particularly when dealing with deterministic PDEs (Theorems 2.3 and 2.5). The pair of equations (2.6)-(2.7) is known as the Markovian BSDE, or the (uncoupled) FBSDE. We now introduce the "Markovian" counterparts of the assumptions made in Theorem 2.1 which ensure the existence and uniqueness of solutions to the FBSDE (2.6)-(2.7), as well as an equivalence between the solution of the FBSDE to a viscosity solution of the parabolic PDE (2.9). We refer the reader to, for instance, [256,318] for more detail and proofs.
Analytic solutions are rarely available for FBSDEs, especially those of practical relevance. However, for illustrative purposes, we present a few examples of the FBSDE in the form (2.6)-(2.7), for which unique solutions are available in closed form, to indicate how simple the FBSDE must be for an analytic solution to be available. First, set µ(t, x) = θ (a − x) and σ (t, x) = b. With a suitable initial state X 0 = x 0 ∈ R, the forward component (2.7) is then a diffusion process in the form of X t = e −θt x 0 + a(1 − e −θt ) + b t 0 e −θ (t−s) dW s , that is, an Ornstein-Uhlenbeck process if θ > 0. With the quadratic terminal Φ(x) = x 2 fixed, if the driver is given by f (t, x, y, z) = ry + e −r(T −t) b 2 + θ (a − x)z/b, for some r, a, b ∈ R, then the FBSDE (2.6) admits a unique solution given by Or, if the driver is set, instead, to f (t, x, y, z) = 2θ (a − x)x + b 2 , for some a, b, θ ∈ R, then the unique solution to the FBSDE (2.6) is available in closed form as Next, if the terminal, the driver and the forward component are given by We once again stress that these simple examples are only presented for demonstration purposes. FBSDEs of practical interest are generally not explicitly solvable and thus do require numerical treatment.
We next state the nonlinear Feynman-Kac formula with the representation of the derivative of the PDE solution. We need the following additional assumption to give the representation under the Lipschitz continuous condition of f and Φ. For details (including the definition of the solution ∇X of a variational equation) and applications we refer the reader to, for instance, [228,316].
Assumption 2.4. Assume the following conditions:

10)
and From here on out, with some exceptions (including Sections 5.2 and 8.4), all the conditions of Theorems 2.1 and 2.3 are imposed as standing assumptions. Further assumptions (such as Assumption 2.4) may be imposed in different forms for different numerical methods depending on the BSDEs and FBSDEs of interest.
From a practical perspective, FBSDEs have long been actively studied in the pricing and hedging problem in the incomplete market [83], which remains an ongoing topic, for instance, [29]. With the aid of the relevant techniques developed there, it has been known that FBSDEs play an essential role for formulating X-Value Adjustment (XVA), which is a collective term for various valuation adjustments for derivative instruments, for instance, credit risk and funding costs. Ever since the catastrophic wave of bankruptcies of big financial firms in the 2008 global financial crisis, the XVA pricing has been realized as one of most urgent problems in derivatives pricing. To be more precise, XVAs can be formulated in the form of a FBSDE with a Lipschitz continuous driver f , which is non-differentiable at some points, such as f (t, x, y, z) = a(t) max{0, y} + b(t). Ever since the pioneering work [267], a variety of extensions and refinements of the XVA formulation have been investigated, certainly in the framework of FBSDEs, from both theoretical and practical standpoints [32,33,75,77,215]. In particular, various numerical methods for the XVA pricing have also been proposed, such as a branching algorithm [167], a Fourier-based discretization method [38], higher-order discretization methods [247], a dual algorithm for the stochastic control problem [168] and the deep learning algorithm [136].
Despite being outside the scope of Assumption 2.2, the American option pricing problem [103,104] can also be formulated in the framework of FBSDEs with a discontinuous driver of the form f (t, x, y, z) = a(t)½(y < Φ(x)) + b(t), based on which various numerical methods have been proposed. Examples of such methods include the perturbative expansion and particle method [117] and a local polynomial approximation and branching processes method [40].
As such, there exist a variety of advanced features and relevant terminologies around FBSDEs and their applications in the literature, which we summarize here in brief. First, the FBSDE in the form (2.6)-(2.7) is called a decoupled FBSDE, in contrast to "coupled" when the coefficients µ and σ in the forward component (2.7) depend on the backward component (2.6) (Section 8.1). Moreover, the term "fully coupled" is often attached when those coefficients depend on the outcome ω. The backward component may be reflected at a given stochastic process (Section 8.2), while FBSDEs may contain jumps in the backward and/or forward component (Section 8.3). If the drift of the backward component (2.6) contains the second order derivative of the corresponding PDE (2.9), then such a coupled FBSDE is referred to as a second-order FBSDE (Section 8.6). If not only the coefficients µ and σ depend on the law of the processes (X ,Y, Z) but also the driver f , then such a coupled FBSDE is called a McKean-Vlasov FBSDE (Section 8.7).

Backward numerical methods
With the basic background reviewed in Section 2, we begin with various numerical methods for BSDEs which work backwards in time under two categories with respect to the degree of discretization, namely, backward Euler methods (Section 3.1) and higher-order methods (Section 3.2). In each subsection, we present one or two representative numerical methods in some detail, while we skip lengthy technical explanations. In particular, when describing the algorithms in each subsection, we do not go into much technical detail in order to avoid digression from the main idea. We then give a brief overview of various other methods in the category.
In general, a discretization method is said to be of order p if the discretization error has the convergence rate O(n −p ) in a suitable norm where the given time interval is discretized into n subintervals. However, this convergence rate does not necessarily represent the efficiency of the numerical algorithm under consideration. That is, in most instances, backward numerical methods require the additional computation of conditional expectations in each subinterval, whose efficiency depends largely on many factors, such as the dimension and the method applied. As we are only focusing on backward methods alone here, we will discuss various methods for computing conditional expectations collectively in Section 4, such as leastsquares regression (Section 4.1), Malliavin calculus based methods (Section 4.2), quantization methods (Section 4.3), tree based methods (Section 4.4) and cubature methods (Section 4.5).

Backward Euler methods
The first class of backward numerical methods we review are the so-called backward Euler schemes. The main references on the topic are [47,317], although the ideas of the backward Euler scheme date back to [72]. The convergence of the backward Euler scheme is of order 1/2 under standard Lipschitz assumptions and with Lipschitz terminal condition [47,317]. However, the scheme has a convergence rate of 1 if the forward SDE can be simulated perfectly on the grid or if it is approximated by a higher order scheme under the conditions that the coefficients are sufficiently smooth [137]. We note that discretization methods of higher order are often collectively called higher-order discretization methods, but we do not discuss them here and instead devote Section 3.2 to them.
In the present context, there are certainly two general categories of explicit and implicit discretization schemes for the FBSDE (2.6)-(2.7) (see, also, [318, Section 5.3.2]), which can be summarized as follows:

Algorithm 1
Initialization: Approximate the terminal condition Y n t n = Φ(X n t n ) with the Euler-Maruyama scheme X n . for k = (n − 1) to 0 do With (Y n , Z n ) t∈[0,T ] understood to be the pair of the step processes whose state at t k corresponds to (3.1)-(3.2), it was first derived [47,317] The backward Euler scheme in the reflected case converges at order 1/2, when f is independent of Z [11], but only at order 1/4 in the general case [39,229]. We note that convergence may depend on the regularity of the terminal condition in general through the L 2 -regularity of Z [127]. Stability analysis is conducted in [64] for the backward Euler method in a similar manner to the Euler method for ordinary differential equations. Backward Euler methods have been successfully generalized to address a broader class of BSDEs with nonstandard features (Section 8), such as BSDEs with a general terminal condition and driver [177], BSDEs driven by an infinite activity Poisson random measure in addition to the Brownian motion [230], reflected BSDEs [39,229,281], BSDEs with jumps [42,232] and quadratic BSDEs [272]. Backward Euler methods have also been tailored to, for instance, mean-field BSDEs [321], G-FBSDEs (FBSDEs driven by G-Brownian motion) [175], FBSDEs driven by càdlàg martingales [205], and backward stochastic Volterra integral equations of type I [24,303] and of type II [160]. The convergence of the backward Euler method for BSDEs is discussed in [177] with general terminal condition and driver.
We close this section with [44,106,156], where some studies focus to construct numerical methods for nonlinear PDEs based on backward Euler methods for FBSDEs. One may also look to [152,322] for related studies on discretization, as well as [155] where a number of finite difference methods are reviewed for solving BSDEs, similar to backward Euler methods for the most part.

Higher-order discretization methods
In the literature, a variety of discretization methods have been proposed to achieve a higher order of convergence than the backward Euler method of order 1 (Section 3.1). In principle, to achieve a higher order of convergence, the following three steps need to be planned and implemented carefully: 1. We first discretize the BSDE. A typical result on the discretization of a solution to a BSDE (Y, Z) is given by where discretization of the forward component X has not been taken into account. Here, Y π and Z π denote suitable simple schemes involving conditional expectation representations for Y and Z, respectively (see, for instance, [81]). 2. The corresponding forward component X is to be discretized using a suitable high-order discretization scheme and in such a way that the overall error remains within the order of the backward component (like, the inequality (3.4)). 3. By then applying a computation method for the conditional expectation (Section 4), an efficient numerical scheme is obtained which has a faster convergence rate than a scheme with the backward Euler method (Section 3.1).
A variety of relevant numerical methods have been developed in conjunction with existing techniques in each of those three steps. We devote the present section to a review of such higher-order discretization methods. For illustrative purposes, we begin with a simple numerical method of higher-order type based upon a random walk approximation of the Brownian motion. This method is given in the context of FBSDEs whose driver is independent of X , although the method can be easily extended to when one does not have this independence. The forward (SDE) component is approximated using the Euler-Maruyama method or Milstein method. The solution to the FBSDE is approximated by a discretization scheme that is derived by using the so-called theta method. For example, the numerical approximation for Y is derived by first showing that and then approximating the integral using a theta approximation: a convex combination of an explicit term (term at time t k+1 ) and an implicit term (term at time t k ): where ∆ n := T /n(= t k+1 − t k ) is a fixed time step. In a similar manner, a numerical approximation for Z is derived, and then a backward numerical scheme is presented using these approximations. Namely, the scheme for approximating (Y, Z) by (Y π , Z π ) is given as Y π for k ∈ {n − 1, · · · , 0} backwards, with θ 2 ∈ (0, 1]. The FBSDE and PDE equivalence (Theorem 2.5) is applied in computing the terminal condition for Z. Hence, Assumption 2.4 is essential here. The above scheme is proved in [328] to converge with order of convergence 2 when θ 1 = θ 2 = 1/2 and order of convergence 1 otherwise. Since its first development [323], the θ -scheme has been extensively generalized in various directions. For instance, as previously mentioned, for BSDEs whose driver is independent of the control process Z, the θ -scheme is proved [328] to converge with order 2 when θ 1 = θ 2 = 1/2 and order 1 otherwise. In order to achieve the same orders (2 when θ 1 = θ 2 = 1/2 and order 1 otherwise) even when the driver depends on the pair (Y, Z), the θ -scheme is generalized [327] by introducing more parameters. The concept of the θ -scheme has been applied to more general problem settings, for instance, mean-field BSDEs [283], BSDEs whose driver contains not only the present value (Y t , Z t ) but also the future values [173], and G-BSDEs [174]. In particular, the θ -scheme with θ = 1/2, termed the Crank-Nicolson scheme [301], is investigated along with error analysis [218,326] and applications to FBSDEs [217], where higher-order discretization formulas are derived for the control process Z via approximation of the derivative of the backward process Y by the well-known Crank-Nicolson method, as well as to meanfield BSDEs [321]. Inspired by the θ -scheme, a numerical method for one-dimensional FBSDEs is proposed [277] without using conditional expectations, along with a few analytical solutions of particular FBSDEs. From an implementation point of view, the θ -scheme has been further accelerated via parallel computing [87].
To achieve even higher orders of convergence, the multistep scheme [329] is developed as an extension of the θ -scheme with convergence at an arbitrary order when the driver is independent of the control process. In the multistep scheme, the pair (Y π t k , Z π t k ) is approximated based on the expectation of values at multiple future time points {(Y π t l , Z π t l )} l∈{k+1,···,k+L} , for which evidently more extensive computing is required than the backward Euler method (Section 3.1) and the θ -scheme. Error analyses are conducted in the cases where the driver depends on the control process [58] and with variable step sizes [164], backed up by convincing numerical results. Moreover, by exploiting the independence of the computation at each grid point, the multistep scheme is tailored [201] for parallelized computing on GPUs.
The concept of multistep methods has also been quite prevalent in the context of FBSDEs. A multistep scheme is developed for coupled FBSDEs [324] along with a crude Euler-Maruyama method employed for discretizing the forward component, with convergence analysis refined in [309]. The multistep method has been generalized to address more general BSDEs with nonstandard features (Section 8), such as BSDEs with jumps [112,114], second-order FBSDEs [332] and fully nonlinear parabolic PDEs [207], as well as improved in combination with relevant techniques, such as the sparse-grid method [315], the spectral sparse grid approximation with fast Fourier transform [115], the Lagrange interpolation polynomial [221,222], finite difference for approximating the derivatives of the solution of FBSDEs in multi-time levels [294], 3-point Gauss-Hermite quadrature rule and non-equidistant difference scheme [253], polynomial approximation [331], and predictor-corrector and least-squares Monte Carlo schemes [165]. Naturally, the multistep method for BSDEs and FBSDEs parallels with the relevant numerical methods for ODEs and PDEs. For instance, the Runge-Kutta method for ODEs is applied to BSDEs [61], where the order barrier is exhibited to be more restrictive for BSDEs than for ODEs. The defferred correction method for ODEs is tailored to coupled FBSDEs [288] and second-order FBSDEs [310]. It is proved [289] that a linear multistep method for FBSDEs is stable as long as the so-called root condition is met in a similar manner to that for ODEs. A general framework is constructed [311] to investigate the stability, consistency and convergence of discretization schemes for FBSDEs in a unified manner, including the backward Euler method, the θ -scheme and various types of the multistep method. In [200,201], the authors demonstrate an acceleration of the multistep method for FBSDEs, relying on parallel GPU computing using CUDA. It is natural to employ established higher-order discretization schemes for the forward component, such as the Milstein and stochastic Taylor methods, together with the multistep method for BSDEs [113,325,330]. Along this line, various first-and second-order discretization methods have been developed by combining the multistep method for the backward component (such as the trapezoidal rule) with a suitable discretization scheme for the forward component, and then compute the resulting conditional expectations efficiently by the Gauss-Hermite quadrature rule [330], a Lagrangian interpolation [113], and an approach with Malliavin derivative [325]. Error estimates are improved in [320] for the schemes developed in [330,331]. Those methods are also applied to mean-field FBSDEs [282].
In [81], a second order discretization scheme is constructed for BSDEs with a nonsmooth boundary data using Brownian weights for the approximation of the Z component under weaker conditions, where the gradient estimate of [78] is employed. In order to construct numerical methods for these FBSDEs, the backward component is discretized by the trapezoidal or Simpson's rule, and the forward component by the cubature method on Wiener space (Section 4.5). Such methods result in the orders of convergence 1 [80], 2 [81], and up to 3 [247]. A similar approach is employed in [88] for discretizing McKean-Vlasov FBSDEs. When the forward and backward components are discretized, respectively, by the cubature and backward Euler methods, one can construct a discretization method on the basis of an explicit error expansion [66], resulting in an arbitrary order of convergence with the aid of the Richardson-Romberg extrapolation. Inspired by the development of [81], a second-order discretization method is proposed in [241], by introducing higher-order correction terms for the backward component in the form of a sum of polynomials of the Brownian motion through the Malliavin weights.
As previously mentioned, the computation of conditional expectations is undoubtedly a major step in higher-order discretization methods (Section 4). Here, Fourier analysis may play central roles for various purposes [68,126,180,190,251,275,302], for instance, via the characteristic function of the discretized forward component [274]. A tree-based regression is suggested in [291] for efficiently implementing higher-order discretization methods. In conjunction with the four-step scheme [226], higher-order discretization methods are investigated, for instance, on an approximation of spatial derivatives in the coefficients of the four-step scheme using finite difference [238], and with the Hermite-spectral method [227]. Finally, in [265], the authors explored the potential for parallel computing with FBSDEs when using the binomial tree type approximation. Due to the special structure of the numerical method proposed (which is similar to that in [179]), a block allocation algorithm is developed in parallelization, where large communication overhead is avoided, backed up by numerical illustration of encouraging speedups for the parallel implementation.

Computation of conditional expectations
In the present section, we survey major techniques for the computation of conditional expectations in the context of numerical methods (mostly, backward numerical methods (Section 3)) for BSDEs, namely, least-squares regression based methods (Section 4.1), Malliavin calculus based methods (Section 4.2), quantization methods (Section 4.3), tree based methods (Section 4.4) and cubature methods (Section 4.5). We remark that some deep learning based methods may also be interpreted as a mechanism for computing the conditional expectation in numerical methods for BSDEs, using nonlinear least-squares Monte Carlo with a deep neural network architecture. We do not review those methods here but review deep learning based methods collectively in Section 6, as this topic forms a rapidly growing field of research deserving of a separate section.

Least-squares regression based methods
Methods that fit into this category are ones which use a form of least-squares regression to evaluate the conditional expectations appearing in a discretization of the BSDE.
Here, we proceed with representative methods of [140,213] to illustrate this type of method in a clear and concise manner. Let ∆ k = T /n, that is, t k = kT /n and ∆W k = W t k+1 − W t k for k ∈ {0, 1, · · · , n}. For each time point t k , we take R K -valued deterministic function bases (e K i,k ) i∈{0,1,···,d} , whose elements are given by, for instance, the sequence of Hermite polynomials or that of Laguerre polynomials of size K. For every k ∈ {1, · · · , n}, let {∆W m k } m∈{1,···,M} be independent copies of ∆W k , and let {X π,m k } m∈{1,···,M} be corresponding copies of X π k . An algorithm for approximating the backward Euler scheme (Y π k , Z π k ) can be described as follows.

Algorithm 2
Set y n,M,K n (·) = Φ for k = (n − 1) to 1 do Under the Lipschitz continuity in the state variables, the (1/2)-Hölder continuity in time of the coefficients of the Markovian BSDE, and the condition that for all measurable functions ϕ such that ϕ( where y n,K k and z n,K k denote the theoretical means of the respective empirical ones y n,M,K k and z n,M,K k of (4.1) and (4.2) above, as well as almost surely, by the strong law of large numbers. While the convergence holds by taking sufficiently large K and M, its rate has been found to be quite complex. It is shown in [213], which extends [140], that the squared error of the time discretization is of the order O(n −1 ), and roughly speaking, the numbers of basis functions and of the paths are to be chosen as K ≈ n d and M ≈ n d+3 in order to control the global squared error to be O(n −1 ), upon a suitable choice of basis functions. We refer the reader to, for instance, [26,140,213,318] for detailed analyses. In a similar spirit, a numerical scheme is developed in [147] for solving the multi-step forward dynamic programming equation arising from the discretization of the FBSDE. The resulting sequence of conditional expectations is computed using empirical least-squares regressions. Also, in [142], we see another algorithm based on least-squares Monte Carlo (LSMC). Here, the algorithm is designed to allow large scale parallelization of the computations on highly multicore processors, such as GPUs by stratifying sample paths to minimize the exposure to the memory requirements due to the storage of simulations. Possible discontinuities at the interfaces due to piecewise polynomial bases [142] are avoided [141,146] by instead employing smooth orthonormal basis functions. Also, [26,140] propose numerical schemes based on iterative least-square regressions on function bases, where the involved coefficients are evaluated using Monte Carlo simulations. We mention the so-called regression-later approach [134], which can be thought of as a variant of least-squares regression methods.
The primal-dual methodology is generalized in [25] to a backward dynamic programming equation associated with time discretization schemes of reflected BSDEs (Section 8.2). They suggest a pathwise approach to the dynamic programming equation, which avoids the evaluation of conditional expectations in backward recursion in time. This approach leads to a minimization problem and is thus thought of as a dual minimization problem. Under suitable assumptions, Y can also be represented as the supremum over a class of classical optimal stopping problems, and this maximization problem can be seen as a primal problem. Using the representations for Y as the value of a maximization and a minimization problem, confidence intervals are constructed for Y 0 using LSMC (even if only to generate an initial input approximation), along with a few numerical examples to test the performance of the algorithm for multi-dimensional reflected BSDEs in nonlinear pricing problems.
In [69], the authors apply the Stochastic Grid Bundling Method (SGBM) to develop a numerical scheme. The SGBM algorithm involves the approximation of conditional expectations by means of bundling Monte Carlo sample paths and a local regress-later technique within each bundle. By employing Hermite martingales, the problem of solving a FBSDE is formulated in [257] as the problem of solving a countably infinite-dimensional system of ODEs. On this basis, they develop a numerical scheme which involves the projection of the solution onto generalized Hermite polynomials. Similarly, a numerical scheme is proposed in [293] based on the projection of conditional expectations onto cubic spline polynomials. Finally, we refer to [241] for a second-order discretization method for FBSDEs based on an algorithm which utilizes polynomials of Brownian motions and is implemented by use of a least-squares Monte Carlo method.

Malliavin calculus based methods
Malliavin calculus based methods are closely related to least-squares regression based methods (Section 4.1) in the sense that a lot use least-squares regression to solve the conditional expectations resulting from a discretization of a BSDE which incorporates Malliavin weights.

Representation of conditional expectations using integration-by-parts
Malliavin calculus was first applied to BSDEs in [47] for the purpose of computing conditional expectations in line with [41,111], for instance, the two terms appearing in the implicit discretization scheme BS-DEs (3.1) and (3.2). With the aid of the Markovian property, those conditional expectations boil down to the form . Here, we describe in brief how this conditional expectation can be reformulated without conditioning in such a way that Monte Carlo methods can play an effective role. In general, for two smooth Wiener functionals F and G in the sense of Malliavin, if G is "non-degenerate", then the conditional expectation E[F|G = x] can be written as fractions: where δ x (G) is understood to be the composition of the delta function at x and G as a Watanabe distribution on the Wiener space, H x denotes a Heaviside function, that is, , and δ W denotes the Skorohod integral operator.
The second equality is due to Malliavin integration by parts and the process u satisfying T 0 D s Gu(s)ds = 1. Under the further constraint T 0 D s Fu(s)ds = 0, by applying the property of the Skorohod integral operator δ W , the last expression in (4.3) can be rewritten in the following form, which lends itself to estimation of the conditional expectation E[φ (X s )| X t = x] by Monte Carlo methods: by generating independent copies with sufficiently large sample size M. This approach can also be applied to reflected BSDEs (Section 8.2) in [47]. As is clear, the effectiveness of this approach depends, next to the Heaviside function, on the Skorohod integral term δ W (u). In principle, one needs a reasonably explicit expression for this term, without which simulation would be essentially impossible. Even if one has an explicit form, its computation can be prohibitive when the problem dimension is high. To address this issue, a variant of the method above is proposed in [82] for a significant reduction of the numerical complexity by wisely modifying the Skorohod integral term δ W (u) in such a way to reduce the terms involved and skip differentiation of the drift and diffusion coefficients of the forward component.

Malliavin weights dynamic programming with regression
We next describe a dynamic programming scheme with the regression methods, named the Malliavin weights dynamics programming, an application of Malliavin calculus in a different spirit from Section 4.2.1. For illustrative purposes, we proceed with the regression method [148], among many candidates, that solves a dynamic programming equation with Malliavin weights arising from the time-discretization of FBSDEs (which may also be fit for Section 3). Now, the main idea of the dynamic programming scheme [148] is to approximate the solution (Y, Z) of a FBSDE by the discrete time stochastic processes (Y π , Z π ), defined on the partition π n , which are given as follows: This system is then solved backwards in the order Y π n , Z π n−1 ,Y π n−1 and so forth. It is called the Malliavin weights dynamic programming (MWDP) equation, as it literally takes the form of a dynamic programming equation with Malliavin weights. The MWDP equation (4.5) is inspired by [228], which gives the following representation of the control process Z: where the processes (H (t) s ) 0≤t<s≤T are the Malliavin weights defined as follows: with (D t X r ) t denoting the Malliavin derivative of the marginal X r . We note that this method adapts the least-squares multistep forward dynamic programming algorithm to a scenario which incorporates Malliavin weights. Under suitable technical conditions presented in [147], it holds almost surely that |Y π The conditional expectations (and hence processes Y π and Z π ) appearing in the MWDP equation (4.5) are computed using a Monte Carlo least-squares regression scheme (Section 4.1). Due to the Markovian assumptions, there exist measurable and deterministic (but unknown) functions y k (·) : ). The aim of the Monte Carlo regression scheme is thus to estimate these functions. Ordinary least-squares regression (OLS) is defined in such a way that easily allows path-dependence and joint laws. The authors reformulate the MWDP equation (4.5) in terms of the given definition of OLS. Specifically, take K (l) k to be any dense subset in the R l -valued functions belonging to In words, for instance, y k (·) is set to be the least-squares approx- n , X π k , · · · , X π n ) −1 . However, the above least-squares regressions give rise to two computational problems, that is, is often infinite dimensional, and the integrals of the OLS in (4.8) are presumably computed using the untraceable law of (H . These issues are addressed by approximating y k (·) and z k (·) on finite-dimensional function spaces K Y,k and K Z,k , respectively, with respect to the empirical version ν k,M of the law P • (H The global error of the algorithm is a weighted time-average of three different errors. The first is the approximation error which relates to the error involved in approximating the functions (y k , z k ) in the finite dimensional approximation spaces. This accuracy is achieved asymptotically as the number of simulations goes to infinity. The second error term is the usual statistical error, which improves with a larger number of simulations or a smaller dimension of the vector spaces. The third and final error term together is the interdependence error which is related to the inter dependencies between regressions at different times. This error is of the same magnitude as the statistical error terms, up to logarithmic factors. Thus, roughly speaking, the global error is of order of the best approximation errors plus the statistical errors.
A similar method can be found in [149], where the solution of the FBSDE is approximated by using a backward MWDP equation and LSMC regression (Section 4.1), along with importance sampling to minimize the conditional variance occurring in the LSMC algorithm and accelerate the convergence of Monte Carlo approximation in a similar manner to [23]. The Radon-Nikodym derivative is not given, as in [23], but rather computed adaptively within the LSMC procedure. However, it makes sense to apply importance sampling only if the driver is independent of Z. If the driver did depend on Z, there would be a propagation of "lack of variance reduction" on the Y component due to the Z component through the driver. This means that it would not be possible to keep track of the benefit of importance sampling for Y . If the Monte Carlo estimation of Z is made with appropriate variance reduction (suited to Z specifically), then this problem would be avoided, and would allow the driver to depend on Z, but this is left to future investigation.

Quantization methods
The quantization method is a numerical scheme for computing the conditional expectation The quantization method can be regarded somewhere in the middle of deterministic and probabilistic methods, because it relies on space grids and weights just like deterministic methods, while the computation of weights often requires Monte Carlo methods. Such quantization methods are developed and examined in [11,12,13] for probabilistically solving multidimensional optimal stopping problems, and then applied to develop discretization schemes for reflected BSDEs (Section 8.2) based on an optimal discrete spatial quantization tree. Further improvements on quantized BSDE schemes are provided in [54,109,248,252].
Here, we describe quantized BSDE schemes in brief in accordance with [109]. On the basis of the explicit backward Euler scheme (3.1) and (3.2), we denote by , · · · , N n }, and then for k ∈ {n − 1, · · · , 1, 0} backwards, with weights given in the forms of conditional probability and expectations: which are to be estimated by Monte Carlo methods if analytic formulas are not available. In principle, more accurate results are naturally expected by increasing the size of the quantizations, whereas from an implementation point of view, the size should be chosen carefully and reasonably, relative to the problem dimension, especially when the dynamics is multivariate with dependent components, since then the conditional distribution cannot be decomposed into a set of independent univariate distributions. The complexity of the quantization methods comes largely from the computation of the optimal quantizers and their transition probabilities, if Monte Carlo simulation needs to be employed for the estimation of the transition probabilities.

Tree based methods
Tree based methods have been found effective in the computation of conditional expectations, particularly in low-dimensional problems, with relevant analyses presented in [49,225] (while its origin may date back to [72]). In order to illustrate the basics of tree based methods, we describe a numerical method proposed in [225] for a BSDE whose driver is independent of Z. The main idea is to approximate the Brownian motion W by a simple random walk B n t : n} is a sequence of iid Rademacher random variables. Then, they use this approximation in a discretized version of the BSDE with a constant time step ∆ := 1/n, whose solution is denoted (Y n , Z n ). The solution to this discretized BSDE is then approximated by ( Y n , Z n ), which satisfies the following algorithm.
Initialization: Approximate the terminal conditions Y n t n = ξ n and Z n t n = 0. for k = (n − 1) to 0 do The involved conditional expectations are computed using a tree structure. We briefly look at the error involved in approx- . It is assumed that ξ = F(W ) and thus ξ n = F(B n ), where F : Ω → R is a bounded Lipschitz function with respect to the uniform topology on Ω, that is, The authors prove that it can then be assumed without loss of generality that the driver is bounded and furthermore that for all k ∈ {0, · · · , n} and for large n, where c denotes the Lipschitz coefficient of the driver. Also, the sequence (Y n ,U n ) is shown to converge weakly in the Skorohod topology to (Y, ZdW ), where We also mention Donsker's theorem which has been derived in the relevant context [49,51]. In addition, stability and convergence of the discretized filtration have been analyzed in [9,73,74]. With the aid of those results, an approximation is investigated [50] in the sense that a sequence of solutions to a BSDE driven by a martingale, approximating the Brownian motion, converges to the solution to (2.1), and is further generalized to BSDEs with random terminal time [295]. Tree based methods can also be employed in conjunction with the theta approximation (3.5), which is a convex combination of explicit and implicit terms: Here, the resulting conditional expectations can be approximated on the basis of a recombining tree structure of a random walk approximation for the Brownian motion with the aid of the Markov property of the approximation of the forward process X . After all conditional expectations have been approximated in this way, one obtains the following algorithm (in accordance with the θ -scheme (3.6)) for computing Y j,k and Z j,k at node ( j, k) backward in time in the recombining tree structure (for instance, [179]).

end end
Since the order of convergence for the tree approximation is 1 and this error dominates the error coming from the θdiscretization, the overall scheme has an order of convergence 1 for any values of θ 1 ∈ [0, 1] and θ 2 ∈ (0, 1]. The expression for Y π j,k above is given implicitly and thus needs approximation by Picard iterations. We close this section by mentioning random walk approximations of BSDEs [71,194,231,235,243,263] and an L 2convergence of the random walk approximation of BSDEs derived in [129,130] which makes use of [127]. Finally, we finish by noting that parallel computing on GPUs is found effective in accelerating tree-based methods [264,265].

Cubature methods
Finally, we review cubature methods on the Wiener space and its application in computing conditional expectations appearing in discretization methods for BSDEs. Consider the following Stratonovich forward SDE and backward SDE: where the coefficients satisfy suitable conditions and • denotes the symbol for the Stratonovich integral. We say that the positive weights {λ k } k∈{1,···,N} and the paths of bounded variation ω 1 , · · · , ω N : holds true for any multi-index α ∈ {0, 1, · · · , d} k such that #{i : , the formula (4.9) can be rewritten as Let x 0,x · (ω) be the solution of the ordinary differential equation obtained by replacing the d-dimensional Brownian motion W with a path of bounded variation ω in ) for a function g : R q → R. We note that implementable expressions of cubature measures are available for some specific degrees (such as m = 3, 5) and dimensions d. Broadly speaking, the cubature measure of degree m yields the following short-time asymptotics: for a smooth function g. We note that the error term depends on the bounds of the higher order derivatives of g. For instance, the following method provides an implicit computation scheme for the iterative conditional expectations in the backward Euler scheme: for i ∈ {n − 1, · · · , 0}, with a suitable degree m [80]. In order to reduce its computational cost, some additional techniques, such as the tree-based branching algorithm, are recommended in implementation. A non-uniform partition is found effective to treat possibly nonsmooth boundary data Φ based on the fact that the corresponding solution of the non-linear PDE u(t, ·) is smooth under a suitable condition on V i , even when Φ is not smooth enough, so that the cubature method (4.10) performs well. Similarly, cubature methods are employed in [81] for constructing a second order discretization scheme, in [247] for a third order scheme, and in [88] for discretizing McKean-Vlasov FBSDEs (Section 8.7). We refer the reader to [66] for an error expansion and the complexity control of the cubature method for solving BSDEs.

Forward numerical methods
Unlike the backward numerical methods (Section 3), the methods that we review in the present section do not inherently work backwards in time, and thus (originally, at least) avoid the computation of conditional expectations (Section 4). We hence call this category "forward" numerical methods in a collective way. In a similar manner to Section 3, we examine one or two representative methods in some detail in each subsection, followed by a brief overview of various other methods in the category.

Picard iteration methods
In this section, we survey forward numerical methods based on Picard iterations. We start with a forward scheme proposed in [22] via Picard iterations on sample paths, which is aimed at numerically approximating sample paths of nonlinear FBSDEs, based upon the discretization of a Picard type iteration. It is assumed that W and X are multidimensional, while Y is univariate.
Also, we impose that µ and σ are 1/2-Hölder continuous with respect to the time variable, as well as Φ is Lipschitz. We note that it is not assumed that the matrix σ is quadratic or that σ ⊗2 is invertible. The limit of a Picard type iteration is used to approximate the processes (Y, Z). Specifically, we set (Y (0) , Z (0) ) ≡ (0, 0), and (Y (r) , Z (r) ) as the solution of the following FBSDE: By taking conditional expectation, the process Y (r) is given as follows: and Z (r) is given via the martingale representation theorem, meaning the above Picard iteration is implicit. The authors propose a time discretization of the above iteration which is explicit in time, but still requires the evaluation of conditional expectations. Given a partition π n of [0, T ] and an approximation X (π) of X , set (Y (0,π) , Z (0,π) ) ≡ (0, 0). Then, for k ∈ {0, 1, · · · , n} forward, define the conditional expectations where W l · denotes the l-th component of W · . The processes Y (r,π) and Z (r,π) are extended to RCLL processes by using constant interpolation. By then approximating those conditional expectations, for instance, by using the LSMC regression method (Section 4.1), the convergence of the above discretized Picard type iteration is given in the form: The discretized Picard type iteration has no (high order) nestings of conditional expectations backwards in time, whereas it does have (lower order) nestings of conditional expectations forward in time in the number of Picard iterations. This turns out to be an advantage from a numerical point of view relative to backward methods. Overall, the error when approximating the involved conditional expectations (by a generic estimator) is significantly lower relative to backward methods. Specifically, it turns out the error grows moderately when the mesh of partition goes to zero and the number of Picard iterations tends to infinity. This is once again an advantage over backward methods, where the error often explodes when the mesh tends to zero. The algorithm that puts everything all together is tested on a hedging problem to demonstrate the proven theoretical convergence of the numerical method. We note that a variance reduced version of the algorithm is also mentioned and tested on a numerical example.
This work is extended in [23,239], where the authors introduce a variance reduced version of the forward approximation scheme by means of importance sampling, or more specifically, by means of a measure transformation based on a general Radon-Nikodym derivative. The technique of importance sampling is generalized from simulating expectations to computing the initial value of a FBSDE. The convergence of this modified and fully implementable numerical method is proved and the success of the involved generalized importance sampling is illustrated by numerical examples in the context of financial option pricing.
We next look at forward numerical methods which focus on solving the equivalent PDE (Theorem 2.5) by Picard iterations. Here, we primarily examine one numerical method, which illustrates this category well. In [138], the authors combine two main ingredients to an algorithm which is aimed at solving the PDE equivalent to the general FBSDE (Theorem 2.5). Hence, Assumption 2.4 is essential. To describe the scheme, we prepare some notation. Let v : [0, T ] × R q → R be C 1 in space and define f v : [0, T ] × R q → R as the following function where f is the driver of the FBSDE (2.6) and σ is the diffusion coefficient of the SDE (2.7). We define the following two random variables Ψ(s, y, g 1 , g 2 ) := T s g 1 (r, X s,y r )dr + g 2 (X s,y T ), Ψ N (s, y, g 1 , g 2 ) := where X s,y (respectively, X N,s,y ) denotes the diffusion process starting from y at time s (respectively, its approximation with N-time steps) which solves the forward SDE (2.7). Finally, we define We first look at the case where the FBSDE has a driver which is independent of Y and Z, that is, f (t, x, y, z) = f (t, x), which means the FBSDE is linear. We demonstrate the adaptive control variate approach (one of the two main ingredients) in this simple setting. It is then extended to the general FBSDE via Picard iterations (the second of the two main ingredients). From Theorem 2.5, we see that in order to solve a linear FBSDE, one can equivalently solve the linear PDE (2.9), for which an adaptive control variate scheme is developed in [143]. Here, one aims to numerically solve the linear PDE: Using the Feynman-Kac formula, the probabilistic solution of this PDE can be expressed as the following conditional expectation in accordance with (5.2): where X t,x is the solution to the standard SDE (2.7) starting from x at time t. One wishes to compute a sequence of solutions as this sequence is proved to converge to the solution v of (5.3). This approach is backed up by the probabilistic representation of the error term: We now proceed to the second ingredient, which is the most general form of Picard iterations [138] and will be used to approximate the solution of the nonlinear FBSDE by the solutions of a sequence of (simple) linear FBSDEs (ones with driver independent of Y and Z), which converge geometrically to (Y, Z). Define recursively the Picard iterative sequence (Y r , Z r ) r∈{0,1,···} with (Y 0 , Z 0 ) = (0, 0), as follows: for r ∈ {0, 1, · · ·} forward. This sequence of linear FBSDEs converges to the unique solution (Y, Z) of the original non-linear FBSDE dt ⊗ dP-a.e. By writing Y r t = v r (t, X t ) and Z r t = (∇v r (t, X t )) ⊤ σ (t, X t ) and by Theorem (2.5), each linear FBSDE can be equivalently written as a PDE: The sequence of solutions of linear PDEs (v r , ∇v r ) r∈{0,1,···} converges in a suitable L 2 norm to the solution (v, ∇v) of the semilinear PDE: which then yields the solution of the non-linear FBSDE by setting (Y, Z) = (v, (∇v) ⊤ σ ). The adaptive control variate method also converges geometrically, and so when combined, the two ingredients provide an algorithm with geometric convergence.

Algorithm 5
Initialization: Set v 0 ≡ 0 and assume that an approximate solution v r of class C 1,2 b has been built at step r. Take n points (t r k , x r k ) k∈{1,···,n} ∈ [0, T ] × R q at each step of the iteration.
• Evaluate c r (t r k , x r k ) using M iid realizations by • Build the global solution c M r (·) based on the values (c M r (t r k , x r k )) k∈{1,···,n} using a linear approximation operator P r : where (w r k ) k∈{1,···,n} are suitable weight functions. The approximation of v at step r + 1 is then computed as We here address a few key questions regarding the algorithm. First, it is critical how to choose the grid points (t r k , x r k ) k∈{1,···,n} at each iteration r. There, at each iteration, we take a new grid of n points (t r k , x r k ) k∈{1,···,n} that are independent and uniformly distributed on [0, T ] × [−a, a] q for a suitable positive constant a. As we want to solve the PDE on [0, T ] × R q , we must choose a large enough. Next, an appropriate norm needs to be prepared for measuring the error. By combining results on BSDEs stated in a norm (leading to the integration with respect to the measure e β s ds [105]) and results on the bounds for solutions of linear PDEs in weighted Sobolev spaces (leading to the integration with respect to the measure e −λ x dx [30]), the convergence of the algorithm is derived in the norm: The sequence of approximation operators (P r ) r∈{0,1,···} is allowed to change at each step of the iteration, under a number of technical conditions: measurability, linearity, regularity, boundedness, able to approximate functions and spatial derivatives well and stability and centering property for random function. We refer the reader to [138] for explicit definitions of each property and an example of a sequence of operators P r which satisfy the desired properties. The operators used are kernel based estimators and are based on the non-parametric technique local averaging. This scheme is further accelerated via parallel computing [210] by replacing the kernel operator with an extrapolating operator for approximating functions and their derivatives.
The geometric convergence results for the algorithm can be described as follows [138]. If Assumption 2.4 holds, Φ ∈ C 2+α b and f is bounded and Lipschitz, then there exists a constant K(T ) such that where S r ≤ ηS r−1 + ε for suitable constants η and ε. Moreover, for β and P-parameters large enough so that η < 1, it holds that This result is derived by splitting the error Y −Y r 2 λ ,β + Z − Z r 2 λ ,β into its difference sources, such as an Euler scheme, Picard iteration, approximation operator P and Monte Carlo simulations.
In [286], the authors propose an analytical approximation scheme which also centers on the representation of FBSDEs given in [228], where a Malliavin calculus method is applied to the forward SDE in conbination with the Picard iteration scheme for the backward component. Various numerical examples are presented to demonstrate the convergence of the proposed algorithm.
Before moving on, we mention that there is a line of research on deep learning based algorithms for solving FBSDEs and parabolic PDEs in high dimensions. The main idea is to make an analogy between the FBSDE and reinforcement learning, where PDEs play an important role from an implementation point of view. As the class of deep learning based algorithms cannot be categorized simply as forward methods and has become a very active field of research on its own, we do not discuss it here but set up an individual section (Section 6) exclusively for this class of algorithms.

Branching diffusion system based methods
We next look at the branching diffusion system based numerical methods for nonlinear PDEs and corresponding BSDEs [167,170]. Consider the following semilinear PDE of KPP (Kolmogorov-Petrovskii-Piskunov) type: , which can be written as (2.9) with L t = ∆ and f (t, x, y, z) = β (∑ k∈N 0 p k y k − y), where ∆ denotes the Laplacian and {p k } k∈N 0 is a probability mass sequence satisfying p k ∈ [0, 1] and ∑ k∈N 0 p k = 1. It is well known [233,280,306] that the solution to the PDE (5.5) admits a probabilistic representation based on the branching diffusion system in which every particle dies in an exponential time of parameter β and creates k iid descendants with probability p k . Then, every descendant dies and reproduces iid descendants independently after independent exponential times in accordance with the same mechanism. In the other words, the solution to the PDE (5.5) can be represented as where N T is the number of particles alive at time T , Z k T denotes the position of the kth particle at time T and the condition indicates that the system is initialized at time t with one particle at position x. For the semilinear PDE with the Laplacian ∆ in (5.5) replaced by the Ito generator L t and a general function f (y): or the corresponding FBSDE (2.6) and (2.7) with driver f (t, x, y, z) = f (y), the so-called "marked" branching diffusion method is proposed in [167] to evaluate u(0, x) based on the equivalence between (5.5) and (5.6) after a polynomial approximation of the driver f (y) ≈ β (∑ m k=0 (a k /p k )p k y k − y) of finite degree m, where {p k } k∈{0,···,m} here is such that p k ∈ [0, 1] and ∑ k∈{0,···,m} p k = 1 and the fraction (a k /p k ) represents the weight to count at vertices of type k. The corresponding FBSDE can thus be evaluated via a fully forward-looking simulation of particles owing to the representation (5.6).
The computation required for branching diffusion methods comes almost entirely from the construction of branching particles. Hence, in terms of the problem dimension, those methods are not as prohibitive as Picard iteration methods (Section 5.1), let alone the computation of conditional expectations (Section 4), for which a few dimensions would be the best in practice. It is reported in [170] that eight-dimensional PDEs are successfully solved without an issue. We note that the marked branching diffusion method is extended further in [3] to elliptic semilinear PDEs by introducing absorption of particles and in [169] to address the case where the nonlinear driver depends on the Z component. In addition, the branching diffusion system is rich enough to provide probabilistic representations of semilinear PDEs beyond the standard form (2.9). For instance, those up to any higher order derivative in the driver (not only up to the first order ∇u as of (2.9)) have been addressed in [245,246] with numerical illustrations. Moreover, probabilistic representations and associated numerical methods have been developed in [258,259] for parabolic and elliptic semilinear PDEs (2.9), but with the Ito generator L t replaced by a fractional Laplacian that corresponds to subordination of the underlying Brownian motion by a stable subordinator.
Interestingly, the concept of the branching diffusion system can be employed for constructing backward numerical methods [45,46], where the driver depends on Z [45] or is independent of Z [46]. Despite the backward nature lies outside the primary scope of Section 5, we illustrate such a method in brief, in accordance with the latter for the sake of simplicity. The other main assumptions made are that Φ : R d → R is measurable and bounded and that f : R d × R → R is measurable (as usual), uniformly Lipschitz continuous in its first argument and satisfies linear growth and Lipschitz continuity in its second argument. As a consequence, there exists a constant M ≥ 1 such that |Φ(X T )|≤ M and X +|Y |≤ M on [0, T ] almost surely. The driver f of the FBSDE is approximated by f l 0 which has a local polynomial structure and is given as f l 0 (x, y 1 , y 2 ) := ∑ j 0 j=1 ∑ l 0 l=0 a j,l (x)y l 1 φ j (y 2 ), where (a j,l , φ j ) ( j,l)∈{1,···, j 0 }×{0,···,l 0 } is a family of continuous and bounded maps. For all y 1 , y 2 ∈ R, j ∈ {1, · · · , j 0 }, l ∈ {0, · · · , l 0 } and a positive constant C, these maps satisfy |a j,l |≤ C, |φ j (y 1 ) − φ j (y 2 )|≤ C|y 1 − y 2 | and |φ j |≤ 1. If the driver was simply approximated by a polynomial, then typically, the approximating FBSDEs would explode in finite time and thus no convergence could be expected. The solution to the FBSDE (Y , Z) with driver f replaced by f l 0 can approximate the true solution (Y, Z) well whenever f l 0 is a good approximation of f : where the positive constant C is independent of f l 0 . Hence, for accurate approximation, one requires a driver that can be approximated well by polynomials. The solution to the new FBSDE (Y , Z) is then approximated by means of a Picard type iteration scheme. To that end, define the process Y m in a recursive way. That is, let Y r T := Φ(X T ), and define, on each interval [t k ,t k+1 ], (Y r . , Z r . ) as the solution on The aim is thus to solve the above Picard iteration backwards on each interval [t k ,t k+1 ], and then using the now completely solved iteration on [0, T ], begin solving the next iteration backwards across the intervals. Importantly, this algorithm requires the truncation of the approximation of Y at some given time steps in order to reduce the approximating driver to a globally Lipschitz driver. The error due to the Picard iteration scheme is given in the form |Y r t −Y t |≤ C(T − t) r for all t ∈ [0, T ] and r ∈ N, where |Y r t | is uniformly bounded in t and r. Each step of the Picard iteration is conducted backwards on each interval [t k ,t k+1 ) by using a representation of Y r in terms of branching diffusion systems. A set of particles (X (l) ) l∈K is constructed, where each particle is the solution to the forward SDE (but each with their own Brownian motion) with a killing time T l . At this time, the particle splits into a random number of new particles which follow the same dynamics, but with their own killing times and Brownian motion. Define the set K l t as the collection of particles in the l-th generation that were born before or at time t, and the set K l t as the collection of particles in K l t that are still alive at time t. Also, given v r−1 and v r (t k+1 , ·), define , a Monte Carlo approximation on a finite functional space, and putting everything together, one obtains a numerical method for approximating (Y, Z). When implementing the algorithm in practice, it is best to modify the algorithm to avoid lots of expensive Picard iterations. Two modifications are suggested which use only one Picard iteration, but provide an accurate estimate nevertheless.

Asymptotic expansion
The method of asymptotic expansion provides a tailor-made approximation for BSDEs by expanding a nonlinear BSDE into by a sequence of linear BSDEs in a flexible way, yet with a rigorous error analysis. By taking advantage of simple computation involved, the method offers a fast computing scheme for BSDEs. Asymptotic expansion is categorized here as a forward methods since the resulting linear BSDEs here can be computed by forward Monte Carlo schemes alone, without the need for backward or regression schemes. We discuss the method of asymptotic expansion here separately from Picard based forward methods, as PDEs are certainly useful here (as we describe shortly) but not fully essential.
We here describe the method in accordance with [285]. On a filtered probability space (Ω, F , F, P), consider the following FBSDE with a small perturbation ε ∈ [0, 1]: where the driver f is sufficiently smooth, with corresponding PDE given by due to Theorem 2.5. Recall that Y ε t = v ε (t, X t ) and Z ε t = (∇v ε (t, X t )) ⊤ σ (t, X t ) =: ∇v ε σ (t, X t ) hold. Now, we expand the process (Y ε , Z ε ) around the solution of null driver (linear) BSDE (Y 0 , Z 0 ): where v 0 is the solution of the corresponding linear PDE ((∂ /∂t) + L t )v 0 (t, x) = 0 with v 0 (T, x) = Φ(x). The following expansion is then proposed [118,285]: for every m ∈ N, where each process (Y (k) , Z (k) ) evidently satisfies (Y (k) , Z (k) ) = ((∂ k /∂ ε k )Y ε , (∂ k /∂ ε k )Z ε )| ε=0 and, moreover, can be characterized by the linear BSDE: It is essential that every (Y (k) , Z (k) ) is explicitly solvable since each solves a linear BSDE. The expansion (5.10) is justified in [285] as follows: for each β > 0, λ > 0 and m ∈ N, there exists C > 0 such that for k ∈ N, with the norm h β ,λ := T 0 R q e β s h(s, x) e −λ |x| dxds for β > 0 and λ > 0. This error estimate is consistent with the expansion (5.10) in the sense of Y The development in [285] is generalized in [145] to nonsmooth drivers. The development in [145] can be summarized as follows: for a driver f which is Lipschitz in (t, x) and is Gateaux-differentiable in (y, z) in the sense that for any squareintegrable predictable processes φ and ψ, there exist κ ∈ (0, 1] and a square-integrable predictable process D f · (φ , ψ) such that The effectiveness and practical accuracy of the schemes has been well demonstrated via numerical experiments in [118,145]. In practice, the expansion up to the first or second order often provides a sufficient accuracy even for non-smooth drivers. An efficient implementation with interacting particles is developed in [120]. We also refer to [76] for nonlinear Monte Carlo scheme using asymptotic expansion method with interacting particles. The method of asymptotic expansion has found its application to a variety of models in mathematical finance. For instance, it is employed in [119] for a complete market on BSDEs. Numerical analyses are conducted for American options in [117] based on asymptotic expansion method with interacting particles. Asymptotic expansion is applied in [77] for numerical experiments in a counterparty risk model. In [10], a dynamic framework is introduced for analyzing XVAs with asymptotic expansion employed in numerical experiments. A polynomial expansion method is developed in [116] by approximating target BSDEs via a recursive system of linear ODEs. Using asymptotic expansion, numerical approximations are derived in [4] for McKean-type anticipative BSDEs arising in initial margin requirement models. We close this section with even different lines of research, such as [276] for quadratic BSDEs, [121] for BSDEs with jumps, [123,242,284] for deep learning schemes inspired by asymptotic expansion, and [107,122,286] for similar but different types of expansions in nonlinear PDE and BSDEs.
Based on asymptotic expansion, though not quite in the same spirit as above, statistical inference has been investigated for BSDEs, as from a practical point of view, it may well happen that some problem coefficients cannot be fully determined upon implementation. For instance, in the case where the drift coefficient µ(t, x; θ ) of the forward process (2.7) remains partially unspecified in its parameter θ and the diffusion coefficient εσ (t, x) is kept small with ε ≈ 0, the approximation problem for the solution to FBSDEs is investigated in [208,209] with maximum likelihood estimation of the unknown parameter θ concurrently conducted. The case where the diffusion coefficient is parameterized instead is investigated in [125].

Multilevel Picard approximation
Multilevel Picard approximation is an emerging class of numerical methods, mainly consisting of the following three steps [98]: 1. reformulation of the PDE as a stochastic fixed point problem, 2. approximation of the unique fixed point by Picard iterations, 3. and then approximation of the iterations by multilevel Monte Carlo methods.
The method is said to be full history recursive in the sense that the realizations in the m-th iteration require those in the 1st, 2nd, · · ·, (m − 1)-th iteration. Literally, its name originates from the two major components, that is, Picard iterations and multilevel Monte Carlo methods. As only a few Picard iterations are required in practice and expectations are nested along the iterations, the conditional expectations may be approximated by the standard Monte Carlo method, leading to nested simulations in the number of iterations. The cost can be kept tractable by using only a small number of samples in each nested layer of simulations, for which a very efficient variance reduction is needed, typically handled by the multilevel Monte Carlo approach.
A major motivation for introducing multilevel Picard approximation is to solve high dimensional nonlinear PDEs in a similar spirit to deep learning based methods (Section 6). The category is initiated in the work [99], in which the applicability of approximation methods based on Picard iterations and multilevel Monte Carlo methods is investigated for high dimensional nonlinear PDEs arising in physics and finance. It is subsequently shown [183] that the complexity grows polynomially both in the dimension and in the reciprocal of the required accuracy in the case of semilinear heat equations with gradient-independent and globally Lipschitz continuous nonlinearities. For instance, its complexity is shown to be O(dε −4 ) for such semilinear heat equations, where d is the dimension of the problem and ε is the required precision.
Multilevel Picard approximation methods have been developed further in many papers. A class of multilevel Picard approximation is proposed in [20] for computing iterated nested expectations. For very high-dimensional problems, there is another nested Monte Carlo algorithm for the PDE formulation, where the nesting is along a random time grid [305] Multilevel Picard approximation algorithms are shown to overcome the curse of dimensionality for high dimensional nonlinear heat equations with general time horizons and gradient-dependent nonlinearities [182] and also overcome the curse of dimensionality in the L p sense for high-dimensional semilinear PDEs [186]. It is shown in [185] that a Monte Carlo type numerical method approximates the solution path of BSDEs with complexity which is at most polynomial in the model dimension and at most quadratic in the reciprocal of the prescribed approximation accuracy. It is proved in [187] that semilinear heat equations with gradient-dependent nonlinearities can be approximated under suitable technical conditions with polynomial complexity both in the dimension and the reciprocal of the accuracy. In [188], Picard iterations for backward stochastic differential equations with globally Lipschitz continuous nonlinearity are shown to converge, at least, at the rate of square-root of factorial. An extension of multilevel Picard approximation is applied to general forward diffusion in [189]. Other numerical algorithms can be found in [100] for approximating solutions of general high-dimensional semilinear parabolic partial differential equations at single space-time points, and in [21] for parametric approximation problems by combining Monte Carlo algorithms with machine learning techniques to learn the random variables in Monte Carlo simulations including the multilevel Picard approximation. Yet another type can be found in [60] based on a Picard iteration scheme in which a sequence of linear-quadratic optimization problems are solved by means of the stochastic gradient descent algorithm.

Further on forward numerical methods
We also find a fully forward-looking numerical method in [117], where the non-linear driver of the FBSDE is treated as a perturbation and consequently the FBSDE is converted into a series of decoupled linear FBSDE. The method proposed for approximating these linear FBSDEs uses an interacting particle method in order to perform the involved integration steps, as opposed to using direct Monte Carlo simulation. The homotopy analysis method has been reported effective [333] for FBSDEs. The coupled FBSDEs are transformed into a control problem [86] by discretizing the backward component in the forward direction and are simulated by combining the standard backward discretization with Picard iterations [28]. Finally, a forward method can be found in [52] based on a Wiener chaos expansion, Picard iterations and Malliavin calculus techniques.

Deep learning
The method of deep learning is one of the most active emerging categories in the context of numerical methods for BSDEs. These methods have attracted a great deal of attention for their unique feature of solving high dimensional nonlinear BSDEs and corresponding nonlinear PDEs, owning to neural network approximation by which the estimation of nested high-dimensional expectations and gradients is significantly eased. So far, a wide variety of numerical results have been developed for approximating the solution of nonlinear PDEs by a neural network from physics and finance and FBSDEs related to a nonlinear pricing model for financial derivatives. Those numerical methods demonstrate the efficiency and accuracy of the algorithm for several 100-dimensional nonlinear PDEs and FBSDEs. This effectiveness in high-dimensional problems, in addition to the intrinsic structure involving both aspects of backward and forward methods, makes the method of deep learning quite distinctive in sharp contrast to backward and forward methods (Sections 3 and 5). Despite the complexity of deep learning methods not being able to be investigated in a general form at present, there exist a few theoretical results in the literature that support deep neural networks in overcoming the curse of dimensionality in numerical approximations of some linear and nonlinear PDEs, where the complexity is shown to grow at most polynomially in both the PDE dimension and the reciprocal of the approximation accuracy (for instance, [19,31,153,154,184,195,271,287]).
In what follows, we describe and summarize such deep learning based methods. To this end, we start with the basics on multilayer neural networks. We reserve L for the number of layers. For k ∈ {0, 1, · · · , L − 1}, let ℓ k ∈ N denote the size of input of the k-th layer and let ℓ L be the size of the output layer. For k ∈ {1, · · · , L}, define a(x e )) for x ∈ R e , where a denotes a real-valued nonlinear function on R, called the activation function. With all those, we define the neural network N Θ L : Note that every continuous function on a compact set can be approximated by a neural network to any desired precision.

Deep BSDE
We first summarize a numerical scheme called the deep BSDE method [97,162]. The essence of deep BSDE methods is, broadly speaking, to make use of the gradient of the solution (with respect to the control process Z) as the policy function, and approximate it through a neural network as is done in the standard deep reinforcement learning. For the reader's convenience, we recall the FBSDE (2.6) and (2.7): with X 0 = x and Y T = Φ(X T ), and Theorem 2.5, which asserts (Y t , Z t ) = (v(t, X t ), (∇v(t, X t )) ⊤ σ (t, X t )), with v satisfying the semilinear parabolic PDE under Assumption 2.4: In general, the deep learning-based method aims to approximate Y 0 = v(0, x). We start with the following stochastic optimization problem: Evidently, to solve this minimization problem, the expectation E[|Φ(X T ) −Y y,Z T | 2 ] needs to be computable or approximated. For this purpose, define the sequence {X n t k } k∈{0,1,···,n} as the Euler-Maruyama discretization of the forward process: with initial state X n t 0 = x, and define the forward discretization {Y n t k } k∈{0,···,n−1} of the backward process by with initial state Y n t 0 = y, in order to approximate the minimization problem, as follows: for sufficiently large M for decent estimation quality, where {X n T,l } l∈N and {Y n T,l } l∈N are independent copies of X n T and Y n T , respectively. At the end, the obtained solutions, say, y * and {z * k } k∈{0,1,···,n−1} , provide the approximation Y 0 ≈ y * and Z t k ≈ z * k (X n t k ) for k ∈ {0, 1, · · · , n − 1}. Deep learning comes into play when solving the approximate minimization problem (6.3). For k ∈ {1, · · · , n − 1}, approximate the target continuous function z k by a neural network as x → z θ k k (x) ≈ N θ k L (x) with parameter θ k ∈ R q(L) for a suitable dimension q(L) ∈ N, depending on the number of layers L. For Θ = (θ 1 , · · · , θ d+1 , θ 1 , · · · , θ n−1 ) with θ k ∈ R for k ∈ {1, · · · , d + 1}, the forward discretization of the backward process (6.2) can be parameterized with Θ as with initial states Y n 0 (Θ) = θ 1 and z θ 0 0 (x) = (θ 2 , · · · , θ 1+d ). We seek the parameter vector Θ via the following minimization problem: by stochastic gradient descent, that is, where γ is a suitable learning rate. After a sufficiently large number of recursions, say K, we obtain the parameter vector Θ K with which the approximation of the target is given by Y 0 = u(0, x) ≈ Y n 0 (Θ K ). As described in brief in the beginning of the present section, the significance of deep BSDE methods is to adopt deep neural networks for computing gradients of the solution and approximating the backward component forward in time so that highdimensional PDEs and BSDEs can be solved in a realistic runtime. An error analysis is conducted for deep BSDE methods in [163], where a posteriori estimates are derived for coupled FBSDEs which relate the quadratic terminal loss to the approximation error for the numerical solution of the FBSDE (a posteriori estimates for BSDEs can also be found in [27]). We note that the approximation of the backward component Y parametrically forward in time as above and then the minimization of an error criterion has been found valid even outside the realm of deep learning [8].
The deep BSDE method is modified in [304] by measuring the loss at the forward initial time (rather than the terminal time, as above), called Deep BSDE-ML method, for approximating linear decoupled FBSDEs, as well as is extended in [135] for BSDEs with jumps. The backward deep solvers are developed in [299,313] in order to apply the deep BSDE solver-based method to financial problems, such as pricing Bermudan swaption and nonlinear pricing in high-dimensional settings. Highdimensional coupled FBSDEs with non-Lipschitz diffusion coefficients are numerically solved in [198] using the deep BSDE method. The deep BSDE method is also developed in [6] for strongly coupled FBSDEs stemming from stochastic control.

Deep backward dynamic programming
Deep backward dynamic programming is proposed in [181] and then further improved in [132], literally on the basis of backward dynamic programming arising from discretization methods of BSDEs for high dimensional nonlinear PDEs via the minimization of loss functions at each step, defined recursively by backward induction. In contrast to deep BSDE methods (Section 6.1), deep backward dynamic programming is built in reference to the backward resolution technique upon an implicit backward Euler scheme. It employs machine learning techniques for estimating the solution and its gradient by minimizing a loss function on each time step, where such local problems are then solved recursively with a stochastic gradient algorithm backward in time. Two schemes are proposed in [181] for dealing with those local problems: (DB1) approximating both the solution and its gradient by a neural network, and (DB2) approximating the solution alone by a neural network, with its gradient estimated directly with automatic differentiation. It is worth mentioning that from a viewpoint of the computation of conditional expectations, deep backward dynamic programming may also be regarded as a backward Euler scheme where nonlinear regression is employed with a deep learning technique.
To describe deep backward dynamic programming in brief, the solution (X ,Y, Z) of the FBSDE (2.6) and (2.7) is discretized, first through the Euler-Maruyama discretization of the forward component {X n t k } k∈{0,1,···,n} with X n t 0 = x on the time grid (0 = )t 0 < t 1 < · · · < t n (= T ) in a similar manner to (6.1) and the backward component, by starting with Y n t n = Φ(X n t n ) and then defining Y , for k ∈ {n − 1, · · · , 1, 0} backwards, where the integrand component {Z n t k } k∈{0,···,n} is not explicit as of yet and to be found during the following procedure. Along the discretized pair {(X n t k ,Y n t k )} k∈{0,···,n} , we describe the aforementioned two schemes of deep backward dynamic programming: for k ∈ {n − 1, · · · , 1, 0} backwards.
for k ∈ {n − 1, · · · , 1, 0} backwards, where ∇u k is the automatic differentiation of the network function u k . At each step k ∈ {n − 1, · · · , 1, 0}, the network functions (U k , Z k ) in (DB1) and V k in (DB2) can be found literally by employing deep neural networks. The effectiveness of deep backward dynamic programming has been supported by numerical results on nonlinear PDEs up to 50-dimension [181] as well as error analysis [132].

Deep splitting
In a similar yet different line from the previous two methods, the so-called deep splitting method is proposed in [17] on the basis of the splitting principle with deep neural networks. Consider the nonlinear parabolic PDE: with u(T, x) = Φ(x), which can be represented, by the nonlinear Feynman-Kac formula, as where the forward component {X t : t ∈ [0, T ]} is as given in Section 6.1. In light of this nonlinear Feynman-Kac representation, one wishes to construct the recursive approximation V k (·) ≈ u(t k , ·) for k ∈ {n − 1, · · · , 1, 0} backwards at discrete time points (0 =)t 0 < t 1 < · · · < t n (= T ), given by where {X n t k } k∈{0,1,···,n} is a sequence of discrete observations of the forward component by the Euler-Maruyama scheme with X n t 0 = x in a similar manner to (6.1). In the deep splitting method, this recursion is approximated by a continuous function on the support of the marginal via the minimization: so as to approximate the map x → V k (x) by a neural network V k (·) ≈ N Θ L (·) with respect to parameter set Θ. It is reported in [17] that the deep splitting method succeeds to deal with as high-dimensional nonlinear PDEs as 10000 dimensions.

Deep Galerkin method and physics-informed neural networks
Deep Galerkin Method (DGM) [279] and Physics-Informed Neural Networks (PINN) [240] are proposed for solving highdimensional nonlinear problems. Despite both methods being based largely on deterministic PDEs, we summarize their essence here for the reason that those may provide solvers for nonlinear BSDEs as well. Hereafter, we refer to DGM and PINN collectively as the deep learning based PDE solver and give a brief summary in accordance with [279].
Let D ⊂ R d and let u : [0, T ] × D → R solve the following (nonlinear) PDE: , x), ∇u(t, x), Hess(u(t, x))) = 0, (t, x) ∈ [0, T ] × D, where f is a nonlinear differential operator, and Ψ : R d → R and g : [0, T ] × R d → R represent the initial and boundary conditions, respectively. In the deep learning based PDE solver, the solution u is approximated by a function u Θ = N Θ L through a deep neural network with respect to parameter set Θ, by minimizing the error between both sides of the nonlinear PDE above, evaluated at sampled points in the space-time domain and in its boundary. An optimal parameter set Θ * is here searched for in such a way to ideally minimize the loss function: where ϕ S,ν := S |ϕ(x)| 2 ν(dx) with the probability measure ν and its support S. For minimization, one employs stochastic gradient decent on an approximate loss function: where {γ n } n∈N is a sequence of learning rates. Here, G n is an approximation of the loss function (6.4), defined by where (t n , x n ), w n and (τ n , z n ) are random elements, respectively, taking values in [0, T ] × D, in D and in [0, T ] × ∂ D, according to the probability measures ν 1 , ν 2 and ν 3 . The iterative procedure above is to be repeated until a suitable convergence criterion is met. In addition, more generalized frameworks using Monte Carlo methods and an efficient implementation of the neural network are discussed in [279], along with various numerical results on high-dimensional American options, high-dimensional HJB equations and Burger's equations. We do not go into further details and applications on the deep learning based PDE solver, since those would lie way outside the scope of the present survey on BSDEs. Instead, we refer the reader to [202,224] for more applications and to [249,269] for its close connection with BSDEs.

Further on deep learning based methods
A wide variety of deep learning based schemes have been developed for solving high-dimensional BSDEs, some of which take advantage of the aforementioned backward and forward methods (Sections 3 and 5). For instance, in [123], a deep learning scheme is introduced in combination with asymptotic expansion (Section 5.3), which is further extended in [242,284,297]. The deep learning technique of [97] is extended in [124,312] to address both terminal and boundary conditions of PDEs. As a computational framework for portfolio risk management problems, the so-called Deep xVA solver is proposed [136] by recursively using the deep BSDE method for a coupled system of BSDEs. A discretization scheme is employed in [244] for solving BSDEs based on deep learning regressions (Section 4.1). A deep signature/log-signature FBSDE algorithm is developed in [108] for approximating FBSDEs with state and path dependent features. A deep Runge-Kutta method is proposed in [59]. In the presence of constraints on the gains process, an approximation of BSDEs is obtained in [204] by neural network approximation. Iterative diffusion optimization techniques are studied using deep learning techniques [250] for applications such as importance sampling and rare event simulation. The Long Short Term Memory networks is applied in [199] to improve the Deep BSDE method [97]. A new algorithm is proposed in [292] based on a θ -discretization of the time-integrands with eXtreme Gradient Boosting (XGBoost) regression for efficiently computing conditional expectations. A deep learning-based stochastic branching algorithm is developed in [245] for numerically solving fully nonlinear PDEs. High-dimensional fullycoupled FBSDEs are solved in [196] with three algorithms based on deep learning. A deep learning based method is proposed in [290] for solving forward-backward doubly stochastic differential equations. Finally, we refer the reader to [19,35,98,133] for more recent developments and surveys on deep learning based methods, to [67] for singular BSDEs, to [18] for second-order BSDEs (Section 8.6), to [56,110,131,161] for McKean-Vlasov BSDEs (Section 8.7), and to [197,266] for stochastic control problems.

Discussion
In the preceding sections (Sections 3, 4, 5 and 6), we have presented a systematic survey and categorization of various numerical methods for BSDEs. However, these methods have thus far been presented in isolation and so not in an appropriate manner for drawing comparisons. In the present section, we thus present those categories all on the table, with a brief description of each, so that a relevant contrast can be made against a few key factors regarding their implementation. Such factors include the convergence property, the dimensionality, and the complexity, which directly influence, individually and/or in combination, what method would be chosen for the BSDE in question. For instance, if a very accurate approximation is needed, on the one hand, then it is desirable to select a method with a strong theoretical convergence guarantee even if the method is more difficult to code and has a longer running time. On the other hand, if a less rigorous approximation is enough for the time being (such as, for preliminary testing purposes), then it would be more reasonable to choose a method that is easier to code and runs faster, even if it may be less theoretically accurate.
In order to make comprehensive, yet, to-the-point contrasts, we must, out of sheer necessity, restrict ourselves to a representative method or two in displaying each category and do not claim that the resulting conclusions entirely hold for every numerical method in a certain category. Moreover, throughout, we focus on the approximation of the backward component Y of each BSDE model (X ,Y, Z) and the corresponding PDE solution (either the point u(t, x) or the function u(t, ·)). We reserve d for the dimension of the forward component X and assume Y is one-dimensional, that is, x(∈ R d ) → u(t, x)(∈ R), and write C for constant multiples whose values change depending on the context.

Backward numerical methods along with computation of conditional expectations
As backward discretization (Section 3) is intrinsically built upon the computation of conditional expectations (Section 4), we discuss representative methods consisting of those two methodological components in combination. Here, we let n represent the number of discretization points in the given time interval, that is, (t 0 ,t 1 , · · · ,t n ) and let M indicate the number of iid replications for Monte Carlo methods involved in each subinterval. provided that K ≈ n d and M ≈ n d+3 . Despite the availability of theoretical analyses [139,140,213], it is still an open question as to the choice of basis functions for a given BSDE. It is worth mentioning that LSMC methods are effective in approximating the function u(t, ·) (rather than a single point u(t, x) alone).
• Malliavin (Section 4.2) + Backward Euler (Section 3.1): Let Y n,M t k denote an approximation at time t k in accordance with [47]. It is shown under suitable conditions that max k∈{0,···,n} which suggests M ≈ n d/2+1 to bound the overall rate by O(n −1/2 ). In general, the required computation can be heavy for dealing with terms involving Skorohod integrals (such as (4.4)), for which an improved algorithm is proposed in [82], while as many as 2 d iid standard normal random variables need to be generated.
• Quantization (Section 4.3) + Backward Euler (Section 3.1): Let Y n,N t k denote an approximation at time t k in accordance with [11], where N = 1 + N 1 + · · · + N n with each N k the number of points in R d used to make up the space grid at the k-th discretization step. It is shown under suitable conditions that max k∈{0,···,n} which suggests N ≈ n (3/2)d+1 to bound the overall rate by O(n −1/2 ). As described in the above and Section 4.3, Monte Carlo simulation is required for completing the quantization method with care on the number of iid replications M.
Quantization methods work in relatively low dimensions, while generally being effective for the approximation of the function u(t, ·).
• Tree (Section 4.4) + Backward Euler (Section 3.1): Let Y n t k be an approximation at time t k by the standard (deterministic) binomial tree method [49,225]. In order to satisfy the order 1/2, that is, the required complexity is approximately 2 dn , which is very large, whereas the complexity drops down to (n + 1) with a recombination tree in one-dimensional problems.
• Cubature (Section 4.5) + Second-order discretization (Section 3.2): Let Y n,N 0 be an approximation of the initial state Y 0 by a cubature method and the tree-based branching algorithm (TBBA) measure with the cubature degree of 7 and N particles at every time point t k , while n here denotes the number of time discretization by the second order method [81]. It is known under suitable conditions that We refer the reader to [66] for relevant complexity analysis.
With numerical methods for conditional expectations (Section 4) now collectively aligned in conjunction with backward discretization methods (Section 3), we make a few relevant remarks before moving on to forward numerical methods (Section 7.2). First, recall that the backward Euler method Y n with forward Euler-Maruyama scheme X n is shown to be of order 1/2 for the backward process Y of a Markovian FBSDE (X ,Y, Z) under minimal Lipschitz conditions [317], that is, max k Y t k −Y n t k 2 ≤ Cn −1/2 , whereas the better rate Y 0 − Y n 0 ≤ Cn −1 at the initial time has also been reported elsewhere [72,137]. This is not an essential disagreement but a simple difference in the evaluation point, due to the error on the forward component of order 1/2 for all time points but the initial time (that is, X t k − X n t k 2 = O(n −1/2 ) for all k ∈ {1, · · · , n}) and the deterministic equality at the initial time (that is, X 0 = X n 0 = x). Thus, the rates described in (7.1), (7.2), (7.3) and (7.4) can be conservative if the approximation is focused on the initial state Y 0 alone.
Next, we make a note on the complexity and dimensionality, which are undoubtedly important upon implementation individually, as well as inextricably bound up together. LSMC can work in up to 10 dimensions and tends to be more efficient than Malliavin calculus based methods. For example, it is known [213] that the complexity of LSMC is O(ε −d−4 ) to achieve the squared error ε 2 in some instances, while the Malliavin calculus based method bears the complexity O(ε −d−13 ), provided that the forward component is a geometric Brownian motion [47]. In general, the application of quantization or tree based methods is limited to considerably low-dimensional problems due to the use of the safety grid. Cubature methods with multi-linear interpolation result in the complexities O(ε −3d/2+1/2 ) if the backward Euler discretization is employed and O(ε −d+1/2 ) with the second-order discretization with the Richardson-Romberg extrapolation [66]. As such, in most cases, the complexity of backward methods (in combination with computation for conditional expectations) tends to explode exponentially as the dimension of the problem increases.
We also remark on the standing assumption of each category. Least-squares regression based methods generally work under minimal assumptions on BSDEs (and SDEs), while a few extra regularity assumptions on the coefficients are required to allow the derivation of robust estimates for the involved error using various regression tools and also to ensure the stability of the algorithm (for instance, [61,147,257,293]). Malliavin calculus based methods tend to require lots of extra assumptions, including various regularity assumptions and in particular, assumptions regarding the Malliavin weights (as in [148,149,241]). Quantization and tree based methods work under the standard Lipschiz conditions on forward and backward SDEs, while cubature methods require some extra smoothness on coefficients of the forward SDE and a smooth driver with Lipschitz terminal condition in most instances.

Forward numerical methods
Next, we make a contrast of the four categories of forward numerical methods (Section 5), again, by providing a brief summary of the convergence property, and key features.
• Picard iteration (Section 5.1): For an approximate solution (Y r,n t ) t∈ [0,T ] in accordance with the Picard iteration method [22], it is shown that for sufficiently large n. This scheme needs to be combined with a numerical method for computing conditional expectations (Section 4) and, by and large, similar limitations apply as described in Section 7.1. As such, in its implementation, variance reduction techniques play an importance role, for instance, importance sampling [23,239] and control variates [138], in solving high-dimensional problems.
• Branching diffusion system (Section 5.2): For a BSDE (Y, Z) with driver f (independent of z), it is shown under suitable conditions [170] that where Y n,ℓ · denotes a branching diffusion approximation to the BSDE with a discretized ℓ-th order polynomial driver f n,ℓ [45,46]. After the branching diffusion approximation has been applied, it is the usual practice to employ Monte Carlo simulation for implementation, such as a nonlinear Monte Carlo simulation [76], which is known to be effective in high-dimensional nonlinear pricing problems.
• Asymptotic expansion (Section 5.3): For a BSDE (Y ε , Z ε ) with a perturbed driver ε f , it holds under suitable conditions [145,285] that After the expansion has been performed, similarly to the aforementioned branching diffusion system, often Monte Carlo simulation is used for implementation [76].
• Multilevel Picard approximation (Section 5.4): Consider a solution u of a d-dimensional semilinear PDE corresponding to a BSDE model, and let u θ n,M : [0, T ] × R d × Ω → R denote an approximate solution to the PDE u(·, ·) by the multilevel Picard approximation in accordance with [98,183]. For each x ∈ R d , there exist K : (0, 1] → N and c > 0 such that for all d ∈ N and ε ∈ (0, 1], with the complexity O(ε −c d c ), which suggests that for each initial state of the forward component, the initial state Y 0 of the backward component can be approximated without suffering from the curse of dimensionality.
To conclude the categories of forward numerical methods, we make a short remark on the standing assumption, as we have done towards the end of Section 7.1. As mentioned throughout the survey, forward numerical methods do not inherently work backwards to avoid the computation of conditional expectations, often with the aid of the FBSDE and PDE equivalence (Theorem 2.5). In return, it is rather evident that numerical methods based on solving the equivalent PDE (Section 5) must impose Assumption 2.4 so that the FBSDE and PDE equivalence holds. In addition, a few extra regularity assumptions are often made. For instance, the branching diffusion system based method (as of [45,46,170]) often requires the key assumption that the driver can be represented as the sum of a power series to be approximated by polynomials, which obviously does not hold for every BSDE. The asymptotic expansion method (as of [77,118,120,285]) requires some smoothness on the driver to expand a nonlinear BSDE by linear BSDEs, while the condition can be relaxed in some instances [145]. Finally, to employ the multilevel Picard iteration method, one needs to carefully verify the relevant conditions on the driver as well as check the structure of the forward component (see, for example, [182,183,187,188,189]).

Summary
To conclude the comparisons and discussions made above, and by considering the enormous numerical examples on BSDEs in the literature, we summarize in Table 1 key aspects of each category upon implementation, that is, what is approximated and the problem dimension. In particular, the dimension of the problem at hand is one of the more restrictive conditions for which method can be chosen, in particular, most methods are not efficient in high-dimensional problems. Note that we did not make a comparative discussion on deep learning based methods (Section 6) in the form of a separate subsection (like Sections 7.1 and 7.2), as their error and complexity analyses are still evolving, some in infancy, and awaiting major advances, as opposed to the effectiveness in very high-dimensional spatial approximation problems. Still, for the sake of completeness, we align the category of deep learning based methods at the bottom of Table 1  Without a doubt, computing time is a vital factor to consider when selecting a method, as there may be a need for very quick calculations. However, if no time pressure exists, then the choice of method is less restrictive. We remark that various attempts have been made so far in the literature by wisely splitting the required computation for massive parallelization on highly multicore GPUs. Parallelization here has naturally proven effective because the primary issue does not lie in the computational time but memory consumption requirements, for instance, by algorithms that require many sample paths at once on memory, such as binomial lattice based methods [87,264,265], the multistep method [201], LSMC [141,142,146], the four-step scheme [296], and the forward Picard iteration [210].

Numerical methods for BSDEs with nonstandard features
To address problems in stochastic control, finance, and partial differential equations, BSDEs often need to be equipped with extra features for better capturing relevant properties under consideration. There has been an increasing interest in those classes of BSDEs and in developing numerical methods exclusively for each or some combinations of those features. Despite most of what follows having already appeared in their respective subsections, we again categorize and summarize those numerical methods here in terms of the class of BSDEs in brief, without going into much detail in order to avoid overloading the paper with lengthy technical intricacies.

Coupled FBSDEs
Consider the following FBSDE, where the coefficients of the forward SDE can depend on the backward components (Y, Z): The FBSDE in this form is called a coupled FBSDE in the sense that the backward components are allowed to couple in the drift and the diffusion of the forward component, in sharp contrast to the decoupled formulation (2.6)-(2.7). Hence, the standard Ito generator L t and the diffusion coefficient σ in the PDE (2.9) for decoupled FBSDEs needs to replaced accordingly in order to carry the backward components, resulting in the following PDE under suitable technical conditions: , v(t, x))), ∇v(t, x) As such, numerical approximation of a coupled FBSDE or the equivalent PDE is evidently far more intricate than that of a decouple counterpart, of which the forward component can be treated separately from the backward components. In particular, if the forward component depends on Z as well, then the gradient of the solution ∇u(t, x) needs to be addressed inside the forward component. Such coupled FBSDEs naturally appear in the utility maximization problem with general utility function in economics [172]. As for numerical methods, the four-step scheme [226] opened the door for coupled FBSDEs as early as in the 1990s, followed by various studies through the corresponding semi-or quasi-linear parabolic PDE. Based on this four-step scheme, the authors in [92] develop an implementable numerical method. Specifically, for the PDE part, they use the combined characteristics and finite difference method to approximate its solution. Almost sure uniform convergence and weak convergence of the method are proven, with the rate of convergence being comparable to that of the approximation of the forward SDE (done using an Euler type scheme). In [89] and moreover [90], we find improved alternatives of the method described in [92], which weakens regularity assumptions required. The authors in [236,237,238] also give a numerical method based on the four-step scheme, but propose a different method for solving the involved PDE. They approximate the solution of the PDE by using layer methods, which are constructed by means of a probabilistic approach. The derivatives of the solution to the PDE are found by using finite differences. The four-step scheme once again appears and is reconstructed in [296] with new conditions. It is then associated with the idea of domain decomposition methods (associated with the Schawrz waveform relaxation method). This approach is used in order to parallelize the related equations. Finally, in [151], we find a numerical method based on the four-step scheme which focuses on approximating the solution to coupled FBSDEs.
We add that various numerical methods have also been applied in tailored ways to coupled FBSDEs, such as Picard iterations [28], a transform into a control problem via a fully forward discretization [86], multistep schemes [221,324], a defferred correction method for ODEs [288], Fourier methods for computing conditional expectations [180], and more recently, deep learning based methods [6,163,196,197].

Reflected BSDEs
We first review reflected backward stochastic differential equations (RBSDEs) [103], which form an important class of BSDEs in the sense that they provide a deep insight into many practical problems, such as optimal stopping problems, American option pricing, stochastic optimal controls and differential games. Their solutions are reflected at a given stochastic process, called the obstacle, and provide a probabilistic representation for the unique viscosity solution of an obstacle problem for a nonlinear parabolic partial differential equation. Now, consider the following BSDE, where the process (Y t ) t∈[0,T ] is forced to stay above the process (h(t, X t )) t∈[0,T ] : where the stochastic process (K t ) t∈[0,T ] is continuous and non-decreasing with K 0 = 0, and grows only when can then be written in the form of the solution to an optimal stopping problem, as where T t denotes the set of (F s ) s∈[t,T ] -stopping times taking values in [t, T ], and provides a probabilistic interpretation Y t = v(t, X t ) where v is a solution to the obstacle problem in a similar yet different manner from Theorem 2.5. For theoretical details on RBSDEs, we refer the reader to, for instance, the monograph [91, Chapter 14].
The most popular application of RBSDEs is, perhaps, the pricing of American options [103,104], due to its structure with lower obstacles. In fact, the RBSDE framework is not strictly necessary for the pricing of American options, as the classical binomial pricing model for example works well for its low-dimensional problems, while least-square Monte-Carlo methods [223] remain efficient for high-dimensional problems, both outside the RBSDE framework. Nonetheless, the RBSDE framework is useful for the pricing of American options, for instance, through detailed error analysis by approximating RBSDEs in quantization methods [11,12] and its stability results for approximating its payoff function [48].
As for discretization methods, it is proved in [11] that the backward Euler type discretization attains the order 1/2 for BSDEs with Lipchitz coefficients and the order 1 for those with semi-convex obstacles. Other backward Euler methods are developed for RBSDEs in [194,231] based on the random walk approximation, in [263] based on the lattice tree and in [47] in combination with LSMC regression and the Malliavin approach (Sections 4.1 and 4.2) for approximating conditional expectations. The stability analysis for backward Euler methods for RBSDEs is conducted in [39,229]. Most recently, the stability and convergence analysis is refined in [281] with a focus on quadratic RBSDEs.
After discretizing RBSDEs, more complex equations need to be solved concurrently with approximation of conditional expectations in the presence of the obstacle term. This difficulty has been addressed by discretizing the state space, such as the so-called quantization, before employing the dynamic programming principle. Quantization algorithms for RBSDEs, initially developed in [13], have been investigated intensively in [11,12] to deal with multi-dimensional RBSDEs with an application to the pricing of an American option, and further extended, for example, in combination with the higher-order discretization [234] and for those whose driver depends on the control process [248]. In addition, the primal-dual method is generalized in [25] to a backward dynamic programming equation associated with time discretization schemes for RBSDEs.
In the literature, other advanced types of RBSDEs have also attracted attention, such as discrete RBSDEs, RBSDEs with multiple obstacles and RBSDEs with jumps. To deal with those classes, the concept of penalization, which was originally employed for proving the existence and uniqueness of RBSDEs [103], proves useful for numerical discretization [235] as well, and can be combined with Fourier transform techniques [190]. It has been applied to optimal switching problems in real option pricing [158], followed by RBSDEs with oblique reflections [63,178], doubly RBSDEs [308], and doubly RBSDEs with jumps [95]. A numerical method is developed in [300] for doubly RBSDEs whose generator depends on the future values of the solution with application to the default risk modeling.

BSDEs with jumps
BSDEs may contain jumps in their backward component as −dY t = f (t, X t ,Y t , Z t ,U t ) − Z t dW t − U t dN t with a suitable jump process (N t ) t∈[0,T ] and an control process (U t ) t∈[0,T ] , and often in their forward component as well. On top of non-trivial issues on existence, uniqueness and stability [254], the presence of jumps increases complexity of numerical implementation, for instance, by paying a great deal of attention to the jump size and timing. For instance, discretization schemes are developed, based on Malliavin calculus on the Gaussian space [42], and the so-called shot noise representation of the Poisson random measure [314] in [232]. A forward scheme for BSDEs [52] is extended [128] to BSDEs with jumps by Wiener chaos expansion and Picard iterations. A numerical algorithm is developed in [1] to approximate the solution to a decoupled FBSDE driven by a pure Lévy process with its small jumps approximated by a suitable Brownian motion. An explicit prediction-correction scheme is developed for solving decoupled FBSDEs with jumps [112]. In [212], Brownian and Poissonian components are independently approximated by random walks. More complex structures are also of interest. A Fourier-based method [38] and asymptotic expansion [121] are proposed, by which non-Poissonian local jump measures can be taken care of. Numerical methods for doubly reflected BSDEs with jumps are developed, with [94] and without [95] penalization. A single jump (not systematic jumps) is considered in the context of BSDEs in [203], where a single jump can represent a default time in credit risk or counterparty risk. A deep learning based method is developed in [135] and applied to BSDEs with jumps. We refer the reader to the monograph [91] for details as well as applications in insurance and finance.

BSDEs with non-global Lipschitz conditions
The majority of key properties of numerical methods for BSDEs are derived upon the global Lipschitz conditions on the drivers and terminal condition, whereas such restrictive conditions do not hold all the time in important applications, such as the Fisher-KPP and FitzHugh-Nagumo equations. In the literature, considerable attempts have been made to develop viable numerical methods where those Lipschitz conditions are relaxed. The L 2 -time regularity of the Z-component is studied in [144] for an irregular terminal condition (such as an indicator function), as an extension of the L 2 -time regularity result [317] for a Lipschitz terminal, where the order of convergence is explicitly connected to the rate of decrease of the expected conditional variance of the terminal. Two discretization methods are studied in [298] for a class of locally Lipschitz Markovian BSDEs: One is a backward Euler scheme (Section 3.1) by approximating a projection of the Z process, while the other is the Malliavin weight scheme (Section 4.2, especially (4.6) and (4.7)) by directly approximating the marginals of the Z process, where advanced a priori estimates and stability results for this class of BSDEs are derived by extending the representation theorem [228] and employed for obtaining competitive convergence rates. It is proved [71] that a backward stochastic difference equation (BS∆E) admits a solution and its sequence is convergent to a BSDE as the time-grid gets finer even when their drivers are not Lipschitz in the Z component. An error analysis of a time discretization method is performed in [219] for systems of BSDEs with drivers of polynomial growth and monotone in the state variable, where a tamed version of the explicit Euler scheme is shown to converge despite that the standard version may diverge unlike with the canonical Lipschitz driver. A general framework is developed in [220] for explicit numerical schemes for BSDEs with drivers of polynomial growth, where the convergence of some modified explicit scheme is of the same rate as implicit schemes and has comparable computing cost to the standard explicit scheme. Finally, for drivers not Lipschitz in the backward component, the Picard iteration is applied to discretized FBSDEs [37] and is employed for deriving the existence and uniqueness for G-BSDEs [319].

Quadratic BSDEs
BSDEs with generators of quadratic growth (with respect to the variable z as in | f (t, x, y, z)|≤ C(1 + y + z 2 )) have been studied actively under the name of quadratic BSDEs. Quadratic BSDEs play important roles in mathematical finance, such as utility optimization with exponential utility functions and risk minimization for the entropic risk measure. Their numerical aspects have thus been investigated ever since the theoretical formalization [206]. In contrast to standard BSDEs, the treatment of the quadratic driver is not trivial, particularly in placing the upper bound estimate of the control process Z. In the development of numerical methods, this issue is typically addressed by truncating the quadratic driver.
The well-known path regularity theorem [317] is extended in [191] to the quadratic-growth setting along with convergence rate of the distance between the solution of a quadratic BSDE and its approximation by truncation. In [192], the Cole-Hopf exponential transformation is applied to the approximate solution. Time-discretization schemes are developed in [272] for quadratic BSDEs based on truncation and non-uniform time partition, and in [65] with the error of order less than 1/2. For quadratic BSDEs with reflection, a truncated discrete-time numerical scheme is proposed in [281] along with some practical examples. In [119], an asymptotic expansion technique (Section 5.3) is developed for a quadratic-growth FBSDE appearing in an incomplete market with stochastic volatility. An approximation method is constructed in [122] for quadratic BSDEs using semi-analytic asymptotic expansion. With a view towards utility maximization problems, a special class of backward stochastic partial differential equations (Section 8.8) is investigated in [193], along with numerical simulation for relevant portfolio optimization problems.

Second-order BSDEs
Second-order BSDEs (2BSDEs) are a type of BSDE whose nonlinear drift contains the second order derivative of the corresponding PDE, widely applied in financial modeling such as the uncertain volatility model, transaction cost model, illiquid market model and the pricing for passport options. In general, a process (Y t , Z t , Γ t , α t ) t≥0 is called a 2BSDE if it solves the system with Y T = g(X T ) and a suitable forward process X in the form of (2.7). This system can be expressed as Y t = u(t, X t ), Z t = ∇u(t, X t ), Γ t = Hess(u(t, X t )) and α t = ((∂ /∂t) + L )∇u(t, X t ) for t ∈ [0, T ] on the basis of the parabolic PDE: ∂ ∂t u(t, x) + f (t, x, u(t, x), ∇u(t, x), Hess(u(t, x))) = 0, u(T, x) = g(x), whose full nonlinearity makes the class of 2BSDE distinct from the standard BSDEs. Its existence and the uniqueness are proved in [70] under Lipschitz continuity of the driver and suitable conditions, along with the numerical scheme: starting with Y T = g(X T ) and Z T = ∇g(X T ), proceed backwards for k ∈ {1, · · · , n}, based on the aforementioned nonlinear Feynman-Kac representation. In [106], an implementable explicit method is developed to approximate the conditional expectations appearing in the above scheme. A variance reduction method is proposed in [5] for the computation of involving conditional expectations. 2BSDEs can also be effectively solved numerically with a monotone scheme [156]. We mention that deep learning based approximation algorithms (Section 6) are developed in [18] for 2BSDEs.

McKean-Vlasov FBSDEs
When dealing with a stochastic differential game with mean field interactions, also known as a mean field game, one often encounters the following optimal control problem: inf α∈A E T t f (s, X s , µ s , α s )ds + g(X T , µ T ) X t = x =: u(t, x), (8.1) subject to the so-called McKean-Vlasov stochastic differential equation dX s = b(s, X s , µ s , α s )ds + σ (s, X s , µ s , α s )dW s , where (µ t ) t∈[0,T ] is a deterministic flow of probability measures and the infimum in (8.1) is taken over the set A of all progressively measurable processes. The McKean-Vlasov SDE (8.2) can be interpreted as a limit of the empirical measure of increasing individual particles, known as the propagation of chaos [57] where the particles tend to be independent of each other in this limit as the impact of each decreases. Under suitable technical conditions, the solution u to the optimization problem (8.1) corresponds to that of a suitable HJB equation, as well as providing an insightful probabilistic interpretation through the representation u(t, X t ) = Y t in accordance with the McKean-Vlasov FBSDE, in general: for s ∈ [t, T ], dX s = b(s, X s , L (X s ), L (Y s ), L (Z s ))ds + σ (s, X s , L (X s ), L (Y s ), L (Z s ))dW s , −dY s = f (s, X s , L (X s ), L (Y s ), L (Z s ))ds − Z s dW s , with X t = ξ and Y T = g(X T , L (X T )), where L (F) denotes the law of the random element F. We note that two approaches (of Pontryagin and weak types) result in distinct McKean-Vlasov FBSDEs and thus numerical methods [7].
As for relevant numerical methods, a variety of deterministic ones can be found in the literature for mean field games [2,211], for instance, based on finite differences or variational approaches, whereas McKean-Vlasov FBSDEs above pave the way towards probabilistic numerical methods and analysis [7], such as a tree method [62], first-order and Crank-Nicolson schemes [321], a higher order discretization method for decoupled cases [88], McKean-Vlasov anticipative FBSDEs arising in initial margin requirements [4] and a posteriori error estimates for fully coupled McKean-Vlasov FBSDEs [270], and recently very actively, deep learning based methods [55,56,110,131,161].

BSPDEs
Backward stochastic partial differential equations are of emerging interest from an implementation point view in its intimate relation to stochastic optimal and utility maximization problems. A few existing developments of numerical methods are as follows. For forward-backward stochastic heat equations from stochastic optimal control, discretization methods are developed via spatial discretization [96], space-time discretization [268], and the local discontinuous Galerkin method [216]. A special class of BSPDEs is investigated in [193] via their reduction to BSDEs from both theoretical and numerical perspectives in the context of utility maximization and portfolio optimization problems. A deep learning based method is developed in [290] for forward-backward doubly stochastic differential equations, where the additional diffusion terms in its BSDE component yields an equivalence to semilinear parabolic SPDEs as well as optimal control problems.

Concluding remarks
In this survey, we have focused on examining a wide array of numerical methods for BSDEs along with the main assumptions made, key convergence properties, as well as summaries of the key advantages and disadvantages. In particular, we have characterized broadly into three categories; backward (Section 3), forward (Section 5) and deep learning based methods (Section 6), along with the computation of conditional expectations (Section 4) to complete the categories of backward numerical methods (Section 3). Within those categories, we have further categorized the involved methods. For backward methods, we have sub-categorized the methods as backward Euler methods (Section 3.1) and higher-order methods (Section 3.2). For the computation of conditional expectations to complement backward methods (Section 3), we have sub-categorized the methods as least-squares regression based methods (Section 4.1), Malliavin calculus based methods (Section 4.2), Malliavin weights dynamic programming with regression methods (Section 4.2.2), quantization methods (Section 4.3), tree based methods (Section 4.4), and cubature methods (Section 4.5). For forward methods, we have sub-categorized the methods as Picard iteration methods (Section 5.1), branching diffusion system based methods (Section 5.2), asymptotic expansion (Section 5.3), and multilevel Picard iterations (Section 5.4). For deep learning based methods, we have sub-categorized the methods as deep BSDE (Section 6.1), deep backward dynamic programming (Section 6.2), deep splitting (Section 6.3), and deep Galerkin methods and physics-informed neural networks (Section 6.4).
As is clear from the present survey, numerical methods for BSDEs are intrinsically involved, and further, have been presented in a scattered manner in the literature. Hence, for the sake of collective comparison, we have devoted Section 7 to displaying those surveyed categories against a few important practical aspects, such as convergence properties and implementation aspects, each of which is crucial for understanding the power and limitations of numerical methods from a practical viewpoint.
In addition, BSDEs of complex structure, and thus the numerical methods employed for them, have become of growing interest in the years to come. To enhance this line of research, we have also surveyed existing numerical methods in terms of BSDEs with nonstandard features in Section 8, such as coupled (Section 8.1), reflection (Section 8.2), jumps (Section 8.3), non-Lipschtiz (Section 8.4), quadratic (Section 8.5), second-order (Section 8.6) and McKean-Vlasov BSDEs (Section 8.7), and BSPDEs (Section 8.8). To the best of our knowledge, numerical aspects have not been explored for other types as of yet, such as ergodic BSDEs, delayed BSDEs, and BSDEs with regime-switching. As such, we hope this survey can serve as an initial yet insightful guideline when selecting and developing appropriate numerical methods for the interested BSDE, now and in the future.
With the advent of deep learning based methods (Section 6), advances in numerical methods for BSDEs appear to have entered a new era. In sharp contrast to the conventional approaches (Sections 3, 4, and 5), deep learning based methods have opened the door wide for very high-dimensional BSDEs and nonlinear PDEs by employing the policy functions for readily approximating gradient dependent nonlinearities, to which the existing methods have devoted substantial effort. Further lines of numerical methods are still expected to come, which will benefit from being built upon the distinctive developments and advantageous features of the existing numerical methods for BSDEs and PDEs, to say nothing of expanding fields of application in relation to high-dimensional and nonlinear problems. However, it also remains to address a variety of unexplored theoretical aspects of the existing numerical methods, for instance, ones for BSDEs with nonstandard features (Section 8), not only from the traditional perspective of applied mathematics and numerical analysis, but also from the viewpoint of pure mathematics. Synergy and complementarity of those distinctive expertise would be beneficial and indispensable for making further advances on numerical methods for BSDEs.