Parametric inference for diffusions observed at stopping times

In this paper we study the problem of parametric inference for multidimensional diffusions based on observations at random stopping times. We work in the asymptotic framework of high frequency data over a fixed horizon. Previous works on the subject (such as [10, 17, 19, 5] among others) consider only deterministic, strongly predictable or random, independent of the process, observation times, and do not cover our setting. Under mild assumptions we construct a consistent sequence of estimators, for a large class of stopping time observation grids (studied in [20, 23]). Further we carry out the asymptotic analysis of the estimation error and establish a Central Limit Theorem (CLT) with a mixed Gaussian limit. In addition, in the case of a 1-dimensional parameter, for any sequence of estimators verifying CLT conditions without bias, we prove a uniform a.s. lower bound on the asymptotic variance, and show that this bound is sharp. MSC 2010 subject classifications: 62Mxx, 62Fxx, 60F05, 60G40, 60Gxx, 62F12.


Introduction
Statement of the problem. In this work we study the problem of parametric inference for a d-dimensional Brownian semimartingale (S t ) 0≤t≤T of the form based on a finite random number of observations of S at stopping times. The time horizon T > 0 and S 0 are fixed. We assume that the observations are the values of a single trajectory of (S t : 0 ≤ t ≤ T ) sampled from the model (1.1) with an unknown parameter ξ = ξ ∈ Ξ. Our goal is to estimate ξ using these discrete observations and study the asymptotic properties of the estimator sequence as the number of observations goes to infinity; we work in the highfrequency fixed horizon setting. Handling data at random observation times is * This author research is part of the Chaire Risque Financiers of the Fondation du Risque and of the Finance for Energy Market Research Centre (FiME lab) of Institut Europlace de Finance.
Parametric inference for diffusions observed at stopping times 2099 important in practice (see the examples in [25,14] for instance) and it has a large impact on inference procedure, as it is argued in [4]. A large number of works (see the references below) are devoted to the inference of diffusion models in the case of deterministic, random independent or strongly predictable observation time grids. In most cases they are based on the approximations of the transition probability density of the diffusion process, resulting in so called approximate maximum likelihood estimators (AMLEs). However, in practice, the observation times may be random and, moreover, the randomness may be (at least partly) endogenous, i.e. depending on the sampled process itself: see [25] for empirical evidence about the connection of volatility and inter-transaction duration in finance, and [14] for modeling bid or ask quotation data and tick time sampling. In other words, as motivated by those examples, the observation grid may be given by a sequence of general stopping times with respect to a general filtration; see the introduction of [21] for additional motivation and discussion. To the best of our knowledge this setting has not yet been studied in the literature, except in [30] where a Central Limit Theorem (CLT) for estimating the integrated volatility in dimension 1 is established assuming the convergence in probability of renormalized quarticity and tricity (however, the authors do not characterize the stopping times for which these convergences hold). One reason for this lack of studies in the literature is essentially that the necessary tools for the analysis of the stopping time discretization grids for multidimensional processes were not available until recently. In particular, the study of the asymptotic normality for a sequence of estimators requires a general central limit theorem for discretization errors based on such grids. Such a result has been very recently obtained in [21] in a concrete setting (i.e. for explicitly defined class of grids, and not given by abstract assumptions, as a difference with [30]), in several dimensions (as a difference with above references) and with a tractable limit characterization. Note that in [15], the derivation of CLT is achieved in the context of general stopping times, but the limit depends on implicit conditions that are hardly tractable except in certain situations (notably in dimension 1). Another issue is that it is delicate to design an appropriate MLE method in this stopping times setting: in general, approximation of the increment distribution seems hardly possible in this case, since the expression for the distribution of (S τ , τ), where τ is a stopping time, is out of reach in multiple dimension even in the simplest cases.
In this work we aim at constructing a consistent sequence of estimators (ξ n ) n≥0 of the true parameter ξ in the case of random observation grids given by general stopping times. We provide an asymptotic analysis that allows to directly apply the existing results of [21] on CLTs for discretization errors and show the convergence in distribution of the renormalized error N n T (ξ n − ξ ) (where N n T is the number of observation times) to an explicitly defined mixture of normal variables.
Literature background. A number of works study the problem of inference for diffusions. For general references, see the books [31,13] and the lecture notes [27].
The nonparametric estimation of the diffusion coefficient σ(.) is investigated in [12] for equidistant observations times on a fixed time interval. In [17] the authors consider the problem of the parametric estimation of a multidimensional diffusion under regular deterministic observation grids. They construct consistent sequences of estimators of the unknown parameter based on the minimization of certain contrasts and prove the weak convergence of the error renormalized at the rate √ n to a mixed Gaussian variable, where n is the number of observations. The problem of achieving minimal variance estimator is investigated using the local asymptotic mixed normality (LAMN) property, see e.g. [8,Chapter 5] for the definition: this LAMN property is established in [10] for one-dimensional S, and in [19] for higher dimensions using Malliavin calculus techniques, when the n observation times are equidistant on a fixed interval. These latter results show the optimality of Gaussian AMLEs that achieve consistency with minimal variance.
If the time step between the observations is not small, one can use more advanced techniques based on the expansions of transition densities in order to approximate the likelihood of the observations. See, for instance, [1,2,3,9]. Note that these works consider only the case of deterministic observation grids.
In [18] the authors study the case where each new observation time may be chosen by the user depending on the previous observations (so that the times depend on the trajectory of S). The authors exhibit a sequence of sampling schemes with an asymptotic conditional variance achieving the optimal (over all such schemes with random times) bound for LAMN property for all the parameter values simultaneously. We remark that though in [18] the observation times are random, they are not stopping times, and the perspective is quite different from ours: the authors assume that observations at all times are, in principle, available, and aim at choosing adaptively a finite number of them to optimize the asymptotic variance of the estimator. In our setting observations are stopping times and are not chosen by the user in an anticipative way.
Several works are dedicated to the inference problem with observations at stopping times, but under quite restrictive assumptions on those times as a difference with our general setting. More precisely, in [4,11] the authors assume that the time increment τ n i − τ n i−1 depends only on the information up to τ n i−1 and on extra independent noise. A similar condition is considered in [26], and it can take the form of strongly predictable times (τ n i is known at time τ n i−1 ). In [5], the time increments are simply independent and identically distributed. In [14,16], the authors consider the observation times as exit times of S from an interval in dimension 1: because such one-dimensional exit time can be explicitly approximated, they are able to establish some CLT results for the realized variance. For potentially more general stopping times, but still in dimension 1, [30] provides CLT results under the extra condition of convergence of the quarticity and tricity. To summarize, all the above results consider stopping times with significant restrictions and, in any case, in one-dimensional setting for S. In the current study, we aim at overcoming these restrictions.

Our contributions.
• To the best of our knowledge, this is the first work that analyzes the problem of parametric inference for multidimensional diffusions based on observations at general stopping times. • Under mild assumptions we construct a sequence of estimators and prove its consistency for a large class of observations grids, which, following [23,Remark 1], contains most of the examples, interesting in practice. • Using our asymptotic analysis and applying the results of [21] we prove the weak convergence of the renormalized error to a mixture of normal variables, for a quite general class of random observations, which includes exit times from general random domains, and allows combination of endogenous and independent sources of randomness. In addition, we explicitly compute the limit distribution. The asymptotic limit is, in general, biased, and we characterize both asymptotic bias and variance. Such a bias has not been previously observed in parametric inference problems due to centering property of Gaussian increments for strongly predictable grids. • We provide a uniform lower bound on the limit variance in the case of a 1-dimensional parameter ξ ∈ Ξ, and for the set of observation grids for which the weak convergence to a mixture of normal variables without bias holds. We also prove that this bound is sharp in this class of grids. To the best our knowledge, this result for parametric inference for diffusions is new, and it allows for discussing optimal sampling procedure for instance.
Last, for other applications and results of stopping times in high-frequency regime, see [15,20,23].
Outline of the paper. In Section 2 we present the model for the observed process S, the random observation grids, and construct a sequence of estimators (ξ n ) n≥0 based on the discretized version of the integrated Kullback-Leibler divergence in the Gaussian case. Section 3 is devoted to the statements of the main results of the paper. We continue with the proofs in Section 4. Several technical points are postponed to Section A.

The model
Let (B t ) 0≤t≤T be a d-dimensional Brownian motion on a probability space (Ω, F, (F t ) 0≤t≤T , P) with a filtration (F t ) 0≤t≤T verifying the usual conditions of being right-continuous and complete. By | · | we denote the Euclidean norm on matrix and tensor vector spaces. Let Mat m,n be the space of real m × n matrices, denote by S ++ m (resp. S + m ) the set of positive (resp. non-negative) definite symmetric real m × m matrices.
Let Ξ ⊂ R q , q ≥ 1, be a convex compact set, with non-empty interior to avoid degenerate cases. We fix a parameter ξ ∈ Ξ \ ∂Ξ (where ∂Ξ is the boundary of Ξ). The process serving for the observation is a d-dimensional Brownian 2102 E. Gobet and U. Stazhynski semimartingale (S t ) 0≤t≤T of the form verifying the following: In what follows we denote for simplicity σ t (ξ) := σ(t, S t , ξ). Let c t (·) := σ t (·)σ t (·) T . We suppose, in addition, the following parameter identifiability assumption.

Random observation grids
We consider a sequence of random observation grids : n ≥ 0} on the interval [0, T ] and suppose that for each n, only the values (τ n i , S τ n i ) 0≤i≤N n T are available for the parameter estimation: these are the observation data. For each n, (τ n i : 0 ≤ i ≤ N n T ) is a sequence of F-stopping times and N n T is a.s. a finite random variable. Here we do not assume further information on the structure of these stopping times (e.g. they are hitting times for S of such or such boundary and so on): we are aware that having this structural information would presumably be beneficial for the inference problem, by making the estimation more accurate. Proving optimality results (like in [10,19]) given the sequence of observations {(τ n i , S τ n i ) 0≤i≤N n T : n ≥ 0} is so far out of reach, and we leave these problems for further investigation. However we establish a partial optimality result in Section 3.4.
Our statistics analysis is based on the asymptotic techniques, developed recently in [20,23,24], for admissible random discretization grids in the setting of quadratic variation minimization. In this work we adapt these techniques to the problem of parametric estimation.
We introduce the following assumptions that depend on the choice of a positive sequence (ε n ) n≥0 with ε n → 0 and a parameter ρ N ≥ 1 (compare to [23, The following non-negative random variable is a.s. finite: Let us now fix (ε n ) n≥0 with ε n → 0 and a sequence of discretization grids T . We assume for some ρ N ∈ [1, (1 + 2η b ) ∧ 4/3) the following hypothesis: for which the assumptions (A osc. S )-(A N ) (with the given ρ N ) are verified. Remark that the class of grids verifying (H T ) is very general and covers most of the settings considered in the previous works on inference for diffusions. At the same time, it allows new types of grids that were not studied before. In particular, it includes: • The sequences of deterministic or strongly predictable discretization grids for which the time steps are controlled from below and from above and for which the step size tends to zero. Here ρ N > 1, see [23, Remark 1]. • The sequences of grids based on exit times from general random domains and, possibly, extra independent noise. Namely let {(D n t ) 0≤t≤T : n ≥ 0} be a sequence of general random adapted processes with values in the set of domains in R d , that are continuous and converging (in a suitable sense, see the details in [21, Section 2.2]) to an adapted continuous domain-valued process (D t ) 0≤t≤T . Consider also an i.i.d. family of random variables (U i,n ) n,i∈N uniform on [0, 1] and an arbitrary P ⊗ B . Then the discretization grids of the form T : where (Δ n,i ) n,i∈N represents some negligible contribution, verify the assumption (H T ) with ρ N = 1 (see [21,Section 3.3]). This class of discretization grids allows a coupling of endogenous noise generated by hitting times and extra independent noise given, for example, by a Poisson process with stochastic intensity (see [21,Section 2.2.3]). In addition, we can rely on a CLT for a general discretization error term based on such grids (see [21,Theorem 2.4]). The optimal observation grid in Section 3.4 is of the above form, taking some ellipsoid for D n and G(·) = +∞, Δ n,i = 0.
Subsequence formulation of the assumption (H T ) is motivated by the following subsequence principle: Theorem 20.5]). Consider real-valued random variables. Then X n P → n→+∞ X if, and only if, for any subsequence (X ι(n) ) n≥0 of (X n ) n≥0 , we can It allows to first prove a.s. results for the sequences of observation grids verifying (A osc. S )-(A N ) and n≥0 ε 2 n < +∞, and then pass to the equivalent results in probability in the general case.

Sequence of estimators
Suppose that T := {T n : n ≥ 0} is a sequence of random grids verifying (H T ) for some ε n → 0, and ρ N ∈ [1, (1 + 2η b ) ∧ 4/3). Denote for any process H (where we omit the dependence on n) (2.5) Parametric inference for a discretely observed process typically requires a discrete approximation of some criterion, whose optimization yields the true parameter ξ . A standard approach is to approximate the likelihood of S τ n 0 , . . . , S τ n i , or equivalently of the distribution of ΔS τ n i conditionally on S τ n 0 , . . . , S τ n i−1 . Gaussian approximations are often used when the distance between observation times is small, see, for instance [17]. The optimality of the Gaussian based likelihood approximations in the case of regular observation times has been proved in [10,19]. Although the distribution of S τ as τ is a stopping time may be quite different from Gaussian, we are inspired by the same approach, because of the flexibility and tractability of the subsequent contrast estimator with respect to the choice of observation times τ n i ; however, below we present a slightly different interpretation of the same minimization criteria, since in the stopping time case the distribution of process increments is not necessarily close to Gaussian. We also generalize the criteria to account for non-equidistant distribution of the discretization points over [0, T ]. Denote x the density of a centered d-dimensional Gaussian variable N d (0, Σ) with the covariance matrix Σ (assumed to be non-degenerate). Denote the Kullback-Leibler (KL) divergence between the variables N d (0, Σ 1 ) and N d (0, Σ 2 ) by For some continuous weight function ω : is always non-negative and equals 0 if and only if Σ 1 = Σ 2 . Thus, in view of (H ξ ), the minimization of T 0 D KL (c t (ξ ), c t (ξ))ω t dt naturally yields the true parameter ξ . Our goal is to construct a discretized version of this criterion based on the observations of S. We write where C 0 is independent of ξ and )ω t dt represents a quadratic variation. Thus we define the following discretized version of U (·), that uses only (τ n i , S τ n i : (2.9) The random function U n (.) plays the role of a contrast function: it is asymptotically equal to U (.), which minimum is achieved at ξ . In the case of regular grids and ω t = 1 the contrast (2.9) coincides with [17, eq. (3)]. Define the sequence of estimators (ξ n ) n≥0 as follows: ξ n := Argmin ξ∈Ξ U n (ξ) (2.10) (if the minimizing set of U n (·) is not a single point we take any of its elements). We expect that the minimizer of U n (·) will asymptotically attain the minimizer of Note that the user is free to choose the form of the process ω t . While the rigorous optimization of the choice of ω t given only the observations (τ n i , S τ n i , 0 ≤ i ≤ N n T ) is complicated, it seems reasonable to increase ω t on the time intervals where the observation frequency is higher. We have not investigated furthermore in this direction.

Main results
For the subsequent convergences, we adopt the following natural notations. By O a.s. n (1) (resp. o a.s. n (1)) we denote any a.s. bounded (resp. a.s. converging to 0) sequence of random variables; in addition, denote O a.s. n (x) = xO a.s. n (1), o a.s. n (x) = xo a.s. n (1). Similarly we write o P n (1) for sequences converging to 0 in probability. Besides, we introduce a convenient and short notation for denoting random vectors written as a mixture of Gaussian random variables. Given a (possibly stochastic) matrix V ∈ S + m , we denote by N (0, V ) a random variable which is equal in distribution to V 1/2 G where G is a centered Gaussian m-dimensional vector with covariance matrix Id m , where V 1/2 is the principal square root of V , and where G is independent from everything else.

Consistency
The following result states the convergence of the estimators (ξ n ) n≥0 in probability to ξ for any sequence of random observation grids verifying (H T ). Its proof is postponed to Section 4.1.

Asymptotic error analysis
We now proceed with the asymptotic analysis of the error sequence (ξ n −ξ ) n≥0 . Recall that D KL (Σ 1 , Σ 2 ) given in (2.6) is always non-negative and equals to 0 if and only if Σ 1 = Σ 2 . Thus for any t ∈ [0, T ] the point ξ ∈ Ξ \ ∂Ξ is a minimum of D KL (c t (ξ ), c t (·)) which implies that ∇ 2 ξ D KL (c t (ξ ), c t (ξ)) | ξ=ξ is positive semidefinite a.s. for all t ∈ [0, T ]. We introduce the following assumption: (H H ): There exists a subset I ⊂ [0, T ] of positive Lebesgue measure such that ∇ 2 ξ D KL (c t (ξ ), c t (ξ)) | ξ=ξ is positive definite for all t ∈ I. Note that in practice, since ξ is not known, the verification of (H H ) is typically required for all possible values of ξ ∈ Ξ \ ∂Ξ. Assumption (H H ) in particular implies that is positive definite, and where the second equality follows from (2.7) (note that we can interchange differentiation and integration via the dominated convergence theorem).
In what follows we assume the following conventions. The gradient of an Rvalued function is assumed to be a column vector. For a Mat d,d -valued function Here comes the main result of this section. This is a universal decomposition of the estimation error, available for any stopping time grids, as in (H T ), which will be the starting point for showing a CLT later. The proof is done in Section 4.2.

CLT in the case of ellipoid exit times
We start with the following lemma, that plays an important role in the sequel: admits exactly one solution x(y) ∈ S + d . Theorem 3.2 shows that it is enough to study the convergence in distribution of N n T Z n T to obtain such a convergence for N n T (ξ n − ξ ). Indeed, from (3.4) we get where o P n ( N n T ε ρ N n ) = o P n (1) from (H T ) and the subsequence principle (Lemma 2.1). This makes possible the direct application of general results on CLT for discretization errors of the form (3.5); we refer to [21] for discussion and references on the subject.
Since we are particularly interesting in the case of stopping time discretization grids in the multidimensional case, we use [21,Theorem 2.4] where the CLT for discretization errors of the form (3.5) with general M t and A t has been proved in a quite general setting. We state a particular case of this setting, namely the exit times from random ellipsoids (as defined in (3.8)). This example is, in particular, used in Section 3.4.
and the variable C σ verifies E(C 4 σ ) < +∞ (this condition, in particular, holds for a diffusion process with bounded coefficients b and σ such that their derivatives are also bounded). Define the sequence of discretization grids T = {T n : n ≥ 0} by To simplify we note σ t := σ t (ξ ) till the end of this section. We consider the setting of [21,Section 2.2] Define the process m t := Tr(σ T t Σ t σ t ) −1 . Following [21], define, for any t ∈ [0, T ] and any measurable function f : where W is an extra d-dimensional Brownian motion, independent from everything else. Denote are continuous symmetric non-negative definite matrices. Define a Mat m,m -valued process (K t ) 0≤t≤T by is the solution of the matrix equation (3.6) Remark that the process (Q t ) 0≤t≤T defined in [21, eq. 2.18] is equal to 0 in our case since the domains D t and D n t are symmetric, see [21,Section 2.4]. Also note that the matrix equation (3.6) may be easily solved numerically, see the details in [20,Section A.4]. However, analytic solution is only available in dimension 1. In general (especially in multi-dimensional case), the computation of K is hardly explicit, and requires some numerical methods, like Monte-Carlo schemes suitable for statistics of stopped processes, see e.g. [22]. The following result is an application of [21, Theorem 2.4 and its proof].

11)
where H T is defined in (3.1). More specifically, for Z n , M n , A n defined in (3.5), we have the convergences from which one may easily get (for the distance μ(·, ·) for domains, as defined in [21, Section 2.
The latter bound can be controlled uniformly in t and n in view of the continuity and the nondegeneracy of Σ t , Σ n t and the condition (H Σ )-1.

E. Gobet and U. Stazhynski
Finally [21, (H G )] is trivial in this case since the function G(·) equals +∞ and Δ n,i = 0 (in the notation of [21]). Other assumptions of [21,Theorem 2.4] follow from (H Σ )-3. Last, the sign in front of H −1 T in (3.7) does not change the distribution-limit which is symmetric (as that of W ).
Note that the drift b does not enter in the parameters of the CLT, this is due to the symmetry of the domain defining the observation times.
Because W is independent of everything else, we have the identity with an extra independent m-dimensional Gaussian random variable N (0, Id m ). In other words, the (random) covariance limit of N n T (ξ n − ξ ) is

Optimal uniform lower bound on the limit variance
In this section we assume q = 1, so that Ξ ⊂ R. Our aim is to seek the optimal observation times (among ellipsoid based stopping times) achieving the lowest possible limit variance. Let X t (ξ) be the solution of the matrix equation (3.6) with which is fixed from now on. In the case where the weak convergence of the renormalized error to a mixture of normal variables holds without bias (e.g. the case of deterministic grids, see [17]; or the hitting times of symmetric boundaries, see [21, Section 2.4] and Theorem 3.4) we prove that V opt.
T is a uniform lower bound on the asymptotic variance of the sequence of estimators (2.10). In addition, this lower bound is tight in the sense that one can find a sequence of observation times achieving as close as possible this lower bound. This is formalized in the following definition. Definition 1. Let κ 0 > 0. A parametric family of discretization grid sequences {T κ : κ ∈ (0, κ 0 ]} is κ-optimal if there exists an a.s. finite random variable C 0 independent of κ such that N n T (ξ n − ξ ) converges in distribution to a mixture of centered normal variables for all T κ , and the limit variance V κ T associated with T κ verifies the condition The subsequent κ-optimal observation times are related to some random ellipsoid hitting times, which are built as follows. Let χ(.) be a smooth function such that 1 (−∞,1/2] ≤ χ(.) ≤ 1 (−∞,1] , and denote χ κ (x) where λ min (M ) stands for the smallest eigenvalue of M ∈ S + d . Hence, Λ κ t (ξ) ∈ S ++ d as soon as κ > 0. Recall that under the general assumptions of Theorem 3.2 we have the decomposition (3.4), with Z n given by (3.5). In view of (3.4), to study the weak convergence of N n T (ξ n − ξ ) we essentially need to consider N n T Z n T . The result below states that under standard conditions implying the CLT for N n T Z n T (and hence for N n T (ξ n − ξ )) there exists a uniform lower bound on the limit variance. We also show the tightness of this bound in the sense of Definition 1. for some adapted non-negative continuous process (K t ) 0≤t≤T . Assume also that N n T Z n T converges in probability to an a.s. finite random variable. Then, the following holds: for some non-negative random variable V T (asymptotic variance). (ii) The asymptotic variance V T satisfies the following uniform lower bound: T is tight in the following sense: the parametric family of discretization grid sequences {T κ : κ ∈ (0, 1]} given for any ε n → 0 by T κ = {T n κ : n ≥ 0} with T n κ = (τ n i ) 0≤i≤N n T written as ) is κ-optimal for κ 0 = 1 in the sense of Definition 1.
We remark that the class of discretization grids over which the universal variance lower bound is obtained in Theorem 3.5 includes most of the examples for which a CLT has been established, since the conditions of the type (3.14) are quite commonly required (see [29, Chapter IX, Theorem 7.3] for a classical result). Typically for deterministic or strongly predictable grids the conditions will hold with ρ = ρ N > 1, while in the setting of [21, Section 2.2] we have ρ = ρ N = 1. See also the discussion in Section 2.1 and [23, Remark 1].
As we may notice the κ-optimal sequence of discretization grids in (3.15) depends on the unknown parameter ξ . Besides, concerning the optimal variance V opt. T in (3.13), it also involves ξ , as well as ω t : we argue in Section 2.2 that the rigorous optimization of ω t (to minimize V opt. T ) is out of reach because ξ is unknown. However, for all these extra optimization steps, a heuristic approach might be used. Namely in practice, one may pre-estimate ξ on some initial interval [0, T 1 ] using any reasonable consistent estimator and then proceed with the estimation that achieves the limit variance close to the optimum on [T 1 , T ] using this pre-estimator instead of ξ . A similar methodology has been designed and analyzed in [24]. A thorough analysis of the limit variance in our case would be possible, although quite technical; we naturally expect that such a method would constitute a κ-optimal family of strategies for T 1 = κ 2 T in view of the robustness results for the optimal sequence of discretization grids produced in [24, Section 3.1].
(ii) For (∇ x σ t (ξ )) 0≤t≤T defined in Section 3.2 and any ρ > 0 we have Proof. To prove (i) remark that (S t ) 0≤t≤T is Hölder continuous with any exponent smaller than 1/2 by [6, Theorem 5.1]. We conclude by using that σ = σ(t, x, ξ ) is locally Lipschitz in t and x due to the continuous differentiability, and that (S t ) 0≤t≤T is a.s. bounded on [0, T ].
To prove (ii) we use the differentiability of σ(t, x, ξ ) in t and x by (H S )-1. We write for any ρ > 0 and some a.s. finite C ρ , which finishes the proof.
The next lemma states the a.s. convergence of U n (·) to U (·), as well as the corresponding results for the derivatives ∇ ξ U n (·) and ∇ 2 ξ U n (·).
Proof. Using (2.8) and Lemma A.1 we deduce the following expressions for Recall that (4.5) Let us first prove that for any ξ ∈ Ξ The convergence of the first term in the right-hand side of (4.5) follows from the standard Riemann integral approximation, using that sup Δτ n i a.s.

E. Gobet and U. Stazhynski
For the second term we have by [23,Proposition 3.7] Hence the convergence (4.6) follows now from taking the sum of (4.7) and (4.8). Further using Lemma A.1 we obtain (4.10) Using (4.3), (4.4) and applying the same reasoning as for the proof of (4.6) we also show the following convergences for any ξ ∈ Ξ for some a.s. finite C>0. This implies that the sequences (U n (·)) n≥0 ,(∇ ξ U n (·)) n≥0 are equicontinuous and hence the convergences in (4.6) and (4.11) are uniform in ξ ∈ Ξ. We are done.

Proof of Theorem 3.1
First suppose that n≥0 ε 2 n < +∞ and that the grid sequence T verifies (A osc. S )-(A N ).
Recall that D KL (c t (ξ ), c t (ξ)) ≥ 0 and the equality holds if and only if c t (ξ ) = c t (ξ). From (H ξ ) we have that for any ξ = ξ the processes c t (ξ ) and c t (ξ) are not almost everywhere equal on [0, T ]. Hence ξ is the unique minimum of T 0 D KL (c t (ξ ), c t (ξ))ω t dt, and in view of (2.7) we have that a.s. ξ = Argmin ξ∈Ξ U (ξ).
Finally the convergence ξ n P −→ n→+∞ ξ for T verifying (H T ) with general ε n → 0 follows from the subsequence principle in Lemma 2.1.

Proof of Theorem 3.2
First suppose that n≥0 ε 2 n < +∞ and the grid sequence T verifies (A osc. S )-(A N ).