Rate of estimation for the stationary distribution of jump-processes over anisotropic Holder classes

We study the problem of the non-parametric estimation for the density of the stationary distribution of the multivariate stochastic differential equation with jumps (Xt) , when the dimension d is bigger than 3. From the continuous observation of the sampling path on [0, T ] we show that, under anisotropic Holder smoothness constraints, kernel based estimators can achieve fast convergence rates. In particular , they are as fast as the ones found by Dalalyan and Reiss [9] for the estimation of the invariant density in the case without jumps under isotropic Holder smoothness constraints. Moreover, they are faster than the ones found by Strauch [29] for the invariant density estimation of continuous stochastic differential equations, under anisotropic Holder smoothness constraints. Furthermore, we obtain a minimax lower bound on the L2-risk for pointwise estimation, with the same rate up to a log(T) term. It implies that, on a class of diffusions whose invariant density belongs to the anisotropic Holder class we are considering, it is impossible to find an estimator with a rate of estimation faster than the one we propose.


Introduction
Diffusion processes with jumps are recently becoming powerful tools to model various stochastic phenomena in many areas such as physics, biology, medical sciences, social sciences, economics, and so on. In finance, jump-processes were introduced to model the dynamic of exchange rates ( [6]), asset prices ( [25], [22]), or volatility processes ( [5], [15]). Utilization of jump-processes in neuroscience, instead, can be found for instance in [14]. Therefore, stochastic differential equations with jumps are nowadays widely studied by statisticians.
In this work, we aim at estimating the invariant density π associated to the process (X t ) t≥0 , solution of the following multivariate stochastic differential equation with Levytype jumps: where W is a d-dimensional Brownian motion andμ a compensated Poisson random measure with a possible infinite jump activity. We assume that a continuous record of observations X T = (X t ) 0≤t≤T is available.
at some point x ∈ R d in the anisotropic context is given bŷ where h = (h 1 , ..., h d ) is a multi -index bandwidth. First of all we extend the previous results by proving the following upper bound for the mean squared error: where β 1 ≤ β 2 ≤ ... ≤ β d and 1 1 β l . As by constructionβ 3 is bigger thatβ, the convergence rate here above is in general faster than the one proposed in [2].
After that, we want to understand if it is possible to improve the convergence rate by using other density estimators and which is the best possible rate of convergence. To answer, the idea is to look for lower bounds for the minimax risk associated to the anisotropic Holder class. For the computation of lower bounds, we introduce a jumpprocess simpler than (1): where a and γ are constants. We moreover assume the intensity of the jumps to be finite. We anticipate here the definition of the minimax risk that will be given in (14): where the infimum is taken over all possible estimators of the invariant density and Σ(β, L) gathers the drifts for which the considered process is stationary and whose stationary measure has the prescribed Holder regularity. In order to prove a lower bound for the minimax risk, the knowledge of the link between b and π b is crucial. In absence of jumps, considering reversible diffusion processes with unit diffusion part (as in both [11] and [32]), such a connection was explicit: where V ∈ C 2 (R d ) is referred to as potential. Adding the jumps, it is no longer true. In our framework, it is challenging to get a relation between b and π b . The idea is to write the drift b in function of π b knowing that they must satisfy A * b π b = 0, where A * is the adjoint operator of A, the generator of the diffusion X solution to (3) (see Proposition 2 below, a similar argument can also be found in [12]). We are in this way able to prove the following main result: where we recall it is β 1 ≤ β 2 ≤ ... ≤ β d and 1 1 β l . It follows that, on a class of diffusions X whose invariant density belongs to the anisotropic Holder class we are considering, it is impossible to find an estimator with a rate of estimation better than T −β 3 2β 3 +d−2 , for the pointwise L 2 risk. Comparing the lower bound here above with the upper bound in (2) we observe that, up to a logarithmic term, the two convergence rates we found are the same. Furthermore, we present some numerical results in dimension 3. We show that the variance depends only on the biggest bandwidth. The simulations match with the theory and illustrate we can remove the two smallest bandwidths, which are associates to the smallest smoothness. It implies we get a convergence rate which does not depend on the two smallest smoothness β 1 and β 2 .
The outline of the paper is the following. In Section 2 we introduce the model and we give the assumptions, while in Section 3 we propose the kernel estimator for the estimation of the invariant density and we state the upper bound for the mean squared error. In Section 4 we complement them with lower bounds for the minimax risk while in Sections 5 and 6 we provide, respectively, the proofs of the upper and lower bounds. Some technical result are moreover proved in Section 7.

Model
We consider the question of nonparametric estimation of the invariant density of a ddimensional diffusion process X, assuming that a continuous record of the process up to time T is available. The diffusion is given as a strong solution of the following stochastic differential equations with jumps: (4) where the coefficients are such that b : The process W = (W t , t ≥ 0) is a d-dimensional Brownian motion and µ is a Poisson random measure on [0, T ] × R d associated to the Lévy process L = (L t ) t∈[0,T ] , with L t := t 0 R d zμ(ds, dz). The compensated measure isμ = µ −μ. We suppose that the compensator has the following form:μ(dt, dz) := F (dz)dt, where conditions on the Levy measure F will be given later. The initial condition X 0 , W and L are independent. In the sequel, we will denoteã := a·a T .

Assumptions
We want first of all to show an upper bound on the mean squared error, as we will see in detail in Section 5. To do that, we need the following assumptions to hold: A1: The functions b(x), γ(x) andã(x) are globally Lipschitz and, for some c ≥ 1, where I d×d denotes the d × d identity matrix. Denoting with |.| and < ., . > respectively the Euclidean norm and the scalar product in R d , we suppose moreover that there exists a constant c > 0 such that, ∀x ∈ R d , |b(x)| ≤ c.
As showed in Lemma 2 of [2] A2 ensures, together with the last point of A3, the existence of a Lyapunov function, while the second and the third points of A3 involve the irreducibility of the process. The process X admits therefore a unique invariant distribution µ and the ergodic theorem holds. We assume the invariant probability measure µ of X being absolutely continuous with respect to the Lebesgue measure and from now on we will denote its density as π: dµ = πdx. Our goal is to propose an estimator for the invariant density estimation and to study its convergence rate. We start our analysis by introducing the natural estimator in this context and by analysing upper bounds for the mean squared error. Then, we investigate the existence of a lower bound for the minimax risk.

Estimator and upper bound
In this section we introduce the expression for our estimator of the stationary measure π of the stochastic equation with jumps (4) in an anisotropic context. After that, we present the rate of convergence the estimator achieves, depending on the smoothness of π. The notion of anisotropy plays an important role. Indeed, the smoothness properties of elements of a function space may depend on the chosen direction of R d . The Russian school considered anisotropic spaces from the beginning of the theory of function spaces in 1950-1960s (in [27] the author takes account of the developments). However, results on minimax rates of convergence in classical statistical models over anisotropic classes were rare for a lot of time.
We work under the following anisotropic smoothness constraints.
for D k i g denoting the k-th order partial derivative of g with respect to the i-th component, β i denoting the largest integer strictly smaller than β i and e 1 , ..., e d denoting the canonical basis in R d .
We deal with the estimation of the density π belonging to the anisotropic Hölder class H d (β, L). Given the observation X T of a diffusion X, solution of (4), we propose to estimate the invariant density π by means of a kernel estimator. We introduce some kernel function K : R → R satisfying for all l ∈ {1, ..., M } with M ≥ max i β i . Denoting by X j t , j ∈ {1, ..., d} the j-th component of X t , t ≥ 0, a natural estimator of π ∈ H d (β, L) at x = (x 1 , ..., x d ) T ∈ R d in the anisotropic context is given bŷ where h = (h 1 , ..., h d ) is a multi-index bandwidth and it is small. In particular, we assume h i < 1 2 for any i ∈ {1, ..., d}. The asymptotic behaviour of the estimator relies on the standard bias variance decomposition. Hence, we need an evaluation for the variance of the estimator, as in next proposition. We prove it in Section 5. One can remark that in [32], where a continuous reversible diffusion process with unit diffusion is considered, the author formulates implications on the functional inequalities (of Poincaré and Nash-type) to get an upper bound for the variance of the estimator. The main advantage in using functional inequalities is that they allow the constants involved in the upper bound of the variance to be controlled uniformly. However, this approach is restricted only to symmetric diffusion framework and so it can not be applied in our setting. To overcome this difficulty we derive some upper bounds on the variance of our estimator by exploiting the mixing properties of X. In particular, the proof of the proposition below relies on a bound on the transition density (see Lemma 1 in [2]) and on the exponential ergodicity and the exponential β-mixing property of the process X (as established in Lemma 2 of [2]). However, this approach has some disadvantages. One above all the fact that, as the upper bounds relies on mixing properties, the constants depend on the coefficients. Hence, it is very challenging to understand how the constants involved can be controlled uniformly and this is still an open question. Proposition 1. Suppose that A1 -A3 hold. If π is bounded andπ h,T is the estimator given in (5), then there exists a constant c independent of T such that We underline that, in the upper bound of the variance here above, it would have been possible to remove, in the denominator, the contribution of no matter which two bandwidths. We arbitrarily choose to remove the contribution of h 1 and h 2 as in the bias term they are associated to β 1 and β 2 , which are the smallest values of smoothness (see Theorem 1 below) and so they provide the strongest constraints.
One may wonder about the origin of the logarithmic term in the upper bound (6). We will see in the proof of Proposition 1 that it is possible to estimate the absolute value of the covariance k(s) for any l ∈ {0, ..., d}. Then, we will need to integrate such term over the time s. When s is small the best choice consists in taking l = 0, while for s far away from the neighbourhood of 0 is convenient to take l = d. As we will see in the proof of Proposition 1, it is possible to make the bound on the variance smaller by considering also the case that stands in between, for which s is not zero but can be arbitrarily small. Here the best choice is to take l = 2, which provides the logarithm as in (6).
To better understand how to choose the bandwidths whose contributions we will remove, let us see more in detail what happens for d = 3. In this case, the two strongest constraints are connected to the two smallest bandwidths and so we arbitrarily decide to remove their contributions. It follows that the upper bound on the variance will depend only the largest bandwidth between h 1 , h 2 and h 3 , up to a logarithmic term. In particular, for d = 3, equation (6) in Proposition 1 becomes On the other side, when As this final result is quite surprising, we decide to support it by presenting some simulations. Our goal is to illustrate that the variance will depend only on the largest bandwidth. We consider the process X solution of . The Brownian motion has variance I 3 and the jump process is a compound Poisson, with intensity 1 and Gaussian jump law N (0, I 3 ). We evaluate the variance of the kernel estimator for different values of the bandwidth h 1 , h 2 and h 3 over the interval [0, T ], where we choose T = 100. The process is simulated by an Euler scheme with discretization step ∆ n = 10 −7 and the integral in the definition of the kernel estimator is replaced by a Riemann sum whose discretization step is once again 10 −7 . We use a Monte Carlo method based on 2000 replications and we provide a 3d graphic, in which on the x and y-axis there are respectively the values of log 10 (h 1 ) and log 10 (h 2 ) while on the z-axis there is the value of log 10 (V ar(π h,T (x))). The idea is to fix h 3 bigger than h 1 and h 2 and to see how the variance of our estimator changes in function of h 1 and h 2 , in a logarithmic scale.
In particular, we ta take h 3 = 10 −0,5 and h 1 and h 2 that belong to [10 −2 , 10 −3.4 ]. Therefore, h 3 is much larger than the other bandwidths and so the variance of the estimator should be, according with our results, dependent only on h 3 . In particular, as it is h 1 h 2 < h 2 3 , from (8) we obtain the theoretical variance is upper bounded by c . Even if the 3d graphic reported in Figure 1 does not seem to represent a completely constant function, one can easily see looking at the z-axis that the variance is remotely dependent on h 1 and h 2 . The minimal variance, indeed, is achieved for h 1 = h 2 = 10 −2 and its value is 10 −1.88 while its maximal value is 10 −1.61 and it is achieved for h 1 = h 2 = 10 −3.4 . It means that the variance varies a little: for the kernel bandwidths which move from  Another evidence of the dependence of the variance of the estimator only on h 3 is given by Figure 2 below. To better understand the graphic below, we underline that the orange and blue curves correspond to the two edges, respectively. In particular, the orange curve corresponds to the variation of the variance for h 1 = 10 −0,5 fixed and h 2 which shifts from 10 −0.5 to 10 −2.4 , while the blue curve represents the variation of the variance when h 2 is fixed equal to 10 −0,5 and h 1 goes from 10 −0.5 to 10 −2.4 . The green curves corresponds to the diagonal of the 3d graphic and so it represents the variance of the estimator when h 1 = h 2 moves from 10 −0.5 to 10 −2.4 . We start discussing the behaviour of the green curve. According with the theory we know the variance should not be dependent on h 1 = h 2 . Therefore, the derivative of the log-variance function with respect to log 10 (h 1 ) = log 10 (h 2 ) should be null. The numerical results match with the theoretical ones, as the slope of the diagonal is quite weak, being equal to −0.186. Regarding the edge curves, one can easily remark that the results provided by Based on the upper bounded on the variance found in Proposition 1 discussed above, we can now state the main result on the asymptotic behaviour of the estimator. Its proof can be found in Section 5.
Theorem 1. Suppose that A1 -A3 hold. If π ∈ H d (β, L), then the estimator given in (5) satisfies, for d ≥ 3, the following risk estimates: Taken β 1 ≤ β 2 ≤ ... ≤ β d and defined 1 1 β l , the rate optimal choice for the bandwidth h provided in (32) and (33) below yields the convergence rate Moreover, in the isotropic context β 1 = β 2 = ... = β d =: β, the following convergence rate holds true: We recall that in [2], under the same assumptions, the following convergence rate has been found for the pointwise estimation of the invariant density for d ≥ 3: whereβ is the harmonic mean smoothness of the invariant density over the d different dimensions, such that We remark that the rate in (12) for d ≥ 3 is the same Strauch found in [32] in absence of jumps, which is also the rate gathered in the isotropic context proposed in [11], up to replacing the mean smoothnessβ with β, the common smoothness over the d dimensions, as we did in (11). By construction,β 3 is bigger thanβ and, therefore, the upper bound found in Theorem 1 is faster than the one reported in (12) in a general anisotropic context. Now the following two questions arise. Can we improve the rate by using other density estimators? What is the best possible rate of convergence? To answer these questions it is useful to consider the minimax risk R T (β, L) associated to the anisotropic Holder class H d (β, L) we defined in Definition 1, as we are going to explain in Section 4.

Lower bounds
In this section, we wonder if it is possible to construct any estimator with a rate better than the one obtained in Theorem 1.
For the computation of lower bounds, we introduce the following stochastic differential equation with jumps: where a and γ are constants, γ is also invertible and b is a Lipschitz and bounded function. We assume that the jump measure satisfies the conditions gathered in points 1,2,4 and 5 of A3. We moreover suppose that there exists λ 1 such that We underline that if the matrix a · a T is diagonal, then the request here above is always satisfied. If it is not the case, such an assumption implies that the diagonal terms dominate on the others. As the model satisfies A1, we know that the stochastic differential equation with jumps (13) admits a solution. Moreover, as γ is inversible, A3 is automatically true. If A2 also holds we know from Lemma 2 in [2] that the process admits a unique stationary measure, that we note π b . We omit in the notations the dependence on a and γ as they will be fixed in the sequel, while the connection between b and π b will be made explicit in Section 6.1.
If the invariant measure exists, we denote as P b the law of a stationary solution (X t ) t≥0 of (13) and we note be E b the corresponding expectation. Moreover we will note by P the law of (X t ) t∈[0,T ] , solution of (13). In order to write down an expression for the minimax risk of estimation, we have to consider a set of solutions to the equation (13) which are stationary and whose stationary measure has the prescribed Holder regularity introduced in Definition 1. It leads us to the following definition.
the set of the Lipschitz and bounded functions b : R d → R d satisfying A2 and for which the density π b of the invariant measure associated to the stochastic differential equation (13) belongs to H d (β, 2L).
We introduce the minimax risk for the estimation at some point. Let x 0 ∈ R d and Σ(β, L) as in Definition 2 here above. We define the minimax risk where the infimum is taken on all possible estimators of the invariant density. Our main result is a lower bound for the minimax risk here above defined. The proof is based on the two hypotheses method, explained for example in Section 2.3 of [33].
Theorem 2. There exists c > 0 such that, ifĉ < c (recall:ĉ is defined in the fifth point of A3), then where we recall it is β 1 ≤ β 2 ≤ ... ≤ β d and 1 The condition onĉ follows from the fact that, in our approach, the jumps have to be not too big. In this way it is possible to build ergodic processes where, in the analysis of the link between the invariant measure and the drift function, the continuous part of the generator dominates (see Lemma 2).
Regarding the choice of the model, it is worth noticing that our framework does not allow to consider continuous processes as well as jump diffusions simultaneously, as we need the coefficients to be always different from zero to get the mixing properties of our process. Hence, we choose to take into account the case where we have an additional information: we do have the jumps. In particular, we are looking for a lower bound on a class of processes where we know that the jumps really occurred, which is truly challenging. It is interesting to remark that it is possible to follow the schema provided in Section 6 also when one aims at finding a lower bound on a class of continuous diffusion processes. The main difference would be the absence of the discrete part of the generator A d , which would implies the absence of its adjoint A * d,i in the definition of the coordinates of b (see Equation (38)). As we will see, in the construction of the priors, the idea will be to provide a first density with the prescribed regularity and then to give the second as the first plus a bump. As we will need to consider the drifts associated to the built priors, we will need to evaluate the adjoint of the generator of the process in the bump. The main difficulty comes from the discrete part of the generator, being a non-local operator (see Points 1 and 2 of Proposition 4: without the jumps the difference between the drifts would be here just zero).
It follows from Theorem 2 that, on a class of diffusions X whose invariant density belongs to H d (β, L) and starting from the observation of the process (X t ) t∈[0,T ] , it is impossible to find an estimator with a rate of estimation better than T −β 3 2β 3 +d−2 , for the pointwise L 2 risk. Comparing the lower bound here above with the upper bound gathered in Theorem 1 we observe that, up to a logarithmic term, the two convergence rates we found are the same. Hence, the convergence rate we found by means of a kernel estimator is the best possible, but only up to a logarithmic term.

Proof upper bound
This section is devoted to the proof of the upper bound gathered in Theorem 1. To do that, we need first of all to prove Proposition 1. Before proving it we recall a result from [2] that will be useful in the sequel. From Lemma 1 in [2], which heavily relies on the first point of Theorem 1.1 in [7], we know that the following upper bound on the transition density holds true for t ∈ [0, 1]: Such a bound is not uniform in t big. Nevertheless, for t ≥ 1, we have We deduce, for all t, Proof. Proposition 1 In the sequel, the constant c may change from line to line and it is independent of T . From the definition (5) and the stationarity of the process we get We deduce that In order to find an upper bound for the integral in the right hand side here above we will split the time interval [0, T ] into 4 pieces: where δ 1 , δ 2 and D will be chosen later, to obtain an upper bound which is as sharp as possible.
• For s ∈ [0, δ 1 ), from Cauchy -Schwartz inequality and the stationarity of the process we get The variance is smaller than Using the boundedness of π and the definition of K h given in (5) it follows • For s ∈ [δ 1 , δ 2 ), taking δ 2 < 1, we use the definition of transition density, for which with We now study k 1 (s). To this end we observe that, for y = (y 1 , ...y d ), it is where q G s (y 3 ...y d |y 1 , y 2 , y) = e −λ 0 Then, from the definition of k 1 (s) given in (18), we get Using the definition of K h and (19) we obtain We remark that in the reasoning here above it would have been possible to remove the contribution of no matter which couple of bandwidth. We choose to remove h 1 and h 2 because they are associated, in the bias term, to the smallest values of the smoothness (β 1 and β 2 ) and so they provide the strongest constraints. Replacing the result here above in (20) and as We want to act in the same way on k 2 (s). We observe it is We remark that sup as each of the d − 2 multiplication factors can be seen as applied the change of variable Again, acting as on k 1 (s), we use the definition of the kernel function K h and the integrability of q J s gathered in (22) to obtain From (17), (21) and (23) it follows as 1 − α 2 > 0 and so the term coming from k 2 is negligible compared to | log(δ 2 )| j≥3 h j , for δ 2 small enough.
• For s ∈ [δ 2 , D) we still use (15) observing that, in particular, We therefore get where we have used that, as d ≥ 3, 1 − d 2 < 0. The exponent of the second term in the integral here above, after having integrated, is 2 − d+α 2 . It is more than zero if d < 4−α, which is possible only if α ∈ (0, 1) and d = 3, less then zero otherwise. Moreover, as α < 2 and so 2− d+α 2 > 1− d 2 . Moreover, the logarithmic terms are negligible compared to the others, for δ 2 small enough and D large enough.
• For s ∈ [D, T ] our main tool is Lemma 2 in [2]. As the process X is exponentially βmixing, indeed, the following control on the covariance holds true: for ρ and c positive constants as given in Definition 1 of exponential ergodicity in [2]. It entails Collecting together (16), (24), (25) and (26) we deduce where we have also used that D 2− d+α 2 1 {d<4−α} ≤ D. Indeed, since d ≥ 3 and α ∈ (0, 2), we always have 2 − d+α 2 ≤ 1. Moreover we know that D ≥ 1 by definition. When d < 4 − α, the power is positive, thus D 2− d+α 2 1 {d<4−α} ≤ D. We now want to choose δ 1 , δ 2 and D for which the estimation here above is as sharp as possible. To do that, if where the last inequality is a consequence of the fact that, as for any j ∈ {1, ..., d} h j is small and in particular they are smaller than 1 2 , all the other terms are bounded by c (6) is therefore proved.

Proof of Theorem 1
Proof. We write the usual bias-variance decomposition Regarding the bias, a standard computation (see for example the proof of Proposition 2 of [2]) provides An analogous computation can be found in Proposition 1.2 of [33] or in Proposition 1 of [8].
It is here important to remark that the constant c does not depend on x. For h 1 h 2 < ( l≥3 h l ) In order to choose the rate optimal bandwidth, we define h l (T ) := ( log T T ) a l for l ∈ {1, ..., d} and we look for a 1 , ... a d such that the upper bound of the mean-squared error in the right hand side of (10) is as small as possible. We remark that Therefore, after having replaced h l (T ), the right hand side of (10) is To get the balance we have to solve the following system in a 3 , ... , a d : while a 1 and a 2 have to be big enough to ensure that both ( 1 T ) 2β 1 a 1 and ( 1 T ) 2β 2 a 2 are negligible compared to the other terms. We observe that, as a consequence of the first d − 3 equations, we can write Hence, the last equation becomes whereβ 3 is the mean smoothness over β 3 , ... , β d and it is such that 1 Regarding a 1 and a 2 , we take them big enough to ensure that Plugging them in (30) we get as we wanted. We now observe that, in the anisotropic case, the multi bandwidth h always satisfies h 1 h 2 < ( l≥3 h l ) Because of the choice of a 1 , ..., a d gathered in (32) and (33), it holds true if As β 1 ≤ β 2 ≤ ... ≤ β d , equation (34) always holds true, in the anisotropic context. However, in the isotropic context, we have Here estimation (29) together with decomposition (28) and the upper bound on the variance gathered in (7) of Proposition 1 gives us, remarking also that β 1 = ... = β d =: β, It leads us to the rate optimal choice h( as we wanted.

Proof lower bound
We want to prove Theorem 2 using the two hypothesis method, as explained for example in Section 2.3 of Tsybakov [33]. The idea is to introduce two drift functions b 0 and b 1 which belong to Σ(β, L) and for which the laws P b 0 and P b 1 are close. To do it, the knowledge of the link between b and π b is crucial. In particular, we will study in detail the above mentioned link in Section 6.1 while we will provide two priors in Section 6.2. In Section 6.3 we will use these preliminaries in order to prove the lower bound for the pointwise minimax risk gathered in Theorem 2.

Explicit link between the drift and the stationary measure
In absence of jumps, most of the times, reversible diffusion processes with unit diffusion processes are considered in order to estimate the invariant density (see [11] and [32]). In this case, the connection between the drift function and the invariant measure is explicit: where V is a C 2 (R d ) function, which we refer to as potential. Adding the jumps, it is no longer true and so, in our framework, it is challenging to get a relation between b and π b . We need to introduce A, the generator of the diffusion X solution of (13). It is composed by a continuous part and a discrete one: We now introduce a class of function that will be useful in the sequel: We denote furthermore as A * b the adjoint operator of A on L 2 (R d ) which is such that, for f, g ∈ C, The following lemma, that will be proven in Section 7, makes explicit the form of A * b . Lemma 1. Let A * b the adjoint operator on L 2 (R d ) of A, generator of the diffusion X solution of (13), where the subscript b is to underline its dependence on the drift function. Then, for g ∈ C, it is If g : R d → R is a probability density of class C 2 , solution of A * b g = 0, then it is an invariant density for the process we are considering. When the stationary distribution π b is unique, therefore, it can be computed as solution of the equation A * b π b = 0. As one can see from Lemma 1, the adjoint operator has a pretty complicate form. Hence, it seems impossible to find explicit solutions g of A * b g = 0 for any b and consequently it seems impossible to write π b as an explicit function of b. However, it can be seen that if one consider π b as fixed and b as the unknown variable, then finding solutions in b is simpler. Moreover, the adjoint of the discrete part of the generator does not depend on b and therefore the solution in b is the same it would have been in absence of jumps, plus a second term which derives from the contribution of the jumps. In order to compute a function b = b g solution of A * b g = 0, we need to introduce some notations. For g ∈ C we denote as A * d g the adjoint operator of A d g which is, for all x ∈ R d , Moreover, we introduce the following quantity, that will be useful in the sequel: To make easier the notation here above, we denote asx i the vector ( and so it is easy to prove that the sum of Then, for g ∈ C and g > 0, we introduce for all x ∈ R d and for all i ∈ {1, ..., d}, where w i = (x 1 , ..., x i−1 , w, x i+1 , ..., x d ). We observe that, by the definition of A * d,i and the fact that the function g is integrable, b i is well defined. Moreover, as both g and its derivatives goes to zero at infinity and using that the Lebesgue measure is invariant on R, it is Hence, the two definitions of b i given here above are equivalent on R. We finally denote ). We show that the function b g here above introduced is actually solution of A * b g(x) = 0. Proposition 2.
1. Let g a positive function in C. Then, 2. Let π : R d → R a probability density such that π ∈ C and π > 0. If b π , defined as in (38), is a bounded Lipschitz function which satisfies A2, then π is the unique stationary probability of the stochastic differential equation (13) Proof. 1. For b i g (x) defined as in (38), we get Replacing b i g (x) and given by Lemma 1 and using (37), we easily obtain A * bg g(x) = 0. 2. From Ito's formula, one can check that any π solution of A * b π(x) = 0 is a stationary measure for the process X solution of (13). From point 1 we know that π is solution to A * bπ π(x) = 0 and so it is a stationary measure for the process X whose drift is b π . However, we have assumed b π to be a bounded Lipschitz function which satisfies A2 and, from Lemma 2 of [2], we know it is enough to ensure the existence of a Lyapounov and to show that the stationary measure of the equation with drift coefficient b π is unique. It follows it is equal to π.
We recall that our purpose, in this section, is to clarify the link between the drift coefficient b π of the stochastic differential equation (13) and the unique stationary distribution π. As a consequence of the second point of Proposition 2, it is achieved when b π is a bounded Lipschitz function which satisfies A2. We now introduce some assumptions on π for which the associated drift b π has the wanted properties.
2. We denote k 1 := max h ((a · a T ) −1 ) hh . There exists > 0 such that < 0 |γ j |d k 1 , (with |γ j | the euclidean norm of the j-th line of the matrix γ and 0 the value appearing in the fifth point of Assumption A3), for which for any y, z ∈ R, where c 2 is some constant > 0.
3. For > 0 as in point 2 there exists c 3 ( ) > 0 such that 4. We denote k 2 := max l,h∈{1,...d} |(γ T · γ) lh | and we recall thatĉ is the constant appearing in the fifth point of A3. There exists 0 <˜ < , where c 5 is the constant that will be introduced below, in the fifth point of Ad, and there exists R such that, for any y : Moreover, there exists a constant c 4 such that, for any y ∈ R, | π j (y) π j (y) | ≤ c 4 .

5.
For each i ∈ {1, ..., d}, for any x ∈ R d and for˜ as in point 4 there exists a constant c 5 such that Moreover, there exists a constantc 5 such that, for any x ∈ R d , it is Even though the just listed properties do not seem very natural and they have been introduced especially to make the associated drift function such that we can use the second point of Proposition 2, they are all satisfied by choosing a probability density in an exponential form, as we will see better in Lemma 2. The proof of the following proposition will be given in Section 7.
Proposition 3. Suppose that π satisfies Ad. Then b π , defined as in (38), is a bounded Lipschitz function which satisfies A2.
From Proposition 3 here above and the second point of Proposition 2 it follows that, if we choose carefully the probability density such that all the properties gathered in Assumption Ad hold true, then π is the unique stationary probability of the stochastic differential equation (13) with drift coefficient b π . The next subsection is devoted to the building of two densities which satisfy the properties listed in Ad.

Construction of the priors
The proof of the lower bound is made by a comparison between the minimax risk introduced in (14) and some Bayesian risk where the Bayesian prior is supported on a set of two elements. We want to provide two drift functions belonging to Σ(β, L) and, to do it, we introduce two probability densities defined on the purpose to make Ad hold true. We set where c η is the constant that makes π 0 a probability measure. For any y ∈ R we define π k,0 (y) : and η is a constant in (0, 1 2 ) which plays the same role as and˜ did in Ad, as it can be chosen as small as we want. In particular we choose η small enough to get π 0 ∈ H d (β, L). Moreover, f is a C ∞ function and it is such that, for any x ∈ R, The function f has been introduced with the purpose of making π k,0 (y) a C ∞ function for which all the conditions in Ad are satisfied. We state the following lemma, which will be proven in Section 7.
We remark that, as a consequence of the fifth point of A3 and of (39) here above, the assumption required on the jumps in order to make π 0 satisfy Ad is that there exists 0 such that It means that the jumps have to behave well. In particular, they have to integrate an exponential function and such an integral has to be upper bounded by a constant which depends on the model.
From Proposition 3, we know that b π 0 is a bounded lipschitz function which satisfies A2 and, using also the second point of Proposition 2 it follows that π 0 is the unique stationary probability of X (0) solution of It yields b π 0 ∈ Σ(β, L), according to Definition 2. To provide the second drift function belonging to Σ(β, L) on which we want to apply the two hypothesis method, we introduce the probability measure π 1 . We are given it as π 0 to which we add a bump: let K : R → R be a C ∞ function with support on [−1, 1] and such that We set where x 0 = (x 1 0 , ..., x d 0 ) ∈ R d is the point in which we are evaluating the minimax risk, as defined in (14), M T and h l (T ) will be calibrated later and satisfy M T → ∞ and, ∀l ∈ {1, ..., d}, h l (T ) → 0 as T → ∞. From the properties of the kernel function given in (41) we obtain Moreover, as π 0 > 0, K has support compact and 1 M T → 0, for T big enough we can say that π 1 > 0 as well. The fact of the matter consists of calibrating M T and h l (T ) such that both the densities π 0 and π 1 belong to the anisotropic Holder class H d (β, 2L) (according with Definition 2 of Σ(β, L)) and the laws P bπ 0 and P bπ 1 are close. It will provide us some constraints, under which we will choose M T and h l (T ) such that the lower bound on the minimax risk is as large as possible. In order to make the here above mentioned constraints explicit, we first of all need to evaluate how the two proposed drift functions differ in a neighbourhood of x 0 ∈ R d , as stated in the next proposition. Its proof will be given in Section 7.
Proposition 4. Let us define the compact set of R d Then, for T large enough, 1. For any x ∈ K c T and ∀i ∈ {1, ..., d}: 3. For any x ∈ K T and ∀i ∈ {1, ..., d}: Using Proposition 4 it is possible to show that also b π 1 belongs to Σ(β, L), up to calibrate properly M T and h i (T ), for i ∈ {1, ..., d}.
Lemma 3. Let > 0 and assume that, for all T large, We suppose moreover that d j=1 for all T sufficiently large.
Proof. From the first point of Proposition 4 here above we know that, ∀i ∈ {1, ..., d}, while the third point of Proposition 4 provides us, being the last equality a consequence of d j=1 1 h j (T ) = o(M T ), for T going to ∞. We recall that we have built the density π 0 especially to apply Proposition 3 on b π 0 . Therefore, b π 0 is a bounded Lipschitz function which satisfies A2 and, as a consequence of (44) and (43), the same goes for b π 1 . From the second point of Proposition 2 it follows that π 1 is the unique stationary measure associated to b π 1 . As Σ(β, L) is defined as in Definition 2, the proof of the lemma is complete as soon as π 1 ∈ H d (β, 2L). Let us check the Hölder condition with respect to the i-th component. We first of all introduce the following notation: For all x ∈ R d and t ∈ R it is where we have used the definition of d T and the fact that π 0 ∈ H d (β, L). We now observe that Therefore, defining We have assumed that, ∀i ∈ {1, ..., d}, 1 M T ≤ h i (T ) β i . We can choose small enough to ensure that ≤ L i c K , obtaining Moreover, from the definition of π 1 and the fact that π 0 ∈ H d (β, L) we also get, for any x ∈ R d and k = 0, ..., β i Again, it is enough to choose such that ≤ L i c K to get |D k i π 1 (x)| ≤ 2L i . We have proven the required Hölder controls on the derivatives of π 1 , the lemma follows.
We remark that the two conditions on the calibration parameters provide which is always true as we have asked β j > 1 for any j ∈ {1, ..., d} in Definition 2.

Proof of Theorem 2
Proof. We first of all recall the notations previously introduced. We denote P b the law of the stationary solution of (13) on the canonical space C([0, ∞), R d ) and E b the corresponding expectation; we also denote as P We want to use the two hypothesis method based on the two drift functions b π 0 and b π 1 which, therefore, have to belong to Σ(β, L). It is b π 0 ∈ Σ(β, L) by construction. Moreover, from Lemma 3 we know in detail the constraints required on the calibrations M T and h i (T ) in order to get b π 1 belonging to Σ(β, L). We therefore assume that the following conditions hold true: As b π 0 , b π 1 ∈ Σ(β, L) we have In order to lower bound the right hand side we need the following lemma, which will be showed in Section 7. and we assume that Then, there exist C and λ > 0 such that, for all T large enough, From (48) it turns out another condition on the calibration quantities. Indeed, using all the three points of Proposition 4, it is as h j (T ) goes to 0 for T going to infinity and so the second term here above is negligible compared to the first one. It provides us the constraints on the calibration that we need to require in order to apply Lemma 4 here above. From Lemma 4, as Z (T ) exists, we can write and so we obtain We recall that π 0 and π 1 have been built in Section 6.2 and in particular, since π 1 has been defined as below (41), it is where we have also used that K(0) = 1, as stated in (41). Moreover from Lemma 4 we know that for some λ, as soon as (49) holds, We deduce that, if (46), (47) and (49) are satisfied, then for c > 0. Hence, we have to find the largest choice for 1 We observe that (47) can be seen as We plug it in (49) and we observe that the biggest term in the sum is l =1 h l (T ) h 1 (T ) . In order to make it as small as possible, we decide to increment h 1 (T ), such that condition (46) is no longer saturated for j = 1. In particular, we increase h 1 (T ) up to get h 1 (T ) = h 2 (T ), remarking that it is not an improvement to take h 1 (T ) also bigger than h 2 (T ) because would be the biggest term, and it would be larger than l =1 h l (T ) for h 1 (T ) = h 2 (T ). Then, we have the possibility to no longer saturate condition (46) also for other j, which means to increase some h j (T ). However, it implies the worse term to be bigger, and so it does not consist in a good choice. Finally, we take h j (T ) which saturates (46) for any j = 1 and h 1 (T ) = h 2 (T ). Replacing them in (49), we get the following condition: Replacing (53) in (52), it leads us to the choice Plugging the value of M T in (50) we obtain that, for any possible estimatorπ T of the invariant density, it is The wanted lower bound on the minimax risk defined in (14) follows.

Proofs
This section is devoted to the proofs of the technical results we have introduced in the previous section.

Proof of Lemma 1
Proof. We aim at making explicit the adjoint operator of A, the generator of the process solution to (13), on L 2 (R d ). It is such that, for f, g belonging to the set C as introduced in Section 6.1, We start analysing the continuous part of the generator of (13), A. From (35), a repeated use of integration by parts and the fact that the function g vanishes for x i going to ±∞ for any i ∈ {1, ..., d} we get (54) We now look for the adjoint operator of the discrete part of the generator A d as defined in (35). It is We evaluate first of all I 1 , on which we operate the change of variable u := x + γ · z. It provides us with the last equality which follows from Fubini theorem. We recall that |γ −1 | stands for the absolute value of the determinant of the matrix γ −1 . Regarding I 2 , one can clearly isolate the adjoint part without further computations as (− R d F (z)dz)g(x). The last term left to deal with is I 3 . From integration by parts and once again the fact that g vanishes for x i going to ±∞ we obtain where we have also changed the variable γ −1 · (x − y) = z in the first integral. From (54) and (55) the lemma follows.

Proof of Proposition 3
Proof. We start proving that b π is bounded. We can assume WLOG x i < 0, if x i > 0 an analogous reasoning applies. As π is in a multiplicative form, we can compute where we have also used that, for the first point of Ad, π j (−∞) = 0. Comparing the equation here above with the definition (38) of b i one can see that, for all x ∈ R d , From the fourth point of Ad it easily follows that there exists a constant c > 0 for which Regarding I 2 [π], we start evaluating, for any The fifth point of Ad (with the notation introduced in the fouth one) provides us an upper bound on the second derivative of π which yields, using also that π is in a multiplicative form, The second point of Ad yields We apply exactly the same on π j (x j − (γ · z) j ), for j < i and we replace them in the right hand side of (57), recalling that |s| < 1. We obtain it is upper bounded by Now, by the definition of given in the third point of Ad, we know it satisfies (a · a T ) −1 jj | j≤i d k=1 γ jk z k | ≤ 0 d|z| 1 d = 0 |z|. It follows that the integral in z is bounded bŷ c and so We plug it in I 2 [π], getting as the integral on w is upper bounded by c 3 ( ), from the third point of Ad. We have proved (56) and (58) and, therefore, b i is clearly bounded. We now want to prove the drift condition A2 on b i π . To do it, we investigate the behavior of x i b i π (x). From the fourth point of Ad, which holds true for any x i such that |x i | > R √ d , it is As we have assumed | j =i (a · a T ) ij (a · a T ) −1 jj | < 1 2 , it is It follows Using (58) we also get Hence, for x i such that |x i | > R √ d , there existsc > 0 such that where the last inequality is a consequence of the fact we have assumed˜ < From (59), using also the boundedness of b i π showed before, it follows where the last inequality is a consequence of the fact that, for |x| > R, there has to be at least a component x i such that |x i | > R √ d . Hence, we can use the sup norm and compare it with the euclidean one. Moreover, as |x| is lower bounded by R, it exists a constant C 1 > 0 such that −c 1 |x| + c 2 ≤ −C 1 |x|.
The drift condition on b π clearly holds. As b is also Lipschitz, the result follows.

Proof of Lemma 2
Proof. We recall that π 0 has been defined as with π k,0 (y) := f (η(a · a T ) −1 kk |y|) and By construction, π 0 is clearly in a multiplicative form and always positive. Moreover, point 1 of Ad directly hold true from the definition of π j,0 (y). To show the second point to hold, we observe it is It implies that point 2 of Ad holds with c 2 = 4 and = η, as we can choose η small enough to make also the condition in the definition of satisfied.
In order to prove that the third point Ad holds true, we need to show that, for any y < 0, By the lower and upper bounds on π k,0 provided through the first property of f we know it is 1 π k,0 (y) For y > 0 an analogous reasoning applies, thus the third point of Ad follows with c 3 ( ) = c 3 (η) = 4 k 3 η . It is easy to check that also the fourth point of Ad hold true as, for |y| > 1 η(a·a T ) −1 kk , it is π k,0 (y) = −η(a · a T ) −1 kk sgn(y)π k,0 (y) ∀k ∈ {1, .., d} .
It means that the fourth point of Ad holds true for |y| > R √ d , up to take R = √ d ηk 3 . Moreover, in order to prove that also the fifth point of Ad holds, we observe it is From the definition of π 0 and the properties of f we have that, for any j ∈ {1, .., d} and for k = 1, 2, It follows It provides us that condition five of Ad holds true with c 5 = 16 and˜ = η. Finally, we have to check that, according with the definition of˜ given in the fifth point of Ad, it is where we have also replaced the values of the constants we have found. It holds true of and only if which is equivalent to askĉ < k 3 4 d+4 k 2 1 k 2 .
Being it exactly the condition assumed in the statement of this lemma, all the points gathered in Ad are satisfied.

Proof of Proposition 4
Proof. Point 1 We suppose x i < 0. If otherwise it is x i ≥ 0 it is enough to act in the same way on the integral between x i and ∞ to get the same result. We first of all introduce the following quantitiesĨ (a · a T ) ij ∂π 0 ∂x j (x), We moreover introduce the notatioñ According with the definition (38), we have Let us also recall the notation presented in (45) for which ).
We are going to show that for any x ∈ K c T ,Ĩ i 2,3 [d T ](x) = 0 while | 1 By the definition (45) of d T and the fact that its support is included in K T it is, for any x ∈ R d , (65) With the purpose to use Fubini theorem, we analyse more in detail the condition (w i − γ · z) ∈ K T . It means that, and for j = i which gives us We define the set The use of Fubini theorem on (65) provides us l>i K( (68) We observe that the supremum value in the innermost integral should have been min(x i , x i 0 + h i (T ) + (γ · z) i ) but when x i > x i 0 + h i (T ) + (γ · z) i , the integral is and the property (41) of the kernel function K. When x i < x i 0 + h i (T ) + (γ · z) i it is (γ · z) i > x i − x i 0 − h i (T ). We can introduce such a constraint in the set G i z , which becomes G i z (x) := z ∈ R d : ∀j ≤ i x j − x j 0 − h j (T ) ≤ (γ · z) j ≤ x j − x j 0 + h j (T ), ∀j > i − ∞ < (γ · z) j < ∞} .
We now observe that, defining Hence, replacing it in (77), we getĨ i 2,3 [d T ](x) is equal to With the change of variable u := w−x i 0 h i (T ) we obtain whereG i z (x) as defined below (68): We know moreover from (66) that, for j > i, x j 0 − h j (T ) ≤ x j ≤ x j 0 + h j (T ). We observe first of all that Therefore, using also Fubini theorem once again, we get where the set G x (z) derives fromG i z (x) and from (66) directly, writing the constraint on the components of x instead of on the components of z: G x (z) := x ∈ R d : ∀j ≤ i (γ · z) j + x j 0 − h j (T ) ≤ x j ≤ (γ · z) j + x j 0 + h j (T ), We therefore need to evaluateĨ i [d T ](x) =Ĩ i From the assumption (48) in the statement of the lemma, it follows that sup T ≥0 which is sufficient to ensure that there exists λ 0 such that, for any T large enough, as we wanted.