Adaptive estimation for degenerate diffusion processes

Abstract: We discuss parametric estimation of a degenerate diffusion system from time-discrete observations. The first component of the degenerate diffusion system has a parameter θ1 in a non-degenerate diffusion coefficient and a parameter θ2 in the drift term. The second component has a drift term parameterized by θ3 and no diffusion term. Asymptotic normality is proved in two different situations for an adaptive estimator for θ3 with some initial estimators for (θ1, θ2), and an adaptive one-step estimator for (θ1, θ2, θ3) with some initial estimators for them. Our estimators incorporate information of the increments of both components. Thanks to this construction, the asymptotic variance of the estimators for θ1 is smaller than the standard one based only on the first component. The convergence of the estimators for θ3 is much faster than the other parameters. The resulting asymptotic variance is smaller than that of an estimator only using the increments of the second component.

Recently there is a growing interest in hypo-elliptic diffusions, that appear in various applied fields. Examples of the hypo-elliptic diffusion include the harmonic oscillator, the Van der Pol oscillator and the FitzHugh-Nagumo neuronal model; see e.g. León and Samson [30]. For parametric estimation of hypo-elliptic diffusions, we refer the reader to Gloter [18] for a discretely observed integrated diffusion process, and Samson and Thieullen [42] for a contrast estimator. Comte et al. [5] gave adaptive estimation under partial observation. Recently, Ditlevsen and Samson [12] studied filtering and inference for hypo-elliptic diffusions from complete and partial observations. When the observations are discrete and complete, they showed asymptotic normality of their estimators under the assumption that the true value of some of parameters are known. Melnykova [33] studied the estimation problem for the model (1.1), comparing contrast functions and least square estimates. The contrast functions we propose in this paper are different from the one in [33]. Recently, Delattre et al. [11] gave a rate of convergence to a nonparametric estimator for the stationary distribution of a hypoelliptic diffusion.
In this paper, we will present several estimation schemes. Since we assume discrete-time observations of Z= (Z t ) t∈R+ , quasi-likelihood estimation for θ 1 and θ 2 is known; only difference from the standard diffusion case is the existence of the covariate Y = (Y t ) t∈R+ in the equation of X= (X t ) t∈R+ but it causes no theoretical difficulty. Thus, our first approach in Section 3 is toward estimation of θ 3 with initial estimators for θ 1 and θ 2 . The idea for construction of the quasi-likelihood function in the elliptic case was based on the local Gaussian approximation of the transition density. Then it is natural to approximate the distribution of the increments of Y by that of the principal Gaussian variable in the expansion of the increment. However, this method causes deficiency, as we will observe there; see Section 8. We present a more efficient method by incorporating an additional Gaussian part from X. The error rate attained by the estimator for θ 3 is n −1/2 h 1/2 and it is much faster than the rate (nh) −1/2 for θ 2 and n −1/2 for θ 1 . Section 4 treats some adaptive estimators using suitable initial estimators for (θ 1 , θ 2 , θ 3 ), and shows joint asymptotic normality. Then it should be remarked that the asymptotic variance of our estimatorθ 1 for θ 1 has improved that of the ordinary volatility parameter estimator, e.g.θ 0 1 recalled in Section 3.4 that would be asymptotically optimal if the system consisted only of X. Section 2 collects the assumptions under which we will work. Section 5 offers several basic estimates to the increments of Z.
To investigate efficiency of the presented estimators, we need the LAN property of the exact likelihood function of the hypo-elliptic diffusion. Another important and natural question the reader must have is the asymptotic behavior of the joint quasi-maximum likelihood estimator based on a quasi-likelihood random field for the full parameter θ; an expression of the random field has already appeared in (4.2) essentially. In the present situation, the three parameters have different convergence rates and in particular the handling of θ 3 is not straightforward because for estimation of θ 3 , the parameters (θ 1 , θ 2 ) become nuisance, but any estimator of them has very large error compared with θ 3 . The user could get some estimated value with the joint quasi-likelihood random field, however, there is no theoretical backing for such a scheme. Though somewhat sophisticated treatments are necessary, we can validate the joint quasi-maximum likelihood estimator and can show that the same asymptotic variance is attained, up to the first order, as the one-step quasi-likelihood estimator provided in this article. We will discuss these problems elsewhere, while we recommend the reader to see Gloter and Yoshida [22] for more complete exposition including the non-adaptive approach and additional information.

Assumptions
We assume that Θ i (i = 1, 2, 3) are bounded open domain in R pi , respectively, and Θ = 3 i=1 Θ i has a good boundary so that Sobolev's embedding inequality (cf. Adams [1]) holds, that is, there exists a positive constant C Θ such that If Θ has a Lipschitz boundary, then this condition is satisfied. Obviously, the embedding inequality (2.1) is valid for functions depending only on a part of components of θ. In this paper, giving priority to simplicity of presentation, we use Sobolev's inequality to control the maximum of a random field though other embedding inequalities such as the GRR inequality improve the assumptions on differentiability of the coefficients of the stochastic differential equations.
In this paper, we will propose an estimator for θ and show its consistency and asymptotic normality.
in any order and f and all such derivatives are continuously extended to R d Z × Θ i , moreover, they are of at most polynomial growth in z ∈ R d Z uniformly in θ i ∈ Θ i . Let where denotes the matrix transpose. We suppose that the process (Z t ) t∈R+ generating the data satisfies the stochastic differential equation (1.1) for a true value for any bounded continuous function f : R d Z → R.
(iii) The function θ 1 → C(Z t , θ 1 ) −1 is continuous on Θ 1 a.s., and is standard. Exponential ergodicity and boundedness of any order of moment of the process are also well known. For nondegenerate diffusions, see e.g. Pardoux and Veretennikov [38], Meyn and Tweedie [34] and Kusuoka and Yoshida [26] among many others. For damping Hamiltonian systems, we refer the reader to Wu [49]. The Lyapounov function method provides exponential mixing (even in the non-stationary case) and estimates of moments of the invariant probability measure up to any order. Wu's paper investigated several examples including the van der Pol model. We are giving additional information in Delattre et al. [11]. Let and The random field Y (3) is well defined under [A1] and [A2]. Obviously, ν depends on the value θ * . We suppress θ * from notation since it is fixed in this article, where it is not necessary to change θ * differently from discussion of the asymptotic minimax bound for example. We will assume all or some of the following identifiability conditions [A3 ] (i) There exists a positive constant χ 1 such that (ii) There exists a positive constant χ 2 such that (iii) There exists a positive constant χ 3 such that In the hypoelliptic case, as it is the most interesting case, checking these identifiability conditions is usually easy since ν is equivalent to or at least dominated by the Lebesgue measure and admits a density that is positive on a non-empty open set. Thus, identifiability is a problem of parameterization of the model. In particular, it is obvious that this condition causes no difficulty for linearly parametrized models often appearing in applications.
As already mentioned, we will assume that h → 0, nh → ∞ and nh 2 → 0 as n → ∞ throughout this article. The condition nh 2 → 0 is a standard one called the condition for rapidly increasing experimental design (Prakasa Rao [40]). Yoshida [51] relaxed this condition to nh 3 → 0, and Kessler [24] to nh p → 0 for any positive number p. Uchida and Yoshida [48] carried out the Ibragimov-Has'minskii-Kutoyants program under the condition nh p → 0, with the socalled Quasi-Likelihood Analysis based on the polynomial type large deviation estimate for the quasi-likelihood random field (Yoshida [52]). It is well known that these approaches under nh p → 0 need more smoothness of the model than our assumptions because they inevitably involve higher-order expansions of the semigroup. In this paper, when estimating the order of a random variable, eventually we use either n → ∞, h → 0, nh → ∞ or nh 2 → 0, and that's all. So, it is easy for the reader to recognize which convergence is used in each case. For example, if the reader finds O p ( √ nh), then quite likely it will be estimated as o p (1). However, we left traces as many as possible in the proof.

Adaptive estimation of θ 3
We denote U ⊗k for U ⊗ · · · ⊗ U (k-times) for a tensor U . For tensors S 1 = (S . This notation will be applied for a tensor-valued tensor T as well.
Remark 3.1. Clearly, this notation has an advantage over the notation by matrix product since the elements S 1 , ..., S m quite often have a long expression in the inference. The matrix notation repeats S i s twice for the quadratic form, thrice for the cubic form, and so on. This notation was introduced by [52] and already adopted by many papers, e.g., [45], [48], [53], [31], [23], [32], [13], [37], [36], [35], just to name a few. Let Define the R d Y -valued function G n (z, θ 1 , θ 2 , θ 3 ) by We will work with some initial estimatorsθ 0 1 for θ 0 1 andθ 0 2 for θ 2 . The following standard convergence rates, in part or fully, will be assumed for these estimators: The expansions (5.1) and (5.6) with Lemma 5.5 suggest two approaches for estimating θ 3 . The first approach is based on the likelihood of Δ j Y only, without assistance of Δ j X. The second one uses the likelihood corresponding to D j (θ 1 , θ 2 , θ 3 ). However, it is possible to show that the first approach gives less optimal asymptotic variance; see Section 8. So, we will take the second approach here.

Adaptive quasi-likelihood function for θ 3
Recall (2.2): Then is approximately conditionally Gaussian in short-term asymptotics, it seems natural to construct a likelihood function based on the local Gaussian approximation. Remark that is the covariance matrix of the principal conditionally Gaussian part of (Δ j X, Δ j Y ) if properly scaled and evaluated at z = Z tj−1 and (θ 1 , See Lemmas 5.4 and 5.5. We define a log quasi-likelihood function by The QMLEθ 0 3 for H (3) n depends on n as it does on the data (Z tj ) j=0,1,...,n ;θ 0 1 in the functionŜ also depends on (Z tj ) j=0,1,...,n .

Consistency ofθ
Proof of Theorem 3.2 is in Section 6.

Asymptotic normality ofθ
The following theorem provides asymptotic normality ofθ 0 3 . The convergence ofθ 0 3 is much faster than other components of estimators. The proof of the following theorem and the definition of M (3) n are in Section 6.3.

About initial estimators
Let Given the data (Z tj ) j=0,1,...,n , let us consider the quasi-maximum likelihood estimator (QMLE)θ 0 1 =θ 0 1,n for θ 1 , that is,θ 0 1 is any measurable function of (Z tj ) j=0,1,...,n satisfying Routinely, n 1/2 -consistency and asymptotic normality ofθ 0 1 can be established. We will give a brief for self-containedness and for the later use. Let for u 1 ∈ R p1 . We will see the existence and positivity of Γ (1) in the following theorem. We refer the reader to Gloter and Yoshida [22] for a proof.

Remark 3.5.
It is possible to show that the quasi-Bayesian estimator (QBE) also enjoys the same asymptotic properties as the QMLE in Theorem 3.4, if we follows the argument in Yoshida [52]. This means we can use both estimators together with the estimator for θ 2 e.g. given in Section 3.4, to construct a one-step estimator for θ 3 based on the scheme presented in Section 3.1, and consequently we can construct a one-step estimator for θ = (θ 1 , θ 2 , θ 3 ) by the method in Section 4.
We will recall a standard construction of estimator for θ 2 . As usual, the scheme is adaptive. Suppose that an estimatorθ 0 1 based on the data (Z tj ) j=0,1,...,n satisfies Condition [A4] (i), i.e., as n → ∞. Obviously we can apply the estimator constructed above, but any estimator satisfying this condition can be used. Define the random field H (2) n on Θ 2 by We will denote byθ 0 2 =θ 0 2,n any sequence of quasi-maximum likelihood estimator for H (2) n , that is, See Gloter and Yoshida [22] for a proof of the following theorem.
does not assume each initial estimator is attaining the optimal asymptotic variance, nor asymptotically normal. The quasi-maximum likelihood estimatorθ 0 2 with respect to (3.9) is an option. Another choice of the initial estimatorθ 0 2 is the simple least squares estimator using the coefficient A though it is less efficient than the quasi-maximum likelihood estimator for the first component of the model. Theorem 4.1 in this section shows the one-step estimator for θ 2 recovers efficiency even if such a less efficient estimator is used as the initial estimator for θ 2 .
The initial estimatorθ 0 3 is not necessarily the one defined in Section 3, though we already know that one satisfies [A4 ] (iii). That is, the initial estimatorθ 0 3 used in this section is requested to attain the error rate n −1/2 h 1/2 only, not to necessarily achieve the asymptotic variance equal to Γ −1 33 or less. We know there is an estimator of θ 1 satisfying Condition [A4 ] (i) based on only the first equation of (1.1). It is known that its information cannot be greater than the matrix It will be turned out that the amount of information is increased by the one-step estimator. Let If H x is an invertible (square) matrix, then Γ 11 coincides with Otherwise, it is not always true. Let .

A. Gloter and N. Yoshida
We will use the following random fields: and To construct one-step estimators, we consider the functions The event X n is a statistic because it is determined by the data (Z tj ) j=0,...,n only. For (θ 1 , θ 2 , θ 3 ), the one-step estimator (θ 1 ,θ 2 ,θ 3 ) with the initial estimator (θ 0 We obtain a limit theorem for the joint adaptive one-step estimator.
Condition [A3] is used to ensure non-degeneracy of the information matrix. We will give a proof to Theorem 4.1 in Section 7.

Basic estimation of the increments
The following sections will be devoted to the proofs.
We have Proof. (a) is trivial. For (b), the first term on the right-hand side of (5.2) can be estimated by the Burkholder-Davis-Gundy inequality, Taylor's formula for Then, thanks to (5.3), we obtain the following estimate. Write Then E ζ ⊗2 j = h 3 I r for the r-dimensional identity matrix I r . The function G n is defined in (3.1). Under sufficient smoothness of the coefficients, we have where κ(Z tj−1 , θ * 1 , θ * 3 ) is given in (3.4).
Proof. It is possible to show (a) by (5.7) and using the estimate (5.3) with the help of Taylor's formula. Additionally to the representation (5.6), by using (5.1) and (5.2), we obtain (b).
We denote by (B x B)(z, θ 2 ) the tensor defined by (B x B) We will apply this rule in similar situations. Let and r (5.12) j for every p > 1, and with some random variables r Proof. The decomposition (5.9) is obtained by Itô's formula. The estimate (5.13) is verified by (5.3) since ∂ z (B x B) and ∂ z A are bound by a polynomial in z uniformly in θ. The estimate (5.14) uses ∂ 2 A for θ 2 near θ * 2 as well as ∂ z A evaluated at θ * 2 : with some positive constant r and some random variables r (5.15) n,j satisfying (5.15). The small number r was taken to ensure convexity of the vicinity of θ * 2 . For θ 2 such that |θ 2 − θ * 2 | ≥ r, the estimate (5.14) is valid by enlarging r (5.15) n,j if necessary.
Proof. By (5.6), we have Then the decomposition (5.16) is obvious. The first and third terms on the righthand side of (5.19) can be estimated with Taylor's formula and (5.3), and the second term is easy to estimate. Thus, we obtain (5.21). Since ∂ (θ1,θ2) L H (z, θ 1 , θ 2 , θ * 3 ) is bounded by a polynomial in z uniformly in (θ 1 , θ 2 ), there exist random variables r for some random variables r Proof.

Random fields
To solve the problem, we need to exploit stochastic orthogonality between random fields. Though this technique is standard, to carry out it as visibly as possible, we necessarily introduce various random fields below. These symbols are useful to clarify which parameters are replaced in the formula and which order of error is caused, also to make big formulas compact and to avoid repetition of them. The following random fields depend on n. , , , ,

Proof of Theorem 3.3
Let .

A. Gloter and N. Yoshida
with the help of orthogonality. In particular, This implies R (6.7) Thus, we obtained the result.
In what follows, we quite often use the estimates in Lemma 6.1 without mentioning it explicitly.

A. Gloter and N. Yoshida
Let as n → ∞.
Proof. By using Lemma 5.6 and Lemma 5.5 (b) together with the convergence rate of the initial estimators, we have The open ball of radius r centered at θ is denoted by U (θ, r). Define the random field Φ (7.1) This completes the proof.
Proof. The proof is similar to that of Lemma 7.1. First, . Then we can show the lemma in the same fashion as Lemma 7.1 with a random field. Let . We will use the following random fields.
, Proof. We have Apply Lemma 5.6 and Lemma 5.5 (b) to obtain Here we used the assumption that the functions are bounded by a polynomial in z uniformly in the parameters, and the count Proof. By using Lemma 5.5 (b) together with the convergence rate of the estimatorsθ 0 1 andθ 0 3 , and next by Lemma 5.5 (a) and Lemma 5.4, we have Here we used the derivative ∂ 1 H.
We consider the random field n − Φ (7.13) n (0) → p 0. Since the first term on the right-hand side of (7.12) is nothing but Φ (7.13) n (u † 1 ) on an event the probability of which goes to 1, we have already obtained the result.
Proof. We have .
Then this lemma can be proved in the same way as Lemma 7.8. Let n → p 0 (7.14) as n → ∞. In particular, as n → ∞.
Proof. Let Here c is a postive constant and we will make it sufficiently small. Then thanks to Lemmas 7.7 and 7.6. On the event X * * (2,3) n , we apply Taylor's formula to obtain . Then Lemmas 7.6 and 7.10 give (7.14). Then the martingale central limit theorem gives (7.15).
The following notation for random fields will be used. , , .
We sometimes keep parameters in notation even when some of them do not appear in a specific expression of the formula, if such an expression is not necessary for later use; e.g.. Ψ 1,1 does not depend on θ 2 in fact.

Discussion on the estimation of θ 3 when only information of Δ j Y is available
In Sections 3 and 4, the estimators for θ 3 used the information of Δ j X as well as Δ j Y , given covariates (Z tj ) j=0,...,n . Then a natural question is what occurs when only the information of Δ j Y , i.e., the martingale part of Δ j Y , is available? It is possible to construct a QMLEθ 3 for θ 3 based on the quasi-log likelihood function Consistency ofθ 3 is obtained if the initial estimators are consistent. higher-order inference (Sakamoto and Yoshida [41]). Then, under a certain set of conditions, we have Thereforeθ 0 3 is superior toθ 3 . Remark thatθ 0 3 was given an initial estimator of θ 2 with error rate O p ((nh) −1/2 ) but not given the aboveθ 0 2 having the faster rate O p ((m n h) −1/2 ). In what follows, we will consider slightly more generalθ 0 2 and show a convergence including (8.4) as a special case.
as n → ∞ for some random variable L.