Least squares type estimation of the transition density of a particular hidden Markov chain

In this paper, we study the following model of hidden Markov chain: $Y_i=X_i+\epsilon_i$, $i=1,...,n+1$ with $(X_i)$ a real-valued stationary Markov chain and $(\epsilon_i)_{1\leq i\leq n+1}$ a noise having a known distribution and independent of the sequence $(X_i)$. We present an estimator of the transition density obtained by minimization of an original contrast that takes advantage of the regressive aspect of the problem. It is selected among a collection of projection estimators with a model selection method. The $L^2$-risk and its rate of convergence are evaluated for ordinary smooth noise and some simulations illustrate the method. We obtain uniform risk bounds over classes of Besov balls. In addition our estimation procedure requires no prior knowledge of the regularity of the true transition. Finally, our estimator permits to avoid the drawbacks of quotient estimators.


Introduction
In this paper we consider the following additive hidden Markov model: with (X i ) i≥1 a real-valued Markov chain, (ε i ) i≥1 a sequence of independent and identically distributed variables and (X i ) i≥1 and (ε i ) i≥1 independent. (2) Only the variables Y 1 , ..., Y n+1 are observed.Besides its initial distribution, the chain (X i ) i≥1 is characterized by its transition, i.e. the distribution of X i+1 given X i .We assume that this transition has a density Π, defined by Π(x, y)dy = P (X i+1 ∈ dy|X i = x), and our aim is to estimate this transition density Π.
This model belongs to the class of hidden Markov models.The Hidden Markov Models constitute a very famous class of discrete-time stochastic processes, with many applications in various areas such as biology, speech recognition or finance.For a general reference on these models, we refer to Cappé et al. (2005).Here, we study a simple model of HMM where the noise is additive (which allows dealing also with multiplicative noise by use of a logarithm).In standard HMM, it is assumed that the joint density of (X i , Y i ) has a parametric form and the aim is then to infer the parameter from the observations Y 1 , ..., Y n , generally by maximizing the likelihood.For this type of study, see, among others, Baum and Petrie (1966), Leroux (1992), Bakry et al. (1997), Bickel et al. (1998), Jensen and Petersen (1999), Douc et al. (2004), Fuh (2006).
This model is also similar to the so-called convolution model (for which the aim is to estimate the density of (X i ) i≥1 ).As in that model, we use the Fourier transform extensively.The restrictions on error distribution and rate of convergence obtained for our estimator are also of the same kind.Related works include Stefanski (1990), Fan (1993), Masry (1993) (for the multivariate case), Pensky and Vidakovic (1999), Comte et al. (2006).
The estimation of the transition density of a hidden Markov chain is studied by Clémençon (2003).His estimator is based on the thresholding of a waveletvaguelette decomposition.The drawback is that this estimator does not achieve the minimax rate because of a logarithmic loss.Lacour (2007b) describes an estimation procedure by quotient of an estimator of the joint density and an estimator of the stationary density f .The minimax rate is reached by this estimator if it is assumed that f and f.Π have the regularity α.But this smoothness condition on f raises a problem.Indeed Clémençon (2000) gives an example in which the stationary density f is not continuous, whereas the transition density Π is constant.It shows that f can be much less regular than Π.Therefore, our aim is to find an estimator of the transition density which does not have the above mentioned disadvantages.
To estimate Π, we use an original contrast inspired by the mean square contrast.The first idea is to connect our problem with the regression model.For any function g, we can write g(X i+1 ) = Π(., y)g(y)dy (X i ) + η i+1 where Then, for all function g, we can consider Πg as a regression function.The mean square contrast to estimate this regression function, if the X i were known, should be (1/n) by setting T (x, y) = t(x)g(y) i.e.T such that T (x, y)g(y)dy = t(x).It is this contrast which is used in Lacour (2007a) but in our case, only the Y 1 , . . ., Y n+1 are known.Therefore we introduce in this paper two operators Q and V such that It leads to the following contrast: A collection of estimators is then defined by minimization of this contrast on wavelet spaces.Indeed wavelets have many useful properties and in particular they can have a compact support and can be regular enough to balance the smoothness of the noise.A general reference on the subject is Meyer (1990)'s book.
A method of model selection inspired by Barron et al. (1999) and based on contrast (3) is used to build our final estimator.A data driven choice of model is performed via the minimization of a penalized criterion.The chosen model is the one which minimizes the empirical risk added to a penalty function.In most cases when estimating mixing processes, a mixing term appears in this penalty.In the same way, some unknown terms derived from the dependence between the X i appears in the thresholding constant used to define the estimator of Clémençon (2003).Here a conditioning argument enables to avoid such a mixing term in the penalty.Our penalty contains only known quantities or terms that can be estimated and is then computable.
For an ordinary smooth noise with regularity γ, the rate of convergence n −α/(2α+4γ+2) is obtained if it is assumed that the transition Π belongs to a Besov space with regularity α.Our estimator is then better than that of Clémençon (2003) which achieves only the rate (ln(n)/n) α/(2α+4γ+2) .Moreover this rate is obtained without assuming the regularity α of Π known.
This paper is organized as follows.In Section 2 we present the model and the assumptions.Section 3 is devoted to the definitions of the contrast and of the estimator.The main result and a sketch of proof are to be found in Section 4. Numerical illustrations through simulated examples are reported in Section 5.The detailed proofs are gathered in Section 6.

Notations
For the sake of clarity, we use lowercase letters for dimension 1 and capital letters for dimension 2. For a function t : R → R, we denote by t the L 2 norm that is t 2 = R t 2 (x)dx.The Fourier transform t * of t is defined by Notice that the function t is the inverse Fourier transform of t * and can be written t(x) = 1/(2π) e ixu t * (u)du.The convolution product is defined by (t * s)(x) = t(x − y)s(y)dy.
In the same way, for a function T : R 2 → R, T 2 = R 2 T 2 (x, y)dxdy and We denote by t ⊗ s the function: (x, y) → (t ⊗ s)(x, y) = t(x)s(y).
We will estimate Π on a compact set A = A 1 × A 2 only and we denote by .A the norm in L 2 (A) i.e.

Assumptions on noise
The Markov chain (X i ) i≥1 is observed through a noise sequence (ε i ) i≥1 of independent and identically distributed random variables.The density of ε i is denoted by q and is assumed to be known.We assume that the Fourier transform of q never vanishes and that q is ordinary smooth.More precisely the assumption on the error density is the following: H1 q is uniformly bounded and there exist γ > 0 and This assumption restrains the regularity class of the noise.Among the so-called ordinary smooth noises, we can cite the Laplace distribution, the exponential distribution and all the Gamma or symmetric Gamma distributions.The noise follows a Gamma distribution with scale parameter λ and shape parameter ζ if q(x) = λ ζ x ζ−1 e −λx /Γ(ζ) for x > 0 with Γ the classic Gamma function.Then . So q is bounded and verifies H1 with γ = ζ.The case ζ = 1 corresponds to an exponential distribution and if λ = 1/2, ζ = p/2, it is a chi-square χ(p).A Laplace noise is defined in the following way Then H1 is satisfied with γ = 2.More generally, we can define the symmetric gamma distribution with density q Remark 1.We have to point out that the Gaussian noise does not verify Assumption H1.Indeed, an exponential decrease of the Fourier transform of the error density is more difficult to control and a supersmooth noise makes denoising more difficult.For that reason, many authors, among which Butucea (2004), Koo and Lee (1998) or Youndjé and Wells (2002), have considered only ordinary smooth noise.The method used in this paper does not allow dealing with supersmooth noise.Indeed, it requires a wavelet basis more regular than the noise and with compact support (because of Assumption H4 below), which is impossible when the noise is supersmooth.

Assumptions on the chain
The hypotheses on the hidden Markov chain (X i ) i≥1 are the following: H2 The chain is irreducible, positive recurrent and stationary with unknown density f .H3 There exists a positive real f 0 such that, for all x in A 1 , H5 The process (X k ) is geometrically β-mixing (β q ≤ e −θq ), or arithmetically β-mixing (β q ≤ q −θ ) with θ > 8 where with P q (x, .) the distribution of X i+q given X i = x, µ the stationary distribution and .T V the total variation distance.
We refer to Doukhan (1994) for details on the β-mixing.Assumption H5 implies that the process (Y k ) is β-mixing, with β-mixing coefficients smaller than those of (X k ).Assumption H3 is common (but restrictive) and is crucial to control the empirical processes brought into play.A lot of processes verify Assumptions H2-H5, as autoregressive processes, diffusions or ARCH processes.These examples are detailed in Lacour (2007a).

Projection spaces
Here we describe the projection that we use to estimate the transition Π.We will consider an increasing sequence of spaces, indexed by m, to construct a collection of estimators.For the sake of simplicity, we assume that A = [0, 1] 2 .
We use a compactly supported wavelet basis on the interval [0, 1], described in Cohen et al. (1993).The construction provides a set of functions (φ k ) for k = 0, . . ., 2 J − 1 with J a fixed level, and for all j > J a set of functions (ψ jk ), k = 0, . . ., 2 j − 1.The collection of these functions forms a complete orthonormal system on [0, 1].Then , for u in L 2 ([0, 1]), we can write where φ is a Daubechies father wavelet with support [−N + 1, N ] and φ 0 , φ 1 are edge wavelets explicitly constructed in Cohen et al. (1993).The functions For r a positive real, N is chosen large enough so that φ has regularity r (in the sense defined in (4)): this is possible since it is a property of the Daubechies wavelets that the smoothness of φ increases linearly with N .We choose J such that 2 J ≥ 2N so that the two edges do not interact (no overlap between φ 0 and φ 1 ).The construction ensures that φ 0 and φ 1 are also of regularity r.In the same way, for each level j, the ψ jk are dilatation and translation of functions ψ, ψ 0 and ψ 1 with regularity r.Now we construct a wavelet basis of L 2 ([0, 1] 2 ) by the tensorial product method (see Meyer (1990) Chapter 3 Section 3).The father wavelet is φ ⊗ φ and the mother wavelets are φ ⊗ ψ, ψ ⊗ φ, ψ ⊗ ψ.A function T in L 2 ([0, 1] 2 ) can then be written For the sake of simplicity, we adopt the following notation where ϕ jk = 2 j/2 ϕ(2 j x − k) with ϕ = φ, φ 0 , φ 1 , ψ, ψ 0 or ψ 1 according to the values of j and k.For j > J, Λ j is a set with cardinal 3.2 2j and Λ J is a set with cardinal 2 2J .In the rest of this paper we will use the following property of ϕ deriving from the regularity of the initial Daubechies wavelet: there exists a positive constant k 3 such that Now , for m ≥ J, we can consider the space Note that the functions in S m are all supported in the interval [0, 1] 2 .The dimension of the space ].We denote by S the space S m0 with the greatest dimension D 2 m0 = D 2 smaller than n 1/(4γ+2) .It is the maximal space that we consider.The spaces S m have the following properties: jkl a jkl ϕ jk ⊗ ϕ jl 2 = jkl a 2 jkl .This property derives from the orthonormality of the basis.
Now, for all function t : R → R, let v t be the inverse Fourier transform of t * /q * (−.), i.e.
This operator is introduced because it verifies ) for all function t.We can write the following lemma : This lemma is proved in Section 6.

Construction of a contrast
Now let us estimate the transition density of the Markov chain by minimizing a contrast.This section is devoted to the definition of this contrast.We explain here how it can be obtained, first by considering the case without noise.
3.2.1.First step: if X 1 , . . ., X n+1 were observed We present here a heuristic to understand why we choose the contrast, assuming first that the (X i ) are known.For all function g, the definition of the transition density implies E[g(X i+1 )|X i ] = Π(X i , y) g(y)dy so that we can write We recognize then a regression model.A contrast to estimate Π(., y)g(y)dy is It is the classical mean square contrast to estimate a regression function.But we want to estimate Π(., y) and not only Π(., y)g(y)dy.So we observe that if g 2 = 1 and T (x, y) = u(x)g(y), then u(.) = T (., y)g(y)dy.So if u(.) = T (., y)g(y)dy estimates Π(., y)g(y)dy, we can assume that T estimates Π.Since T 2 (., y)dy = u 2 (.), the contrast becomes It is the contrast studied in Lacour (2007a) and it allows for a good estimation of Π(., y) when the Markov chain is observed.We can observe that where f is the density of X i and Then this contrast is an empirical counterpart of the distance T − Π f .

Second step: the X i 's are unknown, the observations are the Y i 's
The aim of this step is to modify the previous contrast, to take into account that the X i 's are not observed.To do this, we use the same technique as in the convolution problem (see Comte et al. (2006)).Let us denote by F X the density of (X i , X i+1 ) and F Y the density of (Y i , Y i+1 ).We remark that F Y = F X * (q ⊗q) and F * Y = F * X (q * ⊗ q * ) and then T * q * ⊗ q * F * Y by using the Parseval equality.The idea is then to define V * T = T * /(q * ⊗ q * ) so that Then we replace the term T (X i , X i+1 ) in the contrast by V T (Y i , Y i+1 ).In the same way, we find an operator Q to replace the term T 2 (X i , y)dy.More precisely, for all function T , let V T be the inverse Fourier transform of T * /(q * ⊗ q * )(−.), i.e.
V and Q have been chosen so that the following lemma holds.
Lemma 2. For all k ∈ {1, . . ., n + 1} Points 1. and 3. are proved in Section 6, the other assertions are their immediate consequences.Note that V and Q are strongly linked with v.In particular By using the operators V and Q, we now define the contrast, depending only on the observations Y 1 , . . ., Y n+1 : So we want to estimate Π by minimizing γ n .The definition of the contrast leads to the following "empirical norm": The term empirical norm is used because EΨ n (T ) = T 2 f , but Ψ n is not a norm in the common sense of the word.

Definition of the estimator
We have to minimize the contrast γ n to find our estimator.By writing T = m j=J (k,l)∈Λj a jkl ϕ jk ⊗ ϕ jl = λ a λ ω λ (x, y), we obtain Then, by denoting A m the vector of the coefficients a λ of T , where But the matrix G m is not necessarily invertible.This is why we introduce the set where Sp denotes the spectrum, i.e. the set of the eigenvalues of the matrix and f 0 is the lower bound of f on A 1 .On Γ, G m is invertible and γ n is convex so that the minimization of γ n is equivalent to Equation ( 5) and admits the solution Remark 2. The term 2/3 in Γ can be replaced by any constant smaller than 1.Moreover, the construction of Πm described here requires the knowledge of f 0 .Nevertheless, when f 0 is unknown, we can replace it by an estimator f0 defined as the minimum of an estimator of f (for an estimator of the density of a hidden Markov chain, see Lacour (2007b)).The result is then unchanged if f is regular enough and the mixing rate high enough.
We then have an estimator of Π for all S m .But we have to choose the best model m to obtain an estimator which achieves the best rate of convergence, whatever the regularity of Π.So we set m = arg min where pen is a penalty function to be specified later and Then we can define our final estimator:

Risk and rate of convergence
For a function G and a subspace S, we define We recall that A is the estimation area.For each estimator Πm , we have the following decomposition of the risk: Proposition 1.We consider a Markov chain and a noise satisfying Assumptions H1-H5 with γ ≥ 3/4.For m fixed in M n , we consider Πm the estimator of the transition density Π, previously described.Then there exists C > 0 such that We do not prove this proposition because this result is included in Theorem 1 below, which is proved in Section 6.Now if Π belongs to a Besov space with regularity α, it is a common approximation property of the wavelet spaces that 2α+4γ+2) , we obtain the minimum risk But this choice of m 1 is impossible if α is unknown (it is a priori the case since Π is unknown).That is why we have built our estimator Π via model selection.Now we can state the following theorem.
Theorem 1.We consider a Markov chain and a noise that satisfy Assumptions H1-H5 with γ > 3/4.We consider Π the estimator of the transition density Π previously described with r > 2γ + 3/2 and ).Note that this result is non-asympotic.It is an advantage of the least square method over the quotient method.
All the constants on which the penalty depends do not have the same status.The constants Φ 1 , γ and q ∞ are known, since the wavelet basis and the noise distribution are known.The constant f 0 is unknown but it can be estimated (see Remark 2).Then, even if it means replacing f 0 by an estimator f0 , the penalty is computable.In particular the dependence coefficients of the sequence do not appear at all in the penalty.
The condition γ > 3/4 is due to an additional term of order /n is the dominant term.If γ = 3/4, the result is still true but the constant in the penalty also depends on Π A .In the other cases the estimation is possible but the term D 2γ+7/2 m /n is not negligible any more and the order of the variance (and consequently the rate of convergence) must be changed.This constraint γ > 3/4 is not very restrictive since γ must be larger than 1/2 in order that q be square integrable.Moreover in the case of a Gamma noise, q is not bounded if γ < 1.
We can now evaluate the rate of convergence of our estimator.
Corollary 1.We suppose that the restriction of Π to A belongs to the Besov space B α 2,∞ (A) with α < r.Then, under the assumptions of Theorem 1, To our knowledge, the minimax rates are unknown in the specific estimation problem we consider here and finding them is definitely beyond the scope of this paper.Nevertheless, Clémençon (2003) proved that the rate n − 2α 2α+4γ+2 is optimal whenever f and f Π belong to B α 2,∞ (R) and B α 2,∞ (R 2 ) respectively.Nevertheless we remark that we obtain the same rate of convergence with Π as those obtained with Πm1 where D m1 = n 1/(2+4γ+2α) , but without requiring the knowledge of α.Moreover our estimator is better than the one of Clémençon (2003), which achieves only the rate (ln(n)/n) 2α 2α+4γ+2 .It is also an improvement on the result of Lacour (2007b) because this rate is obtained without requiring any regularity for f or f Π.
If we want to compare the quotient method described in Lacour (2007b) and the one introduced in this paper, we can say that only the quotient method allows dealing with supersmooth distributions, at least from a theoretical point of view.However, the least squares method has the advantage of giving a good rate of convergence without requiring prior information on the stationary density.Moreover, our result is non-asymptotic contrary to the one of Lacour (2007b).

Sketch of proof of Theorem 1
We give in this section a sketch of proof of Theorem 1.
Let m ∈ M n .We denote by Π m the orthogonal projection of Π on S m .We have the following bias-variance decomposition A can be written in the following way A .But, on Γ, the definitions of Πm and m lead to the inequality where The main steps of the proof are then 1. to control the term sup T ∈B f (m, m) Z n,m (T ), 2. to link the empirical "norm" Ψ n with the L 2 norm .A .
• To deal with the supremum of the empirical process Z n,m (T ), we use an inequality of Talagrand stated in Lemma 6 (Section 6.8).This inequality is very powerful but can be applied only to a sum of independent random variables.That is why we split Z n,m (T ) into three processes plus a bias term.
For the first process Z (1) n , we are back to independent variables by remarking that, conditionally to X 1 , . . ., X n+1 , the couples (Y 2i−1 , Y 2i ) are independent (see Proposition 3).
For the other processes, we use the mixing assumption H5 to build auxiliary variables X * i which are approximations of the X i 's and which constitute independent clusters of variables (see Proposition 4).
• To pass from Ψ n to the L 2 norm, we introduce the following set We can easily prove (see Section 6.3) that ∆ ⊂ Γ.Then,

Simulations
To illustrate the method, we compute our estimator Π for different Markov processes with known transition density.The estimation procedure contains several Fourier transforms.This may seem heavy, but, for each noise distribution, the computation of v ϕ jk for all the basis functions can be done beforehand.Here we use the Daubechies wavelet D20.Next, to compute Π from data Y 1 , . . ., Y n+1 , we use the following steps (see Section 3.3): • For each m, compute matrices G m and Z m , • Deduce the matrix A m , • Select the m which minimizes γ n ( Πm ) + pen(m) = − t A m Z m + pen(m), • Compute Π using matrix A m.
Actually, following the theoretical procedure, we should set Πm = 0 on Γ c (see Section 3.3) but, for practical purposes, it is more sensible to inverse G m whenever possible.In all the examples examined below, the minimum of the spectrum of G m has never been too small (so that we merely inverted it without using set Γ).The reason is that P (Γ c ) is very small: it appears in the proofs that it can be bounded with an exponential inequality.
• A radial Ornstein-Uhlenbeck process (in its discrete version).For j = 1, . . ., δ, we define the processes: ξ j n+1 = aξ j n + βε j n where the ε j n are i.i.d.standard Gaussian.The chain is then defined by X n = δ i=1 (ξ i n ) 2 .The transition density is given in Chaleyat-Maurel and Genon-Catalot (2006) where this process is studied in detail: and I δ/2−1 is the Bessel function with index δ/2 − 1.This process (with here a = 0.5, β = 3, δ = 3) is denoted by √ CIR since its square is actually a Cox-Ingersoll-Ross process.The estimation domain for this process is [2, 10] 2 .
For this last chain, the stationary density is not explicit.So we simulate n + 500 variables and we estimate only from the last n to ensure the stationarity of the process.For the other chains, it is sufficient to simulate an initial variable X 0 with density f .

We consider two different noises:
Laplace noise In this case, the density of ε i is given by The smoothness parameter is γ = 2 so that the theoretical penalty is Gaussian noise In this case, the density of ε i is given by This noise does not verify Assumption H1 but it is interesting to see if this assumption is also necessary for practical purposes.Given the exponential regularity of this noise, we consider the following penalty where, by simulation experiments, we calibrate the penalty with κ = 5.
Table 1 presents the L 2 risk of our estimator of the transition density for the six Markov chains and the two noises.These results can be compared with those of Lacour (2007a) (Table 2) who studies the processes AR(i), √ CIR and ARCH but directly observed, i.e. without noise.The risk values are then higher in our case, but with the same order, which is satisfactory.It is noticeable that the estimation works almost in the same way with the Gaussian noise, but with a slower decrease of the risk, as can be observed in Figure 1 .It is a classical phenomenon in deconvolution problems, since the Gaussian noise is much more regular than the Laplace noise.
Figure 2 allows visualizing the result for process ARCH observed through a Laplace noise: the surfaces z = Π(x, y) and z = Π(x, y) are presented.We also give figures of cross-sections of this kind of surfaces.We can see in Figure 3 the curves z = Π(x, −0.44) versus z = Π(x, −0.44) and the curves z = Π(1.12,y) versus z = Π(1.12,y) for the process AR(i).Generally, for a multidimensional estimation, the mixed control of the directions does not enable to do as well as a classical one-dimensional function estimation.Nevertheless here the curves are very close.
From a practical point of vue, it is difficult to compare the method described here and the one of Lacour (2007b).Indeed, the bases used are very different.However, we can say that the quotient method seems to give better results when the noise distribution is Gaussian (that is conform to theory).Nevertheless, the least squares procedure is better for a Laplace noise, especially when n is small.

Detailed proofs
• The computation of v ϕ jk gives Next, it follows from Assumption H1 that |v ϕ jk (x)| ≤ C 1,γ (2 j ) γ+1/2 /2πk 0 using Lemma 5 (Section 6.8) since r > γ + 1.Then, for all x, • To prove P5, we apply the Parseval equality.That yields Using H1 and given that 2r > 2γ + 1, we obtain Then Lemma 5 (Section 6.8) shows that Hence, since r > γ + 2, there exists The fact that ϕ jk and • Applying Parseval's equality, But,using Lemma 5, Then, it follows that It is then sufficient to sum this quantity for all k, k ′ by taking into account the superposition of the supports to prove P7 as soon as Φ 1 ≥ 3C(4N − 3).

Proof of Lemma 2
1. First we write that By using the independence between (X i ) and (ε i ), we compute 3. We proceed in a similar way for By using the independence between (X i ) and (ε i ), we compute Thus By denoting by T y the function x → T y (x) = T (x, y), we obtain and then (10)

Proof of Theorem 1
We start with introducing some auxiliary variables whose existence is ensured by Assumption H5 of mixing.In the case of arithmetical mixing, since θ > 8, there exists a real c such that 0 < c < 3/8 and cθ > 3.In this case, we set q n = ⌊n c ⌋.In the case of geometric mixing, we set q n = ⌊c ln(n)⌋ where c is a real larger than 3/θ.For the sake of simplicity, we suppose that n + 1 = 2p n q n , with p n an integer.Let for l = 0, . . ., p n − 1, A l = (X 2lqn+1 , ..., X (2l+1)qn ) and B l = (X (2l+1)qn+1 , ..., X (2l+2)qn ).As in Viennet (1997), by using Berbee's coupling Lemma, we can build a sequence ( In the same way, we build (B * l ) and we define for any l ∈ {0, . . ., Let us recall that S is the space S m with maximal dimension D 2 ≤ n 1 4γ+2 .We now adopt the notations Let us fix m ∈ M n .We denote by Π m the orthogonal projection of Π on S m .
Then we have the decomposition Now, using the Markov inequality and the definition of Π, We now state the following proposition : Proposition 2. There exists C 0 > 0 such that Now we have to bound The estimators Πm are defined by minimization of the contrast on a set Γ defined in (6).Let us prove that this set Γ contains Ω.More precisely, we prove that ∆ ⊂ Γ.For T = λ a λ ω λ ∈ S m , the matrix A m = (a λ ) of its coefficients in the basis (ω λ (x, y)) Consequently µ ≥ (2/3)f 0 .So ∆ ⊂ Γ and Π m minimizes the contrast on ∆.We now observe that, for all functions T, S Then, since on ∆, where and, for all m ′ , B f (m, m ′ ) = {T ∈ S m + S m ′ , T f = 1}.Now let p(., .)be a function such that for all m, m ′ , 12p(m, m ′ ) ≤ pen(m) + pen(m ′ ).Then So, using the definition of ∆ ⊃ Ω, And using Assumption H3, Now, by denoting E X the expectation conditionally to X 1 , . . ., X n+1 , the process Z n,m (T ) can be split in the following way : Then, by introducing functions p 1 (., .),p 2 (., .)and p 3 (., .) sup We now use the following propositions.
A function T in S can be written T (x, y) = m0 j=J kl a jkl ϕ jk (x)ϕ jl (y) where m 0 is such that S = S m0 .Then where For the sake of simplicity, we denote λ = (j, k) and λ ′ = (j, k ′ ) so that sup This lemma is proved in Baraud et al. (2001) for an orthonormal basis verifying On D: x), we will use the Bernstein inequality given in Birgé and Massart (1998) .

mFig 1 .
Fig 1. Mean of the MISE for the six processes when n increases