Inhomogeneous and Anisotropic Conditional Density Estimation from Dependent Data

: The problem of estimating a conditional density is considered. Given a collection of partitions, we propose a procedure that selects from the data the best partition among that collection and then provides the best piecewise polynomial estimator built on that partition. The observations are not supposed to be independent but only β -mixing; in particular, our study includes the estimation of the transition density of a Markov chain. For a well-chosen collection of possibly irregular partitions, we obtain oracle-type inequalities and adaptivity results in the minimax sense over a wide range of possibly anisotropic and inhomogeneous Besov classes. We end with a short simulation study.


Introduction
In this paper, we are concerned with conditional density estimation.Such a model brings more information than the well-studied regression model; for instance, it may reveal multimodality.Yet, references about conditional density estimation are rather scarce, even for nonadaptive procedures.For independent data, we can cite for instance Györfi and Kohler [GK07] for a histogram based procedure, or Faugeras [Fau07] for a copula-based kernel estimator.For mixing data, De Gooijer and Zerom [DGZ03] and Fan and Yim [FY04] propose kernel methods.For Markov chains, nonadaptive estimation of the transition density is considered for instance in [Rou69,Bir83,DG83], and we also refer to [Lac07] for a more complete bibliography.But, in order to reach the optimal rate of convergence, those methods require the smoothness of the function to estimate to be known, so as to choose adequately some tuning parameter.
Adaptive estimators of the conditional density have only recently been proposed.For independent data, Efromovich [Efr07,Efr08] and Brunel, Comte and Lacour [BCL07] give oracle inequalities and adaptivity results in the minimax sense.Efromovich [Efr07,Efr08] uses a Fourier decomposition to build a blockwise-shrinkage Efromovich-Pinsker estimator, whereas Brunel et al. [BCL07] perform model selection based on a penalized least-squares criterion.Regarding dependent data, Clémençon [Clé00b] and Lacour [Lac07] study adaptive estimators of the conditional density for Markovian observations, the former via wavelet thresholding, and the latter via model selection.Besides, the procedures proposed by [Efr07,Efr08,BCL07,Lac07] are all able to adapt to anisotropy; otherwise said, the conditional density to estimate is allowed to have unknown and different degrees of smoothness in each direction.
But the smoothness of the function to estimate may also vary spatially.If the risk of the estimator is measured via some L q -norm, one way to take into account that inhomogeneous behaviour is to consider functions whose smoothness is measured in a L p -norm, with p < q.Among the aforementioned references, only Clémençon [Clé00b] is able to cope with inhomogeneous smoothness.In the imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 simpler framework of density estimation, without conditioning variables, adaptation to inhomogeneity has been studied in the following works.Thresholding methods, in a univariate framework, are proposed by Hall, Kerkyacharian and Picard [HKP98] for independent data, Clémençon [Clé00a] for Markovian data, and Gannaz and Wintenberger [GW10] for a wide class of weakly dependent data.Piecewise polynomial selection procedures based on a penalized contrast have also been considered, and consist in selecting from the data a best partition and a best piecewise polynomial built on that partition.Thus, Comte and Merlevède [CM02] estimate the univariate density of absolutely regular stationary processes, in discrete or continuous time, selecting a best partition among the collection of all the partitions of [0, 1] built on a thin regular grid via a leastsquares criterion.Besides, three papers have lately considered density estimators inspired from the "multiresolution histogram" of Engel [Eng94,Eng97], or the "dyadic CART procedure" of Donoho [Don97].Willett and Nowak [WN07] select best piecewise polynomials built on partitions into dyadic cubes via a penalized maximum likelihood contrast.Klemelä [Kle09] and Blanchard, Schäfer, Rozenholc and Müller [BSRM07] select best histograms based on partitions into dyadic rectangles via a penalized criterion based on the L 2 -distance for the first one, and on Kullback-Leibler divergence for the second ones.But all these procedures only reach optimal rates of convergence up to a logarithmic factor, and only [Kle09] is able to prove adaptivity both to anisotropy and inhomogeneity.
In this paper, we provide an estimator of the conditional density via a piecewise polynomial selection procedure based on an adequate least-squares criterion.To deal with the possible dependence of the observations, we mainly use β-mixing coefficients and their coupling properties.Thus, our dependence assumptions, while being satisfied by a wide class of Markov chains, are not restricted to Markovian assumptions.We first prove nonasymptotic oracle type inequalities fulfilled by any collection of partitions satisfying some mild structural conditions.We then consider the collection of partitions into dyadic rectangles, as [Kle09] or [BSRM07].We obtain oracle-type inequalities and adaptivity results in the minimax sense, without logarithmic factor, over a wide range of Besov smoothness classes that may contain functions with inhomogeneous and anisotropic smoothness, whether the data are independent or satisfy suitable dependence assumptions.The adaptivity of our procedure greatly relies on the approximation result proved in [Aka10].Moreover, determining in practice the penalized estimator based on that collection only requires a computational complexity linear in the size of the sample.
This paper is organized as follows.We begin by describing the framework and the estimation procedure, and we present an evaluation of the risk on one model.This study allows to understand what bound for the L 2 -risk we seek to obtain.The choice of a penalty yielding an oracle-type inequality is the topic of Section 3.1.Section 3.2 is devoted to the collection of partitions into dyadic rectangles, and adaptivity results are proved for an adequate penalty.We show in Section 4 that all these results can be extended to dependent data.In Section 5, the practical implementation of our estimator is explained and some simulations are presented, both for independent and dependent data.Most imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 proofs are deferred to Section 6.

Framework and estimation procedure
In this section we define a contrast and we deduce a collection of estimators ŝm .In order to understand which model m we should choose, we give an evaluation of the risk for each estimator ŝm .This allows us to define the penalized estimator.

Framework and notation
Let {Z i } i∈Z = {(X i , Y i )} i∈Z be a strictly stationary process, where, for all i ∈ Z, X i and Y i take values respectively in [0, 1] d1 and [0, 1] d2 , with d 1 and d 2 positive integers.We assume that the variables (X i ) i∈Z admit a bounded marginal density f with respect to the Lebesgue measure.Given some integer n ≥ 2, our aim is to estimate, on the basis of the observation of (Z 1 , . . ., Z n ), the marginal density s of Y i conditionally to X i .Thus, our parameter of interest s is the realvalued function of d variables, where In particular, if (X i ) i∈Z is a homogeneous Markov chain of order 1, and Y i = X i+1 for all i ∈ Z, then s is the transition density of the chain (X i ) i∈Z .
Let us introduce some standard notation.For any real-valued function t defined and bounded on some set D, we set We denote by L 2 [0, 1] d1 × [0, 1] d2 the set of all real-valued functions which are square integrable with respect to the Lebesgue measure.Since f is bounded, we can also define on L 2 [0, 1] d1 × [0, 1] d2 the semi-scalar product and the associated semi-norm .f .

Contrast and estimator on one model
In order to estimate the conditional density s, we consider the empirical criterion γ described in [BCL07] and defined on imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 Due to the nature of the function to estimate, the contrast used here borrows both from the classical regression and density least-squares contrasts.This contrast verifies: ).Thus, a natural way to build an estimator of s consists in minimizing γ over some subset of ), that we choose here as a space of piecewise polynomial functions with degree smaller than a given nonnegative integer r.More precisely, for a partition m of [0, 1] d1 × [0, 1] d2 into rectangles, we denote by S m the space of all real-valued piecewise polynomial functions on [0, 1] d1 × [0, 1] d2 which are polynomial with coordinate degree ≤ r on each rectangle of m.We define a best estimator of s with values in the model S m by setting ŝm = argmin t∈Sm γ(t).
An explicit formula for computing ŝm is given in Section 5.

Risk on one model
In this subsection, we fix some partition m of [0, 1] d1 × [0, 1] d2 into rectangles and give some upper-bound for the risk of ŝm when Z 1 , . . ., Z n are independent.
As for all the theorems stated in the sequel, we evaluate that risk in the random semi-norm .n naturally associated to our problem, and defined, for all t ∈ L 2 ([0, 1] d1 × [0, 1] d2 ), by t 2 (X i , y)dy (remember that our problem is a mixture of regression in the x-direction and of density estimation in the y-direction).However, it is also possible to control the classical L 2 -norm, using a truncated estimator (see, for instance, Corollary 3.2 in Section 3.) Besides, for any partition m ′ of a unit cube into rectangles, we denote by |m ′ | the number of rectangles in m ′ and say that the partition m ′ is regular if all its rectangles have the same dimensions.For the risk of the estimator ŝm , we can prove the following result.Let s m be the orthogonal projection of s on S m for the norm ., and D m denote the dimension of S m , so that D m = (r + 1) d |m|.Assume that s and f are bounded, and that f is also bounded from below by a positive constant.If the variables Z 1 , . . ., Z n are independent, then where C only depends on r, d, ι(f ), f ∞ , s ∞ .
We recover approximately in the upper-bound stated in Proposition 2.1 the usual decomposition into a squared bias term, of order s−s m 2 f , and a variance term of order s ∞ D m /n, proportional to the dimension of the model S m .A major interest of such a bound is that it allows to understand how to build an optimal estimator from the minimax point of view.Let us first recall that when s belongs to classical classes of functions with isotropic smoothness σ (isotropic Besov classes for instance), a minimax estimator over such a class reaches the estimation rate n −2σ/(2σ+d) .Roughly speaking, when s belongs to a well-chosen class of isotropic functions with smoothness σ measured in a L pnorm with p ≥ 2, the bias term s − s m 2 f is at most of order D −2σ/d m for any regular partition m into cubes.If we knew at least the smoothness parameter σ, we could choose some regular partition m opt (σ) into cubes realizing a good compromise between the bias and the variance terms, i.e. such that and D mopt(σ) /n are of the same order.We would then obtain with ŝmopt(σ) an estimator that reaches the optimal estimation rate n −2σ/(2σ+d) whatever p ≥ 2. But when s has isotropic smoothness σ measured in a L p -norm with p < 2, one can only ensure that the bias term s − s m 2 f is at most of order D −2σ/d m for some irregular partition m into cubes that does not only depend on σ, but must be adapted to the inhomogeneity of s over the unit cube (see for instance Section 3.2 and [Aka10]).Thus, there exists some well-chosen irregular partition m opt (s) into cubes such that D mopt(s) is of order n d/(2σ+d) but that reaches the estimation rate n −2σ/(2σ+d) only at s, and probably not on the whole class of functions with smoothness σ in a L p -norm with p < 2. Last, if s has anisotropic smoothness, similar properties still hold, with partitions into rectangles -regular or not depending on the homogeneity of s-whose dimensions are adapted to the anisotropy of s.

Penalized estimator
We give ourselves a finite collection M of partitions of [0, 1] d1 × [0, 1] d2 into rectangles.The aim is to choose the best estimator among the collection {ŝ m } m∈M without assumption on the smoothness of s.To do so, we use the model selection method introduced by [BBM99] which allows us to select an estimator only from the data, by minimizing a penalized criterion.Thus, we consider the random selection procedure where pen : M → R + is a so-called penalty function that remains to be chosen so that s performs well.The choice of the collection of partitions M is discussed is the next section.The practical implementation of the penalized estimator based on the collection of partitions into dyadic rectangles is described in Section 5.

Main result
In this section, we study the risk of the penalized estimator s for independent data, first with a general collection of partitions, secondly with a relevant choice of collection that ensures the optimal estimation of a possibly inhomogeneous and anisotropic function s.

Oracle inequality
Ideally, we would like to choose a penalty pen such that s is almost as good as the best estimator in the collection {ŝ m } m∈M , in the sense that for some positive constant C. Theorem 3.1 below suggests a form of penalty yielding an inequality akin to Yet, as recalled in the previous section, for each m ∈ M, E s s − ŝm 2 n is expected to be of order s − s m 2 f + D m /n.So, Inequality (3.2) is expected to be almost as good as Inequality (3.1).In order to deal with a large collection M that may contain irregular partitions, we only impose a minor structural condition on M. That assumption ensures that all the models are included in a biggest model, without imposing that the models be nested as in [BCL07].We also assume that s and f are bounded.
Assumption (P1) All the partitions in the collection M are built on a regular partition m ⋆ of [0, 1] d into cubes such that We establish an oracle type inequality for a very general collection of partitions.Thus we state the following model selection theorem.
Theorem 3.1.Let M be a collection of partitions satisfying Assumption (P1) and {L m } m∈M be a family of reals greater than or equal to 1, that may depend on n, such that Assume that (Z i ) 1≤i≤n are independent and s, f satisfy Assumption (B).If the penalty satisfies, for all m ∈ M, for some large enough positive absolute constant κ, then where Theorem 3.1 is only proved in its general version for dependent data (see Theorem 4.1 in Section 4.3).
The penalty contains unknown terms, but in practice, s ∞ and ι(f ) can be replaced with an estimator, as in [BM97] (Proposition 4) for instance, and κ is calibrated via a simulation study.To state a result with the precise replacement, we choose m • 1 and m • 2 regular partitions of [0, 1] d1 and [0, 1] d2 into cubes such that m , where F m • 1 is the space of all functions on [0, 1] d1 which are polynomial with coordinate degree ≤ r on each rectangle of m • 1 , and estimate ι(f We also impose Besov-type smoothness assumptions on f and s.For σ = (σ 1 , . . ., σ d ) ∈ (0, r + 1) d , R > 0, p > 0, we refer to [Tri06] (Chapter 5) for a definition of the anisotropic Besov space B σ pp ′ and the associated norm .|Bσ pp ′ , and we introduce the anisotropic Besov balls where p ′ = ∞ if 0 < p ≤ 1 or p ≥ 2, and p ′ = p if 1 < p < 2. We recall that, due to the continuous embeddings stated for instance in [Tri06], B σ p∞ contains all the spaces B σ pp ′ , for p ′ > 0, so our choice of p ′ in the definition of B(σ, p, R) is the less stringent one for 0 < p ≤ 1 or p ≥ 2. Last, we set σ = min 1≤l≤d σ l and denote by H(σ) the harmonic mean of σ 1 , . . ., σ d , i.e.
imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 Corollary 3.1.Assume that s ∈ B(σ, p, R) and f ∈ B(α, p, R 1 ) with for some large enough positive constant κ.Then, under the assumptions of Theorem 3.1, for n large enough, where We omit the proof since it exactly follows the proof of Theorem 12 in [Lac07].The smoothness conditions arise from the control of s−s m • ∞ and f −f m • 1 ∞ , for which we use the results of [Aka10] (Lemma 2).It should be noticed that m • may differ from m ⋆ .In particular, it may be chosen less fine than m ⋆ so as to have better estimates of s ∞ and ι(f ).
Let us now comment on Inequality (3.4), which is similar to (3.2), up to the factors C 1 , that does not depend on n, and max m∈M L 2 m .We have already explained that we need irregular partitions to estimate inhomogeneous functions.However, irregular partitions often form a too rich collection.If L m only depends on D m , Condition (3.3) means that L D have to be large enough to balance the number of models of same dimension D. If the number of model for each dimension is high, the L m 's have to be high too.For instance, [BM97] use weights (L m ) m∈M of order log(n) to ensure condition (3.3), which spoils the rates of convergence.We describe in the next section an interesting collection of partitions for which the factor max m∈M L 2 m can be bounded by a constant, although the collection is rich enough to have good approximation qualities with respect to functions of inhomogeneous smoothness.
Let us mention that we can define an estimator s * for which we can control the risk associated to the norm .instead of .n .
Corollary 3.2.Define s * = s1 s ≤n .Then, under assumptions of Theorem 3.1, where The proof exactly follows the proof of Theorem 4 in [Lac07] and then is omitted.(The idea is the following: when s ≤ n then the result is already proved; and ) is low enough to become a remainder term.)Then all the following results (Theorems 2-5) can be stated for the L 2 -norm .replacing s by s * = s1 s ≤n .

The penalized estimator based on dyadic partitions
Let us describe the particular collection of partitions that we use here.We call dyadic rectangle of [0, 1] d any set of the form I 1 ×. ..×I d where, for all 1 ≤ l ≤ d, with j l ∈ N and k l ∈ {1, . . ., 2 j l −1}.Otherwise said, a dyadic rectangle of [0, 1] d is defined as a product of d dyadic intervals of [0, 1] that may have different lengths.We consider the collection of partitions of [0, 1] d into dyadic rectangles with sidelength ≥ 2 −J⋆ , where J ⋆ is a nonnegative integer chosen according to Proposition 3.1 below.We denote by M rect such a collection of partitions.Let us underline that a partition of M rect may be composed of rectangles with different Lebesgue measures, as illustrated by Figure 1.For such a collection, we obtain as a straightforward consequence of Theorem 3.1 that the estimator s is almost as good as the best estimator in the collection {ŝ m } m∈M rect .
Proposition 3.1.The notation is that of Theorem 3.1 and Assumption (B) is supposed to be fulfilled.Let and let pen be given on M rect by where κ is some positive absolute constant.If κ is large enough, then where C 2 is a positive real that depends on κ, r, imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 Proof.Let D a positive integer.Building a partition of [0, 1] d into D dyadic rectangles amounts to choosing a vector (l 1 , . . ., l D−1 ) ∈ {1, . . ., d} D−1 of cutting directions and growing a binary tree with root corresponding to [0, 1] d and with D leaves.For instance, the partition of [0, 1] 2 represented in Figure 1 can be described by the binary tree structure represented in Figure 2 together with the sequence of cutting directions (2, 1, 2, 2), where 1 stands for a vertical cut, and 2 stands for a horizontal cut.Since the number of binary trees with D leaves Binary tree labeled with the sequence of cutting directions (2, 1, 2, 2) corresponding with the dyadic partition represented in Figure 1.
is given by the Catalan number (see for instance [Sta99]), the number of such partitions is at most (4d) D .Therefore, Condition (3.3) is fulfilled for weights L m all equal to the same constant, and a possible choice is Inequality (3.6) is then a straightforward consequence of Theorem 3.1.
We are now able to compute estimation rates for the penalized estimator based on the collection M rect over the anisotropic Besov balls defined by (3.5), by combining Proposition 3.1 with the approximation results of [Aka10] (Proposition 2 and Theorem 2).Let where (x) + stands for the positive part of a real x.Contrary to [Kle09], we have chosen a parameter J ⋆ that does not depend on the unknown smoothness of s, hence the factor σ/H(σ) in the above definition.That factor, which is inferior or equal to 1 with equality only in the isotropic case, may be interpreted as an imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 index measuring the lack of isotropy.We assume that q(σ, d, p) > 1, which is equivalent to where λ = σ/H(σ).Thus, if q(σ, d, p) > 1, then H(σ)/d > 1/p, so B(σ, p, R) only contains continuous functions which are uniformly bounded by C(σ, r, d, p)R.
Theorem 3.2.The notation is that of Theorem 3.1 and Proposition 3.1, and the assumptions those of Proposition 3.1.Let p > 0 and σ ∈ (0, r + 1) The rate Rn −H(σ)/d 2d/(d+2H(σ)) is the minimax one given the lower bounds proved in [Lac07] for transition density estimation of a Markov chain.We are able to reach that rate not only for functions with homogeneous smoothness, i.e. for p ≥ 2, as [Lac07], but also for functions with inhomogeneous smoothness, i.e. for 0 < p < 2, which is impossible with the collection of regular models considered in [Lac07].Besides, let us underline that, among the references cited in the introduction, only [Kle09] can deal simultaneously with anisotropy and inhomogeneous smoothness.Theorem 3.2 improves on [Kle09] by allowing to approximately reach the minimax risk up to a factor that does not depend on n and considering smoothness parameters possibly larger than 1.

Dependent data
We now show that the previous results can be extended to dependent variables.The case of a Markov chain is of particular interest: if (X i ) i∈Z is a homogeneous Markov chain of order 1, and Y i = X i+1 for all i ∈ Z, then s is the transition density of the chain (X i ) i∈Z .

Definitions and notation
Let us introduce the notions of dependence used in the sequel.For two sub-σfields A and B of F , the β-mixing (or absolute regularity) coefficient is defined by where the supremum is taken over all real-valued random variables X and Y that are respectively A and B-measurable and square integrable.We recall that β and ρ-mixing are among the weakest forms of mixing conditions, in the sense that both β and ρ-mixing are implied by φ-mixing (uniform mixing) and imply αmixing (see for instance [Dou94]).Besides, in general, ρ-mixing does not imply β-mixing, and β-mixing does not imply ρ-mixing.In the sequel, the letter θ stands for β or ρ.For all positive integer j, let The process (Z i ) i∈Z is said to be θ-mixing when lim j→+∞ θ Z j = 0.In particular, (Z i ) i∈Z is geometrically θ-mixing with rate b, b > 0, if there exists a positive constant a such that, for all positive integer j, θ Z j ≤ a exp(−bj).We shall also use the 2-mixing coefficients θ(σ(Z 0 ), σ(Z j )), that satisfy, for all j ≥ 1,

Dependence assumptions
We consider the following dependence assumptions.Except for the last one, they are related to some rate of mixing.In each case, we also define a real ϑ, that may vary according to the dependence assumption, and will appear in the penalty proposed in the following section.
Assumption (Dβ cond ) Assumptions (Dβ) is satisfied and, in addition, for all j ≥ 2, Z j is independent of Z 1 conditionally to X j .Then we denote ϑ = 1 and δ = 0.
Let us give sufficient conditions for (Z i ) i∈Z to be θ-mixing.First, if (X i ) i∈Z is a strictly stationary θ-mixing process, and

Main result
All the results of Section 3 can be extended to the case of dependent data, under slightly more restrictive conditions on the thinest partition.Assume that (Z i ) i∈Z satisfies Assumption (Dβ) and s, f satisfy Assumption (B).If the penalty satisfies, for all m ∈ M, for some large enough positive absolute constant κ (where b, δ and ϑ are defined in the assumptions of dependence), then where C 3 is a positive constant that depends on κ, ϑ, δ, a, b, r, Under Assumptions (Dβρ), (Dβ2-ρ), the price to pay for avoiding the logarithmic factor despite the dependence of the data is the presence of the term ϑ in the penalty.For practical purposes, it is necessary to include this term in the constant κ to calibrate.Notice that under Assumption (Dβ cond ), for instance when we estimate the transition density of a Markov chain, the logarithmic factor still disappears and ϑ = 1 so that the penalty is almost as simple as in the independent case.Actually it is possible to consider an arithmetical β-mixing instead of a geometrical one.In this case, it is necessary to slightly strengthen assumption (P2), assuming rather |m * | 2 ≤ n 1−ζ , with ζ a number in (0, 1).Then, if β q ≤ aq −b with b > 5/ζ − 2, Theorem 4.1 is still valid in the cases where δ = 0 (ρ-mixing and conditional independence).The penality is identical, except the term b 2 which is removed.The proof is the same as the original statement, see Subsection 6.3, but with q n = ⌊n ξ ⌋ where ξ ∈ ((5 − ζ)/(2 + 2b), ζ/2).
Then, for our penalized estimator based on partitions into dyadic rectangles described in Section 3.2, we can state the following theorem.
Theorem 4.2.The notation is that of Theorems 4.1, Assumption (B) is supposed to be fulfilled.Let Let p > 0 and σ ∈ (0, r + 1) d,p) , then there exists some positive real C(σ, r, d, p) that only depends on σ, r, d, p such that sup s∈B(σ,p,R) .

Remarks on the dependence assumptions
We can wonder if weaker assumptions of dependence could be used.Another assumption of dependence is used for instance by [Bos98] (Theorem 2.1) to prove that, asymptotically, the quadratic risk of kernel density estimators reaches the minimax rate (see also [CM02]).But we can prove (see [Aka09]) that this assumption is much stronger than Assumption (Dβ2-ρ), which is enough for obtaining the optimal estimation rate from the minimax point of view.
It is difficult to bound the risk for s under weaker dependence assumptions but it is possible to weaken the assumptions to bound the risk E s − ŝm 2 n for one model.In [Aka09], a version of Proposition 2.1 is proved under assumptions of geometrical α-mixing.Actually a sufficient condition to ensure that E s ŝm − s m 2 n is of the same order as in the independent case is that for some constant C and all t ∈ S m , Var Assumptions (Dβρ) and (Dβ2-ρ) are optimal for obtaining such an inequality in the following sense.Let us assume that (Z i ) i∈N is a strictly stationary Harris ergodic and reversible Markov chain satisfying (4.2) for all real-valued function t defined on [0, 1] d .Then the chain is variance bounding in the sense of [RR08], which implies that there is a spectral gap in L 2 (sf ) := {t : [0, 1] d → R s.t.t, sf = 0 and t sf < ∞} (Theorem 14 in [RR08]).This leads to the geometrical ergodicity of the chain (Theorem 2.1 in [RR97]), which, given the reversibility assumption, implies that the chain is ρ-mixing.As a conclusion, a strictly stationary Harris ergodic and reversible Markov chain (Z i ) i∈Z satisfies (4.2) for all real-valued function t defined on [0, 1] d if and only if it is ρ-mixing.
Thanks to Formula (5.2), one can check that, for all m ∈ M rect , We shall consider a penalty pen of the form where c is some positive constant, as in Theorem 4.1.With such a penalty, m is given by m = argmin where, for all rectangle K, That characterization allows to determine m without having to compute all the estimators of the collection {ŝ m } m∈M rect .Indeed, we can for instance adapt to our estimation framework the algorithm proposed by [Don97], which requires a computational complexity of order 2 dJ⋆ .Thus, choosing 2 dJ⋆ at most of order n, which allows for a larger choice of J ⋆ than prescribed by our theoretical results (cf.Proposition 3.1 and Theorem 4.2), the computational complexity is at most linear in the number of observations.Let us also mention that the algorithm proposed by [BSRM07] allows for instance the slightly larger choice J ⋆ = ⌊log(n)⌋, that does not depend on d, while keeping an almost linear computational complexity, that is of order nd log d+1 (n).
We propose a simulation study based on the 4 following examples.Example 1.
where (X i ) 1≤i≤n are i.i.d.Gaussian variables with mean 6 and variance 4/3, (ǫ i ) 1≤i≤n are i.i.d.reduced and centered Gaussian variables, independent of the where (X i ) 1≤i≤n are i.i.d.uniformly distributed over [−6, 6], (ǫ i ) 1≤i≤n are i.i.d.reduced and centered Gaussian variables, independent of the X i 's.Example 3. Let β(., a, b) be the density of the β distribution with parameters a and b, where (X i ) 1≤i≤n are i.i.d.uniformy distributed in [0, 1], (ǫ i ) 1≤i≤n are i.i.d.Gaussian reduced and centered, independent of the X i 's, and g is the density of where N 1 is Gaussian with mean 1/2 and standard error 1/6, N 2 is Gaussian with mean 3/4 and standard error 1/18, N 1 and N 2 are independent.Each model is of the form where ǫ i is a reduced and centered Gaussian variable, so the conditional density of Y i given X i is given by where φ is the density of ǫ 1 .Besides, this allows us to consider Markovian counterparts of Examples 1 to 4, that we will call Example 1 (Markov),. .., Example 4 (Markov).More precisely, we also estimate the transition density of the Markov chain (X i ) i≥1 that satisfies with X 1 that follows the stationary distribution of the chain.Thus, for Example 1 (Markov), X 1 has the same distribution as in Example 1, but in the other examples, the distribution of X 1 differs between the independent and the Markovian cases.In practice, we simulate the chain long enough so that it finally reaches the stationary regime.We estimate s respectively on [4, 8] 2 for Example 1, [−6, 6] 2 for Example 2, and [0, 1] 2 for Examples 3 and 4, both for independent and Markovian data.The four conditional densities are represented on these rectangles in Figure 5.We may say that the first two examples are rather homogeneous functions, whereas the last two are rather inhomogeneous.We implement s for r = 0 and choose the following parameters.The supremum norm of s is estimated by ŝm • , where m • is the regular partition of [0, 1] 2 into cubes with sidelength 2 J• .We select a best partition among those into dyadic rectangles with sidelength ≥ 2 −J⋆ , with 2 J⋆ as close as possible to √ n.For n = 250, we set J • = 2 and J ⋆ = 4, and for n = 1000, we set J • = 3 and J ⋆ = 5.Let us denote by s(c) the penalized estimator obtained with the penalty (5.3) for the penalty constant c.For the sample sizes n = 250 and n = 1000, we give respectively in Tables 1 and 2 n where the minimum is imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011  Results for n = 250 data and 100 simulations.obtained by varying c from 0 to 4 by step 0.1.All these quantities have been estimated over 100 simulations.Besides, for Example 3, we represent in Figure 5 the selected partition for one simulation with 1000 independent data and the penalty constant c = 3.That partitition is both anisotropic and inhomogeneous and well adapted to the function, which illustrates the interest of allowing nonregular and non-isotropic partitions in our selection procedure.Just below, we represent two sections of that conditional density (dark line) together with the corresponding sections of s(3).

The closeness between the minimal risk min
2 n ] indicates that a penalty constant equal to 3 seems to be a good choice.We observe that, for each example, the risks obtained for the independent and the Markovian cases are also close, which tends to confirm Theorem 4.1, otherwise said that the penalty under assumption (Dβ cond ) is not so much affected by the dependency between the data.For Examples 1 and 2, we can compare ourselves with the results of Lacour [Lac07] in the Markovian case, obtained via regular model selection.We obtain either similar results for Example 2 or even imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 better results for Example 1. Last, let us mention that the performance of s, in practice, might still be improved by a data-driven choice of the penalty constant based on the slope heuristics, as described in [BMM11] for instance, but this is beyond the scope of the paper.

Notation and preliminary lemmas
In all the proofs, the letter C denotes a real that may change from line to line.The notation C(θ) means that the real C may depend on θ.
We will use several times the following lemma to bound some variance terms.
Lemma 6.1.Let q be a positive integer.For all where δ and ϑ are defined in Section 4.2 for the dependent case, or δ = 0 and ϑ = 1 when the variables Z i are independent.Besides, Proof: First we use a convexity inequality to write, without further assumption, whereas in the independent case, Var s ( Var s (Γ t (Z 1 )) .Now, under Assumption (Dβρ), Lemma 8.15 in [Bra07] provides Under Assumption (Dβ2-ρ), we immediately deduce from the definition of the ρ-mixing coefficients and the stationarity of (Z i ) i∈Z that, for all 1 ≤ j ≤ q − 1, imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 Thus, Under Assumption (Dβ cond ) hence Inequality (6.3) in the last case.Besides, t 2 (x, y)s(x, y)f (x)dxdy.
We recall here Bernstein's Inequality for independent random variables (see [Mas07] (Section 2.2.3) for a proof).Lemma 6.2 (Bernstein inequality).Let (W i ) 1≤i≤n be an independent and identically dsitributed sequence, defined on the probability space (Ω, F , P), with values in W. Let n ≥ 1 and g be a real-valued and bounded function defined on W. Let σ 2 g = Var(g(W 1 )).Then, for all x > 0,

Proof of Theorem 4.1
Let us fix m ∈ M. We also fix η ≥ 1 and θ 1 > 0, to be determined at the end of the proof.By definition of m and ŝm , γ(s) + pen( m) ≤ γ(ŝ m ) + pen(m) ≤ γ(s m ) + pen(m).(6.8) Using the same arguments as in the proof of Proposition 2.1, we deduce from (6.8) we obtain in the same way as Inequality (6.5) that, on the set Ω η (m ⋆ ) defined as in (6.4), imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 Consequently, provided θ 1 > 2η, To pursue the proof, we have to control the term χ 2 f (m, m).Since the data are β-mixing, we can introduce blockwise independent data.More precisely, let q n = ⌈3b −1 log(n)⌉ (where b is defined in Assumption (Dβ)) and let (d n , r n ) be the unique couple of nonnegative integers such that n = d n q n + r n and 0 ≤ r n < q n .For the sake of simplicity, we assume in the sequel that r n = 0 and d n = 2p n > 0, but the other cases can be treated in a similar way.For l = 0, . . ., p n − 1, let us set As recalled for instance in [Vie97] (proof of Proposition 5.1), we can build, for l = 0, . . ., p n − 1, such that, for all l = 0, . . ., p n − 1, 0≤l≤pn−1 are independent random variables, and so are (B • l ) 0≤l≤pn−1 .We set The proof of Theorem 4.1 heavily relies on the following concentration inequality satisfied by the random variables χ 2 f (m, m ′ ), for m, m ′ partition built on m ⋆ .The proof of that proposition is deferred to Section 6.4.Proposition 6.1.Under the assumptions of Theorem 4.1, there exists a positive constant C such that where [x] + denotes the positive part of a real x and C depends on ϑ, s ∞ , r, d, ι(f ), b.

Assumption ( P2 )
All the partitions in the collection M are built on a regular partition m ⋆ of [0, 1] d into cubes such that |m ⋆ | 2 ≤ n log 2 (n) By comparison with Theorem 3.1, a logarithmic factor then appears in the penalty (and then in the rate of estimation) under the sole condition of β-mixing but this term disappears under Assumption (Dβρ), (Dβ2-ρ) or (Dβ cond ), imsart-ejs ver.2010/09/07 file: RevdenscondinhomogeneEJS.tex date: October 14, 2011 hence the factor log δ (n) with δ ∈ {0, 1}.Let us first present the oracle type inequality.Theorem 4.1.Let M be a collection of partitions satisfying Assumption (P2) and {L m } m∈M be a family of reals, greater than or equal to 1, such that m∈M exp(−L m |m|) ≤ 1.

Fig 3 .
Fig 3. Level lines of the conditional densities to estimate.

Fig 4 .
Fig 4. Top left: Level lines of the the conditional density s for Example 4. Top right: selected partition for c = 3 and n = 1000.Bottom : two sections of s (dark line) together with the corresponding sections of s(3) (light line).
then there exists some positive real C(σ, r, d, p) that only depends on σ, r, d, p such that [Bra05]tly stationary Harris ergodic Markov chain (aperiodic, irreducible, positive Harris recurrent), then (Z i ) i∈Z is geometrically βmixing, i.e.Assumption (Dβ) is verified, if and only if it is geometrically ergodic (cf.[Bra05],Theorem 3.7).In the sequel, we will mainly be concerned with mixing assumptions possibly involving ρ-mixing and β-mixing at the same time.Under adequate hypotheses, Markov chains (always assumed to be homogeneous of order 1) provide examples of such processes: • if (Z i ) i∈Z is a strictly stationary Harris ergodic Markov chain that is also reversible and geometrically ergodic, then (Z i ) i∈Z is both geometrically ρ-mixing and geometrically β-mixing(cf.[Jon04],Theorem 2); • if (Z i ) i∈Z is astrictly stationary, ergodic and aperiodic Markov chain satisfying the Doeblin condition, then (Z i ) i∈Z is uniformly ergodic, hence both geometrically ρ-mixing and geometrically β-mixing (cf.[Bra05], 119-121, or [MT93], Section 16.2).We refer to [DG83, Mok90, DT93, Dou94, AN98] for examples of stationary processes that are geometrically β-mixing or both geometrically β and ρ-mixing among commonly used time series such as nonlinear ARMA or nonlinear ARCH models.
) 1≤i≤n are i.i.d.uniformy distributed in [0, 1], (ǫ i ) 1≤i≤n are i.i.d.reduced and centered Gaussian variables, independent of the X i 's.