Should we estimate a product of density functions by a product of estimators ?

: In this paper, we consider the inverse problem of estimating the product fg of two densities, given a d -dimensional n -sample of i.i.d. observations drawn from each distribution. We propose a general method of estimation encompassing both projection estimators with model selection device and kernel estimators with bandwidth selection strategies. The procedures do not consist in making the product of each density estimator, but in plugging an overﬁtted estimator of one of the two densities, in an estimator based on the second sample. Our ﬁndings are a ﬁrst step toward a better understanding of the good performances of overﬁtting in regression Nadaraya-Watson estimator.


Introduction
In this work, we consider that we have n observations X i , i = 1, . . ., n in R d independent and identically distributed (i.i.d.) with density f and independent from n additional observations Y i , i = 1, . . ., n in R d , i.i.d. with density g.We study the question of estimating the product function f g from these observations.Note that the resulting function is not − in general − a density, and none of the observations are directly related to this product.In that sense, we face an inverse problem.Our framework contains the case where f = g and the goal is to estimate f 2 by splitting a 2n-sample.These quantities may be of interest in some testing problems or as a first step for estimating the L 2 -norm of f , see Laurent and Massart (2000); other product problems are considered in Butucea et al. (2018).
However, we must explain that we considered this problem as a simplified setting (a toy-problem, in some sense) for a more complicated question.Let us explain it.Consider a regression model in dimension d = 1 with Y i = b(X i ) + ε i with i.i.d. and independent sequences (X i ) 1≤i≤n and (ε i ) 1≤i≤n .The question is to estimate the regression function b(•) from observations (X i , Y i ) 1≤i≤n .A popular proposal is the Nadaraya-Watson estimator (see Györfi et al. (2002)) where K is a kernel and h a bandwidth parameter.This estimator can be seen as a weighted combination of the Y i 's (second equality) or as a ratio of an estimator of bf , where f is still the density of the X i 's, divided by an estimator of f (first equality).In this last case, it is not clear that the same bandwidth h must be chosen for the numerator and the denominator.Surprisingly, Comte and Marie (2021) proposed sophisticated strategies for these two terms, but noticed in the simulation experiments that, if the numerical results obtained for both functions separately were excellent, the performance of the ratio was almost systematically defeated by the single bandwidth method selected from a least squares criterion relying on the weighted view of the question.The unique bandwidth selected in this case is small, but the ratio of these two bad and overfitted estimators is undoubtedly very good, at least for not too high noise level (see a related discussion in Section 4.1 of Bartlett et al. (2019)).This is why we wondered if the product of two functional estimators was a good estimator of the product of two functions; we took these functions as densities for simplicity.Thus our motivation is mainly theoretical, but we believe that the question is of general interest.Now, let us see why making a product of density estimators can be seen as an inadequate (sub-optimal) strategy.Assume that we set f g := f × g, where f and g are minimax optimal estimators of f and g respectively.To get an upper bound result, there is no other way than to separate the role of each estimate : both individual risks of f and g would emerge.Then, the resulting rate is the slowest between the rates of estimation of f and g: it is the rate induced by the less regular density between f and g, say g without loss or generality for the remaining of this discussion.Clearly, this is not optimal if the product f g is more regular than g.For instance, if d = 1 and for f a β(p, p) density with p ≥ 2, p integer and g a uniform density, i.e. a β(1, 1).Then on R, f has regularity p − 1 and g regularity 0, but f g = f has regularity p − 1.Therefore, one can wonder if in these cases it is possible to build an estimator directly adapted to the regularity of the product f g.
A related disadvantage of an upper bound separating the roles of f and g is that it does not treat this problem as an inverse problem : both individual regularities of f and g intervene whereas one expects that the sole regularity of f g should matter.Especially since, depending on the regularity classes which are considered, there is often no universal rule relating the regularities of f and g to the one of the product.
To complete this discussion, notice that it is easy to derive a lower bound result, inspired by the former example on beta distributions.Denote by Σ(s, L), where s = (s 1 , . . ., s d ) with positive s i , i = 1, . . ., d and L, a ball of radius L in space of functions with regularity s.Then it holds, for any measurable function T of (X i , Y i ) 1≤i≤n , sup f g∈Σ(s,L) we recover on the right side the lower bound of the direct density estimation problem.To summarize, if the regularity set Σ(s, L) contains a [0, 1] d supported density f 0 , a lower bound for the product is given by a lower bound for the direct estimation of f 0 .This is enough to state that the upper bound results presented below are optimal.For instance in dimension d = 1, we recover rates in n − 2s 2s+1 if (f g) ∈ Σ(s, L), a Sobolev class of regularity s, that are minimax.
The plan of the paper is the following.We propose in section 2.1 a general estimation strategy: we define a function estimator of the product f g and prove a non-asymptotic risk bound under general assumptions.This general strategy encompasses projection and kernel methods, for which we check the assumptions and present spectific results in Section 2.2.Then we study in a particular projection case the resulting rate, related to the regularity of f g: it can be reached for a well-chosen dimension of the projection space (see section 2.4).As this choice depends on unknown parameters, we then propose in Section 3.1 a general parameter selection strategy of Goldenshluger and Lepski (2011) type, and prove that the resulting estimator automatically reaches the squared-bias/variance compromise.This general method applies to projection and kernel methods.However in both cases and in dimension 1 alternative methods are used in the simulations, which outperform Goldenshluger and Lepski procedures numerically; they are simpler to calibrate with faster execution times.Note that an invariable difficulty in the theoretical study of these procedures is that even if we have i.i.d.variables, our estimates are built from sums of dependent variables (see e.g.(2.6) or (2.7)).Numerical comparisons of the different methods and associated strategies for product estimators are conducted in section 4, and concur to our theoretical findings.Several additional questions are presented in the concluding remarks of section 5. Lastly, proofs are gathered in section 6 for the results of section 2 and in section 7 for the adaptive results stated in sections 3 and 4.

Functional estimator and first risk bound
We consider a general family K of functions K : I 2 → R, where I ⊂ R d , that are symmetric (i.e.∀x, y ∈ I, K(x, y) = K(y, x)).
In the sequel, we denote by ψ K the quantity ψ K (x) := K(x, y)ψ(y)dy for any function ψ.The collection K must be chosen such that ψ K can be a good approximation of ψ.Examples of possible collection K are provided in Section 2.2.
For such a function K * ∈ K, we define an estimator of f by expected to be a good approximation of f .Following an analogous way, we propose as an estimator of f g, for K ∈ K, 2), X 1 and Y 1 must have the same dimension.
We set the following assumptions on K: Moreover, we require the following assumptions on f K * : These assumptions are related to K * and therefore to the density f only.Now, we can establish the following upper bound on the mean integrated risk of (f g) K,K * .
Proposition 2.1.Let K, K * belong to K. Assume that assumptions (A1)-(A5) hold and that g is bounded on I with bound denoted by g ∞ .Let (f g) K,K * be the estimator defined by (2.2).Then, we have where Strategy suggested by (2.3).The risk bound (2.3) contains two standard terms, the squared bias (f g) K −f g 2 and the variance C(f, g)L(K)/n, requiring a standard selection for K (see Lerasle et al. (2016)).It also involves the bias term f − f K * 2 which has no counterpart: thus K * can and should be chosen in order to make it negligible and f K * can thus be taken overfitted.
If in the initial problem, f and g have symmetric roles, this is no longer true in the definition (2.2) of the estimator, where one of the two densities is estimated first.As a matter of fact, Proposition 2.1 suggests to plug in the product estimator (2.2) an over-fitted estimator, eliminating a selection issue for K * .When possible we select for this over-fitted estimator the one corresponding to the smoother density.Indeed, this should make the additional bias term decrease faster.However, the information about which is smoother between f and g, is not available.From theoretical viewpoint, both f − f K * 2 and g − g K * 2 are negligible by assuming a minimal regularity for f and g and choosing K * such that Assumptions (A1)-(A5) hold, makes the bias term the smallest as possible.From a practical point of view and in dimension 1, we propose to consider that the smoother density is the one for which a selection method for the direct density estimation of f and g leads to the smallest complexity L(K) (i.e. the smallest selected dimension in projection or the largest bandwidth for kernels, see section 2.2 hereafter).
In the sequel, we may write (f g) K instead of (f g) K,K * when there is no ambiguity, as K * can be fixed.
[Ker ] Kernel function: for K an integrable and square-integrable symmetric function (K(−z) = K(z)) defined on R, such that I K = 1 and h ∈ [0, 1] d .We denote by Next we prove that Assumptions (A3)-(A5) are verified for the projection estimator of f where 0≤j≤m * −1 stands for , and for the kernel estimator (2.5) Proposition 2.3.Assume that f is bounded.Then the estimators of f : satisfy assumptions (A3)-(A5).
The conditions L(m * ) ≤ n and L(h * ) ≤ n represent a stronger version of Assumption (A2).Moreover f Km = f m = 0≤j≤m−1 a j ϕ j , with a j = f, ϕ j is the projection of f on S m : it is a good approximation of f for m large (in the sense that min 1≤j≤d m j is large).Analogously, f K h = f K h with u v denoting the convolution product of two L 2 (I) functions, gets close to f for small h (in the sense that max 1≤j≤d h j is small) under mild conditions, see e.g.Tsybakov (2009) or Goldenshluger and Lepski (2014).
This allows to derive from Proposition 2.1 in the following section upperbound results for projection and kernel estimators that we introduce below.

A Corollary in the projection or kernel case
In a purely projection approach where we take K and K * in the same collection of the form [P], we obtain the following estimator of f g: (2.6) Clearly, E( a (m * ) j ) = ϕ j , f m * g , which shows that our estimator is indeed close to f m * g, which in turn should be near of f g for large m * .Choosing m * large is possible since the variance of f m * does not appear in the risk bound.It is worth noting that the sum in the right hand side of (2.6) is composed of identically distributed but dependent variables: indeed, dependency appears through f m * , which is based on the X-sample.
In a purely kernel approach where we take K and K * in the same collection of the form [Ker], we consider the estimator of f g: A straightforward consequence of Proposition 2.1 is the corollary: Corollary 2.1.Assume that f and g are bounded.
Let (f g) m,m * be defined by (2.6).For any m * such that L(m * ) ≤ n, we have where f g = f g1 I , C 1 := g ∞ (1 + 2 f ∞ ) and (f g) m is the orthogonal projection of f g on S m , f m * the orthogonal projection of f on S m * .Let (f g) h,h * be defined by (2.7).For any h * such that L(h * ) ≤ n, we have It is also possible to consider a mixed strategy where K * is selected in collection [P] and K in [Ker], or vice-versa, and a similar result is obtained.This strategy is investigated in the numerical section.Then, the two bias terms (f g) K − f g 2 and f − f K * 2 refer to different regularity spaces, see the discussion below and the next section.
The bound (2.9) suggests to choose h * the smallest as possible, in order to make this term negligible.For instance if f belongs to a Nikols'ki ball with regularity parameter α = (α 1 , . . ., α d ) (see Goldenshluger and Lepski (2014), Definition 1), f h * −f 2 has order d j=1 (h * j ) 2αj if the kernel K has order at least max j α j (where α is the largest integer such that α < α and the order of the kernel is understood as in section 3.2 of Goldenshluger and Lepski (2014)).It follows that if min j α j > 1/2 and h * j = 1/n, ∀j, this term has order less than 1/n and is negligible.Then, only the bandwidth h requires to be selected.
If in addition f g belongs to a Nikols'ki ball with regularity parameter β and K has order at least max j β j , then the estimator reaches the minimax rate (see Goldenshluger and Lepski (2014)), for h j chosen of order n −β/ βj (2β+1) .Such a choice of h is not feasible since β is unknown, a data driven procedure for selecting h must be proposed.
In the following section we provide a more precise evaluation of the rates of estimation in the projection case in dimension 1.

Projection strategy: Rates on Sobolev Hermite spaces
In this section we focus on the case d = 1 and give an example of rate induced by the bound (2.8), in the case of the Hermite basis and associated Sobolev spaces (see Comte and Lacour (2021) Definition 1 for the general case d ≥ 1).The Hermite functions (ϕ j ) j≥0 are defined from Hermite polynomials (H j ) j≥0 by: for x ∈ R ϕ j (x) = c j H j (x)e −x 2 /2 , H j (x) = (−1) j e x 2 d j dx j (e −x 2 ), c j = (2 j j! √ π) −1/2 .
(2.10) The Hermite polynomials (H j ) j≥0 are orthogonal with respect to the weight function e −x 2 , that is: R H j (x)H k (x)e −x 2 dx = 2 j j! √ πδ j,k (see Abramowitz and Stegun (1964), chap 22.2.14).Therefore, the Hermite basis (ϕ j ) j≥0 is an orthonormal basis on R. We note also that ϕ j is bounded: (see Abramowitz and Stegun (1964) For s > 0, the Sobolev-Hermite ball (see Bongioanni and Torrea (2006)) is defined by : where < +∞} is equivalent to: f admits derivatives up to order s which satisfy: f , f , . . ., f (s) , x s− f ( ) for = 0, . . ., s − 1 belong to L 2 (R).Moreover, for any function f ∈ W s H (D), we have f − f m 2 ≤ Dm −s .It is also easy to see that if, in addition, s > 1, then f is bounded.Indeed As f and g are assumed to be bounded, it holds We can conclude that the resulting rate is of order n −2s/(2s+1) , and is optimal, see (1.1).

General adaption result
We propose a Goldenschluger and Lepski (2011) method.Define K n = {K τ } τ ∈Tn a family of symmetric functions indexed by a parameter τ , satisfying (A1)-(A2).For simplicity, we write ψ τ instead of ψ Kτ and L(τ ) instead of L(K τ ).For example in the previous examples the parameter τ is a d dimensional vector of integers m in the projection case and a bandwidth h ∈ [0, 1] d in the kernel context.We add the following assumptions The collection of models is such that Card(T n ) ≤ n d , and ∀c > 0, The selection of τ is done by the rule for some positive constants κ and κ to be selected.The estimator f K τ * of f defined in (2.1) relies now on the symmetric function K τ * ∈ K n and is rewritten where C is a numerical and C depends on f ∞ , g ∞ and on a constant specific to the collection K.
Theorem 3.1 shows that the adaptive procedure automatically realizes the squared bias-variance tradeoff up to negligible terms, as soon as τ * is such that f − f τ * 2 has order less than O(1/n).This result is general and allows to consider any collection K such that (A1)-(A10) are satisfied.For example, a mixed strategy where a kernel estimator of f is plugged in a projection estimator of the product is possible.It is worth stressing that the proof of Theorem 3.1 relies on the Talagrand inequality and does not involve the study of a U-statistics (see Lerasle et al. (2016) and our Theorem 4.2 below).

Assumptions (A6)-(A10) for projection and kernel estimators
In this section, we present assumptions ensuring (A1)-(A10) for the previous procedures.First, in the projection case (2.6), we define the collection of proposals for m as follows and set the following set of assumptions: [P1] f and g are bounded on I.
[P4] The collection of models is such that Card(M n ) ≤ n d , and ∀c > 0, In Assumption [P3], the maximal value of m * depends on f ∞ .This constraint can be replaced by L(m * ) ≤ n/ log 3/2 (n) and the result follows for n large enough.Assumptions [P2] and [P4] are classical, for instance they are fulfilled by trigonometric and Hermite bases.Condition [P5] is a minimal regularity constraint.For instance for d = 1, it requires that the function f has a minimal regularity of 1/2 on Sobolev-Fourier spaces for I = [0, 1] and 1 on Hermite Sobolev spaces.
Second, for the kernel estimator (2.7), denote by T n a discrete collection H n of bandwidths in (1/n, 1) with cardinality less than n.We consider the following set of assumptions: [K1] f and g are bounded on I.
[K2] The kernel K is even, bounded and integrable.

Note that as
Contrary to the projection, the kernel method does not require any regularity constraint of type [P5].

Numerically efficient adaptive procedures in dimension 1
For the numerical study we focus on dimension d = 1 and do not implement the above adaptive procedure.Indeed, the Goldenschluger and Lepski method is often difficult to calibrate from an implementation viewpoint (see Comte and Rebafka (2012)) and suffers from important computational costs.Indeed, it involves the calibration of two constants, κ and κ.This preliminary calibration step is difficult, probably because these constants act simultaneously on the bias and variance terms.In the kernel case, the "double" convolution which appears when computing f g h,h is numerically time consuming.Instead we propose two numerically efficient procedures model selection and PCO, for which the squared bias-variance tradeoff is also attained under [P1]-[P5] and [K1]-[K4] respectively.

Model selection for projection estimators
We present, under the above assumptions [P1]-[P5], a result for a more standard and simpler model selection procedure.More precisely, define Then select m with the criterion where κ is a numerical constant.Note that With the same tools as those used in the proof of Theorem 3.1, we can prove the following result.
Theorem 4.1.In the projection case [P], if Assumptions [P1]-[P5] hold, then, there exists κ 0 such that, for any κ ≥ κ 0 , we have where C is a constant depending on f ∞ , C a .
The proof is omitted but details can be found in the preprint version Comte and Duval (2022), version 1, which also indicates that κ 0 = 8 × 12 = 96 would suit.In practice, this theoretical value is always too large, and has to be calibrated on preliminary simulation experiments.Note that, the estimate (f g) m,m * is replaced by its positive part, for which the same risk bound holds.Moreover, it can be easily checked that this proof also hold in a mixed strategy where a kernel estimator for f is plugged in a projection estimator of the product, i.e.K * in [Ker] and K in [P].This justifies why for numerical results we experiment this mixed strategy with a penalised criterion for adaptation.
The values f ∞ , g ∞ in the penalty term are unknown and must be replaced by estimates.The bound f ∞ can be estimated by the maximal value of a projection estimate of f on a middle-sized space, for instance sup and an analogous approach can be adopted for g ∞ .Let us denote these estimators by f ∞ and g ∞ .This strategy is theoretically studied in Theorem 12 p.594 (Appendix A: Random penalty) in Lacour (2007).We consider in the numerical Section the estimator f m,m * and adopt the following strategy.The penalty is obtained from the theory as the sum of the bounds on two terms, a bound on ] 2 and a bound on an additional term given by f ∞ g ∞ L(m)/n.Following ideas in Massart (2007) (see also Theorem 7.6 p.216, in the density case), we replace the first term by and the second term by pen The constant κ is calibrated by preliminary simulations, see section 4.

Bandwidth selection with PCO
We describe here a method called "PCO" ("Penalized Comparison to Overfitting") introduced for kernel density estimation by Lacour et al. (2017).The PCO method is more complicated from theoretical point of view, because it involves the study of several U -statistics of order 2. But, it is much easier to calibrate and implement in practice.
Let us describe the PCO method.Denote by (f g) h (omitting h * for simplicity), we select Note that pen 1 and pen 2 are both of order 1/(nh).This is obvious for pen 2 ; for pen 1 , observe that where c 3 and C are positive constants depending on f, g, K and (c 1 , c 2 ) = (3.6,10.4) would suit.
The risk bound of Theorem 4.2 involves four terms.The first term in the first line is the minimal risk among the collection of estimators, up to multiplicative a constants.The two following terms, (f g) hmin − f g 2 and f h * − f 2 are bias terms corresponding to small bandwidths, they are negligible if h min is of order 1/n and h * of order log(n)/n (as required by assumption [K3]), and if the functions f and f g have regularity larger that 1/2.The last term log(n)/n has negligible order compared to the first one.Therefore, the adaptive estimator achieves the intended squared-bias variance compromise given at the end of section 2.3.

Description of the implemented procedures
We illustrate the performances of the projection estimator with Hermite basis (see section 2.4) and kernel estimator with kernel built as a Gaussian mixture defined by: where n j (x) is the density of a centered Gaussian with a variance equal to j.This kernel is of order 3 (i.e.x j K(x)dx = 0, for j = 1, 2, 3).We consider four examples: We compute normalized L 2 -risks to allow the numerical comparaison of the different examples for which (f g) 2 varies a lot.Namely, we evaluate • Product : This estimator is obtained as the product of f g where each estimator is an adaptive optimal estimator.In the projection case, the product is f m1 g m2 , where f m is defined by (2.4) with and g m2 is defined analogously.In the kernel case, f h1 g h2 , where f h1 is defined by (2.5) with and g h2 is defined analogously.
• First X: In all our examples f is smoother than g.The theoretical results suggest that one should consider for the preliminary estimate the dataset X which has density f .In the projection setting our estimate is (f g) m,m * of Theorem 4.1 and penalty given by (4.3), where ∞ is estimated similarly, and with κ = 0.15 after calibration.In the kernel case we consider the estimator (f g) h of Theorem 4.2 with penalty given by (4.4) where pen 2 is replaced by where • Optimal first : As the information about compared smoothness of f and g is unavailable in practice, we have proposed an adaptive method for choosing which estimate is plugged in: we perform a classical penalized (resp.PCO) procedure (see step Product) to the datasets X and Y and we take as preliminary projection (resp.kernel) estimate the one for which m (resp.h) is the smallest (resp.largest).Indeed, the optimal dimension (resp.bandwidth) is asymptotically a decreasing (resp.increasing) function of the regularity.For instance, if m 1 < m 2 we proceed as in First X step, otherwise the roles of X and Y are switched.We count the number of times where Y is selected first; thus, when this count is zero, First X and Optimal first are the same and give the same result.
• Oracle (optimal first) : Our benchmark is computed as follows.We consider for all dimensions or bandwidths the estimators of the step Optimal first and select the oracle that minimizes . This quantity provides a numerical lower bound for the L 2 -risk of our procedure.Case [GE]: L 2 -risks with std in parenthesis (both multiplied by 10 2 ), Dmax = 100.For the Optimal first the bold sub-script is the number of times where Y is selected first.

Numerical results
Let us comment the results of Tables 1-4.First, we compare separately projection and kernel procedures.Let us start with the two fully data driven methods Product and Optimal first.We observe that the results of the corresponding columns nicely confirm the theory: the risks of our procedure is almost systematically and significantly smaller (see Table 2 in particular).Besides, the risk of Optimal first is always comparable and even sometimes better than the risk of the First X method which uses the unavailable knowledge of the smoothest density.The risk of Optimal first has the same order as the Oracle even if a multiplicative factor larger than 2 is observed.Lastly, as the risks are normalized we can compare the risks of the different Tables; we see that the first two   examples (Tables 1 and 2) are slightly more difficult which was expected: these densities are less regular as functions on R.
Second, we can compare projection and kernel methods.The kernel method is much more time consuming that the projection method (by a factor more than 10).We can see that for the operational Optimal first method the kernel strategy seems better for the first two examples while the projection method wins in the two other cases.Nevertheless, the gap between the risks is never very large.
Mixed Kernel and projection strategy.Finally, we implement a mixed strategy where a kernel estimator for f , based on the kernel (4.6), is plugged in a Hermite projection estimator of the product f g.The selection procedure is performed via model selection as described above.Following the same strategies as previously, we display the estimated normalized L 2 -risks for the four examples.The adaptive Optimal first strategy is conducted by adaptively selecting the smoothest density with a model selection procedure, even if the plugged-in estimator is a kernel.Comparing the results displayed in Table 5 with above results, we observe an analog behavior for the mixed strategy as for the projection or kernel ones.Mixed Kernel-Projection : L 2 -risks with std in parenthesis (both multiplied by 10 2 ).
For the Optimal first the bold sub-script is the number of times where Y is selected first.

Concluding remarks
In this paper, we have shown that an optimal strategy for estimating a product of densities was not to make a product of estimators but to plug an overfitted estimator of one of the densities in the estimator of the product.This can be done both with projection and kernel estimators and adequate model or bandwidth selection methods are proved to deliver adaptive estimators.We have implemented these methods and proved their good numerical performances in dimension 1.
We assume that the two samples have the same sizes but the case where the X-sample has size n X and the Y sample size n Y is worth being studied, for instance if n X = γn, n Y = n, γ ∈ (0, ∞).Following the steps of the proof in the projection case suggests that the procedure can be adapted and leads to similar results with rate induced by the smallest sample size (1 ∧ γ)n.
We considered a product of two densities but generalizations to product of other functions or product of more than two densities may be worth studying.Lastly, it is likely that our methods would extend to dependent variables, provided that the two sequences remain independent, but this should be further investigated.
If we come back to the Nadaraya-Watson problem that initiated our question, we justified in our context that plugging an overfitted estimator is an optimal strategy.Other contexts where overfitting has been recognized as judicious exists (see Chinot and Lerasle (2020)).The next step, as the original problem is a ratio, is to address the question of estimating 1/f when f is a density.

Proof of Proposition 2.1
Consider the classical bias variance decomposition We first study the bias term B, note that E where B 2 = (f g) K − f g 2 is the standard integrated squared bias of f g.Using (A1) we write Next we split V to involve the conditional variance given (X 1 , . . ., For the first term, as the sum is composed of centered terms we get where we used successively (A2), (A4) and (A3).Finally, we write using (A5) where we used (A2) to obtain the last line.Using g 2 ≤ g ∞ and gathering all bounds completes the proof.

Proof of Proposition 2.2
In case [P], we have, denoting by ψ m the orthogonal projection of ψ on the linear space S m :=span(ϕ 0 , . . ., ϕ m−1 ), So Assumption (A1) holds with C 1 = 1.On the other hand, for (A2), we have In case [Ker], we denote by u v(x) = u(y)v(x − y)dy the convolution product.Then by the Young inequality (8.1), we get Thus (A1) holds with

Proof of Proposition 2.3
In case [P], we have E[ f m * (y)] = f m * (y) where f m * is the projection of f on f ∞ , and (A3) holds with Therefore (A4) holds with C 4 = 1.Lastly, recalling that a j = n −1 n i=1 ϕ j (X i ), and noting that a j = ϕ j , f = E( a j ), we have as the omitted term is the opposite of a nonnegative quantity (it can be written as the opposite of a square).It follows that We obtain that (A5) holds with C 5 = f ∞ .Now we consider case [Ker] and note that E( where we used Young's inequality.Therefore (A5) is holds with

Proof of Theorem 3.1
The proof starts by decompositions which are standard when studying Goldenschluger and Lepski (2011) methods and bounds.For κ ≥ κ, we get (see Comte (2017), sec 4.2) The first right-hand-side term has expectation controlled by applying Proposition 2.1 and the term V (τ ) can be associated with it.Only A(τ ) must be studied.We write ).
The bound on the middle term is obtained with (A1) which refers to an adequate bias term.For the last term, we use (A1) and get The expectation of this term is also studied in Proposition 2.1, with where we have the bound ( where B(0, 1) is a countable set of square integrable functions with t = 1 and the empirical process is defined by Therefore we have Reminding the definition of A(τ ) given by (3.1), we have Thus, the result of Theorem 3.1 holds if we prove that, for two constants c 1 , c 2 , we have Indeed, plugging (7.3) in (7.2) and the result in (7.1) is the result of Theorem 3.1.
We prove (7.3).We split ν n in four terms: where for some positive constant c 0 to be defined in the sequel, we set

Study of ν n,2
We start by the study of ν n,2 as it leads to fix c 0 , and we first establish that E(sup t∈B(0,1) |ν n,2 (t)|) ≤ n −p for some positive p.It holds that Using (A9) we obtain that f τ * (Y 1 ) 2 ≤ C 9 L 2 (τ * ) ≤ C 10 n 2 (where the last inequality follows from (A10)), and with (A7) that Therefore, We complete by applying the Bernstein inequality to Z i = K τ * (X i , y) yielding with v 2 2 a bound on Var(Z i ) and b 2 an a.s.bound on Z i .We find with (A9) that b 2 = √ C 9 L(τ * ) suits and with (A2) Var and using (A10) , it follows that Then, gathering (7.4) and (7.6) leads to E( sup t∈B(0,1) As a consequence under (A8), we get Study of ν n,1 .We apply the Talagrand inequality (see Lemma 8.1) to ν n,1 conditionally to X.Using that t → ν n,1 (t) is linear and the Cauchy-Schwarz inequality and t 2 = 1, we get Next, note that sup where we used (A1) and (A7).Applying Lemma 8.1 with δ = 1 2 , it follows that Since the latter bound does not depend on X, the inequality holds unconditionally in expectation.Therefore under (A8) and as L(τ ) ≥ 1, we get, for C a positive constant, and using (7.5), using (A1) and (A2).Next, note that using Cauchy-Schwarz inequality and (A1) Applying Lemma 8.1 with δ = 1 2 , it follows that Therefore under (A8), we get, for C a positive constant, Study of ν n,4 Again we apply the Talagrand inequality, similar computations enable to derive from (A1) and (A7) Therefore under (A8) we get, for C a positive constant, As a consequence, gathering (7.7)-(7.10)-(7.11)and (7.12) gives the result (7.3) for C a positive finite constant, depending on a, f ∞ , g ∞ , C 1 and C 7 .

Proof of Corollary 3.2
We already checked that [Ker] satisfies (A1) and (A2), and that for f bounded, Assumptions (A3)-(A5) hold under L(h * ) ≤ n, which is ensured by [K3] (for n large enough).Next we compute where we made successively the affine changes of variables u = x − z and v = u + y.Using that K is even it follows that Assumption (A6) is fulfilled.Using [K1], we easily bound As in Comte and Marie (2021), we decompose ψ n in First, where (7.14) Indeed, and lastly We state a series of Lemmas that permit to establish Theorem 4.2.
Lemma 7.1.Under the assumptions of Theorem 4.2, it holds that where C = C(f, g, K) is a positive constant depending on f , g, K.
Lemma 7.2.Under the assumptions of Theorem 4.2, for every ϑ ∈ (0, 1), it holds Lemma 7.3.Under the assumptions of Theorem 4.2, for every ϑ ∈ [0, 1], it holds We deduce from (7.13) that We have for all positive h We note that for all positive θ Applying these for h = h we get where we used Lemma 7.1.Now, using Lemmas 7.2 and 7.3, we get provided that 1 − κ/θ ≤ 0. Therefore we choose κ ≥ θ and apply Lemma 7.2 again.We obtain Similarly, we get Plugging the last two bounds in the expectation of (7.16) implies Now applying Proposition 2.9 with a rougher bound on the variance (the constant is larger), we get for θ ∈ (0, 1/4), and κ ≥ θ, We conclude that for κ ≥ 1/4 and all θ ∈ (0, 1/4), where and c 3 and C are positive constants depending on θ, f, g, K. Taking θ = 0.2 we get the result given in Theorem 4.2.
7.4.1.Proof of Lemma 7.1 For the study of ψ 2,n , note that As a consequence, for all positive h, h , From the definition of ψ 2,n given by (7.15), il follows that the result of Lemma 7.1 holds with C = (2 g ∞ ( 7.4.2.Proof of Lemma 7.2 and study of We have for all positive θ implying that (7.17) Next, we write where ] K h , (f g) h − f g and apply Bernstein Inequality.Using that K is even, the variance bound is obtained by On the other hand, we have . Therefore, Bernstein Inequality (8.2) implies that with probability larger than This together with (8.3) and Here, we apply the Bernstein inequality conditionally to X, with b The orders of b h,h and v h,h being the same as for V n,2 and independent of X, the result for V b n,1 is : Lastly, by noticing that by reminding of (7.5).Now using as in (7.6) Bernstein Inequality, we get, under [K3], Note that the last bound is obtained using Warning.For the study of this term, in order to avoid burdensome technicalities, we assume that f h * is bounded by 2 K 1 f ∞ .We proved in the study of V n (see (7.20)) that that the probability of the complement is 1/n 4 under [K3].
• Treatment of U (1),( 1) n (h, h min ) we have, by analogy with Lemma 6.2 in Comte and Marie (2021), that, for every ϑ ∈ [0, 1], and it is easy to see that all bounds do note depend on X so de-conditioning is straightforward.3) n (h, h min ) it is easy to handle thanks to the equality 2) n (h, h min ).Now, the process can be written as To apply Bernstein Inequality, we need to bound the variance and infinite norm of the Z 2,3 i 's.For the moment of order 2, we have For the upper bound, it holds: Then Bernstein Inequality implies that with probability larger than 1 − 2e −λ , λ > 0, for any ϑ ∈ (0, 1), As a consequence, we obtain Then it follows from (8.3) and • Treatment of U (1),( 3) n (h, h min ) write that U (1),( 3) We apply Bernstein Inequality conditionally to X, recalling that we consider by iterative application of Young Inequality.Next for the infinite norm The bounds do not depend on X, it holds: for a constant C > depending on f, g, K.Moreover, the bounds do not depend on h, h min so the same bound hold for U (3),( 1) n (h, h min )/n 2 .
• Treatment of U (h, h min ) write U (1),( 2) First, we apply a Bernstein Inequality conditionally to X.The variance term is: For the upper bound, we get with usual tricks.Now, we can notice that The first term is bounded by taking the expectation of the conditional Bernstein, where constants are independent of the X i , which writes with the terms b, v: For the second, we use Talagrand Inequality, relying on the linear process where B(0, 1) is a countable dense subset of {t ∈ L 2 (R), t 2 = 1}.To apply Talagrand inequality, we compute H 2 , v, b.We have from (7.25) that Next, for b, we find The Talagrand Inequality gives As a consequence, as under [K4] and h ≤ 1, h∈Hn exp(−c 1 /h) ≤ Σ < +∞ and card(H n ) ≤ n, we get The decomposition of this term involves first a U-statistics related to X: and terms corresponding to i = j that are studied separately: First, developing the latter product leads to the study of the four following terms.Two cross-terms that are are bounded by The product of last terms can be written Finally, the product of the first terms is implying that, for H i (u, v, x) Hi(u, v, x)g(u)g(v)dudvdx − Let us deal with the U-statistics U X n (h, h min ).We follow the line of the proof of Lemma 6.2 in Comte and Marie (2021) and write U X n (h, h min ) = 1≤i =j≤n G h,hmin (X i , X j ) where Indeed, G h,hmin (X i , X j ) = G hmin,h (X i , X j ) as for all functions u, v it holds u K h , v K hmin = u, v K hmin K h = u, v K hmin K h = u K hmin , v K h .
We apply the deviation inequality for U-statistics of order 2, as in Lacour Next b 2 n is a bound on n sup z E[G 2 h,hmin (z, X 1 )], we write We obtain We compute c 2 n which is a bound on n 2 E G 2 h,hmin (X 1 , X 2 ) .Decompose into four squared terms.First, Second The twin term in h min , h has clearly the same bound.Lastly We get Thus , for all positive θ, λ it holds Lastly, the term d n is a bound on sup a,b 1≤i =j≤n where a k (•), b k (•) for k = 1, . . ., n is such that E( Using the independence for i = j between functions of X i and functions of X j , we get that the term inside the sup is less than where Plugging this in formula (7.28), we get Therefore, It follows that Applying the deviation inequality for U-statistics of order

Appendix
In the paper we make an extensive use of the following: • The Young Inequality: for u ∈ L p (R d ) and v ∈ L q (R d) , 1 ≤ p, q ≤ r ≤ ∞,

FirstProposition 2 . 2 .
we state that the above functions fulfill assumptions (A1)-(A2): The functions K m defined in [P] and K h defined in [Ker] satisfy assumptions (A1)-(A2) and belong to K. The order of L(K m ) := L(m) depends on the choice of the basis.In dimension 1, for the trigonometric basis for m odd and I = [0, 1], it holds L(m) = m.For the Hermite basis where I = R, we have L(m) ≤ C H √ m (see Lemma 1 in Comte and Lacour (2021) and section 2.4).For the Legendre polynomial basis where I = [−1, 1] it holds that L(m) = m 2 , see Cohen et al. (2013, p.831).In any case, we consider that L(m) ≥ 1, which holds at least for m ≥ m 0 .Extension to dimension d is straightforward by tensorization of the bases, and L(m) = d i=1 L(m j ).In the kernel case, we simply have L(K h ) := L(h).

7. 4 .
Proof of Theorem 4.2 Following the first step in Lacour et al. (2017), we write et al. (2017), see Theorem 3.4 in Houdré and Reynaud-Bouret (2003).Following the notations of the aforementionned papers, we have to compute four bounds a n , b n , c n , d n .First a n is a bound on sup z,z |G h,hmin (z, z )|.sup z,z |G h,hmin (z, z )| ≤ sup z,z
.1) This estimator is considered in Lerasle et al. (2016) and for specific choices of K * , it covers for instance projection estimators, kernel estimators or weighted projection estimators.The first two examples are detailed in Section 2.2.Remark that E( bounded with large probability, see(7.6).Theorem 4.2.Consider the kernel case[K].Assume that Assumptions [K1]-[K4] hold and that 1/(nh min ) ≤ 1.Then, for any κ ≥ 1/4, we have Lerasle (2012)sociated deviations, from N = 100 independent datasets with different values of sample size n = 200, 1000 and 2000.All adaptive methods require the calibration of constants κ's in penalties.This is done by preliminary simulation experiments.For calibration strategies (dimension jump and slope heuristics), the reader is referred to Baudry et al. (2012), and toLerasle (2012)for theoretical justifications.Here, we test a grid of values of κ's, the tests are conducted on a set of densities which are different from the one considered hereafter, to avoid overfitting.The different estimators are computed on the same datasets and compared.

Table 1 Case
[BU]: L 2 -risks with std in parenthesis (both multiplied by 10 2 ), Dmax = 130.For "Optimal first", the bold sub-script is the number of times where Y is selected first.

Table 3
Case [NL]: L 2 -risks with std in parenthesis (both multiplied by 10 2 ), Dmax = 100.For the Optimal first the bold sub-script is the number of times where Y is selected first.

Table 4
Case [NC]: L 2 -risks with std in parenthesis (both multiplied by 10 2 ), Dmax = 50.For the Optimal first the bold sub-script is the number of times where Y is selected first.