Recursive estimation of the conditional geometric median in Hilbert spaces

,


Introduction
It is not unusual nowadays to get large samples of high-dimensional or functional data together with real covariates that are correlated with the functional variable under study.The estimation of how the shape of the functional response may depend on real or functional covariates has been deeply studied in the statistical literature : linear models for functional response have been proposed by Faraway (1997), Cuevas et al. (2002) or Bosq (2000) (see also Ramsay and Silverman (2005)) and Greven et al. (2010) whereas nonlinear relationships are studied in Lecoutre (1990), Chiou et al. (2004), Lian (2007), Cardot (2007), Lian (2011) andFerraty et al. (2011).
The main drawback of all the above mentioned estimators, whose target is the conditional expectation, is that they all rely, explicitly or not, on least squares and are consequently sensitive to outliers.In such a context of large samples of high dimensional data, outlying observations, which may not be uncommon, might be hard to detect with automatic procedures.Directly considering robust indicators of centrality such as medians is a way to deal with this issue.If Y be a random variable taking values in a Hilbert space H, its geometric median m (also called spatial median or L 1 -median, see Small (1990) for a survey) is defined as follows (1) The median m is uniquely defined under simple conditions when the dimension of H is larger than or equal to 2, it has a 0.5 breakdown point (Kemperman (1987)) as well as a bounded gross sensitivity error (Cardot et al. (2011)).When one has a sample at hand, algorithms based on the minimization of the empirical version of risk (1) have been proposed by Vardi and Zhang (2000) and properties of such robust estimators can be found in the recent review by Möttönen et al. (2010).Nevertheless, these computational techniques may not be able to handle very large samples of high-dimensional data since they require to store all the data.An alternative approach, developed by Chaouch and Goga (2012) and which can cope with this issue, consists in considering unequal probability sampling techniques in order to select, in a effective way, subsamples with sizes much smaller than the initial sample size.
We suggest in this work another direction based on recursive techniques which do not require to store all the data.Another interest of these recursive approaches is that they allow automatic update of the estimators if, for example, the data arrive sequentially.Recently, a simple recursive algorithm which gives efficient estimates of the geometric median in separable Hilbert spaces has been proposed by Cardot et al. (2011).It is shown that averaged versions of classic stochastic gradient algorithms have a limiting normal distribution that is the same as the distribution of the static estimator based on a direct minimization of the empirical version of risk (1).
In a finite dimension context, Cadre and Gannoun (2000) and Cheng and De Gooijer (2007) proposed to introduce a kernel function K in the empirical version of (1) in order to take covariate effects into account.The kernel weights are controlled by a sequence of bandwidth values that tends to zero when the sample size increases in order to build consistent estimates of the conditional geometric median.With the same ideas of local approximation of the conditional distribution, we study, in this work, a modification of the recursive algorithm suggested in Cardot et al. (2011).It consists in introducing weights, controlled by a kernel function, in order to build consistent recursive estimators of the conditional geometric median.The response variable is also allowed to take values in a separable Hilbert space.For real response, recursive estimators of the regression function based on kernel weights have been introduced by Révész (1977) whereas a deep study of their asymptotic properties, which also includes averaged estimation procedures, is proposed in Mokkadem et al. (2009).
The paper is organized as follows.In Section 2, we first define the stochastic gradient recursive estimator as well as its averaged version for the case of a real covariate.Note that our results could be extended to multidimensional covariates.We state the asymptotic normality, under general conditions, of the averaged algorithm in separable Hilbert spaces, with an optimal rate of convergence.The regularity hypotheses, which are much weaker than those of Cadre and Gannoun (2000), are also expressed in terms of the Wasserstein distance between the conditional distributions.
In Section 3, a comparison of the static approach, which consists in minimizing the empirical version of risk (1), with the stochastic gradient estimator and its averaged version is performed on a simulation study.It confirms the good behavior as well as the stability, with respect to the descent steps, of the averaged algorithm.The ability of this estimator to deal with large samples of very high-dimensional data is then illustrated on the estimation of television audience profiles given the total time spent watching television.Proofs are gathered in Section 4.

Notations, hypotheses and main results
Let (Y, X) be a pair of random variables taking values in H × R, where H is a Hilbert space whose norm is denoted by • .Suppose that X is continuous, and denote by p(x) its density at x ∈ R. For any x in the support of X, denote by µ x the conditional law of Y given X = x.Consider, for (α, x) ∈ H × R, the following functional (2) The geometric median of Y given X = x, denoted by m(x), is defined as the solution of the following optimization problem: (3) The solution of ( 3) is unique provided that the conditional distribution µ x is not supported by a straight line (Kemperman (1987)).We suppose from now on the following assumption.

A1.
For every x in the support of the probability density function p of the random variable X, µ x is not concentrated on a straight line: for all v ∈ H, there is w ∈ H such that v, w = 0 and Suppose we have a sequence (X n , Y n ) n≥1 of independent copies of (X, Y).In the unconditional case where the X variable is not taken into account, one can look for the unconditional median, i.e. the minimum m defined by (1).Under weak hypotheses, the median is uniquely defined as the zero of the derivative: We introduced in Cardot et al. (2011) the following recursive estimator of m: where γ n was a well-chosen deterministic sequence.In the present case, the law of Y n is not the conditional law µ x , so this idea does not work directly.However, it is natural to see Y n as an approximate sample of µ x if X n happens to be very close to x.Therefore, a simple estimator can be built by introducing weights, through a kernel function K, whose properties will be specified later.We modify (5) as follows to take the weights into account, and define our recursive estimator of m(x): with two deterministic sequences of tuning parameters h n and γ n whose properties are given below.
For a constant sequence (h n ), this algorithm converges towards the minimum of the modified objective function: The partial derivative of G h with respect to α is an element of H defined by We will see in Proposition 4.1 that, under suitable hypotheses, when h goes to zero, Φ h goes to the gradient Φ of G, defined by: The idea of using a kernel, and of assigning a large weight to Y n when X n is close to x can only work if the conditional law µ x varies, in some sense, regularly.A natural way of expressing this regularity is through the Mallows-Wasserstein distance.Let us recall its definition.
Definition 1.Let µ and ν be two probability measures on H with finite second order moments.Let C be the set of couplings of µ and ν, i.e. the set of measures π on H × H whose first marginal is µ and whose second marginal is ν.
The Wasserstein distance between µ and ν is given by: We may now state our assumptions.
A2.The probability density function p of the random variable X is bounded and satisfies a uniform Hölder condition : there are two constants β > 0 and C 2 > 0 such that We denote by p max = sup x∈R p(x).
A3.The gradient Φ(x, α) defined by (9) satisfies a uniform Hölder condition with coefficient β.There is A4.The conditional law µ x = L(Y|X = x) varies regularly with x: there are two constants C 4 and A5.The kernel function K is positive, bounded with compact support and satisfies R K(u)du = 1.

A6.
There is a constant C 6 such that: Remark 1.Without loss of generality, we suppose that the constant β in A2, A3 and A4 has always the same value.Assumption A3 is a regularity assumption that is required to control the approximation error and to prove the convergence of the algorithm.Assumption A4 seems to be more natural, and we prove in section 4.1 that, together with A6, it implies A3.
Hypotheses A2 and A5 are classical in nonparametric estimation and could be weakened at the expense of more complicated proofs.For classical properties of kernel estimators under general hypotheses, see for example Wand and Jones (1995).
Similarly, Hypothesis A6 is stated quite strongly here, in order to avoid additional technicalities in the proof of the asymptotic normality if the averaged algorithm.See Cardot et al. (2011) for a relaxed version, under which the same results should hold.Informally it forces the law to be "spread out" and this avoids pathological behaviors of the algorithm.
We have three main results.The first one states the almost sure convergence of the algorithm.
Theorem 2.1.Under assumptions A1-A3 and A5, and if Remark 2. In the following, for simplicity, we choose the step size and window size as inverse powers of n: With these choices the assumptions on the step sizes are: The assumptions on h and γ are always satisfied if we choose γ = 1 and h < 1.However, as shown in the simulation study, the performances of algorithm (6) strongly depend on the choice of the steps γ n and particularly on the constant c γ .Therefore, we also introduce the following averaged algorithm which is less sensitive to the choice of the step sizes γ n and has nice convergence properties, Our main result is a central limit theorem on this averaged algorithm.To adapt the proof of the corresponding CLT from Cardot et al. (2011), we need a good a priori bound on the error Z n (x) − m.
Proposition 2.2.Suppose that x is such that p(x) > 0 and that γ ≤ 1, 2γ − h > 1, γ + βh > 1, and h(1 + 2β) ≥ γ.Under Assumptions A1-A3 and A5, there exist an increasing sequence of events (Ω N ) N∈N , and constants C N , such that Ω = N∈N Ω N , and This proposition tells us that, up to a logarithmic factor, the optimal rates of convergence in nonparametric estimation can be attained for well chosen values of the parameter γ and h Finally our main result is the following central limit theorem for the averaged algorithm.
Theorem 2.3.Assume A1, A2 and A4-A6.Let x satisfy p(x) > 0. where As shown in Cardot et al. (2011) in the unconditional framework, the operator Γ has a bounded inverse under assumption A1, so that the asymptotic variance operator is well defined.Let us also remark that with our assumptions on the sequence of bandwidths, we have Consequently, the rate of convergence in the CLT is of order √ nh n , which is the usual rate of convergence in distribution for nonparametric regression, provided that the bias term is negligible compared to the variance.This latter condition is ensured by the additional condition h < (2β + 1) −1 and we have, with Theorem 2.3, As in the real regression case (see Mokkadem et al. (2009)) it turns out that the averaged estimator has a smaller asymptotic variance, with in our case a factor (1 + h) −1 , than the classical kernel estimator which minimizes the empirical version of risk (7).
Remark 3. Proceeding exactly as in the proof of Theorem 2.3, it is possible to establish a CLT for another weighted version of the algorithm Under the same assumptions of Theorem 2.3, one has:

Examples
We first consider a simple simulated example in order to compare the performances of the averaged algorithm with the more classic static one as well as the recursive Robbins-Monro estimator without averaging.Then, the ability of our recursive averaged estimator to deal with large samples of very high-dimensional data is illustrated on the robust estimation of television audience profiles, measured at a minute scale over a period of 24 hours, given the total time spent watching television.All functions are coded in (R Development Core Team ( 2010)) and are available on request to the authors.
In this picture we represent the possible choices for the parameters h and γ, when β varies.On the left is the most regular case where β = 1, on the right we set β = 1/2.In both cases, if (γ, h) lies in the lighter region, Theorem 2.1 holds and the algorithm converges.In the middle region, the algorithm converges and the additional convergence estimate of Proposition 2.2 holds.Finally, if (γ, h) is in the darker region, the CLT of Theorem 2.3 holds.All these two regions get smaller when β is small.Note that even in the most regular case β = 1, in order to fulfill the hypotheses of Theorem 2.3, it is necessary to choose γ larger than 2/3 and h larger than 1/3.

A simulated example
Consider a Brownian motion Y measured at d equispaced time points in the interval [0, 1], so that we have Y = (Y(t 1 ), . . ., Y(t d )).Besides, suppose that we know the mean value X = 1 0 Y(t)dt of each trajectory Y.We can look for the conditional (geometric) median of vector Y given X.The joint distribution of (Y, X) is clearly Gaussian with EY = 0, EX = 0, Consequently, the distribution of Y given X = x is Gaussian with conditional expectation, for j = 1, . . ., p, and a covariance matrix that does not depend on x.By symmetry of the Gaussian distribution, it is also clear that the conditional expectation is equal to the conditional geometric median, when H = R d equipped with the usual Euclidean norm, so that The hypotheses on the density p are clearly satisfied since X is a Gaussian random variable.Furthermore, the Wasserstein distance between two Gaussian laws with expectations m 1 and m 2 and the same covariance matrix is simply m 1 − m 2 , (see e.g.Givens and Shortt (1984)) so that we can deduce, with (20), that β = 1 in Assumption A4.
We draw n i.i.d.copies of (Y, X) and we focus in this simulation study on the geometric median of Y given x = 0.39, which corresponds to the value of the third quartile of X.Note that our conclusions remain unchanged for other non extreme values of X.
We first compute the static estimator, named "static kernel" in the following.It is based on a direct minimization, with the Weiszfeld's algorithm (see Vardi and Zhang (2000) and Möttönen et al. (2010)), of where and K is the Gaussian kernel.The Robbins Monro estimator Z n , defined in (6), and the averaged estimator Z n , defined in (15), are run for 10 starting points chosen randomly in the sample.Among the 10 estimations, we retain the one with the smallest empirical risk (21).
The accuracy of the different estimators m are compared, for different values of the bandwidth h and sample sizes n, with the quadratic criterion, Since β = 1, we can choose γ = 9/10 and h = 3/10, c h = 1, so that the quadratic estimation error for the Robbins-Monro algorithm, will be, up to the ln(n) factor, of order n −6/10 (see Proposition 2.2).Note that, for simplicity of comparison with the static kernel estimator, we also consider fixed values for h n ∈ {0.05, 0.10, 0.15, 0.20, 0.25} and take in this case γ = 2/3.We are aware that the assumptions needed for the asymptotic convergence are not satisfied but the sample size is fixed in advance here.
We first present in Table 1 the mean value, over 500 replications, of the MSE defined in (22), when estimating the conditional median with a sample size of n = 500 in dimension d = 100.For comparison and interpretability of the results, note that 100R(0) = 18.4.
We note that, when the sample size is moderate (i.e.n = 500), the interest of considering the averaged recursive estimation procedure is less evident than in the unconditional case (see Cardot et al. (2010)) since the Robbins-Monro estimator Z n defined in (6) can perform, for well chosen values of the tuning parameters c γ and h n , nearly as well as the static estimator.Nevertheless, we can remark that Z n is highly sensitive to the values of the tuning parameters and its performances deteriorate much with small variations of these parameters as seen in Table 1.This is not the case of the averaged estimator Z n , defined in (15), which is much less sensitive and thus allows less sharp choices of the values of the tuning parameters provided the descent steps do not force the algorithm to converge too rapidly.We note again (see Cardot et al. (2010)) that for too small values of c γ (i.e.c γ = 0.1), the algorithm converges too quickly and averaging leads to estimations that are outperformed by the direct Robbins-Monro approach.A way to deal with this drawback is to perform averaging only after a certain number of iterations.All these remarks are clearly illustrated in Figure 2 which presents the estimation error, defined in ( 22), for both algorithms and for different values of c γ .
When the sample size gets larger the interest of the averaging step becomes clearer since the estimation error of the Robbins-Monro estimator are always larger as soon as c γ ≥ 1 (see Table 2).Furthermore, the estimation errors of the static kernel estimator and the averaged recursive one are also now very close to each other.

Television audience data
We have a sample of n = 5422 individual audiences measured every minute over a period of 24 hours and by the Médiamétrie company in France.For j = 1, . . ., 1440, an observation Y i (t j ) represents the proportion of time spent by the individual i watching television during the j th minute of this day.Thus, each vector Y i belongs to [0, 1] 1440 .Note that in fact the first measurement t 1 is made at 3 AM of day d and the last one just before 3 AM of day d + 1 (see Figure 3).A more detailed description of these data can be found in Cardot et al. (2011).
We are interested in estimating television consumption behaviors, over a 24 hours period, according to the total time spent watching television.The covariate X, is the proportion of time spent watching television over the considered period, X i = (∑ j Y i (t j ))/1440, for i = 1, . . ., n = 5422.We consider the quantile values of X which are, in the sample, q 25 = 0.0599, q 50 = 0.128, q 75 = 0.225 and q 90 = 0.348.This means for example, that the ten percent of consumers with the highest consumption levels spend more than 34.8 % of their time watching television whereas the 25 % of consumers with the lowest consumption levels spend less than 6% of their time watching television.
We have drawn in Figure 3 the estimated conditional median profiles with a bandwidth value set to h n = 0.05 and a descent parameter c γ = 0.5, for x ∈ {q 25 , q 50 , q 75 , q 90 }.For comparison and better interpretation, we have also plotted the overall geometric median as well as the mean profile.One can note that the shape of the conditional profiles strongly depend on the value of the covariate and that multiplicative models that could be thought to be natural (see the simulation study), are in fact not adapted for modeling the conditional audience median profiles.This is clear if we compare, for example, the levels of the conditional median curves for x = q 75 and x = q 90 at time 15 and at time 21.Around 21, their values are approximately the same and are close to the global maximum whereas at time 15 the value of the conditional median for x = q 90 is about twice the value of the conditional median for x = q 75 .
From a computational speed point of view, for one starting point, our algorithm, which takes less than two seconds, is about 70 times faster than the static estimator which requires 140 seconds to converge.

Proofs
Notation.In all the proofs, x will be a fixed point in R satisfying p(x) > 0. Since x will not vary, we will abuse notation and drop it from various quantities.In particular, in the following m will denote the median m(x) of the conditional law µ x , and we will write Z n = Z n (x) and Φ(α) = Φ(x, α).

About the assumptions
We begin by a simple geometric result on unit vectors.For a, b two points in H, let D(a, b) be the unit vector "starting" from a in the direction of b.Now if a, b, c are three points in H, such that a − b ≤ a − c , Thales' theorem shows that: In any case, We will need a "decoupled" version of this inequality: We can now prove that A4 and A6 imply A3.Let x, x be two real numbers in the support of p. Recall that µ x denotes the law L(Y|X = x).Let Y and Y be two random variables with respective laws µ x and µ x , such that their joint law π achieves the Wasserstein distance.Let us first show that: Fix an α ∈ H.We have: Now we use the geometric bound (24), and Hölder's inequality: The first term is bounded by 2 √ C 6 thanks to A6.The second one is, by definition, the Wasserstein distance, and is bounded by

First properties
Recall that, for z ∈ H, Φ h (z) is defined by (8) as the conditional expectation of the step, with window size h.When h goes to zero, this "expected step" converges.
Proof.With our strong hypotheses this result is easy to prove.Indeed so that by Jensen's inequality Now we use Assumption A3 to bound the norm by C 3 |x − x| β , the compact support of the bounded function K (Assumption A5) and we integrate: Thanks to this result, we have a natural decomposition of algorithm ( 6).Let us introduce the two following quantities: In terms of these quantities, we can rewrite (6) as: The first term D h n (Z n ) will be controlled by Proposition 4.1.The second term ξ n+1 defines a sequence of martingale differences, since the conditional expectation given the sequence of σ-algebra For future reference, let us note the following bound on ξ n :

Almost sure convergence
In this section we prove Theorem 2.1.Define V n = Z n − m 2 .By (30), we have: The first scalar product is non positive, we denote it by (−η n ).We condition by F n : the ξ n+1 in the scalar products disappear by the martingale property.Then we use Hölder's inequality: On the last term we use 2xy ≤ x 2 + y 2 to get: On the last term we bound Z n − m by (1 + V n ) to get: Finally we bound D h n (Z n ) by Ch and Φ(Z n ) by p(x).This yields: Therefore by the Robbins-Siegmund Lemma (Theorem 1.3.12 of Duflo (1997)), V n converges almost surely and ∑ n η n < ∞.This implies that the limit of V n is zero, by the same argument than in Cardot et al. (2011), assuming that ∑ n γ n = ∞.

Proof of proposition 2.2
For the sake of clarity, we follow the same steps as the proof of Proposition 3.2 in Cardot et al. (2011), and emphasize the necessary changes.
Step 1 -a spectral decomposition.This step is exactly the same as in Cardot et al. (2011): thanks to a spectral decomposition of Γ, we can define the operators: Introducing the sequence of real functions, for n ∈ N, we see that each operator β n can be also expressed as follows: their inverses are bounded operators, and satisfy: x e λ .Moreover there exist constants κ 1 , κ 2 , κ 3 such that: where we recall that s n = ∑ n k=1 γ k , and Step 2 -Decomposition of the algorithm.Recall the decomposition (30) , and rewrite the algorithm as follows: where is the difference between the gradient of G and the gradient of its quadratic approximation.Compared to Cardot et al. (2011), there are two differences: the martingale difference ξ n has changed, and there is an additional term γ n D h n (Z n ).Therefore: ∀k, Rewriting where At this point, the first and third term are the same as in Cardot et al. (2011), the martingale has changed and there is an additional remainder term R n .
Step 3 -The deterministic term.Just as in Cardot et al. (2011), we get: Step 4 -The martingale.Still following Cardot et al. (2011), we use the spectral decomposition to deal with the martingale part.The changes appear just before eq.( 41) in that paper, where the bound on E[ ξ k 2 ] has to be changed (from 1 to C/h n , using the new bound ( 31)).Then we use the bounds (32) to get: Once more, the first terms in the sum are negligible (thanks to the exponential), and we isolate the last terms, for k ≥ l(n), where l(n) is given by for some constant c α .Choosing c α large enough, the arguments from Cardot et al. (2011) ensure that the main contribution comes from the last terms.The number of terms, that is n − l(n), is of the order ln(n)n γ , and Step 5 -the error terms.
The first error term is . This one can be treated exactly as in Cardot et al. (2011).We recall the definition of the event Ω N : for a value of K to be chosen later, and l(n) defined by (38).Then, for any power of n (say n −42 ) there is a C such that, on Ω N and for n ≥ N, We now turn to the bound of the new error term Therefore for N large enough, and for k ≥ l(n), For k smaller than l(n), we use the crude bound D h k (Z k ) ≤ p max + 1. Finally we get: The last term is bounded by Cn −βh and dominates the first term.Finally, since by assumption, h(1 + 2β) ≥ γ, one gets Now we use ( 36), ( 39), ( 40) and ( 41) to bound the four terms that appear in (35).We get, for n ≥ N and some new constant C: By the same induction than in Cardot et al. (2011), we obtain the bound announced in Proposition 2.2.

Proof of Theorem 2.3
The following proof follows the same guidelines as the proof of Theorem 3.4 in Cardot et al. (2011).Again we emphasize the necessary changes due to the introduction of the kernel and of the conditional distribution.We first linearize the target function around the conditional median m as in (33): where (ξ n ) is a martingale difference sequence.Therefore, for all k, Define now, and sum (42) over k where Step Zero -convergence of covariance operators Our first task is to establish a central limit theorem for the last term of (43): n where Σ is the limiting covariance defined by ( 17).On the space of linear operators on H we consider two classical norms, the (strong) operator norm and the Hilbert-Schmidt norm: , where e j is an orthonormal base of H.The following lemma will be useful.
Lemma 4.2.Define a random covariance operator Σ n by: Then: In particular, Σ n converges to Σ a.s. in the operator norm.Moreover, if Σ n denotes the following averaged version of Σ n : Finally, for any orthogonal projection operator P, Remark 4. Let us note that the convergence of square roots of covariance operators is equivalent to the convergence of the centered Gaussian laws with these covariances; see e.g.(Bogachev, 1998), Example 3.8.13.
Proof.We first show that the convergence (46) holds in operator norm.Recall that D(x, y) denotes the unit vector (y − x)/ y − x .Let us rewrite Σ n .
Denote by (X, Y) a couple of random variables with the original joint law, and Y x be a random variable with law µ x , independent from (X, Y).
We decompose the difference where Note that only the first and the last terms are random; the others depend on n only through the quantity h n .For (a, b, c) ∈ H 3 , it is easy to see that: where we used (24) in the last line.Therefore: Conditioning on X and using Assumption A6, we get: The boundedness of p and the finiteness of by dominated convergence; therefore the sequence (h Since Z n converges a.s. to m, D 1 op converges a.s. to zero.
The second term D 2 is treated similarly; we get: Recall that µ x = L(Y|X = x) and let µ x,x be a coupling of µ x and µ x that achieves the Wasserstein distance.We condition on the value of X and we apply Hölder's inequality in order to bound the first integral with Assumption A6 and the second one with Assumption A4: In the third term D 3 , since Y x is independent of X we may write Thanks to (49), this converges to zero.Finally, by Proposition 4.1, Φ h n (Z n ) is almost surely bounded, and since a ⊗ b op ≤ a b , Therefore, Σ n converges to Σ in the operator norm.
To prove the convergence of Σ n in operator norm, observe that Since Σ k − Σ op converges to zero, the conclusion follows by the Toeplitz lemma.
Let us show that these convergences hold in the Hilbert-Schmidt norm.For any a, Tr(a ⊗ a) = a .Therefore: . Another application of the Toeplitz lemma shows that . By the same reasoning as in Example 3.8.15 of (Bogachev, 1998), this implies the H.-S. convergences ( 46) and (47).Finally, let P be an orthogonal projection operator.Choose a basis (e i ) i∈N of orthonormal eigenvectors of P: Pe i = 0 or Pe i = e i .Since Σ n is trace-class, so is Σ n P and: This convergence is almost sure.Since Tr(Σ k ) ≤ 1 h k E K 2 ((X − x)/h k ) ≤ C, the convergence also holds in L 1 by dominated convergence, and (48) holds.
Step 1 -The CLT for the martingale.To prove the CLT (44), let us check that the assumptions of Theorem 5.1 in (Jakubowski, 1988) are fulfilled.Reminding (19), translated in our context, these assumptions are: ∀ε > 0 lim where (e n ) n∈N is an orthonormal basis of H and ψ i,j := Σe i , e j .We deal with condition (50) by applying Markov's inequality.Let η > 0.
Condition (51) is a consequence of the law of large numbers for martingales.Let us consider (e n ) n∈N an orthonormal basis of H. From the decomposition ξ n+1 , e i ξ n+1 , e j = E ξ n+1 , e i ξ n+1 , e j F n + ε n+1 , with ε n+1 := ξ n+1 , e i ξ n+1 , e j − E ξ n+1 , e i ξ n+1 , e j F n , we have By Lemma 4.2, the matrix element e i , Σ n e j converges to ψ i,j .The law of large numbers for the martingale (∑ n k=1 ε k+1 ) n∈N whose increasing process is of order Step 2 -The remaining terms are negligible.Now, it remains to prove that all the other terms in (43) converge in probability to zero.Due to the equivalence (19), we have to prove the convergence in probability to zero of Recall that E 1 Ω N T n 2 ≤ C N

ln(n)
n γ h n , thanks to Proposition 2.2.For the first term A n = T n+1 γ n , we have: The end of the proof follows the same guidelines as in Cardot et al. (2011).

Figure 1 :
Figure 1: Possible choices for h and γ.

Figure 2 :
Figure 2: Comparison of the two recursive algorithms according to the mean square error of estimation for different values of c γ (with a logarithmic scale).The sample size is n = 500 and d = 100.

Figure 3 :
Figure 3: Estimation of the conditional median profile for different levels of total time spent watching television, on the 6th September 2010.
≥ 1.We chose an integer p such that p > 2. By convexity of the function x → x p , we have, for any n,+ 2 p−1 E Φ h n (Z n ) p .In the last term, Φ h n (Z n ) is bounded, thanks to (26).Consequently, there exists a constant C(p) (independent of n) such that .Hence we have, for a constant C (p) independent of n, It remains to check condition (52).Let ε > 0. Applying Markov's inequality, we have , Σ n e j .Call P N the orthogonal projection on the e i , i ≥ N.