Non asymptotic controls on a recursive superquantile approximation

In this work, we study a new recursive stochastic algorithm for the joint estimation of quantile and superquantile of an unknown distribution. The novelty of this algorithm is to use the Cesaro averaging of the quantile estimation inside the recursive approximation of the superquantile. We provide some sharp non-asymptotic bounds on the quadratic risk of the superquantile estimator for different step size sequences. We also prove new non-asymptotic $L^p$-controls on the Robbins Monro algorithm for quantile estimation and its averaged version. Finally, we derive a central limit theorem of our joint procedure using the diffusion approximation point of view hidden behind our stochastic algorithm.

1. Introduction 1.1. Quantiles and superquantiles. In this paper, we are interested in the estimation of a standard financial risk measure with a recursive stochastic algorithm. Let be given a real random variable X that mimics the outcome of a portfolio of some financial assets, [ADEH99] introduces a set of axioms that shall describe in actuarial science and financial economics a coherent risk measure (denoted by ρ in this introduction). One of these key properties is that ρ must bring the diversification principle: the risk of two portfolios together is less than the two individual risks, which means from a mathematical point of view that for two random variables Z 1 and Z 2 , ρ satisfies ρ(Z 1 + Z 2 ) ≤ ρ(Z 1 ) + ρ(Z 2 ). In some recent works, this subadditive property has been relaxed and replaced by a convex axiom: ρ(λZ 1 + (1 − λ)Z 2 ) ≤ λρ(Z 1 )+ (1− λ)ρ(Z 2 ). If the value at risk (VaR), i.e. a quantile of a random variable, is a very popular measure of risk in finance, it appears that it is not a coherent risk measure since it does not satisfy the diversification principle. Oppositely, [ADEH97] introduced a Conditional Value at Risk (CVaR) measure (or expected shortfall), which is shown to be a coherent risk measure in [Pfl00]. More precisely, if X refers to the outcome of a portfolio, we define by F the cumulative distribution function: and for a given α ∈ (0, 1), the quantile θ α is denoted by: (1) Note that in many works in actuarial and financial mathematics, the notation VaR α (X) or ES α (X) (for expected shortfall) is used instead of θ α . For a given α ∈ (0, 1), the superquantile is then defined as soon as the random variable X is L 1 , with the help of: (2) Superquantiles are often adopted as a measure of risk when modeling a risk-averse decision maker and are commonly denoted by CVaR α (X) in financial economics and mathematics. More precisely, ϑ α quantifies a risk-averse constraint on asset return prices: for a (log)-return weekly price Z of a portfolio, we are commonly interested in the estimation or in some guarantees of the super-quantile of −Z when α = 95%, which translates the average loss induced by the worst 5% scenario. Besides being a coherent risk measure, the superquantile also translates more information on the tail of the distribution of the random variable X.
In our paper, we develop a new method for jointly estimating (θ α , ϑ α ) and obtain some non-asymptotic guarantees for the quadratic loss of estimation, which is a novel type of result for such a kind of estimation problem with stochastic algorithms. We also obtain a central limit theorem associated to our method.
Related works. Estimating quantiles has a longstanding history in statistics: except in parametric models where explicit formulas are available, the estimation of quantiles is a real issue. It may be tackled with the help of Monte-Carlo approaches when some simulation tools of the random variable X are available, or using the Extreme Value Theory when α is close to 0 or 1 (see e.g. [DHF06,Emb99]). Another popular point of view is to approximate the behaviour of the VaR α (X) with correcting linear and quadratic terms (see e.g. [BJS99,DP01,Rou97]). Finally, a large amount of work is based on order statistics to derive some estimation procedures of VaR α (X). We refer to [TPFG20] for an overview of some recent uses of order statistics for quantile estimation with some prospects in the field of computer experiments.
The development of new mathematical studies on superquantiles started with [BTT86] and [RU00], and use some variational formulation of CVaR α (X) as a solution of a convex optimization problem, to derive some estimation strategy. In particular, their approach does not require any parametric form of the distribution of X. In [LRGGI16], the authors exploit some Monte-Carlo strategy coupled with some order-statistics to estimate superquantiles and they establish a central limit theorem associated with their estimation method. Oppositely, many (econometric or finance) works are developed with an important parametric a priori modeling (see e.g. [HY03,WZ16,ENW09]) and owing to their parametric or semi-parametric point of view, the methods cited above obviously suffer from robustness issues.
All the previous methods are based on a batch framework, where the user observes the whole set of samples before starting the estimation procedure. Nevertheless, in some concrete situations, the observations may only record on-line, while in other big-data situations, the observations are simply too numerous to be handled with a batch method. In these cases, we are led to consider some recursive strategies where we estimate quantiles and superquantiles using the observations sequentially, with the help of a stochastic approximation algorithm (see, e.g., [KY03,BMP90]). The recursive quantile algorithm is a classical example of such a method (see e.g. [Duf97]) and has led to recent numerous contributions to the generalization of geometric medians (see, among others [God15,CCZ13,CCGB17]).
The classical algorithm can be written as: (3)      θ n+1 = θ n − a n 1 X n+1 ≤θn − α ϑ n+1 = ϑ n + b n X n+1 1 − α 1 X n+1 >θn − ϑ n Finally, the recent work of [BFP09] developed a different stochastic algorithm that estimates both CVaR α (X) and VaR α (X) within the same joint procedure. Indeed, [BFP09] used the current value of (θ n ) n≥1 and modified (ϑ n ) n≥1 into: to estimate ϑ α . Their main idea is to consider a convexified version of the algorithm since the update function L(θ, x) = θ + x−θ 1−α 1 x>θ satisfies that θ → E(L(θ, X)) is convex. Then they use a Ruppert-Polyak averaging procedure, introduced in the seminal contributions [Rup88,PJ92], to obtain a central limit theorem for their stochastic algorithm. In the recent work [BCG21], we developed a purely asymptotic study of these two algorithms and derive almost sure properties (Quadratic Strong Law and Law of the Iterated Logarithm) as well as a central limit theorem.
1.2. Algorithm. In this paper, we develop a stochastic procedure, that may be closely related to the one of [BFP09,BCG21], to jointly estimate the quantile and the superquantile of an unknown distribution: we expect to benefit from the acceleration of the Cesaro averaging procedure with (θ n ) n≥1 to obtain a better recursion on the superquantile estimation. For this purpose, we first introduce the following three-dimensional stochastic algorithm, which is a generalization of the stochastic algorithm (3): (4)                θ n+1 = θ n − a n 1 X n+1 ≤θn − α θ n+1 =θ n n n + 1 + 1 n + 1 θ n+1 = 1 n n+1 k=1 θ k The sequence (θ n ) n≥1 corresponds to the standard recursive quantile algorithm (see e.g. [Duf97]), whereas (θ n ) n≥1 is the Cesaro average over the past iterations of the initial stochastic gradient descent (θ n ) n≥1 . Cesaro averaging is known as an efficient way for reducing the quadratic risk of estimation of θ α (see e.g. [Rup88,PJ92]). We will provide a sharp nonasymptotic analysis of (θ n ) n≥1 and (θ n ) n≥1 below since it will be necessary for our study of ( ϑ n ) n≥1 . Our approach is similar to [GP20] even though for our purpose, we need to derive some upper bounds of the L p risk of (θ n ) n≥1 for p = 4 instead of p = 2 in [GP20]. We compare the algorithm proposed in this article with existing ones at the end of this section.
We do not assume any parametric form of this underlying distribution, we only assume that this distribution is absolutely continuous with respect to the real one-dimensional Lebesgue measure λ, whose density is denoted by f : We consider the general situation where f is supported over R, i.e. we do not make any assumption on the compactness of the support of f and we do not assume that f is lower bounded by any non-negative constant.
Assumption H f : We assume that f (θ α ) > 0, that f ∞ < +∞ and that θ −→ (1+|θ|)|f ′ (θ)| is a bounded function. We further assume that X has a moment of order strictly larger than 2: ∃p > 2 : In particular, H f implies that f is L-Lipschitz for a large enough L. This assumption is very close to the framework used in [BFP09]. The boundedness of θ −→ (1 + |θ|)|f ′ (θ)| is crucial to obtain the non asymptotic bounds and some precises controls of the first and second order terms.
Assumption H (an,bn) The gain sequences satisfy: This corresponds to the standard framework of Robbins-Monro algorithm. We specifically consider the case of two-scale algorithms, similarly to [MP06]. We emphasize that the case where a = b was studied in [BFP09] and the case b < a in [BCG21]. Here, we also consider the complementary situation where the learning rate of the super-quantile algorithm is faster than the one of the quantile sequence.
1.3. Main results. We derive non asymptotic estimates of the quadratic loss as well as a L p upper bound of the recursive quantile estimation (θ n ) n≥1 , with p ≥ 1. We introduce the L p risk associated toθ n − θ α denoted by: M n,2p : where κ ′ is an explicit constant given in (23).
The optimal choice of a corresponds to a = 3/4 and in this case ii) For any integer p ≥ 1, a constant c p exists such that: The first item i) of Theorem 1.1 is a consequence of Corollary 6 in [GP20]. However, we precise the computations of the second order term in the framework of the quantile estimation and we obtain precise constants for the second order term. We also point out that this second order term involves f (θ α ) −3 . We observe that in our case, if the second order term involved the local curvature given by the Cramer-Rao lower bound f (θ α ) −2 , this curvature is also involved in the second order term with a worse power (3 instead of 2), that is of course compensated by n −5/4 . As indicated in [GP20], the second order term is better regarding its dependency with n but maybe degraded by the curvature or the dimension of the ambient space. Even though the second result ii) is stated for a general value of p, we emphasize that this result is weaker than that of i) when p = 2: the first item provides a first order optimal result, while the constant c p given by the second item is pessimistic. The first order term is optimal thanks to the Ruppert-Polyak central limit theorem satisfied by (θ n − θ α ) √ n (see e.g. [PJ92,Rup88]). Nevertheless, ii) of Theorem 1.1 will be essential to derive a satisfactory linearization of our superquantile algorithm ( ϑ n ) n≥1 . Therefore, we should consider Theorem 1.1 as a result by itself and as a cornerstone tool for the next result as well.
We also state in Theorem 2.2 similar results for the Robbins Monro algorithm (θ n ) n≥1 , however in this setting we have no explicit expression of the first order constant.
Below, we study the properties of the superquantile recursive estimation algorithm and we differentiate our results into two cases that depend on the choice of the sequence (b n ) n≥0 . We first consider the case b n = b 1 /(n + 1) and then b n = b 1 (n + 1) −b with 1/2 < b < 1. We introduce the important notation V α : Theorem 1.2. Assume H f and H (an,bn) . i) If b ∈ (1/2, 1), then a large enough constant Γ exists such that , then a large enough constant Γ exists such that: Theorem 1.2 provides some statistical guarantees on the behaviour of ( ϑ n ) n≥1 in a finite horizon n while our assumptions on f are very weak: we only assume the smoothness of f with f (θ α ) > 0. The second assumption seems necessary otherwise the distribution P is flat around θ α and the definition of the quantile and of the super-quantile may be ambiguous.
We also state a sharp asymptotic analysis of the sequence ( ϑ n ) n≥1 and we derive a central limit theorem with an explicit computation of the limiting variance.
Discussion. We complete below our previous discussion regarding the results we obtained in Theorem 1.2 and Theorem 1.3.

Low computational cost.
It is well known that on-line methods lead to very simple and fast ways to compute estimators comparing for example to estimators based on order statistics or plug-in. Hence, it is somewhat classical but however important, to remind how fast our algorithm is to manage the estimation of ϑ α recursively. Moreover, our method may be adapted to a stream of observations, contrary to batch methods. Non-asymptotic result.
Gathering i) of Theorem 1.2 and i) of Theorem 1.3, we observe that our upper bound i) cannot be improved as it involves the lowest possible constant. Regarding now ii) of Theorem 1.2 and ii) of Theorem 1.3, we observe that the rate of convergence in Theorem 1.2 ii) is of the order 1/n and is optimal. Hence, we obtain with our sequential method a parametric rate of estimation of ϑ α . Unfortunately, we emphasize that C α,b 1 given in ii) of Theorem 1.2 does not match with the limiting variance obtained in Theorem 1.3 so that the constant involved in ii) of Theorem 1.2 may certainly be improved. Comparing the results i) and ii) of Theorems 1.2 and 1.3, we finally observe that from an asymptotic point of view, the choice b n = b 1 /(n + 1) always outperform the other one.
To the best of our knowledge, estimating the super-quantile ϑ α with a recursive method has only been addressed by [BFP09] with the help of a Cesaro averaging procedure on the initial sequence (ϑ n ) n≥1 and in [BCG21] where some asymptotic results are established for another method that produces a sequence different from ( ϑ n ) n≥1 . The results in [BFP09] and in [BCG21] are somewhat weaker than that of Theorem 1.2 since with the same set of assumptions, they only derive asymptotic results instead of a non-asymptotic bound on the mean square error. Our contribution stated in Theorem 1.2 is significantly stronger: we obtain a non-asymptotic upper bound of the performance of our estimator, which possesses the good convergence rate O(b n ) with a sequential strategy.

Limiting variance and comparison with existing results
Regarding now the asymptotic properties, there exists roughly speaking three axes of work in the literature. The first family of work is concerned by parametric or semi-parametric approaches: some parametric models are considered for the sequence of observations and then a plug-in parametric estimation is used with either a direct formula (if available for ϑ α ) or a Monte-Carlo integral approximation. It is the point of view adopted in [CU01], [HY03] or [WZ16] for example. Although efficient from a numerical point of view, these approaches suffer from robustness problems essentially due to some questionable parametrization. Consequently, even if some of these previous works also establish central limit theorem with explicit limiting variance of estimation, they cannot be directly compared to our result stated in Theorem 1.3.
Oppositely, nonparametric (or historical) VaR α (X) or CVaR α (X) estimation is a robust alternative over the (semi-)parametric approaches and among these works, we can mention batch methods and on-line ones that can manage a stream of observations.
• Concerning the batch approaches, they are commonly based either on order statistics, kernel smoothing or on the variational description of CVaR α (X). For example, in [LRGGI16], the authors derive a central limit theorem for the estimation of the super-quantile using the standard approach with order statistics and an assumption on the behaviour of the quantile function that entails an assumption on the tail of the distribution of the random variable X, which yields a non-adaptive method regarding the tail parameter of the distribution. In [BZ10], the authors use a plug-in method to derive a central limit theorem, with the help of an ad hoc functional delta method. In the same way, [CU01] also uses a variational approach of CVaR α (X) to derive an asymptotic functional central limit theorem for the estimation of ϑ α whose limiting variance depends on the tail index used in their model. In all these works, the variances are difficult to compare with the one stated in Theorem 1.3, at least from a theoretical point of view, while we emphasize that these methods are computationally longer than our on-line method. We also point out that in our work, we do not need any very restrictive assumption on the tail of the distribution of X.
• Therefore, it seems that the meaningful methods we have to compare with are those of [BCG21], [BFP09] and [LRGGI16]. In [LRGGI16], the superquantile is estimated using a Monte-Carlo method and some order statistics of the distribution. The authors obtain a central limit theorem (Theorem 3.1) and the limiting variance whose expression is difficult to compare with our because of the lack of explicit computations. More interestingly, [BCG21] states a central limit theorem for the algorithms (ϑ n ) and ( ϑ n ) (see Theorem 3.4). We obtain that both √ n b (ϑ n − ϑ α ) and √ n b ( ϑ n − ϑ α ) converge to a Gaussian random variable with variance In the case where b < 1 and ϑ α and θ α are positive, the asymptotic variance obtained in Theorem 1.3 ii) is larger than Γ ϑα . The discussion in the case where b = 1 is more interesting. For a fixed value of b 1 we observe that Therefore if the distribution f satisfies that ϑα θα > 3 2 , there exists b 1 such that the asymptotic variance of ϑ n is smaller that the one of ϑ n and ϑ n . Now for a given distribution, the question remains how to find a value of b 1 that minimizes S 2 22 and Γ ϑα . The minimal Γ ϑα is equal to τ 2 α and is reached for b 1 = 1. This value corresponds to the limiting variance of the central limit theorem for the averaged version of ϑ n stated in [BFP09] (Theorem 2.4). The variance S 2 22 admits a minimum for some b 1 ∈ (1/2, 1). However the comparison of this minimal value with τ 2 α is difficult to handle for a generic distribution. We nevertheless point out that our new algorithm ( ϑ n ) n≥1 shall improve the limiting variance of estimation in comparison with the one of (ϑ n ) n≥1 or ( ϑ n ) n≥1 when ϑ α is significantly larger than θ α , which is a common feature with heavy-tail distributions like Student, Weibull or Pareto ones. This last remark may be in particular useful for possible applications in finance and actuary. In these both fields, CVaR estimation is at the cornerstone for designing optimal policies either for contracting or edging. We emphasize that in finance, it is commonly admitted that asset return prices may possess heavy tails (see e.g. [Man63,RM00]) whereas actuarial scientists generally face optimal reinsurance problems with fat tails (see e.g. [BBH09,Del09]).
Organisation of the paper and notations. As indicated above, the main goal of this paper is to prove Theorem 1.2 and Theorem 1.3. To do so, we will also establish some results on the stochastic gradient descent (θ n ) n≥1 involved in the quantile estimation, and on the averaged sequence (θ n ) n≥1 for the quantile estimation itself. Despite their own interests, these results are necessary to study ( ϑ n ) n≥1 and to the best of our knowledge, are new for the recursive quantile algorithm (θ n ) n≥1 and in particular give a state-of-the art result for the sequence ( ϑ n ) n≥1 when compared to known results stated in [Duf97, CCZ13, God15] (among others).
The paper is organized as follows. Section 2 presents a careful study of the L p loss of the recursive quantile. The results derive from a strategy based on a family of Lyapunov functions V q (see Equation 9). In particular, we establish Theorem 2.2 that states that E[(θ n − θ α ) 2p ] is less than a p n up to a constant that depends on p. In Section 2.2, we prove an optimal non-asymptotic result for (θ n ) n≥1 . Our proof relies on a spectral analysis of the algorithm, close to the strategy developed in [GP20]. Section 3 then uses Theorem 1.1 to derive the main result of the paper on the super-quantile estimation proof of Theorem 1.2 and Section 4 describes the central limit theorem with the help of an approximation of the rescaled stochastic algorithm by Markov diffusion processes.
Below, we will use a.s. to refer to an almost sure convergence result. For two positive sequences (t n ) n≥1 and (s n ) n≥1 , t n s n refers to an inequality true for any value of n up to a constant C independent of n: t n ≤ Cs n . In the sequel we will always denote (VaR α (X), CVaR α (X)) by (θ α , ϑ α ).

Quantile estimation
This paragraph deals with the statistical estimation of θ α with the sequence (θ n ) n≥1 and the Cesaro averaging procedure defined by (θ n ) n≥1 .

2.1.
Convergence of the quantile algorithm (θ n ) n≥1 . We first state a standard almost sure convergence result, established with the help of the Robbins-Monro Theorem.
Theorem 2.1 (Robbins-Monro Theorem -Almost sure convergence of (θ n ) n≥1 ). Assume that a n = a 1 n −a with a ∈ [1/2, 1] and a 1 > 0, then the sequence (θ n ) n≥1 satisfies We refer to [Duf97] for a proof of this standard result. The major drawback of Theorem 2.1 is the lack of quantitative information about the rate of convergence. We are generally interested in a result of the form E[(θ n − θ α ) 2q ] ≤ r n,q , where r n,q is referred to as the convergence rate of the algorithm. We emphasize that these bounds are more meaningful than a simpler almost sure convergence result because they make it possible to draw some quantitative comparisons among algorithms (in particular when looking at the rates of convergence and at the first order terms). Deriving a such bound is at the cornerstone of our forthcoming study of (θ n ) n≥1 .
Theorem 2.2. Assume H f , if a n = a 1 n −a with a ∈ (0, 1), then a large enough constant C exists such that: More generally, for any integer q ≥ 1, we have In particular, we can take: where c 1 , c 2 and m are given in Lemma 2.3.
We emphasize that our result holds for a ∈ (0, 1), instead of the classical assumption a ∈ (1/2, 1) in the Robbins-Monro Theorem (see for example the classical result on the dosage in Theorem 1.4.26 and Proposition 1.4.27 of [Duf97]). We prove here that the common constraint n≥1 a 2 n < +∞ is useless in the case of the recursive quantile algorithm to obtain a convergence result. This important and unusual result holds because of the boundedness of the noise from one iteration to another and is obtained with a subtle strong-convexification of the loss function with the help of the family of functions V q . To the best of our knowledge, such improvement has only been observed in the case of bounded noise in a general result of Métivier and Priouret (see e.g. Theorem 1 of [MP87]), and was also stated in a simpler way in Proposition 4.2 of [Ben99]. We emphasize that these works only state a.s. asymptotic results. Said differently, for the dosage issue and other companion problems (such as the mediane estimation for example), the assumption a > 1/2 used in [Duf97,God15] to assess convergence results on (θ n ) n≥1 is useless.
Second, we also observe that our upper bound is degraded according to f (θ α ) −→ 0, which translates a flat area of the cumulative distribution function around θ α . Such a phenomenon is absolutely coherent with the CLT (not show here) known for the dosage problem.
Proof of Theorem 2.2. We start with the linearization of (θ n ) n≥1 . We define The function Φ is associated to the gradient descent involved by Equation (4). We point out that Φ ≥ 0 and satisfies: We shall remark that: where the martingale increment is defined in by: We now follow the roadmap of [GP20] and define the Lyapunov function: We first state some important properties of V q , whose proofs are postponed to Appendix B.
Lemma 2.3. i) A constant m > 0 exists such that for any θ ∈ R and q ≥ 1 : ii) A constant c q > 0 that only depends on q and f exists such that: Our strategy of proof relies on a recursive contraction from V q (θ n ) to V q (θ n+1 ) using a Taylor expansion of V q (θ n+1 ): We first consider the drift term. Since ∆M n+1 is a martingale increment: We now pay a specific attention to the second order term. Lemma 2.3 ii) yields: We are thus led to upper bound Φ(ξ n+1 ). Using a Taylor expansion near θ n , we obtain: We recall that ℓ n+1 is a random variable in [0, 1] and that Φ ′ (θ n ) and ∆M n+1 are uniformly bounded by 1. We thus deduce that: which leads to, for a n large enough: Changing the constant c q from line to line, we are led to: where we used that exp(Φ(θ n )) = V 0 (θ n ) ≤ C(1 + V q (θ n )). Combining Equations (12) and (13) we deduce that: Finally using again the fact that Φ ′ (θ n ) and ∆M n+1 are uniformly bounded, we obtain: We conclude using (11) that: We deduce from the study of this recursive inequality given in Lemma B.1 that for n 0 large enough In particular, we can choose Taking the expectation at time n and using Equation (15), we conclude that: 2.2. Cesaro averaging. This section is devoted to the proof of Theorem 1.1 . The proof relies on a spectral analysis of the algorithm, close to the strategy developed in [GP20]. We introduce Z n = (θ n − θ α ,θ n − θ α ), which gives the joint evolution of the Robbins Monro algorithm and its averaged version.
Proposition 2.4. The sequence (Z n = (θ n − θ α ,θ n − θ α ) T ) n≥0 satisfies: where: i) A n translates the linearization of the algorithm around (θ α , θ α ) at step n: ii) The martingale increment ∆M n+1 defined in (8) satisfies: iii) The rest term R n is an F n measurable random variable such that: Proof. Proof of i): We write the two sequences (θ n ) n≥1 and (θ n ) n≥1 as We then use a linear approximation of F around θ α and write that: Finally, we can write the joint evolution of (θ n ,θ n ) n≥1 as a perturbed linear recursive sequence given by (16).

Proof of ii):
We study the variance of the martingale increment defined in (8): We then deduce that: Proof of iii): The rest term is an F n −measurable random variable given by Using that f is L-Lipschitz, the rest term R n satisfies: Proof of Theorem 1.1 i). The scheme of the proof is to use the linearization to identify the negligible terms in the recursion. To do so, we diagonalize the matrix A n and perform a change of base. The eigenvalues of A n are 1 − a n f (θ α ) and 1 − 1 n+1 and we verify that: We introduce and we denote by Z (2) n the second component of Z n . Equation (16) yields: where: (20) R n = ω n (θ n − θ α ) + 1 n + 1 − ǫ n+1 a n R n with ω n := (ǫ n+1 − ǫ n )(1 − a n f (θ α )).
We now compute the quadratic expansion using the fact that (∆M n+1 ) n≥1 is a sequence of martingale increments: Step 1 : n } 2 ) and use Cauchy-Schwarz inequality on the last term to deduce that: We study in Lemma 2.5 and 2.6 the behaviour of all the terms involved in the previous decomposition. Lemma 2.5 concerns the variance of the martingale increments from which a term of order n −2 arises, while Lemma 2.6 proves that the other terms are negligible.
Lemma 2.5. For every δ > 0, a constant C δ exists such that for all n: Lemma 2.6. There exists a constant C (independent from n) such that for all n: The proof of these results is postponed in Appendix B.2. We deduce from these lemmas that some constant Γ exists such that: We now apply Lemma A.2 in order to derive the convergence rate of the algorithm. With the notations introduced in Lemma A.2 applied with γ = 1, A 1 = 0 and: We then obtain that a large enough Γ a exists such that: Step 2 : Refinement of the previous bound. Here however the second order term is not optimal and to improve our bound, we refine our analysis with a precise study of the covariance term: Lemma 2.7. For all n ≥ 1, we have: Gathering these bounds in (21) we obtain that: with κ = max(LK 2 a 1 , 2(a+1)/a 1 , 2a 1 ) and ε n a negligible term with respect to n −(3/2+a)∧(3−a) . In particular, there exists n 0 large enough such that for all n ≥ n 0 , . We finally apply Lemma A.3 and deduce that a large enough Γ a exists such that: The optimal choice of a is to a = 3/4 and in this case 7f (θ α ) 3 n −5/4 + Γ a n −9/4 .

From the definition of Z
(2) n we obtain that: n } 2 + 2ǫ n E( Z (2) n (θ n − θ α )) + ǫ 2 n E((θ n − θ α ) 2 ). Therefore, it remains to prove that the two last terms are negligible with respect to n −r . From Theorem 2.2, we know that a large enough Γ exists such that and the Cauchy-Schwarz inequality yields (up to a modification of Γ): Proof of Theorem 1.1-ii). The proof is deferred to Appendix B.2.

Embedded averaging super-quantile algorithm
The purpose of this paragraph is to prove Theorem 1.2, i.e., we aim to derive a nonasymptotic result on the sequence ( ϑ n ) (see the definition in Equation (4)).
3.1. Linearization of the embedded algorithm. To study the behavior of Algorithm (4), we follow the roadmap of Section 2 and write a linear approximation of ( ϑ n ) n≥1 . To this end, we define: We shall prove the following result.
Proposition 3.1. Assume H f , for any integer n, one has where ξ n ∈ (θ n , θ α ),V α is defined in (5) and ∆N n+1 verifies: Proof. We observe that: According to Equations (25) and (26), we obtain that: We use a second order Taylor expansion on Q near θ α using the relationships: In particular, H f implies that Q ′′ ∞ < +∞. We obtain that: Now, we compare the variance of the martingale increment and V α : where: is bounded so that a large enough constant C exists such that: Computing the whole expectation and using the Cauchy-Schwarz inequality, we have that: 3.2. Spectral analysis and recursion. We now study the sequence ( ϑ n ) with the help of the previous linearization combining Propositions 3.1 and 2.4. We introduce the matrix We emphasize that the eigenvalues of B n are Obviously, B n may be reduced to a diagonal form as soon as b n / ∈ {a n f (θ α ), 1 n+1 }. In particular, we obtain the following decomposition (stated without any proof). ∈ {a n f (θ α ), 1 n+1 }, we have where P n is the matrix of change of basis: with: + 1)) .
As pointed out above, B n may not be reduced to a diagonal form for any sequence b n −→ 0. Therefore, instead of using an exact spectral decomposition as it is the case in [GP20], the novelty of our work here is to handle a decomposition of B n without an exact diagonal form. For this purpose, we introduce the invertible matrix P n defined by: and we verify that: Remark 1. The matrix P n corresponds to setting k α,n = 0 which removes the singularity issue since k α,n → ∞ for b n close to 1/(n + 1).
Equation (30) may be used as follows: we introduce the vector defined by: and we remark that Proposition 3.1 yields: Using P n , we translate the spectral information on B n to the new vector: O n := P −1 n O n . Equation (30) yields: We define: (31) η n := ǫ n δ n − ǫ n+1 δ n+1 , and verify that: Considering the third coordinate of the sequence ( O n ) n≥1 , we obtain the recursion: All negligible F n -measurable terms are gathered in U n defined by: n + (1 − a n f (θ α ))η n O (1) n − ǫ n+1 δ n+1 a n R n + b n R n , and we regroup the martingale increments into ∆V n+1 defined by: Then the recursion reads: The next result gives a recursive inequality on W n = E O where V α is given in (5) and C > 0.

Proof.
A direct square expansion of (33) yields: The Cauchy-Schwarz inequality entails The next lemmas derive some bounds on E(U 2 n ) and E(∆V 2 n+1 ). The proofs are given in Appendix C.
Lemma 3.4. Assume that H f and H (an,bn) hold, then: . Moreover, in the special case where b n = b 1 (n + 1) −1 , a positive C exists such that: Lemma 3.5. Assume H f and H (an,bn) hold, then: We apply Lemma 3.5 and Lemma 3.4 and obtain that a large enough C > 0 exists such that: We conclude this section with the proof of Theorem 1.2.
Then, we easily verify that the largest possible value of p in Lemma A.1 is Since b < p, we shall apply Lemma A.1 and we obtain that a large enough Γ a exists such that: Case b n = b 1 (n + 1) −1 . In this case we use Lemma 3.4 to refine the recursion obtained in Proposition 3.3 into: f (θα) 2 , r 1 = (5/2 − a) ∧ (a + 1) and r 2 = (2 + a/2) ∧ (3 − a). If we choose ρ = (1 + a 2 ) ∧ (2 − a) and then we obtain that for any b 1 > ρ − 1/2: and this inequality is made optimal when a = 2 3 , which entails a second order term equals to n −4/3 when b 1 > 5/6.

Conclusion: To conclude the proof, we come back to
. Therefore, it remains to prove that the two last terms are negligible with respect to the first one. From Theorem 2.2, we know that a large enough C exists such that ǫ 2 n δ 2 n E((θ n − θ α ) 2 ) ≤ Cn −2−2b+3a , and the Cauchy-Schwarz inequality yields (up to a modification of C): When b = 1, an easy comparison leads to: When b ∈ (1/2, 1) and a < b, we can verify that both terms are negligible when compared to n −(1+a)/2 .

Central Limit Theorem
Deriving a central limit theorem for a stochastic algorithm is a common way to assess the asymptotic variance of the algorithm. Central limit theorem for superquantiles has been addressed in [BFP09] for the Cesaro averaging of the sequence (ϑ n ) n≥1 . Another related relevant work on the subject is the paper of [MP06] where a general central limit theorem is obtained for stochastic algorithms with two different time scales that allows to study a weak convergence result for the pair (θ n , ϑ n ) (see also [BCG21]). We also refer to the pioneering works of [Bor97,KT04] for other results on two time-scales algorithms. However, in our case, even though (θ n , ϑ n ) n≥1 seems to evolve with two different time scales (a n , b n ) n≥1 , the evolution of ( ϑ n ) n≥1 from iteration n to iteration n + 1 highly depends onθ n and therefore a third time-scale (1/(n + 1)) is involved. This last remark significantly complicates the use of Theorem 1 of [MP06]. All the more, it appears that some of the technical assumptions used in [MP06] (especially Assumption (A4)-iii)) can be avoided with the help of another proof strategy.
In what follows, we will first consider the case where b n = b 1 (n + 1) −1 in which the convergence speed of (ϑ n ) n≥1 is optimal, and derive a joint convergence for (θ n , ϑ n ) that evolves with a single time scale proportionnal to 1/n (see Equation 4). To derive a central limit theorem for our pair (θ n , ϑ n ) n≥1 , we use the diffusion approximation of stochastic algorithm and we follow the roadmap of [BMP90] (see also [KY03]). We write the joint evolution of a rescaled version of the algorithm (θ n , ϑ n ) n≥1 and introduce:

4.1.1.
Decomposition of (T n ) n≥1 and (V n ) n≥1 . To obtain a central limit theorem, we will use an approximation of the rescaled algorithm by a stochastic differential equation. More precisely, we will show that the pair (T n , V n ) n≥1 is close to the discretization of a specific Markov process: the Ornstein-Uhlenbeck diffusion. Of course, if one considers the first coordinate, we will then recover the central limit theorem for the Cesaro averaging of the quantile sequence. We emphasize that our method significantly differs from the ones previously developed to establish central limit theorem results for the Ruppert-Polyak averaging algorithm: [PJ92], [Pel00], [CCGB17] are essentially based on the Abel transform and the decomposition of Equation (54).
Recursive analysis of (T n ) n≥1 . Instead of directly studying the pair (T n , V n ) n≥1 , we first consider the reduction of (θ n − θ α ) obtained after our eigenvalue decomposition in Section 2.2. We recall the definition of ( Z n ) n≥1 in (18) and we introduce: We remark, using (19), that: FinallyT n+1 =T n − 1 2(n + 1)T n + a n √ n + 1 1 whereẼ 1 n satisfies: The previous upper bound is obtained using a Taylor approximation of n/(n + 1) on the first term, the control R n defined in (20) given by Lemma 2.6, the definition of (ǫ n ) n≥1 and Equation (17). Recursive analysis of (V n ) n≥1 . We use the recursive formulation given in Proposition 3.1: Using T n =T n + ǫ n √ n(θ n − θ α ) and a first order Taylor expansion of 1 + 1/(n + 1) and of (n + 2)/n, we deduce that: whereẼ 2 n satisfies: Gathering Equations (36) and (38), we have obtained the following proposition.
Proposition 4.1. For any n ≥ 1, defineZ n = (T n , V n ), then: :=Hn(Zn) , whereẼ n = (Ẽ 1 n ,Ẽ 2 n ) satisfies: 4.1.2. Tightness and limit of the martingale increments. We use some standard notations of diffusion approximations of rescaled stochastic algorithms. We introduce: The number of iterations needed to last a time t after a shift of Γ n is then denoted by t n : We associate to the sequence (Z n ) n≥1 a sequence (Z (n) ) n≥1 of time-shifted continuous processes defined as follows: for any integer n, the process (Z (n) t ) t≥0 corresponds to a continuous-time affine interpolation of the rescaled stochastic algorithm that starts at positionZ n at time 0: Using the notations of Proposition 4.1 we remark that: N (n,t) ), and: In this paragraph, we identify the possible weak limits of the sequenceZ (n) (see e.g. [BMP90]) and some details are skipped for the sake of convenience. The reader may found similar arguments in Section 5 of [GPS18].
We define the infinitesimal generator G : for any ϕ ∈ C 2 b (R 2 , R): i) The sequence (Z (n) ) n≥1 is tight for the weak topology induced by the weak convergence on compact intervals of continuous time processes.
The proof of this previous compactness result is deferred to Appendix D.

Central Limit Theorem -Theorem 1.3-ii).
Proof of Theorem 1.3-ii). The proof is briefly sketched since it exactly follows the same lines as those of Theorem 2.4 of [GPS18] (Section 5.3).
• From Proposition 4.2-i), we know that the sequence of processes (Z (n) ) n≥1 is tight and Proposition 4.2-ii)/iii) entails that any weak limit of (Z (n) ) n≥1 is solution of the martingale problem with generator G on the domain of twice differentiable functions C 2 b (R 2 , R). We emphasize that G corresponds to an Ornstein-Uhlenbeck Markov generator. The associated semi-group P t = e tG is elliptic and ergodic: a unique invariant Gaussian distribution ν ∞ exists such that for any test function ϕ: P t (ϕ) −→ ν ∞ (ϕ) as t −→ +∞ uniformly on compact sets of R 2 . We compute the limit covariance in Proposition 4.2-ii).
• Second, the normalized algorithm (Z n ) n≥1 is also a tight sequence: we consider a possible weak limit µ associated to an extracted sequence (Z n k ) k≥1 and a time T large enough. Then, we prove that an associated subseqence of shifted processesZ (m k ) exists such thatZ (m k ) T is arbitrarily close to ν ∞ and to µ, because of the shifted nature of continuous time processes.
• We then conclude that µ = ν ∞ , meaning that (Z n ) n≥1 is a tight sequence with a unique adherence point. It proves thatZ n L −→ n→+∞ ν ∞ . We recall that ν ∞ is a bi-dimensional gaussian with mean 0 and covariance matrix Σ. We finally obtain the convergence of √ n(θ n − θ α , ϑ n − ϑ α ) using the Slutsky theorem. Let us remark that: (2) n .
We verify that: so that √ nǫ n (θ n − θ α ) converges to 0 in probability (even in L 1 ). We therefore deduce that √ n(θ n − θ α , ϑ n − ϑ α ) converges to a Gaussian distribuion with mean 0 and covariance matrix • We finally detail the computation of the limiting variance Σ and identify the invariant measure ν ∞ associated to the infinitesimal generator G defined in (42). ⋄ We apply Gϕ(x, y)ν ∞ (dx, dy) = 0 to ϕ(x, y) = x 2 /2 and observe that: which leads to: ⋄ Then using the function ϕ(x, y) = xy, we obtain that: Therefore, we deduce that: ⋄ Finally using ϕ(x, y) = y 2 /2, we have and thus

4.2.
Case of a step sequence b n = b 1 /n b . Since the computations are rather similar, we only sketch the essential steps of the proof.
Proof of Theorem 1.3-i). We still use Proposition 3.1 as: Following Proposition 4.1, we can identify the main part of the recursion and the rest terms: Let us remark that: Then, the arguments developed in the previous sections can be easily adapted. In the following we only detail the modifications and skip below the technical details related to the discrete/continuous time-scale.
Tightness of (V n ) n≥1 . The proof is similar to the one stated in Appendix D. Indeed, sup n≥1 E V n 2 < +∞ and for any δ > 0, N (n,t+δ)+1 N (n,t) b k ≤ 2δ for large n, thus The only modification comes from the rest terms: We apply the Cauchy-Schwarz inequality: We shall now compute the expectation of the previous terms and verify that: Using now Theorem 1.1 (L p loss on the sequence (θ n ) n≥1 ) and Theorem 1.2 (L 2 loss on the sequence (ϑ n ) n≥1 ), we obtain that the bracket term is uniformly bounded in k, and therefore: Hence, the sequence V n is tight. Finally, applying the above arguments and the bound on E n , any weak limit is solution of the martingale problem associated with the generator: It remains to identify the invariant measure associated with the Ornstein-Ulhenbeck diffusion (44). An easy computation with ϕ(x) = x 2 /2 yields the next result.

Appendix A. From Lyapunov contraction to non-asymptotic upper bound
The next lemmas provide non-asymptotic upper-bounds from a Lyapunov-type contraction.
Lemma A.1. Let b n = b 1 n −b with 1 2 < b < 1, A, C > 0, r 1 > 3b 2 and r 2 > 2b. Consider a positive sequence u n such that: n + A n −r 1 √ u n + n −r 2 , then, for any p ∈]b, (r 1 − b/2) ∧ (r 2 − b) ∧ 2b ∧ 1[, a large enough Γ p exists such that: Proof. We prove the result with a recursive argument. The initialization is obtained for a large enough Γ. Assume that Equation (45) holds true for an integer n, then we have where: The aim is now to prove that for a good choice of p, e n+1 +ẽ n+1 is negative. For e n+1 an easy computation leads to: Forẽ n+1 , we combine the two first terms to obtain: This difference is therefore negative as soon as n is large enough. Finally: Now, we should remark that if p + b ≤ (r 1 + b/2) ∧ r 2 ∧ 3b ∧ (b + 1), then a large enough Γ exists such that e n+1 +ẽ n+1 ≤ 0. It concludes the proof.

Proof.
We proceed by induction on n, in order to prove that u n ≤ V n + Γn −ρ , for the largest possible value of ρ. The initialization is obvious at the price of a large enough Γ. We consider the recursion property assuming that the property holds at iteration n. Then: We observe that for n large enough, we have: 1 + ΓV −1 n −(ρ−1) 1/2 ≤ 1 + Γ 2V n −(ρ−1) + O(n −2(ρ−1) ).
Thus we deduce that: Now using a series-integral comparison n k=1 k −r (k + 1) 2 ≤ 4 + n 3−r 3 − r + 2 n 2−r 2 − r + n 1−r 1 − r , which yields: Let us remark that the leading term within the brackets is of the order n −r , which leads to the conclusion. A similar reasoning also holds if the inequality is verified after an integer n 0 .
Appendix B. Technical results for the quantile estimation Proof of Lemma 2.3. A straigthforward computation from the Equation (9) yields: Proof of i): Using (47) we deduce that and we derive a lower bound for the bracket term q Φ ′ (θ) 2 Φ(θ) + Φ ′ (θ) 2 on R. This lower bound is deduced by a careful analysis in a neighborhood of θ α and outside this small area. Local study Using the definition (6), we know that Φ ′ (θ α ) = 0. A Taylor expansion yields: which implies since Φ ′′ (θ α ) = f (θ α ) > 0 and Φ ′′′ = f ′ that: The Young inequality |ab| ≤ a 2 2 + b 2 2 leads to: We then define ε = f (θ α ) f ′ −1 ∞ and observe that In the meantime, we also verify that so that: We gather (49), (50) and use them in Equation (48) to obtain that: Far from θ α Remark that Φ ′ given by Φ ′ (θ) = θ θα f (s)ds is increasing and negative on ] − ∞, θ α ] and then positive on [θ α , +∞[. In particular, we have that: A straightforward computation associated with a first order Taylor expansion leads to: The same computations also apply for θ α − ε and we conclude that: We use again Equation (48) and the previous lower bound to deduce that:

Proof of ii):
Since Φ ′′ = f and V q = ΦV q−1 , Equation (47) yields We have shown in i) that Φ ′2 Φ −1 is bounded near θ α whereas the unique zero of Φ is θ α and Φ ′ satisfies |Φ ′ | ∞ ≤ 1. Consequently, Φ ′2 Φ −1 is a bounded function. More precisely, we shall observe that: We repeat the same arguments when θ ≤ θ α and deduce that : Lemma B.1. Let us consider the function V q introduced in (9) and recall that there exists n 0 > 0 such that for any q ≥ 1, a c q > 0 exists such that: 1 − a n m + c q a 2 n + c q a 2 n E (V q−1 (θ n )) + c q a q+1 n . Then for any q ≥ 1 there exists (A q , B q , C q ) ≥ 0 such that n , ∀n ≥ 1. In particular, Proof of Lemma B.1. Initialization for q = 1: We prove the result by recursion on n. Let us first remark that V 0 (θ) ≤ e + V 1 (θ), then (14) becomes 1 − a n m + c 1 a 2 n + c 1 a 2 n E (e + V 1 (θ n )) + c 1 a 2 n ≤ E (V 1 (θ n )) 1 − a n m + 2c 1 a 2 n + c 1 (1 + e)a 2 n .
We then apply the same argument as the one used in Theorem 1 of [BM11] and deduce that which concludes the recursion and prove the property for q = 1 by defining Recursion step: We assume that the property holds until the integer q − 1 and we prove similarly by induction on n that the property holds for the integer q. We then obtain that n + a 2 n e A q−1 −B q−1 n 1−a We then use the (rough) upper bound: We then repeat the arguments of [BM11] and deduce that where (A q , B q , C q ) are given by: Technical results for the averaged quantile estimation (θ n ) n≥1 . This paragraph gathers technical results useful for the study of (θ n ) n≥1 .
The second lemma handles the rests terms given in (20).
Proof of Lemma 2.6. Using the Jensen inequality as well as Proposition 2.4 iii), we deduce that: For the first term, a direct computation shows that: (53) ω n = −(a + 1) n 2 a n + o(n −2+a ), which entails: ω 2 n (ǫ n+1 − ǫ n ) 2 (n + 1) −4+2a . Now, we apply Theorem 2.2 and conclude that ω 2 n+1 E(θ n − θ α ) 2 (n) −4+a . For the second term, we combine the bound on a 2 n (1/(N + 1) − ǫ n+1 ) obtained in (52) with Theorem 2.2 and deduce that: Combining the two bounds, we deduce that: Proof of Theorem 1.1-ii). We use the key relationship introduced in [Pel00] derived from the linearization: θ n+1 − θ n = −a n f (θ α )(θ n − θ α ) + a n R n + a n ∆M n+1 . Dividing by a n and summing, we then obtain that: Finally, dividing by n and using an Abel transform, we verify that: The Jensen inequality applied to (54) yields: We then compute the expectation of the right hand side terms and remark that: while Theorem 2.2 yields: since a n = a 1 n −a with a ∈ (0, 1). The generalized Holdër inequality yields (see e.g. Lemma 4.3 of [God15]) When a k = a 1 k −a , we can verify that 1 a k − 1 a k−1 k −1+a . Using again Theorem 2.2, we have E|θ k − θ α | 2p 1/2p (a p n ) 1/2p = √ a n n −a/2 . Therefore a comparison between series and integral implies that: Concerning the rest terms, we use that |R k+1 | |θ k − θ α | 2 (see Proposition 2.4 iii)), Theorem 2.2 and we deduce from the generalized Holder inequality that: We remark that (∆M k+1 ) k≥1 is a bounded sequence of martingale increments. We shall apply the recursive argument stated in Lemma A.2 of [God15]: a constant K p exists such that: We then gather Equations (55)-(59) and obtain that: for a ∈ [1/2, 1), which concludes the proof.
Appendix C. Technical results for the super-quantile algorithm ( ϑ n ) n≥1 This paragraph gives the missing details of the proof of Proposition 3.3.
Lemma C.1. Assume that the step-size sequences satisfy a n = a 1 n −a and b n = b 1 (n + 1) −b with 1/2 < a < b ≤ 1.
We now prove the two lemma contained in the proof of Proposition 3.3.

Proof of Lemma 3.4.
Recall that U n defined in (32) is the sum of four terms so that: We consider each term of the previous sum separately. We remind that O (1) n = θ n − θ α . Then, Lemma C.1 and Theorem 2.2 yield: Concerning the second term, remark that O (2) n = Z (2) n , so that (22) entails: For the rest term R n , we use that: R n ≤ L 2 |θ n − θ α | 2 and Theorem 2.2, so that: Finally, for the last term, we use that |R n | ≤ Q ′′ ∞ 2(1−α) |θ n − θ α | 2 . Then, Theorem 2.2 leads to: Gathering all the previous bounds and using 1/2 < a < b ≤ 1, we get: Proof of Lemma 3.5. We study the variance terms brought by ∆V n : and study each term separately. First term. From Lemma 2.5, E({∆M n+1 } 2 ) ≤ α(1 − α) + Cn −a/2 . Combining this with the upper bound of ǫ n δ n = n −2+2a in Lemma C.1, we observe that: Second term. We use (28) and Theorem 2.2 to deduce that: Third term. We verify with simple algebra that: We apply the Cauchy-Schwarz inequality and obtain that: E X n+1 1 X n+1 ≥θn 1 X n+1 ≤θn |F n ≤ E[X 2 ] |F (θ n ) − F (θ n )|.
In particular, we verify that: E[∆M n+1 ∆N n+1 ] = α(1 − α)ϑ α + O(a 1/4 n ). We obtain with simple algebra using Lemma C.1 that: To conclude we gather the three upperbounds and deduce that: Appendix D. Technical results for the central limit theorem Proof of Proposition 4.2. Proof of i): The proof relies on Theorem 7.3 of [Bil95] and is divided into three steps.
To prove the tightness of (Z (n) ) n≥1 , we follow Theorem 7.3 of [Bil95] and we are led to study P sup s∈[t,t+δ] |Z (n) s −Z (n) t | ≥ ǫ . This difference will be decomposed as: Step 2: Study of B (n) -Tightness of (B (n) ) n≥1 . We consider ǫ > 0, δ > 0 and s ∈ [t, t + δ]. We use the triangle inequality and verify that: For the first term of the right hand side, we use the Tchebychev inequality and obtain that: Since sup k≥1 E Z k 2 < +∞ and N (n,t+δ)+1 N (n,t) 1 k ≤ 2δ for large n, we then obtain that: (61) P   N (n,t+δ)+1 N (n,t) We handle the rest term in the same way and control N (n,t+δ)+1 N (n,t)Ẽ k 2 . We apply Equation (40), the triangle inequality and the Minkowski inequality and deduce that: We shall now compute the expectation of the previous terms and verify that Using now Theorem 2.2 (L p loss on the sequence (θ n ) n≥1 ), Theorem 2.2 (L p loss on the sequence (θ n ) n≥1 ) and Theorem 1.2 (L 2 loss on the sequence (ϑ n ) n≥1 ), we obtain that the bracket term is uniformly bounded in k, and therefore: We then apply the Tchebychev inequality and obtain that: Hence, for any ǫ > 0, for any η > 0, a small enough δ exists (of the order ηǫ 2 ) and a large enough n 0 exists such that: Theorem 7.3 of [Bil95] implies that (B (n) ) n≥1 is a tight sequence of continuous processes.
We now study each term separately.