Stochastic Online Convex Optimization. Application to probabilistic time series forecasting

We introduce a general framework of stochastic online convex optimization to obtain fast-rate stochastic regret bounds. We prove that algorithms such as online newton steps and a scale-free 10 version of Bernstein online aggregation achieve best-known rates in unbounded stochastic settings. We apply our approach to calibrate parametric probabilistic forecasters of non-stationary sub-gaussian time series. Our fast-rate stochastic regret bounds are any-time valid. Our proofs combine self-bounded and Poissonnian inequalities for martingales and sub-gaussian random variables, respectively, under a stochastic exp-concavity assumption.


Introduction
We introduce a stochastic version of the Online Convex Optimization (OCO) analysis of Zinkevich (2003) to calibrate sequential parametric forecasters and measure their performances in stochastic environments.Let K be a convex body of R d , i.e. a convex, compact set with a non-empty interior, and t , t ≥ 1, be loss functions from K to R. In Stochastic Online Convex Optimization (SOCO) analysis the loss functions t are random elements.We consider a filtration (F t ) of non-decreasing σ-algebras such that the sequential learning algorithm predictions (x t+1 ) and the losses ( t ) are F t -adapted.We measure the performances of the sequentialalgorithm with the stochastic regret where Since the stochastic regret is random, we focus on any-time valid deviation rates such that, with high probability, it holds if the distributions of the loss functions t are Dirac masses for every t ≥ 0. Thus SOCO analysis encompasses the classical deterministic OCO analysis.
However, the two analyses differ in various ways.First, the stochastic environment can improve the convex properties of the optimization problem; The conditional risk functions often have better convex properties than the loss functions.
arXiv:2102.00729v3 [cs.LG] 21 Apr 2023 Second, the competitors in stochastic and deterministic regret bounds are not the same; The conditional risks can measure the calibration of parametric probabilistic forecasters to the conditional distributions of the environment.Third, it is likely that the maximum of the deviations of the random loss functions around the conditional risks increase with the number of iterations.The sequential algorithms should be robust to these deviations.
We use the SOCO analysis to prove that some parametric forecasters are robust to sub-gaussian stochastic environments when calibrated sequentially.Our first main result in Section 3 states that the calibration using the Online Newton Step (ONS) algorithm achieves a O(log T ) stochastic regret bound for any conditionally sub-gaussian sequence of random losses.The fundamental assumption is a stochastic exp-concavity condition (H2) that holds for non-convex losses and unbounded gradients.The proof uses a self-normalized martingale inequality, and a Poissonnian inequality valid for conditional sub-gaussian gradients as in Condition (H3).Our study gives insights why the use of second-order gradient algorithms such as ONS yields a fast-rate calibration; ONS implicitly minimizes a surrogate loss involving second-order terms.
Then we extend the deterministic expert aggregation analysis introducing the Stochastic Online Aggregation (SOA) analysis.In SOA, the experts are stochastic predictors adapted to the filtration F t−1 , and the aggregation algorithm competes with the best predictor.The best existing regret bounds achieve optimal rates O(log log T + E) in any stochastic environment bounded by E > 0. However, E being the maximum of stochastic deviations, it usually increases as O( log(T )) and deteriorates the rate in a sub-gaussian environment.
Our second result is a stochastic regret bound with rate O((log log T ) 2 ) achieved by a scale-free version of the algorithm Bernstein Online Aggregation (BOA); We tune the multiple learning rates such that the weights are insensitive to the multiplication of the losses by a scalar.This property is crucial in the proof to deal with stochastic losses and the obtained regret bound improves the existing ones in some unbounded stochastic settings.
In Section 5 we show that we can use the SOCO analysis to calibrate parametric probabilistic forecasters.We consider gaussian probabilistic forecasters of time series and logarithmic losses such that the conditional risk functions coincide with the Kullback-Leibler (KL) divergence.We interpret the stochastic regrets bounds as cumulative KL bounds relative to a static optimal forecaster.
We verify the condition (H2) on parametric gaussian forecasters of a time series (y t ).Then we apply SOCO to parametric forecasters using AR-ARCH modeling to predict the conditional expectations and variances.Even though the corresponding logarithmic loss functions are not convex, the conditional risk functions are still locally stochastically exp-concave.Thus we can combine ONS and BOA algorithms to sequentially calibrate the parameters of the gaussian probabilsitic forecasters.We provide fast-rate non-asymptotic theoretical guarantees for such parametric probabilistic forecasters.
The stochastic regret bound (1) is obtained using Ville (1939)'s inequality and is any-time valid.Any-time valid sequential inference have been recently applied with success in Henzi and Ziegel (2022); Shafer et al. (2021); Waudby-Smith and Ramdas (2020) to many statistical problems such as testing, comparing forecasters and designing confidence sequences.We refer to the textbook Shafer and Vovk (2019) and the survey paper Ramdas et al. (2022) for an exhaustive overview.Sequentially calibrated non-parametric probabilistic forecasters are developed in Chapter 12 of Shafer and Vovk (2019) with a O( √ T ) regret bound in any bounded stochastic environment.Faster O(log T ) regret bounds for parametric prediction of deterministic individual sequences are presented in Cesa-Bianchi and Lugosi (2006); Hazan (2016) under exp-concavity assumptions.
For independent and identically distributed (iid) loss functions t , Hazan (2016); Mahdavi et al. (2015) proved that Online Gradient Descent (OGD) and ONS algorithms satisfy stochastic regret bounds of order O( √ T ) and O(log T ), respectively.Expert aggregation calibrated by Squint and BOA achieves a O(log log T ) stochastic regret bound under the so-called Bernstein condition in the stationary bounded setting, see Koolen et al. (2016) and Wintenberger (2017), respectively.These results have been improved by a careful tuning of the learning rate in Mhammedi et al. (2019); Orseau and Hutter (2021).All existing stochastic regret bounds have a linear dependence in the maximum of the deviations of the loss functions.They all use a fast-rate "online to batch" conversion to turn deterministic regret bounds into stochastic ones, see Mehta (2017).We adopt a different approach introducing surrogate losses and achieving results far beyond the iid environment.Sequential learning naturally applies to time series as recursive algorithms update their predictions when observing new data over time.However, regret bounds with high probability are rare due to the data's temporal dependence that prevents the use of standard exponential inequalities.For stationary β− or φ−mixing time series, Agarwal and Duchi (2012) obtained fast-rate regret bounds for the unconditional risk function E[ t ].Anava et al. (2013) obtained fast-rate regret bounds for the ONS algorithm risk for ARMA (Auto-Regressive Moving-Average) models.Their notion of stochastic regret does not coincide with ours.

Preliminaries and assumptions
We use the notation 0 = (0, . . ., 0) T , I 1 = (1, . . ., 1) T , and the operations implying vectors are thought componentwise.In the sequel • is the Euclidian norm • 2 .We consider a filtration (F t ), t ≥ 0, and F 0 = {∅, Ω} by convention.The proofs of the main results are deferred to Appendix A. Definition 1 (Stochastic online convex optimization).Consider a convex body K ⊂ R d and an F t -adapted sequence of random loss functions ( t ) defined over K.An algorithm predicts x t ∈ K that is F t−1 -measurable and incurs the random conditional risk L t (x t ) = E t−1 [ t (x t )] at each step t ≥ 1. SOCO analyses the rate of the stochastic regret (1) as a function of T ≥ 1 assuming the risk functions L t being convex for all t ≥ 1.
The main difference with the classical OCO analysis is the use of the conditional risk functions L t instead of the loss functions t in the regret and the convex assumption.The SOCO setting extends the OCO setting.Proposition 1.Any OCO problem is a degenerate SOCO problem.
Proof.We consider that t has a degenerate distribution δ { t} , the Dirac mass at t .It is a SOCO problem equipped with the natural filtration is The conditional distribution of the random loss function t may depend adversarially on x t , . . ., x 1 ∈ F t−1 , and, as in OCO, a boundedness assumption on K is necessary to obtain regret bounds.
(H1) The diameter of K is D < ∞ so that x − y ≤ D, x, y ∈ K, and the loss functions t are continuously differentiable over K a.s. with integrable gradients.
Under (H1) and if the loss functions ( t ) are convex the optimal rate is O( √ T ) for OCO and thus for SOCO problems by an application of Proposition 1.This optimal rate is satisfied in SOCO problems even if the loss functions ( t ) are not convex but the risk functions (L t ) are.See Appendix B.2 for the case of the OGD when the gradients ∇ t are a.s.bounded by G > 0. To obtain fast-rate o( √ T ) stochastic regret bounds, we assume stochastic exp-concavity.
Condition (H2) with α = 0 coincides with the convexity assumption on L t , t ≥ 1.Also (H2) with α ≥ 0 does not imply the convexity of t , t ≥ 1.In the iid setting, stochastic exp-concavity has been studied by Koolen et al. (2016), making explicit a condition introduced in Rigollet et al. (2008).Condition (H2) was used by Gaillard and Wintenberger (2018) over the unit 1 -ball and it implies the Bernstein condition of ?introduced for convex losses.In the deterministic setting, an application of Lemma 4.3 of Hazan (2016) shows that Condition (H2) with α = 1/2(µ ∧ 1/(GD)) follows from the µ-exp-concavity of the loss functions.Proposition 2. Assume the loss functions are twice continuously differentiable.Then Condition (H2) implies On the opposite, if L t is µ-strongly convex and there exists g > 0 such that then Condition (H2) holds with α = µ/g 2 .
Proof.Inequalities (2) and (3) follow easily from a second-order Taylor expansion of L t .
We verify Condition (H2) when calibrating parametric gaussian probabilistic forecasters in Section 5.Under expconcavity assumptions, the optimal rate is O(log T ) in OCO (Hazan, 2016) and thus in SOCO.In stochastic environments the constant α > 0 depends on the conditional distributions of the losses and is unknown in practice.
Conditional sub-gaussian random variables are such as the Orlicz norm is bounded by a constant for every t ≥ 1.This norm is not precise enough for our purpose.We require a slightly more explicit condition involving two constants.Our assumption is a conditional version of the Bernstein condition, also related to the notion of Bernstein-Orlicz norm of van de Geer and Lederer ( 2013).
Proposition 3. Assume that the gradient ∇ t (x t ) satisfies Condition (H3): Proof.Denote Y = ∇ t (x t ) .We have We conclude by definition of the Orlicz' norm.
Condition (H3) is satisfied in every bounded cases 3 ONS achieves fast-rate stochastic regrets

Surrogate losses
We base our approach on an observed surrogate loss that upper-bounds the stochastic regret using an exponential inequality for martingales from Bercu and Touati (2008) on unbounded gradients ∇ t , t ≥ 1.
Proposition 5.Under (H1) and (H2), for any predictable sequence (x t ) and deterministic x in K, it holds with probability 1 − δ, 0 < δ ≤ 1, When the distributions of t are degenerate the upper bound in Proposition 1 becomes Since the result is valid with probability 1, the last term disappears letting δ ↑ 1. Forthcoming results, any-time valid with a high probability in a stochastic environment, are surely valid in deterministic environments when suppressing the dependence in δ.
Following ?, we interpret as a surrogate loss.The quadratic term in addition to the gradient term is necessary to upper-bound the unobserved conditional risk with high probability.In stochastic environments, algorithms should minimize the cumulative surrogate loss T t=1 t rather than the cumulative loss T t=1 t .Under Condition (H2) with α > 0, this additional quadratic term is counterbalanced by the compensator The Poissonnian inequality of Proposition 6 relates both quadratic terms.

The stochastic regret analysis of ONS
The cumulative surrogate losses T t=1 t is implicitly minimized in the ONS's regret analysis of Hazan (2016).Then the ONS algorithm achieves a fast stochastic regret bound.
Using the Sherman-Morrison formula, each step of ONS has a O(d 2 + P )-cost, where P is the cost of the projection step.If the gradients ∇ t (x t ), t ≥ 1, verify the condition (H3) then the square of their Euclidian norm ∇ t (x t ) 2 satisfies a Poissonian exponential inequality.Proposition 6.Under Condition (H3) the gradients ∇ t (x t ) satisfy Proof.Expanding the exponential and using Condition (H3) we obtain for every ηG 2 ψ2 < 1 and the desired result follows.
To control the second-order terms in Proposition 5, we combine the self-bounded martingale and Poissonian inequalities.
We obtain a fast-rate stochastic regret bound for the ONS tuned choosing γ = α/3.
Our result extends fast-rate stochastic regret bounds for ONS far beyond existing results in the iid bounded setting.
4 BOA achieves fast-rate regret bounds in Stochastic Online Aggregation

Stochastic Online Aggregation
We consider Aggregation algorithms combine the predictors with weights π t minimizing the stochastic regret We have under Condition (H2) the relation We consider the loss functions π → t (x t π) over K = Λ K that is stochastically exp-concave with the same constant α as the original loss functions t .Applying Proposition 5 under Condition (H2) we obtain We identify the surrogate losses We analyze algorithms minimizing the sum of the surrogate losses in stochastic environments.We compare the aggregation strategy x t to π ∈ {e i , 1 ≤ i ≤ K}, i.e., with the best predictor x (i) t , using the linear losses (π t − π) T t over K = Λ K .We call this problem, encompassing the deterministic expert aggregation problem, the Stochastic Online Aggregation (SOA).

The stochastic regret for the scale-free version of BOA
The version of BOA described in Algorithm 2 is different than the original BOA algorithm in Wintenberger (2017), because of the specific tuning of the multiple learning rates η t .The specific η t provides a self-normalization and the algorithm is scale-free, i.e., insensitive to a multiplicative factor of the losses.
The factor 2.2 is not arbitrary and is chosen such as a small numeric constant satisfying This relation is crucial in the proof of Theorem 5 to propagate the self-normalization in a recursive argument.The coordinate-wise learning rate η t,i is only well defined after the first non-null observation Contrary to the ONS, and thanks to the adaptive learning rates, the algorithm BOA is parameter-free as it does not require the knowledge of α, and each step has a O(K)-cost.We provide a deterministic regret bound valid for any deterministic sequence.
Theorem 8.For every 1 ≤ i ≤ K, the BOA algorithm 2 achieves the deterministic regret bound The term in the regret bound (5) replaces the term M T ∞ in the regret bounds of Mhammedi et al. (2019); Orseau and Hutter (2021).In some unbounded stochastic settings, our regret bound ( 5) is better for T large.For instance, if we consider that the first predictor is iid standard gaussian and the other ones are bounded then The deterministic regret bound in Theorem 8 is assumption-free.Its first term is the square root of the sum of the additional quadratic terms in the surrogate losses (4).It may increase at the rate O( √ T ) but, under condition (H2), it becomes negligible.We provide a stochastic regret bound for sequential aggregation using BOA.
Theorem 9. Assume Conditions (H1), (H2) and (H3) hold on x T t ∇ t (x t π t ) a.s.for all t ≥ 1, 1 ≤ i ≤ K.The scale-free BOA algorithm 2 with π i ≥ e −K for all 1 ≤ i ≤ K satisfies, with probability 1 − 3δ, for every T ≥ 1, and m > 0 such that P(min Aggregation problems are easier than optimization ones and BOA achieves a faster stochastic regret bound than ONS.This rate O((log log T ) 2 ) is suboptimal in the deterministic expert aggregation setting.Condition (H3) implies that the deterministic gradients are bounded by a constant G > 0, and Condition (H2) implies exp-concavity.Optimal strategies achieve O(G log K) deterministic regret (Cesa-Bianchi and Lugosi, 2006).Among them Exponentially Weighted Aggregation, but this algorithm achieves only a O( √ T ) stochastic regret as shown by Audibert (2007).Best-known aggregation algorithms in deterministic and unbounded stochastic settings are different.It is an open question to find an aggregation algorithm optimal in both settings whereas squint and the original version of BOA achieve optimal rates in bounded deterministic and stochastic settings.The choice of the initial weights π 1 being not crucial in the latter setting we choose implicitely uniform initial weights in the sequel.

4.3
The SOCO analysis to adapt to unknown stochastic exp-concavity constant α > 0 We study an example of BOA-ONS dealing with the adaptation to the best stochastic exp-concavity constant α.It is crucial for improving the ONS performances in any stochastic environment where, contrary to deterministic ones, there is no way to determine the optimal α as it depends on the conditional distributions of t .Consider t the BOA aggregation of K ≥ 1 ONS predictions with different parameters γ (i) with γ (i) = {2 −1 , . . ., 2 −K }.The resulting BOA-ONS algorithm adapts to the optimal value of α that depends on the unknown stochastic environment.The algorithm Metagrad of ? is also able to adapt to different rates of convergence.Corollary 10.Under (H1), (H2) and (H3) with α ≥ 2 −K−2 , BOA-ONS algorithm satisfies with probability 1 − 4δ the stochastic regret bound Proof.We combine the stochastic regret bound of Theorem 9 with the inequality (11 5 BOA-ONS for sequential prediction and probabilistic forecast of time series

Probabilistic forecasting
Observing a time series (y t ), we use the SOCO analysis to calibrate some parametric probabilistic forecasters in the sense of Chapter 12 of Shafer and Vovk (2019).In our setting sequential algorithms predict x t and parametrize a probabilistic forecaster P xt .Given a scoring rule S, the loss at step t is t (x t ) = S(P xt , y t ) .The expected score, also denoted by S in Gneiting and Raftery (2007), is a discrepancy measure between probabilities where P t denotes the distribution of y t given F t−1 .Condition (H2) holds on the scoring rule S, the parametrization x → P x and the distribution P t of the variable of interest y t given F t−1 S(P y , P t ) ≤ S(P x , P t ) + ∇ y S(P y , P t ) If Condition (H2) is satisfied in well-specified settings P t = P x * t , for any x * t ∈ K, then S is a proper scoring rule for the class {P x ; x ∈ K} in the sense of Gneiting and Raftery (2007); S(P y , P t ) is minimum when P y = P t by convexity.The scoring rule is not necessarily strictly proper since this maximum is not unique when ∇ y S(P y , P t ) is null in some directions y in the neighborhood of x * t .We provide examples of time series probabilistic forecasting calibrated using the SOCO analysis by verifying Condition (H2).We focus on the logarithmic score assuming that P x , P t admit densities p x , p t , x ∈ K, t ≥ 1.We have where KL is the Kullback-Leibler divergence.This scoring rule is strictly proper because S is minimized when P y = P t only.It is likely to satisfy the stochastic exp-concavity condition (H2) locally in well-specified settings.
Proposition 11.If P t is in the exponential family so that its conditional density p t (y) is proportional to e T (y) with sufficient statistic T (y) and some x * t ∈ K then for the logarithmic score and necessarily α ≤ 1 if condition (H2) holds.
Proof.We apply Proposition 2, noticing that the Fisher information identity holds in the well-specified setting.
We use the logarithmic score for calibrating the first and second moments of gaussian forecasters as recommended in Section 4.4 of Gneiting and Raftery (2007).Giraud et al. (2015) focus on the estimation of m t = E t−1 [y t ], establishing fast-rate stochastic regret bounds in expectation.
We also focus on the estimation of the conditional variance or volatility σ 2 t = Var t−1 (y t ) for gaussian probabilistic forecasters.Up to our knowledge, stochastic regret bounds for sequential algorithms calibrating the volatility have not been established yet.However, the concept of volatility is important and required in many applications such as risk assessment and probabilistic forecasting in finance (McNeil et al., 2015;Shafer and Vovk, 2019).The logarithmic score is well-suited to measure the performances of volatility estimators as it is robust to extreme values (Patton, 2011).
Example 2 (Estimation of the volatility).
This assumption is unrealistic when y t is concentrated around its conditional mean m t .Using SOCO, if the conditional distribution P t has mean m t and volatility ; See Proposition 13 for more details.
The stochastic exp-concavity condition is well-preserved for linear multivariate parametrization.Thus the conditional expectation and the volatility can be expressed as a linear combination of the past observations y t−1 , . . ., y 1 or their squares y 2 t−1 , . . ., y 2 1 .We obtain naturally AR and ARCH estimations for the conditional expectation and the volatility in Sections 5.2 and 5.3, respectively.Combining both, we obtain the AR-ARCH gaussian forecaster studied in Section 5.4.The parametrization does not preserve the strictly proper property of the logarithmic loss function.Despite the logarithmic score being strictly proper overall probability measures, it is not for the AR-ARCH models because different linear combinations of past observations provide the same probabilistic forecaster.Stochastic exp-concavity condition (H2), more general than strict properness, is crucial.
To tackle the case of ARMA models with a moving average component, we consider increasing orders p since any invertible ARMA model admits an AR(∞) representation.For the orders p ∈ {1, . . ., √ log T / log log T }, the ONS predictors m refining the bound obtained by Anava et al. (2013).Our bound is valid in every sub-gaussian stochastic adversarial settings where 2|m t | ≤ D, and the time series (y t ) does not have to be bounded as in Anava et al. (2013).Moreover, our bounds are any-time valid with high probability.
The parameters (M, σ 2 ) should be tuned to find the best compromise in the regret bound (6).However, the task is not feasible using the SOA analysis because the loss functions depend on these parameters.The solution comes from the econometrics litterature that provides better loss and risk functions introducing the concept of volatility.
Any invertible GARCH model admits an ARCH(∞) representation.Thus we consider ARCH(q) models with increasing order q.We consider BOA-ONS σ 2 t aggregating σ 2,(q) t (x t ), q = 1, . . ., √ log T / log log T so that with high probability KL(P t , N (0, σ 2,(q) (x))) + +O(q(log T + G 2 ψ2 log(δ −1 ))) .We solve positively the question raised in the conclusion of Anava et al. (2013) about the optimization of GARCH forecasters.The main restriction of our approach is the small range of the volatilities [cσ 2 /2, σ 2 ], 1 < c < 2. Otherwise, the risk functions arer not even convex when the volatility σ 2 t can be over-estimated by a factor of 2. It is not surprising since Francq and Zakoïan (2010) showed that the Quasi-Likelihood approach is inconsistent with no lower boundedness assumption on the volatilities.

Online gaussian probabilistic forecasting using BOA-ONS
We combine the ARMA and volatility prediction methods.We consider gaussian probabilistic forecaster N ( m (p) t (x 1:p ), σ 2,(q) t (x p+1:p+q )) with M 2 = σ 2 and x = (x 1:p , x p+1:p+q ) ∈ K = {x ∈ R p+q : x 1:p 1 ≤ 1, x p+1:p+q ≥ 0, x p+1:p+q 1 ≤ 1 − c/2}.Proposition 14.We assume that the distributions P t of y t given y t−1 , . . ., y 1 , admit densities with means ] ≤ 3σ 4 , a.s., 1 < c < 2, σ 2 > 0 for every t ≥ 1, and satisfy (H3).Then the gaussian forecaster N ( m (p) t (x), σ 2,(q) t (x)) calibrated by the ONS algorithm with γ = 3 × 2 5 /((c − 1)c 4 ) achieves the stochastic regret Aggregating such predictors for 1 ≤ p, q ≤ √ log T / log log T with BOA, we obtain a gaussian probabilistic forecast N ( m t , σ with high probability.The sequential algorithms adapt to the random environment even in misspecified settings; It approximates the parametric gaussian forecaster that is the closest to the unknown conditional distributions for the cumulative KL divergences and a penalty which increases such as (p + q) log(T ).Thus the BOA-ONS forecaster regret minimizes automatically a Bayesian information type criterion at any-time and with high probability.It is comparable to a model selection procedure that would require to minimize a penalized log-likelihood at each step 1 ≤ t ≤ T .The computational cost of our recursive method is O(T ((p + q) 2 + P )) with explicit formulae except for the projection step of computational cost P , whereas the batch model selection has a computational cost O(T (p + q)M ) where M is the computational cost of the optimization of the likelihood in AR(p)-ARCH(q) models.This cost M is prohibitive when p + q is large and the computational gain of our recursive procedure is important.

Sequential probabilistic forecasting using BOA-ONS
The main drawback of our BOA-ONS approach on gaussian forecasters is the restriction σ 2 t ∈ [cσ 2 /2, σ 2 ], 1 ≤ t ≤ T .However, because the loss and risk functions depend on this hyperparameter, it is not possible to directly aggregate volatility estimators with different σ 2 > 0 in a gaussian forecaster to extend the range of the volatilities.
To circumvent the issue, we can aggregate the gaussian probabilistic forecasters to obtain a probabilistic forecaster which is mixed gaussian.Consider P t = ( P (i) t ) 1≤i≤K , K weak probabilistic forecasters with densities p t = ( p Consider the SOCO analysis of mixtures x T P with K = Λ K and t (x) = − log(x T p t (y t )).We assume that m Similar fast-rate regret bounds were obtained by Thorey et al. (2017) for the CRPS score instead of the KL divergence.They used the Recursive Least Square algorithm without projection that does not constrain π t to be in Λ K .Contrary to our procedure, it is difficult to interpret their ensemble probabilistic forecast because they do not satisfy the axioms of a density function.

Aggregations in stochastic environments
We study the impact of stochastic deviations on the aggregation of predictors for quadratic losses.We consider 100 predictors of y t = 0, t ≥ 1, the first one being negatively biased − √ t + σN (1) t , the other ones being positively biased t are iid standard gaussian random variables.Any aggregation half-weighting the first predictor does not suffer from the bias.We run 100 Monte-Carlo experiments of three different aggregation algorithms; The original version of BOA of Wintenberger (2017)1 , the scale-free version of BOA of Algorithm 2, and the squint algorithm of Koolen et al. (2016).The latter algorithm is not comparable as it uses beforehand the maximum of the deviations for initializing the algorithm.As expected, its performances compared with the scale-free version of BOA highly depend on the level of the stochastic deviations, outperforming it when σ = .1;See Figure 1.The original version of BOA does not manage to learn efficiently the range of small deviations and does  not outperform the scale-free version of BOA in this case because it also suffer from a small observed minimal loss m.On the opposite, for large deviations, both BOA versions achieve performances competitive with squint, the scale-freee BOA outperforming the two other algorithms when σ = 10.We illustrate the impact of the SOCO anlysis on quantile predictions for weekly electricity load, data available in the Opera package developed by Gaillard et al. (2021).The 3 forecasters (GAM, AR, GBM) provided in Opera package plus 2 constant forecasters, 0 and 1.5 times the maximum of weekly loads, are aggregated to predict the upper and lower quantile of levels .5 and .95.We use the quantile loss funcion in 2 different sequential aggregation algorithms, Exponentially Weighted Algorithm (EWA) and BOA, and for the two levels .5 and .95.BOA aggregations provide accurate quantile predictions because it minimizes cumulative risks in the SOA analysis.It confirms the theoretical guarantees obtained in the paper since it is likely that the pinball risk is strongly convex (Steinwart and Christmann, 2011).On the contrary EWA aggregations fail to provide accurate quantile predictions because EWA algorithm minimizes the cumulative losses which are not exp-concave.Such visual validation of the predictions interval is enough to show the benefit of BOA but does not constitute any evidence of its good calibration.?analyze the asymptotic guarantees of a different sequential algorithm predicting quantiles.

Volatility estimation during the COVID crisis
We apply BOA-ONS for designing 90%-prediction intervals for the S&P500 index during 2020, including the COVID crisis in March.We use the iid N (0, x) and ARCH(p) gaussian probabilistic forecasters for p = 1, . . ., 5. The forecasters are tuned sequentially with the ONS algorithm with γ = 1, and K = [c, ∞] × B 1 (1), c = 0 in the iid case, and c = 10 −16 in the ARCH cases.These 6 predictors of the volatility are then aggregated with BOA; See Figure 6.3.We notice that the iid forecast prediction interval is constant after some training period.The ARCH forecasts are required to predict intervals accurately during the crisis.BOA aggregations converge to weights (0.01, 0.17, 0.09, 0.37, 0.21, 0.15) and improve the calibration of ARCH forecasters.A slightly more advanced sequentially calibrated volatility estimator developed by Werge and Wintenberger (2022) has been used in the forecast task of the M6 financial competition by ?. Its RPS performances rank 5th out of 163 competitors, showing that such sequential calibration o is competitive in probabilistic forecasting.

Conclusion and future works
In this paper, we derive fast-rate stochastic regret bounds for the ONS and BOA algorithms under stochastic expconcavity.We alleviate the convexity assumption on the loss functions to calibrate sequentially parametric probabilistic forecasting using the logarithmic score.We achieve fast-rate stochastic regret bounds.Thus, BOA-ONS can adaptively and efficiently calibrate gaussian probabilistic forecasters for any conditionally sub-gaussian non-stationary time series.Our stochastic regret bounds are relative to a static prediction parametrized by x ∈ K for every t ≥ 1.When forecasting non-stationary time series, we should also consider competitors that evolve through time.Key Propositions 5 and 6 extend readily to such settings called tracking optimization problems.Thus, one would like to develop SOCO and algorithms in more dynamic settings.A first step in that direction is made in Haddouche et al. (2023)  M. Zinkevich.Online convex programming and generalized infinitesimal gradient ascent.In Proceedings of the 20th international conference on machine learning (icml-03), pages [928][929][930][931][932][933][934][935][936]2003.
A Proofs of the main results

A.1 Proof of Proposition 4
We first show that Then we derive Using Cauchy-Svhwarz inequality we derive that Then we fix G ψ2 = 2K 2 ψ2 G 2 so that Condition (H3) follows.Let us denote µ t and σ t the mean and the variance of the conditionally gaussian random variable.Then, N being standard gaussian distributed, we use the homogeneity and triangular inequality on the norm • ψ2 to derive and the desired results follows from Cauchy-Schwartz inequality.

A.2 Proof of Proposition 5
Denoting Y t = ∇ t (x t ) T (x t − x), we observe that under (H2) it holds Moreover, from Lemma B.1 of Bercu and Touati (2008) for any random variable Y t and any η ∈ R we have Developing the square, we obtain Using Young's inequality together with Jensen's one, we derive /2 and the exponential inequality We obtain the desired result applying a classical martingale argument due to Ville (1939) and Freedman (1975) and recalled in Appendix B.1.Indeed, using the notation of Appendix B.1 with where M T = exp( T t=1 Z t ).Considering η = −λ/2 for any λ > 0, it holds with probability 1 − δ for any T ≥ 1 which, combines with (8), yields the desired result.

A.3 Proof of Theorem 7
From the proof of the ONS regret bound in Hazan (2016), we obtain from the expression of the recursive steps (and not using the convexity of the loss) Plugging this inequality into the previous bound we obtain Then we apply the Poissonian exponential inequality from Proposition 6 on the second-order terms.More precisely, denoting 0 ≤ Y t = (∇ t (x t ) T (x t − x)) 2 /(2(G ψ2 D) 2 ), we obtain Combined with the argument due to Freedman (1975) recalled in Appendix B.1 we derive Thus an union bound provides From the initialization A 0 = 1 (γD) 2 I d , we obtain bound We apply the Poissonian exponential inequality from Propostion 6 on the second-order terms 0 ≤ Y t = ∇ t (x t ) 2 /(2G 2 ψ2 ) and, combined with the argument due to Freedman (1975) and Condition (H3) ensuring ), we obtain We derive that, with probability 1 − δ, it holds The desired result follows from the specific choice of γ and a union bound.

A.4 Proof of Theorem 8
We keep the same notation and convention as in Section 4.2.In particular, inequalities involving vectors are coordinatewise.With no loss of generality we assume that η 1,i = 0 for all 1 ≤ i ≤ K. To prove the regret bound (5) we will show that (12) From ( 12) we derive 2 we obtain by rearranging the sum Thus we derive from a comparison sum-integral The learning rate satisfying the relation and the regret bound (5) follows from the expression of log(A T ).
It remains to prove the exponential inequality (12).We use the identity To initiate the recursion, we use the basic inequality x ≤ x α + e −1 (α − 1)/α for x ≥ 0 and α ≥ 1 with x = exp(−η T L T −1 ) and α = η T −1 /η T so that We obtain Then we use the expression We use different bounds over the function ϕ : y ∈ R → exp − y 1 + 2.2y 2 − y 2 1 + 2.2y 2 : ϕ(y) ≤ e/2, ϕ(y) ≤ 1 − y 1 + 2.2y 2 for any y ∈ R and ϕ(y) ≤ 1 − y if y ≤ 1/4.Distinguishing whether x T is larger or not than 1/4, we deduce Using the relations η T −1 /η T ≥ I 1 and 1 − y 1 + y 2 > 0, y ∈ R we upper bound the second term by Combining it with the previous bound we achieve The second inequality is obtained .We have We recognize the weights ) and the second term in the upper bound is proportional to π T T ( T − π T T T I 1) = 0 and thus vanishes.We obtain We bound the exponent term such as assuming with no loss of generality that if max 2≤t≤T x t,i > 1/4 then it happens for the last iterate x We have exp(−η 1 L 1 ) ≤ exp( I 1) using the relation |η 1 L 1 | = I 1 and the comparison sum-integral we achieve (12).

A.5 Proof of Theorem 9
From the regret bound (5), keeping the notation of ( 12) and applying Young's inequality, we infer that for any η > 0 Plugging this bound into (4) and identifying t = x T t ∇ t (x t π t ) and x t = x t π t we obtain Applying once again the Poissonian inequality (10), using that the diameter of the simplex satisfies is less than 1, we derive that with probability 1 − δ Then we obtain Thus choosing λ = η = α/3 and introducing ∇ t (x t ) for bounding roughly log(A T ), we obtain From the proof Proposition 6 on the second-order terms 0 ≤ Y t = ∇ t ( x t ) 2 /(2G 2 ψ2 ) we obtain Finally, we obtain the desired result using a union bound.

A.6 Proof of Proposition 12
We denote y Because the second derivatives do not depend on x a Taylor expansion provides The first inequality comes from the relations Applying Theorem 7, the ONS achieves the stochastic regret against every with high probability.Since the risk satisfies the relation we obtain the desired result.
We conclude the proof by identifying the KL divergence with L t up to additive constants.
We conclude the proof by identifying the KL divergence with L t up to additive constants.

B Auxiliary results
B.1 The stopping time argument of Ville (1939) and Freedman (1975) We recall the argument of Ville (1939) and Freedman (1975) as we apply it several times in the proofs of the paper.Consider M T = exp( T t=1 Z t ) for any Z t adapted to a filtration F t and satisfying the exponential inequality E[exp(Z t ) | F t−1 ] ≤ 1.Then we have P ∃T ≥ 1 : T t=1 Z t > log(δ −1 ) ≤ δ for any 0 < δ < 1 by applying the following lemma.Lemma 15.If M t is adapted to F t , M 0 = 1 a.s. and E[M t | F t−1 ] ≤ M t−1 a.s., t ≥ 1, then, for any 0 < δ < 1, it holds P(∃T ≥ 1 : M T > δ −1 ) ≤ δ .
Proof.We apply the optional stopping theorem with Markov's inequality defining the stopping time τ = inf{t > 1 : M t > δ −1 } so that P(∃t ≥ 1 : This simple extension of the usual iid setting to any stochastic adversarial setting could be obtained by classical arguments such as Azuma's inequality used in Chapter 9 of Hazan (2016).It relies on the martingale T t=1 (∇L t (x t ) − ∇ t (x t )) T (x t − x * ) and the gradient trick on L t to remove the assumption of convexity on the losses t .
O(log T ) or O((log log T ) 2 ) , for all T ≥ 1 .The stochastic regret coincides with the one of OCO sup t ) are aggregated with BOA in m t .The obtained BOA-ONS algorithm achieves the cumulative KL-divergence bound T t=1 KL(P t , N ( m t , σ 2 )) ≤ min 1≤p≤ √ log T / log log T min x∈B1(1) T t=1 KL(P t , N (m The risk function is m-strongly convex and Condition (H2) is satisfied with α = m/M .Under Condition (H3), we can use the ONS algorithm on the simplex K = Λ K , and we obtain t , π T p) + O(M K/m(log T + log(δ −1 ))) .

Figure 2 :
Figure 2: 90%-prediction intervals of the electricity load based on EWA (left) and BOA (right) and the same 5 forecasters.

B. 2
SOCO analysis of the OGD algorithmIn this section we work under (H1) and (H2) with α = 0. Proposition 5 holds, λ > 0 = α and the compensator term in Proposition 5 is positive.In this section we assume that the gradients are bounded by G < ∞.A slow rate stochastic regret bound O(GD√T ) is expected and the surrogate loss in Proposition 5 is useless.The classical Online Gradient Descent (OGD) of Zinkevich (2003)x t+1 = arg min x∈K x − D G √ t ∇ t (x t ) starting from x 0 ∈ K ,satisfies the following linearized regret bound in any SOCO problem, see the proof inHazan (2016) that does not use any convex assumption,T t=1 ∇ t (x t ) T (x t − x)we easily bound a.s.both extra quadratic terms in Proposition 5 with the same quantity λ/2 G 2 D 2 T .Choosing λ = 2 log(δ −1 )/(GD √ T ) we immediately obtain a new slow rate stochastic regret bound for the OGD valid in any SOCO problem: Theorem 16.Assume that (H1) holds and that sup x∈K ∇ t (x) ≤ G a.s., t ≥ 1.The OGD algorithm satisfies with probability 1 − δ the stochastic regret bound for any T ≥ 1 and any x ∈ K.