Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance

In this paper, a general stochastic optimization procedure is studied, unifying several variants of the stochastic gradient descent such as, among others, the stochastic heavy ball method, the Stochastic Nesterov Accelerated Gradient algorithm (S-NAG), and the widely used Adam algorithm. The algorithm is seen as a noisy Euler discretization of a non-autonomous ordinary differential equation, recently introduced by Belotto da Silva and Gazeau, which is analyzed in depth. Assuming that the objective function is non-convex and differentiable, the stability and the almost sure convergence of the iterates to the set of critical points are established. A noteworthy special case is the convergence proof of S-NAG in a non-convex setting. Under some assumptions, the convergence rate is provided under the form of a Central Limit Theorem. Finally, the non-convergence of the algorithm to undesired critical points, such as local maxima or saddle points, is established. Here, the main ingredient is a new avoidance of traps result for non-autonomous settings, which is of independent interest.


Introduction
Given a probability space Ξ, an integer d ą 0, and a function f : R dˆΞ Ñ R, consider the problem of finding a local minimum of the function F pxq fi E ξ rf px, ξqs w.r.t. x P R d , where E ξ represents the expectation w.r.t. the random variable ξ on Ξ. The paper focuses on the case where F is possibly non-convex. It is assumed that the function F is unknown to the observer, either because the distribution of ξ is unknown, or because the expectaction cannot be evaluated. Instead, a sequence pξ n : n ě 1q of i.i.d. copies of the random variable ξ is revealed online.
While the Stochastic Gradient Descent is the most classical algorithm that is used to solve such a problem, recently, several other algorithms became very popular. These include the Stochastic Heavy Ball (SHB), the stochastic version of Nesterov's Accelerated Gradient method (S-NAG) and the large class of the so-called adaptive gradient algorithms, among which Adam [30] is perhaps the most used in practice. As opposed to the vanilla Stochastic Gradient Descent, the study of such algorithms is more elaborate, for three reasons. First, the update of the iterates involves a so-called momentum term, or inertia, which has the effect of "smoothing" the increment between two consecutive iterates. Second, the update equation at the time index n is likely to depend on n, making these systems inherently non-autonomous. Third, as far as adaptive algorithms are concerned, the update also depends on some additional variable (a.k.a. the learning rate) computed online as a function of the history of the computed gradients.
In this work, we study in a unified way the asymptotic behavior of these algorithms in the situation where F is a differentiable function which is not necessarily convex, and where the stepsize of the algorithm is decreasing.
Our starting point is a generic non-autonomous Ordinary Differential Equation (ODE) introduced by Belotto da Silva and Gazeau [9] (see also [8] for Adam), depicting the continuous-time versions of the aforementioned florilegium of algorithms. The solutions to the ODE are shown to converge to the set of critical points of F . This suggests that a general provably convergent algorithm can be obtained by means of an Euler discretization of the ODE, including possible stochastic perturbations. Special cases of our general algorithm include SHB, Adam and S-NAG. We establish the almost sure boundedness and the convergence to critical points. Under additional assumptions, we obtain convergence rates, under the form of a central limit theorem. These results are new. They extend the works of [24,8] to a general setting. In particular, we highlight the almost sure convergence result of S-NAG in a non-convex setting, which is new to the best of our knowledge.
Next, we address the question of the avoidance of "traps". In a non-convex setting, the set of critical points of a function F is generally larger than the set of local minimizers. A "trap" stands for a critical point at which the Hessian matrix of F has negative eigenvalues, namely, it is a local maximum or saddle point. We establish that the iterates cannot converge to such a point, if the noise is exciting in some directions. The result extends previous works of [24] obtained in the context of SHB. This result not only allows to study a broader class of algorithms but also significantly weakens the assumptions. In particular, [24] uses a sub-Gaussian assumption on the noise and a rather stringent assumption on the stepsizes. The main difficulty in the approach of [24] lies in the use of the classical autonomous version of Poincaré's invariant manifold theorem. The key ingredient of our proof is a general avoidance of traps result, adapted to non-autonomous settings, which we believe to be of independent interest. It extends usual avoidance of traps results to a non-autonomous setting, by making use of a non-autonomous version of Poincaré's theorem [17,31].
Paper organization. In Section 2, we introduce and study the ODE's governing our general stochastic algorithm. We establish the existence and uniqueness of the solutions, as well as the convergence to the set of critical points. In Section 3, we introduce the main algorithm. We provide sufficient conditions under which the iterates are bounded and converge to the set of critical points. A central limit theorem is stated. Section 4 introduces a general avoidance of traps result for non-autonomous settings. Next, this result is applied to the proposed algorithm. Sections 5, 6 and 7 are devoted to the proofs of the results of Sections 2, 3 and 4, respectively.
Notations. Given an integer d ě 1, two vectors x, y P R d , and a real α, we denote by xdy, x dα , x{y, |x|, and a |x| the vectors in R d whose i-th coordinates are respectively given by x i y i , x α i , x i {y i , |x i |, a |x i |. Inequalities of the form x ď y are to be read componentwise. The standard Euclidean norm is denoted }¨}. Notation M T represents the transpose of a matrix M . For x P R d and ρ ą 0, the notation Bpx, ρq stands for the open ball of R d with center x and radius ρ. We also write R`" r0, 8q. If z P R d and A Ă R d , we write distpz, Aq fi inft}z´z 1 } : z 1 P Au.
By ½ A pxq, we refer to the function that is equal to one if x P A and to zero elsewhere. The set of zeros of a function h : R d Ñ R d 1 is zer h " tx : hpxq " 0u. Let D be a domain in R d . Given an integer k ě 0, the class C k pD, Rq is the class of D Ñ R maps such that all their partial derivatives up to the order k exist and are continuous. For a function h P C k pD, Rq and for every i P t1, . . . , du, we denote as B k i hpx 1 , . . . , x d q the k th partial derivative of the function h with respect to x i . When k " 1, we just write B i hpx 1 , . . . , x d q. The gradient of a function F : R d Ñ R at a point x P R d is denoted as ∇F pxq, and its Hessian matrix at x is ∇ 2 F pxq as usual. For a function S : R d Ñ R d , the notation ∇Spxq stands for the jacobian matrix of S at point x.

A general ODE
Our starting point will be a non-autonomous ODE which is almost identical to the one introduced in [9] and close to the one in [8]. Let F be a function in C 1 pR d , Rq, let S be a continuous R d Ñ R d function, let h, r, p, q : p0, 8q Ñ R`be four continuous functions, and let ε ą 0. Let v 0 P R d and x 0 , m 0 P R d . Starting at vp0q " v 0 , mp0q " m 0 , and xp0q " x 0 , our ODE on Rẁ ith trajectories in Z`fi R dˆRdˆRd reads vptq " pptqSpxptqq´qptqvptq 9 mptq " hptq∇F pxptqq´rptqmptq 9 xptq "´mptq{ a vptq`ε This ODE can be rewritten compactly in the following form. Write z 0 " pv 0 , m 0 , x 0 q, and let zptq " pvptq, mptq, xptqq P Z`for t P R`. Let Z fi R dˆRdˆRd , and define the map g : Z`ˆp0, 8q Ñ Z as gpz, tq " for z " pv, m, xq P Z`. With these notations, we can rewrite (ODE-1) as zp0q " z 0 , 9 zptq " gpzptq, tq for t ą 0.
By setting Spxq " ∇F pxq d2 when necessary and by properly choosing the functions h, r, p, and q, a large number of iterative algorithms used in Machine Learning can be obtained by an Euler's discretization of this ODE. For instance, choosing hptq " rptq " apt, λ, α 1 q and pptq " qptq " apt, λ, α 2 q with apt, λ, αq " λ´1p1´expp´λαqq{p1´expp´αtqq and λ, α 1 , α 2 ą 0, one obtains a version of the Adam algorithm [30] (see [9, for details). To give another less specific example, if we set p " q " 0, then the resulting ODE covers a family of algorithms to which the well-known Heavy Ball with friction algorithm [6] belongs. For a comprehensive and more precise view of the deterministic algorithms that can be deduced from (ODE-1) by an Euler's discretization, the reader is referred to [9, Table 1]. In this paper, since we are rather interested in stochastic versions of these algorithms, Eq. (ODE-1) will be the basic building block of the classical "ODE method" which is widely used in the field of stochastic approximation [11]. In order to analyze the behavior of this equation in preparation of the stochastic analysis, we need the following assumptions.
Note that this assumption implies that the infimum F ‹ of F is finite, and the set zer ∇F of zeros of ∇F is nonempty.  i) h P C 1 pp0,`8q, R`q, 9 hptq ď 0 on p0,`8q and the limit h 8 fi lim tÑ8 hptq is positive.
ii) r and q are non-increasing and r 8 fi lim tÑ8 rptq , q 8 fi lim tÑ8 qptq are positive.
iii) p converges towards p 8 as t Ñ 8.
These assumptions are sufficient to prove the existence and the uniqueness of the solution to (ODE-1) starting at a time t 0 ą 0. The following additional assumption extends the solution to t 0 " 0. Assumption 2.5. Either h, r, p, q P C 1 pr0,`8q, R`q, or the following holds: ii) The functions h p , h q´2r , t Þ Ñ thptq, t Þ Ñ trptq, t Þ Ñ tpptq, t Þ Ñ tqptq are bounded near zero. iii) There exists t 0 ą 0 such that for all t ă t 0 , 2rptq´qptq ą 0 . iv) There exists δ ą 0 such that h r , p q P C 1 pr0, δq, R`q . v) The initial condition z 0 " pv 0 , m 0 , x 0 q P Z`satisfies Remark 1. The functions h, r, p, q corresponding to Adam satisfy these conditions. We leave the straightforward verifications to the reader. We just observe here that the function S that will correspond to our stochastic algorithm in Section 3 below will satisfy Assumption 2.5-i) by an immediate application of Jensen's inequality.
The following theorem slightly generalizes the results of [9, Th. 3 and Th. 5].
Theorem 2.1. Let Assumptions 2.1 to 2.4 hold true. Consider z 0 P Z`and t 0 ą 0. Then, there exists a unique global solution z : rt 0 ,`8q Ñ Z`to (ODE-1) with initial condition zpt 0 q " z 0 . Moreover, zprt 0 ,`8qq is a bounded subset of Z`. As t Ñ`8, zptq converges towards the set If, additionally, Assumption 2.5 holds, then we can take t 0 " 0.
Remark 2. Th. 2.1 only shows the convergence of the trajectory zptq towards a set. Convergence of the trajectory towards a single point is not guaranteed when the set Υ is not countable.

Remark 3.
A simpler version of (ODE-1) is obtained when omitting the momentum term. It reads: This ODE encompasses the algorithms of the family of RMSProp [43], as shown in [8,9]. The approach for proving the previous theorem can be adapted to (ODE-1 1 ) with only minor modifications. In the proofs below, we will point out the particularities of (ODE-1 1 ) when necessary.
The following paragraph is devoted to a particular case of (ODE-1), which does not satisfy Assumption 2.4, and which requires a more involved treatment than (ODE-1 1 ).
The continuous-time dynamical system (ODE-1) we consider was first introduced in [9, Eq. (2.1)] with S " ∇F d2 . Th. 2.1 above is roughly the same as [9,Ths. 3 and 5], with some slight differences regarding the assumptions on the function F , or Assumption 2.4-iv). We point out that the main focus of [9] is to study the properties of the deterministic continous-time dynamical system (ODE-1). In the present work, we highlight that the purpose of Th. 2.1 is to pave the way to our analysis of the corresponding stochastic algorithms in Section 3.
Concerning Th. 2.2, the existence and the uniqueness of a global solution to (ODE-N) has been previously shown in the literature, for instance in [14,Prop. 2.1] or in [42,Th. 1]. The convergence statement in Th. 2.2 is new to the best of our knowledge. In particular, we stress that we do not make any convexity assumption on F . The closest result we are aware of is the one of Cabot-Engler-Gadat [14]. In [14,Prop. 2.5], it is shown that if xptq converges towards some pointx, then necessarilyx is a critical point of F . Our result in Th. 2.2 strengthens this statement, by establishing that xptq actually converges to the set of critical points.

Stochastic Algorithms
In this section, we discuss the asymptotic behavior of stochastic algorithms that consist in noisy Euler's discretizations of (ODE-1) and (ODE-N) studied in the previous section.
We first set the stage. Let pΞ, T , µq be a probability space. Denoting as BpR d q the Borel σ-algebra on R d , consider a BpR d q b T -measurable function f : R dˆΞ Ñ R that satisfies the following assumption.
Assumption 3.1. The following conditions hold: ii) For every s P Ξ, the map f p¨, sq is differentiable. Denoting as ∇f px, sq its gradient w.r.t. x, the function ∇f px,¨q is integrable.
Under Assumption 3.1, we can define the mapping F : for all x P R d , where we write E ξ ϕpξq " ş ϕpξqµpdξq. It is easy to see that the mapping F is differentiable, ∇F pxq " E ξ r∇f px, ξqs for all x P R d , and ∇F is locally Lipschitz. Let pγ n q ně1 be a sequence of positive real numbers satisfying Assumption 3.2. γ n`1 {γ n Ñ 1 and ř n γ n "`8.
Define for every integer n ě 1 Let pΩ, F , Pq be a probability space, and let pξ n : n ě 1q be a sequence of iid random variables defined from pΩ, F , Pq into pΞ, T , µq with the distribution µ.
Algorithm 1 (general algorithm) Initialization: z 0 P Z`. for n " 1 to n iter do v n`1 " p1´γ n`1 q n qv n`γn`1 p n ∇f px n , ξ n`1 q d2 m n`1 " p1´γ n`1 r n qm n`γn`1 h n ∇f px n , ξ n`1 q end for

General algorithm
Our first algorithm is a discrete and noisy version of (ODE-1). Let z 0 " pv 0 , m 0 , x 0 q P Z`and h 0 , r 0 , p 0 , q 0 P p0, 8q. Define for every n ě 1 h n " hpτ n q, r n " rpτ n q, p n " ppτ n q, and q n " qpτ n q.
The algorithm is written as follows. We suppose throughout the paper that 1´γ n`1 q n ě 0 for all n P N. This will guarantee that the quantity ? v n`ε is always well-defined (see Algorithm 1). This mild assumption is satisfied as soon as q 0 ď 1 γ 1 since the sequence pq n q is non-increasing and the sequence of stepsizes pγ n q can also be supposed to be non-increasing.
Since this algorithm makes use of the function ∇f px, ξq d2 , a strengthening of Assumption 3.1 is required: Assumption 3.3. In Assumption 3.1, Conditions ii) and iii) are respectively replaced with the stronger conditions ii') For each x P R d , the function ∇f px,¨q d2 is µ -integrable.
iii') There exists a measurable map κ : R dˆΞ Ñ R`s.t. for every x P R d : a) The map κpx,¨q is µ-integrable.
Under Assumption 3.3, we can also define the mapping S : R d Ñ R d as: Spxq " E ξ r∇f px, ξq d2 s for all x P R d . Notice that Assumptions 2.1 and 2.3 are satisfied for F and S. ii) For every compact set K Ă R d , there exists a real σ K ‰ 0 s.t.

Remark 4.
We make the following comments regarding Assumption 3.4.
• Assumption 3.4-i) allows to use larger stepsizes in comparison to the classical condition ř n γ 2 n ă 8 which corresponds to the particular case q " 2.
• Recall that a random vector X is said to be subgaussian if there exists a real σ ‰ 0 s.t. Ee xu,Xy ď e σ 2 }u} 2 {2 for every constant vector u P R d . In Assumption 3.4-ii), the subgaussian noise offers the possibility to use a sequence of stepsizes with an even slower decay rate than in Assumption 3.4-i).
Assumption 3.5. The set F ptx : ∇F pxq " 0uq has an empty interior.
Remark 5. Assumption 3.5 excludes a pathological behavior of the objective function F at critical points. It is satisfied when F P C k pR d , Rq for k ě d. Indeed, in this case, Sard's theorem stipulates that the Lebesgue measure of F ptx : ∇F pxq " 0uq is zero in R.
Theorem 3.1. Let Assumptions 2.2, 2.4, and 3.2-3.5 hold true. Assume that the random sequence pz n " pv n , m n , x n q : n P Nq given by Algorithm 1 is bounded with probability one. Then, w.p.1, the sequence pz n q converges towards the set Υ defined in Eq. (2). If, in addition, the set of critical points of the objective function F is finite or countable, then w.p.1, the sequence pz n q converges to a single point of Υ.
We now deal with the boundedness problem of the sequence pz n q. We introduce an additional assumption for this purpose.
Assumption 3.6. The following conditions hold.
Remark 6. The above stability result requires square summable step sizes. Showing the same boundedness result under the Assumption 3.4 that allows for larger step sizes is a challenging problem in the general case. In these situations, the boundedness of the iterates can be sometimes ensured by ad hoc means.

Remark 7.
We can also consider the noisy discretization of (ODE-1 1 ) introduced in Remark 3 above. This algorithm reads " v n`1 " p1´γ n`1 q n qv n`γn`1 p n ∇f px n , ξ n`1 q d2 (6a) for pv 0 , x 0 q P R dˆRd . With only minor adaptations, Th. 3.1 and Th. 3.2 can be shown to hold as well for this algorithm. We refer to the concomitant paper [23,Sec. 2.2] for the link between this algorithm and the seminal algorithms AdaGrad [22] and RMSProp [43].

Stochastic Nesterov's Accelerated Gradient (S-NAG)
S-NAG is the noisy Euler's discretization of (ODE-N). Given α ą 0, it generates the sequence pm n , x n q on R dˆRd given by Algorithm 2. ii) For every compact set K Ă R d , there exists a real σ K ‰ 0 s.t.
Theorem 3.3. Let Assumptions 2.2, 3.1, 3.2, 3.5 and 3.7 hold true. Assume that the random sequence py n " pm n , x n q : n P Nq given by Algorithm 2 is bounded with probability one. Then, w.p.1, the sequence py n q converges towards the setῩ defined in Eq. (3). If, in addition, the set of critical points of the objective function F is finite or countable, then w.p.1, the sequence py n q converges to a single point ofῩ.
The almost sure boundedness of the sequence py n q is handled in what follows.
Theorem 3.4. Let Assumptions 2.2, 3.1, 3.2 and 3.6 hold. Then, the sequence py n " pm n , x n q : n P Nq given by Algorithm 2 is bounded with probability one.

Central Limit Theorem
In this section, we establish a conditional central limit theorem for Algorithm 1.
Assumption 3.8. Let x ‹ P zer ∇F . The following holds.
i) F is twice continuously differentiable on a neighborhood of x ‹ and the Hessian ∇ 2 F px ‹ q is positive definite.
ii) S is continuously differentiable on a neighborhood of x ‹ .
Under Assumptions 2.4-i) to iii), it follows from Eq. (5) that the sequences ph n q, pr n q, pp n q and pq n q of nonnegative reals converge respectively to h 8 , r 8 , p 8 and q 8 where h 8 , r 8 and q 8 are supposed positive. Define v ‹ fi q´1 8 p 8 Spx ‹ q. Consider the matrix Let P be an orthogonal matrix s.t. the following spectral decomposition holds: where π 1 ď¨¨¨ď π d are the (positive) eigenvalues of V where I d is the dˆd identity matrix. Then the matrix H is Hurwitz. Indeed, it can be shown that the largest real part of the eigenvalues of H coincides with´L, where Assumption 3.9. The sequence pγ n q is given by γ n " γ 0 n α for some α P p0, 1s, γ 0 ą 0. Moreover, if α " 1, we assume that γ 0 ą 1 2pL^q8q . Theorem 3.5. Let Assumptions 2.4-i) to iii), 3.3, 3.8 and 3.9 hold. Consider the iterates z n " pv n , m n , x n q given by Algorithm 1. Set θ fi 0 if α ă 1 and θ fi 1{p2γ 0 q if α " 1. Assume that the event tz n Ñ z ‹ u, where z ‹ " pv ‹ , 0, x ‹ q, has a positive probability. Then, given that event, where ñ stands for the convergence in distribution and N p0, Γq is a centered Gaussian distribution on R 2d with a covariance matrix Γ given by the unique solution to the Lyapunov equation pH`θI 2d qΓ`ΓpH`θI 2d q T "´" In particular, given tz n Ñ z ‹ u, the vector ? γ n´1 px n´x‹ q converges in distribution to a centered Gaussian distribution with a covariance matrix given by: where C fi P´1V • The matrix Γ 2 coincides with the limiting covariance matrix associated to the iterates # m n`1 " m n`γn`1 ph 8 V ∇f px n , ξ n`1 q´r 8 m n q x n`1 " x n´γn`1 m n`1 .
This procedure can be seen as a preconditioned version of the stochastic heavy ball algorithm [24] although the iterates are not implementable because of the unknown matrix V . Notice also that the limiting covariance Γ 2 depends on v ‹ but does not depend on the fluctuations of the sequence pv n q.
• When h 8 " r 8 (which is the case for Adam), we recover the expression of the asymptotic covariance matrix previously provided in [8,Section 5.3] and the remarks formulated therein.
• The assumption r 8 ą 0 is crucial to establish Th. 3.5. For this reason, Th. 3.5 does not generalize immediately to Algorithm 2. The study of the fluctuations of Algorithm 2 is left for future works.

Related works
In [24], Gadat, Panloup and Saadane study the SHB algorithm, which is a noisy Euler's discretization of (ODE-1) in the situation where h " r and p " q " 0 (i.e., there is no v variable).
In this framework, if we set h " r " r ą 0 in Algorithm 1 above, then Th. 3.1 above recovers the analogous case in [24, Th. 2.1], which is termed as the exponential memory case. The other important case treated in [24] is the case where hptq " rptq " r{t for some r ą 0, referred to as the polynomially memory case. Actually, it is known that the ODE obtained for hptq " rptq " r{t and p " q " 0 boils down to (ODE-N) after a time variable change (see, e.g., Lem. 5.3 below). Nevertheless, we highlight that the stochastic algorithm that stems from this ODE and that is studied in [24] is different from the S-NAG algorithm introduced above which stems from a different ODE (ODE-N). Hence, the convergence result of Th. 3.3 for the S-NAG algorithm we consider is not covered by the analysis of [24]. The specific case of the Adam algorithm is analyzed in [8] in both the constant and vanishing stepsize settings (see [8, which are the analogues of our Ths. 3.1-3.2). Note that we deal with a more general algorithm in the present paper. Indeed, Algorithm 1 offers some freedom in the choice of the functions h, r, p, q satisfying Assumption 2.4 beyond the specific case of the Adam algorithm studied in [8]. Apart from this generalization, we also emphasize some small improvements. Regarding Theorem 3.1, we provide noise conditions allowing to choose larger stepsizes (see Assumption 3.4 compared to [8,Assumption 4.2]). Concerning the stability result (Th.3.2), we relax [8, Assumption 5.3-(iii)] which is no more needed in the present paper (see Assumption 3.6) thanks to a modification of the discretized Lyapunov function used in the proof (see Section 6.4 compared to [8, Section 9.2]).
In most generality, the almost sure convergence result of the iterates of Algorithm 1 using vanishing stepsizes (Ths. 3.1-3.2) is new to the best of our knowledge. Moreover, while some recent results exist for S-NAG in the constant stepsize and for convex objective functions (see for e.g. [4]), Ths. 3.3 and 3.4 which tackle the possibly non-convex setting are also new to the best of our knowledge.
In the work [23] that was posted on the arXiv repository a few days after our submission, Gadat and Gavra study the specific case of the algorithm described in Eq. (6) encompassing both Adagrad and RMSProp, with the possibility to use mini-batches. For this specific algorithm, the authors establish a similar almost sure convergence result to ours [23, Th. 1] for decreasing stepsizes and derive some quantitative results bounding in expectation the gradient of the objective function along the iterations for constant stepsizes [23,Th. 2]. We highlight though that they do not consider the presence of momentum in the algorithm. Therefore, their analysis does not cover neither Algorithm 1 nor Algorithm 2.
In contrast to our analysis, some works in the literature explore the constant stepsize regime for some stochastic momentum methods either for smooth [44] or weakly convex objective functions [33]. Furthermore, concerning Adam-like algorithms, several recent works control the minimum of the norms of the gradients of the objective function evaluated at the iterates of the algorithm over N iterations in expectation or with high probability [18,46,15,47,16,45,1,19,2] and establish regret bounds in the convex setting [2]. Similar central limit theorems to Th. 3.5 are established in the cases of the stochastic heavy ball algorithm with exponential memory [24,Th. 2.4] and Adam [8,Th. 5.7]. In comparison to [24], we precise that our theorem recovers their result and provides a closed formula for the asymptotic covariance matrix Γ 2 . Our proof of Th. 3.5 differs from the strategies adopted in [24] and [8].

Avoidance of Traps
In Th. 3.1 and Th. 3.3 above, we established the almost sure convergence of the iterates x n towards the set of critical points of the objective function F for both Algorithms 1 and 2. However, the landscape of F can contain what is known as "traps" for the algorithm, namely, critical points where the Hessian matrix of F has negative eigenvalues, making these critical points local maxima or saddle points. In this section, we show that the convergence of the iterates to these traps does not take place if the noise is exciting in some directions.
Starting with the contributions of Pemantle [39] and Brandière and Duflo [13], the numerous so-called avoidance of traps results that can be found in the literature deal with the case where the ODE that underlies the stochastic algorithm is an autonomous ODE. Obviously, this is neither the case of (ODE-1), nor of (ODE-N). To deal with this issue, we first state a general avoidance of traps result that extends [39,13] to a non-autonomous setting, and that has an interest of its own. We then apply this result to Algorithms 1 and 2.

A general avoidance-of-traps result in a non-autonomous setting
The notations in this subsection and in Sections 7.1-7.2 are independent from the rest of the paper. We recall that for a function h : The setting of our problem is as follows. Given an integer d ą 0 and a continuous function b : R dˆR`Ñ R d , we consider a stochastic algorithm built around the non-autonomous ODE 9 zptq " bpzptq, tq. Let z ‹ P R d , and assume that on VˆR`where V is a certain neighborhood of z ‹ , the function b can be developed as where epz ‹ ,¨q " 0, and where the matrix D P R dˆd is assumed to admit the following spectral factorization: Given 0 ď d´ă d and 0 ă d`ď d with d´`d`" d, we can write where the Jordan blocks that constitute Λ´P R d´ˆd´( respectively Λ`P R d`ˆd`) are those that contain the eigenvalues λ i of D for which ℜλ i ď 0 (respectively ℜλ i ą 0). Since d`ą 0, the point z ‹ is an unstable equilibrium point of the ODE 9 zptq " bpzptq, tq, in the sense that the ODE solution will only be able to converge to z ‹ along a specific so-called invariant manifold which precise characterization will be given in Section 7.1 below.
We now consider a stochastic algorithm that is built around this ODE. The condition d`ą 0 makes that z ‹ is a trap that the algorithm should desirably avoid. The following theorem states that this will be the case if the noise term of the algorithm is omnidirectional enough. The idea is to show that the case being, the algorithm trajectories will move away from the invariant manifold mentioned above. Theorem 4.1. Given a sequence pγ n q of nonnegative deterministic stepsizes such that ř n γ n " 8, ř n γ 2 n ă`8, and a filtration pF n q, consider the stochastic approximation algorithm in R d z n`1 " z n`γn`1 bpz n , τ n q`γ n`1 η n`1`γn`1 ρ n`1 where τ n " ř n k"1 γ k . Assume that the sequences pη n q and pρ n q are adapted to F n , and that z 0 is F 0 -measurable. Assume that there exists z ‹ P R d such that Eq. (11) holds true on VˆR`, where V is a neighborhood of z ‹ . Consider the spectral factorization (12), and assume that d`ą 0. Assume moreover that the function e at the right hand side of Eq. (11) satisfies the conditions: ii) On VˆR`, the functions B n 2 B k 1 epz, tq exist and are continuous for 0 ď n ă 2 and 0 ď k`n ď 2.

Remark 9.
Assumptions i) to iv) of Th. 4.1 are related to the function e defined in Eq. (11), which can be seen as a non-autonomous perturbation of the autonomous linear ODE 9 zptq " Dpzptq´z ‹ q. These assumptions guarantee the existence of a local (around the unstable equilibrium z ‹ ) non-autonomous invariant manifold of the non-autonomous ODE 9 zptq " bpzptq, tq with enough regularity properties, as provided by Prop. 7.1 and Prop. 7.3 below.

Trap avoidance of the general algorithm 1
In Th. 3.1 above, we showed that the sequence pz n q generated by Algorithm 1 converges almost surely towards the set Υ defined in Eq. (2). Our purpose now is to show that the traps in Υ (to be characterized below) are avoided by the stochastic algorithm 1 under a proper omnidirectionality assumption on the noise.
Our first task is to write Algorithm 1 in a manner compatible with the statement of Th. 4.1. The following decomposition holds for the sequence pz n " pv n , m n , x n q, n P Nq generated by this algorithm: z n`1 " z n`γn`1 gpz n , τ n q`γ n`1 η n`1`γn`1ρn`1 , ? v n`1`ε¯, and where η n`1 is the martingale increment with respect to the filtration pF n q which is defined by Eq. (28).
Observe from Eq. (2) that each z ‹ P Υ is written as z ‹ " pv ‹ , 0, x ‹ q where x ‹ P zer ∇F , and v ‹ " q´1 8 p 8 Spx ‹ q (in particular, x ‹ and z ‹ are in a one-to-one correspondence). We need to linearize the function gp¨, tq around z ‹ . The following assumptions will be required.
Assumption 4.1. The functions F and S belong respectively to C 3 pR d , Rq and C 2 pR d , R d q.
Assumption 4.2. The functions h, r, p, q belong to C 1 pp0, 8q, R`q and have bounded derivatives on rt 0 ,`8q for some t 0 ą 0. Lemma 4.2. Let Assumptions 2.4-i) to iii), 4.1 and 4.2 hold. Let z ‹ " pv ‹ , 0, x ‹ q P Υ. Then, for every z P Z`and every t ą 0, the following decomposition holds true: and the function epz, tq (defined in Section 7.3.1 below for conciseness) has the same properties as its analogue in the statement of Th. 4.1.
Using this lemma, the algorithm iterate z n`1 can be rewritten as an instance of the algorithm in the statement of Th. 4.1, namely, where in our present setting, bpz, tq " gpz, tq´cptq " Dpz´z ‹ q`epz, tq and ρ n " cpτ n´1 q`ρ n . In the following assumption, we use the well-known fact that a symmetric matrix H has the same inertia as AHA T for an arbitrary invertible matrix A. (8). Assume the following conditions: ii) The Hessian matrix ∇ 2 F px ‹ q has a negative eigenvalue.
iv) Defining Π u as the orthogonal projector on the eigenspace of V 2 that is associated with the negative eigenvalues of this matrix, it holds that Let z ‹ P Υ be such that Assumption 4.3 holds true for this z ‹ . Then, the eigenspace associated with the eigenvalues of D with positive real parts has the same dimension as the eigenspace of ∇ 2 F px ‹ q associated with the negative eigenvalues of this matrix. Let pz n " pv n , m n , x n q : n P Nq be the random sequence generated by Algorithm 1 with stepsizes satisfying ř n γ n "`8 and ř n γ 2 n ă`8. Then, Pprz n Ñ z ‹ sq " 0.
The assumptions and the result call for some comments.
Remark 10. The definition of a trap as regards the general algorithm in the statement of Th. 4.1 is that the matrix D in Eq. (11) has eigenvalues with positive real parts. Th. 4.3 states that this condition is equivalent to ∇ 2 F px ‹ q having negative eigenvalues. What's more, the dimension of the invariant subspace of D corresponding to the eigenvalues with positive real parts is equal to the dimension of the negative eigenvalue subspace of ∇ 2 F px ‹ q. Thus, Assumption 4.3-iv) provides the "largest" subspace where the noise energy must be non zero for the purpose of avoiding the trap.
Remark 11. Assumptions 4.2 and 4.3-i) are satisfied by many widely studied algorithms, among which RMSProp and Adam.
Remark 12. The results of Th. 4.3 can be straightforwardly adapted to the case of (ODE-1 1 ). Assumption 4.3-iv) on the noise is unchanged.
In the case of the S-NAG algorithm, the assumptions become particularly simple. We state the afferent result separately.

Trap avoidance for S-NAG
Assumption 4.4. Let x ‹ P zer ∇F and let the following conditions hold.
i) The Hessian matrix ∇ 2 F px ‹ q has a negative eigenvalue.
whereΠ u is the orthogonal projector on the eigenspace of ∇ 2 F px ‹ q associated with its negative eigenvalues. Define y ‹ " p0, x ‹ q. Let py n " pm n , x n q : n P Nq be the random sequence given by Algorithm 2 with stepsizes satisfying ř n γ n "`8 and ř n γ 2 n ă`8. Then, Ppry n Ñ y ‹ sq " 0 .

Related works
Up to our knowledge, all the avoidance of traps results that can be found in the literature, starting from [39,13], refer to stochastic algorithms that are discretizations of autonomous ODE's (see for e.g., [11,Sec. 9] for general Robbins Monro algorithms and [34,Sec. 4.3] for SGD). In this line of research, a powerful class of techniques relies on Poincaré's invariant manifold theorem for an autonomous ODE in a neighborhood of some unstable equilibrium point. In our work, we extend the avoidance of traps results to a non-autonomous setting, by borrowing a non-autonomous version of Poincaré's theorem from the rich literature that exists on the subject [17,31].
In [24], the authors succeeded in establishing an avoidance of traps result for their nonautonomous stochastic algorithm which is close to our S-NAG algorithm (see the discussion at the end of Section 3.4 above), at the expense of a sub-Gaussian assumption on the noise and a rather stringent assumption on the stepsizes. The main difficulty in the approach of [24] lies in the use of the classical autonomous version of Poincaré's theorem (see [24,Remark 2.1]). This kind of difficulty is avoided by our approach, which allows to obtain avoidance of traps results with close to minimal assumptions. More recently, in the contribution of [23] discussed in Sec. 3.4, the authors establish an avoidance of traps result ([23, Th. 3]) for the algorithm described in Eq. (6) using techniques inspired from [39,11]. As previously mentioned, this recent work does not handle momentum and hence neither Algorithm 1 nor Algorithm 2. Moreover, as indicated in our discussion of [24], our strategy of proof is different.
Taking another point of view as concerns the trap avoidance, some recent works [32,21,28,36,37] address the problem of escaping saddle points when the algorithm is deterministic but when the initialization point is random. In contrast to this line of research, our work considers a stochastic algorithm for which randomness enters into play at each iteration of the algorithm via noisy gradients.
5 Proofs for Section 2 The arguments of the proof of this theorem that we provide here follow the approach of [9] with some small differences. Close arguments can be found in [8]. We provide the proof here for completeness and in preparation of the proofs that will be related with the stochastic algorithms.

Existence and uniqueness
The following lemma guarantees that the term a vptq`ε in (ODE-1) is well-defined.
Recall that F ‹ " inf F is finite by Assumption 2.2. Of prime importance in the proof will be the energy (Lyapunov) function E : R`ˆZ`Ñ R, defined as for every h ě 0 and every z " pv, m, xq P Z`. This function is slightly different from its analogues that were used in [3,8,9]. Consider pt, zq P p0,`8qˆZ`and set z " pv, m, xq. Then, using Assumption 2.1, we can write B t Ephptq, zq`x∇ z Ephptq, zq, gpz, tqy With the help of this function, we can now establish the existence, the uniqueness and the boundedness of the solution of (ODE-1) on rt 0 , 8q for an arbitrary t 0 ą 0.
Proof. Let t 0 ą 0, and fix z 0 P Z`. On each set of the type rt 0 , t 0`A sˆBpz 0 , Rq where A, R ą 0 andBpz 0 , Rq Ă p´ε, 8q dˆRdˆRd , we easily obtain from our assumptions that the function g defined in (1) is continuous, and that gp¨, tq is uniformly Lipschitz on t P rt 0 , t 0`A s. In these conditions, Picard's theorem asserts that (ODE-1) starting from zpt 0 q " z 0 has a unique solution on a certain maximal interval rt 0 , T q. Lem. 5.1 shows that vptq ě 0 on this interval.
Let us show that T " 8. Applying Ineq. (16) with pv, m, xq " pvptq, mptq, xptqq and using Assumption 2.4, we obtain that the function t Þ Ñ Ephptq, zptqq is decreasing on rt 0 , T q. By the coercivity of F (Assumption 2.2) and Assumption 2.4-i), we get that the trajectory txptqu is bounded. Recall the equation 9 mptq " hptq∇F pxptqq´rptqmptq. Using the continuity of the functions ∇F , h and r along with Gronwall's lemma, we get that tmptqu is bounded if T ă 8. We can show a similar result for tvptqu. Thus, tzptqu is bounded on rt 0 , T q if T ă 8 which is a contradiction, see, e.g., [26,Cor.3.2].
It remains to show that the trajectory tzptqu is bounded. To that end, let us apply the variation of constants method to the equation 9 mptq " hptq∇F pxptqq´rptqmptq. Writing Rptq " ş t t 0 rpuq du, we get that d dt´e Rptq mptq¯" e Rptq hptq∇F pxptqq.
Therefore, for every t ě t 0 , Using the continuity of ∇F together with the boundedness of x, Assumption 2.4 and the triangle inequality, we obtain the existence of a constant C ą 0 independent of t s.t.
The same reasoning applies to vptq using the continuity of S and Assumption 2.4. This completes the proof.
We can now extend this solution to t 0 " 0 along the approach of [9], where the detailed derivations can be found. The idea is to replace hptq with hpmaxpη, tqq for some η ą 0 and to do the same for p, q, and r. It is then easy to see that the ODE that is obtained by doing these replacements has a unique global solution on R`. By making η Ñ 0 and by using the Arzelà-Ascoli theorem along with Assumption 2.5, we obtain that (ODE-1) has a unique solution on R`.

Convergence
The first step in this part consists in transforming (ODE-1) into an autonomous ODE by including the time variable into the state vector. More specifically, we start with the following ODE: then, we perform the following change of variable in time " z u  Þ Ñ " z s " 1{u  allowing the solution to lie in a compact set.
We initialize the above ODE at a time instant t 0 ą 0. Define the functions H, R, P, Q : R`Ñ R`by setting Hpsq " hp1{sq, Rpsq " rp1{sq, Ppsq " pp1{sq; Qpsq " qp1{sq for s ą 0; Hp0q " h 8 , Rp0q " r 8 , Pp0q " p 8 and Qp0q " q 8 . Our autonomous dynamical system can then be described by the following system of equations: Since the solution of the ODE 9 sptq "´sptq 2 for which spt 0 q " 1{t 0 is sptq " 1{t, the trajectory tsptqu is bounded. The three remaining equations are a reformulation of (ODE-1) for which the trajectories have already been shown to exist and to be bounded in Lem. 5.2. In the sequel, we denote by Φ : Z`ˆR`Ñ Z`ˆR`the semiflow induced by the autonomous ODE (17), i.e., for every u " pz, sq P Z`ˆR`, Φpu,¨q is the unique global solution to the autonomous ODE (17) initialized at u. Observe that the orbits of this semiflow are precompact. Moreover, the function Φppz, 0q,¨q is perfectly defined for each z P Z`since the associated solution satisfies the ODE (19) defined below, which three first equations satisfy the hypotheses of Lem. 5.2.
Since V˝Φpu,¨q is non-increasing and nonnegative, we can define V 8 fi lim tÑ8 V pΦpu, tqq. Let ωpuq fi Ş są0 Ť těs Φpu, tq be the ω-limit set of the semiflow Φ issued from u. Recall that ωpuq is an invariant set for the flow Φpu,¨q, and that distpΦpu, tq, ωpuqq Ý ÝÝ Ñ tÑ8 0, see, e.g., [25, Th. 1.1.8]). In order to finish the proof of Th. 2.1, we need to make explicit the structure of ωpuq.
We know from La Salle's invariance principle that ωpuq Ă V´1pV 8 q. In particular, by the invariance of ωpuq. From ODE (17), we have that any y P ωpuq is of the form y " pz, 0q since sptq Ñ 0. As a consequence, Φpy,¨q is a solution to the autonomous ODE The three first equations can be written in a more compact form : where zptq " pvptq, mptq, xptqq, and for each z P Z`. Consider y " pv, m, x, 0q P ωpuq. Using Eq. (18), we obtain that dV pΦpy, tqq{dt " 0, which implies that y " 0 for all pvptq, mptq, xptq, 0q " Φpy, tq. As a consequence, Assumption 2.4-iv) gives mptq " m " 0, and then, xptq " x for some x s.t. ∇F pxq " 0 using ODE (19). We now turn to showing that vptq " v " p 8 Spxq{q 8 . We have proved so far that any element y P ωpuq is written y " pv, 0, x, 0q where ∇F pxq " 0. The component vp¨q of Φpy,¨q is a solution to the ODE 9 vptq " p 8 Spxq´q 8 vptq and is thus written Fixing x, let S x be the section of ωpuq defined by: .
As ωpuq is invariant, we have S x ωpuq " S x Φpωpuq, tq for all t ě 0. Since the set tṽ P R d s.t. pṽ, 0, x, 0q P S x ωpuqu lies in a compact, we deduce from Eq. (21) that this set is reduced to the singleton tp 8 Spxq{q 8 u and in particular v " p 8 Spxq{q 8 . Therefore, the union of ω-limit sets of the semiflow Φ induced by ODE (17) coincides with the set of equilibrium points of this semiflow. The latter set itself corresponds to the set of points pz, 0q s.t. z P zer g 8 . It remains to notice that Υ " zer g 8 to finish the proof.
Remark 13. Commenting on Remark 3, the same proof works for (ODE-1 1 ) by using the function F´F ‹ as a Lyapunov function. The corresponding limit set (as t Ñ`8) is then of the form tz 8 " pṽ 8 ,x 8 q P R dˆRd : ∇F px 8 q " 0 ,ṽ 8 " p 8 Spx 8 q{q 8 u.
Similarly, if we set p " q " 0 in (ODE-1) and we keep what remains in Assumption 2.4, the function hptqpF pxq´F ‹ q`1 2 }m} 2 works as a Lyapunov function, and the limit set has the form tp0, xq : ∇F pxq " 0u.
Proof. By simple differentiation, we get: Consider a solution pm, xq of (ODE-N) starting at pm 0 , x 0 q P R dˆRd . As in Section 5.1.2, for every t 0 ą 0, on rt 0 ,`8q, we have that pm, x, sq is a solution to the autonomous ODE starting at pm 0 , x 0 , 1{t 0 q. Denote by Φ N " pΦ m N , Φ x N , Φ s N q the semiflow induced by ODE (23) and ω N ppm 0 , x 0 , 1{t 0 qq its limit set.
Lemma 5.4. For any compact set K Ă R 2d`1 and any T ą 0, the family of functions Φpz,¨q : r0, T s Ñ R 2d`1 ( zPK , where Φ is either Φ H or Φ N , is relatively compact in pC 0 pr0, T s, R 2d`1 q, ¨ 8 q.
Proof. The map Φ : R 2d`1ˆR`Ñ R 2d`1 is continuous, hence uniformly continuous on Kr 0, T s. The result follows from the application of the Arzelà-Ascoli theorem to the family Φpz,¨q : r0, T s Ñ R 2d`1 ( zPK . Let pm, x, 0q P ω N ppm 0 , x 0 , 1{t 0 qq. There exists a sequence pt k q of nonnegative reals such that pm, x, 0q " lim kÑ8 pmpt k q, xpt k q, 1{t k q. For any T ą 0 , using Lem. 5.4, up to an extraction, we can say that the sequence of functions tΦ N ppmpt k q, xpt k q, 1{t k q,¨qu k converges towards pm,x, 0q in C 0 pr0, T s, R d q, where pm,xq is a solution to # 9 mptq " ∇F pxptqq 9 xptq "´mptq , with pmp0q,xp0qq " pm, xq. Moreover, by Lem. 5.3, we also have that: Using Lem. 5.4, up to an additional extraction, we get on C 0 pr0, T 2 {κ 2 s, R 2d`1 q that tΦ H ppxpt k q, mpt k q, 1{t k q,¨ converges to pu, y, 0q, where pu, yq is a solution to # 9 yptq " 0 9 uptq "´yptq . Therefore, uptq " A`Bt for some A and B in R d . Imagine that B ‰ 0. We previously proved that x (and therefore u) is bounded by some constant C ą 0. Let T 1 ą C`}A} }B} . Up to an extraction, we obtain that tΦ H ppxpt k q, mpt k q, 1{t k q,¨qu k converges to u 1 on C 0 pr0, T 1 s, R 2d`1 q, with u 1 ptq " A 1`B1 t for some A 1 and B 1 in R d . We then have by uniqueness of the limit that A 1 " A and B 1 " B. As a consequence, u 1 pT 1 q " A`BT 1 ą C and we obtain a contradiction. Hence B " 0.
6 Proofs for Section 3

Preliminaries
We first recall some useful definitions and results. Let Ψ represent any semiflow on an arbitrary metric space pE, dq. As in the previous section, a point z P E is called an equilibrium point of the semiflow Ψ if Ψpz, tq " z for all t ě 0. We denote by Λ Ψ the set of equilibrium points of Ψ. A continuous function V : E Ñ R is called a Lyapunov function for the semiflow Ψ if VpΨpz, tqq ď Vpzq for all z P E and all t ě 0. It is called a strict Lyapunov function if, moreover, tz P E : @t ě 0, VpΨpz, tqq " Vpzqu " Λ Ψ . If V is a strict Lyapunov function for Ψ and if z P E is a point s.t. tΨpz, tq : t ě 0u is relatively compact, then it holds that Λ Ψ ‰ ∅ and dpΨpz, tq, Λ Ψ q Ñ 0, see [25,Th. 2.1.7]. A continuous function z : r0,`8q Ñ E is said to be an asymptotic pseudotrajectory (APT, [12]) for the semiflow Ψ if lim tÑ`8 sup sPr0,T s dpzpts q, Ψpzptq, sqq " 0 for every T P p0,`8q .

Proof of Th. 3.1
Recall that Φ is the semiflow induced by the autonomous ODE (17) which is an "autonomized" version of our initial (ODE-1). In the remainder of this section, the proof will be divided into two main steps : (a) we show that a certain continuous-time linearly interpolated process constructed from the iterates of our algorithm 1 is an APT of Φ; (b) we exhibit a strict Lyapunov function for a restriction to a carefully chosen compact set of a well chosen semiflow related to Φ. Then, we characterize the limit set of the APT using [11, Th. 5.7] and [10, Prop. 3.2]. The sequence pz n q converges almost surely to this same limit set.
(a) APT. For every n ě 1, definez n " pv n , m n , x n´1 q (note the shift in the index of the variable x). We have the decomposition z n`1 "z n`γn`1 gpz n , τ n q`γ n`1 η n`1`γn`1 ς n`1 , where g is defined in Eq. (1), η n`1 "`p n p∇f px n , ξ n`1 q d2´S px n qq, h n p∇f px n , ξ n`1 q´∇F px n qq, 0˘, is a martingale increment and where we set ς n`1 " pς v n`1 , ς m n`1 , ς x n`1 q with the components defined by: ς v n`1 " p n pSpx n q´Spx n´1 qq ς m n`1 " h n p∇F px n q´∇F px n´1 qq ς x n`1 " p γn γ n`1´1 q mn ? vn`ε . We first prove that ς n Ñ 0 a.s. by considering the components separately. The components ς m n`1 and ς v n`1 converge a.s. to zero by using Assumptions 2.1, 2.3, together with the boundedness of the sequences pp n q and ph n q (which are both convergent). Indeed, since ∇F is locally Lipschitz continuous and the sequence pz n q is supposed to be almost surely bounded, there exists a constant C s.t. }∇F px n q´∇F px n´1 q} ď C}x n´xn´1 } ď C ε γ n }m n }. The same inequality holds when replacing ∇F by S which is also locally Lipschitz continuous. The component ς x n`1 also converges a.s. to zero by observing that }ς x n`1 } ď |1´γ n γ n`1 |.}m n }{ ? ε and using Assumption 3.2 together with the a.s. boundedness of pz n q. Now consider the martingale increment sequence pη n q, adapted to F n . Take δ ą 0. Since pz n q is a.s bounded, there is a constant C 1 ą 0 such that Ppsup x n ą C 1 q ď δ. Denotingη n fi η n ½ xn ďC 1 and combining Assumptions 2.4 with 3.4-i) we can show using convexity inequalities that sup n E}η n`1 } q ă 8.
Using Eq. (30) and the almost sure boundedness of the sequence pz n q along with the fact that ς n converges a.s. to zero, it follows from [11,Prop. 4.1, Remark 4.5] that uptq is an APT of the already defined semiflow Φ induced by (17). Remark that it also holds that zptq is an APT of the semiflow Φ 8 induced by (20). As the trajectory of uptq is precompact, the limit set is compact. Moreover, it has the form Lpuq " Our objective now is to prove that In order to establish this inclusion, we study the behavior of the restriction Φ|L of the semiflow Φ to the set L (which is well-defined since L is Φ-invariant). Remark that Φ|L " where Φ 8 is the semiflow associated to (20). In the second part of the proof, we establish Eq. (32) combining item (a) we just proved with [11, Th. 5.7] and [11,Prop. 6.4]. In order to use the latter proposition, we prove a useful proposition in item (b).
(b) Strict Lyapunov function and convergence. For every δ ą 0 and every z " pv, m, xq P Z`, define: where, under Assumption 2.4-i), the function E 8 is defined by Proposition 6.1. Let t 0 ą 0 and let Assumptions 2.1 to 2.4 and 3.5 hold true. Let S be the limit set defined in Eq. (31). Let Φ 8 : Sˆrt 0 ,`8q Ñ S be the restriction of the semiflow Φ 8 to S i.e., Φ 8 pz, tq " Φ 8 pz, tq for all z P S, t ě t 0 .Then, iii) The set of equilibrium points of Φ 8 is equal to Λ Φ 8 X S.
Proof. The first point is a consequence of the definition of S and the boundedness of z. The second point stems from the definition of Φ 8 . Observing that Φ 8 is valued in S, the third point is immediate from the definition of Λ Φ 8 . We now prove the last point. Consider z P S and write Φ 8 pz, tq under the form Φ 8 pz, tq " pvptq, mptq, xptqq. Notice that this quantity is bounded as a function of the variable t. For any map W : Z`Ñ R, define for all t ě t 0 , L W ptq fi lim sup sÑ0 s´1pWpΦ 8 pz, t`sqq´WpΦ 8 pz, tqqq . Introduce Gpzq fi´x∇F pxq, my and Hpzq fi }q 8 v´p 8 Spxq} 2 for every z " pv, m, xq P Z`. Consider δ ą 0 (to be specified later on). We study L W δ " L E8`δ L G`δ L H . Note that Φ 8 pz, tq P S X Z`for all t ě t 0 by an analogous result to Lem. 5.1 for Φ 8 . Thus, t Þ Ñ E 8 pΦ 8 pz, tqq is differentiable at any point t ě t 0 and L E8 ptq " d dt E 8 pΦ 8 pz, tqq. Using similar derivations to Ineq. (16), we obtain that We now study L G . For every t ě t 0 , s´1p´x∇F pxpt`sqq, mpt`sqy`x∇F pxptqq, mptqyq ď lim sup sÑ0 s´1}∇F pxptqq´∇F pxpt`sqq}}mpt`sq}´x∇F pxptqq, 9 mptqy .
It can easily be seen that for every z P S, t Þ Ñ W δ pΦ 8 pz, tqq is Lipschitz continuous, hence absolutely continuous. Its derivative almost everywhere coincides with L W δ , which is nonpositive. Thus, W δ is a Lyapunov function for Φ 8 . We prove that the Lyapunov function is strict.

Proof of Th. 3.3
We can rewrite the iterates from Algorithm 2 as follows: # m n`1 " m n`γn`1 p∇F px n q´α τn m n q`γ n`1 p∇f px n , ξ n`1 q´∇F px n qq x n`1 " x n´γn`1 m n`1 .
We prove that the sequence py n " pm n , x n q : n P Nq of iterates of this algorithm converges almost surely towards the setῩ defined in Eq. (3) if it is supposed to be bounded with probability one. The proof follows a similar path to the proof in Section 5.2.
Indeed, denote by X and M the linearly interpolated processes constructed respectively from the sequences px n q and pm n q and let sptq " 1{t. Recall that Φ N " pΦ m N , Φ x N , Φ s N q is the semiflow induced by (23). As in Section 6.2, we have that Z fi pM, X, sq is an APT of (23). In particular, this means that Let pm, xq be a limit point of the sequence py n q and let T ą 0. Using Lem. 5.4, we can proceed in the same manner as in Section 5.2 and get a sequence pt k q such that pMpt k`¨q , Xpt k`¨q q Ñ pm, xq and pΦ y H pZpt k q,¨q, Φ u H pZpt k q,¨qq Ñ py, uq , where pmp0q, xp0qq " pm, xq , and pm, xq and px, uq are respectively solutions to (25) and (27). As in the end of Section 5.2, we obtain that u and x are constant, therefore m " 0 and ∇F pxq " 0 , which finishes the proof.

Proof of Th. 3.2
The idea of the proof is to apply Robbins-Siegmund's theorem [41] to v n`ε y (note the similarity of V n with the energy function (15)). Since inf F ą´8, we assume without loss of generality that F ě 0. In this subsection, we use the notation ∇f n`1 as a shorthand notation for ∇f px n , ξ n`1 q and C denotes some positive constant which may change from line to line. We write E n " Er¨| F n s for the conditional expectation w.r.t the σ-algebra F n . Define P n fi 1 2 xD n , m d2 n y, with D n fi 1 ? vn`ε . We have the decomposition: We estimate the vector Remarking that v n`1 ě p1´γ n`1 q n qv n and using the update rule of v n , we obtain for a sufficiently large n that ? v n`ε´? v n`1`ε " γ n`1 q n v n´pn ∇f d2 n`1 ? v n`ε`? v n`1`ε ď γ n`1 q n v n p1`?1´γ n`1 q n q ? v n`ε " γ n`1 q n 1`?1´γ n`1 q n ? v n d ? v n ?
v n`ε ď c n`1 ? v n`1 where c n`1 fi γ n`1 q n ? 1´γ n`1 q n p1`?1´γ n`1 q n q .
Using m d2 n`1´m d2 n " 2m n d pm n`1´mn q`pm n`1´mn q d2 , and noting that E n pm n`1´mn q " γ n`1 h n ∇F px n q´γ n`1 r n m n , E n 1 2 xD n , m d2 n`1´m d2 n y " γ n`1 h n x∇F px n q, m n ? v n`ε y´2γ n`1 r n P ǹ 1 2 xD n , E n rpm n`1´mn q d2 sy .
Using the inequality xu, vy ď p}u} 2`} v} 2 q{2 and Assumption 3.6-ii), it is easy to show the inequality x∇F px n q, mn ? vn`ε y ď Cp1`F px n q`P n q. Moreover, using the componentwise inequality ph n ∇f n`1´rn m n q d2 ď 2h 2 n ∇f d2 n`1`2 r 2 n m d2 n along with Assumption 3.6-ii) and the boundedness of the sequences ph n q, pr n q and pγ n`1 {γ n q, we obtain xD n , E n rpm n`1´mn q d2 sy ď Cγ 2 n p1`F px n q`P n q .
Combining Eq. (46) and Eq. (47), we get E n pP n`1´Pn q ď γ n`1 h n x∇F px n q, m n d D n y`Cγ 2 n p1`F px n q`P n q . (48) Denoting by M the Lipschitz coefficient of ∇F , we also have F px n`1 q ď F px n q´γ n`1 x∇F px n q, m n`1 d D n`1 y`γ Using (45) and the update rule of m n , we have m n`1 d D n`1´mn d D n 2 ď C pm n`1´mn q d D n 2`C m n`1 d pD n`1´Dn q 2 ď Cγ 2 n`1 p ∇f n`1 2` m n d D n 2 q`Cγ 2 n`1 m n`1 d D n Finally, recalling that V n " h n´1 F px n q`P n , ph n q is decreasing, combining Eq. (48),(49),(50), and using Assumption 3.6, we have E n rV n`1 s ď V n`γn`1 h n x∇F px n q, E n rm n d D n´mn`1 d D n`1 sỳ V n`C γ 2 n p1`F px n q`P n q ď p1`Cγ 2 n qV n`C γ 2 n , where we used Cauchy-Schwarz's inequality and the fact that m n d D n 2 ď CP n . By the Robbins-Siegmund's theorem [41], the sequence pV n q converges almost surely to a finite random variable V 8 P R`. Then, the coercivity of F implies that px n q is almost surely bounded.
We now establish the almost sure boundedness of pm n q. Assume in the sequel that n is large enough to have p1´γ n`1 r n q ě 0. Consider the martingale difference sequence ∆ n`1 fi ∇f n`1´∇ F px n q. We decompose m n "m n`mn wherem n`1 " p1´γ n`1 r n qm n`γn`1 h n ∇F px n q andm n`1 " p1´γ n`1 r n qm n`γn`1 h n ∆ n`1 , settingm 0 " 0 andm 0 " m 0 . We prove that both termsm n andm n are bounded. Consider the first term: }m n`1 } ď p1´γ n`1 r n q}m n }γ n`1 sup k }h k ∇F px k q} , where the supremum in the above inequality is almost surely finite by continuity of ∇F . We immediately get that if m n ě sup k }h k ∇F px k q} r8 , then m n`1 ď }m n }. Thus which implies thatm n is bounded. Consider now the termm n : E n r}m n`1 } 2 s " p1´γ n`1 r n q 2 }m n } 2`γ2 n`1 h 2 n E n r}∆ n`1 } 2 s ď }m n } 2`γ2 n`1 h 2 n E n r}∆ n`1 } 2 s . Then, the inequality E n r}∆ n`1 } 2 s ď E n r}∇f n`1 } 2 s combined with Assumption 3.4-i) and the a.s. boundedness of the sequence px n q imply that there exists a finite random variable C K (independent of n) s.t. E n r}∇f n`1 } 2 s ď C K . As a consequence, since ř n γ 2 n`1 ă 8 and the sequence ph n q is bounded, we obtain that a.s.: ÿ Hence, we can apply the Robbins-Siegmund theorem to obtain that sup n }m n } 2 ă 8 w.p.1. Finally, it can be shown that pv n q is almost surely bounded using the same arguments, decomposing v n intov n`ṽn as above. Indeed, first, we have: E n r}ṽ n`1 } 2 s ď }ṽ n } 2`γ2 n`1 p 2 n E n r}∇f d2 n`1´S px n q} 2 s . Second, it also holds that: n`1´S px n q} 2 s ď E n r}∇f d2 n`1 } 2 s ď E n r}∇f n`1 } 4 s . Then, using Assumption 3.4-i) and the a.s. boundedness of the sequence px n q, there exists a finite random variable C 1 K (independent of n) s.t. E n r}∇f n`1 } 4 s ď C 1 K . Moreover, the sequence pp n q is bounded and ř n γ 2 n`1 ă 8. As a consequence, it holds that a.s: ÿ ně0 γ 2 n`1 p 2 n E n r}∇f d2 n`1´S px n q} 2 s ď CC 1 It follows that the Robbins-Siegmund theorem can be applied to the sequence }ṽ n } 2 as for the sequence }m n } 2 to obtain that sup n }ṽ n } 2 ă 8 w.p.1.

Proof of Th. 3.4
The proof of Th. 3.2 easily adapts to Algorithm 2 by replacing V n bỹ V n fi F px n q`1 2 m n 2 .
The boundedness of pm n q is an immediate consequence of the convergence ofṼ n .

Proof of Th. 3.5
We shall use the following result.
Theorem 6.2 (adapted from [38], Th. 7). Let k ě 1. On some probability space equipped with a filtration F " pF n q nPN , consider a sequence of r.v. on R k given by Z n`1 " pI`γ n`1H qZ n`γn`1 b n`1`? γ n`1 η n`1 and Er}Z 0 } 2 s ă 8, whereH is a kˆk Hurwitz matrix, pb n q and pη n q are random sequences, and γ n " γ 0 n´α for some γ 0 ą 0 and α P p0, 1s. Let Ω 0 P F 8 have a positive probability. Assume that the following holds almost surely on Ω 0 : i) Erη n`1 |F n s " 0.
Then, given Ω 0 , pZ n q converges in distribution to the unique stationary distribution µ ‹ of the generalized Ornstein-Uhlenbeck process dX t "HX t dt`?ΣdB t where pB t q is the standard Brownian motion and ? Σ is the unique positive semidefinite square root of Σ. The distribution µ ‹ is the zero mean Gaussian distribution with covariance matrix Γ given as the solution to pH`½ α"1 2γ 0 I k qΓ`ΓpH`½ α"1 2γ 0 I k q T "´Σ. Proof. The proof is identical to the proof of [38,Th. 7], only substituting the inverse of the square root of Σ by the Moore-Penrose inverse. Finally, the uniqueness of the stationary distribution µ ‹ and its expression follow from [29, Th. 6.7, p. 357] We define v n "v n`δn where δ 0 " 0,v 0 " v 0 and δ n`1 " p1´γ n`1 q n qδ n`γn`1 pp n´qn q´1 8 p 8 qSpx n q v n`1 " p1´γ n`1 q n qv n`γn`1 q n q´1 8 p 8 Spx n q`γ n`1 p n p∇f px n , ξ n`1 q d2´S px n qq .
For every z " pv, m, xq P Z`and δ ě 0, we define Moreover, for every z " pv, m, xq P Z`and every n P N, we set Defining ζ n " pv n , m n , x n´1 q and recalling the definition of pη n q from Eq. (28), we have the decomposition ζ n`1 " ζ n`γn`1 g n pζ n q`γ n`1 η n`1`γn`1 r n pζ n , δ n q .
Define z ‹ fi px ‹ , 0, v ‹ q. Note that g n pz ‹ q " 0. Evaluating the Jacobian matrix G n of g n at z ‹ , we obtain that there exist constants C ą 0,M ą 0 and n 0 P N s.t. for all n ě n 0 , where G n is given by where ∇S is the Jacobian of S and the matrix V is defined in Eq. (8). We define One can verify that G 8 is Hurwitz, and that the largest real part of its eigenvalues is´L 1 , where L 1 fi L^q 8 and L is defined in Eq. (9). We define Ω p0q fi tz n Ñ z ‹ u. We assume PpΩ p0q q ą 0. Using for instance [20,Lem. 4 and Lem. 5], it holds that δ n pωq Ñ 0 for every ω P Ω p0q , and since x n pωq´x n´1 pωq Ñ 0 on that set, we obtain that Ω p0q " tζ n Ñ z ‹ u. Let M P p0,M q be a constant, whose value will be specified later on. For every N 0 P N, define Ω p0q N 0 fi tζ n Ñ z ‹ and sup něN 0 }ζ n´z‹ } ď M u. We seek to show that ? γ n´1 pζ n´z‹ q ñ ν given Ω p0q , for some Gaussian measure ν, using Th. 6.2.
As Ω p0q N 0 Ò Ω p0q , it is sufficient to show that the latter convergence holds given Ω p0q N 0 , for every N 0 large enough. From now on, we consider that N 0 is fixed. We define the sequence pζ n q něN 0 as ζ N 0 " ζ N 0 and for every n ě N 0 , ζ n`1 "ζ n`γn`1gn pζ n q`γ n`1 pη n`1`rn pζ n , δ n qq½ An where A n is the event defined by A n fi n č k"N 0 t}x k´x‹ } ď M u X t}ζ n´z‹ } ď M u andg n pzq fi g n pzq½ }z´z‹}ďM´K pz´z ‹ q½ }z´z‹}ąM , where K ą 0 is a large constant which will be specified later on. The sequences pζ n q něN 0 and pζ n q něN 0 coincide on Ω p0q N 0 . Thus, it is sufficient to study the weak convergence of pζ n q něN 0 . An estimate of }r n pζ n , δ n q}½ An . We start by studying the sequence p}δ n }½ An q. Unfolding the update rule defining δ n and using the fact that pq n q is a sequence of positive reals converging to q 8 ą 0, we obtain that for some β ą 0. The sequence pw n q is deterministic and converges to zero by [20,Lem. 4]. There exists n 1 ě n 0 s.t. w n ď M . As v Þ Ñ 1 ? v`ε is Lipschitz and ∇F and S are locally Lipschitz, for every z " pv, m, xq and δ s.t. }z´z ‹ } ď M and }δ} ď M , we have }r n pz, δq} ď Cγ n`1 }pv`δ`εq d´1 2 }}m}`C}pv`δ`εq d´1 2´p v`εq d´1 2 }}m} ď Cγ n`1 }z´z ‹ }`C}δ}}z´z ‹ } .
This implies that for every n ě n 1 , }r n pζ n , δ n q}½ An ď Cpγ n`1`wn q}ζ n´z‹ } . (52) Tightness of ? γ n´1 pζ n´z‹ q. We decomposẽ ζ n`1´z‹ " pI 3d`γn`1 G n qpζ n´z‹ q`γ n`1´gn pζ n q´G n pζ n´z‹ q¯½ }ζn´z‹}ďḾ γ n`1 pK`G n qpζ n´z‹ q½ }ζn´z‹}ąM`γ n`1 pη n`1`rn pζ n , δ n qq½ An . (53) For a given t ą 0, we write G 8 " B´1 t G t B t the Jordan-like decomposition of G 8 , where the ones of the second diagonal of the usual Jordan decomposition are replaced by t, and where B t is some invertible matrix. We define S n fi B t pζ n´z‹ q. Setting G ptq n fi B t G n B´1 t , we obtain S n`1 " pI 3d`γn`1 G ptq n qS n`γn`1 B t´gn pζ n q´G n pζ n´z‹ q¯½ }ζn´z‹}ďḾ γ n`1 pK`G ptq n qS n ½ }ζn´z‹}ąM`γ n`1 B t pη n`1`rn pζ n , δ n qq½ An .
Moreover, E n rη n`1 s " 0 and finally, almost surely on Ω p0q N , E n rη n`1η T n`1 s converges to Therefore, the assumptions of Th. 6.2 are fulfilled for the sequenceỹ n . We obtain the desired result for the sequence pm n , x n´1 q. We now show that the same result also holds for the sequence pm n , x n q. For this purpose, observe that Then, notice that } xn´x n´1 ? γn } " ? γ n } mn ? vn`ε } ď b γn ε }m n } Ñ 0 as n Ñ 8 since it is assumed that z n Ñ z ‹ (which implies in particular that m n Ñ 0). Hence, it holds that ? γ n´1 px n´xn´1 q converges a.s. to 0. We conclude by invoking Slutsky's lemma. Proof of Eq. (10). We have the subsystem: and where Q fi Cov p∇f px ‹ , ξqq. The next step is to triangularize the matrixH in order to decouple the blocks of Γ. For every k " 1, . . . , d, set νk fi´r 8 2˘a r 2 8 {4´h 8 π k with the convention that ?´1 " ı (inspecting the characteristic polynomial of H, these are the eigenvalues of H). Set M˘fi diag pν1 ,¨¨¨, νd q and R˘fi V´1 2 P M˘P T V´1 2 . Using the identities M``M´"´r 8 I d and M`M´" h 8 diag pπ 1 ,¨¨¨, π d q, it can be checked that SetΓ fi RΓR T . Denote by pΓ i,j q i,j"1,2 the blocks ofΓ. Note thatΓ 2,2 " Γ 2,2 . By left/right multiplication of Eq. (55) respectively by R and R T , we obtain pM´`θI d qΓ 1,1`Γ1,1 pM´`θI d q "´h 2 8 C .
We now consider a non-autonomous perturbation of this ODE, which is represented in the basis of the columns of Q as 9 yptq " hpyptq, tq with hpy, tq " " Λ´Λ` y`εpy, tq, and ε : R dˆR Ñ R d is a continuous function. In the sequel, we shall be interested in the asymptotic behavior of this equation for the large values of t, and therefore, restrict our study to the interval I " rt 0 , 8q for some given t 0 ě 0 that we shall fix later. We assume that εp0,¨q " 0 on I. We denote as φ : IˆIˆR d Ñ R d the so-called general solution of (62), which is defined by the fact that φp¨, t, xq is the unique noncontinuable solution of (62) such that φpt, t, xq " x for t P I and x P R d , assuming this solution exists and is unique for each px, tq P R dˆI .
In the linear autonomous case provided by the ODE (60), the subspace is trivially invariant in the sense that if pt, yq P G, then, ps, φps, t, yqq P G for each s P R. This concept can be generalized to the non-linear and non-autonomous case. We say that the C 1 function w : R d´ˆI Ñ R d`d efines a global non-autonomous invariant manifold for the ODE (62) if wp0, tq " 0 for all t P I, and, furthermore, if for each t P I and each y´P R d´, writing y " py´, wpy´, tqq, the general solution φps, t, yq " pφ´ps, t, yq, φ`ps, t, yqq with φ˘ps, t, yq P R dv erifies φ`ps, t, yq " wpφ´ps, t, yq, sq for each s P I. The non-autonomous invariant manifold is the set which obviously satisfies pt, yq P G ñ ps, φps, t, yqq P G for each s P I. These invariant manifolds are described by the following proposition, which is a straightforward application of [40, Th. A.1] (see also [31,Th. 6.3 p. 106, Rem. 6.6 p. 111]). It is useful to note that under the conditions provided in the statement of this proposition, the existence of the general solution φ of the ODE (62) is ensured by Picard's theorem.
Proposition 7.1. Let I " rt 0 , 8q for some t 0 ě 0. Assume that the function εpy, tq is such that εp0,¨q " 0 on I, the function εp¨, tq is continuously differentiable for each t P I, and furthermore, the Jacobian matrix B 1 εpy, tq satisfies with K " K´`K``K´K`pK´_ K`q and α´, α`chosen as in Eq. (61). Then, for each δ P p2K|ε| 1 , pα`´α´q{2q and each γ P pα´`δ, α`´δq, the set Finally, if B n 2 B k 1 ε exist and are continuous for 0 ď n ă m and 0 ď k`n ď m, then w is m-times continuously differentiable.
Let us partition the function hpy, tq as hpy, tq " " h´py, tq h`py, tq where h˘: R dˆI Ñ R d˘, y˘P R d˘a nd ε˘: R dˆI Ñ R d˘. With these notations, the previous proposition leads to the following lemma.
Lemma 7.2. In the setting of Prop. 7.1, for each t in the interior of I and each vector y " py´, y`q such that y˘P R d˘a nd y`" wpy´, tq, it holds that h`py, tq " B 1 wpy´, tqh´py, tq`B 2 wpy´, tq .
Assume that α´is small enough so that Ineq. (65) and Eq. (64) hold true with m " 2. Assume in addition that B n 2 B k 1 ε exists and is continuous for 0 ď n ă 2 and 0 ď k`n ď 2, and furthermore, that there exists a bounded neighborhood V Ă R d of zero such that sup py,tqPVˆI B 2 εpy, tq ă`8.
Proof. By Prop. 7.1, the general solution φps, t, yq of the ODE (62) can be written as φps, t, yq " pφ´ps, t, yq, φ`ps, t, yqq with φ`ps, t, yq " wpφ´ps, t, yq, sq for each s P I. Equating the derivatives with respect to s of the two members of this equation and taking s " t, we get the first equation.
By Prop. 7.1, the function w is twice differentiable, and we can write where, e.g., h`is a shorthand notation for h`pgpy´, tq, tq. It holds from Eq. (67) and the assumptions of Prop. 7.1 that for each py, tq P R dˆI , }B 1 hpy, tq} ď }Λ}`}B 1 εpy, tq} ď C, where the constant C ą 0 is independent of py, tq and can change from an inequality to another in the remainder of the proof. By the mean value inequality and Prop. 7.1, we also get that Let V´Ă R d´b e a small enough neighborhood of zero so that gpy´, tq P V for each y´P V´, which is possible by the inequality }gpy´, tq} ď C}y´}. By the assumption on }B 2 εpy, tq} in the statement of Lem. 7.2, we have @y´P V´, B 2 hpgpy´, tq, tq " B 2 εpgpy´, tq, tq ď C.
Prop. 7.1 deals with the case where the function ε is globally Lipschitz continuous. In practical cases, such a strong assumption is not necessarily verified. In particular, for the ODEs we consider for our application, it is not satisfied (see the function e defined in Subsec. 7.3.1 below). Nonetheless, recall that we only need the existence of a local non-autonomous invariant manifold, i.e. defined in the vicinity of an arbitrary solution such as the trivial zero solution (since we suppose here εp0,¨q " 0) whereas the aforementioned strong assumption provides a global non-autonomous invariant manifold. Indeed, as for the avoidance of traps result we intend to show, we will only need to look at the behavior of our ODE in the neighborhood of a trap z ‹ . Therefore, in prevision of the proof of Th. 4.1, we localize the ODE (62) in the neighborhood of zero. This is the purpose of the next proposition. Proposition 7.3. Let I " rt 0 ,`8q for some t 0 ě 0 and let h : R dˆI Ñ R d be defined as in Eq. (62). Assume that εp0,¨q " 0 on I, that the function εp¨, tq is continuously differentiable for every t P I and that lim py,tqÑp0,`8q Then, there exist σ ą 0, t 1 ą 0, a functionε : R dˆI 1 Ñ R d where I 1 fi rt 1 ,`8q and a functioñ h : R dˆI 1 Ñ R d defined for every y P R d , t P I 1 byhpy, tq " Λy`εpy, tq s.t.h andε verify the assumptions of Prop. 7.1 and for every py, tq P Bp0, σqˆI 1 , we have thathpy, tq " hpy, tq andεpy, tq " εpy, tq. Moreover, for any δ ą 0, we can choose σ, t 1 respectively small and large enough s.

Proof of Th. 4.1
We shall rely on the following result of Brandière and Duflo. Recall that pΩ, F , Pq is a probability space equipped with a filtration pF n q nPN . Proposition 7.4. ( [13,Prop. 4]) Given a sequence pγ n q of deterministic nonnegative stepsizes such that ř k γ k "`8 and ř k γ 2 k ă`8, consider the R d -valued stochastic process pz n q nPN given by z n`1 " pI`γ n`1 H n qz n`γn`1 η n`1`γn`1 ρ n`1 .
Assume that z 0 is F 0 -measurable and that the sequences pη n q, pρ n q together with the sequence of random matrices pH n q are pF n q-adapted. Moreover, on a given event A P F , assume the following facts: i) ř n }ρ n } 2 ă 8. ii) lim sup Er}η n`1 } 2`a | F n s ă 8 for some a ą 0, and Erη n`1 | F n s " 0.
Let H P R dˆd be a deterministic matrix such that the real parts of its eigenvalues are all positive. Then, P pA X rz n Ñ 0s X rH n Ñ Hsq " 0.
We now enter the proof of Th. 4.1. Recall the development (11) of bpz, tq near z ‹ and the spectral factorization (12) of the matrix D. To begin with, it will be convenient to make the variable change y " Q´1pz´z ‹ q, and set hpy, tq " Q´1bpQy`z ‹ , tq " Λy`ẽpy, tq, withẽpy, tq " Q´1epQy`z ‹ , tq, in such a way that our stochastic algorithm is rewritten as y n`1 " y n`γn`1 hpy n , τ n q`γ n`1ηn`1`γn`1ρn`1 whereη n is as in the statement of the theorem andρ n " Q´1ρ n . Observe that the assumptions on the function e in the statement of the theorem remain true forẽ with z ‹ replaced by zero.
If the matrix Λ has only eigenvalues with (strictly) positive real parts, i.e., d´" 0, then we can apply Prop. 7.4 to the sequence pz n q. Henceforth, we deal with the more complicated case where d´ą 0.
Apply Prop. 7.3 to h to obtainh and σ, t 1 respectively small and large enough and w : R d´ˆI 1 Ñ R d`w here I 1 :" rt 1 ,`8q. By Assumption iv) of Th. 4.1 and Prop. 7.3 we can choose σ 1 ď σ such that Eq. (80) and Eq. (81) hold. Now, given p P N, let us define the event E p " r@n ě p, }y n } ă σ 1 , τ n P I 1 s .
On E p , it holds that hpy n , τ n q "hpy n , τ n q and @n ě p, y n`1 " y n`γn`1 hpy n , τ n q`γ n`1ηn`1`γn`1ρn`1 " " yń yǹ `γ n`1 " h´py n , τ n q h`py n , τ n q where h is partitioned as in (67), and whereηn ,ρn P R d˘. Note that, by Prop. 7.3 and Assumptions vi) and vii) on the sequence pη n q, we can choose σ, t 1 respectively small and large enough such that lim inf Er ηǹ`1 2 |F n s½ Ep py n q´2 lim sup Er B 1 wpyń , τ n qηń`1 2 |F n s½ Ep py n q ą This inequality will be important in the end of our proof. Let t be in the interior of I 1 , and let y " py´, y`q be in a neighborhood of 0. Make the variable change py´, y`q Þ Ñ pu´, u`q with u`" y`´wpy´, tq, where w is the function defined in the statement of Prop. 7.3, and let W pu´, u`, tq " h`py, tq´B 1 wpy´, tqh´py, tq´B 2 wpy´, tq " h`ppu´, u``wpu´, tqq, tq B 1 wpu´, tqh´ppu´, u``wpu´, tqq, tq´B 2 wpu´, tq.
Th. 4.1 is proven. The matrix D coincides with ∇g 8 pz ‹ q, where the function g 8 is defined in (20). As such, its expression is immediate. Recalling that p 8 Spx ‹ q´q 8 v ‹ " 0, we get gpz, tq´Dpz´z ‹ q " Under the assumptions made, it is easy to see that the function epz, tq has the properties required in the statement of Th. 4.1.

Proof of Th. 4.3
Consider the matrix D defined in the statement of Lem. 4.2. A spectral analysis of this matrix as regards its eigenvalues with positive real parts is done in the following lemma.
Lemma 7.5. Let D be the matrix provided in the statement of Lem. 4.2. Each eigenvalue ζ of the matrix D such that ℜζ ą 0 is real, and its algebraic and geometric multiplicities are equal. Moreover, there is a one-to-one correspondence ϕ between these eigenvalues and the negative eigenvalues of V be a matrix which rows are independent eigenvectors of V 1 2 ∇ 2 F px ‹ qV 1 2 that generate this eigenspace, and denote as β k ă 0 the eigenvalue associated with w k . Then, the rows of the rank d`-matrix A`" " 0 d`ˆd , W V generate the left eigenspace of D, the row k being an eigenvector for the eigenvalue ϕ´1pβ k q.
Proof. It is obvious that the block lower-triangular matrix D has d eigenvalues equal to´q 8 and 2d eigenvalues which are those of the sub-matrix r D " "´r Given λ P C, we obtain by standard manipulations involving determinants that detp r D´λq " detpλpr 8`λ q`h 8 V ∇ 2 F px ‹ qq " detpλpr 8`λ q`h 8 V 1 2 ∇ 2 F px ‹ qV 1 2 q.
Denoting as tβ k u d k"1 the eigenvalues of h 8 V 1 2 ∇ 2 F px ‹ qV 1 2 counting the multiplicities, we obtain from the last equation that the eigenvalues of r D are the solutions of the second order equations λ 2`r 8 λ`β k " 0, k " 1, . . . , d.
The product of the roots of such an equation is β k , and their sum is´r 8 ď 0. Thus, denoting as ζ k,1 and ζ k,2 these roots, it is easy to see that if β k ě 0, then ℜζ k,1 , ℜζ k,2 ď 0, while if β k ă 0, then both ζ k,i are real, and only one of them is positive. Thus, we have so far shown that the eigenvalues of D which real parts are positive are themselves real, and there is a one-to-one map ϕ from the set of positive eigenvalues of D to the set of negative eigenvalues of V 1 2 ∇ 2 F px ‹ qV 1 2 . Moreover, the algebraic multiplicity of the eigenvalue ζ ą 0 of D is equal to the multiplicity of ϕpζq.
Let us now turn to the left (row) eigenvectors of D that correspond to these eigenvalues. To that end, we shall solve the equation uD " ζu with u " r0, u 1 , u 2 s, u 1,2 P R 1ˆd , for a given eigenvalue ζ ą 0 of D. Developing this equation, we get r 8 u 1´u2 V " ζu 1 , h 8 u 1 ∇ 2 F px ‹ q " ζu 2 .
If we now writeũ 1 " u 1 V´1 2 andũ 2 " u 2 V 1 2 , this system becomeś or, equivalently,ũ 2 "´pr 8`ζ qũ 1 ,ũ 1´ζ 2`r 2¯" 0, which shows thatũ 1 is a left eigenvector of V 1 2 ∇ 2 F px ‹ qV 1 2 associated with the eigenvalue ϕpζq. What's more, assume that r is the multiplicity of ϕpζq, and, without generality loss, that the submatrix W r,¨m ade of the first r rows of W generates the left eigenspace of ϕpζq. Then, the matrix " 0 rˆd W r¨V 1 2´p r 8`ζ qW r¨V´1 2 ı is a r-rank matrix which rows are independent left eigenvectors that generate the left eigenspace of D for the eigenvalue ζ. In particular, the algebraic and geometric multiplicities of this eigenvalue are equal. The same argument can be applied to the other positive eigenvalues of D.
We now have all the elements to prove Th. 4.3. Recall Eq. (14): z n`1 " z n`γn`1 bpz n , τ n q`γ n`1 η n`1`γn`1 ρ n`1 , where bpz, tq " gpz, tq´cptq " Dpz´z ‹ q`epz, tq and ρ n " cpτ n´1 q`ρ n . With these same notations, we check that Assumptions i)-vi) in the statement of Th. 4.1 are satisfied. The function epz, tq satisfies Assumptions i)-iv) by Lem. 4.2. We now verify that the sequence pρ n q fulfills Assumption v). First, observe that ř n }cpτ n q} 2 ă 8 under Assumption 4.3-i). Then, we control the second term pρ n q. After straightforward derivations, one can show the existence of a positive constant C (depending only on ε and a neighborhood W of z ‹ ) such that }ρ n`1 } 2 ½ znPW ď Cp}m n´mn`1 } 2`} v n`1´vn } 2 q½ znPW . (87) Using the boundedness of the sequences ph n q and pr n q together with the update rule of m n and Assumption 4.3-iii), there exists a positive constant C 1 independent of n (which may change from an inequality to another) such that E " A similar result holds for E " }v n´vn`1 } 2 ½ znPW ‰ following the same arguments. In view of Eqs. (87)-(88) and the assumption ř n γ 2 n`1 ă`8, it holds that E "ř n }ρ n`1 } 2 ½ znPW ‰ ă`8. Therefore, ř n }ρ n`1 } 2 ½ znPW ă`8 a.s., which completes our verification of condition v) of  and set cptq " 0. Then, in Lem. 7.5, replace the matrix V 1{2 ∇ 2 F px ‹ qV 1{2 by the Hessian ∇ 2 F px ‹ q.