Limits of random walks with distributionally robust transition probabilities

We consider a nonlinear random walk which, in each time step, is free to choose its own transition probability within a neighborhood (w.r.t. Wasserstein distance) of the transition probability of a fixed L\'evy process. In analogy to the classical framework we show that, when passing from discrete to continuous time via a scaling limit, this nonlinear random walk gives rise to a nonlinear semigroup. We explicitly compute the generator of this semigroup and corresponding PDE as a perturbation of the generator of the initial L\'evy process.


Introduction and main results
Lévy processes are mathematically tractable and therefore often used to model certain real-world phenomena. This bears the task of correctly specifying / estimating the corresponding parameters, e.g., drift and variance in case of a Brownian motion. In many situations this can only be achieved up to a certain degree of uncertainty. For this reason, Peng [16] introduced his nonlinear Brownian motion and started a systematic investigation of this object. The nonlinear Brownian motion is defined via a nonlinear PDE and, heuristically speaking, within each infinitesimal time increment it is allowed to select its parameters (drift and variance) within a given fixed set. Accordingly, a nonlinear Feynman-Kac formula makes it possible to compute the worst case expectations of certain functions of the random process. Several works followed this parametric nonlinearization approach to Lévy processes, see e.g. Hu and Peng [13], Neufeld and Nutz [15], Denk et al. [8] and Kühn [14].
On the other hand, in discrete time where no mathematical limitations force one to restrict to parametric uncertainty, a more natural and general nonlinearization of a given (baseline) random walk is of nonparametric nature. We start with a random walk which is the discrete-time restriction of an R d -valued Lévy process starting in zero, who's marginal laws we denote by (µ t ) t≥0 . For instance, µ t can be the normal distribution with mean 0 and variance t in which case we end up with a Gaussian random walk.
For a fixed parameter δ ≥ 0 representing the level of freedom (or uncertainty) and n ∈ N, the nonlinear random walk with time index T = {0 = t 0 < t 1 < t 2 < · · · } ⊂ R + is defined as follows: for each time step t n t n+1 , the nonlinear random walk is allowed select its transition probability within the neighborhood of size δ(t n+1 − t n ) of the transition probability µ tn+1−tn of our baseline random walk, where the neighborhood is taken w.r.t. the p-th Wasserstein distance W p . 1 This means that, conditioned on the event that the nonlinear random walk takes the value x ∈ R d at time t n , the worst possible expected value of an arbitrary function f ∈ C 0 (R d ) at time t n+1 is given by Recall here that C 0 (R d ) is the set of continuous function vanishing at infinity. Iterating this scheme, conditioned on the event that the nonlinear random walk starts in x at time 0, the worst possible expectation at time t n ∈ T is given by In conclusion, the corresponding processes follow the same heuristics as the nonlinear Brownian motion and can be seen as a discrete time nonparametric reincarnation thereof.
Regarding the computation of S T we stumble on a recurring scheme in discrete time: while definitions are mathematically simple, explicit computations are often very challenging. Here this is evident as S and therefore S T are results of (iterated, nonparametric, and infinite dimensional) control problems. In the following, we shall show that when passing from small to infinitesimal time steps, the S T 's give rise to a nonlinear semigroup and that a computation of the limit is possible via a nonlinear PDE.
For the rest of this article we shall fix p ∈ (1, ∞) and assume that our initial Lévy process has finite p-th moment, i.e. R d |x| p µ 1 (dx) < ∞. For convenience, for every n ∈ N consider dyadic numbers T n := 2 −n N 0 and set S n (t) := S Tn (t n ) • S(t − t n ) for t ≥ 0, where t n ∈ T n is the closest dyadic number prior to t. Proposition 1.1 (Semigroup). Both S n and S := lim n→∞ S n are well defined and the family (S (t)) t≥0 defines a sublinear semigroup on C 0 (R d ). More precisely, for every s, t ≥ 0 and x ∈ R d , we have that (i) S (t) maps C 0 (R d ) to itself and S (t) • S (s) = S (t + s), (ii) S (t)(·)(x) : C 0 (R d ) → R is sublinear, increasing, and maps zero to zero, and Now that the semigroup property is established, we can state our main result.
where A µ is the generator of the initial Lévy process. Here ∇ denotes the spatial derivative. Moreover, the notion of viscosity solution we consider here is that of [8], and we refer to the discussion before and after Theorem 2.12 below for the definition and comments on uniqueness.
In the following chapter we consider a convex generalization of the above setting: in the definition of S(s), instead of considering all ν in the δs neighborhood of µ s , we take into account all ν but penalize by their distance to µ s . In the limit this gives a convex semigroup for which the generator includes a convex perturbation in ∇u (instead of the absolute value), see Theorem 2.12. Remark 1.3. Assume that the initial Lévy process is the Brownian motion with drift γ ∈ R d and covariance matrix Σ ∈ R d×d , and set Γ := {η ∈ R d : |η − γ| ≤ δ}. Then a quick computation shows that the PDE of Theorem 1.2 takes the form and the resulting process is called g-Brownian motion. This example illustrates the concept that S corresponds to a nonlinear Lévy process with drift uncertainty.
Finally, let us point out that numerical computation of nonlinear PDEs like the ones resulting from Theorem 1.2 and Remark 1.3 has received a lot of attention in recent years and by now efficient methods are available, see, e.g., [4,17] and references therein.
Possible extensions and related literature. There are several natural variations of the results in this paper. For instance, one can ask which effect additional constraints on the measures ν appearing in the definition of S(t) might have. Concretely, what would happen if one allows only for those ν which (additional to being in a Wasserstein neighborhood of µ t ) have the same mean as µ t , or if one replaces the Wasserstein distance by it's martingale version [5]. In the latter case, when changing the scaling of the radius from δt to δt 2 , one could guess the PDE to be in case that the underlying Lévy process is the Brownian motion. However, with the exact methods of this paper, this can be made rigorous only with a (unnatural) technical twist in definition of S(t) and understanding the full picture would require considerations beyond the scope of this paper. In a similar spirit, it would be interesting to start with transition probabilities (µ n t ) t which approximate (µ t ) t (e.g. a Binomial random walk which converges to a Brownian motion). A (parametric) variant of this was done by Dolinsky, Nutz, and Soner [9] for Binomial random walks with freedom in the Bernoulli-parameter. Related, one could ask whether Donsker-type results hold, i.e., whether the family of laws of the nonlinear random walks (on the path space) has a limiting family.
Finally, let us highlight the connection to distributionally robust optimization (DRO) using the Wasserstein distance. In DRO, the basic task consists of computing inf λ S(s)f λ , where (f λ ) λ is a parametrized family of function; we refer to [3,6,11,12] for recent results and applications. Here duality arguments often help to compute the (infinite dimensional) optimization problem appearing in the definition of S(s). In multi-step versions of DRO (e.g., time-consistent utility maximization with Markovian endowment [1]), the computation of S n (t)f is the key element. Related multi-step versions also occur in the literature on robust Markov chains with interval probabilities (see [10,19] and references therein), and in particular on robust Markov decision processes [20,21]. As S (t)f can be seen as a proxy for S n (t)f for large n, a natural question is whether the results in the current paper can be used as an approximation tool for these multi-step versions of DRO. This also motivates studying the speed of convergence S n (t)f → S (t)f .
For further reference, we provide the proof of the following simple observation.
Proof. By assumption x → ϕ(max{x, 0} 1/p ) is a convex lower semicontinuous function, which is not constant equal to zero. Therefore, by the Fenchel-Moreau theorem there exist a > 0 and b ∈ R, such that ϕ(x 1/p ) ≥ ax + b for all x ≥ 0. Thus, for every given r > 0, we conclude that ϕ t (r) = tϕ r t ≥ ta r t p + tb.
As p > 1, this term converges to infinity when t converges to zero.
Directly from the definition, the operator S(t) has the following properties.
has the same modulus of continuity as f .
Proof. It is clear by definition that S(t) is convex and monotone. Moreover, as inf ϕ t = 0, it follows that S(t)0 = 0. To show that S(t) is a contraction, note that x ∈ R d , and changing the role of f and g yields contractivity.
It remains to prove that First, since f is in particular uniformly continuous it follows that S(t)f is also uniformly continuous. To that end, let ε > 0 be arbitrary and fix δ > 0 such that Then, for every such pair x, y, contractivity of S(t) implies that Replacing the role of x and y shows that S(t)f is uniformly continuous with the same modulus of continuity as f .
Second, we prove that S(t)f is vanishing at infinity. Let ε > 0 be arbitrary and For the reverse inequality, use that S(t)f (x) ≥ R d f (x + y) µ t (dy) for all x ∈ R d , which follows from ϕ(0) = 0. Therefore the same arguments as above show that S(t)f (x) ≥ −ε − ε f ∞ for all x ∈ R d such that |x| ≥ a + b. As ε was arbitrary, the claim follows.
At this point we know that S(t) maps C 0 (R d ) to itself, which allows us to define S(t) • S(s), or more generally S n as in (1.1). The following is the key result for our analysis, and allows in particular to define the limit lim n→∞ S n . Lemma 2.4. For every 0 < s < t and f ∈ C 0 (R d ), we have that and let γ s (da, db) ∈ P p (R d ×R d ) be an optimal coupling between µ s (da) and ν s (db). Similarly, for each b ∈ R d , let ν b t−s (de) ∈ P p (R d ) be such that and let γ b t−s (dc, de) ∈ P p (R d × R d ) be an optimal coupling between µ t−s (dc) and ν b t−s (de). Now define the measure γ t (dy, dz) ∈ P p (R d × R d ) by needs to be γ s -measurable for this expression to make sense, but this can be shown by usual measurable selection arguments). Denoting by ν t (dz) := γ t (dz) the second marginal of γ t , it holds Further, γ t (dy, dz) is a coupling between µ t (dy) and ν t (dz). Indeed, by definition γ t (dz) = ν t (dz), and γ t (dy) = µ t (dy) as for every Borel set B ⊂ R d . Moreover, by definition of the p-th Wasserstein distance it holds |y − z| p γ t (dy, dz) Denote by I the first term in the above equation and by J the second one. By definition of ϕ t = tϕ(·/t), together with the fact that ϕ is convex and increasing, it holds Moreover, convexity of x → ϕ(x 1/p ) implies convexity of x → ϕ t−s (x 1/p ). Therefore, by Jensen's inequality, we obtain Recalling the definitions of I and J and that γ s and γ b t−s where optimal couplings, we conclude Putting everything together, we obtain This completes the proof.
Lemma 2.5. For all f ∈ C 0 (R d ), t ≥ 0 and n ∈ N, we have that S n+1 (t)f ≤ S n (t)f . Further, S n (t) is a contraction on C 0 (R d ), and S n (t)f has the same modulus of continuity as f .

Proof. Both statements follow from Lemma 2.4 (respectively Lemma 2.3) together with an induction.
Corollary 2.6. Let t ≥ 0 and f, g ∈ C 0 (R d ) such that f ≤ g. Then, the pointwise limit S (t)f := lim n→∞ S n (t)f exists and is in fact uniform. Moreover, S (t) is a convex contraction on C 0 (R d ) such that S (t)0 = 0 and S (t)f ≤ S (t)g.
Proof. By Lemma 2.5, the sequence (S n (t)f ) n∈N is decreasing, hence the limit S (t)f = lim n→∞ S n (t)f exists pointwise. Also, the limit S (t)f is vanishing at infinity. Indeed, from the semigroup property of (S µ (t)) t≥0 , it follows that S µ (t)f ≤ S n (t)f ≤ S(t)f for all n ∈ N, and therefore S µ (t)f ≤ S (t)f ≤ S(t)f . Since by Lemma 2.3, S(t)f is vanishing at infinity, and (S µ (t)) t≥0 is a Feller semigroup, we conclude that S (t)f is vanishing at infinity. Further, by Lemma 2.5, the sequence S n (t)f is uniformly equicontinuous on every compact subset of R d , which by the Arzelà-Ascoli theorem and the fact that S (t)f is vanishing at infinity, implies that lim n→∞ S n (t)f − S (t)f ∞ = 0. Finally, by induction over n ∈ N, it follows from Lemma 2.3 that S n (t) is a convex contraction on C 0 (R d ), which satisfies S n (t)0 = 0 and S n (t)f ≤ S n (t)g. These properties remain true for the limit S (t). The proof is complete.
In the following, we shall often use that t → µ t is continuous w.r.t. W p (at t = 0). To see that this is true, use the assumption E[|X 1 | p ] < ∞ and [18,Theorem 25.18] to obtain E[sup t∈[0,1] |X t | p ] < ∞. As X has càdlàg paths, dominated convergence The next result states the strong continuity of the family (S (t)) t≥0 at zero.
Proof. Let f ∈ C 0 (R d ) and ε > 0. We first show an upper bound, namely that there is t 0 > 0 such that S (t) ≤ f + 2ε for all t < t 0 . As functions in C 0 (R d ) are uniformly continuous, there is Since by assumption lim t↓0 R d |y| p µ t (dy) = 0, it follows from Lemma 2.1 that which yields the upper bound.
As for the lower bound, similarly as in the proof of Lemma 2.3, we make use of the fact that S µ ≤ S . Since (S µ (t)) t≥0 is a Feller semigroup, it holds f −ε ≤ S (t)f for all t < t 0 for a suitable t 0 > 0. This completes the proof.
For later reference, let us point out that the exact same proof as given for Lemma 2.7 yields the following.
and all sequences (t n ) n∈N with lim n→∞ t n = 0. Lemma 2.9. Let t, t n ≥ 0 with t n ≤ t and g, g n ∈ C 0 (R d ) for all n ∈ N. If lim n→∞ t n = t and lim n→∞ g n − g ∞ = 0, then lim n→∞ S n (t n )g n − S (t)g ∞ → 0.
Proof. As S n (t) = S n (t n )S n (t − t n ) by definition, the triangle inequality implies that By Corollary 2.6 the first term converges to zero as n → ∞. As for the middle term, by Lemma 2.5 we have that S n (t n ) is a contraction, so that The latter converges to zero by Corollary 2.8. Again by Lemma 2.5, the last term converges to zero as n → ∞. This completes the proof. Now, we are ready to state our first main result (the convex generalization of Proposition 1.1).
Proposition 2.10. The family (S (t)) t≥0 is a strongly continuous, convex, monotone and normalized contraction semigroup on C 0 (R d ), i.e., for every s, t ≥ 0 and f ∈ C 0 (R d ), we have that (i) S (t) : C 0 (R d ) → C 0 (R d ) is a convex and monotone contraction such that S (t)0 = 0, Proof. In view of Corollary 2.6 and Lemma 2.7, it remains to prove the semigroup property S (t) • S (s) = S (t + s). To that end, fix some f ∈ C 0 (R d ) and s, t ≥ 0 and set denote by s n , t n ∈ 2 −n N 0 the closest dyadic elements prior to s and t, respectively. By Lemma 2.9 (applied with g = g n = f ) we have where the last equality follows by definition of S n . Further, Lemma 2.9 also implies that S n (s n )f converges uniformly to S (s)f . Therefore, we may apply Lemma 2.9 again (with g = S (s)f , and g n = S n (s n )f ) and obtain This completes the proof.
and the limit is uniform. Proof To that end, let t > 0. For notational simplicity we assume that t is a dyadic number, say t = k 0 2 −n0 for some k 0 , n 0 ∈ N; the general case (is only notationally heavier but) works analogously. Then, S n0 (t) is just the convolution of S(2 −n0 ) with itself k 0 times. For every x ∈ R d , let r = r x ∈ R d with |r| = 1 and a = a x ≥ 0 such that r∇f (x) = |∇f (x)| and ϕ * (|∇f (x)|) = a|∇f (x)| − ϕ(a), (2.5) where the product between elements in R d is understood as the scalar product. Note that such r exists as | · | is its own dual norm, and such a exists as lim y→∞ ϕ(y)/y = ∞ which follows from the assumption that y → ϕ(y 1/p ) is convex. Moreover, since |∇f (x)| is uniformly bounded over x ∈ R d , the same holds for a = a x . Now, for each n ≥ n 0 , set ν n 2 −n := µ 2 −n * δ a2 −n r . Then, one can compute that W p (µ 2 −n , ν 2 −n ) = a2 −n , and therefore for all g ∈ C 0 (R d ) and y ∈ R d . Note that t = k n 2 −n for k n := k 0 2 n−n0 , and that the measure which results in taking the convolution of ν 2 −n with itself k n times, is equal to µ t * δ atr . As further, k n ϕ 2 −n (a2 −n ) = k 0 2 n−n0 2 −n ϕ(a) = tϕ(a) = ϕ t (at) and each ϕ 2 −n (a2 −n ) does not depend on the state variable, estimating every S(2 −n ) which appears in the definition of S n (t) (as the convolution of S(2 −n ) with itself k n times) by (2.6) gives for all n ≥ n 0 . The right hand side does not depend on n, so that the definition of S (t)f as the limit of S n (t)f therefore implies that By definition of the infinitesimal generator A µ of (S µ (t)) t≥0 and f ∈ D(A µ ), the first term I 1 equals tA µ f + o(t) (uniformly over x ∈ R d ). The second term I 2 is estimated by a Taylor's expansion: for some (measurable) ξ = ξ(x, y) with |ξ| ≤ ta, we may write uniformly over x ∈ R d , where we need to justify the last step. Indeed, this follows as in the proof of Lemma 2.7 by splitting the µ t (dy) integral into two parts (close to zero {|y| ≤ b} and its complement {|y| > b}), and using uniform continuity of ar∇f (x + ·) together with the fact that µ t ({|y| > b}) → 0 and lim t↓0 sup x,y∈R d |ξ(x, y)| = 0 as a = a x is bounded uniformly over x ∈ R d . Recalling (2.5) we conclude that uniformly over x ∈ R d , which shows (2.4).
(b) It remains to show that uniformly over x ∈ R d , where the supremum is taken over all u ≥ 0 and ν ∈ P p (R d ) with W p (µ t , ν) = u. Actually, for every t ≥ 0, one may restrict to those u ≥ 0 for which tϕ(u/t) ≤ f ∞ + 1. As ϕ grows faster than linear, this implies that there is some u 0 (independent of t) for which the latter implies u ≤ u 0 t. Now, fix 0 ≤ u ≤ u 0 t and ν ∈ P p (R d ) with W p (µ t , ν) = u, and a coupling π(dy, dz) between µ t and ν which is optimal for W p (µ t , ν). By Taylor's theorem, for all x, y, z ∈ R d , where ξ = ξ(x, y, z) is a measurable function such that |ξ| ≤ |z − y|. Hence, it follows from Hölder's inequality that where p * = p/(p − 1) is the conjugate Hölder exponent of p. For every 0 ≤ u ≤ u 0 t and ν as above, it follows from uniformly over x ∈ R d , again by the same arguments as in the proof of Lemma 2.7. Putting everything together yields uniformly over x ∈ R d , where the last inequality follows from the definition of the convex conjugate ϕ * . This shows (2.7) and therefore completes the proof.
This is a good place to mention the recent paper [2] in which related ideas as in Proposition 2.11 are applied in the context of stochastic optimization. With Proposition 2.11 at our disposal, we can finally prove our main result (Theorem 1.2), or rather its convex generalization (Theorem 2.12 below).
Before doing so, let us recall the notion of viscosity solution that we use: Denote by C 1 0 (R d ) the space of all continuously differentiable functions f ∈ C 0 (R d ) who's gradient is vanishing at infinity, and call v : (0, ∞) → C 0 (R d ) test function if it is differentiable (w.r.t. the supremum norm) and satisfies v(t) ∈ D(A µ ) ∩ C 1 0 (R d ) for every t ∈ (0, ∞).
Then, following [8], we say that u is a viscosity subsolution to Similarly, u is called viscosity supersolution if the above holds with '≤' replaced by '≥' at both instances, and a viscosity solution if it is both a viscosity supersolution and subsolution. As a consequence of the previous result we derive the following: Proof. To show that u is a viscosity subsolution, let v be a test function such that u ≤ v and v(t, x) = u(t, x) for some (t, x) ∈ (0, ∞) × R d . Since v is differentiable at t there exists ∂ t v(t) ∈ C 0 (R d ) such that v(t + h) = v(t) + h∂ t v(t) + o(h) for |h| → 0. Similar to the proof of Lemma 2.7 it follows that for |h| → 0. Hence, for h > 0 small enough, using Proposition 2.10, we have that In particular, since h −1 (S (h)v(t, x) − v(t, x)) → A v(t, x) uniformly over x ∈ R d by Proposition 2.11, and v(t, x) = u(t, x) by assumption, we conclude that 0 ≤ −∂ t v(t, x) + A v(t, x). This shows that u is a viscosity subsolution of (2.8). The arguments that u is a viscosity supersolution follows along the same lines successively replacing '≤' by '≥'.
Finally, we sketch how uniqueness of the solution of (2.8) in Theorem 2.12 may be obtained. Under certain conditions on the initial Lévy process, one obtains from [13, Corollary 53] uniqueness of the viscosity solution (2.8) by using the space C 2,3 b ((0, ∞) × R d ) as test functions. This requires an extension of the semigroup (S (t)) t≥0 to the space BUC(R d ) of all bounded and uniformly continuous functions, which may be achieved via monotone approximation and continuity from above of the operators S (t), t ≥ 0, see also [7,Remark 5.4] Then, by adapting Proposition 2.11, it follows that Theorem 2.12 also holds for the test functions v : (0, ∞) → BUC(R d ) which are differentiable and v(t) ∈ BUC 2 (R d ) for all t ≥ 0, where BUC 2 (R d ) denotes the space of all functions which are twice differentiable with bounded uniformly continuous derivatives up to order 2. Once this is done, the results in [13] may be used, since C 2,3 b ((0, ∞) × R d ) is a subset of the considered test functions, see [8,Remark 2.7].