Signature asymptotics, empirical processes, and optimal transport

,


Introduction 1.Previous work
The mathematical notion of a path captures the concept of a continuous time-ordered sequence of values.These objects and their generalisations, occur widely throughout both pure and applied mathematics.For example, the analysis of the sample paths of a stochastic process forms a significant part of stochastic analysis, while time series analysis is an established tool in modern statistics.Abstract paths are inherently infinitedimensional objects, and it is desirable to seek low-dimensional summaries which capture some features of interest.A mathematically-principled approach to effecting this has gained prominence in recent years and led several new developments in time-series analysis [19,27,26,31], machine learning [11,32], deep learning [21,22] and more recently in kernel methods [12,25,8,23].This approach involves using the (path) signature transform which, in distinction to traditional methods based on sampling, is rooted in capturing the path by understanding its effects on any smooth non-linear controlled differential system.To be more precise, if γ is a path of finite 1-variation defined on the closed interval [a, b] ⊂ R into R d .Then, given a smooth collection of vector fields {V i : i = 1, .., d} on R e , we can in some circumstances write the response α : [a, b] → R e of the controlled differential equation t , started at α a in terms of a convergent series of iterated integrals of γ; that is where Id denotes the identity function on R e .Using the above as motivation, we recall that the signature S(γ) of γ is defined as the collection of all iterated integrals S(γ) := 1, S(γ) 1  [a,b] , S(γ where S(γ) k := {S(γ) k; I } I∈J (2) and J := {(i 1 , ..., i k )| 1 ≤ i 1 , . . ., i k ≤ d} ⊂ N k is a set of multi-index and where S(γ) k; i1,...,i k := a≤t1≤t2≤..≤t k ≤b A key theorem of Hambly and Lyons [20] states that the map γ → S(γ) is one-to-one up to an equivalence relation on the space of paths which is called tree-like equivalence [7].In this way, the signature offers a top-down summary of γ allowing one a practical and efficient representation of the curve [26].This approach has several pleasant theoretical and computational consequences.For example, the signature transform satisfies a universality property in that any continuous function f : J → R from a compact subset J of signature features can be arbitrarily well approximated by a linear functional [26].This result inspired the development of several new methods and paradigms in time-series analysis [19,27,26,31], machine learning [11,32] and more recently deep learning [21,22].A notable use of the signature transform has been its application in the popular field of kernel methods where the so-called signature kernel [23], consisting of the inner product between two signatures, is introduced.In addition to being backed up by a rich theory [28,16], working with the inner product of signature features has proved itself to be a promising and effective approach to many tasks [12,8] and has achieved state of the art performance for some of them [25].This growing interest in the use of the signature has also brought into focus methods for recovering properties of the underlying path from the signature.Some terms in the signature are explicitly relatable to properties of the original path, e.g. the increment and the area can be recovered from the terms of order 1 and order 2 respectively.Recovering more granular information on the path demands a more sophisticated approach.A rich stream of recent work has tackled the explicit reconstruction of a path from its signature, e.g. based on a unicity result of the signature for Brownian motion sample paths [24], one can consider a polygonal approximation to Brownian paths [24], diffusions [18], a large class of Gaussian processes [5] and even some deterministic paths [17] based on the signature features only.These approaches fundamentally exploit the full signature representation (and not a truncated version of it) which may not be available in some cases.In parallel, there have been other approaches to reconstruction.In [29], the hyperbolic development of the signature is exploited to obtain an inversion scheme for piecewise linear paths.On the other hand, [30] proposed a symmetrization procedure on the signature to which leads to a reconstruction algorithm in some cases.Both approaches have the advantage of being implementable.
Another branch of investigation has been the recovery of broad features of a path using the asymptotics of its signature, or functions of terms of its signature.The study of the latter has been an active area of research for the last 10 years [20,3,10,4,6].Recall that if γ is absolutely continuous its length is defined to be and denoted by l := L(b − a).Such a curve admits a unit-speed parametrisation ρ : such that the path γ • ρ is a unit speed curve.Hambly and Lyons [20] initially showed that the arc-length of a unit-speed path can be recovered from the asymptotics of the norm of terms in the signature under a broad-class of norms.To be more concrete, they proved that if γ : [0, l] → V is a continuously differentiable unit-speed curve (where V is a finite dimensional Banach space), then it holds that for a class of norms • on the tensor product V ⊗n which includes the projective tensor norm, but excludes the Hilbert-Schmidt norm in the case where V is endowed with an inner-product.For the same class of norms, this also implies the weaker statement A natural question is whether other properties of the original path can also be recovered in a similar fashion.When V is an inner-product space and the Hilbert-Schmidt norm is considered, [20] also show that Eq. ( 7) holds, but that statement Eq. ( 6) fails to be true in general: indeed under the assumption that γ is three times continuously differentiable the following result1 is proved, where B 0,s s∈[0,l] is a Brownian bridge which returns to zero at time l.It is easily seen that c (γ) < 1 unless γ is a straight line.
More recent articles have focused on proving statements similar to Eq. ( 7) under the fewest possible assumptions on γ and on the tensor norms.In the article [10] for example, it was proved that if γ is continuous and of finite 1−variation then, under any reasonable tensor norm, we have that provided that the sequence S(γ) n : n = 1, 2, ..., ∞ does not contain an infinite subsequence of zeros.
Boedhardjo and Geng [3] strengthened this result by proving that the existence of such a subsequence is equivalent to the underlying path being tree-like, see also [9].Taken together, these articles prove that for the identity in Eq. ( 9) holds true for a wide class of continuous bounded variation paths.It is conjectured that for a tree-reduced path γ the limit is exactly the length of γ, see [9].This question remains open at the time of writing.

Contributions
In this article, we contribute to the effort of recovering the original path from its signature by laying down a novel route.We do so by explicitly relating Hilbert-Schmidt norm of projected signatures with p-Wasserstein distances between discrete probability measures, allowing the study of the former using tools of the latter.These measures are characterised in terms of γ only through an integral equation, making the contribution of the geometrical properties of γ (such as its curvature) in the limit of the norm explicit.To ease notation, from this point forwards, we consider unit speed paths parameterised on [0.1] rather than [0, l].
The core insight of this connection originated when realising that a theorem by del Barrio, Giné, and Utze (see Theorem 2 below) can be recasted to re-express the Hambly-Lyons limit c(γ) as the limit of a 2-Wasserstein distance between empirical measures (Section 2).Formally, for a twice-continuouslydifferentiable unit speed path γ : [0, 1] → R d that is regular enough (see exact conditions in Proposition 10), we construct and show the existence of a measure µ on R, prescribed in terms of the following integral equation for its cumulative distribution function F , for some constants c, b ∈ R + .When coupled with the B-G-U Theorem, we show that i.e. the Hambly-Lyons limit c(γ) can be seen as the limit of 2-Wasserstein distances between µ and an empirical version , where δ denote the Dirac delta distribution and where {U i } i∈{1,...,n} is a sample of n independent uniform random variables on [0, 1].
In Section 3, motivated by the above insight, we derive relationships between the signature inner product S(γ) n , S(σ) n and a series of p-Wasserstein distances.As a first application, we re-derive a generalised version of the Hambly-Lyons Limit Theorem (Section 3.4) through the lens of discrete optimal transport by exploiting asymptotic results of Wasserstein distances between empirical measures [2].
Theorem 22 (Generalised Hambly-Lyons Limit Theorem).Let γ : [0, 1] → R d be a twice-continuouslydifferentiable unit-speed path such that the map s → γ s is non-vanishing, and differentiable with bounded derivative.Then, Section 3.5 concludes this article by presenting a way to practically compute that limit through the solving of a second order distributional differential equation.

The B-G-U Theorem
We now present the above-mentioned theorem by del Barrio, Giné, and Utze.Definition 1 (J p functional and I-function I(t)).Let X be a non-constant random variable with law µ.Suppose that µ has a density f w.r.t Lebesgue measure and let F be the associated distribution function.The J p functional is defined as for p ∈ N.Moreover, if F admits an absolutely continuous inverse on (0, 1) (or equivalently, by virtue of Proposition A.17 in [2], if µ is supported on an interval, finite or not, and the absolutely continuous component of µ has on that interval an almost everywhere positive density), define the I-function I(t) for almost all t ∈ (0, 1) as where the last equality exploited Proposition A.18 in [2].
For the rest of this section, we denote by µ n the empirical measure defined as where X i , i ∈ {1, 2, .., n} is an i.i.d.sample of random variables sampled according to µ.
Theorem 2 (B-G-U Theorem; [15]).Let µ be a measure supported on (a, b) ⊂ R such that it admits a density f .Assume further that f is positive and differentiable and satisfies and that J 2 (µ) < ∞.Denote by F its distribution function.Then, as n → ∞ weakly in R where B 0,t is a Brownian bridge starting at 0 and vanishing at t = 1.
Remark 3 (Origin of assumption Eq. ( 16)).Condition Eq. ( 16) has been a recurrent assumption in several asymptotic results involving quantile processes and ultimately goes back to Csorgo and Revesz [14].To better understand its connection with the B-G-U Theorem 2, recall the following identity denoting the quantile process.It is shown in [15] that ξ n converges weakly in L 2 (0, 1) to B 0,t /I(t).For a slightly more general object, the so-called normed sample quantile process ρ n , it can be shown that requiring condition Eq. ( 16) allows one to asymptotically control ρ n .More details can be found in [13].
The following corollary will form a key component of our generalisation of the Hambly-Lyons result.
Corollary 4. Let µ be a measure supported on (a, b) ⊂ R which satisfies the assumptions of Theorem 2, then weakly in R as n → ∞, where µ 1 n and µ 2 n are independent copies of empirical measures from µ. Proof.Let ξ 1 n and ξ 1 n denote the quantile processes of µ 1 n and µ 2 n defined in Remark 3. Since L 2 (0, 1) is separable, [1, Theorem 2.8] implies that ξ 1 n − ξ 2 n converges weakly in L 2 (0, 1) to (B 1 0,t − B 2 0,t )/I(t), for independent Brownian bridges B 1 and B 2 .Since the difference of independent Brownian bridges is itself a Brownian bridge with twice the variance, we can apply the Continuous Mapping Theorem and Remark 3 to conclude that The existence of a bridge between signature asymptotics and the theory of empirical processes is hinted at when one considers a special instance of the B-G-U Theorem 2. Indeed, when applied to a regular enough class of measures µ satisfying I(t) = |γ t | −1 , the right-hand side of B-G-U exactly coincides with the Hambly-Lyons limit.The following remark formalises this observation.
Then the B-G-U Theorem 2 implies that the Hambly-Lyons limit in c(γ) can be rewritten as the limit of a 2-Wasserstein distance, i.e.
Proof.The assumptions for the B-G-U Theorem 2 are satisfied and the limit exists under the stated assumptions.If one further assumes that the assumptions of the Hambly-Lyons limit Eq. ( 8) holds (i.e. that γ is of class C 3 ), then the limits coincide as γ is unit-speed and satisfies

Existence and characterisation of admissible measures
The rest of this section will focus on characterising the paths, γ, for which a measure µ exists satisfying conditions A, B, C and D of Corollary 5. First, we recast the determination of such µ into the solving of an integral equation in terms of γ only.Thereafter, we formulate assumptions on γ ensuring the existence of a solution to this integral equation, whose associated measure µ satisfies the conditions A, B, C and D of Corollary 5.
We start by rewriting condition B as an explicit condition on the cumulative distribution function F associated to the measure µ.
Remark 6 (Reformulating condition B as an integral equation for F ). Let γ : [0, 1] → R d be a twicecontinuously-differentiable unit-speed path with non-vanishing second derivative.If F admits an absolutely continuous inverse on (0, 1) (which is the case whenever condition A holds; see Proposition A.17 from [2]), then condition B holds if and only if F satisfies the following integral equation, for all t ∈ (a, b) with boundary conditions lim s→a + F (s) = 0, lim s→b − F (s) = 1.To see this, first suppose that conditions A and B hold, then t a I(F (s))ds Assume now that condition A and Eq. ( 23) hold, then the Lebesgue Differentiation Theorem implies that the density function associated to F can be written as And so with bounded second derivative F .Then, (i) F is a cumulative distribution function with support on (a, b) and admits an absolutely continuous inverse F −1 on (0, 1).
(ii) F fulfills condition A,B,C, and D.
Proof.That F is a cumulative distribution function corresponding to some measure µ supported on (a, b) follows immediately from Eq. ( 25) and differentiability of F .By virtue of proposition A.17 in [2], F admits an absolutely continuous inverse on (0, 1) if and only if µ is supported on an interval, finite or not, and the absolutely continuous component of µ has on that interval has an a.e.positive density (with respect to Lebesgue measure).We show that the latter statement holds.Indeed, since γ is non-vanishing, the Lebesgue Differentiation Theorem implies that the density function of the absolutely continuous component of the probability measure µ associated to F is positive almost everywhere and satisfies implying the existence of an absolutely continuous inverse F −1 .This concludes (i).
Regarding point (ii), as F is twice-differentiable, its underlying measure µ is absolutely continuous and f is its density which is positive and differentiable, implying the fulfillment of condition A. Condition B follows from Remark 6.Finally, since γ is bounded from below (since it is continuous and non-vanishing) and F is bounded from above, it follows that conditions C and D are both satisfied.
The above result states that a twice-differentiable solution F prescribed by the integral equation Eq. ( 25), with the property that F is bounded, is the cumulative distribution function of a measure µ required for the application of Corollary 5. We now determine the conditions on γ for which the existence of such an F is guaranteed.Let γ be as in Lemma 7. Then the following observations can be made.
1. Since γ has bounded curvature, i.e. there exists a constant c > 0 such that |γ s | ≤ 1 c ∈ R + for all s ∈ [0, 1] (or equivalently said, if the map s → |γ s | −1 is bounded by below by c), then for a constant b ∈ [a, a + c] as F is monotone increasing.

Additionally, if the map s → |γ
3. To ensure that F is twice differentiable, requiring Lipschitz continuity on s → |γ s | −1 is not enough.Indeed, the latter assumption only implies the differentiability of F almost everywhere on any open subset of the definition domain by virtue of the Rademacher's Theorem (which is not sufficient as it can lead to the breaking of condition A).However, if we assume the map s → |γ s | to be differentiable, then F is guaranteed to be twice-differentiable for all t ∈ (a, b) by the quotient rule and non-vanishing property of γ .
4. As it is now assumed that s → |γ s | −1 is differentiable, the fundamental theorem of calculus implies that the probability density function f can be exactly written as Additionally, since γ is non vanishing, the curvature |γ s | is bounded from below, i.e. |γ s | ≥ 1 c− ∈ R + for all s ∈ [0, 1] (or equivalently said, if the map s → |γ s | −1 is bounded above by c − ), then the ODE Eq. ( 28) implies Assuming further that the derivative of the map s → |γ s | is bounded implies the boundedness of F by the chain and quotient rules.
These observations are collected and combined with the Lemma 7 in the following lemma.
Lemma 8 (Condition on γ for existence of admissible F ). Let γ : [0, 1] → R d be a twice-continuouslydifferentiable unit-speed path such that the map s → |γ s | is non-vanishing, and is differentiable with bounded derivative.Then there exists constants a, b ∈ R twice-differentiable function F : (a, b) → [0, 1] such that the integral equation and boundary conditions Eq. ( 25) are satisfied and the conditions of Corollary 5 are also satisfied.
Remark 9 (Boundedness of the derivative of |γ s | as a condition on the curvature).Observe that the boundedness condition on |γ s | (which we recall is there to ensure the fulfillment of the boundary conditions in equation Eq. ( 25)) can be reformulated as a condition on the total curvature T (t) := t 0 |γ s |ds of γ as follows.For a c 2 ∈ R + , Taking the integral from 0 to t gives Finally, the direct application of the existence Lemma 8 followed by Theorem 2 yields the following result.
Proposition 10 (Hambly-Lyons limit as the limit of a Wasserstein distance).Let γ be as in Lemma 8 and F the solution to the integral equation Eq. (25).Let µ be the measure associated with F and let X 1 , ..., X n be a sample of n i.i.d.random variables drawn from µ.Let µ n be its associated empirical measure, i.e.
Then, the Hambly-Lyons limit in c(γ) is the limit of a 2-Wasserstein distance 3 Signature projections and Wasserstein distances Because of the known connection between the Hambly-Lyons limit and the limit of the Hilbert-Schmidt tensor norm of projected signatures [20], the insights developed in the previous section naturally lead one to ask whether the Hilbert-Schmidt tensor norm of projected signatures can be related to Wasserstein distances.This section answers this question positively.By using this relationship, we are able characterise a class of curves larger than the C 3 one originally considered in [20] that satisfies We proceed as follows.
1. First, in Section 3.1, we prove a technical augmentation of a lemma in [20] and then derive a probabilistic representation for the inner product of two signature terms.
2. Once this is done, Section 3.2 will exploit the characterisation of the Wasserstein distances between empirical measures to relate the quantities derived in the first step to these Wasserstein distances and hence derive lower and upper bounds on S(γ) in terms of the former.
3. By leveraging the results of the previous section, we generalise the Hambly-Lyons limit Eq. ( 8) in Section 3.4 and present the proof of Theorem 22.
4. Finally, we show a practical way to compute the limit in Section 3.5 and illustrate it in a simple case.

A probabilistic expression for signature inner products
In this subsection, we generalise Lemma 3.9 in [20] to the inner product between signatures before presenting a probabilistic formula in terms of the angles between the derivatives of the two underlying curves.
This result states that the inner product between signatures of deterministic paths can be represented statistically through the mean of the product of γ U (i) , σ V (i) .Observe that for unit-speed curves, the inner products γ U (i) , σ V (i) only encode the information on the angles Θ i between the vectors γ U (i) and σ V (i) , i.e.
In this case, also observe that the angles Θ i can be exactly recovered from the norm of the difference between the two above random variables, Proposition 13 (Inner product as a probabilistic expression).Suppose that γ, σ : [0, 1] → R d are two absolutely continuous curves such that |γ t | = σ t = 1 for almost every t ∈ [0, 1] .Then for every n ∈ N we have where Θ i in [0, π] is defined by γ U (i) , σ V (i) = cos Θ i , for i = 1, . . ., n, and 1 A denotes the indicator function on a set A.

Lower bound on S(γ) n in terms of Wasserstein distances
We now use the probabilistic expression of the signature inner product (Proposition 13) when σ = γ, and derives a lower-bound on S(γ) in terms of Wasserstein distances.
Proposition 17 (Lower bound on S(γ) in terms of Wasserstein distances).Let γ, µ X n , µ Y n be as defined in Assumptions 14.Then for every n ∈ N.
Proof.We use the fact that |γ where g := γ • F .The assumptions on γ give that g is once continuously differentiable and so the mean value inequality may be employed to see that Furthermore, as F satisfies the integral equation of Lemma 7 we have that By applying Lemma 16 we learn that Observe that max i=1,...,n X (i) − Y (i) < 1 is a strictly stronger condition than max i=1,...,n Θ i < π 2 , and that the product in the second term on the right-hand-side of Eq. ( 43) may be lower bounded by −1 by the Cauchy-Schwarz inequality.The lower bound Eq. (47) then follows by combining this observation, Eq. (43), and Eq.(50).
Remark 18.We can also multiply the sum inside the exponential term in Eq. (47) by the indicator function on the set {max i=1,...,n X (i) − Y (i) < 1} without changing the random variable inside the expectation.Doing so will prove convenient in the proof of our main result Theorem 22.

Upper bound on S(γ) n in terms of Wasserstein distances
Similarly to Section 3.2, we use Proposition 13 to derive an upper bound on for S(γ) n in terms of a series of Wasserstein distances.
Proposition 19 (Upper bound on S(γ) in terms of Wasserstein distances).Let γ, µ X n , µ Y n be as defined in Assumptions 14.Then there exists an 0 < ε ≤ 1 so that for every n ∈ N and 0 < ε < ε where φ (u) := sup |s−t|<u g (t) − g (s) is the modulus of continuity of the derivative of g := γ • F .
Proof.An application of the fundamental theorem of calculus and Eq.(49) gives And hence, by applying the Cauchy-Schwarz inequality to the integrand in the last line and so, in particular To ensure that the series inside the exponential in Eq. ( 51) is finite, we note that φ is a modulus of continuity, and so there exists some 0 < ε ≤ 1 for which φ(ε) < 1 for any ε < ε .Now, the product in the second term on the right-hand-side of Eq. ( 43) may be upper bounded by 1 by the Cauchy-Schwarz inequality, so it follows from Eq. ( 53), Lemma 16 and Proposition 13 that Eq. (51) holds provided ε < ε .

Generalising the Hambly-Lyons Limit Theorem
Combined, Propositions 17 and 19 provide lower and upper bounds for S(γ) n in terms of a series of Wasserstein distances.What remains is to show that the lower bound converges to the square of the Hambly-Lyons limit as n → ∞, and that the same applies to the upper bound when taking n → ∞ and then ε → 0. The following pair of lemmas provide the necessary results for this conclusion.
Lemma 20.Let µ, µ X n , and µ Y n be as in the standing assumptions, then for any ε > 0 Proof.Using mean value and inverse function Theorems, we may deduce that for some η i ∈ U (i) , V (i) .An application of Markov's inequality gives that P max i=1,...,n for some absolute constant C > 0. The first inequality utilises Eq. ( 55), and the second is due to Theorem 4.9 of [2].Taking the limit as n → ∞ in the above inequality concludes the proof.

Computing the Hambly-Lyons limit explicitly
Finally, we propose a way to practically compute the limit presented in Theorem 22.Using the work of Yor and Revuz on Bessel bridges [33], we relate the computation of the expected integral c(γ) to the solving of a second order distributional differential equation.
Lemma 24 (High order term and curvature).Let V be a finite dimensional inner product space.Suppose that γ : [0, l] → V is parameterised at unit speed and is three-times continuously differentiable (under this parameterisation).Let µ denote the finite Borel measure on R + which is absolutely continuous with respect to the Lebesgue measure λ = λ R+ with density given by dµ dλ (t) = l 2 γ tl Proof.Define the unit-speed curve γt := 1 l γ lt over [0, 1] and let the terms in its signature be given by 1, S(γ) 1 , .., S(γ) n , .. .It suffices to prove the result under the assumption that l = 1 since the general result can then be recovered by noticing so that c (γ) = c 1 l γ l .We therefore assume that l = 1, and seek to prove that c (γ) = ψ −1/4 1 . To this end, we first observe that the unit-speed parameterisation of γ gives that γ t , γ t = − γ t 2 for every t in [0, 1] .
When used together with Eq. ( 8) this gives that .

Corollary 5 (
Recovering the Hambly-Lyons limit).Let γ : [0, 1] → R d be a twice-continuously-differentiable unit-speed path with non-vanishing second derivative.Consider a probability measure µ on R supported on (a, b) for some a, b ∈ R and having density f .Assume that the following four conditions are satisfied: A. The density function f is positive and differentiable, B. Its associated I-function satisfies I(t) = |γ t | −1 almost everywhere, C. The distribution function F and density function f satisfy

Lemma 7 .
Let γ : [0, 1] → R d be a twice-continuously-differentiable unit-speed path with non-vanishing second-derivative.Assume the existence of two constants a < b ∈ R and a twice-differentiable function F : (a, b) → [0, 1] satisfying s | −1 is Lipschitz continuous, the Picard-Lindelöf Theorem ensures the existence of a unique differentiable function F : [a, b] → [0, 1] such that Bi be their associated empirical probability measures.Then we have n j=1