Strong equivalence between metrics of Wasserstein type

The sliced Wasserstein and more recently max-sliced Wasserstein metrics Wp have attracted abundant attention in data sciences and machine learning due to its advantages to tackle the curse of dimensionality (see [10] and [6]). A question of particular importance is the strong equivalence between these projected Wasserstein metrics and the (classical) Wasserstein metric Wp. Recently, Paty and Cuturi have proved in [9] the strong equivalence of W2 and W2. We show that the strong equivalence also holds for p = 1, while we show that the sliced Wasserstein metric does not share this nice property.


Introduction
The Wasserstein metric arising in the optimal transport theory forms a distance function between probability measures. In mathematical language, the Wasserstein distance of order p ≥ 1 between probability measures µ and ν on R d is defined as where Γ(µ, ν) is the set of probability measures γ on R d × R d having the marginal distributions µ and ν. Theoretical advances in the last fifty years characterize existence, uniqueness, representation and smoothness properties of optimizers for W p (µ, ν) under different settings and compute W p (µ, ν) by adopting tools and methods in PDE, linear programming and computational geometry, see e.g. [16], [19], and applications are various throughout most of the applied sciences including economics, geography and biomedical sciences, see e.g. [17], [18]. Recently, it has attracted abundant attention in data sciences and machine learning due to its theoretical properties and applications Paty and Cuturi showed in [14] that the max-sliced distance W 2 is strongly equivalent to W 2 . This paper aims to prove this result for p = 1. The proof of our result is based on the dual formulation of W 1 , constructing a tailor-made topology τ on the space of Lipschitz functions on R d , and some functional analytic arguments. This is reminiscent of the universal approximation result in e.g. [5], i.e. any arbitrary Lipschitz function can be recovered from functional evaluation of projections. Although in the same spirit, our result here is different. In Theorem 3.5, we prove there exists C d > 0 such that the collection of functions as below 1≤k≤n a k f k (v k · x) (1.1) is dense, endowed with τ , in the absorbing and convex set of 1−Lipschitz functions Lip 1 (R d ), where n ∈ N, a k ≥ 0, v k ∈ S d−1 , f k ∈ Lip 1 (R) for 1 ≤ k ≤ n and 1≤k≤n a k ≤ C d . Roughly speaking, any 1−Lipschitz function on R d can be approximated by a sequence of C d −Lipschitz functions of form (1.1).
We show further that the strong equivalence is not shared by the sliced Wasserstein metric using the recent results of [1], hence promoting the max-sliced metric over the sliced one.
The structure of the rest of the paper is simple. In the next section, after introducing some preliminaries we will give our main results in Theorem 2.3. Section 3, on the other hand, is devoted to the proof of these results. A technical lemma is presented in the Appendix.

Preliminaries on the Wasserstein metric
We start by reviewing the preliminary concepts and formulations needed to introduce the main results. For p ≥ 1, let P p (R d ) be the set of probability measures on R d admitting finite p th moment, i.e. µ ∈ P p (R d ) if and only if the collection of probability measures γ on R d × R d , also known as couplings or transport plans, such that From Wasserstein to max-sliced Wasserstein The Wasserstein metric of order p is a distance function W p : It is known that P p (R d ) endowed with W p is a Polish space, i.e. separable completely metrizable space, see e.g. Theorem 6.18 of [19]. In particular, an explicit expression of W p (µ, ν) is given for d = 1: x]] > t}, see e.g. Chapter 2 of [18]. We note also that W p is depending on d, i.e. W p ≡ W p,d . Nevertheless, we will not emphasize this dependency in the rest and write simply W p without any danger of confusion.

Projected Wasserstein metrics
While approaches based on the Wasserstein metric have been successful in several complex tasks, estimating the Wasserstein distance often suffers from the curse of dimensionality from the complexity/algorithmic perspective. To tackle the issue of complexity, a sliced version of the Wasserstein distance was studied and employed, which only requires estimating distances of projected uni-dimensional distributions and is, therefore, more efficient, see e.g. [3], [6], [13].
Hence, we may define the sliced Wasserstein metric W p and max-sliced Wasserstein metric W p as follows.
where denotes the surface integral over S d−1 .
(ii) W p and W p form two distance functions on P p (R d ).
, which yields the Lipschitz continuity and fur- where Γ : R → R is the Gamma function given as Then one has W p (µ, With r := |z| and v := z/r, it follows that which impliesμ(z) =ν(z) for all z ∈ R d and finally µ = ν.

Main results
Given the active theoretical interest of Wasserstein metric, as well as its importance for applications in practice, the investigation of W p and W p is gaining popularity in machine learning, with several applications to data sciences. A question of particular importance is the equivalence between W p , W p and W p . Recently, Paty and Cuturi proved in [14] the strong equivalence of W 2 and W 2 . Namely, In this paper, we show the (topological) equivalence between W p , W p and W p as well as the strong equivalence between W 1 and W 1 , which are summarized in Theorem 2.3 below.
for any sequence (µ n ) n≥1 ⊂ P p (R d ) and µ ∈ P p (R d ).
(ii) W 1 and W 1 are strongly equivalent for all d ≥ 1, i.e. there exists C d ≥ 1 such that , for all µ, ν ∈ P 1 (R d ).
(2.5) (iii) W 1 and W 1 are not strongly equivalent for all d ≥ 2.

Proof of Theorem 2.3 (i)
which yields the trivial inequality as follows: where A d is defined in (2.4). Therefore, it remains to show the equivalence between W p and W p .
For the sake of presentation, we use the following notation for the fact that random variable X is distributed according to probability measure µ: where L(X) denotes the law of X.
Step 1. For each n ≥ 1, let X n be a random variable with X n ∼ µ n , and by definition, v · X n ∼ µ n v holds for all v ∈ S d−1 . As We can further conclude that the 1 This observation is suggested the anonymous referee and provides a tractable schema to approximate C * d by solving an optimization problem over the compact subset P 1 (B 1 ).
In view of the proof of Proposition 2.2, the maps S d−1 v → W p (µ n v , µ v ) p ∈ R + are equi-Lipschitz with a uniform Lipschitz constant C + M p (µ), and thus Step 2. Consider the characteristic function of µ given bỹ Define similarlyμ n for all n ≥ 1. For every z ∈ R d , with r := |z| and v := z/r, it holds where the second equality follows from (3.3). We conclude thus (µ n ) n≥1 converges weakly to µ.
Step 3. Using the Skorokhod representation theorem, we may assume without loss of generality that the sequence (X n ) n≥1 converges almost surely. Denote by X ∼ µ its limit.
Combining with the uniform integrability of |X n | p n≥1 , one has lim n→∞ W p (µ n , µ) p ≤ lim n→∞ E |X n − X| p = 0, which fulfils the proof.

Proof of Theorem 2.3 (ii)
Our proof is based on the dual formulation of W 1 and inspired by the proof of the universal approximation theorem.
Then it follows by Kantorovich's duality that, see e.g. Remark 6.5 in [19], , for all µ, ν ∈ P 1 (R d ). (3.4) In what follows, (3.4) will be used in the proof of Theorem 2.3. It is known from [12], · lip defines a norm on Lip(R d ) and Lip(R d ), · lip is a Banach space. Next we endow ECP 26 (2021), paper 13.
Lip(R d ) with an alternative topology. Set . Note in particular that every f ∈ Lip(R d ) is a.e. differentiable and ∇f ∈ L ∞ (R d ) d with ∇f ∞ = f lip , see e.g. Exercise 6.14 of [11]. Finally we define L 1 R d , Then we define τ to be the topology generated by τ o . Lip(R d ), τ is a locally convex space, as the origin has a local base of absolutely convex absorbent sets, see e.g. Proposition 1.15 in Chapter 4 of [4]. Under the topology τ , Remark 3.1. (i) In view of [9], Lip(R d ) is is the dual space of some closed quotient space of L 1 (R d ) d and τ turns to be the weak* topology. Hence, Lip(R d ), τ is not metrizable and thus not a Fréchet space as this quotient space is infinite dimensional, see e.g.
(ii) For any sequence (g n ) n≥1 ⊂ Lip(R d ) with lim n→∞ g n lip = 0, one has for any (u, w) ∈ Therefore, the topology under · lip is strictly stronger than τ .
The lemma below characterizes the space of τ −continuous linear functions on Lip(R d ).
Let C ⊂ Lip 1 (R d ) be the subset of functions f of the form:     (3.5) where the integration by parts can be applied thanks to the convolution. Taking respec- and thus u * ϕ t − div(w * ϕ t ) ≡ 0. Therefore, (3.5) holds for any f ∈ Lip(R d ). Further, ECP 26 (2021), paper 13.
Using the dominated convergence theorem, and the Lebesgue-Besicovitch differentiation theorem, see e.g. page 43 of [7], for the second term, one has T (f ) = 0 for any f ∈ Lip(R d ), contradicting the fact that T is not null.
(ii) Let f ∈ Lip(R d ). Then there exists a net (f λ ) λ ⊂ D such that f λ converges to f under τ . Hence, the continuous linear functions F λ : are pointwise bounded. By the uniform boundedness principle, it holds sup λ f λ lip = sup λ ∇f λ ∞ = sup λ F λ < ∞. Thus f ∈ mC for any m ≥ sup λ f λ lip .
Proposition 3.4. C is closed with respect to the norm · lip .
Proof. First, we show that the topology τ restricted to (1 + |x|)dx and L 1 (R d ) d are separable, we may take two dense subsets (u i ) i≥1 and (w j ) j≥1 and define ρ ui,wj : Then by a straightforward verification, the distance ρ : is consistent with the topology τ restricted on Lip 1 (R d ). Second, we prove C ⊂ Lip 1 (R d ).
For any f ∈ C, one has a sequence (f n ) n≥1 ⊂ C converging to f as τ is metrizable on C. For any w ∈ L 1 (R d ) d , it follows that which means that ∇f n converges to ∇f under the weak* topology of L ∞ (R d ) d . Note also, in view of the Banach-Alaoglu theorem, (∇f n ) n≥1 belongs to the unit ball B ∞ which is relatively compact with respect to the weak* topology. Then the uniqueness of the weak* limit yields ∇f ∈ B ∞ 1 , and thus f ∈ Lip 1 (R d ). Hence C ⊂ Lip 1 (R d ). Let f be in the closure of C with respect to · lip . Let (f n ) n≥1 ⊂ C satisfying lim n→∞ ∇(f n − f ) ∞ = lim n→∞ f n − f lip = 0, which implies in particular lim n→∞ ρ(f n , f ) = 0 as |f n (x) − f (x)| ≤ f n − f lip |x| for all x ∈ R d . For each n ≥ 1, since f n ∈ C, there exists g n ∈ C such that ρ(g n , f n ) ≤ 1/n. Then lim n→∞ ρ(g n , f ) ≤ lim n→∞ ρ(g n , f n ) + ρ(f n , f ) = 0, which concludes the proof.
Proof. In view of Propositions 3.3 and 3.4, Lip(R d ) = m≥1 mC and mC is closed with respect to · lip for each m ≥ 1. Now it follows from Baire's theorem that there must exist m * ≥ 1 such that m * C has non-empty interior, i.e.
where C d is the constant in Theorem 3.5. As C is the collection of convex combinations of functions f (v · x) with v ∈ S d−1 and f ∈ Lip 1 (R), it follows that Hence, (2.5) is established in view of Lemma 3.7.
Lemma 3.7. The subset of probability measures admitting a density is dense in P 1 (R d ) under W 1 and W 1 .
Proof. Fix an arbitrary µ ∈ P 1 (R d ) and take the density function ϕ : R d → R + given by where c > 0 is chosen such that R d ϕ(x)dx = 1. Define the sequence of convolutions of measures (µ t ) t>0 by µ t := µ * ν t , where the probability measure ν t is identified by its density function ϕ t (x) := ϕ(x/t)/t d . By construction µ t admits a density, and it remains to estimate W 1 (µ t , µ) according to (3.4).

Proof of Theorem 2.3 (iii)
To complete the proof of Theorem 2.3, we need an auxiliary result. Given a generic probability measure µ ∈ P p (R d ),μ n is said to be its empirical measure (of order n) if where (X k ) k≥1 is a sequence of i.i.d. random variables such that X k ∼ µ for all k ≥ 1. By Theorem 1 of [8], there exist C p, d, M p (µ) > 0 and χ p,d : The function χ p,d is specified in [8], while we need to refer to [10] for the explicit expression of C p, d, M p (µ) . For d = 1 and p = 3, one has δ G k (dx).
For any two sequences (a n ) n≥1 , (b n ) n≥1 ⊂ R + , we say a n ≈ b n if there exists a constant c > 0 such that a n c ≤ b n ≤ ca n , for all n ≥ 1.
Then it follows by [1] that On the other hand, one has by assumption where we recall that d v andμ n v are the projections of d andμ n along the direction v = (v 1 , . . . , v d ). Substituting d v andμ n v into (3.6), one has where the second inequality follows from and further, combined with (3.7), yields a contradiction for d ≥ 2 and concludes the proof.

A
We start by recalling some elementary ingredients from functional analysis. Given a topological vector space E, we denote by E * its dual space in separating duality via a bilinear form. The weak * convergence, denoted by w * , is the convergence on E * induced by the elements of E, i.e. (e * n ) n≥1 ⊂ E * converges to e * ∈ E under w * if and only if lim n→∞ e * n , e = e * , e , for all e ∈ E.
Endowed with w * , the dual space of E * is isometric to E. In the following, we set E = L 1 (R d ) d+1 and E * = L ∞ (R d ) d+1 which are respectively the (d + 1)−product of L 1 (R d ) and L ∞ (R d ).
is a w * −closed subspace. Then the dual spaces of L Lip(R d ) is included in the dual space of L 1 (R d ) d+1 , see e.g. page 129 in [4]. The proof is fulfilled by the fact that the elements (g, w) of L 1 (R d ) d+1 represent all τ −continuous linear functions on L Lip(R d )