Clustering of measures via mean measure quantization

: This paper addresses the case where data come as point sets, or more generally as measures. Our goal is to build from data an embedding of these measures into a ﬁnite-dimensional Euclidean space, that allows for provably eﬃcient clustering of the source measures. The vectorization technique we propose relies on ﬁnding a compactly supported approximation of the mean measure generating process, that coincides with the intensity measure in the point process framework. To this aim we provide two algorithms that we prove almost minimax optimal. We assess the practical validity of our approach, ﬁrst by showing that our results apply in the framework of persistence-based shape classiﬁcation via the ATOL procedure described in [34]. At last, numerical experiments are carried out on simulated and real datasets, encompassing text classiﬁcation and large-scale graph classiﬁcation.


Introduction
This paper handles the case where we observe n i.i.d measures X 1 , . . . , X n on R d , rather than n i.i.d sample points, the latter case being the standard input of many machine learning algorithms. Such kind of observations naturally arise in many situations, for instance when data are spatial point patterns: in species distribution modeling [33], repartition of clusters of diseases [16], modelisation of crime repartition [36] to name a few. The framework of i.i.d sample measures also encompasses analysis of multi-channel time series, for instance in embankment dam anomaly detection from piezometers [21], as well as topological data analysis, where persistence diagrams naturally appear as discrete measures in R 2 [8,10].
The objective of the paper is the following: we want to build from data an embedding of the sample measures into a finite-dimensional Euclidean space that preserves clusters structure, if any. Such an embedding, combined with standard clustering techniques should result in efficient measure clustering procedures. It is worth noting that clustering of measures has a wide range of possible Clustering of measures via mean measure quantization 2061 applications: in the case where data come as a collection of finite point sets for instance, including ecology [33], genetics [34,2], graphs clustering [7,20] and shapes clustering [8].
Our vectorization scheme is quite simple: for a k-points vector c = (c 1 , . . . , c k ) in (R d ) k , we map each measure X i on R d into a vector v i ∈ R k that roughly encodes how much weight X i spreads around every c j . Then, in a mixture distribution framework, we provide general conditions on the k-points vectors that allow an efficient clustering based on the corresponding vectorization. Thus, efficiency of our measure clustering procedure essentially depends on our ability to find discriminative k-points vectors from sample.
Choosing a fixed grid of R d as vectorization points is possible, however the dimension of such a vectorization is constrained and can grow quite large. Another choice could be to draw vectorization points at random from the sample measures. Note however that such a method could miss discriminative areas. We choose an intermediate approach, by building a k-points approximation of a central measure of X 1 , . . . , X n . Note that finding compact representations of central tendencies of measures is an interesting problem in itself, especially when the sample measures are naturally organized around a central measure of interest, for instance in image analysis [14] or point processes modeling [33,16,36].
In [14], the central measure is defined as the Wasserstein barycenter of the distribution of measures. Namely, if we assume that X 1 , . . . , X n are i.i.d measures on R d drawn from X, where X is a probability distribution on the space of measures, then the central measure is defined as μ W = arg min ν E (W 2 (X, ν)) 2 , where ν ranges in the space of measures and W 2 denotes the Wasserstein distance. Note that this definition only makes sense in the case where X(R d ) is constant a.s., that is when we draw measures with the same total mass. Moreover, computing the Wasserstein barycenter of X 1 , . . . , X n in practice is too costly for large n's, even with approximating algorithms [14,31]. To overcome these difficulties, we choose to define the central measure as the arithmetic mean of X, denoted by E(X), that assigns the weight E [X(A)] to a borelian set A. In the point process theory, the mean measure is often referred to as the intensity function of the process.
An easily computable estimator of this mean measure is the sample mean measureX n = ( n i=1 X i ) /n. We intend to build a k-points approximation of E(X), that is a distribution P c supported by c = (c 1 , . . . , c k ) that approximates well E(X), based on X 1 , . . . , X n . To this aim, we introduce two algorithms (batch and mini-batch) that extend classical quantization techniques intended to solve the k-means problem [27]. In fact, these algorithms are built to solve the k-means problem forX n . We prove in Section 3.1 that these algorithms provide minimax optimal estimators of a best possible k-points approximation of E(X), provided that E(X) satisfies some structural assumption. Interestingly, our results also prove optimality of the classical quantization techniques [27,25] in the point sample case.
We assess the overall validity of our appproach by giving theoretical guarantees on our vectorization and clustering process in a topological data analysis framework, where measures are persistence diagrams from different shapes. In this case, vectorization via evaluation onto a fixed grid is the technique exposed in [2], whereas our method has clear connections with the procedures described in [44,34]. Our theoretical results are given for vectorizations via evaluations of kernel functions around each point c j , for a general class of kernel functions that encompasses the one used in [34]. Up to our knowledge, this provides the only theoretical guarantee on a measure-based clustering algorithm.
At last, we perform numerical experiments in Section 4 to assess the effectiveness of our method on real and synthetic data, in a clustering and classification framework. In a nutshell, our vectorization scheme combined with standard algorithms provides state-of-the art performances on various classification and clustering problems, with a lighter computational cost. The classification problems encompass sentiment analysis of IMDB reviews [26] as well as large-scale graph classification [35,41]. Surprisingly, our somehow coarse approach turns out to outperform more involved methods in several large-scale graph classification problems.
The paper is organized as follows: Section 2 introduces our general vectorization technique, and conditions that guarantee a correct clustering based on it. Section 3 gives some detail on how our vectorization points are build, by introducing the mean measure k-points quantization problem. Then, two theoretically grounded algorithms are described to solve this problem from the sample X 1 , . . . , X n . Section 3.3 investigates the special case where the measures are persistence diagrams built from samplings of different shapes, showing that all the previously exposed theoretical results apply in this framework. The numerical experiments are exposed in Section 4. Sections 5 and 6 gather the main proofs of the results. Proofs of intermediate and technical results are deferred to Appendix A.

Vectorization and clustering of measures
Throughout the paper we will consider finite measures on the d-dimensional ball B(0, R) of the Euclidean space R d , equipped with its Borel algebra. Let M(R, M ) denote the set of such measures of total mass smaller than M . For μ ∈ M(R, M ) and f a borelian function from R d to R, we denote by f (y)μ(dy) integration of f with respect to μ, whenever |f (y)|μ(dy) is finite. We let X be a random variable with values in M(R, M ) (equipped with the smallest algebra such that fdX is measurable, for any continuous and bounded f ), and X 1 , . . . , X n be an i.i.d sample drawn from X.

Vectorization of measures
We aim to provide a map v from M(R, M ) into a finite-dimensional space that preserves separation between sources. That is, if we assume that there exists (Z 1 , . . . , Z n ) ∈ [[1, L]] n a vector of (hidden) label variables such that

2063
The intuition is the following. Let μ 1 = μ 2 be two different measures on B(0, R). Then there exists x ∈ B(0, R) and r > 0 such that μ 1 (B(x, r)) = μ 2 (B(x, r)). Thus, the vectorization that sends X i into X i (B(x, r)) will allow to separate the sources X (1) = δ μ1 and X (2) = δ μ2 . We extend this intuition to the multi-classes case. To adopt the quantization terminology, for k ∈ N * we let c 1 , . . . , c k be codepoints, and c = (c 1 , . . . , c k ) ∈ (R d ) k is called a codebook. For a given codebook c and a scale r > 0, we may represent a measure μ via the vector of weights (μ(B(c 1 , r)), . . . , μ(B(c k , r))) that encodes the mass that μ spreads around every pole c j . Provided that the codepoints are discriminative (this will be discussed in the following section), separation between clusters of measure will be preserved. In practice, convolution with kernels is often preferred to local masses (see, e.g., [34]). To ease computation, we will restrict ourselves to the following class of kernel functions.
Note that a (p, δ)-kernel is also a (q, δ)-kernel, for q > p. This definition of a kernel function encompasses widely used kernels, such as Gaussian or Laplace kernels. In particular, the function ψ(u) = exp(−u) that is used in [34] is a (p, 1/p)-kernel for p ∈ N * . The 1-Lispchitz requirement is not necessary to prove that the representations of two separated measures will be well-separated. However, it is a key assumption to prove that the representations of two measures from the same cluster will remain close in R k . From a theoretical viewpoint, a convenient kernel is ψ 0 : x → (1 − ((x − 1) ∨ 0)) ∨ 0, which is a (1, 0)-kernel, thus a (p, 0)-kernel for all p ∈ N * .
From now on we assume that the kernel ψ is fixed, and, for a k-points codebook c and scale factor σ, consider the vectorization v c,σ : Note that the dimension of the vectorization depends on the cardinality of the codebook c. To guarantee that such a vectorization is appropriate for a clustering purpose is the aim of the following section.

Discriminative codebooks and clustering of measures
In this section we assume that there exists a vector (Z 1 , . . . , Z n ) ∈ [[1, L]] n of (hidden) label variables, and let M 1 , . . . , M L be such that, if Z i = , X i ∈ M(R, M ). We also denote M = max ≤L M , and investigate under which conditions the vectorization exposed in the above section provides a representation that is suitable for clustering. For a given codebook c, we introduce the following definition of (p, r, Δ)-scattering to quantify how well c will allow to separate clusters.
In a nutshell, the codebook c shatters the sample if two different measures from two different clusters have different masses around one of the codepoint of c, at scale r. Note that, for any i, j, X i (B(c j , r/p)) ≥ X i ({c j }), so that a stronger definition of shattering in terms of X i ({c j })'s might be stated, in the particular case where X i ({c j }) > 0. The following Proposition ensures that a codebook which shatters the sample yields a vectorization into separated clusters, provided the kernel decreases fast enough.
A proof of Proposition 3 can be found in Section 5.1. This proposition sheds some light on how X 1 , . . . , X n has to be shattered with respect to the parameters of Ψ. Indeed, assume that Δ = 1 (that is the case if the X i 's are integer-valued measures, such as count processes for instance). Then, to separate clusters, one has to choose δ small enough compared to 1/M , and thus p large enough if Ψ is non-increasing. Hence, the vectorization will work roughly if the support points of two different counting processes are rp-separated, for some scale r. This scale r then drives the choice of the bandwidth σ. Section 3.3 below provides an instance of shattered measures in the case where measures are persistence diagrams originating from well separated shapes. If the requirements of Proposition 3 are fulfilled, then a standard hierarchical clustering procedure such as Single Linkage with L ∞ distance will separate the clusters for the scales smaller than Δ/2. Note that, without more assumptions, this global clustering scheme can result in more than L clusters, nested into the ground truth label classes. Now, to achieve a perfect clustering of the sample based on our vectorization scheme, we have to ensure that measures from the same cluster are not too far in terms of Wasserstein distance, implying in particular that they have the same total mass. This motivates the following definition. For p ∈ [[1, +∞]] and μ 1 , μ 2 ∈ M(R, M ) such that μ 1 (R d ) = μ 2 (R d ), we denote by W p (μ 1 , μ 2 ) the p-Wasserstein distance between μ 1 and μ 2 .

Quantization of the mean measure
To find discriminative codepoints from a measure sample X 1 , . . . , X n , a naive approach could be to choose a ε-grid of B(0, R), for ε small enough. Such an approach ignores the sample, and results in quite large codebooks, with k ∼ (R/ε) d codepoints. An opposite approach is to sample k i points from every X i , with k = n i=1 k i , resulting in a k-points codebook that might miss some discriminative areas, especially in the case where discriminative areas have small X i -masses. We propose an intermediate procedure, based on quantization of the mean measure. Definition 6 below introduces the mean measure.
The empirical mean measureX n may be defined viaX n (A) = 1 In the case where the measures of interest are persistence diagrams, the mean measure defined above is the expected persistence diagram, defined in [11]. If the sample measures are point processes, E(X) is the intensity function of the process. It is straightforward that, if P (X ∈ M(R, M )) = 1, then both E(X) andX n are (almost surely) elements of M(R, M ).
For a codebook c = (c 1 , . . . , c k ) ∈ B(0, R) k , we let so that (W 1 (c), . . . , W k (c)) forms a partition of R d and N (c) represents the skeleton of the Voronoi diagram associated with c. To grid the support of E(X), we want to minimize the distortion F , that is According to [18,Corollary 3.1], since E(X) ∈ M(R, M ), there exist minimizers c * of F (c), and we let C opt denote the set of such minimizers. As well, if M k (R, M ) denotes the subset of M(R, M ) that consists of distributions supported by k points and c * ∈ C opt , then P c * is a best approximation of E(X) in M k (R, M ) in terms of W 2 distance. Note that distortions and thus optimal codebooks may be defined for W p , p = 2. However, simple algorithms are available to approximate such a c * in the special case p = 2, as discussed below.

Batch and mini-batch algorithms
This section exposes two algorithms that are intended to approximate a best kpoints codebook for E(X) from a measure sample X 1 , . . . , X n . These algorithms are extensions of two well-known clustering algorithms, namely the Lloyd algorithm ( [25]) and Mac Queen algorithm ( [27]). We first introduce the counterpart of Lloyd algorithm. Algorithm 1. Batch algorithm (Lloyd) Input : X 1 , . . . , X n and k ; # Initialization Sample c u1 Wj (c (t) ) (u)X n (du) ; Output : c (T ) (codebook of the last iteration) .
Note that Algorithm 1 is a batch algorithm, in the sense that every iteration needs to process the whole data set X 1 , . . . , X n . Fortunately, Theorem 9 below ensures that a limited number of iterations are required for Algorithm 1 to provide an almost optimal solution. In the sample point case, that is when we observe n i.i.d points X (1) i , Algorithm 1 is the usual Lloyd's algorithm. In this case, the mean measure E(X) is the distribution of X (1) 1 , that is the usual sampling distribution of the n i.i.d points. As well, the counterpart of Mac-Queen algorithm [27] for standard k-means clustering is the following mini-batch algorithm. We let π B(0,R) k denote the projection onto B(0, R) k .
Note that the mini-batches split in two halves is motivated by technical considerations only that are exposed in Section 6.2. Whenever n i = 1 for i = 1, . . . , n (and B , Algorithm 2 is a slight modification of the original Mac-Queen algorithm [27]. Indeed, the Mac-Queen algorithm takes mini-batches of size 1, and estimates the population of the cell j at the t-th iteration via ) . These modifications are motivated by Theorem 10, that guarantees near-optimality of the output of Algorithm 2, provided that the mini-batches are large enough.

Theoretical guarantees
We now investigate the distortions of the outputs of Algorithm 1 and 2. In what follows, F * will denote the optimal distortion achievable with k points, that is F * = F (c), where c ∈ C opt . Note that F * = 0 if and only if E(X) is supported by less than k points, in which case the quantization problem is trivial. In what follows we assume that E(X) is supported by more than k points. Basic properties of C opt are recalled below.
We will further assume that E(X) satisfies a so-called margin condition, defined in [22, Definition 2.1] and recalled below. For any subset A of an Euclidean space, and r > 0, we denote by B(A, r) the set ∪ u∈A B(u, r).
In a nutshell, a margin condition ensures that the mean distribution E(X) is well-concentrated around k poles. For instance, finitely-supported distributions satisfy a margin condition. Following [24], a margin condition will ensure that usual k-means type algorithms are almost optimal in terms of distortion. Up to our knowledge, margin-like conditions are always required to guarantee convergence of Lloyd-type algorithms [37,24].
A special interest will be paid to the sample-size dependency of the excess distortion. From this standpoint, a first negative result may be derived from [24,Proposition 7]. Namely, for any empirically designed codebookĉ, we have In fact, this bounds holds in the special case where X satisfies the additional assumption X = δ X (1) a.s., pertaining to the vector quantization case. Thus it holds in the general case. This small result ensures that the sample-size dependency of the minimax excess distortion over the class of distribution of measures whose mean measure satisfies a margin condition with radius r 0 is of order 1/n or greater. A first upper bound on this minimax excess distortion may be derived from the following Theorem 9, that investigates the performance of the output of Algorithm 1. We recall that, for N ∈ N * , M N (R, M ) denotes the subset of measures in M(R, M ) that are supported by N points.
satisfies a margin condition with radius r 0 , and denote by R 0 = Br0 for all x > 0, where C is a constant.

2069
A proof of Theorem 9 is given in Section 6.1. Combined with (2), Theorem 9 ensures that Algorithm 1 reaches the minimax precision rate in terms of excess distortion after O(log(n)) iterations, provided that the initialization is good enough. Note that Theorem 9 is valid for discrete measures that are supported by a uniformly bounded number of points N max . This assumption is relaxed for Algorithm 2 in Theorem 10.
In the vector quantization case, Theorem 9 might be compared with [22, Theorem 3.1] for instance. In this case, the dependency on the dimension d provided by Theorem 9 is sub-optimal. Slightly anticipating, dimension-free bounds in the mean-measure quantization case exist, for instance by considering the output of Algorithm 2.
Theorem 9 guarantees that choosing T = 2 log(n) and repeating several Lloyd algorithms starting from different initializations provides an optimal quantization scheme. From a theoretical viewpoint, for a fixed R 0 , the number of trials to get a good initialization grows exponentially with the dimension d. In fact, Algorithm 1 seems quite sensitive to the initialization step, as for every EMlike algorithm. From a practical viewpoint, we use 10 initializations using a k-means++ sampling [3] for the experiments of Section 4, leading to results that are close to the reported ones for Algorithm 2.
Note that combining [24,Theorem 3] or [37] and a deviation inequality for distortions such as in [22] gives an alternative proof of the optimality of Lloyd type schemes, in the sample points case where X i = δ X (1) i . In addition, Theorem 9 provides an upper bound on the number of iterations needed, as well as an extension of these results to the quantization of mean measure case. Its proof, that may be found in Section 6.1, relies on stochastic gradient techniques in the convex and non-smooth case. Bounds for the single-pass Algorithm 2 might be stated the same way.
In the sample point case, the same result holds with the centroid update that is without splitting the batches.
A proof of Theorem 10 is given in Section 6.2. Note that Theorem 10 does not require the values of X to be finitely supported measures, contrary to Theorem 9. Theorem 10 entails that the resulting codebook of Algorithm 2 has an optimal 2070 F. Chazal et al. distortion, up to a log(n) factor and provided that a good enough initialization is chosen. As for Algorithm 1, several initializations may be tried and the codebook with the best empirical distortion is chosen. In practice, the results of Section 4 are given for a single initialization at random using k-means++ sampling [3].
Note that Theorem 10 provides a bound on the expectation of the distortion. Crude deviation bounds can be obtained using for instance a bounded difference inequality (see, e.g., [6,Theorem 6.2]). In the point sample case, more refined bounds can be obtained, using for instance [22,Theorem 4 To investigate whether these kind of bounds still hold in the measure sample case is beyond the scope of the paper. Note also that the bound on the excess distortion provided by Theorem 10 does not depend on the dimension d. This is also the case in [22,Theorem 3.1], where a dimension-free theoretical bound on the excess distortion of an empirical risk minimizer is stated in the sample points case. Interestingly, this bound also has the correct dependency in n, namely 1/n. According to Theorem 9 and 10, providing a quantization scheme that provably achieves a dimension-free excess distortion of order 1/n in the sample measure case remains an open question.
We may now connect Theorem 9 and 10 to the results of Section 2.2.

Corollary 11.
Assume that E(X) satisfies the requirements of Theorem 9, and that c * provides a (p, r, Δ) shattering of A proof of Corollary 11 is given in Section 6.3. Corollary 11 ensures that if c * shatters well X 1 , . . . , X n , so willĉ n , whereĉ n is built with Algorithm 1. To fully assess the relevance of our vectorization technique, it remains to prove that k-points optimal codebooks for the mean measure provide a shattering of the sample measure, with high probability. This kind of result implies more structural assumptions on the components of the mixture X. The following section describes such a situation.

Application: clustering persistence diagrams
In this section we investigate the properties of our measure-clustering scheme in a particular instance of i.i.d. measure observations. Recall that in our mixture model, Our mixture of persistence diagrams model is the following. For ∈ [[1, L]] let S ( ) denote a compact d -manifold and D ( ) denote the persistence diagram generated by d S ( ) , the distance to S ( ) . This persistence diagram can either be thought of as a multiset of points (set of points with integer-valued multiplicity) in the half-plane for a general introduction to persistence diagrams the reader is referred to [5, Section 11.5]), or as a discrete measure on H + (see, e.g., [11] for more details on the measure point of view on persistence diagrams). For a fixed scale s > 0, we define D ( ) ≥s as the thresholded persistence diagram of d S ( ) , that is where the multiplicities n(b, d) ∈ N * , and the m for some bandwidth h . To sum up, conditionally on Z i = , X i is a thresholded persistence diagram corresponding to a N -sampling of the shape S ( ) .
To retrieve the labels Z i 's from a vectorization of the X i 's, we have to assume that the persistence diagrams of the underlying shapes differ by at least one point.
where the thresholded persistence diagrams are considered as measures.
Note that if m 1 , 2 satisfies the discrimination condition stated above, then ≥s . Next, we must ensure that optimal codebooks of the mean measure have codepoints close enough to discrimination points. Let k ≥ k 0 + K 0 (h), and (c * 1 , . . . , c * k ) denote an optimal k-points quantizer of E(X). Then, provided that N is large enough for all , we have The proof of Proposition 14 is given in Section 6.4. IfD ≥s denotes the mean persistence diagram L =1 π D ( ) ≥s , andD ≥s has K 0 points, then it is immediate . Proposition 14 ensures that the discrimination points are well enough approximated by optimal k-centers of the expected persistence diagram E(X), provided the shapes S ( ) are well-enough sampled and k is large enough so thatD ≥s is well-covered by k balls with radius h. Note that this is always the case if we choose k = K 0 , but also allows for smaller k's.
In turn, provided that the shapes S (1) , . . . , S (L) are discriminable at scale s and that k is large enough, we can prove that an optimal k-points codebook c * is a (p, r, Δ)-shattering of the sample, with high probability.

Proposition 15.
Assume that the requirements of Proposition 14 are satisfied.
A proof of Proposition 15 is given in Section 6.5. In turn, Proposition 15 can be combined with Proposition 5 and Corollary 11 to provide guarantees on the output of Algorithm 1 combined with a suitable kernel. We choose to give results for the theoretical kernel ψ 0 : x → (1 − ((x − 1) ∨ 0)) ∨ 0, and for the kernel used in [34], ψ AT = x → exp(−x).

Corollary 16.
Assume that the requirements of Proposition 15 are satisfied, and that E(X) satisfies a margin condition. For short, let v i denote the vectorization of X i based on the output of Algorithm 1. Then, with probability larger than where κ and C are small enough constants, we have for σ ∈ [r, 2r] and the following choices of p and r: • If Ψ = Ψ AT , p AT = 4M , and r AT =B 32p AT .
A proof of Corollary 16 is given in Section 6.6. Corollary 16 can be turned into probability bounds on the exactness of the output of hierarchical clustering schemes applied to the sample points. For instance, on the probability event described by Corollary 16, Single Linkage with norm . ∞ will provide an exact clustering. The probability bound in Corollary 16 shed some light on the quality of sampling of each shape that is required to achieve a perfect classification: roughly, for N in Ω(log(n)), the probability of misclassification can be controlled. Note that though the key parameterB is not known, in practice it can be scaled as several times the minimum distance between two points of a diagram.
Investigating whether the margin condition required by Corollary 16 is satisfied in the mixture of shapes framework is beyond the scope of the paper. This condition is needed to prove that the algorithms exposed in Section 3.1 provide provably good approximations of optimal c * 's (those in turn can be proved to be discriminative codebooks). Experimental results below asses the validity of our algorithms in practice.

Experimental results
This section investigates numerical performances of our vectorization and clustering procedure. The vectorization step is carried out using the Atol procedure (Automatic Topologically-Oriented Learning, [34]): codebooks are defined using Algorithm 2, and an exponential kernel Ψ AT is used to construct the embedding as in (1). As a remark, Algorithm 1 combined with other kernels such as ψ 0 lead to similar results in our three experiments.
It is worth noting that our method requires few to no tuning: for all experiments we use minibatches of size 1000 and the same calibration scheme (use a random 10% of the sample measures for learning the space quantization). We use the same kernel Ψ AT , only the Atol budget k (length of the vectorization) will sometimes vary in the three experiments.
Our numerical experiments encompass synthetic measure clustering, but also investigate the performance of our vectorization technique on two (supervised) classification problems: large-scale graph classification and text classification. On these problems, we achieve state-of-the-art performances.

Measure clustering
In this first experiment, we settle in the framework of Section 2.2. Namely, we consider an i.i.d sample drawn from a mixture of measures, and evaluate our method on a clustering task. We generate a synthetic mixture of L measures. We first choose p − 1 centers shared by all mixture components on the d − 1dimensional sphere of radius 10 and denote them by {S i d | i = 1, . . . , p− 1}; then for each mixture component another center is placed at a separate vertex of the d-dimensional unit cube (implying L 2 d ), labeled Cube |d for ∈ [[1, . . . , L]], so that the inner center is the only center that varies amongst mixture components, see instances Figure 1. The support centers C for the mixture component are given by For every mixture component and signal level r > 0, we then make N normal simultaneous draws with variance 1 centered around every element of r × C , resulting in a point cloud of cardinality p × N that is interpreted as a measure.
To sum up, the -th mixture component X k has distribution where the ε's are i.i.d standard d-dimensional Gaussian random variables. We compare with several measure clustering procedures, some of them being based on alternative vectorizations. Namely, the rand procedure randomly chooses codepoints among the sample measures points, whereas the grid procedure takes as codepoints a regular grid of [0, 10r] d . For fair comparison we use the same kernel Ψ AT as in Atol to vectorize from these two codebooks.
We also compare with two other standard clustering methods: the histogram procedure, that learns a regular tiling of the space and vectorizes measures by counting inside the tiles, and the W2-spectral method that is just spectral clustering based on the W 2 distance (to build the Gram matrix). Note that the histogram procedure is conceptually close to the grid procedure.
At last, for all methods the final clustering step consists in a standard k-means clustering with 100 initializations.
We designed these mixtures to exemplify problems where the data are significantly separated (by L discriminative points corresponding to hypercube vertices), but also very similar by including draws from the same set of centers on the outer sphere. In this experiment r may be thought of as the strength of the signal separation, and p can be interpreted as a measure of noise. Empirically we use two different setups for the number of supporting centers p: p = 4 or p = 20, that we interpret respectively as low and high noise situations in Figure 1. We fix L = 3, and generate measures in dimension 2 and 5. We also set N = 25 and sample every mixture component 20 times, so that the final sample of measures consist of 60 i.i.d. finite measures, each with 25 × p points.

Fig 1. Synthetic 3-mixture of measures instance (blue, dark brown and pink) and their support centers for indication (dark green for the outer centers, yellow, green and brown for the middle centers) with either 4 (left) or 20 (right) support centers in dimension 2, r = 1. The sets of points similarly coloured constitute measures.
Regardless of the dimension, for the p = 4 setup, W2-spectral takes about 4.3 seconds on average, 0.2 seconds for Atol, and the other methods are naturally faster. Note that for the p = 20 case, the W2-spectral clustering procedure runs takes too long to compute and is not reported, whereas Atol runs in 0.5 seconds on average. Each experiment is performed identically a hundred times and the resulting average normalized mutual information (NMI) with true labels and 95% confidence intervals are presented for each method. Two sets of experiments are performed, corresponding to the following two questions. Q1: at a given signal level r = 1, for increasing budget (that is, the size of the codebooks used in the vectorization process), how accurate are methods at clustering 3-mixtures? Q1 aims at comparing the various methods potentials at capturing the underlying signal with respect to budget.
Results are displayed in Figure 2. The Atol procedure is shown to produce the best results in almost all situations. Namely, it is the procedure that will most likely cluster the mixture exactly while requiring the least amount of budget for doing so. The grid procedures grid and histogram naturally suffer from the dimension growth (bottom panels). The oscillations of the histogram procedure performances, exposed in the upper right panel, may be due to the existence of a center tile (that cannot discriminate between several vertices of the cube) that our oversimple implementation allows. To draw a fair comparison, the local minima of the histogram curve are not considered as representative, leading to an overall flat behavior of this method in the case d = 2, p = 20. In dimension 2 the grid procedure shows to be very efficient, but for measures in dimension 5 it seems always better to select codepoints at random, rather than using some instance of regular grid. Whenever possible, W2-spectral outperforms all vectorization-based methods, at the price of a significantly larger computational cost. Note however that the performance gap with Atol is discernable for small budgets only (k ≤ 15).
Q2: for a given budget k = 32 and increasing signal level, how accurate are Results are shown in Figure 3. Atol and grid procedures give similar results, and compare well to W2-spectral in the case p = 4. Figure 3 also confirms the intuition that the rand procedure is more sensitive to p than competitors, resulting in significantly weaker performances when p = 20.

Large-scale graph classification
We now look at the large-scale graph classification problem. Provided that graphs may be considered as measures, our vectorization method provides an embedding into R k that may be combined with standard learning algorithms. In this framework, our vectorization procedure can be thought of as a dimensionality reduction technique for supervised learning, that is performed in an unsupervised way. Note that the notion of shattering codebook introduced in Section 2.2 is still relevant in this context, since a (p, r, Δ)-shattering codebook would lead to exact sample classification if combined with a classification tree for instance. Theoretically investigating the generalization error of such an approach is beyond the scope of the paper.
As exposed above, since Atol is an unsupervised procedure that is not specifically conceived for graphs, it is likely that other dedicated techniques would provide better performances in this graph classification problem. However, its simplicity can be a competitive advantage for large-scale applications.
There are multiple ways to interpret graphs as measures. In this section we borrow from [7]: for a diffusion time t > 0 we compute the Heat Kernel Signatures (HKS t ) for all vertices in a graph, so that each graph G(V, E) is embedded in R |V | (see, e.g., [7, Section 2.2]). Then four graph descriptors per diffusion time may be computed using the extended persistence framework (see, e.g., [7, Section 2.1]), that is four types of persistence diagrams (PDs). Schematically for a graph G(V, E) the descriptors are derived as: In these experiments we will show that the graph embedding strategy of (4) paired with Atol can perform up to the state of the art on large-scale graph classification problems. We will also demonstrate that Atol is well assisted but not dependent on the aforementioned TDA tools.

Large-scale binary classification from [35]
Recently [35] introduced large-scale graph datasets of social or web origin. For each dataset the associated task is binary classification. The authors perform a 80% train/test split of the data and report mean area under the curve (AUC) along with standard errors over a hundred experiments for all the following graph embedding methods. SF from [15] is a simple graph embedding method that extracts the k lowest spectral values from the graph Laplacian, and a standard Random Forest Classifier (RFC) for classification. NetLSD from [38] uses a more refined representation of the graph Laplacian, the heat trace signature (a global variant of the HKS) of a graph using 250 diffusion times, and a 1-layer neural network (NN) for classification. FGSD from [40] computes the biharmonic spectral distance of a graph and uses histogram vectorization with small binwidths that results in vectorization of size [100, 1000000], with a Support Vector Machine (SVM) classifier. GeoScattering from [19] uses graph wavelet transform to produce 125 graph embedding features, also with a SVM classifier.
We add our own results for Atol paired with (4): we use the extended persistence diagrams as input for Atol computed from the HKS values with diffusion times t 1 = .1, t 2 = 10, and vectorize the diagrams with budget k = 10 for each diagram type and diffusion time so that the resulting vectorization for graph G is v Atol (G) ∈ R 2×4×10 . We then train a standard RFC (as in [15], we use the implementation from sklearn [30] with all default parameters) on the resulting vectorized measures.
The results are shown Table 1. Atol is close to or over the state-of-theart for all four datasets. Most of these methods operate directly from graph Laplacians so they are fairly comparable, in essence or as for the dimension of the embedding that is used. The most positive results on github stargazers improves on the best method by more than 6 points.

A variant of the large-scale graph classification from [34]
The graph classification tasks above were binary classifications. [41] introduced now popular datasets of large-scale graphs associated with multiple classes. These datasets have been tackled with top performant graph methods including the graph kernel methods RetGK from [42], WKPI from [43] and GNTK from [17] (combined with a graph neural network), and the aforementioned graph embedding method FGSD from [40]. PersLay from [7] is not designed for graphs but used (4) as input for a 1-NN classifier. Lastly, in [34], the authors also used (4) and the same diagrams as input data for the Atol procedure and obtained competitive results within reasonable computation times (the largest is 110 seconds, for the COLLAB dataset, see [34, Section 3.1] for all computation times).
Here we propose to bypass the topological feature extraction step in (4). Instead of topological descriptors, we use simpler graph descriptors: we compute four HKS descriptors corresponding to diffusion times t 1 = .1, t 2 = 1, t 3 = 10, t 4 = 100 for all vertices in a graph, but this time directly interpret the output as a measure embedding in dimension 4. From there we use Atol with k = 80 budget. Therefore each graph G(V, E) is embedded in R 4|V | seen as M(R 4 ) and our measure vectorization framework is readily applicable from there. To sum-up we now use the point of view: We do not claim that this particular embedding is the correct way to handle graphs in general, much to the contrary: instead we simply show that our methodology can handle all sort of embedding strategies, with good results to support it for the exposed experiments. It is important to note that the proposed transformation from graphs to measures can be computed using any type of node or edge embedding. Our vectorization method could be applied the same way as for HKS's. Table 2 Mean accuracies and standard deviations for the large multi-classes problems of [41]. On Table 2 we quote results and competitors from [34] and on the right column we add our own experiment with Atol. The Atol methodology works very efficiently with the direct HKS embedding as well, although the results tend to be slightly inferior. This may hint at the fact that although PDs are not essential to capturing signal from this dataset, they can be a significant addition for doing so. Overall Atol with (5), though much lighter than Atol with (4), remains competitive for large-scale graph multiclass classification.

Text classification with word embedding
At last we intend to apply our methodology in a high-dimensional framework, namely text classification. Basically, texts are sequences of words and if one forgets about word order, a text can be thought of as a measure in the (fairly unstructured) space of words. We use the word2vec word embedding technique ( [29]), that uses a two-layer neural network to learn a real-valued vector representation for each word in a dictionary in such a way that distances between words are learnt to reflect the semantic closeness between them, with respect to the corpus. Therefore, for a given dimension E ∈ N, every word w is mapped to the embedding space v word2vec (w) ∈ R E , and we can use word embedding to interpret texts as measures in a given word embedding space: In practice we will use the Large Movie Review Dataset from [26] for our text corpus, and learn word embedding on this corpus. To ease the intuition, Figure 4 below depicts the codepoints provided by Algorithm 2, based on a 2-dimensional word embedding. Note that this word embedding step does not depend on the classification task that will follow. We now turn to the task of classification and use the binary sentiment annotation (positive or negative reviews) from the dataset. For classifying texts with modern machine-learning techniques there are several problems to overcome: important words in a text can be found at almost any place, texts usually have different sizes so that they have to be padded in some way, etc. The measure viewpoint is a direct and simple solution.
The successful way to proceed after a word embedding step is to use some form of deep neural network learning. We learn a 100-dimensional word embedding using the word2vec implementation from the gensim module, then compare the following classification methods: directly run a recurrent neural network (LSTM with 64 units), against the Atol vectorization followed by a simple dense layer with 32 units. We measure accuracies through a single 10-fold.
The Atol vectorization with word embedding in dimension 100, 20 centers, and 1-NN classifier reaches 85.6 accuracy with .95 standard deviation. The average quantization and vectorization times are respectively 5.5 and 208.3 seconds. The recurrent neural network alternative with word embedding in dimension 100 reaches 89.3 mean accuracy with .44 standard deviation, for an average run time of about 1 hour.
Naturally the results are overall greatly hindered by the fact that we have treated texts as a separate collection of words, forgetting sentence structure when most competitive methods use complex neural networks to analyse nuplets of words as additional inputs. Additional precision can be also gained using a task-dependend embedding of words, at the price of a larger computation time (see for instance the kaggle winner algorithm [12], with precision 99% and run time 10379 seconds).
Let Ψ be a (p, δ)-kernel and σ ∈ [r, 2r]. We have On the other hand, we have that

Proof of Proposition 5
Let

Proof of Theorem 9
Throughout this section we assume that E(X) satisfies a margin condition with radius r 0 , and that P(X ∈ M Nmax (R, M )) = 1. The proof of Theorem 9 is based on the following lemma, which ensures that every step of Algorithm 1 is, up to concentration terms, a contraction towards an optimal codebook. We recall here that R 0 = Br0 Lemma 17. Assume that c (0) ∈ B(c * , R 0 ). Then, with probability larger than 1 − 9e −c1npmin/M − e −x , for n large enough, we have, for every t, where D n = CRM √ n k d log(k) + √ x and C, K are positive constants.
The proof of Lemma 17 is deferred to Appendix A.1.1.
Proof of Theorem 9. Equipped with Lemma 17, the proof of Theorem 9 easily follows. On the probability event of Lemma 17, using (7) we have that Clustering of measures via mean measure quantization 2083

Proof of Theorem 10
The proof of Theorem 10 follows the proof of [32, Lemma 1]. Throughout this section we assume that E(X) satisfies a margin condition with radius r 0 . The proof of Theorem 10 is based on the following lemma. Recall that R 0 = Br0 We let c (t) denote the output of Algorithm 2 (or its variant in the sample point case).
The proof of Lemma 18 is deferred to Appendix A.1.2.
Proof of Theorem 10. Equipped with Lemma 18, we can prove Theorem 10 using the same method as in the proof of [32, Lemma 1]. Namely, denoting by a t = c (t) − c * 2 , we prove recursively that Denote by G = 16kM R 2 pmin . The case t = 1 is obvious. Next, assuming that Ea t ≤ 2G t and using (8) we may write Since K 1 ≤ 1 2 , we get that Ea t+1 ≤ 2G/(t + 1).

Proof of Proposition 14
The proof of Proposition 14 relies on the two following lemmas. First, Lemma 19 below ensures that the persistence diagram transformation is stable with respect to the sup-norm.
Lemma 19. [13] If X and Y are compact sets of R D , then We let Z denote the latent label variable, so that X | {Z = } ∼ X ( ) , or ≥s have the same number of (multiple) points, when A c occurs. Further, we have

Clustering of measures via mean measure quantization
For N large enough so that α N d ≤M On the other hand, let c be a k-points codebook such that, for every p ∈ [ [1, k]], Now let 0 be such that n ( 0 ) (m 1 ) ≥ 1, and assume that the N 's are large enough so that α ≤ 1 2 . It holds hence the result.

Proof of Proposition 15
We let M = max ≤L M , h 0 be such thatD ≥s−h0 =D ≥s . . Then we have For the remaining of the proof we assume that, for i = 1, . . . , n, d B (X i , D Zi ≥s ) ≤ κB, that occurs with probability larger than 1−n max ≤L α . Recall that on this probability event, under the assumptions of Proposition 14, X i and D Zi ≥s have the same number of (multiple) points. Let i 1 = i 2 , and assume that Z i1 = Z i2 = z.

Proof of Corollary 16
In the case where Ψ = Ψ AT , we have that Ψ AT is a (p, 1/p) kernel. The requirement 1/p ≤ 1 4M of Proposition 3 is thus satisfied for p AT = 4M . On the other hand, choosing r AT =B 32p AT ensures that 8r AT p AT ≤ (1/2 − κ)B and r AT 2p AT ≥ 2κB, for κ small enough. Thus, the requirements of Proposition 15 are satisfied: c * is a (2p AT , r, 1) shattering of X 1 , . . . , X n . At last, using Corollary 11, we have thatĉ n is a (p AT , r AT , 1) shattering of X 1 , . . . , X n , on the probability event described by Corollary 11. It remains to note that 2κB ≤ r AT 4 for κ small enough to conclude that X 1 , . . . , X n is r AT 4 -concentrated on the probability event described in Proposition 15. Thus Proposition 5 applies.
where K 1 , K 2 and K 3 are constants to be fixed later. Combining pieces and using t + 1 ≥ 1 leads to a t+1 ≤ a t + a t t + 1 Thus, provided that c (0) ∈ B(c * , R 0 ), on A ≤t we have c (p) ∈ B(c * , R 0 ), for p ≤ t. Next, if F t denotes the sigma-algebra corresponding to the observations of the t first mini-batches B 1 , . . . , B t , and E t denotes the conditional expectation with respect to F t . We will show that where K 1 < 0.5. First, we may write with R 1 ≤ (4k 2 R 2 )/n 3 ≤ 4kM R 2 /(p min (t + 1) 2 ), since p min ≤ M/k and t + 1 ≤ T ≤ n. Then, using (12) entails hence the result.

A.1.5. Proof of Lemma 24
We follow the proof of [4, Theorem 1]. Let t > 0, and assume that P f − P n f > 2t √ P f, as well as P n f ≥ P f − t √ P f ≥ 0. Let g a : R + → R be defined as g a (x) = x−a √ x+a , for a ≥ 0. Then g a is nondecreasing on R + . With a = P n f , 0 ≤ x 2 = P f − t √ P f ≤ P n f = x 1 , we have g a (x 2 ) ≤ g a (x 1 ), so that P n f − P n f 1 2 (P n f + P n f ) Since P n f + P f − t √ P f ≤ 2P f, we deduce that P n f − P n f 1 2 (P n f + P n f ) Thus, ≥ P ∃f 0 ∈ F | P f 0 − P n f 0 > 2t P f 0 and P n f 0 ≥ P f 0 − t P f 0 .
Let A = ∃f 0 ∈ F | P f 0 − P n f 0 > 2t √ P f 0 . The previous inequality may be written as P ∃f 0 ∈ F | P f 0 − P n f 0 > 2t P f 0 and P n f 0 ≥ P f 0 − t P f 0 with, for a fixed f ∈ F, using Cantelli's inequality, n Var(f ) + t 2 P f .
Since, for any f ∈ F, f ∞ ≤ 1, we have Var(f ) ≤ P f 2 ≤ P f, thus provided that nt 2 ≥ 1. Thus The other inequality proceeds from the same reasoning, considering f such that P n f − P f > 2t √ P n f and P n f ≤ P f + t √ P n f , and g a : R + → R defined by g a (x) = a−x √ a+x , that is nonincreasing for a ≥ 0. Choosing a = P n f , 0 ≤ x 1 = P n f ≤ P f + t √ P n f = x 2 leads to P n f − P n f 1 2 (P n f + P n f ) since P f + t √ P n f ≤ P n f . Using Cantelli's inequality again leads to the result.

A.2. Proof of Lemma 12
Lemma (12). reach τ . Let S be a compact subset of R d , and D denote the persistence diagram of the distance function d S . For any s > 0, the truncated diagram consisting of the points m = (m 1 , m 2 ) ∈ D such that m 2 − m 1 ≥ s is finite.
The lemma follows from standard arguments in geometric inference and persistent homology theory.
First, the definition of generalized gradient of d S -see [9] or [5, Section 9.2] -implies that the critical points of d S are all contained in the convex hull of S. As a consequence, they are all contained in the sublevel set d −1 S ([0, 2diam(S)]). It follows from the Isotopy Lemma -[5, Theorem 9.5] -that all the sublevel sets d −1 S ([0, t]), t > 2diam(S) have the same homology. As a consequence, no point in D has a larger coordinate than 2diam(S) and D is contained in [0, 2diam(S)] 2 .
Since S is compact, the persistence module of the filtration defined by the sublevel sets of d S is q-tame ([10, Corollary 3.35]). Equivalently, this means that for any b 0 < d 0 , the intersection of D with the quadrant Q (b0,d0) = {(b, d) : b < b 0 and d 0 < d} is finite. Noting that the intersection of [0, 2diam(S)] 2 with the half- plane {(b, d) : d ≥ b + s} can be covered by a finite union of quadrants Q (b,b+ s 2 ) concludes the proof of the lemma.