Fast rates for empirical vector quantization

We consider the rate of convergence of the expected loss of empirically optimal vector quantizers. Earlier results show that the mean-squared expected distortion for any fixed distribution supported on a bounded set and satisfying some regularity conditions decreases at the rate O(log n/n). We prove that this rate is actually O(1/n). Although these conditions are hard to check, we show that well-polarized distributions with continuous densities supported on a bounded set are included in the scope of this result.


Introduction
Clustering is the problem of identifying groupings of similar points that are relatively far one from each others, or, in other words, to partition the data into dissimilar groups of similar items.For a comprehensive introduction to this topic, the reader is referred to the monograph of Graf and Luschgy [8].Isolate meaningful groups from a cloud of data is a topic of interest in many fields, from social science to biology.In fact this issue originates in the theory of signal processing in the late 40's, known as the quantization issue, or lossy data compression (see Gersho and Gray [7] for a comprehensive approach of this topic).More precisely, let X 1 , . . ., X n denote n random variables, independent and identically distributed, drawn from a distribution P over R d , equipped with its Euclidean norm ., and let Q denote a k-quantizer, that is a map from R d to R d such that Card(Q(R d )) ≤ k.Let c ∈ (R d ) k be a concatenation of k d-dimensional vectors c 1 , . . ., c k .Without loss of generality we only consider quantizers of the type x −→ c i , where x − c i = min j=1,...,k x − c j .The c i 's are called clusters.To measure how well the quantizer Q performs in representing the source distribution, a possible way is to look at when EX 2 < ∞.The goal here is to find a set of clusters ĉn , drawn from the data X 1 , . . ., X n , whose distortion is as close as possible to the optimal distortion R * = inf c∈(R d ) k R(c).To solve the problem, most approaches to date attempt to implement the principle of empirical error minimization in the vector quantization context.According to this principle, good clusters can be found by searching for ones that minimize the empirical distortion over the training data, defined by The existence of such empirically optimal clusters has been established by Graf an Luschgy [8,Theorem 4.12].Let us denote by ĉn one of these vectors of empirically optimal clusters.If the training data represents the source well, ĉn will hopefully perform near optimally also on the real source.Roughly, this means that we expect R(ĉ n ) ≈ R * .The problem of quantifying how good empirically designed clusters are, compared to the truly optimal ones, has been extensively studied, see for instance Linder [10].
To reach the later goal, a standard route is to exploit the Wasserstein distance between the empirical distribution and the source distribution, to derive upper bounds on the average distortion of empirically optimal clusters.Following this approach, Pollard [14] proved that if E X 2 < ∞, then R(ĉ n ) − R * −→ 0 almost surely, as n → ∞.More recently, Linder, Lugosi and Zeger [11], and Biau, Devroye and Lugosi [3] showed that if the support of P is bounded, then E (R(ĉ n ) − R * ) = O(1/ √ n), using techniques borrowed from statistical learning theory.Bartlett, Linder and Lugosi [2] established that this rate is minimax over distributions supported on a finite set of points.However, faster rates can be achieved, using methods inspired from statistical learning theory.For example, it is shown by Chou [6], following a result of Pollard [16], that R(ĉ n ) − R * = O P (1/n), under some regularity conditions on the source distribution.Nevertheless, this consistency result does not provide any information on how many training samples are needed to ensure that the average distortion of empirically optimal clusters is close to the optimum.Antos, Györfi and György established in [1] that E(R(ĉ n ) − R * ) = O(log n/n) under the same conditions, paying a log n factor to derive a non-asymptotic bound.It is worth pointing out that the conditions cannot be checked in practice, and consequently remain of theoretical nature.Moreover, the rate of 1/n for the average distortion can be achieved when the source distribution is supported on a finite set of points.Consequently, an open question is to know wether this optimal rate can be attained for more general distributions.
In the present paper, we improve previous results of Antos, György and Györfi [1], by getting rid of the log n factor.Besides, we express Pollard's condition in a more reader-friendly framework, involving the density of the source distribution.To this aim we use statistical learning arguments and prove that the average distortion of empirically optimal clusters decreases at the rate O(1/n).To get this result we use techniques such as the localization principle borrowed from Massart, Blanchard and Bousquet [4] or Koltchinskii [9].The condition we offer can be easily interpreted as margin-type condition, similar to the ones of Massart and Nedelec in [13], showing a clear connection between statistical learning theory and vector quantization.
The paper is organized as follows.In Section 2 we introduce notation and definitions of interest.In Section 3 we offer our main results.These results are discussed in Section 4, and illustrated on examples such as Gaussian mixtures or quasi-finite distribution.Finally, proofs are gathered in Section 5.

The quantization problem
Throughout the paper, X 1 , . . ., X n is a sequence of independent R d -valued random observations with the same distribution P as a generic random variable X.To frame the quantization problem as a statistical learning one, we first have to consider quantization as a contrast minimization issue.To this aim we introduce the following notation.Let c = (c 1 , . . ., c n ) be the set of possible clusters.The contrast function γ is defined as Within this framework, the risk R(c) takes the form R(Q) = R(c) = P γ(c, .),where P f (.) means inegration of the function f with respect to P .In the same way, if P n denotes the empirical distribution that is induced on R d by the n-sample X 1 , . . ., X n , we can express the empirical risk Rn (Q) as P n γ(c, .).
Note that, within this context, an optimal c * minimizes P γ(c, .),whereas ĉn ∈ arg min c∈(R d ) k P n γ(c, .).It is worth pointing out that the existence of both c and c * are guaranteed by Graf and Luschgy [8,Theorem 4.12].In the sequel we denote by M the set of such minimizers of the true risk P γ(c, .),so that c * ∈ M. To measure how well a vector of clusters c performs compared to an optimal one, we will make use of the loss Troughout the paper we will use the following assumptions on the source distribution.Let B(0, M ) denote the closed ball of radius M , with M ≥ 0.
Assumption 1 (Peak Power Constraint).The distribution P is such that P (B(0, 1)) = 1, Note that Assumption 1 is stronger than the requirement E X 2 < ∞, as it imposes a L ∞ -boundedness condition on the random variable X.For conveniency we assume that the distribution is bounded by 1.However, it is important to note that our results hold for random variables X bounded from above by an arbitrary M .We will also need the following regularity requirement, first introduced by Pollard [16].
Assumption 2 (Pollard's regularity condition).The distribution P satisfies the following two conditions: 1. P has a continuous density f with respect to Lebesgue measure on R d , 2. The Hessian matrix of c −→ P γ(c, .) is positive definite for all optimal vector of clusters c * .
One can point out that Condition 1 of Assumption 2 does not guarantee the existence of a second derivative for the expectation of the contrast function.Nevertheless Assumption 1 and Condition 1 of Assumption 2 are enough to guarantee that the map c −→ P γ(c, .) is twice differentiable.Let V i be the Voronoi cell associated with c i , for i = 1, . . ., k.In this situation, the Hessian matrix is composed of the following d × d blocks: ) denotes the possibly empty common face of V i and V j , and σ means integration with respect to the (d − 1)-dimensional Lebesgue measure.For a proof of that statement, we refer to Pollard [16].
The proof of these two results are both based on arguments which have a connection with the localization principle ( [12], [9]), which provides faster rates of convergence when the expectation and the variance of γ(c, .)− γ(c * , .) are connected.To prove his result, Pollard used conditions under which the distortion and the Euclidean distance are connected, and used chaining arguments to bound from above a term which looks like a Rademacher complexity, constrained on an area around an optimal vector of clusters.Note that Koltchinskii [9] used a similar method to apply the localization principle.On the other hand, Antos, Györfi and György exploited Pollard's condition, and used a concentration inequality based on the fact that the variance and the expectation of the distortion are connected to get their result.Interestingly, this point of view has been developped by Blanchard, Bousquet and Massart [4] to get bounds on the classification risk of the SVM, using the localization principle.That is the approach that will be followed in the present document.

Main results
We are now in a position to state our main result.Theorem 3.1.Assume that Assumption 1 and Assumption 2 are satisfied.Then, denoting by ĉn an empirical risk minimizer, we have where C 0 is a positive constant depending on P , k and d.
This result improves previous non-asymptotic results of Antos, Györfi and György [1], Linder, Lugosi and Zeger [11], showing that a convergence rate of 1/n can be achieved in expectation.To prove Theorem 3.1, the key result is based on a version of Talagrand's inequality due to Bousquet [5] and its application to localization, following the approach of Massart and Nedelec [13].The main point is to connect Var (γ(c, .)− γ(c * , .)) to P (γ(c, .)− γ(c * , .)) for all possible c.To be more precise, Pollard's condition involves differentiability of the distortion, therefore γ(c, .)− γ(c * , .) is naturally linked to c − c * , the Euclidean distance between c and c * .However, it is noteworthy that, mimicing the proof of Antos, Györfi and György [1, Corollary 1], we have in fact: Proposition 3.1.Suppose that Assumption 1 and Assumption 2 are satisfied.Then there exists two positive constants A 1 and A 2 , depending on the distribution P , such that 3 When considering several possible optimal vector of clusters, we have to choose one to be compared with our empirical vector ĉn .A nearest optimal vector of clusters c * (ĉ n ) is a natural choice.It is important to note that, for every c ∈ (R d ) k and c * ∈ M, ℓ(c, c * (c)) = ℓ(c, c * ).Consequently, Theorem 3.1 holds for every possible c * ∈ M.Besides it is easy to see, using the compacity of B(0, 1), that there is only a finite set of optimal clusters c * when Assumption 1 is satisfied and the Hessian matrixes H(c * ) are definite positive for every possible c * .This compacity argument is also the key to turn respectively the local positiveness of H(c * ) into property (H1) and the regularity of the contrast function γ into the global property (H2).These two properties are exactly matching the two parts of the proof of Antos, Györfi and György [1, Corollary 1], which in turn implies Proposition 3.1.Note also that, from Proposition 3.1 we get . This allows us to use localization techniques such as in the paper of Blanchard, Bousquet and Massart [4].
Pollard's regularity condition (Assumption 2) involves second derivatives of the distortion.Consequently, checking Assumption 2, even theoretically, remains a hard issue.We give a more general condition regarding the L ∞ -norm of the density f on the boundaries of Voronoi diagram, for the distribution to satisfy Assumption 2. We recall that M denotes the set of all possible optimal clusters c * .Theorem 3.2.Denote by V * i the Voronoi cell associated with c * i in the Voronoi diagram associated with c * , by N * the union of all possible boundaries of Voronoi cells with respect to all possible optimal vector of clusters c * , and by Γ the Gamma function.Let B = inf Then P satisties Assumption 2.
The proof is given in Section 5.It is important to note that, for general distributions supported on B(0, M ), we can state a similar theorem, involving M d+1 in the right-hand side of the inequality in Theorem 3.2.However, a source distribution supported on B(0, M ) can be turned into a distribution supported on B(0, 1), using an homothetic transformation.Therefore we will only state results for a distribution supported on B(0, 1).
This theorem emphasizes the idea that if P is well concentrated around its optimal clusters, then some localization conditions can hold and therefore it is a favorable case.The intuition behind this result is given by the extremal case where Voronoi cells boundaries are empty with respect to P .This case is described in detail in Section 4.Moreover, the notion of a well-concentrated distribution looks like margin-type conditions for the classification case, as described by Massart and Nedelec [13].This confirms the intuition of an easyto-quantize distribution, when the poles are well-separated.

Minimax lower bound
Let P denote the set of probability distributions on B(0, 1).Bartlett, Linder and Lugosi [2] offered a minimax lower bound for general distributions: Consequently, for general distributions, this minimax bound mathches the upper bound on Eℓ(ĉ n , c * ) Linder, Lugosi, and Zeger [11] There is no contradiction between Theorem 3.1 and Proposition 4.1.In fact, in Theorem 3.1, k is fixed, whereas, in Proposition 4.1, k n strongly depends on n.Therefore, it is an interesting point to know whether we can get such a minimax bound when k is fixed.
The proof of Proposition 4.1 follows the proof of Bartlett, Linder and Lugosi [2, Theorem 1], and it is therefore omitted in this paper.The main idea is to replace the distribution supported on 2n points proposed by these authors in Step 3 of the proof, with a distribution supported on 2n small balls satisfying Assumption 1 and Assumption 2.

Assumption 1 is necessary
The original result of Pollard [16] assume only that E X 2 < ∞, to get an asymptotic rate of O P (1/n).Consequently, it is an interesting question to know whether Assumption 1 can be replaced with the assumption E X 2 < ∞ in Theorem 3.1.In fact, Assumption 1 is useful to get a global localization result from a local one, through a compacity argument.This is precisely the result of Proposition 3.1, which provides us with the global argument required for applying some localization result from a local regularity condition.However, following the idea of Antos, Györfi and György [1], it is possible to suppose only that E X 2 < ∞ and nevertheless get (H2) in Proposition 3.1, as expressed in the following result.Proposition 4.2.Suppose that E X 2 < ∞ and that the set of all possible optimal clusters c * is finite.Then there exists a constant A 2 , depending on P , such that A proof of Proposition 4.2 can be directly deduced from the proof of [1, Theorem 2].Consequently it is omitted in this paper.According to Proposition 4.2, we can expect to control the variance of our process indexed by c and c * with the Euclidean distance c − c * , even if the support of P is not contained within a ball.Unfortunately, when the distribution is not supported on a bounded set, there are cases where the term ℓ(c, c * (c)) cannot dominate c − c * (c) 2 for all c, as expressed in the following counter-example.Let η > 0, q(η) = e η/2+2R−1

3
, and define the density f of the distribution P supported on R by , and define c n = (0, n, n 2 ).Then One easily deduces from Proposition 4.3 that the distribution P satisfies Assumption 2, but fails to satisfy (H1) in Proposition 3.1.Therefore Assumption 1 is necessary to get the result of Theorem 3.2.The intuiton behind this counter-example is that two phenomenons prevent ℓ(c, c * (c)) from being at most proportional to c − c * (c) 2 when c is arbitrarily far from 0. Firstly, the underlying measure "erase" the Euclidean distance in the expression of ℓ(c, c * (c)), which implies that ℓ(c n , c * (c n )) converges.Therefore, a suitable criterion to link ℓ(c, c * (c)) and Var(γ(c, .)− γ(c * '(c), .))should probably involve a weight drawn from the tail of P .Typically we expect such a criterion to be a function of c − c * (c) 2 , taking into account a tail constraint on the distribution P we consider.Secondly, this example shows that, if for instance we take the 3-quantizer c n = (n, n 2 , n 4 ), the relative loss ℓ(c n , c * (c n )) will mostly depend on the contribution of the smallest cluster n, when n grows to infinity, whereas c n − c * (c n ) 2 essentially depends on the distance to the most far from 0 cluster n 4 .
To conclude, the Euclidean distance does not take into account the weight induced by the underlying distribution over the space.Thus, when Assumption 1 is released, dominant clusters for the Euclidean distance from c * are essentially the most far ones.On the other hand, when integrating with respect to P , far-from-zero clusters loose their influence in the loss ℓ(c n , c * (c n )).

A toy example
In this subsection we intend to understand which conditions on the density f can guarantee that the Hessian matrixes H are positive.To this aim we consider an extremal case, in which the probability distribution is supported on small balls scattered in B(0, 1).Roughly, if the balls are small enough and far one from each others, the optimal quantization points should be the center of these balls.These are the ideas which are behind the following proposition.
, where U |B(zi,ρ) denotes the uniform distribution over The proof of Proposition 4.4, which is given in Section 5, is inspired from a proof of Bartlett, Linder and Lugosi [2, Step 3].It is interesting to note that Proposition 4.4 can be extended to the situation where we assume that the underlying distribution is supported on k small enough subsets.In this context, if each subset has a not too small P -measure, and if those subsets are far enough one from each others, it can be proved in the same way that an optimal quantizer has a point in every small subset.
Let us now consider the distribution described in Proposition 4.3, with relevant values for ρ and R. We immediatly see that if R/2 > ρ, then every boundary of the Voronoi diagram for the optimal vector of clusters lies in a null-measured area.Thus, for this distribution, which is clearly positive.This short example illustrates the idea behind Theorem 3.2.Namely, if the density of the distribution is not too big at the boundaries of the Voronoi diagram associated with every optimal k-quantizer, then the Hessian matrix H will roughly behave as a positive diagonal matrix.Thus Pollard's condition (Assumption 2) will be satisfied.This most favorable case is in fact derived from the special case where the distribution is supported on k points.Antos, Györfi and György [1] proved that if the distribution has only a finite number of atoms, then the expected distortion ℓ(ĉ n , c * ) is at most C/n, where C is a constant.Here we spread the atoms into small balls to give a density to the distribution and match regularity conditions.

Quasi-Gaussian mixture example
The aim of this subsection is to apply our results to the Gaussian mixtures in dimension d = 2.However, since the distribution support of a Gaussian random variable is not bounded, we will restrict ourselves to the "quasi-Gaussian mixture" model, which is defined as follows.
Let the density f of the distribution P be defined by where N i denotes a normalization constant for each Gaussian variable.To ensure this model to be close to the Gaussian mixture model, we assume that there exists a constant ε ∈ [0, 1] such that, for i = 1, . . ., k, N i ≥ 1 − ε.Denote by B = inf i =j m i − m j the smallest possible distance between two different means of the mixture.To avoid boundary issues we suppose that, for all i = 1, . . ., k, B(m i , B/3) ⊂ B(0, 1).For such a model, we have: Proposition 4.5.Suppose that .
Then P satisfies Assumption 2.
The inequality we propose as a condition in Proposition 4.5 can be decomposed as follows.If then the optimal vector of clusters c * is close to the vector of means of the mixture m = (m 1 , . . ., m k ).
Knowing that, we can locate the Voronoï boundaries of the Voronoï diagram associated to c * and apply Theorem 3.2.This leads to the second term of the maximum in Proposition 4.5.This condition can be interpreted as a condition on the polarization of the mixture.A favorable case for vector quantization seems to be when the poles of the mixtures are well-separated, which is equivalent to σ is small compared to B when considering Gaussian mixtures.Proposition 4.5 just explained how σ has to be small compared to B, in order to satisfy Assumption 2 and therefore apply Theorem 3.1, to reach an improved convergence rate of 1/n for the loss ℓ(ĉ n , c * ).Notice that Proposition 4.5 can be considered as an extension of Proposition 4.4.In these two propositions a key point is to locate c * , which is possible when the distribution P is well-polarized.The definition of a well-polarized distribution takes two similar forms when looking at Proposition 4.4 or Proposition 4.5.In Proposition 4.4 the favorable case is when the poles are far one from each other, separated by an empty area with respect to P , which ensures that the Hessian matrixes H(c * ) are positive definite (in this case they are diagonal matrixes).When slightly disturbing the framework of Proposition 4.4, it is quite natural to think that the Hessian matrixes H(c * ) should remain positive definite.Proposition 4.5 is an illustration of this idea: the empty separation area between poles is replaced with an area where the density f is small compared to its value around the poles.The condition on σ and B we offer in Proposition 4.5 gives a theoretical definition of a well-polarized distribution for quasi-Gaussian mixtures.
It is important to note that our result holds when k is known and match exactly the number of components of the mixture.When the number of cluster k is larger than the number of components k of the mixture, we have no general idea of where the optimal clusters can be placed.Moreover, suppose that we are able to locate the optimal vector of clusters c * .As explained in the proof of Proposition 4.5, the quantity involved in Proposition 4.5 is in fact B = inf i =j c * i − c * j .Thus, in this case, we expect B to be much smaller than B. Consequently, a condition like in Proposition 4.5 could not involve the natural parameter of the mixture B.

Proof of Theorem 3.1
The proof strongly relies on the localization principle and its application by Blanchard, Bousquet and Massart [4].We start with the following definition.Definition 5.1.Let Φ be a real-valued function.Φ is called a sub-α function if and only if Φ is non-decreasing and if the map x → Φ(x)/x α is non-increasing.
The next theorem is an adaptation of the result of Blanchard, Bousquet and Massart [4, Theorem 6.1].For the sake of clarity its proof is given in Subsection 5.2.
Theorem 5.1.Let F be a class of bounded measurable functions such that Let K be a positive constant, Φ a sub-α function, α ∈ [1/2, 1[.Then there exists a constant C(α) such that, if D is a constant satisfying D ≤ 6KC(α) and r * is the unique solution of the equation Φ(r) = r/D, the following holds.Assume that Then, for all x > 0, with probability larger than This theorem emphasizes the fact that if we are able to control the variance and the complexity term controlled by the variance, we can get a possibly interesting oracle inequality.Obviously the main point is to find a suitable control function for the variance of the process.Here the interesting set is According to Section 3 the relevant control function for the variance of the process γ(c, .)− γ(c * , .) is proportional to c − c * 2 .Thus it remains to bound from above the quantity This is done in the following proposition.
Proposition 5.1.Suppose that P satisfies Assumption 1. Then where C is a constant depending on k, d, and P .
We are now in a position to prove Theorem 3.1.Take c * = c * (c), a nearest optimal vector of clusters to c, and use (H2) to connect c − c * (c) 2 to ℓ(c, c * (c)).Introducing the explicit form r * = C 2 D 2 n , we get, with K = 2A 1 A 2 , D = 6K, and probability larger than Observing that P n (γ(ĉ n , .)− γ(c * (c), .))≤ 0, and taking expectation leads to, for all c * ∈ M, for some constant C 0 > 0 depending only on k, d, and P .

Proof of Theorem 5.1
This proof is a modification of the proof of Blanchard, Bousquet and Massart [4, Theorem 6.1].For r ≥ 0, set We start with a modified version of the so-called peeling lemma: Lemma 5.2.Under the assumptions of Theorem 5.1, there exists a constant C(α) depending only on α such that, for all r > 0, Furthermore, we have Proof of Lemma 5.2.Let x > 1 be a real number.Because 0 ∈ Conv(F ), sup f ∈F Taking expectation on both sides leads to Recalling that Φ is a sub-α function, we may write Φ(rx k+1 ) ≤ x α(k+1) Φ(r).Hence we get proves the result.
We are now in a position to prove Theorem 5.1.Using the inequality of Talagrand for a supremum of bounded variables that Bousquet [5] offered, we have, with probabilty larger than 1 − e −x , Using Lemma 5.2 and the inequality For such an r we have .
We want to find a suitable r such that r ≥ r * and V r ≤ 1/K.To this aim, it suffices to see that if r ≥ (3KA 1 ) , and r ≥ r * , then V r ≤ 1/K using the previous upper bound on V r .It remains to check that the condition (3KA 1 ) Thus, we deduce that, if D ≤ 6KC(α), the choice r = (3KA 1 ) and, consequently,

Proof of Proposition 5.1
Using the differentiability of P γ(c, .),we get, for any c ∈ (R d ) k and c * ∈ M, where, with use of Pollard's [16] notation Observe that, because M is a finite set, by dominated convergence Theorem, Splitting the expectation in two parts, we obtain

Term A: complexity of the model
Term A in inequality ( 1) is at first sight the dominant term in the expression Φ(δ).The upper bound we obtain below is rather accurate, due to the finite-dimensional Euclidean space structure.Indeed, we have to bound a scalar product when the vectors are contained in a ball, thus it is easy to see that the largest value of the product matches in fact the largest value of the coordinates of the gradient term.We recall that M denotes the finite set of optimal vector of clusters.Let x = (x 1 , . . ., x k ) be a vector in (R d ) k .We denote by x ir the r-th coordinate of x i , and name it the (i, r)-th coordinate of x.We may write where c * expresses the maximum and thus, have the largest possible coordinate absolute value for (P n − P )(−∆(c * ))).Moreover, we denote by (j, r) the coordinate of this largest possible absolute value, ε the sign of its (j, r)-th coordinate, and c j,r,ε,c * = c * +e j,r,ε , where e j,r,ε is the vector with ε √ δ for its (j, r) coordinate, 0 elsewhere.Therefore we can reduce the set of the c's of interest to a finite set, writing Taking into account that for every c * in M, P ∆(c * , .)= 0, and that for every fixed c and c * , the quantity c − c * , P n (−∆(c * , .)) is a sub-Gaussian random variable with variance 16δ/n, we get, by a maximal inequality (Massart,[12,Part 6.1]): Therefore, the expected dominant term involves the complexity of the model in a way which is proportional to the square root of the complexity.In our case, this complexity is the dimension of the vector of clusters space.

Bound on B
To bound the second term in inequality (1), we follow the approach of Pollard [16], using complexity arguments such as Dudley's entropy integral.Let F be a set of functions defined on X with envelope F .Let S be a finite set and f a function.We denote f l 2 (S) = 1/n x∈S f 2 (x) 1/2 , where n = Card(S), and by N F (ε, S, F ) the smallest integer m such that there exists φ 1 , . . ., φ m , m functions on X satisfying min i=1,...,m f − φ i l 2 (S) ≤ ε 2 F 2 l 2 (S) .Also define According to [16] and [15,Theorem 7], for the class of functions there exists C > 0 depending on k and d such that F (x) = C(1 + x ) is an envelope for F .Furthermore, for this envelope, we have where A is a positive constant, and W depends only on the pseudo-dimension of F , in a way which will not be described here (see the result of Pollard [15,Theorem7]).We will use a classical chaining argument to bound term B. Let c denote the pair (c, c * ) ∈ (B(0, 1)) k × M. For practical, let f c denote the function R(., c * , c − c * ).We set ε 0 = 1 and ε j = 2 −j ε 0 .
For any f c, let f cj be a function such that f − f cj 2 l 2 (S) ≤ ε 2 j F 2 l 2 (S) , for every finite set S. Since Assumption 1 holds, F is bounded from above by a constant C F .By dominated convergence Theorem we have f cj L 1 ,a.s −→ j→∞ f c, and thus Using a symmetrization inequality and introducing some Rademacher random variables σ (σ = ±1 with probability 1/2), we get, for the first term: where κ A depends on k, d and P .In the second line of this inequality, we used the maximal inequality for random processes depending only on Rademacher variables given by Massart [12, part 6.1].It remains to bound the second term.Using the same approach (symmetrization and maximal inequality for Rademacher variables) we get, for every j > 0, However Comparing a sum with an integral, we obtain which, by assumption on m(ε), can be bounded from above by κB √ n , where κ B depends on k, d and P .We are now in position to prove Proposition 5.1.From the two above subsections we deduce that This concludes the proof.

Proof of Theorem 3.3
Let x = (x 1 , . . ., x k ) be a k × d vector, V 1 , . . ., V k the Voronoi diagram associated with an optimal vector of clusters c * .We state here a sufficient condition for the Hessian matrix H(c * ) to be positive.Denote where, for all i = 1, . . ., k, The support of P is included in B(0, 1), thus we can replace ∂(V i ∩ V j ) with ∂(V i ∩ V j ) ∩ B(0, 1) in the equations above.However, to lighten notation we will omit the indication and implicitly assume that every set we consider is contained in B(0, 1).Let p i,j = ∂(Vi∩Vj ) f (u)du be the d − 1-dimensional P -measure of the boundary between V i and V j .Recalling that the underlying norm is the Euclidean norm, even for matrixes, we may write Next, where we recall that x i x j ≤ 2 x i 2 + x j 2 and B = inf The last step is to derive bounds for p i,j from conditions on f .Denote λ = f ∞ , we see that V i is a regular convex set included in B(c * i , 2).Therefore, by a direct application of Stokes Theorem, the surface of ∂V i is smaller than the surface of S d−1 (c * i , 2) (the sphere of radius 2).Consequently ) is enough to ensure that the Hessian matrix H(c * ) is positive definite.

Proof of Proposition 4.4
We take a distribution uniformly distributed over small balls far one from each others.Denote by V i the Voronoi cell associated with z i in (z 1 , . . ., z k ).Let Q be a k-quantizer, Q * the expected optimal quantizer which maps V i to z i for all i.Denote finally, for all i = 1, . . , where S and V are the unit surface and the volume of the unit ball in R d .Let i be an integer between 1 and k.Let i | be the number of images of V i sent outside V i .The three situations of interest are the following ones: Then, we deduce that, for every y Then at least two clusters of Q lies in V i .Therefore, there exists j such that no cluster of Q lies in V j , so that m out j ≥ 1.We straightforward deduce that the number of cells V i for which m in i ≥ 2 is smaller than the number of cells for which m out j ≥ 1. Taking into account all contributions of Voronoi cells, we get from which we deduce a sufficient condition to get R(Q) ≥ R(Q * ).

Proof of Proposition 4.3
Using the same method as in the proof of Proposition 4.4, we prove that, for η = 2 and R = 10, the optimal vector of clusters is ( η 2 , η 2 + R, η 2 + 2R).Thus the density f is zero-valued on each boundary of every Voronoi cell of the optimal vector of centroids.Consequently Assumption 2 is satisfied.For n large enough, P γ(c n , .)= 1 3η η 0 x 2 dx + 1 3η R+η R x 2 dx + q n 2 η+2R−1 x 2 e −x dx + q n 2 +n 2 n 2 (x − n) 2 e −x dx + q +∞ n 2 +n 2 x − n 2 2 e −x dx.
Hence, by the dominated convergence Theorem for the three first terms of the right-hand side and through computation for the remaining terms, P γ(c n , .)−→ n−→∞ P x 2 .

Proof of Proposition 4.5
We begin with a lemma which ensures that every possible optimal centroid c * i is close to at least one mean m j of the mixture when the ration p min /p max is large enough.Lemma 5.3.Let c * be an optimal vector of clusters.Suppose that (1 − ε) B2 (1 − e − B2 /288σ 2 ) .
Let c be a vector of clusters such that there exists j satisfying, for all i = 1, . . ., k, m j − c i > B/6.We will prove that P γ(c, .)> P γ(m, .), which implies that c / ∈ M. In fact we have, for all i = 1, . . ., k and for all x ∈ B(m j , B/12), x − c i > B/12.Hence, a lower bound for P γ(c, .Hence we deduce that every optimal vector of clusters has a centroid close to every mean m j of the mixture, of at most B/6.
Suppose that the ratio p min /p max satisfies the assumption of Proposition 4.5.In particular p min /p max satisfies the assumption of Lemma 5. .
Then, we deal with the left-hand side.Let x be at distance from every m i of at least B/4.Then

Proposition 4 . 4 .
Let z 1 , . . ., z k be vectors in R d .Let ρ be a positive number and R = inf i =j z i − z j be the smallest possible distance between these vectors.Let the distribution P be defined as follows ∀i ∈ {1, . . ., k}