Improved Convergence Guarantees for Learning Gaussian Mixture Models by EM and Gradient EM

We consider the problem of estimating the parameters a Gaussian Mixture Model with K components of known weights, all with an identity covariance matrix. We make two contributions. First, at the population level, we present a sharper analysis of the local convergence of EM and gradient EM, compared to previous works. Assuming a separation of $\Omega(\sqrt{\log K})$, we prove convergence of both methods to the global optima from an initialization region larger than those of previous works. Specifically, the initial guess of each component can be as far as (almost) half its distance to the nearest Gaussian. This is essentially the largest possible contraction region. Our second contribution are improved sample size requirements for accurate estimation by EM and gradient EM. In previous works, the required number of samples had a quadratic dependence on the maximal separation between the K components, and the resulting error estimate increased linearly with this maximal separation. In this manuscript we show that both quantities depend only logarithmically on the maximal separation.


INTRODUCTION
Gaussian mixture models (GMMs) are a widely used statistical model going back to Pearson [15]. In a GMM each sample x ∈ R d is drawn from one of K components according to mixing weights π 1 , . . . , π K > 0 with K i=1 π i = 1. Each component follows a Gaussian distribution with mean µ * i ∈ R d and covariance Σ i ∈ R d×d . In this work, we focus on the important special case of K spherical Gaussians with identity covariance matrix, with a corresponding density function For simplicity, as in [22,23], we assume the weights π i are known. * This research was partially supported by the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center. This research was also partially supported by a research grant from the Estate of Tully and Michele Plesser.
Segol and Nadler/Improved Convergence Guarantees for EM 2 Given n i.i.d. samples from the distribution (1), a fundamental problem is to estimate the vectors µ * i of the K components. Beyond the number of components K and the dimension d, the difficulty of this problem is characterized by the following key quantities: The smallest and largest separation between the cluster centers, the minimal and maximal weights and their ratio, π min = min i∈ [K] π i , π max = max i∈ [K] π i , θ = π max π min .
In principle, one could estimate µ * i by maximizing the likelihood of the observed data. However, as the log-likelihood is non-concave, this problem is computationally challenging. A popular alternative approach is based on the EM algorithm [7], and variants thereof, such as gradient EM. These iterative methods require an initial guess (µ 1 , . . . , µ K ) of the K cluster centers. Classical results show that regardless of the initial guess, the values of the likelihood function after each EM iteration are non decreasing. Furthermore, under fairly general conditions, the EM algorithm converges to a stationary point or a local optima [21,19]. The success of these methods to converge to an accurate solution depend critically on the accuracy of the initial guess [10].
In this work we study the ability of the popular EM and gradient EM algorithms to accurately estimate the parameters of the GMM in (1). Two quantities of particular interest are: (i) the size of the initialization region and the minimal separation that guarantee convergence to the global optima. Namely, how small can R min be and how large can µ i − µ * i , and still have convergence of EM to the global optima in the population setting; and (ii) the required sample size, and its dependence on the problem parameters, that guarantees EM to find accurate solutions, with high probability.
We make the following contributions: First, we present an improved analysis of the local convergence of EM and gradient EM, at the population level, assuming an infinite number of samples. In Theorems 3.1 and 3.2 we prove their convergence under the largest possible initialization region, while requiring a separation R min = Ω √ log K . For example, consider the case of equal weights π i = 1/K, and an initial guess that satisfies µ i − µ * i ≤ λ min j =i µ * j − µ * i for all i, with λ < 1/2. Then, a separation R min ≥ C(λ) √ log K, with an explicit C(λ), suffices to ensure that the population EM and gradient EM algorithms converge to the true means at a linear rate.
Let us compare our results to several recent works that derived convergence guarantees for EM and gradient EM. [23] and [22] proved local convergence to the global optima under a much larger minimal separation of R min ≥ C min(d, K) log K. In addition, the requirement on the initial estimates had a dependence on the maximal separation, µ i −µ * i ≤ 1 2 R min −C 1 min(d, K) log max(R max , K 3 ) for a universal constant C 1 . These results were significantly improved by [13], who proved the local convergence of the EM algorithm for the more general case of spherical Gaussians with unknown weights and variances. They required a far less restrictive minimal separation R min ≥ C √ log K, with a constant C ≥ 64, and their initialization was restricted to λ < 1 16 . We should note that no particular effort was made to optimize these constants. In comparison to these works, we allow the largest possible initialization region λ < 1 2 , with no dependence on R max . Also, for small values λ ≤ 1/16, our resulting constant C is roughly 6 times smaller that that of [13].
Our second contribution concerns the required sample size to ensure accurate estimation by the EM and gradient EM algorithms. Recently, [13] proved that with a number of samples n =Ω(d/π min ), a sample splitting variant of EM is statistically optimal. In this variant, the n samples are split into B distinct batches, with each EM iteration using a separate batch. In contrast, for the standard EM and gradient EM algorithms, weaker results have been established so far. Currently, the best known sample requirements for EM are n =Ω(K 3 dR 2 max /R 2 min ), whereas for gradient EM, n =Ω(K 6 dR 6 max /R 2 min ). In addition, the bounds for the resulting errors increase linearly with R max , see [22,23]. Note that in these two results, the required number of samples increases at least quadratically with the maximal separation between clusters, even though increasing R max should make the problem easier. In Theorems 3.3 and 3.4, we prove that for an initialization region with parameter λ strictly smaller than half, the EM and gradient EM algorithms yield accurate estimates with sample sizeΩ(K 3 d). In particular, both our sample size requirements and the bounds on the error of the EM and gradient EM have only a logarithmic dependence on R max .
Our results on the initialization region and minimal separation stem from a careful analysis of the weights in the EM update and their effect on the estimated cluster centers. Similarly to [13], we upper bound the expectation of the i-th weight when the data is drawn from a different component j = i and show that it is exponentially small in the distance between the centers of the i and j components. We make use of the fact that all Gaussians have the same covariance to reduce the expectation to one dimension and directly upper bound the one dimensional integral. This allows us to derive a sharper bound compared to [13] from which we obtain a larger contraction region for the population EM and gradient EM algorithms. Our analysis of the finite sample behavior of EM and gradient EM follows the general strategy of [23]. Our improved results rely on tighter bounds on the sub-Gaussian norm of the weights in the EM update which do not depend on the distance between the clusters.

Previous work
Over the past decades, several approaches to estimate the parameters of Gaussian mixture models were proposed. In addition, many works derived theoretical guarantees for these methods as well as information-theoretic lower bounds on the number of samples required for accurate estimation. Significant efforts were made in understanding whether GMMs can be learned efficiently both from a computational perspective, namely in polynomial run time, and from a statistical view, namely with a number of samples polynomial in the problem parameters.
Method of moments approaches [11,14,8] can accurately estimate the parameters of general GMMs with R min arbitrarily small, at the cost of sample complexity, and thus also run time, that is exponential in the number of clusters. [9] showed that a method of moments type algorithm can recover the parameters of spherical GMMs with arbitrarily close cluster centers in polynomial time, under the additional assumption that the components centers are affinely independent. This assumption implies that d ≥ K.
Methods based on dimensionality reduction [4,1,2,12,17] can accurately estimate the parameters of a GMM in polynomial time in the dimension and number of clusters, under conditions on the separation of the clusters' centers. In particular, [17] proved that accurate recovery is possible with a minimal separation of R min = Ω(min(K, d) 1 4 ). In general, it is not possible to learn the parameters of a GMM with number of samples that is polynomial in the number of clusters, see [14] for an explicit example. [16] showed that for any function γ(K) = o( √ log K) one can find two spherical GMMs, both with R min = γ(K) such that no algorithm with polynomial sample complexity can distinguish between them. [16] also presented a variant of the EM algorithm that provably learns the parameters of a GMM with separation Ω( √ log K), with polynomial sample complexity, but run time exponential in the number of components.
More closely related to our manuscript, are several works that studied the ability of EM and variants thereof to accurately estimate the parameters of a GMM. [5] showed that with a separation of Ω(d 1 4 ), a two-round variant of EM produces accurate estimates of the cluster centers. A significant advance was made by [3], who developed new techniques to analyze the local convergence of EM for rather general latent variable models. In particular, for a GMM with K = 2 components of equal weights, they proved that the EM algorithm converges locally at a linear rate provided that the distance between the components is at least some universal constant. These results were extended in [20] and [6] where a full description of the initialization region for which the population EM algorithm learns a mixture of any two equally weighted Gaussians was given. As already mentioned above, the three works that are directly related to our work, and to which we compare in detail in Section 3 are [22], [23] and [13].

Notations
We write X ∼ GMM(µ * , π) for a random variable with density given by Eq.
(1). The distance between cluster means is denoted by Expectation of a function f (X) with respect to X is denoted by E X [f (X)], or when clear from context simply by E[f (X)]. For simplicity of notation, we shall write . For a vector v we denote by v its Euclidean norm. For a matrix A, we denote its operator norm by A op = max x =1 Ax . Finally, we denote by µ = (µ 1 , . . . , µ K ) ∈ R Kd the concatenation of µ 1 , . . . , µ K ∈ R d .
As in previous works, we consider the following error measure for the quality of an estimate µ of the true means, We will see that in the population case we can restrict our analysis to the space spanned by the K true cluster means and the K cluster estimates. It will therefore be convenient to define d 0 = min(d, 2K). For any 0 < λ < 1 2 we define the region For future use we define the following function which will play a key role in our analysis,

Population and Sample EM
Given an estimate (µ 1 , . . . , µ K ) of the K centers, for any x ∈ R d and i ∈ [K] let The population EM update, denoted by µ + = (µ + 1 , . . . , µ + K ) is given by The population gradient EM update with a step size s > 0 is defined by Given an observed set of n samples X 1 , . . . , X n ∼ X, the sample EM and sample gradient EM updates follow by replacing the expectations in (7) and (8) with their empirical counterparts. For the EM, the update is and for the gradient EM In this work, we study the convergence of EM and gradient EM, both in the population setting and with a finite number of samples. In particular we are interested in sufficient conditions on the initialization and on the separation of the GMM components that ensure convergence to accurate solutions.

Population EM
As in previous works, we first study the convergence of EM in the population case and then build upon this analysis to study the finite sample setting. Informally, our main result in this section is that for any fixed λ ∈ (0, 1 2 ) and an initial estimate µ ∈ U λ , there exists a constant C(λ) such that for any mixture with R min C(λ) log 1 πmin the estimation error of a single population EM update (7) decreases by a multiplicative factor strictly less than 1. This, in turn, implies convergence of the population EM to the global optimal solution µ * . Formally, our result is stated in the following theorem.
where c(λ) and θ are as defined in (5) and (3), respectively. Then for any µ ∈ U λ it holds that E(µ t ) ≤ 1 2 t E(µ) where µ t is the t-th iterate of the population EM update (7) initialized at µ.
We derive a similar result for gradient EM.
. Let X ∼ GMM(µ * , π) with R min satisfying (11). Then for any s ∈ 0, 1 πmin and any µ ∈ U λ it holds that E(µ t ) ≤ γ t E(µ) where µ t is the t-th iterate of the population gradient EM update (8) with step size s and γ = 1 − 3 8 sπ min . The proof of Theorem 3.1 appears in Section 4 with the technical details deferred to the appendix. The proof of Theorem 3.2 is similar and appears in full in the appendix.
It is interesting to compare Theorems 3.1 and 3.2 to several recent works, in terms of both the size of the initialization region, and the requirements on the minimal separation. [22] and [23] assumed a separation R min = Ω( √ d 0 log K) and proved local convergence of the gradient EM and of the EM algorithm, for an initialization region of the following form, with C 1 a universal constant, Recently, [13] significantly improved these works, proving convergence of population EM with a much smaller separation R min ≥ 64 log(θK). Moreover, they considered the more general and challenging case where the Gaussians may have different variances and the EM algorithm estimates not only the Gaussian centers, but also their weights and variances. However, they proved convergence only for an initialization region U λ with λ ≤ 1 16 .
Our results improve upon these works in several aspects. First, in comparison to the contraction region of [22], our theorem allows the largest possible initialization region µ i − µ * i < 1 2 R i , with no dependence on the other problem parameters d 0 , K and R max . This initialization region is optimal as there exists GMMs and initializations µ with µ i − µ * i = 1 2 R i such that the EM algorithm, even at the population level, will not converge to values that are close to the true parameters.
Second, in comparison to the result of [13], we allow λ to be as large as 1 2 . Also, for λ < 1 16 , our requirement on R min is nearly one order of magnitude smaller. For example, for a balanced mixture with π min = 1 K , the right hand side of (11) reads An initialization region µ i − µ * i ≤ 1 16 R i leads to a separation requirement R min ≥ 10.3 log(K) + 6.6, which is much smaller than 64 √ log K. We remark on the necessity of our assumptions on the separation and initialization in Theorems 3.1 and 3.2. In general, given only a polynomial number of samples, a separation of R min = Ω( √ log K) is necessary to accurately estimate the parameters of a GMM regardless of the estimation method [16]. With infinitely many samples and sufficiently close initial estimates, the EM algorithm may still converge to the global optimum even with Ω(1) separation. However, to the best of our knowledge, a precise characterization of the attraction region to the true parameters is still an open problem. Next, the separation requirement (11) in our theorems depends inversely on c(λ). Therefore, as λ → 1 2 the requirement on R min becomes more restrictive. Simulation results, see Figure  1b in Section 6, suggest that this dependence of the separation requirement on the initialization may be significantly relaxed. We conjecture that the EM and gradient EM algorithms converge when R min ≥ C log K πmin for a universal constant C and µ i − µ * i < 1 2 R i .

Sample EM
We now present our results on the EM and gradient EM algorithms for the finite sample case.
∼ GMM(µ * , π) with R min satisfying (11). Suppose that n is sufficiently large so that Assume an initial estimate µ ∈ U λ and let µ t be the t-th iterate of the sample EM update Segol and Nadler/Improved Convergence Guarantees for EM 8 (9). Then with probability at least 1 − δ, for all iterations t, µ t ∈ U λ and for a suitable absolute constant C 1 .
∼ GMM(µ * , π) with R min satisfying (11). Set s ∈ 0, 1 πmin and suppose that n is sufficiently large so that Assume an initial estimate µ ∈ U λ and let µ t be the t-th iterate of the sample gradient EM update (10) with step size s. Then with probability at least 1 − δ, µ t ∈ U λ for all t, and where γ = 1 − 3 8 sπ min and C 1 is a suitable absolute constant. The main idea in the proofs of Theorems 3.3 and 3.4 is to show the uniform convergence, inside the initialization region U λ , of the sample update to the population update. The sample size requirements (12) and (14) are such that the resulting error of a single update of the EM and gradient EM algorithms is sufficiently small to ensure that the updated means are in the contraction region U λ . This, combined with the convergence of the population update, yields the required result. We outline the main steps of the proof in Section 5 with more technical details deferred to the appendix.
Let us compare Theorems 3.3 and 3.4 to previous results, in terms of required sample size and bounds on the estimation error. The strongest result to date, due to [13], considered a variant of the EM algorithm, whereby the samples are split into B separate batches, and at each iteration t (with 1 ≤ t ≤ B), the sample EM algorithm is run only using the data of the t-th batch. They showed that to achieve an error E(µ B ) ≤ , the required sample size isΩ( d πmin 2 ). The best known bounds without sample splitting were derived by [22] and by [23]. The error guarantee for gradient EM isÕ( ) for EM. Note that these bounds have a dependence on the maximal separation R max . In particular, even though intuitively, as R max increases the problem should become easier, these error bounds increase linearly with R max and the required sample size increases quadratically with R max . In contrast, in our two theorems above there is a dependence on 1/(1−2λ), which is strictly smaller than R max by the separation condition (11). Thus, for λ bounded away from 1/2, there is only a logarithmic dependence on R max . We believe that with further effort, the dependence on R max can be fully eliminated.
We note that both the minimal sample size requirement in Equations (12) and (14) and the bounds on the error in Equations (13) and (15) could probably be improved. Indeed, [13] proved that the sample splitting variant of the EM algorithm yields accurate estimates with onlyΩ(Kd) samples. Numerical results, see Section 6, suggest that the error of the classical EM algorithm depends only on √ Kd.

PROOF FOR THE POPULATION EM
Our strategy is similar to [22] and [23]: We bound the error of a single update, , which in turn depend on their expectations with respect to individual Gaussian components.
Our key result on the latter expectation is the following Proposition, whose proof appears in the appendix.
where θ is defined in (3). Then for any µ ∈ U λ and all j = i, with c(λ) defined in (5), This proposition shows that E i [w j (X, µ)] is exponentially small in the separation R ij and is key to proving contraction of the EM and gradient EM updates. A similar result was proven in [13]. The main differences are that they assumed a smaller region with λ < 1 16 and obtained a looser exponential bound exp(−R 2 ij /64). However, they considered a more challenging case where the weights π i and variances of the K Gaussian components are unknown and are also estimated by the EM procedure.
The key idea in proving Proposition 4.1 is that for X ∼ N (µ * i , I d ) it suffices to analyze the random variable w j (X, µ) on the one dimensional space spanned by µ i − µ j . Thus, the expectation over a d dimensional random vector is reduced to the expectation of some explicit function over a univariate standard Gaussian. An immediate corollary is that under the same conditions as in Proposition 4.1, the following lower bound holds for the expectation E i [w i (X, µ)].
Next, note that for X ∼ GMM(µ * , π), it holds that E X [w i (X, µ * )] = π i . Thus, for center estimates µ close to µ * we expect that E X [w i (X, µ)] > 3 4 π i . This intuition is made precise in the following lemma which follows readily from Corollary 4.1.1.
Then for any i ∈ [K] and any µ ∈ U λ , Next, we turn to the term and For future use we introduce the following quantities related to The following lemma, proved in the appendix, provides a bound on these quantities.
where C is a universal constant, for example we can take C = 14.
Expressions related to V i,i and V i,j were also studied by [22]. They required a much larger separation, R min ≥ C √ d 0 log K, and their resulting bounds involved also R max .
Remark 4.1. In proving the convergence of EM, the quantities of interest are V i,j (µ, µ * ) and V i,i (µ, µ * ), whereas for the gradient EM algorithm the relevant quantities are V i,j (µ, µ), V i,i (µ, µ). The reason for the effective dimension d 0 = min(d, 2K) is that for d > 2K, in the population setting, the EM update of µ always remains in the subspace spanned by the 2K vectors In the case of gradient EM, one may define a potentially smaller effective dimension d 0 = min(d, K).
Last but not least, the following auxiliary lemma shows that µ * is a fixed point of the population EM update.
With all the pieces in place, we are now ready to prove Theorem 3.1.
Proof of Theorem 3.1. Consider a single EM update, as given by Eq. (7), Using Lemma 4.4, we may write the numerator above as follows, By the mean value theorem there exists µ τ on the line connecting µ and µ * such that Inserting the expressions (21) and (22) for the gradient of w i into Eq. (28) gives Taking expectations, and using the definitions of V ii and V ij , Eqs. (23) and (24), gives Since µ τ ∈ U λ , we may apply Lemma 4.3 to bound the terms on the right hand side above. Furthermore, given that x 2 e −tx 2 is monotonic decreasing for all x > 1/t and R i ≥ 2/c(λ), we may replace all R i , R j in the bounds of , we thus have Next, note that condition (11) on R min implies that it also satisfies the weaker condition (19) If d 0 ≥ R 2 min , then for E(µ + ) ≤ 1 2 E(µ) to hold the minimal separation must satisfy In contrast, if R 2 min ≥ d 0 we obtain the following inequality for w = c(λ) 2 R 2 min , Note that for w > 1, the function we −w is monotonic decreasing. Also, consider the value w * = 2 log(2U/c(λ)) which is larger than 1, given the definitions of U and of c(λ). It is easy to show that w * exp(−w * ) ≤ c(λ)/2U . Hence a sufficient condition for (31) to hold is that w > w * , namely It is easy to verify that log U + log(4/c(λ)) > log d 0 and thus the bound of (32) is more restrictive than (30). Inserting the expression for U into Eq. (32) yields the condition of the theorem, Eq. (11). Finally, to complete the proof we need to show that for all i, µ + i − µ * i ≤ λR i . This part is proven in auxiliary lemma A.5 in the appendix.

PROOF FOR THE SAMPLE EM
In this section we prove our results on the sample EM and gradient EM algorithms. The main idea is to show concentration results for both the denominator and the numerator of the EM update. Our strategy is similar to [23] but with several improvements. First, our result on the concentration of the denominator of the EM update, Lemma 5.1, only considers samples from the i-th cluster. Thus, in Lemma 5.2, we obtain a uniform lower bound for the weight w i with n =Ω(Kd/π min ) compared to the larger n =Ω(Kd/π 2 min ) in [23]. Second, while [23] bounded the sub-Gaussian norm of the numerator of the EM update by CR max , we derive in Lemma 5.3 a tighter bound, which does not depend on R max . This in turn, yields a tighter concentration for the numerator of the EM update in Lemma 5.4.
. Then with probability at least 1 − δ, As we saw in Lemma 4.2, the denominator in the population EM update for the i-th mean is lower bounded by 3 4 π i . We use Lemma 5.1 to show that this lower bound holds also for the finite sample case. We remark that a version of the following lemma appeared in [23], but with a larger sample size requirement of n =Ω(Kd/π 2 min ).
∼ GMM(µ * , π), with R min that satisfies (19). Assume a sufficiently large sample size n such that n log n > C Kd logC δ π min (34) Then, the event D i occurs with probability at least 1 − δ 2K . Next, we analyze the sub-Gaussian norm of w i (X, µ)(X − µ * i ). [23] bounded this quantity by CR max . We present an improved bound which does not depend on R max . For the definition of the sub-Gaussian norm · ψ2 , see the Appendix.
Suppose that µ ∈ U λ . Then for any i ∈ [K], and Using Lemma 5.3 we upper bound the concentration of the numerator in the expression for the error in the sample EM update, Eq. (9).
and with a suitable choice of a universal constant C, the event N i occurs with probability at least 1 − δ 2K .
With all the pieces in place, we are now ready to prove Theorem 3.3.
Proof of Theorem 3.3. Consider the error of a single the update of the from (9) of the sample EM algorithm, .
Note that the requirement (11) on R min is more restrictive than (19). Also, the sample size requirement (12) is more restrictive than (34). Thus, we may invoke Lemma 5.2 and get that with probability at least 1 − δ 2K , that event D i (35) occurs. Hence, It follows from Theorem 3.1 that for R min satisfying (11), the second term above is upper bounded by 1 2 min(E(µ), λR i ). We thus continue by bounding the first term above. Note that our requirements on the minimal separation (11) is more restrictive than the requirement in (36). Thus, we may invoke Lemma 5.4 and obtain with probability at least 1− δ 2K , that the event N i (39) occurs. Therefore, where C is a universal constant andC = 36K 2 R max ( √ d + 2R max ) 2 . For n sufficiently large so that (12) is satisfied, it holds that C

Simulations
We present numerical simulations with the EM algorithm and compare them to our theoretical results. The ability of the EM and gradient EM algorithms to learn Gaussian mixture models has been extensively demonstrated in simulations by various authors, see [23,22] and references therein. Our simulations focus on several quantities that appear in Theorem 3.1. First, we demonstrate that even in a setting with a moderately high dimension and with a large number of components, a relatively low separation suffices for the EM algorithm to yield accurate estimates. Unlike [23], which presented numerical results for a 5 component GMM in R 10 , we consider a 64 component GMM with centers on the unit simplex in R 64 . We generate 5 · 10 5 samples from this GMM and consider several initializations where each initial estimate µ i is sampled uniformly from a sphere of radius 0.45R i around µ * i . In Figure 1a we plot the error E(µ) as a function of the number of iterations for 12 random initializations. We see that the EM algorithm yields accurate estimates in this setting, even though the separation between the different Gaussians is small relative to the dimension and to number of components. Next, we explore the effect of the constant λ such that the initial estimates satisfy µ i − µ * i ≤ λR i for values of λ slightly smaller than 1 2 . We consider a 5 components GMM with centers on the unit simplex in R 10 . We generate 5 · 10 5 samples and run the EM algorithm. The initial values µ 1 and µ 2 are chosen on the line connecting µ * 1 and µ * 2 . The other 3 initial value µ 3 , µ 4 , µ 5 are sampled uniformly from a unit sphere of radius λR i and center µ * i . As can be seen in Figure 1b, the EM algorithm yields accurate estimates for all considered values of λ smaller than 1 2 , even when λ = 1 2 − 10 −5 .
Finally, we consider the accuracy of EM for GMMs as we increase either the number of components or the dimension. Specifically, we considered a 1 dimensional GMM with K equally spaced components with R min = 10 and varying values of K. We generate 5 · 10 5 samples from each GMM and run the EM algorithm for 20 iterations. Fig. 1c shows the error, averaged over 25 runs as a function of K. As seen in the plot the error behaves like K log(K)/n, which is also the expected parametric error if all samples were labeled, which means that on average we had n/K samples from each component. These results suggest that the upper bound of Eq. (13) which depends on √ K 3 may be improved. Next, we considered a sequence of GMMs in increasing dimension R d where d ∈ [20,130]. Each GMM had 5 components with centers Re i for 1 ≤ i ≤ 5, where R = 10 and e i is the standard Euclidean basis vector in R d . We generate 5 · 10 5 samples from each GMM and run the EM algorithm for 20 iterations. We plot the error averaged over 25 runs as a function of the square root of the dimension d. In accordance to Eq. (13), the empirical error seems to increase like √ d. Proof. Since the function inside the integral is monotonically decreasing in B, part (i) directly follows. To prove part (ii), we take the derivative with respect to A, Denote the function inside the integral by f (t). Note that f (t) > 0 when t < 0 and f (t) < 0 when t > 0. To show that the integral is positive it suffices to show that for all t > 0 it holds that −f (t) < f (−t). This condition reads as Some algebraic manipulations give that this condition is equivalent to e At − e −At α 2 e 2B − 1 > 0 which is indeed satisfied for A, t > 0 and α > e −B .
Lemma A.2. Fix any two distinct vectors µ * i , µ * j ∈ R d and λ ∈ (0, 1/2). Denote the ball of radius r about the origin in R d by B d (0, r) and define Consider the two functions A, B : Ω → R Then for any (ξ i , ξ j ) ∈ Ω, Proof. We first prove the upper bound on A. By the triangle inequality As for the lower bound on B, clearly it is obtained when ξ i is maximal, i.e.
Finally, the vector ξ j = λ(µ * i − µ * j ) minimizes (42) regardless of the value of ξ i . This yields the lower bound of (44) for B.
Proof of Proposition 4.1. Recall the definition of the weight w j (X, µ) in Eq (6). Since all the terms in the denominator are positive, we may upper bound w j by taking into account only the two terms with indices k = i and k = j. Hence, Note that by definition η (µ i −µ j ) is a univariate Gaussian random variable with mean zero and variance µ i −µ j 2 . Hence, we may write η (µ i −µ j ) = µ i −µ j ν where ν ∼ N (0, 1). Definingw(A, B, ν) = 1/(1 + πi πj e Aν+B ), we therefore have with A = A(ξ i , ξ j ) and B = B(ξ i , ξ j ) as defined in (41) and (42), respectively.
To upper bound the integral I we split it into two parts based on the sign of A * t + B * .
For I 1 , where A * t + B * < 0, we upper boundw(A * , B * , t) ≤ 1. Since both A * and B * are positive, we have that − B * A * < 0. We can therefore use Chernoff's bound to get For I 2 , where A * t+B * > 0 we upper bound the integral by ignoring the constant 1 in the denominator. Completing the square and changing variables by z = t + A * we get Using the definitions of A * and B * in (43) and (44) we note that for λ > 0, We can therefore apply Chernoff's bound on the above and obtain Combining the two bounds (48) and (49) yields Eq (17).
Proof of Corollary 4.1.1. By definition, the sum of all weights is one. Thus,

A.2. Proof of Lemma 4.2
Proof. Since X is distributed as a GMM with K components and w i (X, µ) > 0, the expected value is greater than if we consider only the i-th component of the GMM.
Since the requirement (19) on R min implies (16), it follows from Corollary 4.1.

A.3. Proof of Lemma 4.3
The proof consists of several steps. First, in Lemma A.3 we reduce the dimension to d 0 = min(d, 2K). Next, in Lemma A.4 we bound E k [(X − v i )(X − µ j ) ] op in terms of d 0 , R i , R j and R i j. We then present the proof of the Lemma. We first introduce notations. For µ, v ∈ R Kd and i, j, k ∈ [K] with i = j we define, Suppose that X ∼ N (µ * k , I d ). Let Γ be a rotation matrix such that ( . Write X d0 for the first d 0 coordinates of ΓX and X d−d0 for the remaining coordinates. We define The proof is similar to the one in [22]. We include it for our paper to be self contained.
Proof. We prove only the first inequality. The proof of the second inequality is similar. Note that ( Now, since Γ is a rotation matrix and the last [d−d 0 ] + coordinates of Γµ i are 0 we (50) and (52), respectively and Since w i (X, µ) = w i (X d0 , µ), we may return to the original variables and write Hence, the lemma follows.
Proof. First, for any rank 1 matrix uv it holds that uv T op = u · v . Thus, Next, since X ∼ N (µ * k , I d ) we may write X = µ * k + η, where η ∼ N (0, I d ). Thus, and α is the angle between e 1 and Γ(µ * k − µ j ). Then by applying Γ to and using the rotation invariance of the Gaussian distribution, It is easy to show that the expectation of the above expression is maximal when α = 0. In this case, we can write the expectation as follows where A = (R * ik + η 1 ) 2 follows a non-central χ 2 distribution with one degree of freedom and non-centrality parameter (R * ik ) 2 , C = q≥2 η 2 q follows a central χ 2 distribution with d − 1 degrees of freedom, and B = (R * kj + η 1 ) 2 . Using known results on the moments of central and non-central χ 2 random variables, Since µ, v ∈ U λ it holds that R * ik ≤ R ik + 1 2 R i , R * kj ≤ R kj + 1 2 R j . Now we consider several different cases. First, for i = j = k we have R ik = R jk = 0 and R i = R j . Hence . Finally, we consider the case where j = i but k is not distinct from both i and j, without loss of generality k = i. Then R ik = 0, R kj = R ij . By definition, R j ≤ R ij and R i ≤ R ij . Thus, We are now ready to prove the lemma. For clarity we present in two separate parts the proof of Eq. (26) and of Eq. (25).
Proof of Eq. (26) in Lemma 4.3. The first step is to separate the expectation over the GMM to its K components. By the triangle inequality, with V k ij as defined in (50). By Lemma A.3, for each k, with V k ij as defined in Eq. (52). We now separately analyze each of the two terms on the right hand size of (56). We start with the second term. When k = i, by Proposition 4.1 By symmetry, the same bound holds also for k = j.
Next we bound V k ij and we shall later see that it is the largest of the two quantities in (56). Note that by the Cauchy-Schwarz inequality By Lemma A.4 there exists a universal constant C such that Eq (54) holds with dimension d 0 . Thus, for the first term in Eq. (55), with k = i, and by Proposition 4.1 Similarly, for k = j, Hence, in these two cases indeed V ij k is the dominant term. Finally we consider the case k = i, j. Again by Proposition 4.1 Segol and Nadler/Improved Convergence Guarantees for EM

23
As for the first term, Inserting (57), (58) and (59) into (55) and summing over the components gives Since the function x 2 e −tx 2 is monotonic decreasing for x > 1/t, and R min > 2/c(λ) we may replace R ij by max(R i , R j ) in the first term. Similarly, we may replace R ik by R i and R jk by R j in the second sum above. This yields Eq. (26).
Proof of Eq. (25) in Lemma 4.3. The first step is to separate the expectation over the GMM to its K components. By the triangle inequality, with V k ii as defined in (51). We now bound each V k ii separately. First, by Lemma with V k ii as defined in (53). We now analyze each component separately. We first bound V k ii and we shall later see that it is the largest of the two quantities in (61). By the Cauchy-Schwarz inequality, By Lemma A.4, there exists a universal constant C such that Eq. (54) holds with dimension d 0 . Thus for k = i, and by Corollary 4.1.1, Similarly, we upper bound the second quantity on the right hand side of (61) as follows, The function x 2 e −tx 2 is monotonic decreasing for x > √ t −1 . Since R i ≥ 4c(λ) −1 , so does R ik , and we may replace it in the equation above by R i . Namely, Inserting (62) and (63) into (60), and summing over all components yields Eq. (25).

A.4. Completing the Proof of Theorem 3.1
The following lemma completes the proof of the Theorem.
Lemma A.5. Let X ∼ GMM(µ * , π) with R min satisfying (11). Let µ + be the population EM update (7). Then for every i ∈ [K] it holds that µ + i −µ * i ≤ λR i . Proof. Our starting point is Eq. (29), We insert the bounds (25) and (26) on V i,i and V i,j , respectively, to the above. Since µ ∈ U λ we may replace all µ k − µ * k in the expressions above by λR k . Since x 3 e −tx 2 is monotonic decreasing for all x > 3/2t, we may replace all , this gives By Eqs. (27) and (20), it follows that The separation condition (11) suffices to ensure that µ + i − µ * i ≤ λR min .
Proof. For any ε > 0, let N i be an ε-net of B i and define N ε = ⊗N i . Then, Therefore for any t > 0, where the three events A, B, C are given by We first bound Pr(A). By Markov's inequality and the first condition of the lemma, Thus, for 3εL t ≤ δ 2 we have that Pr(A) < δ 2 . Note also that for t satisfying and hence Pr(B) = 0. Finally, we bound the probability of C. Here we use the second condition of the lemma, that W (X, µ) ψ2 ≤ R. It follows from Lemma B.1, that for any fixed µ ε , where c is a universal constant. Since all the balls B 1 , . . . , B K ⊂ R d are of radius at most r, it holds that |N ε | ≤ e log( 3r ε )Kd . Hence, taking a union bound, Pr sup The requirement that the right hand side of the above is smaller than δ 2 implies t ≥ R Kd log 3r ε + d log 3 + log 4 δ cn .
Setting ε = δ 6Ln , the condition Pr(A) ≤ δ 2 implies t > 1 n , which holds if t satisfies the inequality above. Hence, for t > R c Proof. The lemma will follow from Lemma B.2 by setting X ∼ N (µ * i , I d ), B = U λ and W = w i (X, µ). To this end we show that the conditions of Lemma B.2 hold: (i) There exists L > 1 such that for any ε > 0, is bounded by a constant. The latter is clear as w i is bounded. For the former we use the mean value theorem. There exists a point µ such that |w i (X, µ) − w i (X, µ ε )| = |∇w i (X,μ) (µ − µ ε )| Using the expressions (21) and (22) for the gradient of w i with respect to µ, . Therefore, It follows that The lemma follows by plugging L = K( √ d + 2R max ) and r = R max into (64).

B.3. Proof of Lemma 5.2
Proof. We denote the set of all samples X generated from the i-th component by I i , n i = |I i | andπ i = n i /n. Since w i (X, µ) ≥ 0 for any X, µ we can lower bound the sum in the event D i by considering only the terms w i (X , µ) with With a suitably large constant C, the sample size requirement (34) implies that n > 300 log 8K δ πmin . By the multiplicative form of the Chernoff bound for Bernoulli random variables, see e.g. [18, Exercise 2.3.5], we have for any δ ∈ (0, 1) that |π i − π i | ≤ 1 10 π i . Therefore,π i ≥ 9 10 π i and thus 1 n Note that by Lemma 5.1, with probability at least 1 − δ 4K , Eq. (33) holds. Since n i ≥ 9 10 nπ i , we may replace n i in Eq. (33) by nπ i , and increase the relevant constants toc 1 = 10 9c andC 1 = 10 9 C. It thus follows that Kd log(C 1 n δ ) nπ i   .
Next, we split the expectation over ν to two cases as follows, Next, to bound the second term we use the expressions in Eqs. (21) and (22), with V i,j and V i,i as defined in (23) and (24), respectively. The proof proceeds similarly to that of the original EM algorithm. First, using the bounds (25) and (26) in Lemma 4.3, for R min satisfying the separation condition (11), it holds that Therefore, We finish by showing that µ + ∈ U λ . Replacing µ j − µ * j by λR j in (68), and replacing R k by R min in the bounds in (25) and (26) we get . Under the separation condition (11), the right hand side of the above is upper bounded by 3 8 π min . Therefore, The next lemma presents a concentration result for the sample EM update.
With the pieces in place we now prove Theorem 3.4.