Separating populations with wide data: A spectral analysis

In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of $k$ product distributions. We are interested in the case that individual features are of low average quality $\gamma$, and we want to use as few of them as possible to correctly partition the sample. We analyze a spectral technique that is able to approximately optimize the total data size--the product of number of data points $n$ and the number of features $K$--needed to correctly perform this partitioning as a function of $1/\gamma$ for $K>n$. Our goal is motivated by an application in clustering individuals according to their population of origin using markers, when the divergence between any two of the populations is small.


Introduction
We explore a type of classification problem that arises in the context of computational biology.The problem is that we are given a small sample of size n, e.g., DNA of n individuals (think of n in the hundreds or thousands), each described by the values of K features or markers, e.g., SNPs (Single Nucleotide Polymorphisms, think of K as an order of magnitude larger than n).Our goal is to use these features to classify the individuals according to their population of origin.Features have slightly different probabilities depending on which population the individual belongs to, and are assumed to be independent of each other (i.e., our data is a small sample from a mixture of k very similar product distributions).The objective we consider is to minimize the total data size D = nK needed to correctly classify the individuals in the sample as a function of the "average quality" γ of the features, under the assumption that K > n.Throughout the paper, we use p j i and µ j i as shorthands for p (j) i and µ (j) i respectively.
Statistical Model: We have k probability spaces Ω 1 , . . ., Ω k over the set {0, 1} K .Further, the components (features) of z ∈ Ω t are independent and Pr Ωt [z i = 1] = p i t (1 ≤ t ≤ k, 1 ≤ i ≤ K).Hence, the probability spaces Ω 1 , . . ., Ω k comprise the distribution of the features for each of the k populations.Moreover, the input of the algorithm consists of a collection (mixture) of n = k t=1 N t unlabeled samples, N t points from Ω t , and the algorithm is to determine for each data point from which of Ω 1 , . . ., Ω k it was chosen.In general we do not assume that N 1 , . . ., N t are revealed to the algorithm; but we do require some bounds on their relative sizes.An important parameter of the probability ensemble Ω 1 , . . ., Ω k is the measure of divergence between any two distributions.Note that √ Kγ measures the Euclidean distance between the means of any two distributions and thus represents their separation.Further, let N = n/k (so if the populations were balanced we would have N of each type) and assume from now on that kN < K. Let D = nK denote the size of the data-set.In addition, let σ 2 = max i,t p i t (1 − p i t ) denote the maximum variance of any random bit.
The biological context for this problem is we are given DNA information from n individuals from k populations of origin and we wish to classify each individual into the correct category.DNA contains a series of markers called SNPs, each of which has two variants (alleles).Given the population of origin of an individual, the genotypes can be reasonably assumed to be generated by drawing alleles independently from the appropriate distribution.The following theorem gives a sufficient condition for a balanced (N 1 = N 2 ) input instance when k = 2.
Theorem 1.1 (Zhou (2006)).Assume N 1 = N 2 = N .If K = Ω ln N γ and KN = Ω ln N log log N γ 2 then with probability 1 − 1/ poly(N ), among all balanced cuts in the complete graph formed among 2N sample individuals, the maximum weight cut corresponds to the partition of the 2N individuals according to their population of origin.Here the weight of a cut is the sum of weights across all edges in the cut, and the edge weight equals the Hamming distance between the bit vectors of the two endpoints.
Variants of the above theorem, based on a model that allows two random draws from each SNP for an individual, are given in Chaudhuri et al. (2007); Zhou (2006).In particular, notice that edge weights based on the inner-product of two individuals' bit vectors correspond to the sample covariance, in which case the max-cut corresponds to the correct partition Zhou (2006) with high probability.Finding a max-cut is computationally intractable; hence in the same paper Chaudhuri et al. (2007), a hill-climbing algorithm is given to find the correct partition for balanced input instances but with a stronger requirement on the sizes of both K and nK.
A Spectral Approach: In this paper, we construct two simpler algorithms using spectral techniques, attempting to reproduce conditions above.In particular, we study the requirements on the parameters of the model (namely, γ, N , k, and K) that allow us to classify every individual correctly and efficiently with high probability.
The two algorithms Classify and Partition compare as follows.Both algorithms are based on spectral methods originally developed in graph partitioning.More precisely, Theorem 1.2 is based on computing the singular vectors with the two largest singular values for each of the n × K input random matrix.The procedure is conceptually simple, easy to implement, and efficient in practice.For simplicity, Procedure Classify assumes the separation parameter γ is known to decide which singular vector to examine; in practice, one can just try both singular vectors as we do in the simulations.Proof techniques for Theorem 1.2, however, are difficult to apply to cases of multiple populations, i.e., k > 2. Procedure Partition is based on computing a rank-k approximation of the input random matrix and can cope with a mixture of a constant number of populations.It is more intricate for both implementation and execution than Classify.It does not require γ as an input, while only requires that the constant k is given.We prove the following theorems.
Theorem 1.2.Let ω = min(N1,N2) n and ω min be a lower bound on ω.Let γ be given.Assume that K > 2n ln n and k = 2. Procedure Classify allows us to separate two populations w.h.p., when n ≥ Ω σ 2 γωminω , where σ 2 is the largest variance of any random bit, i.e. σ 2 = max i,t p i t (1 − p i t ).Thus if the populations are roughly balanced, then n ≥ c γ suffices for some constant c.This implies that the data required is D = nK = O ln nσ 4 /γ 2 ω 2 ω 2 min .Let P s = (p i s ) i=1,...,K , we have There is a polynomial time algorithm Partition that satisfies the following.Suppose that γω for some large enough constant C k , and ω = Ω(1).Then given the empirical n × K matrix comprising the K features for each of the n individuals along with the parameter k, Partition separates the k populations correctly w.h.p.
Summary and Future Direction: Note that unlike Theorem 1.1, both Theorem 1.2 and Theorem 1.3 require a lower bound on n, even when k = 2 and the input instance is balanced.We illustrate through simulations to show that this seems not to be a fundamental constraint of the spectral techniques; our experimental results show that even when n is small, by increasing K so that nK = Ω(1/γ 2 ), one can classify a mixture of two populations using ideas in Procedure Classify with success rate reaching an "oracle" curve, which is computed assuming that distributions are known, where success rate means the ratio between correctly classified individuals and N .Exploring the tradeoffs of n and K that are sufficient for classification, when sample size n is small, is both of theoretical interests and practical value.
Outline of the paper: The paper is organized as follows.In Section 1.1 we discuss related work.Then, in Section 2 we describe the algorithm Classify for Theorem 1.2 and outline its analysis.Some (very) technical details of the analysis are deferred to the appendix.Section 3 deals with the algorithm Partition for Theorem 1.3.Finally, in Section 4 we report some experimental results on Classify.

Related work
In their seminal paper Pritchard et al. (2000), Pritchard, Stephens, and Donnelly presented a model-based clustering method to separate populations using genotype data.They assume that observations from each cluster are random from some parametric model.Inference for the parameters corresponding to each population is done jointly with inference for the cluster membership of each individual, and k in the mixture, using Bayesian methods.
The idea of exploiting the eigenvectors with the first two eigenvalues of the adjacency matrix to partition graphs goes back to the work of Fiedler Fiedler (1973), and has been used in the heuristics for various NP-hard graph partitioning problems (e.g., Fjallstrom (1998)).The main difference between graph partitioning problems and the classification problem that we study is that the matrices occurring in graph partitioning are symmetric and hence diagonalizable, while our input matrix is rectangular in general.Thus, the contribution of Theorem 1.2 is to show that a conceptually simple and efficient algorithm based on singular value decompositions performs well in the framework of a fairly general probabilistic model, where probabilities for each of the K features for each of the k populations are allowed to vary.Indeed, the analysis of Classify requires exploring new ideas such as the Separation Lemma and the normalization of the random matrix X, for generating a large gap between top two singular values of the expectation matrix X and for bounding the angle between random singular vectors and their static correspondents, details of which are included in Section 2 with analysis in full version.
Procedure Partition and its analysis build upon the spectral techniques of McSherry (2001) on graph partitioning, and an extension due to Coja-Oghlan (2006).McSherry provides a comprehensive probabilistic model and presents a spectral algorithm for solving the partitioning problem on random graphs, provided that a separation condition similar to (1.2) is satisfied.Indeed, McSherry (2001) encompasses a considerable portion of the prior work on Graph Coloring, Minimum Bisection, and finding Maximum Clique.Moreover, McSherry's approach easily yields an algorithm that solves the classification problem studied in the present paper under similar assumptions as in Theorem 1.3, provided that the algorithm is given the parameter γ as an additional input; this is actually pointed out in the conclusions of McSherry (2001).In the context of graph partitioning, an algorithm that does not need the separation parameter as an input was devised in Coja-Oghlan (2006).The main difference between Partition and the algorithm presented in Coja-Oghlan ( 2006) is that Partition deals with the asymmetric n × K matrix of individuals/features, whereas Coja-Oghlan ( 2006) deals with graph partitioning (i.e., a symmetric matrix).
There are two streams of related work in the learning community.The first stream is the recent progress in learning from the point of view of clustering: given samples drawn from a mixture of well-separated Gaussians (component distributions), one aims to classify each sample according to which component distribution it comes from, as studied in Dasgupta (1999), Dasgupta and Schulman (2000), Arora and Kannan (2001); Vempala and Wang (2002); Achlioptas and McSherry (2005); Kannan et al. (2005); Dasgupta et al. (2005).This framework has been extended to more general distributions such as logconcave distributions in Achlioptas and McSherry (2005); Kannan et al. (2005) and heavy-tailed distributions in Dasgupta et al. (2005), as well as to more than two populations.These results focus mainly on reducing the requirement on the separations between any two centers P 1 and P 2 .In contrast, we focus on the sample size D.This is motivated by previous results Chaudhuri et al. (2007); Zhou (2006) stating that by acquiring enough attributes along the same set of dimensions from each component distribution, with high probability, we can correctly classify every individual.
While our aim is different from those results, where n > K is almost universal and we focus on cases K > n, we do have one common axis for comparison, the ℓ 2 -distance between any two centers of the distributions.In earlier works Dasgupta and Schulman (2000); Arora and Kannan (2001), the separation requirement depended on the number of dimensions of each distribution; this has recently been reduced to be independent of K, the dimensionality of the distribution for certain classes of distributions Achlioptas and McSherry (2005); Kannan et al. (2005).This is comparable to our requirement in (1.2) for the discrete distributions.For example, according to Theorem 7 in Achlioptas and McSherry (2005), in order to separate the mixture of two Gaussians, is required.Besides Gaussian and Logconcave, a general theorem: Theorem 6 in Achlioptas and McSherry (2005) is derived that in principle also applies to mixtures of discrete distributions.The key difficulty of applying their theorem directly to our scenario is that it relies on a concentration property of the distribution (Eq.(10) of Achlioptas and McSherry (2005)) that need not hold in our case.In addition, once the distance between any two centers is fixed (i.e., once γ is fixed in the discrete distribution), the sample size n in their algorithms is always larger than Ω K ω log 5 K Achlioptas and McSherry (2005); Kannan et al. (2005) for log-concave distributions (in fact, in Theorem 3 of Kannan et al. (2005), they discard at least this many individuals in order to correctly classify the rest in the sample), and larger than Ω( K ω ) for Gaussians Achlioptas and McSherry (2005), whereas in our case, n < K always holds.Hence, our analysis allows one to obtain a clean bound on n in the discrete case.
The second stream of work is under the PAC-learning framework, where given a sample generated from some target distribution Z, the goal is to output a distribution Z 1 that is close to Z in Kullback-Leibler divergence: KL(Z||Z 1 ), where Z is a mixture of product distributions over discrete domains or Gaussians Kearns et al. (1994); Freund and Mansour (1999); Cryan (1999); Cryan et al. (2002); Mossel and Roch (2005); Feldman et al. (2005Feldman et al. ( , 2006)).They do not require a minimal distance between any two distributions, but they do not aim to classify every sample point correctly either, and in general require much more data.
Our work is also related to the use of principal component analysis ("PCA") in genetics Patterson et al. (2006); Price et al. (2006).The basic approach in these papers is to use the eigenvectors of a covariance matrix between samples to analyze a mixture of populations.While Patterson et al. (2006); Price et al. (2006) study the use of spectral methods empirically, the crucial point of the present work is that we prove rigorously that spectral methods succeed on a certain (simple) probabilistic model.Hence, our work can be seen as a further theoretical justification of the practical use of PCA.A difference between the present paper and Patterson et al. (2006); Price et al. (2006) is that we actually aim to assign each individual to exactly one of the populations.By contrast, Patterson et al. (2006); Price et al. (2006) just assign each individual a real "weight" for each population: essentially the eigenvectors with the dominant eigenvalues corresponding to the populations, and each individual is assigned its projection on these dominant eigenvectors.The algorithm Classify is somewhat similar to PCA, but Partition is conceptually more involved.In addition, our experimental results show a phase transition phenomenon similar to what was observed in Patterson et al. (2006) in detecting population structure using simulated data.

A simple algorithm using singular vectors
As described in Theorem 1.2, we assume we have a mixture of two product distributions.Let N 1 , N 2 be the number of individuals from each population class.Our goal is to correctly classify all individuals according to their distributions.Let n = 2N = N 1 + N 2 , and refer to the case when N 1 = N 2 as the balanced input case.For convenience, let us redefine "K" to assume we have O(log n) blocks of K features each (so the total number of features is really O(K log n)) and we assume that each set of K features has divergence at least γ.(If we perform this partitioning of features into blocks randomly, then with high probability this divergence has changed by only a constant factor for most blocks.) The high-level idea of the algorithm is now to repeat the following procedure for each block of K features: use the K features to create an n×K matrix X, such that each row X i , i = 1, . . ., n, corresponds to a feature vector for one sample point, across its K dimensions.We then compute the top two left singular vectors u 1 , u 2 of X and use these to classify each sample.This classification induces some probability of error f for each individual at each round, so we repeat the procedure for each of the O(log n) blocks and then take majority vote over different runs.Each round we require K ≥ n features, so we need O(n log n) features total in the end.
In more detail, we repeat the following procedure O(log n) times.Let T = 15N 32 √ 3ω min γ, where ω min is the lower bound on the minimum weight min{ N1 2N , N2 2N }, which is independent of an actual instance.Let s 1 (X), s 2 (X) be the top two singular values of X.
2. Otherwise, use u 1 to partition, with mixture mean M = n i=1 u 1,n as the threshold.
Analysis of the Simple Algorithm: Our analysis is based on comparing entries in the top two singular vectors of the normalized random n × K matrix X, with those of a static matrix X , where each entry X i,j = E[X i,j ] is the expected value of the corresponding entry in X. Hence ∀i = 1, . . ., N 1 , We assume the divergence is exactly γ among the K features that we have chosen in all calculations.
The inspiration for this approach is based on Lemma 2.1, whose proof is built upon Theorem A.2 that is presented in a lecture note by Spielman (2002).For an n×K matrix A, let s We denote the set of n left and right singular vectors of X with ū1 , . . ., ūn , v1 , . . ., vn .
Lemma 2.1.Let X be the random n × K matrix and X its expected value matrix.Let A = X − X be the zero-mean random matrix.Let θ i be the angle between two vectors: represents a vector that is the concatenation of two vectors u, v.
We first bound the largest singular value s 1 (A) = s 1 (X − X ) of (a i,j ) with independent zero-mean entries, which defines the Euclidean operator norm (a i,j ) := sup i,j a i,j x i y j : x 2) The behavior of the largest singular value of an n × m random matrices A with i.i.d.entries is well studied.Latala (2005) shows that the weakest assumption for its regular behavior is boundedness of the fourth moment of the entries, even if they are not identically distributed.Combining Theorem 2.2 of Latala (2005) with the concentration Theorem 2.3 by Meckes (2004) proves Theorem 2.4 that we need1 .
Theorem 2.2 (Norm of Random Matrices Latala ( 2005)).For any finite n × m matrix A of independent mean zero r.v.'s a i,j we have, for an absolute constant C, Theorem 2.3 (Concentration of Largest Singular Value: Bounded Range Meckes ( 2004)).For any finite n × m, where n ≤ m, matrix A, such that entries a i,j are independent r.v.supported in an interval of length at most D, then, for all t, (2.4) Theorem 2.4 (Largest Singular Value of a Mean-zero Random Matrix).For any finite n × K, where n ≤ K, matrix A, such that entries a i,j are independent mean zero r.v.supported in an interval of length at most D, with fourth moment upper bounded by B, then (2.5) In order to apply Lemma 2.1 to the top two singular vectors of X and X through we need to first bound |s 1 (X ) − s 2 (X )| away from zero, since otherwise, RHSs on both (2.6) and (2.7) become unbounded.We then analyze Let us first define values a, b, c that we use throughout the rest of the paper: (2.8) For the following analysis, we can assume that a, b, c ∈ [K/4, K], given that X is normalized in Procedure Classify.We first show that normalization of X as described in Procedure Classify guarantees that not only |s 1 (X ) − s 2 (X )| = 0, but there also exists a Θ( √ N K) amount of gap between s 1 (X ) and s 2 (X ) in Proposition 2.5: (2.9) Proposition 2.5.For a normalized random matrix X, its expected value matrix X satisfies 4c0 We next state a few important results that justify Procedure Classify.Note that the left singular vectors ūi , ∀i of X are of the form [x i , . . ., x i , y i , . . ., y i ] T : ū1 = [x 1 , . . ., x 1 , y 1 , . . ., y 1 ] T , and ū2 = [x 2 , . . ., x 2 , y 2 , . . ., y 2 ] T , (2.11) where x i repeats N 1 times and y i repeats N 2 times.We first show Proposition 2.6 regarding signs of x i , y i , i = 1, 2, followed by a lemma bounding the separation of x 2 , y 2 .We then state the key Separation Lemma that allows us to conclude that least one of top two left singular vectors of X can be used to classify data at each round.It can be extended to cases when k > 2.
Proposition 2.6.Let b as defined in (2.8): when b > 0, entries x 1 , y 1 in ū1 have the same sign while x 2 , y 2 in ū2 have opposite signs.
Lemma 2.7. where where Proof.Let ∆ := P 1 − P 2 as in Theorem 1.2, and b = [1, 0, . . ., 0, −1, 0, . . ., 0] T , where 1 appears in the first and −1 appears in the In Section 2.2, we first prove a proposition regarding a, b, c as defined in (2.8).We next provide the proof for Theorem 2.4 regarding the largest singular value of (X − X ).In Section 2.4, we show that the probability of error at each round for each individual is at most f = 1/10, given the sample size n as specified in Theorem 1.2.Hence by taking majority vote over the different runs for each sample, our algorithm will find the correct partition with probability 1 − 1/n 2 , given that at each round we take a set of K > n independent features.

Detailed analysis for the simple algorithm
Throughout the rest of the paper, we use X, Y , H to represent random matrices, where H = XX T and Y = 0 X X T 0 .We use X , Y, H to represent the corresponding static matrices.Let us substitute a, b, c in H = X X T , where the blocks in H from top to bottom and from left to right are of size: . (2.12) Proposition 2.10.For any choices of µ k i , ac ≥ b 2 ; By definition, Remark 2.11.Both matrices of X and X X T have rank at most two.When ac = b 2 , H has rank 1.

Proof of Theorem 2.4
By having an upper bound on both maximum variance and fourth moment of any entry, we have the following corollary of Theorem 2.2.
Corollary 2.12 (Largest Singular Value: Bounded Fourth Moment Latala (2005)).For any finite n × m, where n ≤ m, matrix of independent mean zero r.v.'s a i,j , such that the maximum variances of any entry is at most σ 2 , and each entry has a finite fourth moment B we have for an absolute constant C.
Remark 2.13.The requirement that σ 2 is upper bounded is not essential.The conclusion in Corollary 2.12 works so long as fourth moment is bounded by B.
Let Ms 1 (A) be the median of s 1 (A).Following a calculation from Meckes (2004), we have where D ≤ 1 for Bernoulli random variables that we consider.This allows us to conclude Theorem 2.4.

Correctness of classification for the simple algorithm
We now prove correctness of our algorithm.We first show how to choose T for Procedure Classify.Let B denote the fourth moment bound for a single random variable in the mean zero random matrix X − X ; for the type of normalized Bernoulli r.v.s that we care about, √ B is in the order of σ 2 , where σ 2 is defined in Theorem 1.2.
Let N γ be a large enough constant.Let s 1 (X − X ) ≤ C 0 √ K, where C 0 = C 1 B 1/4 as defined in Theorem 2.4 and let the threshold which requires that (2.16) Following Lemma A.4, (A.1), (A.3), and Proposition A.1, we have (2.17) We have two cases, 1.When s 2 (X) ≤ T , by Lemma 2.7 and the fact that for C max as defined in Lemma 2.7.We want s 2 (X ) (2.20) Thus the condition of Theorem 2.14 holds with c 2 = 1 2 , so long as  Let us first denote the first singular vector u 1 and its "noise" vector ǫ as follows: It turns out that we only need to use the mixture mean to decide which side to put a node, i.e., to partition j ∈ [2N ] according to u 1,j < M or u 1,j ≥ M , given that N 1 /N 2 is a constant; Misclassifying any entry as in (2.21), and where c 1 = 5C1B 1/4 √ 2c0σ for C 1 specified in Theorem 2.4 and c 0 specified in Proposition 2.5, we can classify the two population using the mixture mean M with the error factor at most f for N 1 , N 2 respectively whp.
By Lemma 2.1 and Theorem 2.4, we immediately have the following claim.
Claim 2.15.For c 1 chosen as in Theorem 2.14, 2c0σ such that C 1 appears in Theorem 2.4 and c 0 appears in Proposition 2.5, This allows us to conclude the claim.
We need the following lemma, proof of which appears in Appendix B.
Lemma 2.16.Assume that 2N ≤ K and Condition (2.23) in Theorem 2.14, we have Proof of Theorem 2.14.Recall that the largest ū1 , ū2 have the form of [x, . . ., x, y, . . ., y], where x repeats N 1 times and y repeats N 2 times; hence w.l.o.g., assume that x < y, we have Hence the total number of entries that goes above M from P 1 , and those goes below M from P 2 can not be too many since their total contribution is upper bounded by ǫ 2 2 = u 1 − ū1 2 2 .Let ℓ 1 be the number of misclassified entries from N 1 , i.e., those described in (2.25), by Lemma 2.1, (2.27) We next bound the number of entries from P 2 that goes below M , which can not be too many either; let ℓ 2 be the number of misclassified entries from P 2 , hence by requiring Thus by requiring we have satisfied all requirements.
Combining Lemma 2.1 and Corollary 2.9, we have Claim 2.17.Given that s 2 (X ) This allows us to prove the following theorem.Let the classification error factor be the number of misclassified individuals from one group over total amount of people in that group.
and ω min is the minimum possible weight allowed by the algorithm.By requiring we can classify the two population using 0 to separate components in u 2 , with error factor at most f for both P 1 , P 2 whp.
Proof.Let ℓ 1 , ℓ 2 be the number of misclassified entries from P 1 and P 2 respectively; they each contribute at least Cx min 2N , and amount to u 2 − ū2 2 , and hence by Claim 2.17, γf ωωmin individuals, by trying Procedure Classify for log n rounds, with probability of error at each round for each individual being f = 1/10, where each round we take a set of K > n independent features, and by taking majority vote over the different runs for each sample, our algorithm will find the correct partition with probability 1 − 1/n 2 .Proof.A sample is put in the wrong side with a probability 1/10 at each round.Let E i be the event that sample i is misclassified for more than log n times, thus Pr[E i ] = 1 10 log n ≤ 1/n 3.32 ; hence by union bound, with probability 1 − 1/n 2 , none of the 2N individuals is misclassified.

Preliminaries
Let V = {1, . . ., n} be the set of all n individuals, and let ψ : V → {1, . . ., k} be the map that assigns to each individual the population it belongs to.Set V t = ψ −1 (t) and N t = |V t |.Moreover, let E = (E vi ) 1≤v≤n,1≤i≤K be the n × K matrix with entries E vi = p i ψ(v) .For any 1 ≤ t ≤ k we let E Vt = (p l t ) l=1,...,K be the row of E corresponding to any v ∈ V t .In addition, let A = (a vi ) denote the empirical n × K input matrix.Thus, the entries of E equal the expectations of the entries of A.
As in Theorem 1.3, we let Further, set λ = √ Kσ.Then the assumption from Theorem 1.3 can be rephrased as where C k signifies a sufficiently large number that depends on k only (the precise value of C k will be specified implicitly in the course of the analysis).As in the previous section, by repeating the partitioning process log n times, we may restrict our attention to the problem of classifying a constant fraction of the individuals correctly.That is, it is sufficient to establish the following claim.
Claim 3.1.There is a polynomial time algorithm Partition that satisfies the following.Suppose that (3.1) is true.Then whp Partition(A, k) outputs a partition (S 1 , . . ., S k ) of V such that there exists a permutation σ such that Let X = (x ij ) 1≤i≤n,1≤j≤K be a n × K matrix.By X i we denote the i'th row (X i1 , . . ., X iK ) of X.Moreover, we let Xξ signify the operator norm of X.A rank k approximation of X is a matrix X of rank at most k such that for any n × K matrix Y of rank at most k we have X − X ≤ X − Y .Given X, a rank k approximation X can be computed as follows.Letting ρ = rank(X), we compute the singular value decomposition here (ξ i ) 1≤i≤ρ is an orthonormal family in R n , (η i ) 1≤i≤ρ is an orthonormal family in R K , and we assume that the singular values λ i are in decreasing order (i.e., λ 1 ≥ • • • ≥ λ ρ ).This can be accomplished in polynomial time within any numerical precision.Then X = min{k,ρ} i=1 λ i ξ i η T i is easily verified to be a rank k approximation.
In addition to the operator norm, we are going to work with the Frobenius norm Although the following fact is well known, we provide its proof for completeness.
i be a singular value decomposition as above.Then Since ξ 1 , . . ., ξ k and η 1 , . . ., η k are orthonormal families, we have This implies the assertion, because λ i ≤ X for all 1 ≤ i ≤ k.

Description of the algorithm
Algo 3.3.Partition(A, k) Input: A n × K matrix A and the parameter k.Output: A partition S 1 , . . ., S k of V .

1.
Compute a rank k approximation A of A.

4.
Partition the entire set V as follows: first, let S 5.
Let J be such that r * = rJ is minimum.Return S (J ) 1 , . . ., S k .
The basic idea behind Partition is to classify each individual v ∈ V according to its row vector A v in the rank k approximation A. That is, two individuals v, w are deemed to belong to the same population iff A v − A w 2 ≤ 0.01Γ 2 .Hence, Partition tries to determine sets S 1 , . . ., S k such that for any two v, w in the same set S j the distance A v − A w is small.To justify this approach, we show that A is "close" to the expectation E of A in the following sense.
Proof.Since both A and E have rank at most k, and as therefore A − E has rank at most 2k, Lemma 3.2 yields Observe that Lemma 3.4 implies that for most v we have whence z ≪ n min due to our assumption that n min Γ ≫ kλ 2 .Thus, most rows of A are close to the corresponding rows of the expected matrix E. Therefore, the separation assumption n > C k σ 2 γω from Theorem 1.3 implies that for most pairs of elements in different classes v ∈ V i , w ∈ V j the squared distance A v − A w 2 will be large (at least 0.99Γ, say).By contrast, for most pairs u, v ∈ V i of elements belonging to the same class A v − A w 2 will be small (at most 0.01Γ, say), because As the above discussion indicates, if the algorithm were given Γ as an input parameter, the procedure described in Steps 2-4 (with Γ j replaced by Γ) would yield the desired partition of V .The procedure described in Steps 2-4 is very similar to the spectral partitioning algorithm from McSherry (2001).
However, since Γ is not given to the algorithm as an input parameter, Partition has to estimate Γ on its own.(This is the new aspect here in comparison to McSherry (2001), and this fact necessitates a significantly more involved analysis.)To this end, the outer loop goes through 2 log K "candidate values" Γ j .These values are then used to obtain partitions Q k in Steps 2-4.More precisely, Step 2 uses Γ j to compute for each v ∈ V the set Q(v) of elements w such that A w − A v ≤ 0.01Γ 2 j .Then, Step 3 tries to compute "big" disjoint Q k , where each Q (j) i results from some Q(v i ).Further, Step 4 assigns all elements v not covered by Q whose "center vector" ξ (j) i is closest to A v .In addition, Step 4 computes an "error parameter" r j .Finally, Step 5 outputs the partition that minimizes the error parameter r j .
Thus, we need to show that eventually picking the partition whose error term r j is minimum yields a good approximation to the ideal partition V 1 , . . ., V k .The basic reason why this is true is that the "empirical" mean ξ will be about as small as A−E 2 F (cf. Lemma 3.4).In fact, the following lemma shows that if Γ j is "close" to the ideal Γ, then r j will be small.Lemma 3.5.If 1 2 Γ ≤ Γ j ≤ Γ, then r j ≤ C 0 k 3 λ 2 for a certain constant C 0 > 0. We defer the proof of Lemma 3.5 to Section 3.3.Furthermore, the next lemma shows that any partition such that r j is small yields a good approximation to V 1 , . . ., V k .
The proof of Lemma 3.6 can be found in Section 3.4.
Proof of Claim 3.1.Since the rank k approximation A can be computed in polynomial time (within any numerical precision), Partition is a polynomial time algorithm.Hence, we just need to show that K −1 ≤ Γ ≤ K; for then Partition will eventually try a Γ j such that 1 2 Γ ≤ Γ j ≤ Γ, so that the claim follows from Lemmas 3.5 and 3.6.To see that K −1 ≤ Γ ≤ K, recall that we explicitly assume that Γ > K −1 .Furthermore, all entries of the vectors E Vi lie between 0 and 1, whence Γ = max i<j E Vi − E Vj 2 ≤ K.
Hence, due to (3.4) we obtain |, as desired.The remaining task is to establish (3.2)-(3.4).We proceed by induction on l.Thus, let us assume that (3.2)-(3.4)hold for all l < L; we are to show that then (3.2)-(3.4)are true for l = L as well.As a first step, we establish (3.3).To this end, consider a class V i such that i ∈ π({1, . . ., L − 1}) and let ) whence the assumption (3.1) on Γ yields provided that C k is sufficiently large.Moreover, for all v ∈ Z i we have because we are assuming that Γ j ≥ Γ/2.In addition, let w ∈ Q l for some l < L; since our choice of i ensures that v (3.8) Finally, let v L signify the element chosen by Step 3 of Partition to construct As this estimate holds for all i ∈ π({1, . . ., L − 1}), (3.3) follows.Thus, we know that Q L is "big".As a next step, we prove (3.4), i.e., we show that Q L "mainly" consists of vertices in V π(L) .To this end, let 1 Hence, due to our assumption (3.1) on Γ, (3.9) yields that |Y | < 0.01n min .Consequently, (3.3) entails that Finally, to show (3.2), we note that by construction Step 3 of Partition).Therefore, 1Γ.Thus, (3.2) follows.In the sequel, we shall assume without loss of generality that the map π from Lemma 3.7 is just the identity, i.e., π(i) = i for all i.Bootstrapping on the estimate ξ i − E Vi 2 ≤ 0.1Γ for 1 ≤ i ≤ k from Lemma 3.7, we derive the following stronger estimate.
Proof.Let i = l and consider a v ∈ S i ∩ V l .We shall establish below that (3.13) Then by Lemma 3.7 Consequently, we obtain √ Γ, so that the assertion follows from the estimate Finally, we prove (3.13 Thus, we conclude that A v −ξ l ≥ A v −ξ i , thereby completing the proof.
Proof of Lemma 3.5.
Furthermore, by Corollary 3.9  (3.18) provided that C k is sufficiently large (cf.(3.1)).Combining (3.17) and (3.18), we obtain nmin provided that C k is large enough.Thus, we have established the first two parts of the lemma.In addition, observe that (3.18) implies that π is bijective (because the sets S 1 , . . ., S k are pairwise disjoint and |V a | ≥ n min for all 1 ≤ a ≤ k).
Finally, the third assertion follows from the estimate k a,b=1 where we assume once more that C k is sufficiently large.

Experiments
We illustrate the effectiveness of spectral techniques using simulations.In particular, we explore the case when we have a mixture of two populations; we show that when N K > 1/γ 2 and K > 1/γ, either the first or the second left .Each point is an average over 100 trials.Horizontal lines ("oracles") indicate the information-theoretically best possible success rate for that value of K (how well one could do if one knew in advance which features satisfied p i 1 > p i 2 and which satisfied p i 1 < p i 2 ; they are not exactly horizontal because they are also an average over 100 runs).Vertical bars indicate the value of N for which N K = 1/γ 2 .singular vector of X shows an approximately correct partitioning, meaning that the success rate is well above 1/2.The entry-wise expected value matrix X is: among K/2 features, p i 1 > p i 2 and for the other half, p i 1 < p i 2 such that ∀i, , where ǫ = 0.1α.Hence γ = α 2 .We report results on balanced cases only, but we do observe that unbalanced cases show similar tradeoffs.For each population P , the success rate is defined as the number of individuals that are correctly classified, i.e., they belong to a group that P is the majority of that group, versus the size of the population |P |.
Each point on the SVD curve corresponds to an average rate over 100 trials.Since we are interested in exploring the tradeoffs of N, K in all ranges (e.g., when N << K or N >> K), rather than using the threshold T in Procedure Classify that is chosen in case both N, K > 1/γ, to decide which singular vector to use, we try both u 1 and u 2 and use the more effective one to measure the success rate at each trial.For each data point, the distribution of X is fixed across all trials and we generate an independent X 2N ×K for each trial to measure success rate based on the more effective classifier between u 1 and u 2 .
One can see from the plot that when K < 1/γ, i.e., when K = 200 and 400, no matter how much we increase N , the success rate is consistently low.Note that 50/100 of success rate is equivalent to a total failure.In contrast, when N is smaller than 1/γ, as we increase K, we can always classify with a high success rate, where in general, N K > 1/γ 2 is indeed necessary to see a high success rate.In particular, the curves for K = 5000, 2500, 1250 show the sharpness of the threshold behavior for increasing sample size n from below 1/Kγ 2 to above.For each curve, we also compute the best possible classification one could hope to make if one knew in advance which features satisfied p i 1 > p i 2 and which satisfied p i 1 < p i 2 .These are the horizontal(ish) dotted lines above each curve.The fact that the solid curves are approaching these information-theoretic upper bounds shows that the spectral technique is correctly using the available information.
In fact both ±s i (X) are eigenvalues of Y , which is irrelevant.
Proof of Lemma 2.1.We first state a theorem, whose statement appears in a lecture note by Spielman (2002), with a slight modification (off by a factor on RHS).Our proof for this theorem is included here for completeness.It is known that for any real symmetric matrix, there exist a set of n orthonormal eigenvectors.
Theorem A.2 (Modified Version of Spielman (2002)).For A and M being two symmetric matrices and be eigenvalues of M and w 1 , w 2 , . . ., w n be the corresponding orthonormal eigenvectors of M , with θ i = ∠(v i , w i ).Then Let us apply Theorem A.2 to the symmetric matrix Y in (A.1).In particular, we only compare the first n eigenvectors of Y of Y.For the numerator of RHS of (A.
We first prove the following claim.
Claim A.3.For any symmetric n × n matrix A, let λ i , ∀i = 1, . . ., n be eigenvalues of A with orthonormal eigenvectors v 1 , v 2 , . . ., v n , for all y ⊥ v i , Proof.Let us first assume y ⊥ v i and write y = n j=1,j =i c j v j , thus we have Proof of Theorem A.2. Let us construct a vector y that is orthogonal to v i as follows: and hence Proof.Let S j be a subspace of dimension j.Recall the following definition of λ i for a matrix: x T M x. (A.5) In the following, let S v N −i+1 be the subspace that is orthogonal to the subset of orthonormal eigenvectors v 1 , . . ., v i−1 of symmetric matrix A. Note that this is the N − i + 1 dimensional subspace that achieves the minimum of the maximum of v T Av over all unit-length vectors v in the particular subspace.We have For the other direction, let S w N −i+1 be the subspace that is orthogonal to the subset of orthonormal eigenvectors w 1 , . . ., w i−1 of symmetric matrix M .Note that this is the N − i + 1 dimensional subspace that achieves the minimum of the maximum of w T M w over all unit-length vectors w in the particular subspace.We have where

A.2. Some Propositions regarding the static matrices
For static matrix H = X X T and Y = 0 X X T 0 , we define Proof.We first show the following: , as Proposition A.8 and λ j (Y) = s j (X ).In particular, For the upper bound on gap(X ), we have that Definition A.7.For our application, we have ∀k, 1 ≥ p k 1 , p k 2 ≥ 0, and It is easy to see that with this normalized random matrix, λ 1 (H) = λ 2 (H) is not possible, given that a, b, c ∈ [K/4, K]; furthermore, gap(H) = Θ(N K) as in the Proposition A.8. Proposition A.8.Given H = X X T and a, b, c as in (2.8) for any expected value mean matrix X , which is not necessarily normalized, Finally, for a normalized random matrix X and its X , we have c 0 being a constant and combing with the upper bound of gap(H) ≤ N 1 a + N 2 c ≤ 2N K concludes that gap(H) = Θ(N K).
Proof of Proposition 2.6.By (A.2), ū1 , ū2 are the first and second eigenvectors of H corresponding to λ 1 (H) and λ 2 (H).Let x, y be entries that correspond to P 1 , P 2 respectively in the first or second eigenvectors of H.By (A.9) and (A.10), we have In addition, given any b = 0, we have gap(H) > |N 1 a − N 2 c| and hence λ 1 (H) > max{N 1 a, N 2 c} > λ 2 (A).Therefore, for b > 0, y x ≥ 0 for first eigenvector and ≤ 0 for v 2 .and for b < 0, it is the opposite.

Fig 1 .
Fig 1. Plots show success rate as a function of N for several values of K, when γ = (0.04) 2 .Each point is an average over 100 trials.Horizontal lines ("oracles") indicate the information-theoretically best possible success rate for that value of K (how well one could do if one knew in advance which features satisfied p i1 > p i 2 and which satisfied p i 1 < p i 2 ; they are not exactly horizontal because they are also an average over 100 runs).Vertical bars indicate the value of N for which N K = 1/γ 2 .