Second order accurate distributed eigenvector computation for extremely large matrices

We propose a second-order accurate method to estimate the eigenvectors of extremely large matrices thereby addressing a problem of relevance to statisticians working in the analysis of very large datasets. More specifically, we show that averaging eigenvectors of randomly subsampled matrices efficiently approximates the true eigenvectors of the original matrix under certain conditions on the incoherence of the spectral decomposition. This incoherence assumption is typically milder than those made in matrix completion and allows eigenvectors to be sparse. We discuss applications to spectral methods in dimensionality reduction and information retrieval.


Introduction
Spectral methods have a long list of applications in statistics and machine learning.Beyond dimensionality reduction techniques such as PCA or CCA [And03,MKB79], they have been used in clustering [NJW02], ranking & information retrieval [PBMW98,HTF + 01,LM05] or classification for example.Computationally, one of the most attractive features of these methods is their low numerical cost, in particular on problems where the data matrix is sparse (e.g.graph clustering or information retrieval).Computing a few leading eigenvalues and eigenvectors of a matrix, using the power or Lanczos methods for example, requires performing a sequence of matrix vector products and can be processed very efficiently.This means that when the matrix is dense and has dimension n, the cost of each iteration is O(n 2 ) in both storage and flops.
However, for extremely large scale problems arising in statistics or information retrieval for example, this cost quickly becomes prohibitively high and makes spectral methods impractical.In this paper, we propose a randomized, distributed algorithm to estimate eigenvectors (and eigenvalues) which makes spectral methods tractable on very large scale matrices.We show that our method is second order accurate and illustrate its performance on a few realistic datasets.
Going back to the numerical cost of spectral methods, we see that decomposing each matrix vector product in many smaller block operations partially alleviates the complexity problem, but makes the overall process very bandwidth intensive.Decomposition techniques thus improve the granularity of iterative eigenvalue methods (i.e.require many cheaper operations instead of a single very expensive one), but at the expense of significantly higher bandwidth requirements.Here, we focus on methods that improve the granularity of large-scale eigenvalue computations while having very low bandwidth requirements, meaning that they can be fully distributed over many loosely connected machines.
The idea of using subsampling to lower the complexity of spectral methods can be traced back at least to [GMKG91,PRTV00] who described algorithms based on subsampling and random projections respectively.Explicit error estimates followed in [FKV04, DKM06,AM07] which bounded the approximation error of either elementwise or columnwise matrix subsampling procedures.On the application side, a lot of work has been focused on the Pagerank vector, and [NZJ01] in particular study its stability under perturbations of the network matrix.Similar techniques are applied to spectral clustering in [HYJT08] and both works have close connections to ours.Following the Netflix competition on collaborative filtering, a more recent stream of works [RFP07, CR08, CT09, KMO09] has also been focused on exactly reconstructing a low rank matrix from a small, single incoherent set of observations.Finally, more recent "volume sampling" results provide relative error bounds [KV09], but so far, the sampling probabilities required to obtain these improved error bounds remain combinatorially hard to compute.
Our work here is focused on the impact of subsampling on eigenvector approximations.First we seek to understand how far we can reduce the granularity of eigenvalue methods using subsampling, before reconstructing eigenvectors becomes impossible.This question was partially answered in [CT09,KMO09] for matrices with low rank, incoherent spectrum, using a single subset of matrix coefficients, after solving a convex program with high complexity.Here we make much milder assumptions on matrix incoherence.In particular, we allow some eigenvectors to be sparse (while remaining incoherent on their support) and we approximate eigenvectors using many simple operations on subsampled matrices.Under certain conditions on the sampling rate which guarantee that we remain in a perturbative setting, we show that simply averaging many approximate eigenvectors obtained by subsampling reduces approximation error by an order of magnitude.

Notation.
In what follows, we write S n the set of symmetric matrices of dimension n.For a matrix X ∈ R m×n , we write X F its Frobenius norm, X 2 its spectral norm, σ i (X) its i-th largest singular value and let X ∞ = max ij |X ij |, while Card(X) is the number of nonzero coefficients in X.We denote by X(i, j) or X ij its (i, j)-th element and by M i the i-th column of M .Here, • denotes the Hadamard (i.e entrywise) product of matrices.When x ∈ R n is a vector, we write its Euclidean norm x 2 and x ∞ its ℓ ∞ norm.We write 1 ∈ R n the vector having all entries equal to 1. Finally, κ denotes a generic constant, whose value may change from display to display.

Subsampling
We first recall the subsampling procedure in [AM07] which approximates a symmetric matrix M ∈ S n using a subset of its coefficients.The entries of M are independently sampled as where p ∈ [0, 1] is the sampling probability.Theorem 1.4 in [AM07] shows that when n is large enough holds with high probability.In what follows, we will prove a similar bound on M − S 2 using incoherence conditions on the spectral decomposition of M .

Computational benefits
Computing k leading eigenvectors and eigenvalues of a symmetric matrix of dimension n using iterative algorithms such as the power or Lanczos methods (see [GVL90, Chap.8-9] for example) only requires matrix vector products, hence can be performed in O(kn 2 ) flops when the matrix is dense.However, this cost is reduced to O(k Card(M )) flops for sparse matrices M .Because the matrix S defined in (1) has only pn 2 nonzero coefficients on average, the cost of computing k leading eigenvalues/eigenvectors of S will typically be 1/p times smaller than that of performing the same task on the full matrix M .Of course, sampling the matrix S still requires O(n 2 ) flops, but can be done in a single pass over the data and be fully distributed.In what follows, we will show that, under incoherence conditions, averaging the eigenvectors of many independently subsampled matrices produces second order accurate approximations of the original spectral decomposition.While the global computational cost of this averaging procedure may not be globally lower, it is decomposed into many much smaller computations, and is thus particularly well adapted to large clusters of simple, loosely connected machines (Amazon EC2, Hadoop, etc.). ... Cost Figure 1: Our objective here is to approximate the spectral decomposition problem of size O(n 2 ) by solving many independent problems of much smaller size.

Sparse matrix approximations
Let us write the spectral decomposition of M ∈ S n as where u i ∈ R n for i = 1, . . ., n and λ ∈ R n are the eigenvalues of M with λ 1 > . . .> λ n (we assume they are all distinct).Let α ∈ [0, 1] n , we measure the incoherence of the matrix M as Note that this definition is slightly different from that used in [CT09] because we do not seek to reconstruct the matrix M exactly, so the tail of the spectrum can be partially neglected in our case.As we will see below, the fact that we only seek an approximation also allows us to handle sparse eigenvectors.
Let us define a matrix Q ∈ S n with i.i.d.Bernoulli coefficients We can write where C is has i.i.d.entries with mean zero and variance one, defined as We can now write the sampled matrix S in (1) as and we now seek to bound the spectral norm of the residual matrix E as n goes to infinity.Naturally, if E 2 is small, S is a good approximation of M in spectral terms, because of Weyl's inequality and the Davis-Kahan sin(θ)-theorem (see [Bha97]).So our aim now is to control E 2 so we can guarantee the quality of spectral approximations of M made using the sparse matrix S which is computationally easier to work with than the dense matrix M .We now make the following key assumptions on the incoherence of the matrix M .Assumption 1.There is a sequence of vectors α (n) ∈ [0, 1] n for which as n goes to infinity, where µ is an absolute constant.
In what follows, we will drop the dependence of α on n to make the notation less cumbersome, so instead of writing α (n) we will just write α.We have the following theorem.
Theorem 1. Suppose that Assumption 1 holds.Let us call α min = min 1≤i≤n α i .Assume that p and n are such that, p < 1/2, and for a given δ > 0, α min > (log n) (δ−3)/4 and Proof.Using [HJ91, Th. 5.5.19] or the fact that uu T • C = D u CD u , where D u is a diagonal matrix with the vector u on the diagonal (remember that • 2 is a matrix norm and hence sub-multiplicative), we get Since we assume that the vector u i is sparse with Card(u i ) ≤ n α i , C α i is a principal submatrix of C with dimension n α i .Now, we show in Theorem A-1 (this is the key element of the proof -see p.17) that , and α min > (log n) (δ−3)/4 for some δ > 0. (Our proof of Theorem A-1 relies on a result of Vu [Vu07] and Talagrand's inequality.).This yields Equation (5) and concludes the proof.
The proof of the theorem makes clear that the error term coming from the sparsest eigenvector will usually dominate all the others in the residual matrix E.
In these approximation methods, we naturally want to use a small p, so that S is very sparse and the computation of its spectral decomposition is numerically cheap.The result of Theorem A-2 guarantees that the subsampling approximation works whenever p ≫ (α min log n) 4 /n α min (asymptotically, but we have in mind a very high-dimensional setting, so n will be large in practice).
A natural question is therefore whether we could use p much smaller than this.Separate computations (see Subsection A-3) indicate that C/n 1/2 2 goes to infinity if p ≤ (log n) 1−δ /n, which suggests that this subsampling approach to approximating eigenproperties of M might run into trouble if the sampling rate p gets smaller than log n/n.As a matter of fact, we could not control the quantities C α i /n α i /2 2 at this sampling rate, which is naturally problematic given the way we established the bound on E 2 .Furthermore, if the sparsest eigenvector had support disjoint from the supports of all other eigenvectors, E would be the sum of two block diagonal matrices.Hence, its operator norm would be the maximum of the operator norms of the two blocks, at least one of which having potentially very large operator norm.

Tightness
Note that, in the limit case α = 1 where the eigenvectors are fully dense and incoherent, our bound is similar to the original bound in [AM07, Theorem 1.4] or that of [KMO09, Th 1.1] (our model for M is completely different however).In fact, the bounds in (2) and (5) can be directly compared.In the fully dense case where α = 1, we have so in this limit case, the original bound in (2) is always tighter than our bound in (5).However, in the sparse incoherent case where α = 1, the ratio of the bound (2) in [AM07] over our bound (5) becomes which can be large when α min < 1.The results in [KMO09], which are focused on exact recovery of low rank incoherent matrices, do not apply when the eigenvectors are sparse (i.e.α = 1).

Approximating eigenvectors
We now study the impact of subsampling on the eigenvectors and in particular on the one associated with the largest eigenvalue.We have the following theorem.
Theorem 2. Assume that the eigenvalues of M are simple.Let us call v k ∈ R n and λ k (S) the k-th eigenpair of S, and u k ∈ R n , λ k the k-th eigenpair of M .We write R k the reduced resolvent of M associated with u k , defined as and let We also call d k the separation distance of λ k , i.e the distance from λ k to the nearest eigenvalue of Proof.From now on we focus on u k and drop the dependence on k in u k , v k , R k , ∆ k etc... when this does not create confusion.We also use the notation λ S and λ instead of λ k (S) and where γ = λ S − λ.The formula is valid as soon as and assume that ∆ has no eigenvalues equal to -1, i.e Id + ∆ is invertible.Then we have We also have by construction Ru = 0, so REu = ∆u.Hence, we can write Putting all the elements together and recalling that u 2 = 1, we get (7) from Equation (8).
Spectral methods tend to focus on eigenvectors associated with extremal eigenvalues, so let us elaborate on the meaning of Theorem 2 for the eigenvector associated with the largest eigenvalue.If we suppose that the spectral norm of the residual matrix E is smaller than half the separation distance of the largest eigenvalue, i.e the previous result (and results such as [Kat95, Theorem II.3.9])shows that we can use perturbation expansions to approximate the leading eigenvector of the subsampled matrix.Based on the bound in Equation ( 5), the condition stated in Equation (9) will be satisfied (asymptotically with high-probability) if, for some We note that assumption (9) is likely reasonable if one eigenvalue is very large compared to the others, which is a natural setting for methods such as PCA.(Note however that our result is not limited to the largest eigenvalue but actually applies to any eigenvalue of the original matrix M , λ, for which E 2 is smaller than half the distance from λ to any other eigenvalue of M .In particular, the result would apply to several separated eigenvalues.)We also note that the approximation is accurate to order j + 2.
Let us now try to make our approximation slightly more explicit.If we write R the reduced resolvent of M (associated with u 1 ), and assume that λ 1 − λ 2 stays bounded away from 0, we have in this setting, using Equation (7) with j = 1, after we account for the fact that u T Eu is an order-E 2 2 accurate approximation of λ 1 (S) − λ 1 [Kat95, Eq. 2.36 and 3.18].This approximation makes clear that a key component in the accuracy of our approximations will be the size of the vector Eu.For simplicity here, we have normalized v so that v T u = 1; a similar result holds if we set v T v = 1 instead, if for instance E 2 → 0 asymptotically.

Second order accuracy result for eigenvectors by averaging
In light of Equation (10), it is clear that v is a first order accurate approximation of u, because of the presence of the (first-order) term REu in the expansion.We now show that we can get a second order accurate approximation of the eigenvector u.Our results are based on an averaging procedure and hence are easy to implement in a distributed fashion.We have the following second-order accuracy result.
Theorem 3. Let us call u 1 the eigenvector associated with the largest eigenvalue of M , and ν 1 = v 1 / v 1 the eigenvector associated with the largest eigenvalue of S and normalized so that ν 1 = 1 and ν T 1 u 1 ≥ 0. Let us call ξ = µ/(pn α min ) 1/2 .Suppose that the assumptions of Theorem 1 are satisfied (hence ξ → 0).

Suppose also that
Then we have Practically, this means that if we average eigenvectors over many subsampled matrices (after removing indeterminacy by always making the first component positive), the residual error will be of order In other words, by averaging subsampled eigenvectors, we gain an order of accuracy (over the method that would just take one subsampled eigenvector) by canceling the effect of the first order residual term REu.
Proof.To keep notations simple, we drop the index 1 in ν and u in the proof (so ν 1 = ν and u 1 = u).In what follows, κ is a generic constant that may change from display to display.Before we start the proof per se, let us make a few remarks.First, there is a technical difficulty when trying to work directly with v, namely the fact that it appears difficult to control E (Id + ∆) −1 2 and hence to get a bound on E[ v − u ] (with the normalization v T u = 1, v could be very large; our bounds show that this can happen with only low probability but obviously E[ v ] could still be large).To go around this difficulty, we need two steps: first, we work with unit eigenvectors (so we go from v to ν), and second we need a "regularization" step and will replace v by a vector ṽε which is equal to v with high-probability and for which we can control E[ ṽε − u ].More precisely, for ε > 0, we call ṽε the vector such that Its properties are studied in Theorem A-3.We call it below the ε-regularized version of v.
We note that under the assumptions of the current theorem we have ξ d → 0, so the results of Theorem A-3 apply.In particular, as shown in the proof of that Theorem, we have M 2 ∞ /p 2 = o ξ 2 .Also, Assumption 1 (which is made in Theorem 1), means µ is fixed so ξ → 0, as pn α min → ∞.
If v is the eigenvector of S associated with its largest eigenvalue, using the fact that (v − u) T u = 0 by construction, we have Turning our attention to ṽε , we see that, since Ru = 0 by construction and R is symmetric, u T ∆ = 0, so (ṽ ε − u) T u = 0, and hence ṽε we see that β = ν as long as (Id + ∆) −1 2 ≤ 1/ε, since when this happens, v = ṽε .Now we have (note the importance of the change of normalization here, as this bound would not hold with v instead of ν).Let us now work on controlling both these quantities.For reasons that will be clear later, we now take ε = 2ξ/d.
We show in Theorem A-3 that, for some κ > 0, asymptotically Control of P (ν = β).We have (essentially) seen in the proof of Theorem 2 above that if 2 E 2 /d < 1 − ε, then (Id + ∆) −1 2 ≤ 1/ε (see also the proof of Theorem A-3).Hence Recall that we have now chosen ε = 2ξ/d.In that case, we have

Now we show the following deviation inequality in Theorem
Recall also that for n large enough 0 ≤ m E ≤ 3ξ when the conditions of Theorem 1 apply (see Theorems 1 or arguments at the end of the proof of Theorem A-1).Suppose now that n is such that indeed m E ≤ 3ξ.

Then if d
2 − 4ξ > 0, we have 3 asymptotically.Since we assumed that d ≥ ξ ln(ξ −2 ) and ξ → 0, we indeed have ξ/d → 0. Therefore, All we have to do now is to verify that the asymptotics we consider, the quantity on the right-hand side of the previous equation remains less than ξ 2 /d 2 asymptotically.Elementary algebra shows that this is equivalent to saying that We have M 2 ∞ /p 2 = o ξ 2 , so the right-hand side is going to zero.In particular, we see that when d ≥ ξ ln(ξ −2 ), as we assume, the inequality above is satisfied asymptotically.As a matter of fact, when d < exp(1), and the result comes out of the fact that , the result is obvious as the righthand side of Equation ( 12) goes to 0 asymptotically, while the left-hand side is asymptotically larger than exp(2)/2 for instance.So we have shown that under our assumptions, We can finally conclude that This result applies to all eigenvectors corresponding to eigenvalues whose isolation distance (i.e distance to the nearest eigenvalue) satisfies the separation condition (11), which is a strong version of the separation condition (9).We note that we need the strong separation condition (Equation (11)) to be able to take expectations rigorously.
Finally, we note that theoretical as well as practical considerations seem to indicate that condition (9) (and hence (11)) is quite conservative.On the theoretical side, we see with Equation ( 8) that what really matters for the quality of the approximation is the norm of the vector or its expectation.We used in our approximations the coarse bound ∆ 2 ≤ 2 R 2 E 2 , which is convenient because it does not require us to have information about the eigenvectors of ∆.However, we see that the norm of l j could be small even when R 2 E 2 is not very small, for instance if u belonged to a subspace spanned by eigenvectors of ∆ associated with eigenvalues of this matrix that are small in absolute value.So it is quite possible that our method could work in a somewhat larger range of situations than the one for which we have theoretical guarantees.This is what our simulations below seem to indicate.

Variance
The expansion in Equation (10) also allows us to approximate the variance of the first-order residual REu after subsampling.This is useful in practice because it gives us an idea of how many independent computations we need to make to essentially void the effect of the first order term in the expansion of v.In terms of distributed computing, it therefore tells us how many machines we should involve in the computation.We have the following theorem.
Theorem 4. Let u 1 be the eigenvector associated with λ 1 , the largest eigenvalue of M .Let us call Assuming w.l.o.g. that λ 1 = M 2 , this bound yields in particular where NumRank(M ) = M 2 F / M 2 2 is the numerical rank of the matrix M and is a stable relaxation of the rank, satisfying 1 ≤ NumRank(M ) ≤ Rank(M ) ≤ n (see [RV07] for a discussion).
Proof.By construction, E[E] = 0 and because E is symmetric, the u i 's form an orthonormal basis and u T 1 Eu j is the j-th coefficient of Eu 1 in this basis, so the sum of the squared coefficients is the squared norm of the vector.Hence The variance of u T 1 Eu 1 is easy to compute if we rewrite this quantity as a sum of independent random variables.Also, separate computations (see Appendix, Subsection A-4) show that Assuming w.l.o.g. that λ 1 = M 2 , we get (13).

Nonsymmetric matrices
The results described above are easily extended to nonsymmetric matrices.Here M ∈ R m×n , with m ≥ n and we write its spectral decomposition where We can adapt the definition of incoherence to and reformulate our main assumption on M as follows.
In this setting, using again [HJ91, Th. 5.5.19],we get where we have assumed that u i , v i are sparse and C α i ,β i is a n α i × m β i submatrix of C. As in (5), we can then bound the spectral norm of the residual and we have almost surely.Perturbation results similar to (10) for left and right eigenvectors are detailed in [Ste98] for example.

Numerical experiments
In this section, we study the numerical performance of the subsampling/averaging results detailed above on both artificial and realistic data matrices Dense matrices: PCA, SVD, etc.We first illustrate our results by approximating the leading eigenvector of a matrix M as the average of leading eigenvectors of subsampled matrices, for various values of the sampling probability p.To start with a naturally structured dense matrix, we form M as the covariance matrix of the 500 most active genes in the colon cancer data set in [ABN + 99].We let p vary from 10 −4 to 1 and for each p, we compute the leading eigenvector of 1000 subsampled matrices, average these vectors and normalize the result.We call u the true leading eigenvector of M and v the approximate one.We now normalize v so that v 2 = 1 (which is standard, but different from the normalization we used in our theoretical investigations where we had u T v = 1).In Figure 2, we plot u T v as a function of p together with the median of u T v sampled over all individual subsampled matrices, with dotted lines at plus and minus one standard deviation.We also record the proportion of samples where E satisfies the perturbation condition (9).
We repeat this experiment on a (nonsymmetric) term-document matrix formed using press release data from PRnewswire, to test the impact of subsampling on Latent Semantic Indexing results.Once again, we let p vary from 10 −2 to 1 and for each p, we compute the leading eigenvector of 1000 subsampled matrices, average these vectors and normalize the result.We call u the true leading eigenvector of M and v the approximate one.In Figure 3 on the left, we plot u T v as a function of p together with the median of u T v sampled over all individual subsampled matrices, with dotted lines at plus and minus one standard deviation.The matrix M is 6779 × 11171 with spectral gap σ 2 /σ 1 = 0.66.
In Figure 3 on the right, we plot the ratio of CPU time for subsampling a gene expression matrix of dimension 2000 and computing the leading eigenvector of the subsampled matrix (on a single machine), over CPU time for computing the leading eigenvector of the original matrix.Two regimes appear, one where the eigenvalue computation dominates with computation cost scaling with p, another where the sampling cost dominates and the speedup is simply the ratio between sampling time and the CPU cost of a full eigenvector computation.Of course, the principal computational benefit of subsampling is the fact that memory usage is directly proportional to p.
A key difference between the experiments of Figure 2 and those of 3 is that the leading eigenvector of the gene expression data set is much more incoherent than the leading left eigenvector of the term-document matrix, which explains part of the difference in performance.We compare both eigenvectors in Figure 4.
We then study the impact of the number of samples on precision.We use again the colon cancer data set in [ABN + 99].In Figure 5 on the left, we fix the sampling rate at p = 10 −2 and plot u T v as a function of the number of samples used in averaging.We also measure the impact of the eigenvalue gap λ 2 /λ 1 on precision.We scale the spectrum of the gene expression covariance matrix so that its first eigenvalue is λ 1 = 1 and plot the alignment u T v between the true and the normalized average of 100 subsampled eigenvectors over subsampling probabilities p ∈ [10 −2 , 1] for various values of the spectral gap λ 2 /λ 1 ∈ {0.75, 0.95, 0.99}.
Graph matrices: ranking.Here, we test the performance of the methods described above on graph matrices used in ranking algorithms such as pagerank [PBMW98] (because of its susceptibility to manipulations however, this is only one of many features used by search engines).Suppose we are given the adjacency matrix of a web graph, with where A ∈ R n×n (one such matrix is displayed in Figure 6).Whenever a node has no out-links, we link it with every other node in the graph, so that B = A + δ1 T /n, with δ i = 1 if and only if deg i = 0, where deg i is the degree of node i.We then normalize B into a stochastic matrix P g ij = B ij /deg i .The matrix P g is the transition matrix of a Markov chain on the graph modeling the behavior of a web surfer randomly clicking on links at every page.For most web graphs, this Markov chain is usually not irreducible but if we set for some c ∈ (0, 1], the Markov chain with transition matrix P will be irreducible.An additional benefit of this modification is that the spectral gap of P is at least c [HK03].The leading (Perron-Frobenius) eigenvector u of this matrix is called the Pagerank vector [PBMW98], its coefficients u i measure the stationary probability of page i being visited by a random surfer driven by the transition matrix P , hence reflect the importance of page i according to this model.
The coefficients of pagerank vectors typically follow a power law for classic values of the damping factor [PRU06,BC06] which means that the bounds in assumption 1 do not hold.Empirically however, while the distance between true and averaged eigenvectors quickly gets large, the ranking correlation (measured using Spearman's ρ [Mel07]) is surprisingly robust to subsampling.
We use two graphs from the Webgraph database [BV04], wb-cs.stanfordwhich has 9914 nodes and 36854 edges, and cnr-2000 which has 325,557 nodes and 3,216,152 edges.For each graph, we form the transition matrix P as in [GZB04] with uniform teleportation probability and set the teleportation coefficient c = 0.85.In Figure 6 we plot the wb-cs.stanfordgraph and the Pagerank vector for cnr-2000 in loglog scale.In Figure 7 we plot the ranking correlation (Spearman's ρ) between true and averaged Pagerank vector (over 1000 samples), the median value of the correlation over all subsampled matrices and the proportion of samples satisfying the perturbation condition (9), for various values of the sampling probability p.We notice that averaging very significantly improves ranking correlation, far outside the perturbation regime.

Conclusion
We have proposed a method to compute the eigenvectors of very large matrices in a distributed fashion: 1.To each node in a computer cluster of size N , we send a subsampled version S i of the matrix of interest, M .
2. Node i computes the relevant eigenvectors of S i .
3. The N eigenvectors are averaged together and normalized to produce our final estimator.
The key to the algorithm is that Step 2 is numerically cheap (because S i is very sparse), and hence can be executed fast even on small machines.Therefore a cluster or cloud of small machines could be used to approximate the eigenvectors of M , a difficult problem in general when M is extremely large.
We have shown that under carefully stated conditions, the algorithm described above will yield a secondorder accurate approximation of the eigenvectors of M .This gain in accuracy comes from the averaging step of our algorithm.We note that arguments similar to the ones we used in this paper could be made to compute second-order accurate approximations of the eigenvalues of M .(We restricted ourselves to eigenvectors here because in methods such as PCA, the eigenvectors are in some sense more important than the eigenvalues.)Our results depend on a measure of incoherence for M that we propose in this paper.They also show that subsampling will work if the sampling probability is small, but is likely to fail if that probability is too small.Finally, our simulations show that we gain significantly in accuracy by averaging subsampled eigenvectors (which suggests that our theoretical passage from first-order to second-order accuracy is also relevant in practice) and that the performance of our method seems to degrade for very incoherent matrices, a result that is also in line with our theoretical predictions.

A Appendix
A-1 On C 2 Let us consider the symmetric random matrix C with entries distributed as, for i ≥ j, We assume that C is n × n.Our aim is to show that we can control C 2 and in particular its deviation around its median.We do so by using Talagrand's inequality.
We have the following theorem.
Theorem A-1.Suppose that we observe n matrices C α i , for 1 ≤ i ≤ n with entries distributed as those of the matrix C just described.Suppose these matrices are of size n α i , where α i are positive numbers.Call α min = min 1≤i≤n α i and assume that, for some fixed δ > 0, α min > (log n) (δ−3)/4 .Suppose further that p is such that lim n→∞ (α min log n) 4 /(n α min p) = 0. Then Proof.We note that the application C → C 2 is a convex, √ 2-Lipschitz (with respect to Euclidian/Frobenius norm) function of the entries of C that are on or above the main diagonal.As a matter of fact, since • is a norm, it is convex.Furthermore, if A and B are two symmetric matrices, Now recall the consequence of Talagrand's inequality [Tal95] spelled out in [Led01], Corollary 4.10 and Equation (4.10): if F is a convex, 1-Lipschitz function (with respect to Euclidian norm) on R n , of n independent random variables (X 1 , . . ., X n ) that take value in [u, v], and if m F is a median of F (X 1 , . . ., X n ), then The random variables that are above the main diagonal of C are bounded, and take value in Therefore, calling m n the median of n −1/2 C 2 , we have, in light of Equation (A-3), Suppose now that we have a collection C α i of matrices of size n α i with entries distributed as in Equation (A-1).(Note that the matrices could be dependent.)Let us call m n α i the medians of C α i /n α i /2 2 .Then we have, by a simple union bound argument, for any k, where α min = min 1≤i≤k α i .Suppose now that k = n, p ≤ 1/2, pn α min > (log n) 1+δ , and t ≥ (log n) −δ/3 for some δ > 0.Then, t 2 p(1 − p)n α min > (log n) 1+δ/3 /2, which tends to ∞ as n → ∞.Because u n = n exp(−(log n) 1+δ/3 /16) is the general term of a converging series, we have, when p ≤ 1/2 and pn α min > (log n) 1+δ for some δ > 0, max 1≤i≤n by a simple application of the Borel-Cantelli lemma.Hence, we have Now all we have to do is control max 1≤i≤n m n α i , which is the maximum of a deterministic sequence.Recall Vu's Theorem 1.4 in [Vu07], applied to our situation where we are dealing with bounded random variables with mean 0 and variance 1: if the matrix C has entries as above and is n × n, then almost surely, for some constant κ 0 .So as soon as (log n) 4 /(pn) remains bounded, so does m n , the median of (Note that this is true because we are taking the maximum of elements of a fixed deterministic sequence that is asymptotically less than or equal to 2 + ε, for any ε and the smallest argument is going to infinity.All the work using Talagrand's inequality was done to allow us to switch from having to control the maximum of a random sequence to that of a deterministic sequence.)Now when (α min log n) 4 /(pn α min ) → 0, we have a fortiori pn α min > (log n) 1+δ when α min > (log n) (δ−3)/4 .So we conclude that when (α min log n) 4 /(pn α min ) → 0 and α min > (log n) Let us now consider the related issue of understanding the matrix E = r p M •C, where r p = (1 − p)/p, M is a deterministic matrix and C is a random matrix as above.
Theorem A-2.Suppose E = r p M • C, where C is a symmetric random matrix distributed as above, M is a deterministic matrix and r p = (1 − p)/p.Let us call m E a median of E 2 .Then we have Hence, in particular, and Proof.The crux of the proof is quite similar to that of Theorem A-1: we will rely on Talagrand's concentration inequality for convex 1-Lipschitz functions of bounded random variables.To do so let us consider the map: This map f is convex as the composition of a norm with an affine mapping.
Let us now show that it is ( √ 2 M ∞ )-Lipschitz with respect to Euclidian norm: if we denote by c (k) i,j the (i, j)-th entry of the matrix C k , we have Hence, f is indeed a ( √ 2 M ∞ )-Lipschitz function of the entries of C that are above or on the diagonal.Now the function of C we care about is g(•) = r p f (•), which is convex and √ 2 M ∞ r p -Lipschitz.Given that the entries of C are bounded, we have, as in the proof of Theorem A-1, Now using the proof of Proposition 1.9 in [Led01] (see p.12 of this book), we conclude that , and Therefore, since for a and b positive, More generally, we see, using essentially Proposition 1.10 in [Led01] and elementary properties of the Gamma function, that if the random variable F is such that for a deterministic number a F , P Applying this result with k = 3, we get In our context, using the fact that, for positive a and b, (a + b) 3 ≤ 4(a 3 + b 3 ) by convexity, we also have

A-2 Regularized eigenvector considerations
We now have the following (regularized) second order accuracy result, which is a critical component of the proof of Theorem 3, one of the main results of the paper.
Theorem A-3.Suppose that the assumptions of Theorem 1 are satisfied.We consider the approximation of u the eigenvector associated with the largest eigenvalue of M .Recall that v is the eigenvector corresponding to the leading eigenvalue of the subsampled matrix S. For ε > 0, we call ṽε the vector such that Then, for any η > 0, we have asymptotically, Suppose further that we are in an asymptotic setting where 1 Proof.Let us first show that our regularization does not change the vector we are dealing with with highprobability.ṽε = v as long as (pn α min ) 1/2 → 0 and we have according to Theorem A-2 E 2 ≤ 2 µ (pn α min ) 1/2 with high-probability, we conclude that with high-probability, ṽε = v.
Using Equation (8) with j = 1, we see that, since Recall that by construction E[E] = 0. Hence, since R is a fixed deterministic matrix and u is a deterministic vector, So, if we now use the fact that u = 1, we have Let us now show that we can control the right-hand side of the previous equation.We prove in Theorem A-2 that where m E is a median of the random variable E 2 .Our asymptotic control of E 2 in (5) gives allows us to control m E , namely, In other respects, we clearly have M ∞ ≤ n i=1 λ i u i 2 ∞ , and hence since we are in a setting where Furthermore, we prove in Theorem A-2 that At the end of Subsection 2.2, we mentioned a corollary (see below) of the following theorem: Theorem A-4.Suppose that p = (log n) 1−δ u n /n, for a fixed δ in (0, 1) and for a fixed κ, 0 < u n ≤ κ.

Suppose further that we can find
Recall that practically, this theorem suggests that if we don't sample enough the matrix M (i.e p is too small), a subsampling approximation to its eigenproperties is not likely to work.Let us now prove it.
Proof.Our strategy is to show that the largest diagonal entry of C T C/n goes to infinity.To do so, we will rely on results in random graph theory.Let us examine more closely this diagonal.Using the definition of C, we see that, if T = C T C, and d i is the number of times (1 − p)/p appears in the i-th column of C, Now {d i } is the degree sequence of an Erdös-Renyi random graph.According to [Bol01], Theorem 3.1, if k is such that n n−1 p p k (1 − p) n−1−k → ∞, then, if X k is the number of vertices with degree greater than k, lim n→∞ P (X k ≥ t) = 1 , for any t.So if we can exhibit such a k, then max d i ≥ k with probability going to 1.We now note that for small p, Hence, if our k is also such that k/pn → ∞, we will indeed have and the theorem will be proved.
• h/(pn) = v n = o (log n) by assumption.We can finally conclude that max i T (i, i)/n ≥ k 2np with probability going to 1 .
But because v n → ∞, we have k/(2np) → ∞ and the theorem is proved.
We have the following corollary to which we appealed in Subsection 2.2.
The previous corollary follows immediately from Theorem A-4, by noticing that u n is lower bounded under our assumptions and by taking v n = (log n) δ/5 .

A-4 Variance computations
We provide some details here to complement the explanations we gave in the proof of Theorem 4 in Subsection 2.6.
On E[E 2 ] Let us explain why this matrix is diagonal and compute the coefficients on the diagonal.Recall that E = (1 − p)/pM • C, where C is a random matrix whose above-diagonal elements are independent, have mean 0 and variance 1. E is naturally symmetric and we call E i its i-th column.Naturally, E 2 (i, j) = E T i E j .Suppose first that i = j.The elements of E i and E j are independent, except for E ij and E ji , which are equal.In particular, E ki and E kj are independent for all 1 ≤ k ≤ n.Recall also that E[C] = 0, so E[E] = 0. Combining all these elements, we conclude that, if i = j, Therefore E[E 2 ] is diagonal.Let us now turn our attention to computing the elements of the diagonal.This is simple since We note that this is the result we announced in the proof of Theorem 4 in Subsection 2.6.
On var(u T Eu) Rewriting this quantity as a sum of independent quantities greatly simplifies the computation.If we pursue this route, we have Because the previous expression is a sum of independent random variables, we immediately conclude that Calling w = u • u and M = M • M , we immediately recognize in the last expression the quantity as announced in the proof of Theorem 4.

Figure 2 :
Figure2: Left: Alignment u T v between the true and the normalized average of 1000 subsampled eigenvectors (blue circles), median value of u T v over all sampled matrices (solid black line), with dotted lines at plus and minus one standard deviation and proportion of samples satisfying the perturbation condition (9) (dashed red line), for various values of the sampling probability p on a gene expression covariance matrix.Right: Zoom on the the interval p ∈ [10 −2 , 1].

Figure 3 :Figure 4 :
Figure3: Left: Alignment u T v between the true and the normalized average of 1000 subsampled left eigenvectors (blue circles), median value (solid black line), dotted lines at plus and minus one standard deviation and proportion of samples satisfying condition (9) (dashed red line), for various values of the sampling probability p on a term document matrix with dimensions 6779 × 11171.Right: Speedup in computing leading eigenvectors on gene expression data, for various values of the sampling probability p.

Figure 5 :Figure 6 :
Figure5: Left: Alignment u T v between the true leading eigenvector u and the normalized average leading eigenvector versus number of samples, on the gene expression covariance matrix with subsampling probability p = 10 −2 .Right: Alignment u T v for various values of the spectral gap λ 2 /λ 1 ∈ {0.75, 0.95, 0.99}.

Figure 7 :
Figure7: Ranking correlation (Spearman's ρ) between true and averaged pagerank vector (blue circles), median value of the correlation over all subsampled matrices (solid black line), dotted lines at plus and minus one standard deviation and proportion of samples satisfying the perturbation condition (9) (dashed red line), for various values of the sampling probability p. Left: On the wb-cs.stanfordgraph.Right: On the cnr-2000 graph.