Two Proposals for Robust PCA using Semidefinite Programming

The performance of principal component analysis (PCA) suffers badly in the presence of outliers. This paper proposes two novel approaches for robust PCA based on semidefinite programming. The first method, maximum mean absolute deviation rounding (MDR), seeks directions of large spread in the data while damping the effect of outliers. The second method produces a low-leverage decomposition (LLD) of the data that attempts to form a low-rank model for the data by separating out corrupted observations. This paper also presents efficient computational methods for solving these SDPs. Numerical experiments confirm the value of these new techniques.


Introduction
Principal component analysis (PCA), proposed in 1933 by Hotelling [23], is a common technique for summarizing high-dimensional data.Principal components are designed to identify directions in which the observations vary most.As a consequence, PCA is often used to reduce the dimension of the data.
Statistics based on variance, such as principal components, are highly sensitive to outliers [43].The literature on robust statistics contains a wide variety of techniques that attempt to correct this shortcoming [25].Unfortunately, many of these approaches are based on intractable optimization problems or lack a principled foundation.
Our focus in this work is to develop new formulations for robust PCA that can be solved efficiently using convex programming algorithms.Our first proposal, which we call maximum mean absolute deviation rounding (MDR), exchanges the variance in the definition of PCA with a function less sensitive to outliers known as the mean absolute deviation.Although this formulation leads to a non-convex optimization problem, we demonstrate that it is possible to approximate the optimum by relaxing to a semidefinite program and randomly rounding the solution.This method can be viewed as a specific instance of projection-pursuit PCA [26].
Our second proposal uses a different semidefinite program to split the input data into the sum of a low-leverage matrix and a matrix of corrupted observations.We refer to this dissection as a low-leverage decomposition (LLD) of the data.This method is similar in spirit to the rank-sparsity decomposition of Chandrasekaran et al. [7].While preparing this manuscript, we learned of an independent investigation into this formulation of robust PCA by Xu et.al. [46,47].
We describe algorithms that solve these semidefinite programs efficiently, and we provide numerical experiments that confirm the effectiveness of these new techniques.We begin with a brief overview of our proposals before laying out the details in Sections 2 and 3.
1.1.The Data Model.Suppose that we have a family {x i } n i=1 of n observations in p dimensions.We form an n × p data matrix X whose rows are the observations.The observations are assumed to be centered; that is, 1 n i x i ≈ 0. While our methods do not explicitly require the data to be centered, this hypothesis allows us to interpret principal components as directions of high variance in the data.We discuss practical centering approaches in Section 5.
1.2.Maximizing the Mean Absolute Deviation.Our first method is designed to mitigate a source of sensitivity in classical principal component analysis.The top principal component v PCA is defined as a direction of maximum variance in the data: (1.1) The squared inner products in (1.1) may lead to outsized influence of outlying points because squaring a large number results in a huge number, which can drag the principal component away from the bulk of the data.We can reduce this effect by replacing the squared inner product with a measure of spread that is less sensitive.We propose the use of the absolute value of the inner product: where we have added the subscript MD to indicate that we have exchanged the variance in equation (1.1) with a measure of spread known as the mean absolute deviation (MD) [25, p. 2].This revision results in some complications.The formulation (1.1) is an eigenvector problem which can be solved efficiently.In contrast, it is NP-hard to compute v MD .Nevertheless, we develop an efficient randomized algorithm that provably computes an approximate solution to (1.2).We call this approach maximum mean absolute deviation rounding (MDR).
Our main result, Theorem 2.2, states that, for any failure probability δ > 0 and loss factor ε > 0, our algorithm produces a unit-norm vector v MDR such that The algorithm requires that we solve one semidefinite program (SDP) whose size is polynomial in the number of observations.Since SDPs are solvable in polynomial time using interior-point methods, our algorithm is tractable in principle.In practice, solving SDPs can be daunting even for moderately sized input data-say, more than 100 observations.To address this issue, we detail a technique of Burer and Monteiro [4,5] that can usually solve the SDP efficiently, and in Section 5 we provide some numerical evidence that this approach succeeds.We find additional components by greedily restricting the data to a subspace perpendicular to the previous components and solving (1.2) again.
This proposal is not without precedent.A more general formulation appears in Huber's book [24, p. 203], and it is now known as projection-pursuit PCA (PP-PCA) [26].We provide further detail on PP-PCA in Section 2.2 and discuss the history of the method in 4.1.

1.3.
A Low-Leverage Decomposition.Our second proposal stems from a different interpretation of classical principal component analysis.Instead of viewing classical principal components as directions of maximum variance, we can view them as an optimal low-rank model for the data [6].Suppose P is a matrix that solves minimize X − P F subject to rank(P ) = T.
The dominant principal components of X are given by the T right singular vectors of P corresponding with the nonzero singular values of P .
With real data, one is often faced with the situation where entire observations are corrupted.If this is the case, we would still like to recover a low-rank model.We can develop as natural formulation for identifying a low-rank model using the well-known rank sparsity [15] and group sparsity [37] heuristics.We propose to decompose the data matrix as X = P LLD + C LLD by solving the semidefinite program minimize i σ i (P ) + γ j c j 2 subject to P + C = X. (1.3) We have written σ i (P ) for the ith singular value of P and c i for the ith row of C. We view the optimal matrix P LLD as a surrogate for the low-rank approximation to the uncorrupted data, and the optimal matrix C LLD as an approximation of the corrupted data.The formulation (1.3) has an interesting property even when P LLD is not low-rank or C LLD is not row-sparse: P LLD is guaranteed to be a low-leverage set of observations in a sense we make precise in Section 3.1.As a result, we refer to X = P LLD +C LLD as a low-leverage decomposition (LLD) of the data.We define the dominant LLD components as the right singular vectors of P LLD .
This optimization problem is similar to the rank-sparsity decomposition problem proposed in [7]; see also [6].We discuss these ideas at more length in Section 4. As this manuscript was being prepared, we learned of an independent investigation of the program (1.3) for robust PCA by Xu et.al. [46,47] that provides conditions for recovery of the support of the corruption and the row-space of the uncorrupted observations.1.4.Road map.Sections 2 and 3 describe our proposals in more detail, including theoretical guarantees and practical algorithms.Section 4 offers an overview of previous work on robust PCA, while Section 5 describes numerical experiments illustrating the performance of our methods in various settings.A technical appendix contains the proofs of supporting results.1.5.Notation.We work exclusively with real numbers.The symbols P and E denote probability and expectation, respectively.We use ∂ to denote the subgradient map.
Bold capital letters denote matrices while bold lower-case letters denote vectors.We represent the ith row of a matrix A by a i and the jth entry of a vector a by a j .The adjoint of a matrix A is written A * .When referring to matrix elements, we sometimes use the notation [A] ij , and similarly for vectors we use [a] i .
We use the compact convention for the singular value decomposition (SVD) of a matrix: when A is rank r, we write its SVD as A = U ΣV * , where U and V have orthonormal columns, and Σ is a non-singular diagonal matrix whose entries are positive and are arranged in weakly decreasing order.The notation A B denotes that A − B is positive semidefinite.1.5.1.Norms.We denote the p vector norm as The Frobenius norm of a matrix is defined by A 2 F = A, A , where •, • represents the standard inner product.The Moore-Penrose pseudoinverse of a matrix A is denoted A † .
We define the p to q operator norm and its dual respectively by Au q , and B * p→q = sup Table 1 describes some of the specific operator norms used in this work.We also use the norms A 2→1 and A ∞→1 , which lack such simple descriptions; see Sections 2.3 and 2.4.The operator norm of the adjoint satisfies A * q * →p * = A p→q where p and q satisfy the conjugacy relations 1/p + 1/p * = 1 and 1/q + 1/q * = 1 with the convention 1/∞ = 0.

Maximum Mean Absolute Deviation Rounding
Our first method is based on the classical interpretation of the top principal component as the direction of maximum empirical variance in multidimensional data.It has long been recognized that the variance is highly sensitive to outliers in the data [43].The field of robust statistics has reacted by developing and analyzing robust measures of spread known as robust scales; see [25,Ch. 5] or [30,Sec. 2.5].This literature describes a generic method for determining robust principal components by replacing the variance with a robust measure of scale.Li and Chen [26] published the first investigation of this under the name projection-pursuit PCA (PP-PCA).Our proposal is a specific instance of PP-PCA with the mean absolute deviation scale (2.1).We show that this formulation is computationally intractable, but we develop an algorithm that provably approximates its solution.To our knowledge, this is the first rigorous algorithm for PP-PCA with a robust scale.
2.1.Scales.A scale is a function that measures the spread of one-dimensional data [25,Ch. 5].By far, the most common scale is the empirical standard deviation, defined 1 as where we we assume the data y is centered.Of course, the standard deviation is not the only way to measure the spread of the data.An alternative proposal [25, p. 2] is the mean absolute deviation (MD).For centered data y, the MD scale is defined as (2.1) More generally, a scale is a function S : R n → R such that S(αy) = |α| S(y).Scales are typically chosen so that they are less sensitive to outliers than the standard deviation.The robust statistics literature focuses on scales that have a positive breakdown point: the value of the scale cannot be arbitrarily corrupted by nefariously chosen observations, so long as the fraction of bad observations in the entire data set is small.Although the mean absolute deviation has a breakdown point of zero, it exhibits more efficient behavior than the standard deviation under contaminated distributions [43].
2.1.1.Scales for multivariate data.We extend the definition of scales to multivariate data by considering the scale of the data in a given direction.The projection of the rows of X onto the unit direction u is given by the product Xu.Note that if X is centered in the sense of Section 1.1, then the projection Xu is also centered by linearity.We define the scale of X in the direction u to be the scale of the projected data S(Xu).
As noted in [24], this definition is equivariant under an orthogonal change of basis: for any Q with Q * Q = I, the scale of X in the direction u is equal to the scale of XQ * in the direction Qu.

2.2.
Projection-Pursuit PCA.Classically, the top principal component is defined as the direction where the empirical standard deviation in the data is largest: std(Xv). (2.2) A natural approach for finding robust components is to replace the standard deviation in (2.2) with a robust scale S(•), so that the robust component is the direction of maximum robust scale S(Xv).
We define further robust components inductively by adding orthogonality constraints: This greedy method of constructing orthogonal components based on robust scales goes by the name projection-pursuit PCA.This scheme was originally proposed by Huber [24, p. 203], but was first studied in detail by Li and Chen [26].PP-PCA reduces to PCA when the scale is given by the standard deviation due to the variational characterization of eigenvectors by Courant and Fischer.
To implement the PP-PCA method, one only needs a method that finds the first component.We discuss how to enforce the orthogonality constraints in Section 2.6.1.
1 One usually defines scales so that they are unbiased estimates of the sample standard deviation when the data is drawn from a normal distribution.We are more interested in the direction of maximal scale rather than the value, so we can safely ignore the normalization factor.
2.3.PP-PCA with the MD Scale is NP-Hard.Finding the top principal component is an eigenvector problem that amounts to computing the direction where the norm • 2→2 is achieved.Similarly, PP-PCA with the MD scale amounts to finding a vector that achieves an operator norm.Indeed, the problem v MD = arg max v 2 =1 Xv 1 is equivalent to the problem find v MD 2 = 1 such that Xv MD 1 = X 2→1 . (2.4) Unfortunately, exchanging the 2 norm for the 1 norm leads to an NP-hard computational problem.To see this, we require the following result, which we establish in the Appendix.
Rohn [39] shows that there exists a class of well-conditioned positive matrices M such that the existence of a polynomial-time algorithm for accurately computing M ∞→1 for all M ∈ M implies P = NP.Since we can factor positive matrices M = RR * in polynomial time using, for example, a Cholesky factorization, the existence of an accurate polynomial-time algorithm that computes R 2 2→1 for any matrix R implies that P = NP.The observation that Equation (2.3) is NP-hard to solve for the specific choice S(•) = • 1 has serious implications for existing PP-PCA algorithms.The algorithms available in the literature for PP-PCA [9,11,26] are general schemes that claim to work for any choice of scale S. As a result, none of these algorithms can provide both accurate and efficient solutions to the PP-PCA problem.This issue is not merely theoretical because these algorithms tend to perform poorly in practice.We discuss this point further in Section 4.1.

2.4.
Approximating the 2 → 1 Norm using Randomized Rounding.Although it is NP-hard to compute the 2 → 1 norm, it is possible to approximate its value efficiently.This fact is a consequence of the little Grothendieck theorem [36,Sec. 5b], but the algorithm depends on ideas of Nesterov [34], a technique of Burer and Monteiro [4,5], and a new factorization step.

2.4.1.
The semidefinite relaxation of the 2 → 1 norm.Before describing our algorithm, we begin by showing how the computation of 2 → 1 operator norm can be relaxed to a semidefinite program.First, apply Fact 2.1 to change the computation of the 2 → 1 norm to the computation of the ∞ → 1 norm: The second identity above follows from the proof of Fact 2.1; see also [39,Prop. 1].Interpreting the quadratic form on the right hand side of (2.5) as a trace implies that X 2 2→1 is the optimal value of the (non-convex) program maximize trace(XX * Z) (2.6) Relaxing the rank one constraint Z = yy * to a positive-semidefinite constraint Z 0 leads to the SDP maximize trace(XX * Z) It follows that X 2→1 ≤ α , where α 2 is the optimal value of (2.7).Moreover, Grothendieck's inequality for positive-semidefinite matrices implies that where this inequality is asymptotically the best possible [2, Sec.4.2].Thus, α is within a factor of π/2 < 1.26 of the true value of the norm X 2→1 .

Algorithm 1: Maximum Mean Absolute Deviation Rounding
Input: An n × p matrix X; repetition count K. Output: A p × 1 unit-norm vector v and an optimal value α . ( Set α to be the square root of the optimal value: 2.5.The MDR Algorithm.The fact that equation (2.7) gives us a good upper bound on the value of X 2→1 is of secondary importance.We would prefer an approximation for v MD in (2.4), that is, a vector v with v 2 = 1 such that Xv ≈ X 2→1 .We accomplish this goal via a randomized procedure that rounds an optimal solution Z to (2.7) back to a vector v .The entire procedure is detailed in Algorithm 1.
The first step of the algorithm solves the SDP relaxation (2.7).In Step 2(a), we draw a random y ∈ {±1} n with E XX * y 1 = 2α 2 /π.This procedure is well understood [34].The method in Step 2(b) that we use to compute v from y is novel, and it requires a proof of correctness, which appears in the Appendix.By choosing the best random outcome, Step 3 limits the probability that our method fails to provide a reasonable approximation.
The following theorem describes the behavior of Algorithm 1.
In Theorem 2.2, it may be more natural to specify a failure probability δ > 0 and approximation loss ε = 1 − θ > 0 instead of a repetition number K. In this case, simple algebra shows that Xv 1 > (1 − ε) 2/π X 2→1 except with probability δ, so long as In particular, the choice K = 94 implies that Xv 1 > 0.75 X 2→1 with probability at least 0.999.We use the approximation ratio ρ = Xv 1 /α to measure the quality of the optimal solution in Section 5.Although Theorem 2.2 only guarantees that we can make ρ as close to 2/π > 0.79 as we desire, in practice we typically see a 0.95 approximation ratio or higher.This observation does not indicate that the analysis of the algorithm is loose; it follows directly from [2, Sec.4.2] that this bound is asymptotically tight for a class of examples as n → ∞.
2.6.Implementation of Algorithm 1.For a fixed iteration count K, the complexity of Algorithm 1 is typically dominated by Step 1.When applied to (2.10), modern interior-point methods are guaranteed to compute the optimal objective value α and optimal point Z accurately in polynomial time.The factor R is determined using a Cholesky factorization of Z .In practice, interior-point methods are very slow for large-scale problems, so we prefer an algorithm of Burer and Monteiro [5].
The algorithm of Burer and Montiero never forms the semidefinite matrix Z; rather it operates directly with the factor R. We express the objective function of (2.10) in terms of R as trace(RR * XX * ) = X * R 2 F .The constraints [Z] ii = 1 are equivalent to constraints on the rows of R of the form r i 2 = 1.
We implicitly enforce these row constraints by incorporating them into the objective function as in [4,Sec. 4.2].The resulting unconstrained, nonconvex optimization problem takes the form where N (R) denotes the operator that normalizes the rows of We then apply a conjugate gradient algorithm to maximize the unconstrained objective in (2.11).Our particular implementation uses the algorithm of Hager and Zhang [22], which we have found to work well in our experiments.We refer to our online code for the choice of parameters in this conjugate gradient algorithm [31].
This factorization technique for solving (2.10) is advantageous because it reduces the dimension of the problem.The paper [5] shows that restricting R to be an n × k matrix for k = O( √ n) suffices to solve this problem exactly.To be precise, when k = (1 + √ 9 + 8n)/2 any local minimum R ∈ R n×k of (2.11) gives a global minimum Z of (2.10) via the map Z = R R * , provided a mild technical condition2 holds.2.6.1.Orthogonal Restriction.Algorithm 1 only approximates the first principal component in (1.2).In order to approximate the kth robust principal component for k > 1, we define a new matrix X k by restricting the rows of X to the subspace perpendicular to the span of v 1 , . . ., v k−1 .Ignoring numerical stability, we can inductively define which ensures each row of X is orthogonal to the previous components v j for j < k.We then apply Algorithm 1 to the restricted matrix X k to produce the component v k .Since the output v of Algorithm 1 is a linear combination of the rows of the input matrix by Step 2(b), this iterative procedure ensures that v k is perpendicular to the previous components.
In practice, the implementation can be done using Householder reflections as in [11]; see [42] for further background on the implementation of Householder transformations.Householder reflections are more numerically stable than the naïve method (2.12).Moreover, they take full advantage of the fact that we are only searching over a p−k+1 dimensional subspace by reducing the dimension of X k to n × (p − k + 1).

2.7.
Extending the Rounding to Multiple Components.We have also attempted to extract a collection of robust components simultaneously by solving a single semidefinite program.That is, we would like to solve the problem maximize where δ ij is the Kronecker delta function.When T = 1, equation (2.13) is equivalent with (1.2).When T > 1, the restriction v i , v j = δ ij ensures that the optimum occurs at an orthogonal set of unit vectors.We can rephrase this optimization problem by the equivalent quadratically constrained quadratic program maximize The diagonal restrictions on w i ensure that w i ∈ {±1} n for each i = 1, . . ., n.The nonconvex problem (2.14) can be approximated via a semidefinite relaxation proposed in [33].The results of [41] imply that the optimal value of this relaxation is guaranteed to be larger than the optimal value of (2.13) by no more than a logarithmic factor.The rounding procedure does not produce orthogonal vectors, so we need to apply an additional orthogonalization step to achieve feasibility for (2.13).Empirically, we have found that the orthogonalization increases the objective value over the standard rounding, so it appears that there is no loss in applying this procedure.

Algorithm 2: Low-Leverage Decomposition
Input: An n × p data matrix X; desired number of principal components T .Output: A p × T matrix V with orthogonal columns.
(3) Set V to the first T columns of V , that is, set . ., p, and j = 1, . . ., T.
Unfortunately, this method does not appear to be competitive with the projection pursuit method.The vectors we find by coupling Algorithm 1 with the orthogonal pursuit of Section 2.6.1 are feasible for (2.14) and typically provide a larger objective value than rounding coupled with post-processing orthogonalization.A better rounding procedure for this type of relaxation may prove more effective than the projection-pursuit approach; this is a direction for further research.

The Low-Leverage Decomposition
Our second method is derived from the interpretation of principal component analysis as a matrix approximation problem.When the observations are drawn from a highly correlated family, the singular values of the data matrix X tend to decay rapidly.If this is the case, then the matrix X is well approximated by a low-rank matrix P .
It is rare that a large data set can be compiled without error, but it is often the case that the errors only affect a subset of the observations.We can model these errors through a multipopulation model.Suppose that the bulk of the observations is well-explained by a low-rank model while the remainder come from another population or are corrupted by measurement noise.A prudent approach to robust principal component analysis would first separate the corrupted data from the uncorrupted data before attempting to recover a low-rank model.When the corrupted rows are unknown, this task may seem daunting.
To accomplish this task, we propose a semidefinite program that decomposes the input X into two matrices: minimize (P ,C) The norm P * 2→2 is the sum of the singular values of P and is known to promote low-rank solutions [15], while C * 2→∞ is the sum of the 2 norms of the rows of C and promotes group sparsity [37].
We call the optimal matrix pair (P , C ) for the problem (3.1) the low-leverage decomposition (LLD) of X; we can interpret C as an identified corruption and P as a surrogate for the uncorrupted observations.We define our robust components as the right singular vectors of the surrogate matrix P .The detailed procedure appears in Algorithm 2. We show in Section 3.1 that our recovered data matrix P has the additional property of being a low-leverage set of observations.
The LLD formulation is related to recent proposals [6,7], and we discuss this point more in Section 4.2.
As we were preparing this manuscript, we became aware of the independent work [46,47] which also considers (3.1) for the robust PCA problem.This work shows that, under certain hypotheses, the recovered low-rank data P has the same row-space as the true data and the corrupted rows are correctly identified.
3.1.Low-Leverage by Duality.In this section, we demonstrate that (3.1) extracts a lowleverage model for the data.This result follows from duality arguments that characterize the optimum of the convex program.Lemma 3.1 (First-order optimality conditions for (3.1)).A feasible pair (P , C) is optimal for (3.1) if and only if there exists a matrix Q such that Before continuing, we introduce another fact concerning the subgradient of unitarily invariant norms.Let P = U ΣV * be the compact SVD of P .It follows from [44] that (3.3a) implies Q = U V * + W , where, in particular, U V * W = 0.
3.1.1.Leverage scores.The leverage score of the observation x i corresponding to the ith row of X is given by the number [H] ii , where H = X(X * X) † X * is the orthoprojector onto the column space of X.We refer to H as the hat matrix in accord with common statistical practice.A large leverage score tends to indicate that the corresponding observation lies outside of the bulk of the data, although it does not necessarily indicate that the point is influential in linear regression.We refer to [32,Ch. 6] for further discussion of leverage scores.
The following theorem shows that the leverage scores of our decomposition are bounded above by γ 2 , justifying the terminology low-leverage decomposition for the solution of the program (3.1).Theorem 3.2.Suppose (P , C ) is an optimal point of the program (3.1).Then the diagonal elements of the hat matrix H = P (P * P ) † P * are bounded above by γ 2 .
Proof.From the characterization of the subgradient of unitarily invariant norms [44] discussed above, we know that Q = U V * + W with U V * W * = 0. Thus, where the last equality can be easily checked using the definition of H and the SVD of P .Since the diagonal entries of a positive-semidefinite matrix are nonnegative, this relation implies [H] ii ≤ [QQ * ] ii .Recall that the 2 → ∞ operator norm is the maximum 2 row norm of the matrix.Thus relation (3.3b) of Lemma 3.1 implies that [QQ * ] ii ≤ γ 2 , which completes the proof.
We can view our proposal as a method of decomposing a data matrix X into a component with a (user-specified!) upper bound on the leverage plus an error term.Moreover, this result gives a statistical interpretation to the regularization parameter γ in (3.1).
We note that while our program guarantees a low-leverage decomposition, an assumption of suitably small leverage is a technical hypothesis in other works, e.g., [6, eq. (1.2)].
The reader should be warned that this method does not necessarily produce a low-leverage solution if we use our program to identify outlying data and then "prune" the rows.That is, suppose (P , C ) is an optimal point of (3.1) and c i = 0 for row indices i ∈ I. Then the corresponding matrix X I = P I does not necessarily have leverage scores bounded above by γ 2 .

3.2.
The Choice of γ.In this section, we study how the value of the regularization parameter γ affects the properties of the decomposition.
We begin by showing that, when γ ≥ 1, the degenerate solution (P , C ) = (X, 0) minimizes (3.1).This claim follows by explicit construction.Let U ΣV * be the compact SVD of X, and define Q = U V * .Clearly Q, X = X * 2→2 , so Q satisfies (3.3a) with P = X.By construction, the maximum singular value of Q is bounded above by one.Equivalently, QQ * I.This inequality implies [QQ * ] ii ≤ 1.Since the diagonal entries of QQ * are the squared row norms of Q, we have shown that Q 2→∞ ≤ 1 ≤ γ.This bound demonstrates that Q satisfies (3.3b) with C = 0, which certifies optimality of this degenerate solution by Lemma 3.1.
We now show that the regularization parameter γ gives an upper bound on the rank of the optimal P .It is easy to show using the SVD of P that the trace of the hat matrix H defined above is equal the rank of P .Since [H] ii ≤ γ 2 by Theorem 3.2, we must have rank(P ) = trace(H) ≤ nγ 2 . ( The rank is a positive integer, so γ < 1/ √ n implies that the optimal P is trivial.Moreover, in order to get T meaningful components in Step 2 of Algorithm 2, we require rank(P ) ≥ T .Thus, we can limit ourselves to situations where γ ∈ [ T /n, 1].Inequality (3.5) has implications for the numerical solution of (3.1).As we discuss in Section 3.3, the bulk of the computation comes from computing an SVD at each iteration.When the solution of the optimization problem has low rank, the iterates also tend to have low rank.This allows us to save significant computational effort by computing partial singular decompositions at each step.A judicious choice of γ can increase the performance of our algorithm immensely.We find that taking nγ 2 ≈ T 2 is a useful heuristic for achieving a rank-T optimal solution, so long as n T 2 .On the other hand, typical statistical data does not show true low-rank behavior even when there are no outliers.Therefore, forcing the optimal decomposition to be low rank typically results in a dense corruption C .This effect may be mitigated somewhat by another formulation we discuss briefly in Section 3.4.In practice we find that setting γ somewhat less than p/n, say γ = 0.8 p/n, provides a very good low-rank model, but it does poorly in the context of outlier identification.We discuss specific parameter choices for our experiments in Section 5.

3.3.
Computing the Low-Leverage Decomposition.Although general-purpose semidefinite programming software such as CVX [19,18] can solve small instances of (3.1) efficiently, the interior-point methods they utilize may be unable to complete even a single iteration of a largescale problem.This observation indicates that we need to use different methods for large-scale problems.
To solve (3.1), we recommend an alternating direction augmented Lagrangian algorithm analogous to the one used in [6]; see also [27].The generic form of the method is known as the Augmented Lagrangian Method of Multipliers (ALMM).The augmented Lagrangian for (3.1) with dual variable Q is given by For an initial starting point P 0 , we alternately solve P k+1 = arg min P L µ (P , C k , Q k ) and We then update the multiplier by the feasibility gap The minimizations above have an explicit form in terms of shrinkage operations [8] where RowShrink(A, ν) soft-thresholds each row a i of A: where [x] + = max{x, 0}.Similarly SpecShrink(A, ν) soft-thresholds the singular values of A SpecShrink(:, ν) : where the operator [•] + is applied element-wise.We initialize the algorithm with P 0 = 0 and set the parameter µ = np/ X * 2→∞ .We stop the algorithm when the iterates are nearly feasible, that is, The main computational difficulty when running this algorithm involves computing the spectral shrinkage operator.When the iterates P k are low rank, we can save significant computational effort by performing only partial singular value decompositions [27].We can leverage our analysis in Section 3.2 to ensure that the optimal P is low rank.Since the algorithmic iterates tend to be low-rank in this case, we can significantly improve the performance of our algorithm by choosing γ to limit the rank of the optimal solution.In practice, we have found that one should set the quantity nγ 2 somewhat larger than the desired rank of the solution, e.g., nγ 2 ≈ T 2 when we desire a rank-T solution.
3.4.Extensions for a Noisy Model.We note that there is an obvious extension of the LLD when one wants to account for an additional of noise in the model.Suppose that in addition to gross corruptions of certain observations, we would also like to model small corruptions or noise that may be spread throughout the data.
Instead of enforcing the equality X = P + C, we allow for some additional slack of the form X − P − C F ≤ η, where η is an estimate for the noise level.That is, we solve the problem minimize When η = 0, this is equivalent to our proposal (3.1) for the gross corruption model.Other loss functions are also possible.Note that the Frobenius norm remains invariant under a rotation on the right, which is a feature of (3.1) that we would like to preserve.This formulation is also studied in the independent work [46,47].It is shown there that under some technical conditions, the decomposition from (3.9) results in a decomposition where P is close to a matrix with the same row-space as the true observations, and the matrix C is close to a matrix that correctly identifies the column support of the corruption.

Previous Work
This section describes previous work on robust formulations of principal component analysis.Convex approaches to robust PCA are unusual, and, as a consequence, many other attempts at robust PCA lack rigorous algorithms.Often, proposals are put forward with a mathematical formulation and only a heuristic algorithm-or an algorithm without a clear mathematical formulation.
In Sections 4.1 and 4.2, we describe the two methods in the literature most closely related to our proposals.We then describe in detail an approach for robust PCA recommended by Maronna [29] with which we provide comparisons in Section 5. We conclude with a short overview of other robust PCA proposals that have appeared in the literature.4.1.Antecedents for MDR: Projection Pursuit PCA.Our MDR proposal is a particular instance of an approach that has come to be known as projection-pursuit PCA (PP-PCA), as we discuss in Section 2.2.The theoretical properties of PP-PCA are well understood; see for instance [12] and [11].
All of the algorithms we have found in the literature for computing PP-PCA are meant to operate with an arbitrary scale.In view of the fact that the PP-PCA problem is NP-hard, it is unsurprising that the literature appears to contain no PP-PCA algorithms with proofs of correctness and tractability.Indeed, we have been unable to find other work that recognizes that the PP-PCA problem is intractable in a rigorous sense.
The original study of Li and Chen [26] uses a Monte Carlo approach that was found to be computationally expensive.In theory, even simple Monte Carlo methods (e.g., randomly sampling the unit sphere) can produce arbitrarily good solutions to problem (2.2) with an arbitrary (continuous) scale.Given the computational hardness of the problem, it is unlikely that Monte Carlo approaches can provide guarantees of computational efficiency.
Current algorithms for PP-PCA rely on heuristics.A popular and fast algorithm for generic projection-pursuit PCA is the finite direction method (FDM) of Croux and Ruiz-Gazen [11].This technique replaces the search over the entire unit sphere v 2 = 1 with a finite search over the directions that appear among the observations: v ∈ {x 1 / x i 2 , . . ., x n / x n 2 }.The hope is that directions of large scale are likely to be well approximated by directions appearing in the data.This heuristic to performs poorly when n and p are large because it takes an extremely large number of points to cover a high-dimensional sphere.

4.2.
A convex approach.Recently, a method of Chandrasekaran et al. [7] has been adapted for robust PCA in [6].This approach attempts to decompose the data matrix into a sum of a low-rank matrix and a sparse matrix via the semidefinite program minimize The nuclear norm • * 2→2 promotes low rank and the matrix 1 norm • * 1→∞ promotes sparsity.We refer to this method as N + L1.The works [6,7] provide conditions under which N + L1 succeeds in exactly recovering a low-rank and sparse component.
This convex approach is principled in the sense that the mathematical formulation is also algorithmically tractable.On the other hand, it lacks an invariance to a change in the observation basis possessed by all other methods we discuss, including standard PCA.That is, applying a rotation U * U = I to the data X = XU does not result in a similar rotation of the decomposition due to the fact that the norm • * 1→∞ is not invariant under this transformation.One may argue that this invariance is inconsequential: in real data, the particular choice of coordinates has a meaning and outliers may occur coordinate-wise.This argument is defensible in domain specific examples, such as image data that contain specularities [6].Nevertheless, PCA is intended to locate a coordinate basis that explains data more effectively than the standard basis [23].If this is the analytical goal, basis invariance is indeed a requisite property.See Section 5.1.2for an experiment where this lack of orthogonal invariance in N + L1 appears to produce unnerving results.4.3.Spherical PCA.Another approach, known as spherical principal components (sphPCA) [28], rescales the observations to unit (Euclidean) norm and applies standard PCA to this modified data.To implement the sphPCA method, we first compute a normalized matrix X.Each row of X is the normalized version of the corresponding row of the centered data matrix X, that is x i = x i / x i 2 .Using the row-normalization operator from (2.11), we can express the normalized matrix as X = N (X).
The robust components are then defined as the standard principal components of the rescaled matrix X.Since all of the observations from the normalized matrix X have norm one, there are no large magnitude observations that exert an undue influence on the principal components.
A study by Maronna [29] shows that sphPCA enjoys good practical performance.The ease of implementation and relatively good behavior of sphPCA leads Maronna to suggest it as the default choice for robust principal component analysis.As a result, we use sphPCA as a baseline comparison for the performance of our robust methods in Section 5.

4.4.
Other proposals.Some of the earliest methods for robust PCA compute approximations of correlation or covariance matrices using robust methods.Gnanadesikan and Kettenring propose direct robust estimation of the covariance matrices through robust estimation of the individual entries [17].This may lead to counterintuitive results such as non-positive covariance matrices.An alternative approach explicitly enforces positive matrices as minimizers of a functional such as an M -estimator [14]; see also the more recent study [10].
A representative example of robust PCA from the machine learning community is the work of De La Torre and Black [13].They define the robust components as the minimum of a highly non-convex energy function and attempt to minimize this energy function using an iteratively reweighted least-squares algorithm coupled with an annealing step.No theoretical guarantees of correctness for the algorithm are provided.
Another recent approach appears in the paper [45] of Xu et al.This algorithm randomly removes observations that appear to have high influence in the current estimate of the principal components.The principal component estimate is computed from the trimmed data.Xu et al. are able to establish strong theoretical properties of their algorithm, including a high breakdown point in the high-dimensional scaling regime where n → ∞ and n/p → c > 0.

Numerical Experiments
This section provides some numerical examples comparing our proposals with standard PCA and other robust PCA methods in the literature.In Section 5.1, we look at the projection of two data sets on the top robust component.Section 5.2 repeats a multiple-component experiment of Maronna [30] with additional robust methods.Section 5.3 contains a larger experiment, where we calculate the first two components of a dense matrix with more than twenty million entries.
All of these experiments and algorithms are implemented with Matlab.Following the principle of reproducible research [3], we provide code that reproduces the exact experiments in this work [31].
5.1.Projection onto the top component.In this section, we study the robust component methods applied to two data sets.The first set is a selection of environmental factors that may affect the concentration of nitrogen dioxide around Oslo, Norway.The second example is constructed from standard iris data.In each case, we examine the spread of the data in the direction of the top robust component.x i − µ 2 . (5.1) Maronna [30,Ch. 9 ] gives a method to solve this convex problem for µ.
We project the data onto the top component for each method and compare the performance of the methods by the interquartile range (IQR), that is, the distance between the 25th and 75th percentile of the projected data.
We apply extract the dominant component from each data set using our methods (MDRand LLD), other robust methods (sphPCAand N + L1), and standard PCA.For MDR, we use K = 94 rounding trials as discussed in Section 2.4.We set the LLD weight parameter γ = 0.8 p/n.As recommended in [6], we set the N + L1 parameter λ = 1/ √ n for the first experiment.With the iris data in Section 5.1.3,we find that λ = 1/ √ n gives a trivial result: no outliers were identified by N + L1.Instead, we use the more favorable choice λ = 0.3/ √ n.
5.1.2.Norwegian nitrogen dioxide data.Our data for this experiment consists of 500 observations of eight environmental factors around Oslo, Norway, available on the Statlib archive [1].The variables include the log-concentration of nitrogen dioxide (NO 2 ) particles, the number of cars per hour, and the wind speed, as well as several additional factors useful for predicting the concentration of NO 2 particles.We calculate the top component of the data using each method.In Figure 1 we plot the projection of the data onto the direction of these components using a standard box-and-whisker plot.The whiskers extend either 1.5 times the IQR beyond the edge of the box or to the extreme data point.We consider points that lie beyond the whiskers outliers.We give the percentage of outliers and several order statistics of the data in Table 2.
Every robust method results in a larger IQR than PCA.The MDR component finds the largest IQR, and the LLD method finds the smallest IQR among the robust methods.Except for N + L1, every method identifies a direction with a relatively large number of outliers, which indicates that the data has heavy tails.
The N + L1 method is unique because it does not identify a direction of large spread outside of the middle 50% of the data.We have observed that a random change of the observation basis causes the N + L1 component to perform similarly to the LLD component.By orthogonal equivariance, the results for methods other than N + L1 are unchanged by a change in the observation basis.This indicates that the behavior of the results given by the N + L1 component is due to the lack of orthogonal equivariance.
We note that the approximation ratio for the top MDR component is near optimal at 0.978.[16] in this experiment.The data contains 60 observations from three different species of iris: Iris setosa, Iris virginica, and Iris versicolor.Each observation consists of four measurements, namely sepal length, sepal width, petal length, and petal width.Fifty of the observations come from the setosa flowers.We corrupt these observations with 5 measurements of Iris virginica and five measurements of Iris versicolor.We hope that robust principal components identify a direction of large spread in the setosa bulk of the data.
As a baseline comparison, we also calculate the dominant principal component of the setosa population without the outlying flowers.
As in Section 5.1.2,we project the data onto the direction of the dominant components.These points are plotted in Figure 2; we distinguish the bulk setosa points from the versicolor and virginica observations.We compute an approximate density of the setosa observations by convolving the projected data with a unit volume Gaussian kernel of width σ = 0.2.Table 3 gives some order statistics of the projections.The dominant component of LLD, sphPCA, and N + L1 each achieves an IQR at least 3 times that of PCA.These components do not clearly distinguish among the three populations, indicating that these methods are insensitive to the effect of the outliers.LLD and sphPCA appear the most effective in this situation; indeed, it appears that LLD and sphPCA perform as well as setosa-only PCA.
Although MDR results in the most modest IQR in the setosa among the robust methods, the IQR associated with the MDR component is 1.95 times the IQR of the setosa family along the dominant PCA component.Unlike the other robust methods, the MDR component discriminates among the three distinct populations.While it is clear that MDR does not reject the influence of the outliers, MDR balances the influence of outliers and the bulk of the data better than PCA.In this experiment the optimality ratio for MDR is 0.9975, certifying that the MDR component is essentially the direction of maximum mean deviation in the data.5.2.Regression Surface for Bus Data.In this experiment, we construct a regression surface using multiple components.A point is well described by a surface if its Euclidean distance from the surface is small.The dominant T classical principal components span a T -dimensional regression surface such that the sum of the squared distances of the observations to the plane is minimized.We would hope that robust components describe the bulk of the points better than standard components when outliers contaminate the data.We illustrate this behavior with an experiment of Maronna et al. [30, p. 214], which we augment with additional robust methods.5.2.1.Experimental setup.Our data consists of p = 18 geometric features collected from n = 218 bus silhouettes [40] that we arrange into an n×p matrix X.Following Maronna et al., we remove the 9th variable from the data and divide the columns of X by their median absolute deviation (MADN), a robust measure of scale defined as MADN(x) = median(|x − median(x)|).
We then center the observations by their Euclidean median.We compute the top three components using PCA, MDR, LLD, sphPCA, and N + L1.We take the LLD parameter γ = 0.8 n/m, the N + L1 parameter λ = 1/m, and the rounding count of MDR K = 94.
For each method, we determine the Euclidean distance from each observation to the orthogonal regression plane spanned by the dominant three components.In Figure 3, we plot the ordered distances to the robust hyperplanes against the ordered distances to the PCA hyperplane.
Since the PCA regression surface minimizes the sum of squared distances to the observations, not all of the observations can lie below the 1:1 line.However, a large number of points below the 1:1 line indicates that a robust regression surface explains the bulk of the data better than the classical surface.
5.2.2.Discussion. Figure 3 focuses on the third and fourth quantiles of the data; the first and second quantiles roughly follow the pattern apparent in the third quantile.For clarity, we omit the three most outlying points that would appear in the upper right corner of the figure.Each robust method results in a regression surfaces that explains the data better than PCA for more than 75% of the points.In the third quantile, both N + L1 and sphPCA lose their explanatory advantage over PCA.It is not until the after 95% of the data that MDR and LLD provide worse explanations than PCA.LLD is the dominating method through the latter part of the data.MDR explains the bulk of the data less effectively than the other robust methods, yet the final outlying observations are explained better by MDR than the other methods.This indicates that MDR is more sensitive to outlying points than the other robust methods, but is less sensitive to outliers than standard PCA.The optimality ratios for the first three MDR components are, respectively, 0.99999, 0.99992, and 0.97253, implying that MDR essentially succeeds in PP-PCA with the MD scale for this data.
Finally, we note that changing the N + L1 parameter to λ = 2 1/n results in performance similar to LLD. 5.3.Movielens.We finish this section with a larger example: the million-rating movielens data [21].The data consist of 6040 users rating and 3952 movies, though several movies are Table 4.Most important movies as given by the first standard principal component.The decimal numbers represent the weight each component puts on a movie.The integer to the right of the weight is the rank of the movie under the given component.replicated.The set contains just over one million ratings.Each rating is between one and five stars, and each user in the data set rated at least 20 movies.We arrange these responses into an n = 6040 by p = 3952 matrix X whose rows correspond to the users and whose columns correspond to the movies.We set unrated movies to the user's median rating, and center each user's ratings by their personal median.As with our other experiments, we center the rows by the Euclidean median, which results in a dense matrix with nearly 24 million entries.

Movie
We then compute the top two components using PCA, MDR, LLD, and sphPCA.In order to speed up processing for LLD, we set γ = 100/p.As discussed in Section 3.3, this choice of γ limits the rank of the iterates P (k) in the ALMM algorithm, which allows us to compute a partial SVD at each step.Our choice γ = 100/n results in iterates whose rank is roughly 10; the rank of the optimal point P is nine.
Each component v represents a direction in movie coordinates.The magnitude entry [v] i indicates how much v points in the direction of movie i.We use these magnitude of the entries in the components to rank the movies.We call movies with large magnitudes "important," and we call the corresponding entry of the component a movie's "importance."5.3.1.Discussion.Table 4 displays the five most important movies identified by the first standard principal component, along with the importance and rank calculated assigned to these movies by the robust components.Each method agrees that the violent mobster movie GoodFellas is the most important film.Indeed, GoodFellas, Army of Darkness, A Little Princess, and Stand by Me are ranked in the top five movies by every method.However, PCA ranks Pushing Hands much higher than the robust methods.
In Table 4, each importance has positive sign.For each method, the first component assigns very few movies a negative importance for the first component.This fact comes about because the typical user rating is positive; that is, the sum j [x i ] j is greater than zero for most users.
Table 5 displays the results for the second components.Each robust component views Mommie Dearest as the most important movie, while standard PCA relegates it to fifth place.Neither Fried Green Tomatoes nor Unforgiven are among the top five movies for the robust methods.With the second component, sphPCA takes the most dramatic shift away from PCA, with only Mommie Dearest making it into the top ten movies.
Of course, rankings are not the whole story.The signs are very consistent 3 between methods.Mommie Dearest is negative for every method considered and Fried Green Tomatoes is positive.The sign consistency indicates that these components are measuring essentially the same thing.
The magnitude of the importance are also telling.PCA assigns the smallest weight to every movie, with the exception of the second component of sphPCA.This indicates that the robust methods are willing to assign more importance to discriminating movies.

5. 1 . 1 .
Experimental setup.For these experiments, we center the data by removing the Euclidean median from each observation.The Euclidean median µ is a robust estimate of the center of the data, and is defined as µ = arg min µ n i=1

Figure 1 .
Figure 1.Projection of the Oslo NO 2 data set onto top components.The box surrounds the middle 50% of the data.The vertical line in the box is the median of the data.Each whisker extends either 1.5 times the length of the IQR or to the extreme value of the data, and the red crosses beyond the whiskers are the outlying points.The plots are ordered by decreasing IQR.

Figure 2 .Figure 3 .
Figure 2. The projections of the iris data onto the top components.The points are randomly jittered above the zero line for readability.The blue curve represents the approximate local point density of setosa.Note that LLD and sphPCA essentially provide the same projection as PCA without outliers.We sort the plots by decreasing IQR.

Table 1 .
Summary of the norms used in this work.
2→2 Maximum singular value of A Sum of the singular values of A A 2→∞ Maximum 2 row norm of A Sum of the 2 row norms of A A 1→∞ Maximum absolute entry of A Sum of the absolute entries of A

Table 2 .
Statistics for the projected NO 2 data.The last column lists the percentage of points lying outside the whiskers in Figure1.

Table 3 .
Order statistics for the projection of the setosa data onto the top components.The last column lists the number of setosa points further than 1.5 times IQR left of the 25th percentile or the right of the 75th percentile.

Table 5 .
Most important movies: second component.