Sparse Equisigned PCA: Algorithms and Performance Bounds in the Noisy Rank-1 Setting

: Singular value decomposition (SVD) based principal component analysis (PCA) breaks down in the high-dimensional and limited sample size regime below a certain critical eigen-SNR that depends on the dimensionality of the system and the number of samples. Below this critical eigen-SNR, the estimates returned by the SVD are asymptotically uncorrelated with the latent principal components. We consider a setting where the left singular vector of the underlying rank one signal matrix is assumed to be sparse and the right singular vector is assumed to be equisigned, that is, having either only nonnegative or only nonpositive entries. We consider six diﬀerent algorithms for estimating the sparse principal component based on diﬀerent statistical criteria and prove that by exploiting sparsity, we recover consistent estimates in the low eigen-SNR regime where the SVD fails. Our analysis reveals conditions under which a coordinate selection scheme based on a sum-type decision statistic outperforms schemes that utilize the (cid:96) 1 and (cid:96) 2 norm-based statistics. We derive lower bounds on the size of detectable coordinates of the principal left singular vector and utilize these lower bounds to derive lower bounds on the worst-case risk. Finally, we verify our ﬁndings with numerical simulations and illustrate the performance with a video data example, where the interest is in identifying objects.


Introduction
It is well-understood that singular value decomposition (SVD) based principal component analysis (PCA) breaks down in the high-dimensional and limited sample size regime below a certain critical eigen-SNR (eigenvalue signal-to-noise ratio) that depends on the dimensionality of the system and the number of samples [17,4].Several sparse PCA algorithms have been proposed in the literature (see [17,4,8,21,33,3]) and have been shown to successfully estimate the principal components in the low eigen-SNR regime where the SVD fails.
Prior work in this area primarily considers the Gaussian signal-plus-noise model with random effects, where the signal matrix is assumed to have sparse left singular vectors, normally distributed right singular vectors, and the noise matrix is assumed to have normally distributed i.i.d.entries.Here, we consider the setting where the left singular vector of the rank one signal matrix is sparse and the right singular vector is assumed to be equisigned.We say that a vector is equisigned if its entries are all non-negative or all non-positive.This is motivated by applications such as diffusion imaging in MRI where the right singular vector represents a physical quantity (e.g.intensity as the diffusion agent is absorbed by a tissue) that is non-negative, by imaging problems such as foreground-background separation in video data [25,31] and object detection in astronomy [27], where the data are naturally non-negative, and by problems in bioinformatics where the data are (non-negative) counts of genes [30].When analyzing data that are non-negative, it is logical to take advantage of this property, and investigate how we may use this knowledge to do better than the (generic) alternatives.Additionally, we motivate the rank-1 assumption by noting that for a video with a static background, the foreground is a perturbation of a rank-1 background [22,12].Finally, even though we do not pursue this angle here, our framework can be extended to deal with the scenario where the signal can be viewed of a rank 1 tensor with all but one of the representors in the Kroneker product representation of the tensor is an equisigned vector.
A natural question at this juncture is the following: how does our problem differ from that solved by Non-Negative Matrix Factorization (NNMF)?NNMF takes a given matrix X and looks for non-negative matrices F and G such that X = F G T [15,32].Ordinary NNMF has no sparsity constraints.We might impose such constraints, as is done in [14] and [20], but except in special cases, these solutions have no known theoretical guarantee of statistical performance.This problem partly stems from the fact that solutions to the corresponding optimization problems may not be unique.In contrast, our problem only constrains the right singular vectors, while the left singular vectors are free to take any sign.The work in [9] extends the NNMF framework to one wherein only one of the factors is non-negative; nevertheless, the rest of the constraints we impose are not included.Hence, NNMF is not an answer to the problem we consider herein.
The main contribution of this paper is a rigorous sparsistency analysis of the various algorithms that brings into focus the various very-low eigen-SNR regimes where the new algorithms work and the SVD based methods provably fail.Additionally, a major novelty of this work is the integration of FDR-controlling (False Discovery Rate) hypothesis testing to the Sparse PCA problem.
Our analysis illustrates the situations where the sum based coordinate selection scheme dramatically outperforms the 1 and 2 [17,4] based sparse PCA schemes.Additionally, our proposed algorithms are non-iterative, do not require imsart-ejs ver.2014/10/16 file: sepca.texdate: May 24, 2019 the computation of the sample covariance matrix, and do not require knowledge of the sparsity level.We separate our algorithms into two groups: one where the Family-Wise Error Rate (FWER) is controlled, and another where the False Discovery Rate (FDR) is controlled.We utilize sharp tail probability bounds for relevent statistics to derive our FWER-controlling estimators [6].For the FDR controlling estimators, we relate the problem at hand to that of the sparse normal means problem [10].
This paper is organized as follows.In Section 3, we describe three algorithms for estimating the sparse principal component that utilize a coordinate selection scheme based on the sum, 1 , and 2 norm-based statistics respectively.We call our family of algorithms SEPCA, an abbreviation for Sparse Equisigned PCA.Section 4 proposes three FDR-controlling refinements of the sum-and 2 -based algorithms in Section 3 by relating coordinate detection to the sparse normal means estimation problem.In Section 5 we show how the estimation performance is governed by the size of the smallest detectable coordinate, which we analyze in Section 6 and validate using numerical simulations in Section 7. In Section 8, we provide some geometric intuitions about the relative performance of three of our algorithms.We show that the sum statistic is potentially the most powerful, while the 1 is the least powerful.We provide some concluding remarks in Section 9.

Problem Formulation
Let X ∈ R p×n be a real-valued signal-plus-noise data matrix of the form (2.1) The columns of the p × n data matrix X represent p-dimensional observations.In (2.1), u and v are the left and right singular vectors of the rank-one latent signal matrix, and have entries u i and v j , respectively.The entries of G, the noise matrix, are assumed to be i.i.d.Gaussian random variables with mean 0 and variance 1/n.We assume that u ∈ R p has unit norm and is sparse in the sense of small 0 norm, with s p non-zero entries, where s/n → 0. That is, for a set where I C denotes the complement of I. We further assume v ∈ R n to be of unit norm, deterministic, and equisigned.Given X, our goal is to recover u and v.
Note that the (i, k) entry of X, X ik , is a Gaussian random variable with mean [θu i ]v k and variance σ 2 /n.Moreover, it follows that where I p denotes the p × p identity matrix.The quantity (θ/σ) 2 is, for this model, the eigen-SNR (signal-to-noise ratio).

Motivation: Breakdown of PCA / SVD
From [2], we have the following result: let u be the estimate of u given by the Singular Value Decomposition (SVD) of X, and let p(n)/n have limit c ∈ [0, ∞] as n grows, with θ fixed and σ = 1.Then, with probability 1, For general σ, we replace θ by θ/σ in (2.3).Hence, SVD based PCA leads to inconsistent estimates of u (and also for v, which can be deduced from (2.3)) when the dimension p is comparable to or larger than the sample size n.Moreover, in the low eigen-SNR regime, the estimates break down completely.SVD does not exploit any assumed structure in u and v. Consequently, (2.3) holds for arbitrary u and v, including our setting where u is sparse and/or v is equisigned.Our goal, in what follows, is to derive consistent estimators for u and v that outperform the SVD by exploiting the sparsity of u and the equisigned nature of v.

Proposed Algorithms
We propose six different two-stage algorithms for estimating u.The first three algorithms are designed to control the family-wise error rate (FWER), or, the probability of obtaining a false positive in the coordinate selection.The last three algorithms aim to control the false discovery rate (FDR), or, the proportion of false discoveries (coordinate detections) among all discoveries.We defer discussion of the FDR-based algorithms to Section 4.
All of the algorithms have the same basic form given in Algorithm 1.Given X, we associate a test statistic T i to each row of X.The sparsity of u implies that the majority of the rows of X are purely noise, so that the majority of the T i come from the null, noise-only distribution.Hence, based on the statistics {T i }, we perform a form of multiple hypotheses testing procedure, and select the set I of indices that are non-null.In this way, we can estimate the support of u, thereby isolating the the rows of X that contain the signal.Then, taking the SVD of this submatrix (comprised of only the selected rows of X) yields a better estimate of the non-zero coordinates in u, as well as v.
We begin by discussing the FWER-controlling algorithms.The work in [17] proposed a covariance thresholding method for Sparse PCA called DT-SPCA; this is equivalent to a coordinate selection scheme based on the 2 norm-based statistic.In our terminology and with our choice of thresholds, we label it as 2 -SEPCA.We label the coordinate selection scheme based on the 1 norm-based statistic 1 -SEPCA.Finally, the sum-SEPCA algorithm utilizes row sums of the data matrix.
We shall choose the thresholds τ n,p for the coordinate selection scheme so that in the noise-only case, where e is Euler's number, or the base of the natural logarithm.This choice ensures that the probability of a false positive tends to zero as p → ∞.That is, the FWER is asymptotically zero and is bounded by 1/ep in the finitedimensional case.Note that the constraint used to control the FWER is simply that the distribution of the noise is log-concave.In the Gaussian case, we obtain the specific expressions given summarized in Table 1; however, with knowledge of the moments ET i and Var T i , we can repeat our analysis and find thresholds for the 1 and 2 -SEPCA algorithms with any log-concave noise distribution.The thresholds are summarized in Table 1.Table 1: Test Statistics and Thresholds for Algorithm (1) In the noise-only cases, the statistics for 2 -and 1 -SEPCA are distributed as scaled χ 2 n and sums of half-normal, respectively.Both of these quantities are log-concave random variables, so we may apply the result in [19] to set the threshold τ n,p in both cases.
Defining K to be some absolute constant (we may use K = e, as in [5]), we define the constants 3.1.2.sum-SEPCA From Proposition 4.4 of [7], we obtain that the threshold for sum-SEPCA is given by In (3.3), we have that where Erf denotes the error function, or alternatively, the cumulative distribution function of a standard Gaussian random variable is given by Moreover, τ n,p ≤ σC U log p n for some constant C U .For a fixed value of p, choosing is sufficient.The choice of 1/ep is the largest bound justified by Proposition 4.4 of [7], so we have calibrated all of our algorithms to the same constant factor times 1/p.The thresholds are summarized in Table 1.

Controlling the False Discovery Rate
So far, we have controlled the probability of a false alarms when detecting coordinates.However, there are two relevant observations to make.First, under the Gaussian noise, rank-1, and equisigned assumptions, the vector of test statistics {T i } in the sum-SEPCA algorithm looks like a sparse vector plus Gaussian noise (or a vector of χ 2 n -variates with varying non-centralities, in the 2 -SEPCA algorithm).Secondly, controlling the false discovery rate, that is, the proportion of rejected nulls that are false positives, can lead to increased detection power relative to controlling the false positive rate.We hence look at FDR-controlling tests for the Sparse Normal Means problem.
That is, given a vector of test statistics (as before), we replace the thresholding and selection in Algorithm 1 with an FDR-controlling selection procedure.
We summarize this change in Algorithm 2. There are three procedures we consider.The first two are known as Higher Criticism, and directly extend the sum-and 2 -SEPCA algorithms [10,11].The third is a method for detection in the sparse normal means problem that comes out of complexity-penalized estimation theory for linear inverse problems [18].

Algorithm 2 FDR-Controlling Variable Selection and Estimation Algorithm
Require: Test Statistic T i from Table 1 and Selection Procedure Let I be an empty list for all Rows i of X, 1 ≤ i ≤ p do Form test statistic T i from row i of X end for Perform an FDR-Controlling selection procedure, and add the selected indices to For i k ∈ I, let u i k = u k ; the other entries of u are set to 0.

Higher Criticism
Assume we have p independent tests of the form and assume that at most p 1−β of the p hypotheses are truly non-null, for some β ∈ (1/2, 1).Further assume that the non-null means have magnitude for r ∈ (0, 1).Here, the means will correspond to the coordinate size.Note that the expected maximum of p standard Gaussian random variables is upper bounded by √ 2 log p, with the bound being asymptotically sharp.If we let p (1) ≤ p (2) ≤ • • • ≤ p (p) be the sorted p-values of the individual tests, we may define the Higher Criticism statistic: Rejecting the global null hypothesis (that there are no non-null coordinates) when ) leads to asymptotically full power when r is greater than some decision boundary ρ, and that under the global null, in probability as n, p → ∞.The function ρ depends on the sparsity index β, and as [11] indicate: If we replace the normal distribution with a χ 2 n distribution, the same results hold for tests of the form where δ is a non-centrality parameter and we consider r ∈ (0, 1) such that δ = 2r log p.That is to say, we form the Higher Criticism statistic in the same manner as for the sum statistic, perform the test with the same threshold, and the form of the decision boundary ρ is identical [11].
To summarize, taking sums across the rows of X, we obtain a vector y where y i = µ i + σz i , with µ i = (θu i ) v 1 : this situation is exactly that of a sparse mean vector embedded in Gaussian noise.Similarly, taking sums of squares across the rows of X yields scaled χ 2 n distributed random variables, of which only a few have non-zero non-centrality parameters.
As a point of interest, the test in (4.1) can be extended to (and potentially strengthened in) the case where the p tests are correlated, i.e., when the additive Gaussian noise has a non-identity covariance [13].
Remark While Higher Criticism is typically formulated for the case of identical non-null means or parameters (all of the non-zero µ i are identical), this constraint is not mandatory [1,13].Indeed, the results hold without modification for the Gaussian model with non-null means of size µ i = α i √ 2 log p, where α i is a non-negative random variable with the property that P(α i ≤ √ r) = 1 and P(α i > √ r − ) > 0 for all > 0 [13].The case of a χ 2 n distribution is similar.

FDR-SEPCA
In this section, we give an summary of the algorithm for uncorrelated noise and defer the general case and details to Appendix D. We continue in the same vein as in the previous section on Higher Criticism.
We note that in the equisigned, rank-1 setting, coordinate selection is equivalent to the estimation of a sparse mean vector.Let y i = µ i + σz i , where i ∈ {1, • • • , p} and the vector z of the z i is normally distributed with mean 0 and covariance I p .The mean vector µ of the µ i is assumed to be sparse; the goal is to estimate µ.Taking sums across the rows of X, we obtain a vector y where y i = µ i + σz i , with µ i = (θu i ) v 1 .Hence, we are in the same setting as in the previous section.
The following penalized least squares formulation, taken from [18], yields an estimator for µ: where pen(k) is defined as We define µ 0 to be the number of non-zero coordinates of µ.
The solution to (4.6) is given by hard-thresholding.Let |y| (i) be the i th order statistic of defining the solution is to hard threshold at t k .
In this set-up, we have that We provide a precise quantification of t k in Appendix D. Hence, by computing t k and performing hard thresholding of the row sums, we can perform coordinate selection.Once again, this procedure replaces the test statistic/thresholding in Algorithm 1.

Estimation Error and Smallest Detectable Coordinate
As we will see, our theorems discuss the "detectability" of the coordinates u i of u.However, it is common in the sparse PCA literature to discuss lower bounds for the risk (estimation error) [17,4,21].In what follows, we will show that these two notions are equivalent.
We define the L 2 estimation error for a principal component estimator as (5.1) The quantity in (5.1) is upper bounded by 2; this bound is attained when u and u are unit norm and mutually orthogonal.Following [4], we want to compute a lower bound for the maximum expected loss for the s-sparse vectors u (in the sense of 0 sparsity) defined as sup where S p−1 denotes the unit sphere in R p .Let I be some index set of coordinates selected by an algorithm of the form given in Algorithm (1).We may take u, u to be non-negative, and decompose the loss as Estimation Error from detected coordinates Error from missed coordinates (5. Equation (5.3) shows that the loss is lower-bounded by the squared sum of the missed coordinates.Indeed, it is a natural consequence of the result in [2] that if the sparsity s grows slower than does n, and we have a consistent estimate of the support of u, the estimation error will asymptotically be small.Essentially, we are estimating the singular vectors of an s × n matrix instead of a p × n matrix, so that if the ratio s/n has limit zero, our estimates will be consistent (see (2.3) and [2]).This suggests the following strategy for lower-bounding (5.2): we want to construct a non-trivial 'worst-case' sparse vector.That is, we want a vector u that has a non-trivial loss (less than 2), is sparse (fewer than s nonzero coordinates), and has maximal error from missed coordinates.To ensure a non-trivial loss, we set the first coordinate u 1 to be large, i.e., u 1 = √ 1 − r 2 , where r = o(1).To ensure sparsity, we set u 2 , • • • , u m+1 to be non-zero for some m ≤ s − 1, with the subsequent coordinates of u set to 0. Then, the expected loss has the lower bound since u 1 is detected with probability approaching 1 and u k is zero for k > m + 1.Now, let u 2 through u m+1 all have value r/ √ m, so that we may simplify the lower bound to EL(u, u) ≥ r 2 P (Not Selecting Coordinate k) . (5.5) If coordinates of size r/ √ m are not detected with a probability approaching 1, r 2 is a lower-bound on the risk.This construction shows that specifying the sizes of coordinates that are not detected with probability approaching 1 is equivalent to specifying a worst-case risk lower bound.Consequently, in what follows we focus on the smallest detectable and largest undetectable coordinates because they directly shed light on the attainable estimation error.The details of the risk calculations and extensions to approximate sparsity are deferred to Appendix C, where we summarize our findings in Theorem 3.

Main Results
The following theorem characterizes consistent support recovery conditions.These results are the analogue of the 'sparsistency' guarantees found in the LASSO and 1 -norm minimization literature [26].Throughout, I denotes the set of coordinates selected by the coordinate selection scheme.
Theorem 1.For the model specified in (2.1) and (2.2) and the algorithms specified in Table 1, assume that p(n), n → ∞, s(n)/n → 0, and log p(n) = o(n).Let ∈ (0, 1).We have that a.For i ∈ I c , max Here and t 1 satisfies the relation We defer the proof to Appendix A. Theorem 1 identifies a phase transition in the ability of the algorithms to accurately estimate the support of u.Note that the analysis brings into sharp focus the dependence of β crit on v for the 1 -and sum-SEPCA algorithms, but not the 2 -SEPCA algorithm.Consequently, we can expect the algorithms to perform differently depending on the structure of the underlying v.It is important to note that the sparsity s of u is not a parameter in the thresholds and results.
It is also important to note that 2 -SEPCA and 1 -SEPCA do not rely on the equisigned character of v.However, it is clear that the sum-SEPCA algorithm explicitly depends on the equisigned assumption.

FDR-Based Algorithms
We may summarize the coordinate selection properties of the FDR refinements as follows: Theorem 2. For the model specified in (2.1) and (2.2) and the three FDRcontrolling algorithms summarized in Algorithm 2, assume that p(n), n → ∞, s(n)/n → 0, and log p(n) = o(n).Let ∈ (0, 1).We have that a.For all three algorithms and i ∈ I c , c.For the FDR-SEPCA algorithm, uniformly over i ∈ I, ), coordinate i is not selected with probability tending to 1. Here where ζ > 1, ν > e, and the FDR-SEPCA algorithm detects k coordinates.
We defer the proof to Appendix B.
Once again, we see that the structure of the underlying v plays a role in the performance of the sum-based algorithms, but not for the 2 -based HC-2 -SEPCA algorithm.Unlike in the FWER-controlling cases, the sparsity of u plays a (small) role here, via the constant ρ (β) for the Higher Criticism-based methods and via k for FDR-SEPCA.Moreover, 2 -HC-SEPCA, like 2 -SEPCA, does not make use of the equisigned nature of v.

Simulations
To illustrate the relative powers of the six algorithms, we compute the theoretical limits on the sizes of detectable coordinates as a function of n.We use a unitnorm, equisigned v such that This choice of v has a 'rise and fall' sort of behavior, and is motivated by physical signals, e.g., chemical reactions or nerve signals in the brain.The value of β crit is shown in Figure 1; for this choice of v, it is clear that the sum-SEPCA dramaticaly outperforms the other SEPCA variants in terms of size of the smallest detectable component.The FDR-SEPCA algorithm has similar performance to sum-SEPCA, and the HC-sum-SEPCA algorithm has the strongest performance.
In Figure 2, we plot the estimation error as a function of n and θ for all six algorithms.We also include results for the SVD and competing algorithms Smallest Detectable Coordinates TPower [33] and ITSPCA [21].In the simulations, we fix p = 1000 and vary n, since the dependence in p in the thresholds is logarithmic, whereas that in n is not.The left singular vector u is chosen to be the vector with 1 in the first coordinate and 0 elsewhere.We fix the noise variance σ 2 at 1, so that θ 2 is the eigen-SNR.The results should be interpreted as follows.For the particular v chosen here, we expect HC-sum-SEPCA to have the lowest detectable limit, and 1 -SEPCA to have the largest.This behavior is confirmed.Moreover, the sum-based algorithms offer a slight strengthening of both ITSPCA and TPower.

Comments on the FDR-controlling procedures
The Higher Criticism for the χ 2 n -variates 'pushes back' the phase transition between detecting nothing and something to a lower value of θ relative to the 2 -SEPCA algorithm, but is still less powerful than any of the sum-based algorithms.Moreover, even above the phase transition, the 2 -SEPCA algorithm may be preferable, as the error is increased by unacceptably many false positives.
The Higher Criticism procedure for the sum statistic has the lowest phase transition point and hence the highest power.Its transition is more gradual than the penalized FDR thresholding procedure and sum-SEPCA, which have roughly the same performance in this simulation.The plots show the empirical estimation error for all six algorithms for the u and v described in (7.1).We include results from TPower, ITSPCA and the SVD for comparison.

An example where 2 -based algorithms outperform sum-based algorithms
Sum-SEPCA has a β crit that depends on v. Looking at the form in (6.1), if v 1 is smaller than n 1/4 , we would expect 2 -SEPCA to detect a smaller coordinate size.Vectors with smaller coordinates have a smaller 1 -norm, i.e., one that is closer to their 2 -norm.Hence, if we choose we expect sum-SEPCA to have worse performance relative to 2 -SEPCA.Figures 3 and 4 confirm this expectation.The FDR refinements perform poorly.It should be noted, however, that TPower and ITSPCA retain their performance.This choice of v effectively corresponds to a very small value of n: the majority of coordinates are tiny in size and buried beneath noise regardless of the value of θ.If we 'corrected' the scenario and used a smaller n and a subset of v, we would be in a situation closer to that given in (7.1).

A video data example
We conclude our sequence of examples with a real data study.This example is motivated by the problem of foreground-background separation in videos.Consider a grayscale video of stars twinkling against a black background [28].Our goal is to estimate the locations of the stars: by reshaping the video, we may treat each frame as a vector and hence treat the video as a sparse matrix.Only a few locations have a star and are hence non-zero.The scale of the video pixels is between 0 and 255.We examine the top-left 72×64 pixels for 89 frames, as shown in Figure 2a.In Figure 2b, we plot the singular values of the video matrix.The first singular value stands out strongly against the rest, and at most two more singular values are well-separated from the bulk.This structure suggests that our rank-1 based approach is well suited to this problem.We add Gaussian noise of variance σ 2 and study the True Positive Rates (TPR) and False Discovery Rates (FDR) across all algorithms and across different values of σ.In Figure (5), we show the results of our simulations.In terms of the TPR, everything other than the SVD has a similar performance, while the test-statistic SEPCA-based algorithms enjoy the best performance in terms of the FDR.In Figure 6 we zoom in on the top-right three stars and show how the algorithms perform as noise increases.Here, we see that the behavior alluded to in the TPR/FDR results actually occurs in the video.

Estimation of the Noise Variance, σ 2
In general, estimation of σ 2 may not be straightforward [23].However, in most applications, including the video example we consider, one can obtain a relatively sparse representation of the object in a mutiscale basis such as a wavelet basis.Under such circumstances, under the assumed additive, isotropic noise model, we can easily obtain a consistent estimate of σ 2 by utilizing the inherent sparsity of the signal, especially in finer scales.This can be done, for example, by computing the variance of the wavelet coefficients in the finest scale [16].The plots show the empirical estimation error for all six algorithms for the u and v described in (7.2).We include results from TPower, ITSPCA and the SVD for comparison.

ITSPCA
One can obtain a more robust estimate by taking the median absolute deviation of the coefficients about their median and then by multiplying its square with a known scale factor (assuming normality).Alternatively, procedures such as those proposed in [23,24,29] could be employed.

A geometric view: which algorithm to use?
We have stated detectability results for each algorithm in Section 6 and provided a numerical verification and comparison in Section 7. In this section, we wish to analytically compare the algorithms.In particular, we have seen that the right singular vector v plays a critical role in the detectability and estimability of u, and we will characterize this behavior carefully.
In this section, will use the following notational convenience: we absorb (θu i ) into v ∈ R n , and write the detectability of coordinates in terms of v.That is, if v T is a row of X, we specify when that row is selected.Moreover, we take σ = 1 for simplicity.
There are two 'classes' of detectability: in terms of v 1 and in terms of v 2 .The sum-, HC-sum, and FDR-SEPCA algorithms select a coordinate if large enough for a v in the orthant with all non-negative or all non-positive coordinates.Geometrically, the vector v is selected if it is 'outside' a hyperplane with a normal vector proportional to the vector of all 1s.The 1 -SEPCA algorithm is similar, as it selects a coordinate when v 1 is large enough, or if v lies outside an 1 -ball of some radius.The connection between  the previous three algorithms and 1 -SEPCA comes from noting that the faces of an 1 -ball are sections of hyperplanes with normal vectors proportional to a vector of ±1s.Finally, the 2 -and HC-2 -SEPCA algorithms select a coordinate when v 2 is large enough.I.e., when v lies outside some 2 -ball.

Stars Video Example
Our goal in this section is to derive comparisons between the six algorithms.Specifically, for a given vector v, which algorithm will have the greatest detection ability (we are, for the moment, only concerned with maximizing power)?Note that when v has a large norm, it does not matter which algorithm is used.Questions only arise when v 1 or v 2 are relatively small and are close to the thresholds.

Intersection of a hyperplane and a hypersphere
We may think of the 1 ball as a hyperplane when restricted to a single orthant.If a hypersphere of radius r intersects a hyperplane with a normal vector proportional to the vector of all ±1s and minimum distance to the origin of r − h, a hyperspherical cap of height h is formed: see Figure 7 for a simple illustration.Geometrically, a right triangle is formed, with hypotenuse r and leg r−h.Hence, the angle between the center of the cap and the edge is: It is sufficient to guarantee that for the hyperspherical cap to exist.Moreover, a vector v has a direction contained inside the cap when the angle between v and the vector of ±1 in the orthant containing v is smaller than θ lim .In other words, defining the angle for a vector v as we need θ(v) ≤ θ lim .

li m r r -h h
Now, we consider when HC-2 -SEPCA is more powerful than 2 -SEPCA.The ratio of the radii is given by If this ratio is smaller than 1, HC-2 -SEPCA is more powerful than 2 -SEPCA.Note that the quantity 2 is an upper bound for (8.3), so that if the original ratio is smaller than 1 and HC-2 -SEPCA is preferable to 2 -SEPCA.

Comparing the sum-based algorithms
Finally, we compare sum-, HC-sum-, and FDR-SEPCA.First, the ratio of the thresholds for HC-sum-and sum-SEPCA is Noting that ρ(β) ≤ 1 and that it is clear that this ratio is always smaller than 1 so that HC-sum-SEPCA is a strict improvement on sum-SEPCA.
Next, we compute the ratio of the thresholds for FDR-and sum-SEPCA: Using the lower bound on C U , we find that if k ≥ 11 (and p ≥ k, naturally), FDR-SEPCA is always more powerful than sum-SEPCA.For smaller values of k, for sufficiently large values of p, the ratio will be smaller than 1.Lastly, we compare FDR-SEPCA to HC-sum-SEPCA, wherein the ratio of the thresholds is (FDR to HC-sum): Because of involvement of ρ(β), this quantity is hard to analyze.If in an oracle manner, FDR-SEPCA obtained k correctly as p 1−β , we would find that this ratio is always larger than 1 for p > 1.That is if k assumes the the correct value, HC-sum-SEPCA is more powerful than FDR-SEPCA.Alternatively, we can note that ρ(β) ∈ (0, 1] and ask when the ratio is larger than 1.Based on the ratio above, we can see that in the following scenarios HC-sum-SEPCA is more powerful than FDR-SEPCA. To summarize, we prefer the FDR-controlling alternatives to sum-SEPCA, but depending on the output of FDR-SEPCA, HC-sum-SEPCA may be more powerful.However, as the simulations in Section 7 revealed (see Figure 2), the number of false positives with HC-sum-SEPCA may be higher than with FDR-SEPCA.

Overall Message
We have seen that for n and p sufficiently large and v that is sufficiently dense (in the sense of v 1 being large), a sum-based statistic and algorithm leads to better performance.This is expected behavior, as by using a sum-based method, we are taking advantage of the equisigned nature of v.Moreover, within the class of sum-based algorithms, controlling the FDR leads to greater power, as expected.It is difficult to clearly identity which of HC-sum-and FDR-SEPCA will have the greatest power, and the end result may come down to a practitioner's tolerance for false discoveries.

Conclusions
We have considered the setting where the left singular vector of the underlying rank one signal matrix plus noise data matrix is assumed to be sparse and the right singular vector is assumed to be equisigned.We have proposed six different SEPCA algorithms for estimating the sparse principal component imsart-ejs ver.2014/10/16 file: sepca.texdate: May 24, 2019 based on different decision statistics and provided sparsistency conditions for the same.Our analysis reveals conditions where a coordinate selection scheme based on a sum-based decision statistic outperforms schemes that utilize the 1 and 2 decision statistics.Thereby, the proposed algorithm outperforms known schemes such as diagonal thresholded PCA [17] in terms of estimation of the singular vectors associated with the rank-1 component.We have derived lower bounds on the size of detectable coordinates of the principal left singular vector, utilized these lower bounds to derive lower bounds on the worst-case risk and verified our findings with numerical simulations.Finally, we have discussed the results of our simulations analytically, by providing a geometric interpretation of the differences in power among the algorithms.We note that while we have stated our results for Gaussian noise with identity covariance, we can extend the FWER-controlling results to any log-concave noise distribution, and the FDRcontrolling procedures to Gaussian noise with certain non-identity covariances.
A. Proof of Theorem 1 a.Note that P (T i ≥ τ ) ≤ P max j∈I c T j ≥ τ for i ∈ I c .Taking the maximum over the left-hand side and noting that the right-hand side has limit zero yields the result.This follows from (3.1).b.We consider when true positives occur with probability approaching 1.We want to find the smallest coordinate (θu i ) such that the following probability approaches 1: Note that if (τ n,p − ET i ) is negative and not tending to zero as n grows, and if the variance of T i decays to zero as n grows, the quantity tends toward negative infinity.Hence, we will specify conditions so that Var T i decays to zero as n grows and then compute when a coordinate is detectable by considering when τ n,p is strictly less than ET i .For brevity, we omit the computations in solving τ n,p < ET i for |θu i | and present verifications that the variance of T i has limit 0. These results show that above the decision boundary, we have uniform detection.In sum-SEPCA, T i is a Gaussian random variable with mean (θui) √ n k v k and variance σ 2 n .Since σ does not grow with n, Var T i always decays to zero.In 2 -SEPCA, T i has Since σ and θ are fixed, the variance always decays to 0. Let which is less than or equal to Since v 2 = 1, the variance of T i has limit 0. Because we cannot solve the inequality τ n,p < ET i analytically, we leave the bound in the form given previously.
In the proof above, note that if (τ n,p − ET i ) is positive and not tending to zero as n grows, the quantity in (A. Rearranging the inequality r > ρ(β) yields Note that sum-SEPCA can detect coordinates of size However, C U is strictly larger than √ 2 + 1/(3 √ 2).Thus, using HC yields a threshold of the same order, but with a strictly smaller scaling.

B.1.2. Sum of squares: HC-2 -SEPCA
If we sum the squares of the entries of rows of X, abusing notation slightly and using N (µ, σ 2 ) to indicate a Gaussian random variable with mean µ and variance σ 2 , the statistic for the i th coordinate is of the form Assuming oracular knowledge of σ, the statistic places us in the setting of (4.5).The non-centrality parameter δ is given by Setting δ = 2r log p and solving r > ρ(β) yields We have that 2 -SEPCA can detect coordinates with Using HC offers a significant improvement over 2 -SEPCA.However, we also expect HC with the χ 2 n statistic to have a smaller detectable coordinate: v 1 ≤ √ n, so that for fixed β and p, the threshold in (B.2) is asymptotically larger than that in (B.4) (but potentially of the same order).This result is strange in context of the non-FDR results.In any case, HC improves on 2 -SEPCA.

B.1.3. FDR-SEPCA
Recall that taking sums across the rows of X, we obtain a vector y where y i = µ i + σz i , with µ i = (θu i ) v 1 .Moreover, we have noted that where t k is the level at which y is thresholded.It follows that, entries of y that are of size at least Relative to HC and sum-SEPCA, the gain here is found when there are many smaller coordinates of u and k is large.B.2. Proofs for the Higher Criticism-Based Methods a. From (2.8) in [13], As with Theorem 1, we omit the computation of β crit , as it follows from the discussion in Section 4.1.
Choosing ρ ≤ v −2κ/(1−q/2) is enough.With these choices of parameters, Noting that 1 √ ρ ≥ v κ/(1−q/2) , choosing ρ = v −2κ/(1−q/2) leads to the smallest possible choice of coordinate.In summary: and So, for a given algorithm, it remains to choose φ and κ so that the worst-case risk is lower-bounded by r , and for 2 -SEPCA, κ is irrelevant, as v 2 = 1.Sum-SEPCA uses φ = 0 and 2 -SEPCA uses φ = 1 2 .Hence, sum-SEPCA has a risk lower-bounded by Noting that v 2 = 1, 2 -SEPCA has (C.14) In the 0 case, i.e., when u has no more than s non-zero entries, the preceding analysis goes through with C q replaced by s and q set to zero.

C.1.1. FDR Algorithms
For HC-sum-SEPCA, the β crit is of the same order as that for sum-SEPCA.Similarly, for FDR-SEPCA, if k is much smaller than p, β crit is of roughly the same order.Hence, these two algorithms have the same risk bound as sum-SEPCA.For HC-2 -SEPCA, κ = 0 and φ = 1.The risk is therefore lowerbounded by O [C q − 1]n −(1−q/2) . (C.15)

D. FDR-SEPCA: Further Details
Let y i = µ i + σz i , where i ∈ {1, • • • , p}, the vector z of the z i is normally distributed with mean 0 and covariance Σ, and Σ satisfies Here, ξ 0 is the smallest eigenvalue of Σ and ξ 1 is the largest.The mean vector µ of the µ i is assumed to be sparse; the goal is to estimate µ.The following penalized least squares formulation yields an estimator for µ: The parameter β may be set to 0 here, and ν is chosen to be no smaller than e 1/(1+2β) .We define µ 0 to be number of non-zero coordinates of µ.
The solution to (D.1) is given by hard-thresholding.Let |y| (i) be the i th order statistic of |y i |, namely |y| Recalling that (D.1) solves a penalized least squares problem for y close to y, we may discuss the statistical behavior of this estimator.The following discussion follows and reproduces that in [18] First, note that for β = 0, the parameter ν directly controls the FDR (where a false positive corresponds to selecting a zero coordinate in y): a choice of ν = 2 1/ω for ω ∈ (0, 1) bounds the FDR at a level ω.
Second, the expected risk, E y − y 2 2 , is bounded as follows.By Proposition 4.1 in [18], The second term in (D.7) is the ideal risk, or, the infimum of the penalized least squares objective.If y belongs to an q ball with radius C and 0 < q < 2, and we define the ideal risk is bounded as sup y∈R p : i |yi| q ≤C q R(y, σ) ≤ c(log ν)σ 2 r p,q (C/σ), (D.9) for some c > 0. The supplementary results in [18] yield that R(y, σ) is bounded by C 2 log ν, and by C 2 when C ≤ √ 1 + log p.As in Appendix C, we may replace q with 0 and C q with s in the case of hard sparsity with s non-zero coordinates.Doing so leads to the bound: (D.10) Note that we have recovered the factor of log νp/s in β crit .

Fig 1 :
Fig 1: This plot shows β crit for all six algorithms for the v described in (7.1).
Fig 2:The plots show the empirical estimation error for all six algorithms for the u and v described in (7.1).We include results from TPower, ITSPCA and the SVD for comparison.

Fig 3 :
Fig 3: This plot showsβ crit for all six algorithms for the v described in (7.2).
Fig 4:The plots show the empirical estimation error for all six algorithms for the u and v described in (7.2).We include results from TPower, ITSPCA and the SVD for comparison.
(a) The image shows the mean intensity of pixels from the top-left 72 × 64 pixels for 89 frames.White indicates the presence of a star.(b) The plots shows the singular values of the video data.The spacing suggest a low-rank-plus-noise structure.

. 1 )Fig 5 :
Fig 5: The left plot shows the True Positive Rate of the various algorithms as a function of the noise level σ.The right plot shows the False Discovery Rates.

Fig 6 :
Fig 6: A zoomed-in view of the three top-right stars in the video example.White indicates a false negative (missed star), Red a false positive (a guessed pixel where there was nothing), and Blue a true positive (correctly identified pixel).

Fig 7 :
Fig 7: A spherical cap in R 2

Table 2
Video Example Figures
2) tends to positive infinity when the variance decays to zero.Hence, modifying the proof by solving τ n,p > ET i for |θu i | yields when a coordinate is not detectable with probability approaching 1: i.e., when |θu i | is smaller than the values given in (6.1).If v is equisigned, summing across the rows of X yields a normally distributed quantity with mean (θu i ) v 1 and variance σ 2 .Dividing by σ and adopting the notation of HC, we have that under the alternative hypothesis, µ i = [1]as limit zero.b.Let I 1 ⊆ I be the set of coordinates with signal larger than the detection limit (i ∈ I such that |θu i | > β crit (1 + )), and let I 2 ⊆ I contain the rest of the coordinates (i∈ I such that |θu i | < β crit (1 − )).By Theorem 1 in[1], the asymptotic power for detecting signals below the detection limit is one, and that for signals below the limit is zero.Hence, for i ∈ I 1 , i∈I : |θui|<βcrit(1− ) 2 n .Sum-SEPCA misses coordinates of size O