A note on sensitivity of principal component subspaces and the efficient detection of influential observations in high dimensions

In this paper we introduce an influence measure based on second order expansion of the RV and GCD measures for the comparison between unperturbed and perturbed eigenvectors of a symmetric matrix estimator. Example estimators are considered to highlight how this measure compliments recent influence analysis. Importantly, we also show how a sample based version of this measure can be used to accurately and efficiently detect influential observations in practice.


Introduction
Principal component analysis (PCA) is a popular tool in multivariate statistics. However, PCA estimates may be highly influenced by certain types of observations and, as such, it is often important to locate, and perhaps subsequently treat, those observations which may be potentially harmful in practice. Influence analysis of PCA estimators (see, for e.g., [9; 3; 31; 11] and [14]) show that the direction of an observation plays a crucial role in how influential it may be on the eigenvector and eigenvalue estimates. As such, influential observations may or may not be detectable using common distance measures and can therefore be difficult to locate.
Throughout we will consider PCA of a symmetric matrix where the set of all eigenvectors satisfies orthonormality conditions. Often it is the span of a subset of eigenvectors that is of primary interest rather than individual elements. Let A = [a 1 , . . . , a K ] and B = [b 1 , . . . , b K ] be two p × K matrices where a i = 1, b i = 1 (i = 1, . . . , K) and a ⊤ i a j = 0, b ⊤ i b j = 0 for all i = j. Two common measures for the comparison between the column spaces of A and B are the RV(A, B) coefficient ( [15; 27]) and the GCD(A, B) measure [32]. Let P A = AA ⊤ and P B = BB ⊤ denote the projection matrices onto the column spaces of A and B respectively. Then, due to orthonormality of the columns of A and B (see, for e.g., [3]) we have In considering influence on eigenvector subset spans, Bénasséni [3] noted that the RV and GCD measures were insensitive to small perturbations. Bénasséni then introduced a new measure for assessing sensitivity given as or, alternatively, ρ 2 (A, B) = ρ 1 (B, A). It should be noted that it is not necessarily true that ρ 2 (A, B) = ρ 1 (A, B) so that, unlike the RV and GCD measures, the ordering of the arguments may be important. However, by considering a small adjustment in ρ 1 (A, B) that considers summation of a k − P B a k 2 instead of a k − P B a k , note that so that there is a strong link between Bénasséni's measure and the RV and GCD measures. The purpose of this paper is then to consider an influence measure based on the RV and GCD measures. In Section 2 we consider perturbation of the RV and GCD measures and introduce a measure for the influence analysis of eigenvector spans. We then apply this measure to some example estimators in Section 3 to show how it compliments existing influence studies. Sample versions are discussed in Section 4 for the detection of highly influential observations in practice. In Section 5 we show how influential observations may be efficiently detected in practice with respect to a high dimensional data set.

The effect of perturbation on the RV and GCD measures
Consider the contamination distribution where F µ,Σ is some distribution with mean µ and covariance matrix Σ, 0 < ε < 1 and ∆ x is the Dirac measure putting all of its probability mass at the comtaminant x. When convenient we may also utilize z = Σ −1/2 (x − µ) (the standardized contaminant at F µ,Σ ) and the population Mahalanobis distance with respect to x at F µ,Σ given as Let W denote a p× p statistical functional where, at an arbitrary distribution G for which it exists, W (G) is symmetric. With respect to F x (ε) defined in (3), we are interested in perturbation of the form where W 1 and W 2 are p × p symmetric matrices independent of ε. Under perturbation of the form given in (4), the first order coefficient of ε is the influence function ( [16; 17]) for W at F µ,Σ denoted IF(W, F µ,Σ ; x). The influence function is a useful tool for understanding the sensitivity of estimators to small perturbations. For example, if IF(W, F µ,Σ ; x) = 0 then, for small ε, W (F x (ε)) ≈ W (F µ,Σ ) so that small perturbations with respect to x have little influence on the estimator. On the other hand, if IF(W, F µ,Σ ; x) is large then perturbation with respect to x is highly influential since it effects a large change on the estimator.
Let |κ 1 | ≥ |κ 2 | ≥ · · · ≥ |κ p | be the eigenvalues of W (F µ,Σ ) and let ν 1 , . . . , ν p denote the corresponding eigenvectors. We are interested in the effect that perturbation has on the span of a subset of the eigenvectors. Let S ⊂ {1, . . . , p} and let P S = j∈S ν j ν ⊤ j denote the projection matrix onto the subspace spanned by the elements of {ν j : j ∈ S}. Similarly, let P S (ε) = j∈S ν j (ε)ν j (ε) ⊤ denote the perturbed equivalent at F x (ε) where ν 1 (ε), . . . , ν p (ε) are the eigenvectors corresponding to the ordered absolute eigenvalues of W (F x (ε)). Typically, S will be chosen to be {1, . . . , K} such that the span of the eigenvectors corresponding to the K largest absolute eigenvalues is of interest. For example, in PCA corresponding to covariance matrices, principal components with corresponding large eigenvalues are retained as they can account for most of the total population variance. Let S ′ denote the compliment of S. The following condition will also be used.
We now look at a measure based on the expansion of a function of the RV and GCD measures based on the above condition. The proof is given in the Appendix.
Theorem 1. Consider S, P S , P S (ε), κ i (i = 1, . . . , p) and ν i (i = 1, . . . , p) defined previously and let Then, under the perturbation form given in (4) and Condition 1, where K is the number of elements in the set S and IF(W, F µ,Σ ; x) is the influence function for W at F µ,Σ .
From Theorem 1, if perturbation has resulted in no difference between the span of the non-perturbed and perturbed eigenvectors then ρ S (W, F µ,Σ ; x) = 0 since trace {P S P S (ε)} = K. However, as the distance between the spans increases according to 1 K trace {P S P S (ε)} (i.e, as the RV and GCD measure for the comparison between the spans approaches zero) then ρ S (W, F µ,Σ ; x) increases. Also, for small ε we have that ρ S (W, F µ,Σ ; x) is approximately equal to the second order coefficient to ε in the expansion of [1 − trace {P S P S (ε)} /K]. Hence, the following influence measure will be utilized throughout, (5) In the case of K = 1 such that j ∈ {1, . . . , p}, where ν j is the functional for the jth eigenvector estimator and IF(ν j , F µ,Σ ; x) is the associated influence function. Hence, the measure may be used to assess influence on individual components as well as the span of a subset of components.
Theρ S (W, F µ,Σ ; x) measure provides a convenient means to understand the sensitivity of eigenvector estimators in the presence of non-unique eigenvalues. IF(ν j , F µ,Σ ; x) is only known for the case of unique κ j (see, for e.g. [9] or [11]) and, as such, is not useful in all situations. For example, let J ⊂ {1, . . . , p} such that {κ i = κ j : i, j ∈ J , i = j} then the solution to IF(ν j , F µ,Σ ; x) is not known for j ∈ J . However, from (5),ρ S (W, F µ,Σ ; x) is known provided J ⊂ S or J ⊂ S ′ and Condition 1 holds.

Covariance and correlation matrix estimators
Let C 0 denote the functional for the classical covariance matrix estimator where, at F µ,Σ , C 0 (F µ,Σ ) = Σ with eigenvalues λ 1 ≥ · · · ≥ λ p and correspond-ing eigenvectors η 1 , . . . , η p . The influence function for this estimator (see, for e.g. [9] where (2), was introduced in the classical covariance matrix setting. This measure was based on the average sine of the angle between each of the perturbed eigenvectors and their projection onto the non-perturbed subspace. Using our notation the influence measure associated with this coefficient is (see [3]) which contains similar sensitivity information as theρ S (C 0 , F µ,Σ ; x) measure.
Similarly, let R 0 denote the functional for the classical correlation matrix estimator where, at F µ,Σ , R 0 (F µ,Σ ) = Γ and let α 1 , . . . , α p and γ 1 , . . . , γ p denote the eigenvalues and associated eigenvectors of Γ. For x i denoting the ith element of x, µ i denoting the ith element of µ and σ ii denoting the ith diagonal . The influence function for this estimator is, from [13], IF(R 0 , F µ,Σ ; x) = z z ⊤ − (DΓ + ΓD) /2 so that, from (5), We will now provide an example of the form of the influence measure for eigenvector subspaces of covariance estimators. We will not only consider the measure for the classical case as shown in (6), but also with respect to two robust estimators of the covariance matrix; namely the one-step re-weighted Minimum Covariance Determinant (RMCD) estimator which includes an initial MCD estimation step [28] followed by a subsequent re-weighting [21] and the S-estimator ([29; 30; 12]). For simplicity and to satisfy Fisher consistency at the non-contaminated model we will assume F µ,Σ is multivariate normal. For the RMCD estimator we choose the breakdown point for the initial MCD estimator to be α = 0.5 followed by the retention of 97.5% mass that satisfies MD 2 µ * ,Σ * (x) ≤ q .975 where P (χ 2 p ≤ q .975 ) = 0.975 and µ * and Σ * are the MCD mean vector and covariance matrix. For the S-estimator we use the minimizer function associated with Tukey's biweight function (see, for e.g., Example 2.2 of [20]) and α = 0.5. Associated influence functions for these estimators that are used in the following example can be found in [10] and [20]. Example 1. Let λ 1 = · · · = λ K and λ K+1 = · · · = λ p with λ p < λ 1 , Σ = diag(λ 1 , . . . , λ p ), µ = 0 and S = {1, . . . , K}. From (6), and since each Let θ denote the angle between z and P S z (its projection onto the K-eigenvector subspace) then cos(θ) = trace( In Figure 1 we plotρ S (C 0 , F µ,Σ ; x) for Example 1 and the corresponding measures for the RMCD and S-estimators as described above. As can be seen in plot (a), the classical estimator can be highly influenced by x, in particular those x with a large MD µ,Σ (x). However, the angle of x from its projection onto the subspace spanned by [η 1 , . . . , η K ] also plays an important role. Regardless of the magnitude of MD µ,Σ (x), x has zero influence when z ∈ Span{η j : j ∈ S} or z ∈ Span{η r : r ∈ S ′ }. This is also the case for the RMCD and S-estimator though the downweighting of observations with a large MD µ,Σ (x) results in reduced influence as is seen in plots (b) and (c). Let q ξ be the ξ ×100% percentile for the χ 2 p distribution such that P (χ 2 p ≤ q ξ ) = ξ. For the RMCD estimator there are discontinuities at MD µ,Σ (x) = q α and MD µ,Σ (x) = q δ corresponding to the rejection of observations in the initial weighting and re-weighting steps respectively. The estimator is particularly sensitive at these points since small changes in x can effect a large change in influence. From an influence perspective the S-estimator is preferred since it does not suffer from comparatively high influence and the smooth weighting function utilized results in smooth changes in influence with respect to changes in MD µ,Σ (x).

Dimension reduction methods
In the regression setting, consider a univariate response variable Y and pdimensional predictor vector X. If there exists a p × K matrix B such that Y ⊥ ⊥ X|B ⊤ X then, when K < p, dimension reduction can be achieved without loss of information by replacing the p-dimensional X with the K-dimensional B ⊤ X. For more information see, for e.g., [7]. Let S denote the column space of B. Under appropriate conditions, methods such as Sliced Inverse Regression (SIR, [18]), Sliced Average Variance Estimates (SAVE, [6]) and Principal Hessian Directions (PHD, [19]) seek a basis for S. The methods are based on an eigenanalysis of a p × p symmetric matrix so that, provided the influence function for the symmetric matrix estimator is known, the influence measure resulting from Theorem 1 is applicable in this setting.
Influence functions for versions of SIR, SAVE and PHD that return an orthonormal basis for B have been considered where influence on the directions of the basis is carried out with respect to Bénasséni's measure (see [23; 24] and [25]). As an example we will consider PHD since the method requires less introduction. Assuming X ∼ N p (µ, Σ), [19] showed that eigenvectors corresponding to non-zero eigenvalues of the the average Hessian matrix given as are elements of S. For more on PHD, including alternative versions, see [19] or [8].
In this regression context, the contamination distribution becomes where ∆ y,x has all of its probability mass at (y, x) ∈ R p+1 to allow for contamination in both the response and predictor spaces. We shall assume that F µ,Σ = N p (µ, Σ) and rank(H x ) = K so that PHD is capable of finding a complete basis for S. Let H denote the functional for the usual estimator of H x . From [25] the influence function for H at F µ,Σ as where w = Σ −1 (x − µ), b OLS is the OLS slope vector at F µ,Σ and E(Y ) is the expected value of Y at F µ,Σ . We are interested in the basis estimator for S so we set S = {1, . . . , K} where each λ i (i ∈ S) is a non-zero eigenvalue with corresponding eigenvector η i ∈ S. We also have (see, for e.g., [4; 5] or [22]) b OLS ∈ S so that, and since H x η r = 0 (r ∈ S ′ ), from (5), where w m = η ⊤ m w.
The unboundedness ofρ S (H, F µ,Σ ; y, x) is clearly evident suggesting that some observational types may be highly destructive to the estimator. In the original PHD paper by Li [19], it was noted that the method can be highly sensitive to outlying observations in the response space. The model itself imposes distributional restrictions on the predictor (i.e. normality) meaning that outliers in the predictor space may be more formally identified and subsequently treated. However, this is not the case with the response where outlying observations may still contain important regression information. However,ρ S (H, F µ,Σ ; y, x) increases without bound as y is moved further from E(Y ) suggesting that such observational types can be highly influential.
It is also interesting to note the types of observations that are not influential. For example, suppose that µ = 0 and Σ = 0. Then, from (9), we have that ρ S (H, F µ,Σ ; y, x) = 0 if x is an element of S. That is, even extreme outliers may not be influential.

Sample versions
In practice it is common to consider the effect of highly influential observations on sample estimates. A limitation, however, is in the difficulty and inefficiency of locating such observations in large data sets. In this section we will consider sample based versions of the influence measure for the detection of influential observations. Let x 1 , . . . , x n denote a random sample of size n where each x i ∈ R p and let F n denote the empirical distribution of this data. Similarly, suppose that F n,(i) denotes the empirical distribution of the data without the ith observation so that Throughout, all reference will be to data of this form though in the regression setting one would need to consider observational pairs consisting of a predictor and response. Sample based versions of the influence function for the ith observation have been employed (see, for e.g., [9]) where the contaminant is x i and ǫ is replaced with −1/(n − 1). For P and P S,(i) denoting the projection matrix estimates with an without the ith observation, the true sample version of ρ S (W, F µ,Σ ; x) is then Computation of r S (W, F n ; x i ) for all n observations requires estimation at each of F n , F n,(1) , . . . , F n,(n) . Such a process can be extremely inefficient when n is large or p is large (or both) due to the time it can take to carry out an eigen-analysis n+ 1 times. An approximation to r S (W, F n ; x i ) may be computed by replacing unknown population parameters inρ S (W, F µ,Σ ; x) with their re-spective estimates at F n . This approximate influence value is, from (5), where EIF(W, F n ; x i ) is the empirical influence function consisting of estimates at F n in place of population parameter values in IF(W, F µ,Σ ; x). When IF(W, F µ,Σ ; x) exists in a closed form, i.e. in terms of x and population parameters only, thenr S (W, F n ; x i ) may be calculated for each observation after just one eigen-analysis at F n . In the next section we will highlight the usefulness of this approximation in the context of computation time.

Sample principal components of the classical covariance matrix estimator: A microarray application
In this section we consider the colon tumor microarray data set [1]. For simplicity we consider the first K estimated eigenvectors of the sample covariance matrix estimate S. Computation of r S (W, F n ; x i ) for even just a few observations can be extremely inefficient for high-dimensional data sets. If r S (W, F n ; x i ) is to be calculated for all observations then such an analysis may become extremely onerous. We will therefore consider efficiently approximating r S (W, F n ; x i ) with r S (W, F n ; x i ). All results were obtained using R version 2.5.1 and the R function eigen for the eigen-analysis which utilizes LAPACK routines (see [2]) for computation. An Intel Pentium D CPU 3.60GHz with 1.99GB of RAM was used for the analysis.
The colon tumor microarray data set consists of gene expression measurements for 2000 genes corresponding to n = 62 samples. Each sample is either classified as being a 'normal tissue' sample or a 'tumor tissue' sample. Of the 40 individuals in the study, each has an associated 'tumor tissue' sample and 22 of the individuals also have a 'normal tissue' sample. We consider the normalized data where each sample is standardized to have mean 0 and standard deviation 1. Although this is a subset of a larger data set consisting of 6500 genes, most statistical research has concentrated on just the 2000 genes. We chose this data since it is often used in discriminant analysis but classical methods are not immediately applicable due to the singularity of S. As such methods such as PCA may be used to initially reduce the dimension of the predictor space. For more information regarding this data set see [1].
From (6),r where y ji = η ⊤ j (x i − x), x is the sample mean and η 1 , . . . , η p are the sample eigenvectors corresponding to the sample eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ p of S. Potential efficiency problems may still exist when the number of loop repetitions is large. A total of K × (p − K) iterations are required for the computation of a single r S (C 0 , F n ; x i ) so that the total number of iterations required for the computation of r S (C 0 , F n ; x 1 ), . . . , r S (C 0 , F n ; x n ) is n × K × (p − K). For example, if we consider n = 62, p = 2000 and let K = 10, then the total number of iterations amounts to 1,239,380. However, this can be greatly reduced when p >> n by noting that, since rank(S) ≤ n − 1 giving η ⊤ k S η k = 0 for k > n − 1 then y ki = 0 for all i = 1, . . . , n when k = n, . . . , p. Hencẽ which requires just K(n − 1 − K) iterations. Again for n = 62 and p = 2000 the total number of iterations required for a choice of K = 10 is only 31620; just 2.55% of the iterations required for (12).
In Table 1 we provide the time in seconds taken to compute all r S (W, F n ; x i )'s and the approximations usingr S (W, F n ; x i ) andr * S (W, F n ; x i ). To highlight how well the approximation can be used to detect influential observations we also included the Spearman rank correlations between the r S (W, F n ; x i )'s and r * S (W, F n ; x i )'s. It is immediately evident that much time can be saved when using the approximations. For example, for K = 2 the true computation cost is 2951.24 seconds compared to just 29.37 seconds (or around 1% of the time) using ther * S (W, F n ; x i )'s. The high Spearman rank correlation of 0.993 also indicates thatr * S (W, F n ; x i ) is an excellent indicator of influential observations. High correlations over 0.9 are maintained for all choices of K up until and including K = 15. It is also clear that calculating the approximation based on the fewer loop iterations given in (13) is also beneficial with all computations coming in under 30 seconds compared to computation times approaching 60 seconds for larger choices of K when (12) was utilized.
Further evidence of how close the approximation is to the true sample influence measure is provided in Figure 2. Here we plot the r S (W, F n ; x i )'s versus thẽ r * S (W, F n ; x i )'s for all observations and some choices of K. For K = 1 and 3 there is little difference between the true and approximate values. For K = 6 and 12, although there are obvious differences for some observations the approximation is still highly successful in highlighting influential observations.

Discussion
In this paper we considered sensitivity analysis of subsets of eigenvectors based on perturbation of the GCD and RV measures. Examples were provided to show how this analysis compliments existing influence studies in Section 3.
In Section 5 we highlighted how an approximate sample version of the measure can be used to efficiently detect influential observations in practice. The data set considered consisted of 2000 measurement variables for 62 individuals so that a leave-one-out sensitivity analysis requiring repeated principal component estimation was computationally inefficient. However, the approximate version provided an excellent approximation to the true sample measure when the subset of eigenvectors was not too large. This approximate version was much less time consuming to compute, therefore offering a useful means to assess influence for large data sets.