Theoretical properties of Cook’s PFC dimension reduction algorithm for linear regression

: We analyse the properties of the Principal Fitted Components (PFC) algorithm proposed by Cook. We derive theoretical properties of the resulting estimators, including suﬃcient conditions under which they are √ n -consistent, and explain some of the simulation results given in Cook’s paper. We use techniques from random matrix theory and perturbation theory. We argue that, under Cook’s model at least, the PFC algorithm should outperform the Principal Components algorithm.


Introduction and notation
Cook [1] gives an extensive account of the history and controversy, dating back at least to the work of Fisher [4], surrounding the question of whether the selection of regression variables should be based only on values of the predictor, or whether the response should also be taken into account.For example, reduction by Principal Components (PC), introduced by Hotelling [7] and reviewed by for example Seber [12], only uses the sample covariance matrix of the predictors to select the variables, ignoring the values of the response.
Cook [1] argues that, while PC can play a useful role in regression problems, the analysis can also take account of the random response Y, and introduces the Principal Fitted Components (PFC) algorithm to do this.This algorithm performs a principal components analysis on the fitted sample covariance matrix obtained by projecting the predictor onto f y , a function of the response.Efficient selection of variables can allow regression algorithms to operate in a space of reduced dimension, making the resulting estimates potentially more accurate.
In this paper, we analyse the properties of the PFC algorithm, in the context of a model given by Equation (5) of Cook [1], following the notation of [1] throughout.Cook converts the more familiar forward linear model Y = Xβ + ε into an equivalent inverse regression form: Example 1.1.Consider a sample of n independent observations X y in R p indexed by the responses y and generated as X y = µ + Γβf y + σǫ. (1) Here, µ and each X y are p × 1 matrices (column vectors), Γ is a full rank p × d matrix (with d < p), β is a d × r matrix (with d ≤ r) and f y is an r × 1 matrix.The parameter σ gives the scale factor of ǫ, a p×1 matrix of errors.The entries of ǫ will often (but not always) be assumed to be independent standard normals.We will estimate the span of the columns of the rank d matrix Γ, where by assumption, Γ T Γ = I d (we can reparameterize the model to achieve this).Having estimated Γ, we can work in a space of smaller dimension, using whatever techniques are appropriate.
The f y is a vector-valued function of the random response Y, and for simplicity we assume throughout that y f y = 0. Further we assume that ǫ is independent of Y, and hence of f y .(Note that we use different symbols, ǫ and ε, for the error terms in Equation (1) and the forward regression Y = Xβ+ε.We do not claim that ε is independent of Y).The f y can be constructed in a variety of ways, depending on the exact form of the data.For example, Cook mentions that if the conditional mean E(X|Y = y) can be modelled by a polynomial of degree r, then it is appropriate to take f y = (y − m 1 , y 2 − m 2 , . . ., y r − m r ) T , where m u is the sample mean of the uth power of y.Alternatively, in the spirit of the Sliced Inverse Regression algorithm of Li [10], f y can be constructed by slicing the range of Y into (r + 1) disjoint bins.
For ease of calculation we convert the model from Example 1.1 into matrix form.We write X for the n×p centred matrix of predictors, with rows (X y −X) T (where X is the sample mean of X), write F for the n × r matrix with rows f T y , and E for the n × p error matrix with rows σǫ T .Given a full rank matrix G we write P G = G(G T G) −1 G T for the matrix which projects orthogonally onto the span of the columns of G. Definition 1.2.Cook [1] defines the fitted matrix of predictors X = P F X and proposes the PFC algorithm; an estimate Γ PFC of Γ is given by the set of d eigenvectors of the fitted sample covariance matrix X T X which correspond to the largest eigenvalues.
Cook contrasts this with the PC algorithm of Hotelling [7], which performs the corresponding calculation for the sample covariance matrix X T X.That is, an estimate Γ PC of Γ is given by the set of d eigenvectors of X T X corresponding to the largest eigenvalues.Note that we refer to Γ PFC and Γ PC as estimates of Γ for the sake of brevity; in fact, the span of the columns of Γ PFC and Γ PC form estimates of the span of the columns of Γ.
Theorem 2.4 of this paper gives a distributional result for the accuracy of PFC estimation.This allows us to explain the simulations relating to PFC estimation in Section 5 of Cook's paper [1], and to bound confidence intervals for the accuracy of the estimation, see Theorem 3.3 below.In Section 3 we give sufficient conditions for Γ PFC to be a √ n-consistent estimator of Γ, in the case d = r.In Section 4, we use results from perturbation and random matrix theory to consider the more general case d ≤ r and consider the order of the errors that arise.In Theorem 4.3, we give sufficient conditions for Γ PFC to be a √ n-consistent estimator of Γ in this more general case.Model (1) represents a problem in dimension reduction.The PC model is given as Equation ( 2) of [1] as where d × 1 vector ν y satisfies y ν y = 0. Hence the model ( 1) is a special case of (2), where the βf y replaces ν y .As Cook remarks, in the PFC model ( 1) we aim to estimate β, which contains dr parameters, whereas the PC model (2) contains the (n−1)d parameters of ν y .Hence, the PFC model has the attractive feature that the number of parameters to be estimated does not grow with n.
In Section 5 we argue that the PFC algorithm should perform strictly better than the PC algorithm, if the errors ǫ are normally distributed.Specifically, Lemma 5.1 shows that in this case the sample covariance matrix X T X is the fitted covariance matrix X T X perturbed by random noise independent of X and Γ. Hence inference about Γ using X (PC algorithm) will be necessarily less accurate than inference using X (PFC algorithm).Proposition 5.4 gives bounds in certain parameter regimes that imply that the PC estimate Γ PC is also O P (1/ √ n), approximately explaining simulations relating to PC estimation in Section 5 of [1].All the proofs of this paper are presented in Appendix A.
It is of course important to consider the validity of the model given by Equation (1), in order to understand the significance of these results.The key is the reversal of the conditional distribution for Y|X implied by the linear model Y = Xβ + ε to give a conditional distribution for X|(Y = y), as in Equation (2).Such a reversal is not new, arising for example in the work of Oman [11].If the X and ε are each multivariate normal, then (X, Y) will be jointly multivariate normal, and we can move between Y = Xβ + ε and Equation ( 2) by parameterizing appropriately.The question of the validity of the full PFC model Equation (1) amounts to the issue of how accurately the βf y models ν y .Presumably for real data there will be a trade-off between improved accuracy arising from PFC estimation and errors introduced by βf y not adequately explaining ν y .Implicitly, model (2) requires that the X should be random and take continuous values.Hence X should come from measurements rather than designed experiments, and the case of factor variables is excluded.
In future work, we hope to explain the performance (in the sense of Mean Squared Error) of prediction errors shown in Figure 1(d) of Cook's paper [1], and to investigate the theoretical properties of the PFC PC algorithm described in Section 6 of [1].

Theoretical PFC performance when d = r
In this section, we analyse the performance of the PFC algorithm in the case d = r.In Lemma 2.2 we give explicit expressions for the matrices involved, and define a matrix V which gives the span of the PFC directions.In Lemma 2.3, we deduce distributional results for V, which we use to prove Theorem 2.4.Finally, we show how Theorem 2.4 explains some of the simulation results given in Section 5 of [1].
To measure the accuracy of our estimates, we consider the distribution of a normalised version of the quantity m( Γ, Γ) = (I p − P Γ ) Γ used, for example, by Xia, Tong, Li and Zhu [15].We have a choice of which matrix norm • to use; we shall use the Frobenius norm • F defined by where A (k) is the kth column of A and |•| represents the vector norm.This choice of matrix norm has the attractive feature that A F = AP F for orthogonal P, so that we can make orthogonal changes of basis without affecting m( Γ, Γ).
Definition 2.1.For true Γ and estimated value Γ, we define In the case d > 1, the quantity C( Γ, Γ) will change on rescaling columns of the matrix Γ by different amounts (that is, on multiplying Γ by diagonal D not proportional to I).We do not have a priori estimates for the true scalings of the eigenvectors, so use the scalings that arise from orthogonal transformations of the columns of a matrix V arising from matrix factorization (see Lemma 2.2).Lemma 2.3 shows that this choice has the attractive feature that the entries of V have equal variance.There may exist better scalings in the sense of reduced average angle, however this choice already has good properties, as the theorems of this paper show.Lemma 2.2.Under model (1) from Example 1.1, the span of the r largest eigenvectors of X T X is identical to the span of columns of the p × r matrix V defined by In particular, in the case where d = r these columns define the PFC space.Note that V can be found from known information and without calculating eigenvalues, by The proofs of all the results of this paper are given in Appendix A. Next we give distributional results for V, by conditioning on the values of F. Lemma 2.3.Assume that the errors ǫ have mean zero and are uncorrelated with variance 1. Conditioned on the values of F, the entries of V have mean Γβ(F T F) 1/2 and variance σ 2 , and are uncorrelated.Lemma 2.3 implies that if the errors ǫ are independent standard normals, then the columns of V are independent multivariate normals, with means given by the columns of Γβ(F T F) 1/2 and with covariance matrix σ 2 I p .Hence the p×p matrix X T X = VV T has a non-central Wishart distribution with r degrees of freedom, scale parameter σ 2 I p and non-centrality parameter Γβ(F T F)β T Γ T .This allows us to deduce a distributional result for C. Theorem 2.4.Under the model given by Equation ( 1), assume that the errors ǫ are independent standard normals, and consider the PFC estimator Γ PFC .In the case d = r, conditional on (F T F), the term where degrees of freedom and non-centrality parameter given by the scaled trace The quantity C( Γ, Γ) measures the proportion of the magnitude of the estimate Γ which lies in the span of the columns of Γ, and hence measures how good an estimate of the span of Γ is provided by Γ.In the case r = d = 1, this is compatible with Cook's plots of the angle Θ( Γ, Γ) between true Γ and estimated Γ, in the sense that for any Γ and Γ the C( Γ, Γ) = cot 2 Θ( Γ, Γ).
Section 6 of Li, Zha and Chiaromonte [9] introduces a different measure of similarity of subspaces as P Γ − P Γ , where • represents the operator norm.
In the case d = r = 1, this again is compatible with the angle Θ( Γ, Γ), in the sense that in this case P Γ − P Γ = | sin Θ( Γ, Γ)|.The use of the operator norm means that the measure of [9] represents 'worst case' performance, whereas using the Frobenius norm gives 'average' performance.Our techniques do not at present give distributional results for P Γ − P Γ , however, Frobenius norms are typically easier to calculate than operator norms.
Theorem 2.4 shows that the distribution of C( Γ PFC , Γ) does not depend on the value of Γ itself, helping to explain Cook's remark [1, P.10] that "the value of principal component estimators does not rest solely with the presence of collinearity".Further, using Theorem 2.4, we can better understand the simulation graphs given in Section 5 of Cook [1].
Example 2.5. Figure 1 of [1] considers the model given by Equation (1) in the case where p = 10, d = r = 1, β = 1.Further, the errors ǫ are normally distributed and Y is normal with variance σ 2 Y , so that In Figure 1 we simulate directly from the distribution given by Theorem 2.4 and vary parameters n, σ and σ Y .Based on a sample of size 50,000 for each set of parameter values, we plot the sample mean as •, along with upper and lower 5% sample quantiles (plotted as △ and +).The sample means shown here in Figure 1(a)-1(c) fit closely with those in Figure 1(a)-1(c) of [1].

3.
√ n-consistency of PFC estimates The quantiles plotted in Figure 1 give some idea of the spread of likely values of the angle Θ( Γ PFC , Γ). Theorem 2.4 shows that the squared cotangent C( Γ PFC , Γ) is a scalar multiple of a non-central F distribution with random non-centrality parameter, a complicated hierarchical form of mixture distribution, meaning that it is not trivial to give confidence intervals for the angle Θ( Γ PFC , Γ) in closed form.
In the case d = r we give probabilistic bounds in Theorem 3.
, for a more general class of error models than simply assuming normality.For simplicity of exposition, we restrict to the case where the distributions of ǫ are symmetric, though this is not necessary for √ n-consistency to holdsee Appendix A for details.First we give two technical results, Lemmas 3.1 and 3.2, concerning the entries of the matrix V introduced in Lemma 2.2 and matrix β(F T F)β T respectively.
In the case where the errors ǫ are standard normal, Lemma 3.1 simplifies, since the C ik become independent and identically distributed normals.Thus D/σ 2 is central χ 2 with r(p−d) degrees of freedom.Similarly N/σ 2 is non-central χ 2 with rd degrees of freedom and non-centrality parameter tr (β(F T F)β T )/σ 2 , and T = 2rσ 4 , with equality holding in the bounds on Var (N ) and Var (D) in Equations ( 4) and (5).
We will write λ 1 (X) ≥ λ 2 (X) ≥ . . .≥ λ p (X) for the ordered sequence of eigenvalues of a real symmetric p × p matrix X. Lemma 3.2.There exists a sequence (φ i : 1 ≤ i ≤ d) such that given ǫ > 0 and δ > 0, there exists n * = n * (δ, ǫ) such that the eigenvalues satisfy Theorem 3.3.In the case where d = r and errors ǫ are independent and symmetric with variance 1 and finite 4th moment, then we can construct confidence intervals such that where for any fixed α, the Using Theorem 3.3, we can consider the case analysed in Example 2.5 and Figure 1, where β = 1, r = d = 1, p = 10 and the errors ǫ and Y are normal.In this case, ( 26) and ( 27) below show that confidence intervals for the angle Θ( Γ PFC , Γ) decay asymptotically as c/ √ n, c/σ Y or cσ respectively, as the other terms are kept constant, as Figure 1 may suggest.

Random matrices and perturbation
In this section, we analyse the general case d ≤ r of the model given by Equation (1) under the assumption that the errors ǫ are Gaussian, using a perturbation argument.We first review some results from random matrix theory, which we use to deduce the order of Θ( Γ PFC , Γ) in Theorem 4.3.Proposition 4.1.Write X u,v for a u × v matrix with entries that are independent standard Gaussians, and consider the largest eigenvalue of the Wishart matrix X u,v X T u,v .1.For a sequence of matrices X uv ,v , in the regime where v → ∞ and 2. In the same regime where , and where F 1 is the Tracy-Widom law of order 1. 3. For any u and v and for any t > 0, Part 1 of Proposition 4.1 is due to Geman [6], who does not require that the entries of the matrix X u,v be Gaussian (finiteness of moments of all orders will suffice).However, Equation ( 8) does require Gaussian entries, which is the reason for the assumption that ǫ has Gaussian entries in the rest of this paper.Johnstone [8] proved Part 2, giving simulation results suggesting that this approximation is accurate for u v , v as small as 5. Part 3 is implied by Theorem II.13 of Davidson and Szarek [3].
We combine the random matrix results of Proposition 4.1 with a perturbation argument based on that given by Sibson [13] and used by Critchley [2] in a related context.As in Sibson [13], the perturbed eigenvectors are given in terms of generalized inverse matrices M + .If Mx = λx, the linear map M + is defined using the property that To motivate the proof of √ n-consistency in Theorem 4.3, we change basis to the orthogonal set {b (i) } used in the proof of Theorem 2.4.Equations ( 17) and ( 19) below mean that in this new basis, X T X = (U + σS)(U + σS) T .Here UU T has a d × d block of the form β(F T F)β T with the remaining entries being zero, and S is a p × r matrix whose entries are independent standard Gaussians.
Hence if σ = 0 then X T X has d positive eigenvalues with eigenvectors lying in the space spanned by the columns of Γ, and the remaining eigenvalues are zero.Using perturbation theory, we bound how large σ would have to be before one of the zero eigenvalues could become one of the d largest ones.Definition 4.2.For σ = 0, we take B = UU T and L = σ(SU T + US T ) + σ 2 SS T , so that X T X = B + L. We define the first level crossing event Hence, if L 1 does not occur, the d largest eigenvalues of X T X correspond to the perturbed values of the original eigenvalues of β(F T F)β T .Proposition 4.1 gives probabilistic bounds on λ 1 (L) − λ p (L), allowing us to control P(L 1 ).
Theorem 4.3.In the case d ≤ r, if the errors ǫ are independent standard normals and the limiting matrix Φ = lim n→∞ (β(F T F)β T )/n has distinct eigenvalues, then there exist confidence intervals where for any fixed α, the Θ * (α) = O(1/ √ n).

Theoretical performance of the PC algorithm
Finally we discuss the PC algorithm, arguing in Lemma 5.1 that, in the case of normal errors ǫ, it will give inferior performance to the PFC algorithm.Proposition 5.4 gives bounds on the performance of the PC algorithm, explaining the simulation results of Cook [1].
Lemma 5.1.Under the model given by Equation ( 1), if the errors ǫ are normally distributed, then we can write X T X = X T X + N , where N is independent of X and Γ.
This result allows us to argue that PFC estimation should out-perform PC estimation, in several senses.Firstly, inference about random variable X through random variable Y is better (in the sense of Minimum Mean Squared Error) than inference about X through Y + N , if N is independent of X.This follows since the best estimates are f (Y ) = E(X|Y ) and g(Y + N ) = E(X|Y + N ) respectively.The fact that the MMSE is lower in the first case is equivalent to the fact that Eg(Y + N ) 2 ≤ Ef (Y ) 2 , which follows by the conditional Jensen inequality.Similarly, the conditional entropy H(X|Y ) ≤ H(X|Y + N ), showing that there is less uncertainty about X on learning Y than on learning Y + N .
We use similar techniques to those in Section 4 to bound the PC angle Θ( Γ PC , Γ).We regard the term N as a perturbation of order σ 2 of the fitted sample covariance matrix X T X, and thus regard Γ PC as a perturbation of Γ PFC .Definition 5.2.Writing λ i (X) for the ith ordered eigenvalue of X, we define for the minimum level spacing (this includes the spacing between zero and the non-zero eigenvalues, since λ r+1 ( X T X) = 0).We define the second level crossing event by Lemma 5.3.If the errors ǫ are independent standard normals, and if √ M /σ > √ n + √ p then the probability If L 2 does not occur, then there are exactly r eigenvalues of X T X larger than λ r ( X T X), so the r largest eigenvalues must correspond to the perturbed values of the original.
We now prove bounds on the angle Θ( Γ PC , Γ PFC ) between the PC and PFC directions, where C( Γ PC , Γ PFC ) = cot 2 Θ( Γ PC , Γ PFC ) as before.Since M and (X − X) T (X − X) are O P (n), this angle is again of the order O P (1/ √ n).
Proposition 5.4.In the case d = r, if the errors ǫ are independent standard normals, the conditional expectation where, as in Equation ( 9), M is the minimum eigenvalue spacing of X T X.
Using Proposition 5.4, we continue to explain the simulation graphs given in Section 5 of Cook [1], consider the means of the angles Θ( Γ PC , Γ). Recall that Example 2.5, based on Figure 1 of [1], considers normally distributed errors ǫ with p = 10, d = r = 1, β = 1, and varies parameters n, σ and σ Y .For simplicity, we replace some terms by their asymptotic limits to obtain more heuristic results.We recall that βF T Fβ T ∼ σ 2 Y χ 2 n−1 , so that asymptotically we can take M = nσ 2 Y .Similarly, we use the asymptotics given by Geman [6] and replace (X − X) T (X − X) by σ 2 n.Combining Lemma 5.3 and Proposition 5.4, we know that writing C for arctan(∞) (so that C = π/2 radians, or 90 • for the graphs plotted) gives .
We ignore the first term (since it decays exponentially fast in n if σ 2 Y > σ 2 ).In the case r = d = 1, angles satisfy a triangle inequality: that is, if the angle between vectors Γ PFC and Γ is θ 1 , and the angle between Γ PC and Γ PFC is θ 2 , then the angle between Γ PC and Γ is ≤ θ 1 + θ 2 .This means that, conditional on Θ( Γ PFC , Γ), the angle between the PC directions and the true Γ is bounded above by where the distribution of C = cot 2 Θ( Γ PFC , Γ) is given in Theorem 2.4.In each graph in Figure 2, we plot four curves, as follows: 1. We simulate directly from the distributions given by Theorem 2.4 and plot the sample mean of the PFC angle Θ( Γ PFC , Γ) as •, as in Figure 1. 2. Next, we plot the sample mean of the PC angle (based on direct simulation of 2500 samples from the underlying distribution) as ×. 3. We plot the bound given by the right hand side of (11) as +, noting that it is only useful when σ 2 Y > σ 2 .4. Finally, in the spirit of Sibson [13], we use △ to plot the angle corresponding to the leading term in σ in the power series expansion in Equation ( 36) of the proof of Proposition 5.4, that is Providing a more complete description of the distribution of Θ( Γ PC , Γ) than that given in Proposition 5.4, and extending such results to the case d < r, would require more advanced results from random matrix theory.However, the results shown in Figure 2 give a good explanation of Figure 1 of [1], and indicate parameter regions where the PFC and PC algorithms give close or differing results.In particular, the similarity of the plots in Figure 2 given by × and △ shows that Equation ( 12) provides an accurate approximation to the PC performance.(In fact, (12) should approximately give an upper bound to the performance of this algorithm, because of the Jensen inequality used in Equation (37) of the proof of Proposition 5.4.However, it appears that the resulting distribution is sufficiently concentrated around its mean that the Jensen approximation is asymptotically accurate).
Since the bound given by ( 11) is only valid where σ 2 Y > σ 2 , we cannot plot it for all ranges of parameters considered by Cook -not at all in Figure 2(a), and only in a small region in Figure 2(c).We provide an extra series, Figure 2(d), where σ 2 Y = 2, to show the dependence of this bound on n.

Appendix A: Proofs
Proof of Lemma 2.2.As in [1], we write 1 n for a n × 1 matrix of 1s, so that if X is a n × p matrix with rows given by X T y then 1 n 1 n 1 T n X gives a matrix with rows all equal to the sample mean X.The assumption that y f y = 0 means that 1 T n F = 0, which implies that and P F represent projections onto orthogonal subspaces.Hence the matrix X defined to have rows (X y − X) T can be expressed as (I n − 1 n 1 n 1 T n )X or, in terms of the quantities in Equation ( 1), as ) This means that the fitted matrix of predictors Using Equation ( 14), we rewrite X = FN where N = (β Any vector w orthogonal to the span of the columns of V is a 0-eigenvector of X T X, since for such a vector Equation (15) shows that and similarly the span of the columns of V is preserved by X T X. Hence the r columns of V span the same space as the eigenvectors of X T X with the r largest eigenvalues.
Proof of Lemma 2.3.For any matrix A, the entries of AE have mean E(AE) = AE(E) = σAE(ǫ) = 0, allowing us to deduce the mean of V. Similarly, we use the well-known fact that for any A and B: Taking A = B = (F T F) −1/2 F T means that AB T = I, so by Equation ( 16) the entries of V satisfy and the result follows.
Proof of Theorem 2.4.In the case d = r, Lemma 2.2 tells us that we can choose to take a PFC estimate Γ PFC given by an orthogonal transformation of the columns of V defined in Equation (3).Write b (i) , where i = 1, . . ., d, for the columns of Γ, which form an orthonormal set since, by assumption, Γ T Γ = I d .We extend this to create an orthonormal basis {b (1) , . . ., b (p) } for the whole of R p , and write G for the p × p matrix made up of the complete set of columns b (i) , with G T G = I p .We express V in this new basis, for k = 1, . . ., r we expand the kth column of V as: where Equivalently, we write the p × r matrix , consists of a d×r block of the form β(F T F) 1/2 and a (p−d)×r zero block, and C = G T E T F(F T F) −1/2 .Equation (17) and Lemma 2.3 show that A has mean µ, so that for i ≥ d + 1 the A ik has mean 0, whereas for i ≤ d the A ik has mean Using Lemma 2.3, we know that for any 1 ≤ i, j ≤ p and 1 ≤ k, l ≤ r: The fact that Γ T Γ = I d implies that the projection P Γ = ΓΓ T , so that the projections of the kth column become Hence, the respective Frobenius norms become Note that the arguments so far do not require the assumption that the errors ǫ are normal.Under this additional assumption, we deduce that the A ik are normal and independent, with common variance σ 2 .In particular, under this assumption the expressions (20) and ( 21) are independent of each other.Further, if the errors are normal, Equation (21) means that (I−P Γ )V 2 F /σ 2 has a central χ 2 distribution with r(p−d) degrees of freedom.Similarly, Equation (20) means that P Γ V 2 F /σ 2 has a non-central χ 2 distribution with rd degrees of freedom, and non-centrality parameter and the proof is complete.
In particular, C ik have mean zero and variance σ 2 , and Equation (18) implies that k,i µ 2 ik = tr (β(F T F)β T ).Hence we know that N = In this expansion, we know that EC ik C 2 jk = 0, since it can be written as a linear combination of expectations of terms in E ij of degree 3.Each such term contains a term of odd degree, so independence and symmetry means the resulting expectation is zero.Similarly, using independence and symmetry, since only terms of the form EE 4 ij and EE 2 ij E 2 kl make a non-zero contribution, for any k we can expand If ǫ are independent with mean zero and finite 4th moment, but no longer symmetric, we can prove similar results.In this case, the , which allows this proof to be adapted.Fulton [5] reviews many facts concerning the eigenvalues of sums of matrices, including the following result: Lemma A.1 (Weyl [14]).For any real symmetric B and L: Proof of Lemma 3.2.Since F has rows which correspond to independent samples of Y, the Law of Large Numbers means that (F T F)/n converges elementwise almost surely to a real symmetric matrix Φ.Hence, using the union bound, sup i,j (β(F T F)β T ) ij /n − Φ ij can be made arbitrarily small for all n sufficiently large with probability 1 − ǫ.This implies that the spectral norm of the difference (β(F T F)β T /n−Φ can be made smaller than δ with the same probability.By Lemma A.1, this implies that the ith eigenvalue λ Proof of Theorem 3.3.We write C( Γ PFC , Γ) = N/D, in the notation of Lemma 3.1 and define the event By Lemma 3.2, given any α there exist K 1 , K 2 , n * such that P(B K1,K2,n * ) ≥ 1 − α/3.Note that on B K1,K2,n * , Lemma 3.1 implies that the mean EN ≥ K 1 n and Var N ≤ T d + 4K 2 n.We use Chebyshev and choose We bound an upper confidence interval by taking X + = (3r(p − d)σ 2 /α) and combining the fact that D is independent of the event B K1,K2,n * with Markov's inequality to obtain We obtain the required bounds by equating N * + and X + / tan 2 (Θ * + ), and using Equation ( 24) and the fact that arctan (t) ≤ t.That is, we can take and substitute the value X + = (3r(p− d)σ 2 /α) given above.We can find a lower confidence interval, since for any X − , a similar argument to Equation (25) gives By Chebyshev and Lemma 3.1, we can take and the result follows in the same way.
Proof of Theorem 4.3.During this proof, for simplicity, we write λ i for λ i (UU T ), which equals λ i (β(F T F)β T ) if i ≤ d and zero otherwise.First, we introduce the event where δ is defined in terms of the interlevel spacing as δ = min i≤d (λ i (Φ) − λ i+1 (Φ))/10, writing λ d+1 (Φ) = 0. Note that δ > 0 by assumption.We mirror the proof of Theorem 3.3 by conditioning on whether C Φ,n * and L 1 occur, to obtain Lemma 3.2 means that the first term in Equation ( 28) is less than α/3 for n * sufficiently large.Next we bound the second term in Equation (28), using the fact that λ 1 (L) − λ p (L) ≤ 4σ SS T UU T + σ 2 SS T , where we write • for the spectral norm.Completing the square and using Equation ( 8), we deduce that The last inequality uses the fact that S is a p × r matrix of independent standard Gaussians, so we can apply Equation (8).Conditioning on the event C Φ,n * ensures that where the choice of δ ensures that a > 0, so the second term in Equation ( 28) is less than α/3 for n sufficiently large.Finally we condition on the event L c 1 ∩ C Φ,n * The key result is (as in [13]) that if e i is a λ i -eigenvector of UU T then if f i satisfies Hence, Equation (29) tells us that, conditioned on L c 1 , these perturbed eigenvectors form the PFC directions.Since we condition on the event C Φ,n * , the norm (UU T − λ i I p ) + ≤ 1 min j =i (λ j − λ i ) ≤ 10 nδ for all i. (30) Proof of Lemma 5.3.Note that Equation (34) means that (X − X) T (X − X) = σ 2 X T P G X, where X is an n× p Wishart matrix with standard Gaussian entries independent of X T X.We take t = √ M /σ − ( √ n + √ p) and apply the result of Davidson and Szarek, Equation (8).
Since (X − X) T (X − X) is positive-definite, λ p ((X − X) T (X − X)) ≥ 0, and since λ i ( X T X) = 0 for i ≥ r + 1, Lemma A.1 and Equation (35) imply that λ i ( X T X) ≤ λ i (X T X) ≤ λ i ( X T X) + λ 1 ((X − X) T (X − X)) Hence, if L 2 does not occur, that is if λ 1 ((X− X) T (X− X)) < M , there are exactly r eigenvalues of X T X larger than λ r ( X T X), and the r largest eigenvalues must correspond to the perturbed values of the original.

Next we prove another technical lemma:
Lemma A.2.Given n × n projection matrix P G and n × p matrix X with independent standard Gaussian entries, consider matrices R = XW and S = XV, where W T V = 0. Then E tr (RR T P G SS T P G ) = tr (W T W)tr (V T V)rank (G).
Proof.Using the representation R = XW and Equation ( 16) we deduce that E(RR T ) ij = δ ij tr (W T W).Similarly E(SS T ) ij = δ ij tr (V T V).The fact that W T V = 0 makes R and S independent by Equation( 16), so that: Proof of Proposition 5.4.During this proof, for simplicity, we write λ i = λ i ( X T X) and L = (X − X) T (X − X), where Equation (34) means that as before L = σ 2 X T P G X, with P G = (I − 1 n 1 n 1 T n − P F ).We condition on the r PFC directions being u i , where u i is a λ i -eigenvector of X T X.As in Equation (29) and [13], if the vectors v i satisfy then u i +v i are (λ i +µ i )-eigenvectors of X T X, and give the r PC directions.This definition of v i makes v i ⊥ u i , so that P ui Γ PC,i = u i and (I p − P ui ) Γ PC,i = v i .
As before, we know that for all i: This means that, conditioned on L 2 , (I p − P ui ) Γ PC,i 2 Writing U for the matrix made up of columns u i , using Jensen's inequality Using Lemma A.2 and recalling the fact that L = σ 2 X T P G X, we know that the numerator of Equation (38) satisfies where R i = X( X T X − λ i I p ) + , S i = Xu i , so the conditions of Lemma A.2 apply.Substituting into Equation (38), we deduce the result.

2
ik has mean tr (β(F T F)β T ) + rdσ 2 and D = r k=1 p i=d+1 A 2 ik has mean r(p − d)σ 2 .For i ≤ d, we write C ik = r,s Γ ri E sr W sk where W = F(F T F) −1/2 , and M = ΓΓ T .We deduce that Var (N ) equals the r M rr = tr (ΓΓ T ) = tr (Γ T Γ) = d, and u M ru M ur = M rr , so the result follows.Similarly we deduce Var (D) ≤ (p − d)T .

E
tr (RR T P G SS T P G ) = T ) ij (P G ) jk E(SS) kl (P G ) li = tr (W T W)tr (V T V) ) ik (P G ) ki = tr (W T W)tr (V T V)tr (P T G P G ) as required.