Principal Component Analysis for Multivariate Extremes

The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations. Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all projection subspaces. As a consequence, the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere. In addition, if the exceedances are re-scaled to the unit ball, we obtain finite sample uniform guarantees to the reconstruction error pertaining to the estimated projection sub-space. Numerical experiments illustrate the relevance of the proposed framework for practical purposes.


Introduction
If one wants to analyze the tail behavior of an R d -valued random vector X = (X 1 , . . . , X d ) one usually assumes that X is regularly varying (if necessary after a standardization of the marginal distributions), i.e. there exists a non-zero measure μ on R d \ {0} such that for all μ-continuous Borel sets B that are bounded away from the origin. This definition does not depend on the choice of the norm · on R d , but in what fol-lows we only consider the Euclidean norm. Convergence (1.1) may be understood as a generalization to arbitrary dimension of a heavy-tail assumption regarding a real-valued random variable. This mathematical framework is particularly useful in situations where the focus is on 'tail events' of the kind {X ∈ C} where the distance to the origin u = inf{ x : x ∈ C} is large, for some norm · . In a risk management context, the probability of such tail events is of crucial importance. If the distance u is so large that few or no data are available in the considered region C, all attempts to resort to empirical estimation are in vain. One common idea behind statistical methods based on Extreme Value Theory (EVT) is to use a small proportion of the available data (those with a comparatively large norm) to learn an estimate for μ, which may be used for quantifying the probability of tail events.

Regular variation
A standard reference concerning the probabilistic aspects of regular variation in the setting of EVT is [28], see also [27] for application-oriented examples. Regular variation for Borel measures on Polish spaces has since been revisited in [19] and [23]. It is well known that if convergence (1.1) holds true, then the limit measure μ is homogeneous of order −α for some α > 0. Moreover, the norm X is regularly varying, too: P{ X > tx}/P{ X > t} → x −α as t → ∞ for all x > 0.
Because the limit measure is homogeneous, after a polar transformation, it can be decomposed into a so-called spectral (or angular) probability measure H and an independent radial component, that is for all r > 0 and all Borel subsets A of the unit sphere. Whereas the literature on the design and the asymptotics of flexible multivariate parametric or nonparametric models for μ or integrated versions of it is plentiful (see e.g. [32,13,9,15,29], or [2] and the references therein), the issue of how to escape the curse of dimensionality has only recently been raised (see below). One reason for this may be that a major field of application of EVT concerns environmental, spatial extremes such as heavy rainfalls, heat waves, droughts or floods. In this context, max-stable or generalized Pareto spatial models are widely used ( [25,12,30]) which have built in a priori information about the spatial dependence structure, thus reducing the effective dimension.

Dimensionality reduction for extreme values, a brief overview
For applications such as e.g. anomaly detection or network monitoring where no particular structure is known a priori, dimension reduction suggests itself as a preliminary step before implementing any kind of statistical procedure. This subject has recently received increasing attention. If d is moderate or large, the measure μ (and hence H) will often exhibit some 'sparse' structure. For example, if some of the components of X are asymptotically independent, i.e.
for some index set I ⊂ {1, . . . , d} of size |I| ∈ {1, . . . , d − 1} then μ is concentrated on {x = (x 1 , . . . , x d ) : max i∈I |x i | = 0 or max i ∈I |x i | = 0}. More generally, one may consider the case where only a small number of subsets of components {I k ⊂ {1, . . . , d}, k = 1, . . . , K} are likely to be large simultaneously, while the other components remain small. Here, 'small number' is understood relatively to the 2 d − 1 non empty possible subsets of components. This setting applies e.g. to heavy rainfalls in a spatial setting (storms are usually localized, so that neighboring sites are more likely to be jointly impacted) or of shocks to different assets of a financial portfolio. [5] proposes a clustering approach combined with spherical data analysis to detect structures of this type. [16,17] propose an algorithm with moderate computational cost (linear in the dimension and the sample size) and finite sample uniform guarantees. Their error bounds are linear in d and scale as 1/ √ k, where k is the number of order statistics of each component that are considered extreme during the training step. A refinement of the latter framework is proposed in the yet unpublished work of [34]. [6] and [7] aim at identifying subgroups of components for which the probability of a joint excess over a large quantile is not negligible compared to that of an excess by a single component. [10] use graphical models to reduce the complexity of the extremal dependence structure. In a regression context, [14] sets up a mathematical framework for tail dimension reduction suited to the case where the distribution of the target variable above high thresholds only depends on the projection of the covariates on a lower dimensional subspace. Consistency of k-means clustering applied to the most extreme observations of a data set has recently been proven in [20]. An overview of different concepts of sparsity in EVT can be found in [11].

Principal component analysis (PCA) and support identification
Here we focus on finding a linear subspace on which μ is (nearly) concentrated. In a classical setting, when X has finite second moments, PCA ( [1]) is the method of choice to determine such supporting linear subspaces if i.i.d. random vectors X i , 1 ≤ i ≤ n, with the same distribution as X are observed. Theoretical guarantees obtained so far concern the reconstruction error ( [21,33,4,22,26]) or the approximation error for the eigenspaces of the covariance matrix ( [37]), under the assumption that the sample space (or the feature space for Kernel-PCA) has finite diameter or that sufficiently high order moments exist.
For motivation of our version of PCA, it is useful to keep the following working hypothesis in mind, although it is not required for most results to hold. Hypothesis 1. The vector space V 0 = span(supp μ) generated by the support of μ has dimension p < d.
Note that then the points (X i /t)1{ X i > t} are more and more concentrated on a neighborhood of V 0 as t increases, but usually they will not lie on V 0 . If the dimension p of V 0 is known, then it suggests itself to approximate V 0 by the subspace of dimension p that is 'closest' in expectation to these points.
In PCA one measures the closeness by the squared Euclidean distance which hugely alleviates the optimization problem as one may work with orthogonal projections in the Hilbert space L 2 . However, this approach requires finite second moments which cannot be taken for granted in the above setting. Indeed, if α < 2 then E( X i 2 ) = ∞. Hence, we will instead consider re-scaled vectors where ω : R d → (0, ∞) is a suitable scaling function. The most common choice is ω(x) = 1/ x , leading to Θ i on the unit sphere which describes the direction of X i , and we will focus on this re-scaling when we derive finite sample bounds on the reconstruction error (see Section 3). However, consistency results will be proved for considerably more general scaling functions; cf. Section 2.
To the best of our knowledge, the only existing work considering PCA properly speaking for high dimensional extremes is the paper [8]. The authors discuss a transformation mapping negative observations to small positive ones and apply PCA in this transformed space. They also use a preliminary re-scaling involving the norm of the transformed vector. They illustrate their approach with simulations and real data examples, without deriving theoretical statistical guarantees.

Notation and risk minimization setting
To give a formal description of our method, we first introduce some notation. All random variables are defined on some probability space (X , A, P); the expectation with respect to P is denoted by E. For x ∈ R d and t > 0, let (1.4) By P we denote the distribution of X and by P t its conditional distribution given that X > t, i.e. P t (·) = P(X ∈ · | X > t}. Then P ∞ := μ| (B1(0)) c is the weak limit of P t (t·) (with B 1 (0) denoting the closed unit ball); cf. (1.1).
For any probability measure Q and any Q-integrable function f , we denote the expectation of f with respect to Q by Qf or Q(f ). By E t we denote the conditional expectation (with respect to P) given X > t so that E t (f (X)) = P t (f ), provided the expectations exist.
For any linear subspace V ⊂ R d , let Π V be the orthogonal projection onto V (or the associated projection matrix), and let Π ⊥ V be the orthogonal projection onto the orthogonal complement V ⊥ of V .
To apply PCA to the re-scaled vectors, we have to assume that the scaling function ω is chosen such that E( Θ 2 ) = P ( θ 2 ) < ∞ and P ∞ ( θ 2 ) < ∞. Note that this condition is always fulfilled if there exist β > 1 − α/2 and c > 0 such that ω(x) ≤ c x −β for all x ∈ R d . For simplicity's sake, in what follows we will impose the following stronger homogeneity condition: and c ω := sup denotes the unit sphere. Note that then The choice ω(x) = x −β seems natural, but different choices allow for focusing on particular aspects of the extreme value behavior. For instance, if one is only interested in the positive components of X, one may choose and the infima are taken over all linear subspaces of the specified dimension (cf. Lemma 2.5). The risk R ∞ may be interpreted as the expected reconstruction error in the limit model if the re-scaled observation Θ is replaced with its lower dimensional approximation Π V Θ. Since P t (t·) → P ∞ (·) weakly, one may given that X exceeds a high threshold t > 0. Note that V * t may be of interest even if Hypothesis 1 only holds approximately, in the sense that P ∞ concentrates most of its mass on a small neighborhood of a p-dimensional subspace.
It is natural to 'estimate' V * t (and thus V 0 ) by a minimizer of the corresponding empirical risk Here the threshold t must be chosen suitably, depending on the sample size. To this end, often order statistics of the norms of the observed vectors are used, and we follow this approach. Let X (j) = X σ(j) where σ is a permutation of indices such that X (1) ≥ X (2) ≥ · · · ≥ X (n) . (For brevity, we suppress the dependence on n in our notation of order statistics.) For 1 ≤ k ≤ n, denote byt n,k = X (k+1) the empirical quantile of level 1 − k/n for X . We define the empirical risk for the subspace V related to the k largest observations aŝ where Θ i,t = θ t (X i ) in accordance with the notation introduced in (1.4). Here and throughout the paper, we suppose that the c.d.f. of X is continuous in the tail to avoid technicalities. Then we may assume w.l.o.g. that there are no ties and thus exactly k observations have norm larger thant n,k . A minimizer of R n,k (V ) among all linear subspaces of dimension p will be denoted byV n =V p n . It is the main goal of the present paper to analyze the asymptotic and the finite sample behavior of the empirical riskR n,k (V ) and its minimizerV n .

Outline
In Section 2 we will first show that the minimizer of the risk R t based on a finite threshold t converges to the minimizer of the limit risk R ∞ , and thus under Hypothesis 1 to V 0 , as t → ∞. Moreover, we show consistency of the empirical risk minimizerV n under condition (1.5). In Section 3, we derive non-asymptotic uniform bounds on the difference between the empirical risk and its theoretical counterpart for the most important scaling ω(x) = 1/ x . Furthermore, we construct uniform confidence bands for R t (V ). The results obtained in a simulation study are reported in Section 4. In particular, we explore the choice of the dimension p based on empirical risk plots and the effect of a PCA projection on estimators of probabilities expressed in terms of the spectral measure H. All proofs and technical lemmas are postponed to Section 5, while an appendix contains some details about the proof of a modification of a result by [4].

Consistency of risk minimizers
In this section, we first discuss how to calculate minimizers of the conditional risk R t given X > t and the empirical riskR n,k . Moreover, we prove that these converge in some sense towards a minimizer of R ∞ .
It is well known that a point of minimum of V → E Π ⊥ V Y 2 can be derived from the spectral analysis of the matrix of second (mixed) moments of Y : among all linear subspaces V of dimension p. In the case λ p > λ p+1 it is the unique minimizer. (ii) If the scaling condition (1.5) holds and λ 1 ≥ λ 2 ≥ · · · ≥ λ d ≥ 0 denote the eigenvalues of Σ t := E t (ΘΘ ) with corresponding orthogonal eigenvectors If the scaling condition (1.5) holds and λ n,1 ≥ λ n,2 ≥ · · · ≥ λ n,d ≥ 0 denote the eigenvalues of Σ n,k := k −1 n i=1 (Θ i,t n,k Θ i,t n,k ) with corresponding orthogonal eigenvectors x n,1 , . . . , x n,d , then V n = span(x n,1 , . . . , x n,p ) minimizesR n,k (V ) among all linear subspaces V of dimension p.
A proof of assertion (i) can e.g. be found in [31], Theorem 5.3, where also other optimality properties of the minimizers are given. Both the other results follow directly by an application of (i) with Y equal to Θ conditional on X > t, respectively a random variable according to the empirical distribution of those Θ i for which X i >t n,k . If λ p = λ p+1 , then the minimizer is not unique. With m = min{i ∈ {1, . . . , p} : λ i = λ p } any minimizer V * t of R t can be represented as . ,x p are orthogonal eigenvectors to the eigenvalue λ p and all these subspaces are minimizers. An analogous statement holds for the empirical risk.

Asymptotic behavior of the conditional risk and its minimizer
Here we discuss the relationship between R t and R ∞ and their respective minimizers.

Proposition 2.2.
Suppose that ω fulfills condition (1.5). Then, for any subspace V of R d , the suitably standardized associated finite threshold risk converges: In view of Proposition 2.2, one may ask whether a minimizer ofR t := t 2(β−1) R t (which of course is also a minimizer of R t ) converges in some sense to a minimizer of R ∞ . Denote by V p the set of all subspaces of R d of dimension p, endowed with the metric pertaining to the operator norm ||| · ||| of the projections.
It can be shown that V p is compact w.r.t. ρ (see Lemma 5.2) and that the normalized conditional risk functionsR t are uniformly Lipschitz continuous (Lemma 5.3), from which the convergence of the risk minimizers follows by standard arguments.

Theorem 2.4. Suppose that ω satisfies condition (1.5) and that
Under Hypothesis 1, V 0 is the unique minimizer of R ∞ over V p , that is if we minimize the risk over linear subspaces with the correct dimension, as the following result shows. Hence in this case, V * t converges to V 0 .

Lemma 2.5. Under Hypothesis 1, for any subspace
. By definition of P ∞ and the homogeneity of μ, this means that the support of μ must be a subset of V and thus V 0 ⊂ V .

Convergence of the empirical risk and its minimizer
We now establish analogous consistency results for the empirical riskR n,k and its minimizer. In what follows, let F X be the c.d.f. of X , F ← X its generalized inverse (quantile function) and define We start with consistency of the standardized empirical risk.
The following main result of this section states the consistency of the empirical risk minimizer.
So far, we have proved weak consistency of both the standardized empirical risk and the empirical risk minimizer under mild assumptions on the scaling function ω. However, the rates of convergence may be arbitrarily slow. In the next section, we establish bounds on the recovery risk under stronger conditions.

Uniform risk bounds
Since a minimizerV t of the empirical riskR t (orV n ofR n,k ) differs from the minimizer of the true risk R t , usually the so-called excess risk R t (V t )−inf V ∈Vp R t (V ) will be strictly positive. We follow the common approach in the theory of risk minimization to bound the excess risk by deriving uniform bounds on |R t (V ) − R t (V )| which hold with high probability for a fixed sample size n. If these uniform bounds can be calculated from the observed data, they may also be used to construct confidence intervals for the reconstruction error R t (V t ) resp. R t n,k (V n ).
As condition (1.5) does not guarantee any finite moments of Θ of order greater than 2, and tight concentration inequalities are available only for subgaussian distributions, we now assume that the scaling function ω satisfies the following stronger condition: For classical PCA (and a kernel version thereof), [33] established uniform risk bounds for bounded random vectors Z i , which were improved by the following result by [4]. Assume Z i ≤ 1, and denote the empirical matrix of second (mixed) moments byΣ n and the Hilbert-Schmidt norm on the space of matrices by · HS . Then, with probability greater than 1 − δ, for all V ∈ V p . One may try to derive uniform risk bounds in our extreme value setting by applying this result to the random variables if one ignores the difference between N t and its expectation nπ t . In the case π t = o(n −1/2 ), however, the above upper bound will not even converge to 0 when it is divided by π t because of the second term. Hence this direct approach does not give meaningful bounds for The reason for this inconsistency is that, unlike in the classical setting, most of the random variables Z i will vanish as t increases, and the concentration inequalities used in the proofs of the aforementioned bounds are too crude in such a situation. However, we will take up ideas used by [4], with appropriate modifications, to derive much tighter uniform bounds on |R n,k (V ) − R t n,k (V )|. Furthermore, we will derive uniform bounds on |R t (V ) − R t (V )| which hold conditionally on N t = and depend only on the data. These can then be used to construct confidence bands for R t (V ).
If, for the time being, one neglects the difference between the empirical (1 − k/n)-quantile of X (i.e.t n,k ) and the true quantile t n,k , thenR n,k (V ) can be approximated byR t n,k (V ) wherē Denote the empirical distribution of the observed random vectors X i , 1 ≤ i ≤ n, by P n , and recall the notation P for the distribution of X. For any threshold t > 0, the maximal difference between the approximate empirical riskR t (V ) and the true risk R t (V ) can be rewritten as To derive uniform bounds on on |R n,k (V ) − R t n,k (V )|, we thus first analyze the error of the approximation ofR n,k byR t n,k (Lemma 5.5) and then bound ϕ ± (X 1 , . . . , X n ). For the latter step, we employ a version of the bounded difference inequality to conclude a concentration inequality for ϕ ± (X 1 , . . . , X n ) (Lemma 5.7) and combine this with an upper bound on its expectation (Lemma 5.8). This approach leads to our first uniform bound on the difference between the empirical risk and its theoretical counterpart.
. In particular, with probability greater than or equal to 1 − δ,

H. Drees and A. Sabourin
Note that (3.6) also implies an upper bound on the excess risk: Note that the upper bound in Theorem 3.1 cannot be calculated from the data and can thus not directly be used to construct confidence intervals for the true reconstruction error R t n,k (V n ) or the minimal reconstruction error inf V ∈Vp R t n,k (V ). Therefore, we derive data-dependent bounds directly from (a minor improvement of) the bound established by [4]. However, this result will be applied to the conditional distribution of Θ given X > t and the resulting bound is to be interpreted conditionally on the number N t of exceedances over the chosen threshold t.
In the statement about the confidence bands one may replace B t, withB This half width of a confidence band is more suitable for (numerical) minimization (as a function of u and v ) under the constraint 2 exp − 2 Remark 3.5. The conditional approach employed in Theorem 3.3 can also be used to obtain a uniform risk bound similar to the one in Theorem 3.1: . A comparison with Theorem 3.1 reveals that the new bound may be tighter if S * t n,k is substantially smaller than S t n,k . This will be the case if k/n is small and tr (E t ΘΘ ) 2 is not much smaller than E t Θ 4 .
So far, we have compared empirical risks with the true risk R t for finite thresholds t. A comparison with the limit risk R ∞ would require second order refinements of our basic assumption (1.1).
where the second supremum is taken over all Therefore, bounds on the difference between empirical risks and the limit risk require additional assumptions on the spectrum of the difference Σ t − Σ ∞ between the matrix of second moments for the re-scaled exceedances over the threshold t and the corresponding matrix in the limit model.
If one merely wants to compare the minimum risk for finite thresholds with the minimum limit risk, which equal the sums of d − p smallest eigenvalues of Σ t resp. Σ ∞ , then somewhat weaker assumptions on the convergence of the spectrum of Σ t and Σ ∞ are needed. In particular, under Hypothesis 1,

The setting
We investigate the performance of our PCA procedure. In particular, we examine how the standard non-parametric estimator of the spectral measure (defined via (1.2)) based on the k largest observationŝ if the data is first projected onto a lower dimensional subspace using PCA: Here, δ y is the Dirac measure with point mass at y and V denotes the subspace picked by PCA based on the same number k of largest observations. It will turn out that sometimes it is advisable to use a smaller numberk for the PCA procedure; the resulting estimator of the spectral measure will be denoted bŷ H P CA n,k,k . We simulate from different models of d-dimensional regularly varying vectors for which the spectral measure is (approximately) concentrated on a pdimensional subspace. Since PCA is equivariant under rotations, w.l.o.g. we assume that this subspace is spanned by the first p unit vectors. The dependence between these p components is either described by a so-called Dirichlet model or by a Gumbel copula C ϑ (x) = exp − p i=1 (− log x i ) ϑ 1/ϑ . In Example 3.6 of [32] it is described how to simulate from the former model, while the data according to the second model is generated using the transformation method proposed by [35]. The marginal distributions are chosen as Fréchet with In addition, we have simulated observations from a Dirichlet model which are then rotated in the plane spanned by two randomly chosen coordinates, one of them among the first p coordinates, the other among the last d−p. The rotation angle is uniformly distributed on the interval [−π/10, π/10]. Note that, unlike in the first two models, Hypothesis 1 is not fulfilled here. By this means we evaluate how sensitive PCA is to moderate deviations from the ideal situation.
In all cases, we add the modulus of a d-dimensional multivariate normal vector with suitable variances and constant correlations 0.2. This way, it is ensured that the support of the exceedances over high thresholds is not fully concentrated on the p-dimensional subspace. The variances are chosen equal to 10 5 /d for α = 1 (i.e., if we start with unit Fréchet margins) and equal to 10/d for α = 2, so that the sparsity assumption becomes apparent for the most extreme observations, whereas large yet less extreme data points are more spread out.
In all settings, we simulate samples of size n = 1000 and examine the performance of the PCA procedure based on Θ = X/ X for the k vectors with largest norms for k ∈ {5, 10, 15, . . . , 200}. The results reported here are based on 1000 simulations in each setting.
Write x j for the jth coordinate of a p-dimensional vector x. To measure the performance of the spectral estimators, we calculate the errors of the resulting estimators of the following probabilities in the limit model, which can be expressed in terms of the spectral measure: The first probability is related to the c.d.f. of the mean contribution of the first p coordinates to the norm of the random vector, thus quantifying, in some sense, how strongly the norm is spread over the coordinates. Probability (ii) indicates how likely it is that the first p components are all large, while this is not true for any of the other components, given that the norm of the vector is large. Probability (iii) specifies how likely it is that the first component is extreme, given that any component is extreme. In a financial context, such probabilities are used to quantify how strongly a specific market participant is exposed to a failure of any market participant. Finally, probability (iv) specifies the minimal contribution of any coordinate to the norm. Note that under Hypothesis 1 this probability equals 0. The other true values are determined by Monte Carlo simulations with sample size of at least 10 7 , unless they can be easily calculated analytically; the approximation error is always smaller than 10 −3 . Throughout, we assume α to be known since we are interested in the effect of the PCA procedure on the estimator of the spectral measure, which should not be compounded with the estimation error of the tail index.
We first investigate the performance of the estimators in models of moderate dimension (d = 5, p = 2), before we examine high dimensional models (d = 100, p = 5).

Moderate dimensions
Throughout this subsection, vectors of dimension d = 10 are considered whose extremes are approximately concentrated on a two-dimensional subspace. We first discuss the simulation results for the Dirichlet model with all Dirichlet parameters α i , 1 ≤ i ≤ p, equal to 3 and unit Fréchet margins. Figure 1 shows the mean empirical risk in the left plot as a function of k for the PCA that projects onto a subspace of dimensionp ∈ {1, . . . , 10}. Since the mean empirical risk cannot be observed if one analyzes a given data set, the right plot shows the corresponding empirical risk for a single data set. The structure of both plots is very similar: essentially, the mean empirical risk curves are just a bit smoother. For this reason, in the remaining settings, we will only report the mean empirical risk.
It is obvious from the risk plot thatp = 2 is a good choice, since there is a big gap to the empirical risk forp = 1, whereas the empirical risk almost vanishes for small k andp = 2, and the risk decreases more regularly for valuesp > 2, with no obvious structural breaks. The growing influence of the multivariate normal component as k increases is manifest in these plots, since the empirical risk quickly increases with k for all choices ofp. This suggests to choose k rather small to detect the sparsity in the model, a finding which will be corroborated in the analysis of the estimator of the spectral measure below.
In Figure 2, the mean operator norm of the difference between the projection onto the true support of the limit measure μ and the projection onto the subspace of dimension 2 chosen by PCA is plotted versus k. Again it becomes obvious that for less extreme observations the approximation by a lower dimensional vector is rather poor, which leads to a larger error for the projection matrix estimated from these data. For k = 80, the norm has almost reached its maximal value. However, one should keep in mind that the operator norm measures the maximal distance between the projection of some vector y ∈ S d−1 onto the estimated respectively the true subspace. If the underlying distribution of X/ X puts little mass on vectors y for which the distance is large, the true risk corresponding to the estimated subspace may still be small. Next we consider the estimators of the probabilities (i)-(iv), obtained by replacing the spectral measure H either withĤ n,k orĤ P CA n,k . Since the PCA estimator of the subspace supporting μ quickly deteriorates as k increases, in addition we consider the estimators resulting fromĤ P CA n,k,10 , that uses just the largest 10 observations to estimate the supporting subspace. Figure 3 displays the root mean squared errors (RMSE) of the resulting estimators as a function of k. For very small values of k, all estimators perform similarly. For probability (i) with t (i) = 0.65 (leading to a true value of about 0.684), both PCA based estimators have a considerably smaller RMSE than the standard estimator for most k. In particular, the PCA based method using just 10 largest observations to estimate the support of the spectral measure clearly outperforms both other estimators (almost) irrespective of the number of observations used for estimation of the spectral measure.
For the estimation of probability (ii) (≈ 0.309), the standard non-parametric estimator performs best for k ≤ 40. The classical PCA using the same number of order statistics in both steps performs better for larger values of k and its minimum RMSE is a bit lower than that of the standard estimator. The PCA based estimator which determines the support of μ from the largest 10 observations has a very stable RMSE, but its minimum is much larger than that both of the other estimators.
In case (iii) (with true value of about 0.770), the RMSE of the standard estimator and the estimator based onĤ P CA n,k,10 are very similar for k up to about 80, but the latter is remarkably insensitive to the choice of k up to 200. This feature might be useful in practical applications where the selection of k is often tricky. In contrast, the PCA based procedure that uses the same number of largest observations in both steps is even more sensitive to this choice than the standard estimator.  Similarly, the classical PCA estimator of probability (iv) strongly depends on the choice of k while both other estimators stably have a very low error.
Next we consider the model whose extremal behavior is described by the Gumbel copula with ϑ = 2 and Fréchet marginal distributions with c.d.f. F (x) = exp(−x −2 ), x > 0. The mean empirical risk and the mean operator norm of the difference matrix are shown in Figure 4. Overall the picture is similar as for the Dirichlet model, but the operator norm increases more slowly with k. Based on the left plot, one will choosep = 2. Figure 5 displays the RMSE of the estimators of (i)-(iv) with PCA projecting on two-dimensional subspaces. Here t (i) = 0.7 and the true values for (i)-(iv) are 0.3813, 0.083, 1/ √ 2 and 0. The relative performance of the estimators for probability (iv) is very similar to the one in the Dirichlet model. The same is true for the standard estimator of (i) and the PCA estimator that uses just 10 largest observations for estimating the support, but now the PCA estimator that uses the same number k in both steps performs slightly worse than the standard estimator. In contrast, both PCA based estimators of probability (ii) outperform the standard estimator, while for probability (iii) all three estimators perform similarly for k up to 100 where the usual PCA based estimator starts to deteriorate. Again, for all probabilities, the RMSE of the estimator resulting fromĤ P CA n,k,10 is remarkably insensitive against the choice of k. Finally, we turn to the disturbed Dirichlet model where the observations are randomly rotated by an angular up to π/10, leading to true values for (i)-(iv) of 0.653 (with t (i) = 0.65), 0.185, 0.770 and 0. The corresponding plots are shown in the Figures 6 and 7. In view of the empirical risk, the choicesp ∈ {2, 3} seem reasonable.
Again, the PCA procedure that uses the same largest observations in both steps performs better for the larger choice ofp, whereas the performance of the other PCA procedure improves only for (ii), while it does not change much for (iii) and it deteriorates a bit for (i) and (iv). The PCA estimators perform better than the standard procedure for probability (i) and for (iii) if k is large (for classical PCA only ifp = 3), whereas for (ii) overall the estimators perform similarly well with the standard procedure performing better for small values of k and the PCA estimators for larger values.

High dimensional models
We now compare the estimators when random vectors of dimension d = 100 are observed whose extremes are concentrated near a p = 5 dimensional subspace.
Again, we start with the Dirichlet model, for which the mean empirical risk for PCA projecting on a subspace of dimensionp ∈ {1, . . . , 10} are shown in the left plot of Figure 8 and the mean operator norm of the difference between the estimated and the true projection matrix in the right plot. Here the choice of an appropriate dimension based on the empirical risk plot is less obvious than in the lower dimensional setting, but one should clearly arrive at some valuẽ p between 4 and 6 and choose k not much larger than 50 for estimating the support of the limit measure. Figure 9 shows the RMSE of the different estimators of the probabilities (i)-(iv) with t (i) = 0.4 and true values 0.573, 0.072, 0.584 and 0, respectively. Here, we have used PCA withp = 4 in the upper row,p = 5 in the mid row andp = 6 in the lower row. As expected, in most cases the PCA procedures perform worse when they project on too low dimensional subspaces, yet in the cases (i) and (iv) the differences are moderate. At first glance somewhat surprisingly, overall the PCA procedures exhibit a better behavior forp = 6 than for the "correct" valuẽ p = 5. This may be explained by the fact that the extra dimension offers the opportunity to compensate for the difference between the subspaces minimizing the true resp. the empirical risk. This difference is expected to be larger if the dimension of the observed vectors is large, as can also be seen from the right plot in Figure 8.
Again, the PCA based estimators for probability (i) outperform the standard procedure, but the other probabilities are more accurately estimated by the standard procedures ifp ≤ 5 (though all estimators of (iv) perform reasonably well). Forp = 6, the RMSE of both variants of PCA based estimators of (ii) are very similar with a minimum value which is somewhat smaller than the minimum RMSE of the standard estimator. The performance of the standard estimator and the one based on classical PCA are almost identical for the probability (iii), while the estimator with PCA based on just k = 10 largest observations is less accurate, probably because it is difficult to estimate a subspace of dimension 6 based on just 10 observations. It might help to increase the number of largest observations used to estimate the supporting subspace with the dimension d, but we do not explore this idea here in order not to overload the presentation.
For the high dimensional Gumbel model with d = 100 and p = 5, by and large the findings are similar to the ones observed for the Dirichlet model so that we do not show the corresponding plots. However, in this modelp = 4 can be ruled out by the empirical risk plot and both PCA based estimators outperform the standard estimator of (ii).

Conclusion
To sum up, while the PCA step does not always improve the estimator of the spectral measure, for probability (i) the resulting estimators are superior to the standard estimator and in most other cases they seem competitive ifp is chosen appropriately. To this end, the plot of the empirical risk is a very useful tool; this is particularly true for observations with moderate dimensions. For higher dimensional data, there may be some ambiguity about the dimension of the subspace onto which the data should be projected. In case of doubt, it is advisable to choose a higher dimensional subspace, in particular for the PCA method that uses the same number of largest observations to estimate the support and to calculate the estimator of the spectral measure. The PCA estimators that determine the support based only on the largest 10 observations often exhibit a desirable insensitivity to the choice of largest observations used to estimate the spectral measure, which makes them easier to apply in practice.

Proofs to Section 2
The following technical lemma comes in handy for the asymptotic analysis of the conditional risk.

Lemma 5.1. Let f : R d → R be a measurable function that is locally bounded, P ∞ -a.e. continuous and satisfies lim sup
Proof. According to (1.1), weakly. Let Y t and Y ∞ be random vectors with distribution P t (t · ) and P ∞ , respectively. Since f (x/t) P t (dx) = E f (Y t ) and f (x) P ∞ (dx) = E(Y ∞ ), the assertion follows if the f (Y t ) are asymptotically uniformly integrable (see [36], Theorem 2.20).
The following two lemmas are crucial to prove convergence of the minimizers of R t .
Proof. We have to show that any sequence (V n ) n∈N in V p has a convergent subsequence. For each n, let (u 1,n , . . . , u p,n ) be an orthonormal basis for V n so that Π Vn x = U n U n x where U n denotes the matrix with columns u j,n . The vectors (u j,n ) 1≤j≤p belong to the compact set (S d−1 ) p . Thus there exists a subsequence n such that u j,n → u 0 j for all 1 ≤ j ≤ p. Since for all n, u j,n , u i,n = δ i,j , we also have u 0 j , u 0 i = δ i,j and the u 0 j , j ≤ p, form an orthonormal family in R d . Let V 0 be the space generated by the u 0 j 's and denote by U 0 the matrix with these columns. Then V 0 has dimension p, i.e. V 0 ∈ V p , and by construction which proves the claimed compactness.
which proves the assertion.
Proof of Theorem 2.4. Suppose the assertion of the theorem were wrong. By the compactness of V p , then there exist a sequence t n → ∞ such that V * tn converges to some , which is strictly positive by assumption, and sufficiently large n, one may conclude a contradiction: Therefore, the assertion must be correct.
Proof of Proposition 2.6. For simplicity, we assume that F X is continuous in the tail (so that there are no ties among the observed norms), but the proof can be easily generalized using standard techniques from the theory of regular varying functions. First we want to replace the random thresholdt n,k with t n,k in the definition ofR n,k . Since Π ⊥ V is a contraction, the Hölder inequality yields where η > 0 is chosen such that (2 + η)(1 − β) < α. It is well known thatt n,k /t n,k → 1 in probability. Thus there exists a sequence δ n ↓ 0 such that P {t n,k > (1 − δ n )t n,k } → 0. By (5.2) and the regular variation because there exist exactly k exceedances oft n,k , and either all non-vanishing differences of the indicator functions equal 1 or all equal −1, depending on whethert n,k < t n,k ort n,k > t n,k . Now, the last sum is binomially distributed with parameters n and k/n. By the central limit theorem for triangular arrays, the right hand side is of stochastic order k 1/2 . A combination of these results show that In view of Proposition 2.2, it thus suffices to show that in probability, with d n,k := (log k)t n,k . Let α * := 4(1 − β) ∨ (α + 1). Since Π ⊥ V is a contraction, (1.6) implies Similarly as in the proof of Lemma 5.1, we can bound the expectation using integration by parts and Karamata's theorem: for sufficiently large n. Therefore, by the choice of d n,k , which implies the convergence in probability of T n,1 . For the second term, we may similarly conclude from (5.2) and the definition of t n,k that .
Similarly as in the analysis of the conditional risk, the following result on the equicontinuity in probability of the standardized empirical risk is central to the consistency of the empirical risk minimizer. Proof. First note that in view of (5.4), it suffices to prove the assertion witĥ R n,k (V ) replaced by k −1 n i=1 Π ⊥ V Θ i 2 1{ X i > t n,k } andR n,k (W ) replaced by the analogous expression.
Similarly as in the proof of Lemma 5.3, we have

H. Drees and A. Sabourin
for n sufficiently large. Hence, by Markov's inequality, (5.2) and the definition of t n,k , Proof of Theorem 2.7. LetR n,k := t 2(β−1) n,kR n,k . Fix an arbitrary ε > 0 and let According to Lemma 5.4, there exists δ ≤ ε/2 and n 0 such that for all n ≥ n 0 with probability greater than 1 − ε/4 one has |R n,k (V ) −R n,k (W )| ≤ η/4 for all V, W ∈ V p such that ρ(V, W ) ≤ δ. Since V p is compact, there exists a finite cover of V p by open balls with radius δ and centers W 1 , . . . , W m , say. By Proposition 2.6, there exists n 1 ≥ n 0 such that with probability greater than 1 − ε/2 Hence, there exists a (random) index j ∈ {1, . . . , m} such that ρ(V n , W j ) < δ ≤ ε/2, and, on a set with probability greater than 1 − ε, By the definition of η, this implies W j ∈ M and thus Since ε > 0 was arbitrary, this concludes the proof.

Proofs to Section 3
The proofs rely on some well-known facts about Hilbert spaces, which we recall in a version specialized to the present setting. Let (e i ) 1≤i≤d be an arbitrary orthonormal basis of R d and denote by · , · the usual inner product on R d . Finally, for independent centered random matrices We start with a uniform bound on the difference betweenR n,k andR t n,k , defined in (3.2).
Proof. In view of (5.3) and π t n,k = k/n, we have for all V ∈ V p , By Bernstein's inequality ([24, Theorem 2.7]), it follows that and hence the assertion.
We next prove concentration inequalities for ϕ ± t (X 1 , . . . , X n ) defined in (3.4), using a version of the bounded difference inequality by [24,Theorem 3.8], which we recall for convenience. For brevity's sake, in what follows we use the notation x i:j := (x i , . . . , x j ) for a subvector of (x 1 , . . . , x n ). If both maxdev + andv are finite then, for all u ≥ 0, Lemma 5.7. Under (3.1), one has, for all u > 0, Proof. The assertion follows immediately from Theorem 5.6 applied to ϕ ± t and the following bounds: Var Next, the expectations E ϕ ± t (X 1 , . . . , X n ) are bounded using arguments from [4].
Proof. Since, by (5.6), Π W x 2 = Π W x, x = Π W , xx HS for any linear subspace W and any x ∈ R d , using the bilinearity of the inner product and the Cauchy-Schwarz inequality in the Hilbert-Schmidt space, we obtain Using (5.5) and taking the supremum over all V ∈ V p and the expectation, one arrives at One the other hand, by first rewriting Π ⊥ Now, by the Jensen's inequality and (5.7), E (P n − P )(θ t θ t ) HS ≤ E (P n − P )(θ t θ t ) 2 Combining this with (5.10) and (5.11), we arrive at It remains to show that E Θ t Θ t − E(Θ t Θ t ) 2 HS = π t E t Θ 4 − π t tr(Σ 2 t ) . From the representation of the Hilbert Schmidt norm by the trace operator and the linearity of the latter, one may conclude by direct calculations that = tr E(Θ t Θ t ) 2 − tr (E(Θ t Θ t )) 2 = tr π t E t (ΘΘ ) 2 − tr (π t E t ΘΘ ) 2 = π t E t tr (ΘΘ ) 2 − π 2 t tr(Σ 2 t ).
Hence the assertion follows from tr (ΘΘ ) 2 = Recall from (3.3) that the second term can be written as sup V ∈Vp |R t n,k (V ) − R t n,k (V )| = n k max(ϕ + t n,k (X 1:n ), ϕ − t n,k (X 1:n )).
Hence, Lemma 5.7, Lemma 5.8 and π t n,k = k/n immediately yield Combine this with the bound on the first term given by Lemma 5.5 to conclude the proof of the first assertion. Check that for u := log(4/δ) 3k + log(4/δ) 3k 2 + 2 k (1 + k/n) log(4/δ) both exponential expressions on the right hand side of (3.5) equal δ/4, and so the upper bound equals δ. Hence the remaining assertions follow from Proof of Theorem 3.3. Define i.i.d. random vectors Z i whose distribution equals the conditional distribution of Θ given X > t. Recall that Θ (i) := θ(X (i) ) where X (i) is the vector X j with the ith largest norm among X 1 , . . . , X n . Then, conditionally on N t = , the joint distribution of the empirical riskR t (V ) and Θ (1) , . . . Θ ( ) equals the joint distribution of −1 i=1 Π ⊥ V Z i 2 and the order statistics of Z 1 , . . . , Z . Therefore, the proof of Theorem 3.1 of [4] (with M = 1 and L = 2) combined with arguments given in the proof of Lemma 5.8 show that with probability at least 1 − 2 exp − 2 u 2 ) − exp − /2 v 2 /2 conditionally to N t = , Since the proof of Theorem 3.1 of [4] is quite tersely formulated in a more abstract setting and it contains a minor inaccuracy, for convenience we give more details of the proof of (5.13) in the Appendix. Similarly as in the proof of Lemma 5.8, the first assertion thus follows from i,j=1 where in the last step we have used that, on {N t = }, the set of non-vanishing vectors Θ i,t equals the set of non-vanishing random vectors Θ (i) . The remaining assertions are now obvious.
Proof of Remark 3.5. The (modified) proof of Theorem 3.1 of [4] shows that with S * t := E t Θ 4 − tr(Σ 2 t ) (cf. (A.1) and (A.2)). Observe thatR t (V ) = N tRt (V )/(nπ t ). On the set M t (v) := {|N t − nπ t | ≤ nπ t v}, one thus has since R t (V ) ≤ 1. Moreover, for t = t n,k , it is shown in the proof of Theorem 3.1 that sup V ∈Vp |R n,k (V ) −R t n,k (V )| ≤ v on the set M t n,k = {N t n,k ∈ [k(1 − v), k(1 + v)]} and that P(M c t n,k ) ≤ 2 exp − kv 2 /(2(1 + v/3)) . Hence, Appendix: Details of the proof of (5.13) Recall that Z i are iid random variables whose distribution equals the conditional distribution of Θ given X > t. Let First note that for all z,z ∈ B 1 (0). Thus a simple version of the bounded difference inequality (see, e.g., Theorem 3.1 of [24]) gives Exactly in the same way as in the proof of Lemma 5.8 (cf. (5.12)), one obtains