$k$-means clustering of extremes

The $k$-means clustering algorithm and its variant, the spherical $k$-means clustering, are among the most important and popular methods in unsupervised learning and pattern detection. In this paper, we explore how the spherical $k$-means algorithm can be applied in the analysis of only the extremal observations from a data set. By making use of multivariate extreme value analysis we show how it can be adopted to find"prototypes"of extremal dependence and we derive a consistency result for our suggested estimator. In the special case of max-linear models we show furthermore that our procedure provides an alternative way of statistical inference for this class of models. Finally, we provide data examples which show that our method is able to find relevant patterns in extremal observations and allows us to classify extremal events.


Introduction
When looking at multivariate and in particular high-dimensional data, it is one of the most important objectives of a statistical analysis to detect structures and patterns in the observations so as to simplify their complexity.This task has led to the development of an abundance of procedures in the field of unsupervised learning, see for example Hastie et al. [2009] for an overview.Usually the analysis is focused on describing the bulk of the data and one looks for results that apply to most observations at hand.However, when extremal observations are of interest, a different approach needs to be taken.In this paper we consider the complexity reduction of extremal observations via clustering.
For a specific dataset, a naive implementation is to choose the observations with the largest norm (thereby considering them as extremal observations) and apply a clustering algorithm to them.However, this proves to be inefficient as extremal points are typically spread out in space.In the presence of heavy-tailed observations, most classical clustering algorithms would have further problems from the possibly infinite second moments.In order to allow for robust and meaningful estimation, one should incorporate structural results about the particular kind of data at hand into the estimation procedure.
Multivariate extreme value theory (MEVT) provides us with such a framework and has useful applications in a wide range of disciplines, such as finance and climate science, see for example Fougères [2003] and Davison et al. [2012] for an overview.Most of the current parametric models and estimation methods focus on the bivariate or lower dimensional scenario and are difficult to generalize to higher dimensions due to either lack of flexibility or heavy computation loads, see Davison and Huser [2015].
Recently there have been a few attempts to adapt complexity reduction to extremal dependence.One way of doing so is applying classical dimension reduction techniques to (transformed) extremal observations, in the form of principal component analysis and related covariance matrix decomposition techniques, see Haug et al. [2009] and Cooley and Thibaud [2016], or empirical basis functions, see Morris et al. [2018].Another direction of research is aimed at dividing the parameter space into lower dimensional subspaces via making use of the phenomenon of asymptotic independence (meaning that in extremal events there are often only a few components which are large at the same time), see Goix et al. [2017] and Chiapino et al. [2019].Chautru [2015] identifies relevant subspaces by first reducing the dimension of projected observations by a spherical principal components procedure, then clustering the projected data using spherical k-means, and as a last step attributes a lower-dimensional subspace to each cluster.Finally, Bernard et al. [2013] presents a classification approach with special emphasis on a spatial decomposition of separated regional clusters for extremal events.The methodology is based on a k-means like algorithm, where distances between two stations are measured by their F -madogram.
The usage of k-means estimation in extremes is therefore not entirely new (see also Einmahl et al. [2012], where the procedure is mentioned to produce starting values for numerical estimation algorithms), but so far it has only been applied as an intermediate step towards a specific goal and its theoretical properties have not been explored.The aim of this paper is twofold: First, we provide the theoretical background as to how a k-means algorithm applied to extremal observations can be constructed as a consistent estimator of theoretical extremal cluster centers, see Theorem 3.1.Second, we demonstrate that these cluster centers themselves can be seen as prototypes of directions of extremal events and the algorithm therefore provides a comprehensive, computationally fast and robust procedure to interpret observed extremes.As a side effect we demonstrate that our procedure can be seen as an alternative, consistent way of estimating relevant components of max-linear models, which recently gained popularity in applications due to their relationship to causal models for extremes as implied by directed acyclic graph (DAG) models, see Gissibl [2018], Gissibl et al. [2018a,b].
The paper is organized as follows: Section 2 provides a short background on the two main components of our method, MEVT and the spherical k-means algorithm.In Section 3, we present a general consistency result for the spherical k-means algorithm in the extremal setting and construct a non-parametric estimator for the theoretical cluster centers.The application of our procedure to the particular class of max-linear models is outlined in Section 4. Finally, we demonstrate the application and interpretation of our results by looking at three data examples in Section 5.

Multivariate extreme value theory
In order to describe the extremal behavior of a random vector, a typical assumption is that the componentwise maxima, generated from i.i.d.copies of this vector, converge jointly to a non-degenerate limit distribution after proper linear normalization.More formally, let (X i 1 , . . ., X i d ), i ∈ N, be i.i.d.copies of the random vector X = (X 1 , . . ., X d ).We assume that there exist sequences of constants a n j > 0, b n j ∈ R, 1 ≤ j ≤ d, n ∈ N and a (in each margin non-degenerate) distribution function G, such that for all continuity points (x 1 , . . ., x d ) of G.We say that X is in the max-domain of attraction of the extreme value distribution G.A central result of extreme value theory is that this convergence can be broken down into two separate components.First, all marginal distributions G i , 1 ≤ i ≤ d, of G have to be univariate extreme value distributions of the form where for γ i = 0 the right hand side is interpreted as exp(− exp(−(x − µ i )/σ i )), x ∈ R. The parameter γ i , known as the extreme value index, is the most important parameter in describing the univariate extremal behavior of component i.Here a wealth of statistical procedures exists for univariate extremes, see de Haan and Ferreira [2007], Chapters 3 and 4, for an overview.Second, let F i be the (continuous) marginal distributions of X i , i = 1, . . ., d, then the convergence in (2.1) holds if and only if the standardized vector for a probability measure S on S d−1 , where • stands for an arbitrary but fixed norm, and its continuity-Borel-sets B, see Beirlant et al. [2006], Chapter 8.The measure S in (2.4) thus describes the limiting behavior of the directions that we see in extremal observations from Y. Furthermore, the measure S is connected to the limiting behavior of maxima as described in the following.The transformed Y satisfies where G 0 is an extreme-value distribution with standard Fréchet margins (i.e., γ i = σ i = 1, µ i = 0 in (2.2)).For G 0 there exists a so-called exponent measure ν such that This exponent measure is homogeneous of degree −1 and there exists a constant c > 0 such that for Borel sets B ⊂ S d−1 + and y > 0. The same measure S appearing in (2.4) and (2.6) is called the spectral measure of X or Y and by the above it uniquely describes the dependence structure (or copula) of both exceedances and maxima.Due to the marginal standardization of the vector Y there always exists a constant c > 0 such that (2.7) for all i = 1, . . ., d.In this analysis, we are less interested in the marginal extremal behavior of individual components and more concerned with the extremal dependence structures.Hence we are interested in the structure of S.
Note that even for a vector X whose marginals are not in the domain of attraction of an extreme value distribution, it can still make sense to define (2.3) and look at (2.4) in order to describe its extremal behavior.On the other hand, there may also be situations where one does not want to standardize marginals first but treats observations as coming from a random vector Y which satisfies (2.4) with a spectral measure S that does not necessarily satisfy (2.7).In any case, the measure S tells us about the angle or direction of an observation that is considered extreme, either in the original scale of observations, or after standardization.If there exist small sets which receive relatively high probabilities under S, these sets can be seen as "typical" directions for an extremal event.The idea of this paper is to identify these sets without assuming a specific underlying model, thereby identifying extremal patterns in a non-parametric way.
We have until now not specified which norm we meant when writing • .The general equivalence between convergence of multivariate maxima in (2.1) and marginal convergences together with convergence of exceedances as described in (2.4) holds for any choice of norm • , although the particular choice will of course affect the specific form of the spectral measure S. Depending on the particular application, there may be different possible choices for the particular norm.The main results in Section 3 are therefore formulated in a general way that holds for any choice of norm.For our simulations in Section 4 and data examples in Section 5 we use the Euclidean norm • 2 .

k-means and spherical k-means
The k-means clustering procedure is a way to identify distinct groups within a population.The name was first introduced in MacQueen [1967] although the ideas behind the algorithm dated back further, see Bock [2008].The motivation is to identify cluster centers such that distances of the observations to their nearest cluster centers are minimized.Accordingly, all observations which are closest to the same cluster center are viewed as belonging to the same group.
In the following, let d : R d × R d → [0, ∞) be a distance function or, more generally, a dissimilarity function in R d (see Gan et al. [2007], Chapter 6).For a probability measure P on B(R d ) and a set A = {a 1 , . . ., a k }, a i ∈ R d for i = 1, . . ., k and k ∈ N, one can introduce the averaged distance from any observation to the closest element of A as (2.8) For given P and k, a set A k which minimizes W (A, P ) among all A with |A| ≤ k can be seen as a set of theoretical cluster centers.Note that the set may not necessarily be unique, in which case no clear cluster centers can be identified.
If we replace P by its sample version Pn , (i.e. the measure that places mass 1/n on each observation x 1 , . . ., x n of a sample) in (2.8), and derive an accordingly optimal set Ân k , its components minimize the sum of the distances from every observation to its nearest cluster center.While the original version of k-means uses the Euclidean distance, several alternatives to the algorithm with different choices for d have been suggested.Since our central interest is in the spectral measure and a sample from it has only mass on the unit sphere S d−1 + , it seems natural to measure the distance between two points by their angle, which is then in fact independent of the chosen norm.Therefore we make use of the spherical k-means procedure of Dhillon and Modha [2001], which measures differences in terms of angular dissimilarity.We thus set or simply d ϕ (x, y) = 1 − x, y if we restrict x, y to the Euclidean unit sphere.Since the optimal cluster centers under this dissimilarity are then determined by their direction only, for P living on S d−1 + for a chosen norm, the cluster centers will be placed on S d−1 + by convention, so that we can interpret our cluster centers in the same way as we interpret the projections of extreme observations.Note that in order to allow for more flexibility in the choice of a suitable distance measure, the results in Section 3 only assume that d : It should be noted that finding the optimal cluster centers for a given P can be an N P -hard problem (see Mahajan et al. [2012]) and the known iterative algorithms often depend crucially on the initial cluster centers, see Bradley and Fayyad [1998].For the examples and simulations in Sections 4 and 5 we rely on the R-package skmeans by Hornik et al. [2012], which provides short run-times and stable results.

The main result
In this section we formally introduce our estimation procedure which will, for a given sample, provide a set of empirical cluster centers.Each center can then be interpreted as a "dependence prototype" for a particular class of an extremal event.In brief, our procedure looks as follows: 1.With the help of the empirical distribution function, transform a sample from the distribution of X into (approximately) a sample from Y as in (2.3).
2. Choose a fraction of the latter sample that only keeps the transformed observations with largest norm.
3. For the chosen subsample, project the transformed observations onto the corresponding unit sphere.
4. Apply a spherical k-means procedure to the projected observations.
More details about the steps can be found below.Note that steps 1.-3. in the above procedure are a way of generating a "pseudo-sample" from the spectral measure S of standardized observations.But there exist many different ways of statistical inference for S both for standardized and nonstandardized data, from non-parametric (e.g.Einmahl et al. [2001], Einmahl and Segers [2009]), over semiparametric (e.g.Einmahl et al. [1997]) to fully parametric procedures (e.g.Coles and Tawn [1991]).The following theorem is therefore formulated in a way such that it holds for any estimator of the spectral measure as long as it is weakly or strongly consistent.
Theorem 3.1.Assume that S is a probability measures on B(S d−1 + ) and that S n , n ∈ N, is a sequence of random probability measures on B(S d−1 + ) on a common probability space (Ω, A, P ).Furthermore, assume that d : For each S n and a given value of k ∈ N denote a random set which minimizes among all sets A with at most k elements from S d−1 + by A n k .Accordingly, if we replace S n by S, denote the optimal set by A k , and assume that for a given value of k, the set A k is uniquely determined.a) If Remark 3.2.In the above theorem, the convergence of sets is formally meant in the Hausdorff distance, but since all involved sets have only finitely many elements, it implies pointwise convergence of elements after a suitable reordering. 2 Proof of Theorem 3.1.The argumentation is very similar to the one used in Pollard [1982], and the crucial ingredients for the proof are the continuity of d and the compactness of S d−1 + .First, the assumption a) or b) implies that for a fixed set A we have W (A, S n ) → W (A, S) in probability or almost surely, respectively, since d is continuous.For a finite number of sets A 1 , . . ., A m with at most k elements from S d−1 in the respective mode of convergence follows then immediately.Due to continuity of d and its compact support we can furthermore find for each δ > 0 an m ≥ 1 and sets A 1 , . . ., A m with at most k elements from for all A with at most k elements from S d−1 + .This in turn implies that sup for all probability measures P on B(S d−1 + ) and therefore sup Convergence (3.1) then implies sup in probability or almost surely, respectively.Thereby we also have for our optimal sets A n k and A k .Due to the continuity of d and its compact support, the assumption of a unique optimal set A k implies that for each δ > 0 there exists an > 0 such that Since the idea of our approach is to detect general patterns without relying on a particular model, we choose for the rest of the analysis a straightforward non-parametric estimator of the spectral measure of standardized observations which is a natural empirical counterpart to (2.4) and a slight modification of the estimator introduced in Einmahl et al. [2001].
Note that the spectral measure of X is defined in terms of the random vector Y from (2.3), but that we do not know the marginal distributions A solution to this is to replace F i by the left-continuous version of the empirical distribution function and transform the observations to The empirical counterpart of (2.4) then motivates the estimator Ŝn (B) : for Borel subsets B of S d−1 + , where l n ∈ N. Note that due to their definition the observed components of the Ŷ i s will have values in {1, n/(n − 1), . . ., n/2, n}.Therefore, the value of l n should grow with n in order to obtain a consistent estimator, but the growth rate should also not be too fast in order to catch only the extremal observations.Proposition 3.3 below gives necessary assumptions on l n for the weak and strong consistency of this estimator.In a second step, for a chosen value of l n , we can then apply a spherical k-means algorithm to Ŝn , i.e. to the selected projections of extreme observations onto the unit sphere.The following proposition shows that the resulting estimators are consistent.
. copies of a vector X, such that the transformed vector Y from (2.3) satisfies (2.4) with spectral measure S. For S n = Ŝn as in (3.3), define the sets A n k and A k as in Theorem 3.1 and assume that the set A k is uniquely determined for a given value of k.Then Proof.The proposition follows from Theorem 3.1 if we can verify that the sequence of random measures S n satisfies the corresponding assumptions.To see this, we argue similar to Einmahl et al. [2001], Theorem 1.We start by looking at the estimator for the so-called stable tail dependence function with ν as introduced in (2.5).For l n /n → 0, l n → ∞ the estimator converges in probability (see Huang [1992]) and for l n /n → 0, l n / log(log(n)) → ∞ it converges almost surely (see Qi [1997]) for all x 1 , . . ., x d ∈ (0, ∞) d .This pointwise convergence can be extended to show that for each c > 0 and continuity set B ⊂ [0, ∞) d in the respective mode of convergence, see the proof of Theorem 1 in Einmahl et al. [2001] for details.Due to equivalence of norms there exists a c 0 > 0 such that x > 1 implies x ∞ > 1/c 0 for the chosen norm again in the respective mode of convergence.This implies the weak convergence either in probability or almost surely of Ŝn to S (see again Einmahl et al. [2001], Theorem 1) and thereby that the assumption a) or b), respectively, of Theorem 3.1 is satisfied.

Application to max-linear models
The idea behind the decomposition of a spectral measure into k clusters is motivated by the special case where the spectral measure is clearly concentrated around k different centers.An idealized example is provided by the max-linear model where a spectral measure has only masses on k different points.In the following we will demonstrate how our k-means procedure can be seen as an alternative way of estimating their model parameters.
A max linear model consists of k different so-called factors where Z 1 , . . ., Z k are i.i.d.random variables with the same heavy-tailed distribution.The most common choice for this distribution is a standard Fréchet-distribution. Furthermore, one typically assumes that k i=1 a i j = 1 for all j = 1, . . ., d such that all margins of X are standard Fréchet as well.
Looking at (4.1) it is clear that the largest observations of X are due to a large observation of a Z i and therefore the factors a i determine the possible directions of extremal observations.In fact one can show that the spectral measure S concentrates on the points s i = a i / a i with corresponding probabilities p i = a i /( k j=1 a j ), 1 ≤ i ≤ k.On the other hand, for each discrete spectral measure with mass concentrated on k points there exists a max-linear model with k factors which results in this given spectral measure, see Yuen and Stoev [2014].It is also shown that any given dependence structure of extremes can be approximated arbitrarily well by spectral measures generated from max-linear models if one allows the number of factors to grow, see Fougères et al. [2013].Furthermore, it was recently shown in Gissibl [2018], Gissibl et al. [2018a,b] that max-linear models also evolve from a natural modeling of extremal dependence generated from a directed acyclic graph of components, thereby allowing the modeling and detection of causality in extreme events.
Parameter estimation of max-linear models has proven to be a difficult task and previous approaches to estimate model parameters based on extremal observations had to address the fact that no spectral density exists which excludes standard maximum likelihood procedures.Instead, Einmahl et al. [2012], Einmahl et al. [2016] and Einmahl et al. [2018] use a least squares estimator based on the stable tail dependence function to estimate parameters from extremal observations and Yuen and Stoev [2014] construct a least squares estimator based on the joint distribution function and make use of all observations.In the following we illustrate how the k-means procedure serves as an alternative and effective way of inference for the max-linear models.
From the above it follows that the discrete spectral measure, i.e., the points s 1 , . . ., s k ∈ S d−1 + on which the spectral measure S is concentrated, and the corresponding probabilities p 1 , . . ., p k , are an equivalent way of parametrizing a max-linear model.For such a spectral measure, it is clear that W (A, S) as defined in (2.8) is minimized by choosing A = {s 1 , . . ., s k } and A is uniquely determined.For an estimator S n of S, which satisfies the assumptions from the previous chapter, the k-means cluster centers can therefore be seen as consistent estimators of s 1 , . . ., s k .This consistency also implies that the percentages of points which are classified as belonging to cluster i converge to the corresponding probability p i , 1 ≤ i ≤ k.Especially if one is interested in using the max-linear model as an approximation for the largest observations only, the estimation of the s i 's and p i 's can be seen as an alternative to the estimation of the components a i 's.
In order to compare the spherical k-means procedure with the previously mentioned approaches we set up a small simulation study.To this end, we first generate a random parameter constellation for a max-linear model with d dimensions and k factors.We then estimate the factor coefficients according to Einmahl et al. [2016] and Einmahl et al. [2018], as provided by the R-package Kiriliouk [2016], and Yuen and Stoev [2014], as provided by Yuen [2015], and derive the resulting spectral measure.We then compare the estimators for s 1 , . . ., s k and p 1 , . . ., p k as derived from the estimated parameters of the max-linear models on one hand and on the other hand as derived directly by the k-means estimator S k n which puts mass pi in the cluster center ŝi , where pi = S n ({x ∈ S d−1 i.e. the percentage of observations that are classified as belonging to cluster i.The difference between the true spectral measure S and the corresponding estimator is evaluated by two different evaluation criteria.For the estimated cluster centers, we first derive which can be seen as a distance measure similar to the Hausdorff distance applied to finite sets, but taking into account all distances of the matched vectors instead of only the maximal one.This gives an idea about how well the estimator identifies possible extremal directions, but does not take into account their frequencies.To this end we also look at a metric on the space of probability measures, where we use the Wasserstein metric with p = 1, which is defined as where Γ(S, S n ) is the set of all probability measures on S d−1 + ×S d−1 + with first marginal S and second marginal S n .We use the R-package transport, see Schuhmacher et al. [2019], for evaluation of Wasserstein distances.
We first assume k to be known.For each combination of d and k we randomly generate 100 model specifications and for each model specification we generate 1000 observations.The random factors are generated as below, where U i , i ∈ N, stand for i.i.d.random variables with uniform distribution on [0, 1].We only state the first k − 1 factors, as the last factor is always determined from the first ones by the standardization assumption.
For the method from Yuen and Stoev [2014] we use all observations, for the remaining only the 100 with largest norm.The grid for the estimator from Einmahl et al. [2018] includes all d-dimensional vectors with entries from the set {0, 1/3, 2/3, 1} and exactly 2 non-zero entries.A finer grid would have implied very long run times.For those three estimators, the fact that there are 0's in the factors and knowledge of their positions is not used for the estimation, so there are d • (k − 1) parameters to estimate.The average values of d s (S n , S) and W 1 (S n , S) over all 100 realizations for different constellations are shown below in Tables 1 and 2, with the observed standard deviations in brackets.It can be seen that the locations of the points of mass of the spectral measure are most precisely estimated by the spherical k-means procedure.The accuracy in estimating the spectral measure is very similar for all methods except for the CRPS-method from Yuen and Stoev [2014], which was constructed for an overall good fit but with little focus on extremes.Since the number of factors is usually not known a priori, we also look at two misspecified models, where in both cases the model was fitted as if there were k = 3 factors, but the true value of k is 2 or 6.The random models are generated as described previously for the respective constellations of d and k.In Table 3 we see the results for the two misspecified models as measured in the W 1 -metric of estimated and true spectral measure.In the two examples, the spherical k-means procedure copes better with the fact that our model is misspecified.
Regarding the numerical implementation of the estimators Einmahl et al. [2016] and Einmahl et al. [2018] we noted in our simulations that the first depends for larger values of d and k heavily on the starting parameters of the algorithm, where we choose the starting value for factor parameters from the spherical k-means estimator, as also suggested in Einmahl et al. [2012].The procedure for the estimator from Einmahl et al. [2018] has very long runtimes, even with the relatively coarse grid that we use.
We conclude from the simulations that the spherical k-means procedure is, for the chosen examples and metrics, superior or competitive to (and usually faster than) methods for estimating max-linear models if one is mainly interested in the resulting spectral measure of observations.

Data examples
In the following we apply our procedure to three different data sets with dimensions 5, 30 and 38.For all data sets we observe that in each estimated cluster center there are many components with values close to 0 and only a few which have significantly positive entries.This hints at the phenomenon of asymptotic independence.In extreme value theory, two random variables X, Y with distributions F X , F Y are called asymptotically independent if lim (assuming that the limit exists) and asymptotically dependent else.Detecting the groups of random variables which share asymptotic dependencies and classify them accordingly was the main aim of Chautru [2015].Our analysis allows for a more gradual view on dependencies since looking at the estimated clusters and observed differences in cluster components gives an idea about strong and weak hints towards asymptotic dependence or independence.For groups of asymptotically dependent variables it furthermore allows to identify patterns within those groups.

Air pollution data
We start the analysis with a dataset of relatively small dimension, namely the air pollution data which has been analyzed in Heffernan and Tawn [2004].It consists of daily measurements of five air pollutants in the city center of Leeds (U.K.), collected between 1994 and 1998 and split up into summer and winter months which gives 578 and 532 observations, respectively.The data is available via the R-package texmex, see Southworth et al. [2018].We apply the four-step procedure as outlined at the beginning of Section 3, with the transformation of marginals as described in (3.2).For this data set, we use the 10% of transformed observations with largest Euclidean norm, project them on the unit sphere and apply the spherical k-means procedure.For this and the other two data examples we use the command skmeans with method pclust and additional parameters nruns = 1000, maxchains=100 from Hornik et al. [2012].The first step is now to determine suitable values of k.A common way of doing this is creating a so-called "elbow plot" by plotting the minimized distance W (A k , S n ) (recall from (2.8)) against the number of clusters k, see Figure 1 for the summer and winter data.Note that W (A k , S n ) necessarily decreases with the increase of k.In the plot one usually looks for a k such that for larger values the decrease becomes insignificant, but we note that there is no clear theoretical criterion for an optimal choice of k.Our goal is to use the algorithm as a tool to explore the structure in the data and we stress that it often makes sense to look at different values of k in the analysis.
From the elbow plots of the summer and winter data we decide to pursue our analysis for both k = 4 and k = 5.The graphs in Figure 2  For k = 5 we see that for the summer data each cluster has exactly one large component, hinting at asymptotically independence between all air pollutants.We note that due to the marginal transformation in (3.2) and the standardization property (2.7) of the spectral measure S, the components of our projections on the unit sphere should all have approximately the same expected value.Therefore, each component should correspond to a large entry in at least one cluster center.
As a result, we note that for k = 4 there has to be at least one cluster where at least two components are large.We can interpret cluster 4 in the summer data for k = 4 in the way that NO and NO 2 are the air pollutants which are most likely to occur at extreme levels simultaneously.Both for k = 4 and for k = 5 we can identify a tendency for particulate matter (PM 10 ), NO and NO 2 to occur jointly at extreme levels, but only in winter.This is in line with the conclusions in Heffernan and Tawn [2004] who state asymptotic independence for all components except for PM 10 , NO and NO 2 in winter.

Financial portfolio losses
In this example, we consider the 'value-averaged' daily returns of 30 industry portfolios compiled and posted as part of the Kenneth French Data Library.The data in consideration span between 1950-2015 with n = 16694 observations.This is the same dataset as analyzed in Cooley and Thibaud [2016], where dependencies in extreme losses were explored with a method related to principle component analysis.Here we attempt to recover more information using our method.Since we are interested in extremal losses we first multiply all returns by -1.After that we use the same procedure as for the previous data set, but this time we only look at the transformed observations with the largest 5% of Euclidean norms.We create again an "elbow plot", see Figure 4.Here it is rather difficult to find a concrete value for k so we compare the analysis for k = 5 and k = 10.
The top graph of Figure 5 illustrates the cluster centers with k = 5.We can see that the clusters clearly separate the categories into different sectors.Cluster 1 signals the asymptotic independence of the tobacco industry to all other categories.Cluster 2 focuses on the energy and material sectors.Cluster 3 consists of business and IT related industries.Cluster 4 consists of the consumer oriented industries and Cluster 5 encompasses the rest.
The top graph of Figure 5, which shows the time points of the extreme losses in each cluster, also provides interesting insights.The extreme losses for Cluster 1 occurred around 2000, when massive lawsuits surged against tobacco companies.For Cluster 2, since the turn of the millennium, the U.S. coal and mining industries have been more heavily affected by government regulations, the rise of alternative sources of energy and foreign imports, which led to many struggles in the industry.The timeline for Cluster 3 clearly indicates the dot-com bubble in the late nineties.The consumer goods of Cluster 4 were heavily affected by U.S. recessions, most prominently by the oil crisis in 1973.Cluster 5 can be explained as widespread effects of the dot-com bubble and financial crisis.
Figure 6 shows the same set of results for k = 10.There is a clear pattern where most clusters have only one or few large components, hinting at an overall strong level of asymptotic independence.We can clearly identify sectors which are more prone to exhibiting joint losses, and that many of the connections from the analysis with k = 5 remain.Still tightly linked are the consumer sector (Cluster 5), the energy sector (Cluster 7), the business and IT sectors (Cluster 8) and a wide collection of industries linked to the financial industry (Cluster 10).The timeline plot also shows similar patterns to the previous results and clearly identifies the major events in the financial market history.

Dietary intakes data
In this section we look at the dietary interview from the 2015-2016 NHANES report, available at https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DR1TOT_I.XPT.The interview component, called "What We Eat in America", recorded the food and beverage consumed by all participants during the 24 hours period prior to the interview.The resulting dataset describes the nutrients information calculated from these observations.We are interested in the dependency of 38 chosen nutrients in high-level intakes, as high doses of some of the components can have negative health effects.See also Chautru [2015] for the analysis of a similar, but smaller data set.We derive again an estimator for the spectral measure by transforming observations with the help of the empirical distribution function and keeping the transformed observations whose Euclidean norm belongs to the largest 5%.The choice of k is again ambiguous in this data set and the elbow plot is similar to that in Figure 4. What is clear is that as k increases the number of cluster centers with only one large component increases, again pointing at asymptotic independence of most of the nutrients.

Figure 1 :
Figure 1: The value of W (A n k , S n ) for different values of k in the air pollution data sets.Left: summer data.Right: winter data.

Figure 2 :
Figure 2: The k-means clustering result on the air pollution data for k = 4. Left: summer data.Right: winter data.

Figure 3 :Figure 4 :
Figure 3: The k-means clustering result on the air pollution data for k = 5.Left: summer data.Right: winter data.

Figure 5 :
Figure 5: The k-means clustering result on the financial portfolio loss data for k = 5.Top: cluster classification result vs. time.Bottom: re-normalized cluster centers illustrated.

Figure 6 :
Figure 6: The k-means clustering result on the financial portfolio loss data for k = 10.Top: cluster classification result vs. time.Bottom: re-normalized cluster centers illustrated.

Figure 7 :
Figure 7: The k-means cluster centers in the dietary intakes data for k = 15.

Figure 8 :
Figure 8: The k-means cluster centers in the dietary intakes data for k = 20.

Table 1 :
Comparison of d s (S n , S) for different values of d and k.Mean over 100 simulations, standard deviation in brackets.
Significant clusters for several values of k can nevertheless be identified as clusters formed by carbs and sugar, by vitamin B 2 , Vitamin B 6 , Vitamin B 12 and niacin, by lutein and vitamin K, by iron and vitamin B 1 and finally by fat together with fatty acids which goes also hand in hand with high values of intaken calories.