Quantifying identifiability in independent component analysis

We are interested in consistent estimation of the mixing matrix in the ICA model, when the error distribution is close to (but different from) Gaussian. In particular, we consider $n$ independent samples from the ICA model $X = A\epsilon$, where we assume that the coordinates of $\epsilon$ are independent and identically distributed according to a contaminated Gaussian distribution, and the amount of contamination is allowed to depend on $n$. We then investigate how the ability to consistently estimate the mixing matrix depends on the amount of contamination. Our results suggest that in an asymptotic sense, if the amount of contamination decreases at rate $1/\sqrt{n}$ or faster, then the mixing matrix is only identifiable up to transpose products. These results also have implications for causal inference from linear structural equation models with near-Gaussian additive noise.


Introduction
We consider the p-dimensional independent component analysis (ICA) model X = Aǫ, (1.1) where A is a p × p mixing matrix, ǫ is a p-dimensional error (or source) variable with independent and nondegenerate coordinates of mean zero, and X is a p-dimensional observational variable. Based on observations of X, ICA aims to identify the mixing matrix A and the distribution of the error variables ǫ. Theory and algorithms for ICA can be found in, e.g., [4,5,11,12,13,18]. ICA has applications in many different disciplines, including blind source separation (e.g., [6]), face recognition (e.g., [2]), medical imaging (e.g., [3,15,25]) and causal discovery using the LiNGAM method (e.g., [22,23]).
Our focus is on identifying the mixing matrix. Identifiability is an issue, since two different mixing matrices A and B may yield the same distribution of X, for example if the distribution of ǫ is multivariate Gaussian. In this case, the mixing matrix cannot be identified from X. In [5], it was shown that whenever at most one of the components of ǫ is Gaussian, the mixing matrix is asymptotically identifiable up to scaling and permutation of columns, see also Theorem 4 of [9].
In order to illustrate the relevance of identifying the mixing matrix in (1.1), we give an example based on causal inference. Example 1.1. Consider a two-dimensional linear structural equation model with additive noise of the form see, e.g., [22]. We assume that the coordinates of ǫ are independent, nondegenerate and have mean zero, and assume that C is strictly triangular, meaning that all entries of C are zero except either C 12 or C 21 . In the first case, X 1 is a function of X 2 and vice versa for the second case. In the context of linear structural equation models, identifying which row of C is zero corresponds to identifying whether X 1 is a cause of X 2 or vice versa.
As C is strictly triangular, I − C is invertible. Letting A = (I − C) −1 , we obtain where A is upper or lower triangular according to whether the same holds for C. Thus, we have arrived at an ICA model of the form (1.1). In the case where ǫ is jointly Gaussian, it is immediate that we cannot identify A from the distribution of X alone. By the results of [5,22], identification of whether A is upper or lower triangular from the distribution of X is possible when ǫ has at most one Gaussian coordinate. Thus, in this case, we may infer causal relationships from estimation of the mixing matrix in an ICA model. However, if the distribution of ǫ is close to Gaussian, it may be expected that based on samples from the distribution of X, identification of A and thus identification of the causal relationship becomes difficult.
• Motivated by the above, we study asymptotic identifiability of the mixing matrix under an asymptotic scenario where the distribution of ǫ depends on the sample size n, and tends to a Gaussian distribution as n tends to infinity. In fact, we will consider a general nondegenerate mean zero distribution ζ and an asymptotic scenario where the distribution of the coordinates of ǫ tends to ζ. Results on asymptotic identifiability for the case of a limiting Gaussian distribution then follow as a corollary.
Specifically, let ζ and ξ denote nondegenerate mean zero probability distributions on (R, B) such that ξ = ζ. Fix p ∈ N and let A be a p × p matrix. Let (ǫ n ) be a sequence of p-dimensional variables such that for each n, the coordinates of ǫ n = (ǫ n1 , . . . , ǫ np ) are independent and identically distributed according to the contaminated ζ distribution Here, scalar multiplication and addition of probability measures, or more generally of signed measures, is defined in a pointwise manner as in [21]. We investigate asymptotic identifiability of the mixing matrix A based on n independent samples of X = Aǫ n , where β n is allowed to tend to zero as n tends to infinity. Our results suggest that when β n ∈ o(1/ √ n), asymptotic identifiability is determined solely by the properties of the limiting distribution ζ of P e (β n ) (Theorem 4.3). In particular, in the case where ζ is a Gaussian distribution, asymptotic identifiability becomes problematic if β n ∈ o(1/ √ n) (Corollary 4.4). Finally, we prove a positive identifiability result for β n = n −ρ for 0 < ρ < 1/2 (Theorem 4.5).

Problem statement and main results
ICA can be used to estimate A when the distribution of ε is unknown. In this case, we may think of the statistical model (1.1) as the collection of probability measures where M(p, p) denotes the space of p × p matrices, L A : R p → R p is given by L A (x) = Ax, L A (R) denotes the image measure of R under the transformation L A , and B p denotes the Borel-σ-algebra on R p . Also, P(p) denotes the set of product probability measures on (R p , B p ) with nondegenerate mean zero coordinates. With ε having distribution R, this means that the error distribution has independent nondegenerate mean zero coordinates. In other words, it is assumed that the distribution of X in (1.1) is equal to L A (R) for some A ∈ M(p, p) and R ∈ P(p). This is a semiparametric model, where A is the parameter of interest and R is a nuisance parameter. Asymptotic distributions of estimates of the mixing matrix in this type of set-up are derived in, e.g., [1,4,14]. The difficulty of identifying A can then be appraised by considering for example the asymptotic variance of the estimates.
Alternatively, one can consider estimation of A for a given error distribution. This is the approach we take in this paper. When ε has the distribution of some fixed R ∈ P(p), the statistical model (1.1) is the collection of probability measures Asymptotic identifiability of A in (2.2) follows from the results of [5] and [9]. In particular, if no two coordinates of R are jointly Gaussian, the mixing matrix A is asymptotically identifiable up to sign reversion and permutation of columns, in the sense that L A (R) = L B (R) implies A = BΛP for some diagonal matrix Λ with Λ 2 = I and some permutation matrix P .
We are interested in identifiability of the mixing matrix in (2.2) when the error distributions are different from Gaussian but close to Gaussian. Some results in this direction can be found in [19], where the authors calculated the Crámer-Rao lower bound for the model (2.2), under the assumption that the coordinates of the error distribution have certain regularity criteria such as finite variance and differentiable Lebesgue densities. These results indicate how the minimum variance of an unbiased estimator of the mixing matrix depends on the error distribution.
We consider the problem from the following different perspective. For p ≥ 1 and any signed measure µ on (R, B), let µ ⊗ µ denote the product measure of µ with itself, and let µ ⊗p = ⊗ p i=1 µ denote the p-fold product measure. Fix two nondegenerate mean zero probability measures ξ and ζ with ξ = ζ, and let P e (β) be the contaminated distribution given by Also, we write F A for the cumulative distribution function of L A (ζ ⊗p ), and we write F A β for the cumulative distribution function of L A (P e (β) ⊗p ). In Section 3, we will show that F A β tends uniformly to F A at an asymptotically linear rate in β as β tends to zero (Theorem 3.1). As a consequence, whenever F A = F B , the distance F A β − F B β ∞ tends to zero at an asymptotically linear rate as well (Corollary 3.4). In Theorem 4.3, we use this result to show that when F A = F B and β ∈ o(1/ √ n), identifiability of the mixing matrix is determined by the properties of F A and not F A β . In particular, we argue in Corollary 4.4 that when ζ is a Gaussian distribution, β ∈ o(1/ √ n) and AA t = BB t , distinguishing between the candidates A and B for the mixing matrix becomes problematic. Finally, we prove in Theorem 4.5 that, under suitable regularity conditions, identifiability issues like in the previous results do not occur (not even when F A = F B ) if convergence of the contaminated normal error distribution is sufficiently show, in the sense that β n = n −ρ for some 0 < ρ < 1/2. All proofs are given in the appendix.

An upper asymptotic distance bound
We begin by introducing some notation. For any measure µ on (R p , B p ), let |µ| denote the total variation measure of µ, see, e.g., [21]. We define two norms by (3.2) and refer to these as the uniform and the total variation norms, respectively. The uniform norm for measures is also known as the Kolmogorov norm. Note that if P and Q are two probability measures on (R p , B p ) with cumulative distribution functions F and G, it holds that P − Q ∞ = F − G ∞ . Finally, we use the notation f (s) ∼ g(s) for s → s 0 when lim s→s0 f (s)/g(s) = 1. As in the previous section, let ξ and ζ be two nondegenerate mean zero probability distributions on (R, B) with ξ = ζ. We aim to bound the distance The following theorem is a first step towards this goal.
The proof of Theorem 3.1 exploits properties of the contaminated distributions P e (β) for β ∈ (0, 1), in particular that P e (β) − ζ ∞ is nonzero and linear in β, and that (P e (β) − ζ)/ P e (β) − ζ ∞ is constant in β. As Lemma 3.2 shows, only contaminated distributions have these properties. This is our main reason for working with this family of distributions. Due to the properties of contaminated distributions, Theorem 3.1 in fact also holds for other norms than the uniform norm. However, the choice of the norm is important when we wish to bound the norm of the right-hand side of (3.4). Such a bound is achieved in Lemma 3.3.
Combining Theorem 3.1 and Lemma 3.3 yields the following corollary, which we give without proof.
Then we have, for β → 0, Corollary 3.4 shows that in the case where F A = F B , as β tends to zero and the error distributions P e (β) become closer to ζ, the distance between the observational distributions F A β and F B β decreases asymptotically linearly in β. Heuristically, this suggests that when F A = F B and β is close to zero, the distributions F A β and F B β are hard to distinguish.

Corollary 3.4 is stated under the condition that F
For later use, we characterize the occurrence of this in the next lemma, in terms of ζ, A and B, for the case where A and B are invertible. Recall that a probability distribution Q on (R, B) is said to be symmetric if, for every random variable Y with distribution Q, Y and −Y have the same distribution. The proof of Lemma 3.5, given in the appendix, is a simple consequence of Theorem 4 of [9].
Lemma 3.5. Let A, B ∈ M(p, p) be invertible. Then the following hold: (2) If ζ is non-Gaussian and symmetric, then F A = F B if and only if A = BΛP for some permutation matrix P and a diagonal matrix Λ satisfying

Asymptotic identifiability
We now turn to asymptotic properties of ICA models. We will need some basic facts about random fields in order to formulate our results, see [16] and [17] for an overview. Recall that a mapping R : for all x, y ∈ R p , and is said to be positive semidefinite if for all n ≥ 1 and for all x 1 , . . . , x n ∈ R p and ξ 1 , . . . , ξ n ∈ R, it holds that For any symmetric and positive semidefinite function R : R p × R p → R, there exists a mean zero Gaussian random field W with covariance function R, taking its values in R R p . In general, W will not have continuous paths. For a general random field W , we associate with W its intrinsic pseudometric ρ on R p , given by If the metric space (R p , ρ) is separable, we say that W is separable. In this case, W ∞ = sup x∈D |W (x)| with probability one, for any countable subset D of R p which is dense under the pseudometric ρ. In particular, whenever the σ-algebra on the space where W is defined is complete, W ∞ is measurable.
The following lemma describes some important properties of a class of Gaussian fields particularly relevant to us. The result is well known, see for example [8]. For completeness, we outline a short proof in the appendix based on a strong approximation result from the literature.
Lemma 4.1. Let F be a cumulative distribution function on R p . There exists a pdimensional separable mean zero Gaussian field W which has covariance function R : x ∧ y is the coordinate wise minimum of x and y. With Q denoting the rationals, it holds that W ∞ = sup x∈Q p |W (x)| and W ∞ is almost surely finite.
For a fixed cumulative distribution function F , we refer to the Gaussian field described in Lemma 4.1 as an F -Gaussian field. We are now ready to formulate our results on asymptotic identifiability in ICA models. We first state a result, Theorem 4.2, concerning the classical asymptotic scenario, where the error distribution is not contaminated and does not depend on the sample size n. Fix a nondegenerate mean zero probability distribution ζ on (R, B) and a matrix A ∈ M(p, p). As in the previous section, we let F A denote the cumulative distribution function of L A (ζ ⊗p ), corresponding to the distribution of Aǫ when ǫ is a p-dimensional variable with independent coordinates having distribution ζ. Consider a probability space (Ω, F , P ) endowed with independent variables (X k ) k≥1 with cumulative distribution function F A . Let F A n be the empirical distribution function of X 1 , . . . , X n . Also assume that we are given an F A -Gaussian field W on (Ω, F , P ).
The equations (4.3) and (4.4) roughly state that in the classical asymptotic scenario, Note that Lemma 3.5 gives us conditions for F A = F B and F A = F B depending on ζ.
Next, we consider an asymptotic scenario where the error distribution is contaminated and the amount of contamination depends on the sample size n. As in Section 3, ξ and ζ are fixed nondegenerate mean zero probability measures on where ε is a p-dimensional variable with independent coordinates having distribution P e (β). Consider a sequence (β n ) in (0, 1), and consider a probability space (Ω, F , P ) endowed with a triangular array (X nk ) 1≤k≤n such that for each n, the variables X n1 , . . . , X nn are independent variables with cumulative distribution function F A βn . Let F A βn be the empirical distribution function of X n1 , . . . , X nn . Also assume that we are given an F A -Gaussian field W on (Ω, F , P ). We are interested in the asymptotic properties of F A βn . Theorem 4.3 is our main result for this type of asymptotic scenarios.
In particular, if k = 0 and c is a continuity point of the distribution of W ∞ , we have Theorem 4.3 essentially shows that for the asymptotic scenario considered, the convergence of F A βn to F A is fast enough to ensure that the asymptotic properties of F A βn are determined by F A instead of F A βn . Corollary 4.4 applies this result to the case where the error distributions become close to Gaussian without being Gaussian.
Corollary 4.4. Assume that lim n √ nβ n = 0. Let A, B ∈ M(p, p) be invertible. Assume that AA t = BB t while A = BΛP for all diagonal Λ with Λ 2 = I and all permutation matrices P . Let ζ be a nondegenerate Gaussian distribution and let ξ be such that P e (β) is non-Gaussian for all β ∈ (0, 1). Let c be a point of continuity for the distribution of W ∞ , with W an F A -Gaussian field. It then holds that: Statement (1) of Corollary 4.4 shows that for any finite n, we are in the case where, were the error distribution not changing with n, it would be possible to asymptotically distinguish F A βn and F B βn at rate 1/ √ n as in (4.4) of the classical case. However, statement (2) shows that as n increases and the error distributions becomes closer to a Gaussian distribution, distinguishing F A βn and F B βn at rate 1/ √ n is nonetheless impossible, with a limit result similar to (4.3). Note that by Lemma 3.5, having A = BΛP for all diagonal Λ with Λ 2 = I and all permutation matrices P , as in Corollary 4.4, is the minimum requirement for non-Gaussian error distributions to be able to asymptotically distinguish F A and F B in the classical scenario.
As can be seen from the proof of Theorem 4.5, the measure L A (P e (β) ⊗p ) can be written as a polynomial of degree p in β, where the constant term corresponds to F A and the first order term corresponds to Γ 1 (A), and similarly for L B (P e (β) ⊗p ). In this light, Theorem 4.5 shows that in the absence of a difference between the constant terms of L B (P e (β) ⊗p ) and L A (P e (β) ⊗p ), having different first order terms is a sufficient criterion for asymptotic identifiability.

Discussion
We studied identifiability of the ICA model for error distributions which have independent and identically distributed coordinates following contaminated distributions. We argued in particular that for contaminated Gaussian distributions, it holds that if the level of contamination decreases at rate 1/ √ n or faster, then asymptotic identifiability is determined by the Gaussian limiting distribution rather than by the non-Gaussian contaminated distribution. Combining this with Lemma 3.5, we obtain that distinguishing A and B becomes difficult when AA t = BB t , rather than when A and B are equal up to sign reversion and permutation of columns, which one might expect for non-Gaussan error distributions. The consequence of this is that if we have n observations from an ICA model with a contaminated Gaussian error distribution with contamination level on the order of 1/ √ n or smaller, we expect that identifying the mixing matrix will be difficult. In particular, causal inference as described in Example 1.1 (using LiNGAM) is going to be difficult in this setting.
The proof of our main theoretical result, Theorem 4.3, rests on two partial results: (1) When F n is a sequence of cumulative distribution functions converging uniformly to F , and F n is an empirical process based on n independent observations of variables with cumulative distribution function F n , then √ n(F n − F n ) converges weakly in ℓ ∞ (R p ).
In Theorem 4.5, we also considered the case of slower rates of decrease in the level of contamination, namely rates n −ρ for 0 < ρ < 1/2. Our results here indicate that in such asymptotic scenarios, identifiability of the mixing matrix at rate 1/ √ n will be possible, subject to some regularity conditions related to the Γ 1 signed measures of (4.7).
We have conducted numerical experiments to assess our results. We considered the case where p = 2, ξ is the standard exponential distribution, ζ is the standard normal distribution and where α ∈ (0, 1) is fixed. These two matrices are related to Example 1.1. It then holds that AA t = BB t while A = BΛP for all diagonal Λ with Λ 2 = I and all permutation matrices P . Combining Theorem 4.3 and Theorem 4.5, we would expect that with for β n = n −ρ , we should have p(ρ) = 1 for 0 < ρ < 1/2 and p(ρ) = P ( W ∞ > c) for 1/2 < ρ. By Monte Carlo simulations, we found that p(ρ) appears to be constant for ρ > 1/2, in accordance with Theorem 4.3. However, our results did not satisfactorily indicate p(ρ) = 1 for 0 < ρ < 1/2, as Theorem 4.5 would suggest. We expect that the reason for this discrepancy is that the maximum sample size used in the simulations (n = 5 ·10 4 ) is not large enough to show these asymptotics.
Our results also open up new research questions, such as the following: Is it possible to characterize the matrices A and B such that the regularity condition Γ 1 (A) = Γ 2 (B) of Theorem 4.5 holds? Also, together, Theorem 4.3 and Theorem 4.5 describe the behaviour of the empirical process F A βn for asymptotic scenarios of the form β n = n −ρ for ρ > 0, in particular describing the difficulty of using F A βn to distinguish F A βn and F B βn . Is it possible to obtain finite-sample bounds instead of limiting behaviours in these results? How do Theorem 4.3 and Theorem 4.5 translate into results on the ability of practical algorithms such as the fastICA algorithm, see [11], to distinguish the correct mixing matrix? Is it possible to use similar techniques to analyze identifiability of the mixing matrix in asymptotic scenarios where p tends to infinity? Do the present results extend to cases where the coordinates of the error distributions are not contaminated normal distributions, or when the coordinates are not identically distributed? Finally, besides linear SEMs with non-Gaussian noise, there are other settings where the underlying causal structure is completely identifiable, such as non-linear SEMs with almost arbitrary additive noise and linear SEMs with additive Gaussian noise of equal variance, see e.g. [10] and [20], respectively. Can one use similar techniques to study identifiability in these models when the structural equations are close to linear or the variance of the errors is close to normal?
In light of these open questions, our presents results should be seen as a small step towards a better understanding of the identifiability of the mixing matrix for ICA for error distributions which are close to Gaussian but not Gaussian. We hope that this paper will lead to more work in this direction.
To prove Lemma 3.3, we first present a lemma relating the uniform norm of certain measures on (R p , B p ) to the uniform and total variation norms of some measures on (R, B). Let µ 1 , . . . , µ p be signed measures on (R, B), and let A ∈ M(p, p). Then for any i ∈ {1, . . . , p}, it holds that Proof. For any permutation π : {1, . . . , p} → {1, . . . , p} and corresponding permutation matrix P , we have L A (µ 1 ⊗ · · · ⊗ µ p ) = L AP −1 (µ π(1) ⊗ · · · ⊗ µ π(p) ). Hence, it suffices to consider i = p. Let x ∈ R p and define I x = (−∞, x 1 ] × · · · × (−∞, x p ]. Then Fubini's theorem for signed measures yields where we have also used the triangle inequality for integrals with respect to signed measures, which follows for example from Theorem 6.12 of [21]. We now analyze the innermost integral of (A.11). For fixed y 1 , . . . , y p−1 , we have Hence, {y p ∈ R | 1 Ix (L A (y)) = 1} is a finite intersection of intervals, and is therefore itself an interval. This yields This inequality is immediate when the interval is of the form (−∞, a] for some a ∈ R. If the interval is of the form [a, ∞), we have Applying the triangle inequality, we therefore obtain Proof of Lemma 3.5. Proof of (1). With ζ Gaussian with mean zero and variance σ 2 , L A (ζ ⊗p ) is Gaussian with mean zero and variance σ 2 AA t , and so the result is immediate for this case.
Proof of (3). Now consider the case where ζ is not a symmetric distribution. As L P (ζ ⊗p ) = ζ ⊗p holds for any permutation matrix P , we obtain that if A = BP , then L A (ζ ⊗p ) = L B (ζ ⊗p ) and so F A = F B , proving one implication.
Conversely, assume that F A = F B , meaning that L A (ζ ⊗p ) = L B (ζ ⊗p ). As ζ is nondegenerate and non-Gaussian and A and B are invertible, Theorem 4 of [9] shows that A = BΛP , where Λ ∈ M(p, p) is an invertible diagonal matrix and P ∈ M(p, p) is a permutation matrix. This yields Now let Z be a random variable with distribution ζ. The above then yields that for all i, Λ ii Z and Z have the same distribution. In particular, |Λ ii ||Z| and |Z| have the same distribution, so P (|Z| ≤ z/|Λ ii |) = P (|Z| ≤ z) for all z ∈ R. As Z is not almost surely zero, there is z = 0 such that P (|Z| ≤ z − ε) < P (|Z| ≤ z + ε) for all ε > 0. This yields |Λ ii | = 1. Next, let ϕ denote the characteristic function of Z. We then have ϕ(Λ ii θ) = ϕ(θ) for all θ ∈ R. As Z is not symmetric, there is a θ ∈ R such that ϕ(θ) = ϕ(−θ). Therefore, Λ ii = −1 cannot hold, so we must have Λ ii = 1. We conclude that Λ is the identity matrix and thus A = BP , as required.
Proof of (2). Finally, consider a symmetric probability measure ζ. It is then immediate that when Λ and P are as in the statement of the lemma, it holds that L ΛP (ζ ⊗p ) = ζ ⊗p and thus F A = F B whenever A = BΛP . The converse implication follows as in the proof of (3).
A.2. Proofs for Section 4. Proof of Lemma 4.1. The existence of the process W follows from the results cited at the beginning of Section 4. To show separability, note that there exists for any x ∈ R p a sequence (x n ) ⊆ Q p such that F (x) = lim n→∞ F (x n ). Therefore, R p endowed with the intrinsic pseudometric ρ of W is separable and Q p is a countable dense subset. As a consequence, W ∞ = sup x∈Q p |W (x)| almost surely holds. In particular, completing the underlying probability space, we may take W ∞ to be measurable.
In order to see that W ∞ is almost surely finite, note that by Theorem B of [7], there exists a probability space (Ω, F , P ) endowed with a sequence of variables (X k ), independent and with common cumulative distribution function F , as well as a sequence of p-dimensional separable Gaussian fields (W k ) with the same finite-dimensional distribution as W , such that with F n denoting the empirical distribution function of X 1 , . . . , X n , it holds that for some C 1 , C 2 > 0. As all the W n have the same distribution, this yields in particular that Letting n tend to infinity, this implies P ( W ∞ < ∞) = 1, as required.
Before proving Theorem 4.2 and Theorem 4.3, we show a result on empirical processes. Recall that for a metric space (M, d), the ε-covering number N (ε, M, d) is the minimum number of open balls of radius ε which is required to cover (M, d), see, e.g., Section 2.1.1 of [24].
proving claim (2). It is then immediate that ρ is a pseudometric, proving claim (1). Next, it holds that (R p , ρ) is totally bounded if and only if N (ε, R p , ρ) is finite for all positive ε. Let Q be the distribution corresponding to the cumulative distribution function F , and let L 2 (R p , B p , Q) be the space of Borel measurable functions from R p to R which are square-integrable with respect to Q. Let · 2,Q denote the usual seminorm on L 2 (R p , B p , Q). Applying claim (2), it is immediate that Combining Example 2.6.1 and Exercise 2.6.9 of [24], we find that (1 Ix ) x∈R p is a Vapnik-Cervonenkis (VC) subgraph class with VC dimension p + 1. Furthermore, (1 Ix ) x∈R p has envelope function constant and equal to one. Therefore, Theorem 2.6.7 of [24] shows that N (ε, (1 Ix ) x∈R p , · 2,Q ) and thus N (ε, R p , ρ) is finite, and so (R p , ρ) is totally bounded.
Lemma A.3. Let (F n ) be a sequence of cumulative distribution functions on R p , and let F be a cumulative distribution function on R p . Let (X nk ) 1≤k≤n be a triangular array such that for each n, X n1 , . . . , X nn are independent with distribution F n . Let F n be the empirical distribution function of X n1 , . . . , X nn . If F n converges uniformly to F , then √ n(F n − F n ) converges weakly in ℓ ∞ (R p ) to an F -Gaussian field.
Proof. For x, y ∈ R p and n ≥ 1, let R n (x, y) = F n (x ∧ y) − F n (x)F n (y) and also define R(x, y) = F (x ∧ y) − F (x)F (y). Let ρ be the pseudometric of Lemma A.2 corresponding to the cumulative distribution function F . Let Z nk be the random field indexed by R p given by Z nk (x) = 1 Ix (X nk )/ √ n, where we as usual put We will apply Theorem 2.11.1 of [24] to prove that n k=1 Z nk − EZ nk and thus √ n(F n − F n ) converges weakly in ℓ ∞ (R p ). We may assume without loss of generality that all variables are defined on a product probability space as described in Section 2.11.1 of [24], and as the fields (Z nk ) can be constructed using only countably many variables, the measurability requirements in Theorem 2.11.1 of [24] can be ensured. In order to apply Theorem 2.11.1 of [24], first note that by Lemma A.2, (R p , ρ) is totally bounded and so can be applied in Theorem 2.11.1 of [24]. Also, the covariance function of so as F n converges uniformly to F , R n converges uniformly to R. Thus, the covariance functions of n k=1 Z nk − EZ nk converge to R. Therefore, in order to apply Theorem 2.11.1 of [24], it only remains to confirm that the conditions of (2.11.2) in [24] hold. Fixing η > 0, we have and so it is immediate that the first condition of (2.11.2) in [24] holds. Next, define d 2 n (x, y) = n k=1 (Z nk (x) − Z nk (y)) 2 . We then also have for x, y ∈ R p that (1 Ix (X nk ) − 1 Iy (X nk )) 2 , (A.27) and therefore, Ed n (x, y) 2 = F n (x)+F n (y)−2F n (x∧y). Thus, (x, y) → Ed n (x, y) 2 converges uniformly to ρ 2 on R p ×R p . Therefore, we conclude that for any sequence (δ n ) of positive numbers tending to zero, it holds for all η > 0 that lim sup n→∞ sup x,y:ρ(x,y)≤δn Hence, the second condition of (2.11.2) in [24] holds. In order to verify the final condition of (2.11.2) in [24], first note that by (A.27), d n (x, y) 2 = E Pn (1 Ix − 1 Iy ) 2 , where E Pn denotes integration with respect to P n and P n is the empirical measure on (R p , B p ) in X n1 , . . . , X nn . Thus, d n (x, y) is the L 2 (R p , B p , P n ) distance between the mappings I x and I y , and so where · 2,Q denotes the norm on L 2 (R p , B p , Q) and the supremum is over all probability measures Q on (R p , B p ). Thus, the third condition of (2.11.2) in [24] is satisfied if only it holds that for all sequences (δ n ) of positive numbers tending to zero, However, Theorem 2.6.7 of [24] yields a constant K > 0 such that for 0 < ε < 1, As a consequence, again for 0 < ε < 1, By elementary calculations, we obtain for 0 < c < d < 1 and a, b > 0 that where erf denotes the error function, erf(x) = (2/ √ π) x 0 exp(−y 2 ) dy. Therefore, we conclude that for all 0 < η < 1, the mapping x → √ a − b log x is integrable over [0, η]. Thus, (A.30) holds. Recalling (A.24), Theorem 2.11.1 of [24] now shows that √ n(F n − F n ) converges weakly in ℓ ∞ (R p ). By uniqueness of the finite-dimensional distributions of the limit, we find that the limit is an F -Gaussian field.
Proof of Theorem 4.3. By the triangle inequality, we have the inequalities Let η > 0. By Corollary 3.4, we can choose N ≥ 1 such that for n ≥ N , √ n F B βn − F A βn ∞ ≤ 4p(1 + η) √ nβ n ξ − ζ ∞ . (A.37) By our assumptions, lim n √ nβ n = k. Letting γ > 0, we then find for n large that For such n, the first inequality of (A.36) yields Similarly, the second inequality of (A.36) yields and by similar arguments as previously, we obtain lim sup Combining our results, we obtain (4.5).
Proof of Corollary 4.4. As we have assumed that P e (β n ) is non-Gaussian, it follows from Lemma 3.5 that F A β = F B β , since A = BΛP for all diagonal Λ with Λ 2 = I and all permutation matrices P . This shows (1). And as AA t = BB t and ζ is Gaussian, Lemma 3.5 yields F A = F B , so Theorem 4.3 yields (2).
Proof of Theorem 4.5. Note that for any x ∈ R p , we have We first consider the case F A = F B . Let x ∈ R p be such that F A (x) = F B (x). Then lim n F A βn (x) − F B βn (x) = 0, so | √ n(F A βn (x) − F B βn (x)| tends to infinity as n tends to infinity. By the central limit theorem, √ n(F A βn (x) − F A βn (x)) converges in distribution. Therefore, (A.44) yields the result.