Quantitative asymptotics of graphical projection pursuit

There is a result of Diaconis and Freedman which says that, in a limiting sense, for large collections of high-dimensional data most one-dimensional projections of the data are approximately Gaussian. This paper gives quantitative versions of that result. For a set of deterministic vectors $\{x_i\}_{i=1}^n$ in $\R^d$ with $n$ and $d$ fixed, let $\theta\in\s^{d-1}$ be a random point of the sphere and let $\mu_n^\theta$ denote the random measure which puts mass $\frac{1}{n}$ at each of the points $\inprod{x_1}{\theta},...,\inprod{x_n}{\theta}$. For a fixed bounded Lipschitz test function $f$, $Z$ a standard Gaussian random variable and $\sigma^2$ a suitable constant, an explicit bound is derived for the quantity $\ds\P[|\int f d\mu_n^\theta-\E f(\sigma Z)|>\epsilon]$. A bound is also given for $\ds\P[d_{BL}(\mu_n^\theta, N(0,\sigma^2))>\epsilon]$, where $d_{BL}$ denotes the bounded-Lipschitz distance, which yields a lower bound on the waiting time to finding a non-Gaussian projection of the $\{x_i\}$ if directions are tried independently and uniformly on $\s^{d-1}$.


Introduction
A foundational tool of data analysis is the projection of high-dimensional data to a one-or two-dimensional subspace in order to visually represent the data, and, ideally, identify underlying structure. The question immediately arises: which projections are interesting? One would like to answer by saying that those projections which exhibit structure are interesting, however, identifying which projections those are is not quite as straightforward as one might think. In particular, there are several reasons that have led to the idea that one should mainly look for projections which are far from Gaussian in behavior; that Gaussian projections in fact do not generally exhibit interesting structure. One justification for this idea is the following result due to Persi Diaconis and David Freedman.
Theorem 1 (Diaconis-Freedman [1]). Let x 1 , . . . , x n be deterministic vectors in R d . Suppose that n, d and the x i depend on a hidden index ν, so that as ν tends to infinity, so do n and d. Suppose that there is a σ 2 > 0 such that, for all ǫ > 0, (1) 1 n j ≤ n : |x j | 2 − σ 2 d > ǫd ν→∞ −−−→ 0, and suppose that Let θ ∈ S d−1 be distributed uniformly on the sphere, and consider the random measure µ θ ν which puts mass 1 n at each of the points θ, x 1 , . . . , θ, x n . Then as ν tends to infinity, the measures µ θ ν tend to N(0, σ 2 ) weakly in probability.
Heuristically, Theorem 1 can be interpreted as saying that, for a large number of high-dimensional data vectors, as long as they have nearly the same lengths and are nearly orthogonal, most onedimensional projections are close to Gaussian regardless of the structure of the data. It is important to note that the conditions (1) and (2) are not too strong; in particular, even though only d vectors Research supported by an American Institute of Mathematics Five-year Fellowship. 1 can be exactly orthogonal in R d , the 2 d vertices of a unit cube centered at the origin satisfy condition (2) for "rough orthogonality".
A failing of the usual interpretation of Theorem 1 is that sometimes, projections of data look nearly Gaussian for a reason; that is, it is not always due to the central-limit type effect described by the theorem. Thus the question arises: is there a way to tell whether a Gaussian projection is interesting? A possible answer lies in quantifying the theorem, and then saying that a nearly-Gaussian projection is interesting if it is "too close" to Gaussian to simply be the result of the phenomenon described by Theorem 1. By way of analogy, one has the Berry-Esséen theorem stating that the rate of convergence to normal of the sum of n independent, identically distributed random variables is of the order 1 √ n ; if one has a sum of n random variables converging to Gaussian significantly faster, it must be happening for some reason other than just the usual central-limit theorem. In order to implement this idea, it is necessary (as with the Berry-Esséen theorem) to have a sharp quantitative version of the limit theorem in question.
A second motivation for proving a quantitative version of Theorem 1 is the application to waiting times for discovering an interesting direction on which to project data. If a sequence of independent random projection directions is tried until the empirical distribution of the projected data is more than some threshhold away from Gaussian (in some metric on measures), and N is the number of trials needed to find such a direction, a one can easily give a lower bound for EN from the type of quantitative theorem proved below.
Thus the goal of this paper is to provide a quantitative version of Theorem 1 in a fixed dimension d and for a fixed number of data vectors n. To do this, it is first necessary to replace conditions (1) and (2) with non-asymptotic conditions. The conditions we will use are the following. Let σ 2 be defined by 1 and, for all θ ∈ S d−1 , For a little perspective on the restrictiveness of these conditions, note that, as for the conditions of Diaconis and Freedman, they hold for the vertices of a unit cube in R d (with A = 0 and B = 1 4 ). Under these assumptions, the following theorems hold. (3) and (4) above. For a point θ ∈ S d−1 , let the measure µ θ n put equal mass at each of the points θ, x 1 , . . . , θ, x n . Fix a test function f : Then for Z a standard Gaussian random variable, θ chosen uniformly on the sphere, σ defined as above, and ǫ > max 2π with c 1 = 48 √ π, c 2 = 3 −2 2 −16 , and d BL denoting the bounded Lipschitz distance.

Remarks:
(i) It should be emphasized that the key difference between the results proved here and the result of Diaconis and Freedman is that Theorems 2 and 3 hold for fixed dimension d and number of data vectors n; there are no limits in the statements of the theorems. (ii) It is not necessary for A and B to be absolute constants; for the the results above to be of interest as d → ∞, it is easy to see from the statements that it is only necessary that for Theorem 3. The reader may also be wondering where the dependence on n is in the statements above; it is built into the definition of B. Note that, by definition, B ≥ |x i | 2 n for each i; in particular, B ≥ σ 2 d n . It is thus necessary that n → ∞ as d → ∞ for Theorem 2 and n ≫ √ d for Theorem 3. (iii) For Theorem 2, consider the special case that for a large constant C. Then the statement becomes (iv) It is similarly useful to consider the following special case for Theorem 3. Let C > 3 10 , and consider the case . Then the bound above becomes: where C ′ = 9 · 2 16 CB 2 and C ′′ = 48 √ πC −3/10 . Thus, roughly speaking, the bounded Lipschitz distance from the random measure µ θ n to the Gaussian measure with mean zero and variance σ 2 is unlikely to be more than a large multiple of log(d−1) . We make no claims of the sharpness of this result. Theorem 3 can easily be used to give an estimate on the waiting time until a non-Gaussian direction is found, if directions are tried randomly and independently. Specifically, we have the following corollary.

Proofs
This section is mainly devoted to the proofs of Theorems 2 and 3, with some additional remarks following the proofs. For the proof of Theorem 2, several auxiliary results are needed. The first is an abstract normal approximation for bounding the distance of a random variable to a Gaussian random variable in the presence of a continuous family of exchangeable pairs. The theorem is an abstraction of an idea used by Stein in [6] to bound the distance to Gaussian of the trace of a power of a random orthogonal matrix.
Theorem 5 (Meckes [4]). Suppose that (W, W ǫ ) is a family of exchangeable pairs defined on a common probability space, such that EW = 0 and EW 2 = σ 2 . Let F be a σ-algebra on this space with σ(W ) ⊆ F. Suppose there is a function λ(ǫ) and random variables E, E ′ measurable with respect to F, such that Then if Z is a standard normal random variable, The next result gives expressions for some mixed moments of entries of a Haar-distributed orthogonal matrix. See [3], Lemma 3.3 and Theorem 1.6 for a detailed proof. (i) For all i, j, (ii) For all i, j, r, s, α, β, λ, µ, δ ir δ αλ δ js δ βµ + δ iα δ rλ δ jβ δ sµ + δ iλ δ rα δ jµ δ sβ .
(iii) For the matrix Q = q ij d i,j=1 defined by q ij := u i1 u j2 − u i2 u j1 , and for all i, j, ℓ, p, Finally, we will need to make use of the concentration of measure on the sphere, in the form of the following lemma.
Lemma 7 (Lévy, see [5]). For a function F : S d−1 → R, let M F denote its median with respect to the uniform measure (that is, for θ distributed uniformly on S d−1 , P F (θ) ≤ M F ≥ 1 2 and P F (θ) ≥ M F ≥ 1 2 ) and let L denote its Lipschitz constant. Then With these results, it is now possible to give the proof of Theorem 2.
Proof of Theorem 2. The proof divides into two parts. First, an "annealed" version of the theorem is proved using the infinitesimal version of Stein's method given by Theorem 5. Then, for a fixed test function f and Z a standard Gaussian random variable, the quantity P f dµ θ ν − Ef (σZ) > ǫ is bounded using the annealed theorem together with the concentration of measure phenomenon.
Let θ be a uniformly distributed random point of S d−1 ⊆ R d , and let I be a uniformly distributed element of {1, . . . , n}, independent of θ. Consider the random variable W := θ, x I . Then EW = 0 by symmetry and EW 2 = σ 2 by the condition 1 n n i=1 |x i | 2 = σ 2 d . Theorem 5 will be used to bound the total variation distance from W to σZ, where Z is a standard Gaussian random variable.
The family of exchangeable pairs needed to apply the theorem is constructed as follows. For ǫ > 0 fixed, let where δ = O(ǫ 4 ). Let U be a Haar-distributed d × d random orthogonal matrix, independent of θ and I, and let W ǫ = U A ǫ U T θ, x I ; the pair (W, W ǫ ) is exchangeable for each ǫ > 0.
Let K be the d × 2 matrix made of the first two columns of U and C 2 = 0 1 −1 0 . Define Q := KC 2 K T (note that this is the same Q as in part (iii) of Theorem 6). Then by the construction of W ǫ , The conditions of Theorem 5 can be verified using the expressions in Lemma 6 as follows. By the lemma, E KK T = 2 d I and E Q = 0, and so it follows from (5) that Condition (i) of Theorem 5 is thus satisfied for λ(ǫ) = ǫ 2 d and E ′ = 0. For the condition (ii), taking F = σ(θ, I), Lemma 6, part (iii) yields

Condition (ii) of Theorem 5 is thus satisfied with
Condition (iii) of the theorem is trivial by (5); it follows that This is the annealed statement referred to at the beginning of the proof.
We next use the concentration of measure on the sphere to show that, for a large measure of θ ∈ S d−1 , the random measure µ θ n which puts mass 1 n at each of the θ, x i is close to the average behavior. To do this, we make use of Lévy's Lemma (Lemma 7). Let f : R → R be such that Consider the function F defined on the sphere by In order to apply Lemma 7, it is necessary to determine the Lipschitz constant of F . Let θ, θ ′ ∈ S d−1 . Then, using f BL ≤ 1 together with equation (4), thus the Lipschitz constant of F is bounded by √ B. It follows from Lemma 7 that where M F is the median of the function F . Now, if θ is a random point of S d−1 , then thus if ǫ > π √ B √ d−1 , we may use concentration about the median of F to obtain concentration about the mean, with only a loss in constants.
Note that for W = θ, x I as above, and so by the bound (6), Putting these pieces together, if ǫ > max 2π Proof of Theorem 3. The first two steps of the proof of Theorem 3 were essentially done already in the proof of Theorem 2. From that proof, we have that if W = θ, x I for θ distributed uniformly on S d−1 and I independent of θ and uniformly distributed in {1, . . . , n}, then for A as in equation (3). Furthermore, it follows from equation (7) in the proof of Theorem 2 that for F (θ) := f dµ θ n and ǫ > π In this proof, this last statement is used together with a series of successive approximations of arbitrary bounded Lipschitz functions as used by Guionnet and Zeitouni [2] to obtain a bound for P d BL (µ θ n , N(0, σ 2 )) > ǫ . By definition, First consider the subclass F BL,K = {f : f BL ≤ 1, supp(f ) ⊆ K} for a compact set K ⊆ R. Let ∆ = ǫ 4 ; for f ∈ F BL,K , define the approximation f ∆ as in Guionnet and Zeitouni [2] as follows. Let x o = inf K and let That is, the function f ∆ is just an approximation of f by a function which is piecewise linear and has slope 1 or −1 on each of the intervals [x o + i∆, x o + (i + 1)∆]. Note that, because f BL ≤ 1, it follows that f − f ∆ ∞ ≤ ∆ and the number of distinct functions whose linear span is used to approximate f in this way is bounded by |K| ∆ , where |K| is the diameter of K. If {h k } N k=1 denotes the set of functions used in the approximation f ∆ and ǫ k their coefficients, then for ǫ 2 > 8π|K| B d−1 , The second-last line follows from equation (9) above, and the last line from the bound N ≤ 4|K| ǫ . To move to the full set F BL := {f : f BL ≤ 1}, we make a truncation argument. Given f ∈ F BL and M > 0, define f M by Choosing M such that B M 2 = ǫ 4 , it follows that for ǫ 5/2 > 3·2 6 πB assuming that B ≥ ǫ.
Recall that E f dµ θ n = Ef (W ) for W = θ, x I , and so by the bound (8), thus for ǫ bounded below as above and also satisfying ǫ > 2(A+2) d−1 , Proof of Corollary 4. The proof is essentially trivial. Note that by independence of the θ j and Theorem 3, since T ǫ > m if and only if d BL (µ θ j n , N(0, σ 2 ) ≤ ǫ for all 1 ≤ j ≤ m. This bound can be used in the identity ET ǫ = ∞ m=0 P[T ǫ > m] to obtain the bound in the corollary.
Remark: One of the features of the proofs given above is that they can be generalized to the case of k-dimensional projections of the d-dimensional data vectors {x i }, with k fixed or even growing with d. The proof of the higher-dimensional analog of Theorem 2 goes through essentially the same way. However, the analog of the proof of Theorem 3 from Theorem 2 is rather more involved in the multivariate setting and will be the subject of a future paper.