Large-Sample Properties of Blind Estimation of the Linear Discriminant Using Projection Pursuit

We study the estimation of the linear discriminant with projection pursuit, a method that is blind in the sense that it does not use the class labels in the estimation. Our viewpoint is asymptotic and, as our main contribution, we derive central limit theorems for estimators based on three different projection indices, skewness, kurtosis and their convex combination. The results show that in each case the limiting covariance matrix is proportional to that of linear discriminant analysis (LDA), an unblind estimator of the discriminant. An extensive comparative study between the asymptotic variances reveals that projection pursuit is able to achieve efficiency equal to LDA when the groups are arbitrarily well-separated and their sizes are reasonably balanced. We conclude with a real data example and a simulation study investigating the validity of the obtained asymptotic formulas for finite samples.


Introduction
Classification and clustering are two central themes in modern data analysis and can be seen, respectively, as the "unblind" and "blind" versions of the same problem: In classification the group memberships, or labels, of the training data points are known and the objective is to use the training data to form a classification rule for future observations, some of the standard methods including, e.g., linear discriminant analysis, support vector machines and random forests, see Friedman et al. (2001). Whereas in clustering, no labels for the data points are known but we postulate that a reasonable grouping exists and aim to find it, with, e.g., k-means clustering or spectral clustering, see Friedman et al. (2001), Von Luxburg (2007).
In this paper, we work under a clustering context and the assumption that the data admit a natural grouping but that their labels are indeed unknown to us. In their seminal work, Peña & Prieto (2001) studied in this setting the use of projection pursuit (PP), a general family of methods searching for a projection direction that maximizes the value of the so-called projection index, see, e.g., Huber (1985), Bolton & Krzanowski (2003), Bickel et al. (2018), Fischer et al. (2019) and the references therein. Namely, denoting the within-class covariance matrix by Σ and the two group means by µ 1 , µ 2 , Peña & Prieto (2001) established that using kurtosis as the projection index in projection pursuit allows the "blind" estimation of the projection direction θ := Σ −1 (µ 2 − µ 1 ) that is used in linear discriminant analysis to construct the optimal Bayes classifier, in the full absence of any label information. In other words, projection pursuit essentially allows conducting LDA in a blind fashion to recover the subspace that optimally separates the two groups. Afterwards, various clustering methods can then be applied to the projected data to conduct efficient clustering.
While very interesting, the result of Peña & Prieto (2001) raises a natural question regarding the efficiency of the procedure. Namely, how much does one lose by not knowing the labels and relying on projection pursuit compared to using LDA to recover the same direction θ when the group memberships are known? This is the main question we study in the current paper, working, for simplicity, under the assumption of two-group normal mixtures. Our approach is asymptotic in nature and we perform the comparison through the limiting covariance matrices of the estimators in question. In particular, we show that the limiting covariance matrices of projection pursuit and LDA are proportional, allowing us to conduct the comparisons simply through the corresponding constants of proportionality.
Asymptotic results for general projection indices have been derived earlier in the context of independent component analysis (ICA), see, e.g., Ollila (2009), Dermoune & Wei (2013), Miettinen et al. (2015), Virta et al. (2016). In ICA, one assumes that the observed random p-vector x is an independent component model, i.e., there exists a full rank p × p matrix Γ such that Γ{x − E(x)} has independent components. The IC model is a rather wide family of distributions and, in particular, contains our model of choice, the multivariate normal mixture (this, apparently novel, result is given as Lemma B.1 in the supplementary material). This connection implies that the results of the current paper are intimately related to Virta et al. (2016) who considered (in the context of ICA) the same projection indices as we do here. However, we remark that our contributions surpass those of Virta et al. (2016) in two critical regards: 1) Virta et al. (2016) derived only the asymptotic variances of the ICA parameters (ignoring their covariances), whereas we give the full limiting distribution of the estimated projection direction. Besides completing the asymptotic story, knowledge of the full distribution is crucial with respect to the comparison of PP and LDA as it reveals that the limiting covariance matrix of PP is exactly proportional to the limiting covariance of LDA, see Theorems 1-4 later on. 2) From a technical viewpoint, the derivation of the convergence rates of the estimators was in Virta et al. (2016) left implicit and our proofs provide a rigorous treatment of this. In particular, to guarantee well-defined Taylor expansions of the objective functions, we need to establish the almost sure convergence of the projection pursuit estimates and, as far as we are aware, such results have not been given previously either in ICA or PP-literature.
While kurtosis is the most popular choice for the projection index in projection pursuit, also several alternatives are commonly used. In particular, skewness is a somewhat standard choice, see, for example, Loperfido (2018), and was shown in Loperfido (2013) to have the same property of being able to find the optimal projection direction without the label information as possessed by kurtosis. As such, we study also skewness-based projection pursuit in the current work. However, as shown by Peña & Prieto (2001), Loperfido (2013), for both kurtosis and skewness there exist particular values of the mixing proportion under which the two indices are unable to recover the optimal projection direction (for example, skewness fails to produce a consistent estimate of θ when the two groups have equal proportions). These drawbacks can be mitigated by combining both cumulants into a single projection index, in a form of a weighted linear combination. Our results then show that, with a proper choice of weighting, a rather efficient blind competitor for the LDA-based unblind estimator can be obtained. Indeed, in the extreme case where the groups are arbitrarily well-separated and their sizes are reasonably balanced, projection pursuit is able to achieve equal efficiency with LDA. As remarked in the previous paragraph, the asymptotic properties of the hybrid index have been studied also earlier, in the context of independent component analysis, in Virta et al. (2016).
We note that despite the theoretical guarantees of projection pursuit, the most common blind method for revealing clusters is still arguably PCA, see, e.g., Jolliffe (2002). However, it is also well known that PCA does not, in general, yield a consistent estimator of the linear discriminant direction. A standard example demonstrating this is the extreme case where the within-group covariance matrix Σ is heavily concentrated on a direction orthogonal to the difference of the group means µ 2 − µ 1 . In such a case, the projections of the two group means onto the first principal component direction overlap, making clustering based on the direction impossible. Hence, due to its unreliability in estimating the linear discriminant, PCA cannot really be seen as a blind estimator of the separating direction and, as such, we do not include it in the comparisons in the current paper. However, we have still included, for completeness, equivalent asymptotic results for PCA as we state for the other methods, and these are given in Appendix A.
The rest of the manuscript is organized as follows. In Section 2 we derive the asymptotic behavior of three estimators of the linear discriminant direction: LDA and kurtosis-and skewness based projection pursuit. A short comparison of the results is also presented. In Section 3, we give the corresponding results for projection pursuit based on a weighted combination of skewness and kurtosis and conduct a more extensive set of asymptotic comparisons between all considered methods. Simulation studies exploring both the finite-sample performance of the methods and the applicability of our asymptotic results to practice are given in Section 4, while the performance and the applicability of presented methods to a real data example, as well as the comparison to the PCA are given in Section 5. Finally, we conclude with some discussion in Section 6. All proofs of the technical results are postponed to Appendix B.

Estimation of the linear discriminant
Let (Ω, F, P) be a probability space. Throughout the following, we assume that the (p + 1)-dimensional pair (x, y) obeys the following model: for 0 < α 1 < 1, µ 1 , µ 2 ∈ R p , µ 1 = µ 2 , and a full rank Σ ∈ R p×p . The marginal distribution of x is then the multivariate normal mixture, where α 2 := 1 − α 1 . Under model (1), the classification of x is usually based on its projection onto the linear discriminant direction θ = Σ −1 (µ 2 − µ 1 ). This projection direction is optimal in the sense that the optimal Bayes classifier (having the minimal misclassification rate out of all classifiers) depends on the data only through the projection θ x, see, e.g., Mardia et al. (1995).
Our objective throughout the paper is the estimation of the standardized projection direction θ/ θ (the scale of the projection direction is irrelevant, meaning that the unit length constraint is without loss of generality). As described in Section 1, we will consider two types of estimators, blind ones which use only the random vector x (a sample from its distribution) in the estimation, and an unblind one which bases the estimation on the full pair (x, y). The unblind method is allowed more information in the estimation and is, naturally, expected to provide a more efficient estimator, a fact that is verified by our comparisons later on.

Unblind estimation of the linear discriminant
If we have a sample (x 1 , y 1 ), . . . , (x n , y n ) from the distribution of the full pair (x, y) available, the standard estimator of Σ −1 (µ 2 − µ 1 ) is the plug-in estimator (which is also its MLE, up to the scaling of the pooled covariance matrix) used in standard LDA. That is, using the notation, we consider the estimator, w n := S −1 n (x n2 −x n1 ). Asymptotic results for LDA are very standard in the literature, see for example Anderson (2003). However, these results are usually given in the case of fixed group sizes, whereas in our model the group sizes are determined by the indicator variables y 1 , . . . , y n and are, as such, random. Hence, as far as we know, the following theorem is, if not particularly groundbreaking in its conclusions, a novel one. Theorem 1. Under model (1), we have, as n → ∞, where The form of the limiting covariance matrix in Theorem 1 is rather simple and inspection of the proof of the result reveals that the involved projection matrices onto the orthogonal complement of the direction θ/ θ are simply consequences of the standardization of the estimator to unit length. Note also that the scalar factor in front can be written as 1/( θ 2 β) + (θ/ θ ) Σ(θ/ θ ), the two summands of which have the following rough interpretations: If the groups are imbalanced, β is small, making the first summand large and inflating the asymptotic variance. Similarly, if the data exhibit a large amount of variation in the direction of the optimal discriminant direction, i.e., (θ/ θ ) Σ(θ/ θ ) is large, the second term increases the magnitude of the asymptotic variance.

Blind estimation of the linear discriminant
Kurtosis-based projection pursuit Let δ 1 := 1/2 − 1/ √ 12, δ 2 := 1/2 + 1/ √ 12 andx := x − E(x). The kurtosis κ : S p−1 → R of the projection of x on a given direction u ∈ S p−1 is then defined as, The fact that projection pursuit based on kurtosis is Fisher consistent for the linear discriminant under normal mixtures was first shown in (Peña & Prieto 2001, Corollary 2). However, the successful use of their result in practice requires knowing something about the mixing proportion α 1 . Namely, if α 1 ∈ (δ 1 , δ 2 ) then the linear discriminant θ/ θ is found as the minimizer of κ, whereas if α 1 ∈ (0, δ 1 ) ∪ (δ 2 , 1) then θ/ θ is found as the maximizer of κ. Naturally, as a workaround, one could in practice always search for both the minimizer and the maximizer of κ but, even in this case, it might be non-trivial to recognize the linear discriminant amongst the two. Thus, to obtain a truly blind estimator, we propose instead using the squared excess kurtosis {κ(u) − 3} 2 as an objective function. Indeed, the next lemma reveals that the squared excess kurtosis yields a Fisher consistent estimate of the linear discriminant, apart from the degenerate cases α 1 ∈ {δ 1 , δ 2 } where excess kurtosis vanishes, without the need to choose between minimization and maximization. Lemma 1. Given model (1), Moving next to study the asymptotic properties of κ, let x 1 , . . . , x n be a random sample from the marginal distribution of x in the model (1). The sample counterpart of κ is If n ≥ p the denominator of the random function κ n is a.s. positive, making κ n well-defined and an estimator for θ/ θ is then obtained as any maximizer of u → {κ n (u) − 3} 2 (note that if n ≥ p, a maximizer exists almost surely due to the compacity of S p−1 ). The following theorem shows that any sequence of such maximizers has a limiting normal distribution. Note that the need to include the "corrective" signs s n in Theorem 2 stems from the sign-invariance of the objective function (which also causes the existence of two maximizers in Lemma 1). Theorem 2. Given model (1), assume that α 1 / ∈ {δ 1 , δ 2 } and let u n be any sequence of maximizers of u → {κ n (u) − 3} 2 . Then, there exists a sequence of signs s n ∈ {−1, 1} such that, as n → ∞, 1) s n u n → θ/ θ , almost surely.
The limiting covariance matrices in Theorems 1 and 2 are proportional, the only difference being the factor C κ . This makes their comparisons in Subsection 2.3 particularly straightforward. However, even without the formal comparisons, it is evident that the kurtosis-based estimator has a clear flaw in that it fails to be consistent for the mixing proportions α 1 ∈ {δ 1 , δ 2 } (for these values of α 1 , we have 1 − 6β = 0 in the denominator of C κ in Theorem 2). And even though these are only two points in the continuum (0, 1), the continuity of C κ in α 1 outside of these points implies that the estimator is highly inefficient for values of α 1 near δ 1 or δ 2 . Hence, we will next discuss an alternative estimator that is consistent when α 1 ∈ {δ 1 , δ 2 } (at the price of lacking consistency in another point).

Skewness-based projection pursuit
To complement the kurtosis-based projection pursuit, we next consider skewness-based projection pursuit. Note that, despite its dependency on lower moments, this form of PP is less studied in the literature (see the references in Section 1).
The skewness of the projection of x on a given direction u ∈ S p−1 is measured by the objective function γ : S p−1 → R defined as, Similarly to kurtosis, also with skewness it is more convenient to work with its squared value. The next lemma presents the Fisher consistency of the corresponding estimator and reveals that the mixing proportion 1/2 plays the role of the proportions δ 1 , δ 2 for skewness. The reason for this is intuitively clear as, under the choice α 1 = 1/2, the normal mixture is perfectly symmetrical, explaining the vanishing of the skewness. The result appeared originally as Proposition 1 in Loperfido (2013) but we give, for completeness, a proof in Appendix B. Lemma 2. Given model (1), 1) if α 1 = 1/2, then the function u → γ(u) 2 is uniquely maximized by ±θ/ θ , 2) if α 1 = 1/2, then γ(u) 2 = 0 for all u ∈ S p−1 .
Finally, we derive in Theorem 3 below the strong consistency and limiting distribution of the corresponding sample estimator, obtained through the maximization of the square of the sample skewness, defined as, Theorem 3. Given model (1), assume that α 1 = 1/2 and let u n be any sequence of maximizers of u → γ n (u) 2 . Then, there exists a sequence of signs s n ∈ {−1, 1} such that, as n → ∞, 1) s n u n → θ/ θ , almost surely.

2)
√ n(s n u n − θ/ θ ) Interestingly, also the limiting covariance of the skewness-based estimator is proportional to that of LDA, meaning that the main object of interest in the result is the factor C γ . These factors will be compared in the next section to make statements about the relative efficiencies of the estimators under various scenarios.

Asymptotic comparison of the three estimators
Theorems 1, 2 and 3 show that the limiting distributions of the unblind and blind estimators of θ/ θ all have proportional covariance matrices. Thus, their efficiencies may be compared simply through the corresponding constants of proportionality which depend on the problem parameters only through the mixing proportion (β = α 1 α 2 ) and the degree of separation between the two groups, as measured by the (squared) Mahalanobis distance τ . The relative asymptotic efficiencies (pair-wise ratios of the constants) of the blind estimators vs. the unblind estimator (LDA) are simply C −1 κ and C −1 γ where the values of the constants are given in Theorems 2 and 3. Especially the former expression is somewhat complicated for arbitrary β and τ but both simplify greatly if we consider the case where the Mahalanobis distance τ is large. That is, letting τ → ∞, the relative asymptotic efficiencies are simply Figure 1 plots the relative efficiencies as a function of the first mixing proportion α 1 in the cases τ = 5, 15 and τ → ∞. The plots verify that, for any practical value of the Mahalanobis distance, LDA is always asymptotically highly superior to both blind methods. However, in the extreme case where the two groups are well-separated to an arbitrarily large degree, we see, in particular, that the kurtosis estimator is asymptotically equally efficient to LDA in the balanced case α 1 = 1/2, and also in the limits α 1 → 0 and α 1 → 1 (although, in these cases the actual asymptotic covariance matrices themselves grow without bounds).
Figure 1 also shows that, depending on the Mahalanobis distance, around the point α 1 ≈ 0.30 there is a mixing proportion for which κ and γ are asymptotically equally efficient. Figure 2 plots these proportions (and their mirror images on the upper half of the region (0, 1)) as a function of the Mahalanobis distance. The plot reveals that the region of mixing proportions for which κ is asymptotically superior choice to γ (the gray inner region) gets wider as the groups get more well-separated, finally approaching the region 1/2 ± 1/ √ 24 in the limit τ → ∞ (the two horizontal dashed lines).  3 Convex combination of skewness and kurtosis

Theoretical properties
Based on Figure 1, the objective functions κ and γ produce, especially for well-separated groups, fairly efficient estimators of the linear discriminant in the absence of any grouping information. However, for this, it is crucial to have at least an approximate idea of the mixing proportion α 1 in order to choose the more efficient of the two objective functions and to avoid the points where a particular estimator becomes completely inefficient (for example, if the groups are close to being balanced, one wants to use kurtosis as skewness contains no information when γ = 1/2, see Lemma 2 and Figure 1). As the mixing proportion is rarely known in practice, this makes the procedure difficult to implement.
A natural way to overcome this weakness is to, instead of choosing between γ and κ, use them both simultaneously, through a objective function η : S p−1 → R that is a convex combination of the two squared cumulants, where w 1 , w 2 ≥ 0, w 1 + w 2 = 1. Naturally, the cases w 1 = 0 and w 2 = 0 simply correspond to the two individual objective functions and, hence, we will in the following assume that w 1 , w 2 > 0. The next result, following straightforwardly from Lemmas 1 and 2, shows that η indeed allows the completely blind recovery of the linear discriminant regardless of the mixing proportion α 1 ∈ (0, 1). Lemma 3. Given model (1), η is uniquely maximized by ±θ/ θ .
The sample version of the hybrid objective function is η n : S p−1 → R, defined as η n (u) = η n (u; w 1 ) := w 1 γ n (u) 2 + w 2 {κ n (u) − 3} 2 , and we next give the limiting behavior of its maximizer. Unsurprisingly, the resulting limiting covariance matrix is up to a multiplicative constant equal to the previous ones.
The constant of proportionality C η in Theorem 4 is again rather complicated but simplifies in the limit τ → ∞ to the more manageable, if not intuitive, form,

Asymptotic comparisons
We next investigate how the efficiency of the hybrid estimator compares to its competitors. Figure 3 shows the relative asymptotic efficiency of the hybrid estimator vs. LDA as a function of the mixing proportion α 1 for the same values of τ as in Figure 1 and for w 1 ∈ {0, 0.2, 0.4, 0.6, 0.8, 1}. Note that the extreme cases w 1 = 0 and w 1 = 1 are equivalent to using the individual objective functions u → {κ(u) − 3} 2 and u → γ(u) 2 , respectively. The curves show somewhat erratic behavior around the points δ 1 , δ 2 where kurtosis vanishes but otherwise seem to convey a clear message: inside the interval (δ 1 , δ 2 ) the hybrid estimator is, in general, a superior choice over the indivdual estimators, whereas outside of the interval the choice w 1 = 1 (corresponding to using skewness only) is preferable over the hybrid estimator.
To obtain a "universal" value of w 1 that yields (in some sense) on average the most efficient estimator over all α 1 , we compute with numerical integration the "average efficiency" A(w 1 , τ ) of the weight w 1 for a given value of τ as the area between the x-axis and the corresponding efficiency curve. For example A(0, 15) is the area under the red solid curve in the middle panel of Figure 3. Figure 4 then plots the weights w 1 yielding the maximal value of A(w 1 , τ ) as a function of the Mahalanobis distance τ and reveals that, regardless of the separation of the groups, one should optimally consider weights only in the range around τ ∈ (0.725, 0.825). This conclusion is rather predictable as kurtosis is based on a higher moment than skewness, meaning that the latter should be given a larger weight in order to obtain a "balanced" combination. As a further interesting observation, when τ → 0, i.e., when the mixture model approaches the multivariate normal model, the limit of the optimal weight seems to approach the value 0.8, which is the exact weighting used in the Jarque-Bera test statistic for testing normality, (n/6)γ 2 n + (n/24)(κ n − 3) 2 (Jarque & Bera 1980). Moreover, the same weighting was also recommended by Jones & Sibson (1987) as an approximation to an entropy-based index. Mixing proportion α 1 Mahalanobis distance τ = ∞ Figure 3: The relative asymptotic efficiency of the hybrid estimator vs. LDA under different Mahalanobis distances τ between the two group means. The colour and type of the lines denotes the weighting parameter w 1 .The solid black line titled "optimal" traces the efficiencies obtained using the optimal choice of weighting for a given pair (α 1 , τ ).  Optimal proportion w1 of skewness Figure 5: The heatmap shows as a function of (α 1 , τ ) the weighting w 1 yielding the highest relative asymptotic efficiency compared to LDA. The three discontinuities correspond to the lines α 1 = δ 1 , α 1 = 1/2 and α 1 = δ 2 .
Whereas Figure 4 aims to obtain a single universally useful value of w 1 , in the optimal situation one would always use the particular weighting yielding the highest relative asymptotic efficiency for a given combination of mixing proportion and Mahalanobis distance. The optimal weights are plotted as a function of (α 1 , τ ) in the heatmap of Figure 5, the most striking features of which are the discontinuities at the horizontal lines α 1 = δ 1 , α 1 = 1/2 and α 1 = δ 2 . These are caused by the fact that the coefficient C η in Theorem 4 becomes a constant function of w 1 at each of these values of α 1 (where either the skewness or excess kurtosis vanishes). Consequently, there is no unique maximizer w 1 at these points and to emphasize their nature we have chosen to color them in Figure 5 with the corresponding extreme color (e.g., black for α 1 = 1/2 where skewness carries no information). However, more puzzling are the differing limits when approaching the points δ 1 , δ 2 from below and above. Essentially, when one approaches either of these points from inside the interval, the highest efficiency is obtained by focusing all weight on kurtosis which seems very counter-intuitive as in the limit kurtosis carries no information at all. This behaviour is visualized still in more detail in Figure 6 which plots the relative asymptotic efficiency C −1 η as a function of w 1 for τ = 5 and α 1 − δ 1 =: ε ∈ {0.001, 0.002, 0.005, 0.010}. Clearly, the weight achieving the maximal efficiency indeed approaches zero as ε → 0. Algebraically, it is easy to see what is happening: For τ → ∞ and β ≈ 1/6, the approximation of C η , obtained by ignoring the terms of order (1 − 6β) 2 and higher is C η ≈ 1 1−4β + 8 it is minimized for w 1 = 1. Also, as β → 1/6, no matter from which side, C η converges to constant in w 1 , implying that there is no discontinuity in the efficiency value itself. No such behaviour is observed for α 1 → 0.5 and the reason for this is that 1 − 4β ≥ 0, implying that no sign change occurs when passing the critical value α 1 = 1/2. : Relative asymptotic efficiencies of the hybrid estimator as a function of w 1 when τ = 5 and α 1 = δ 1 + ε, where ε ∈ {0.001, 0.002, 0.005, 0.010}. The value of the weight w 1 achieving the maximal efficiency can be seen to approach zero when ε → 0.
The efficiencies achieved by the optimal weighting are shown by the solid black line in Figure 5 and indicate that the hybrid estimator is able to reach satisfying levels of efficiency particularly when the mixing proportion lies in the interval (δ 1 , δ 2 ). Indeed, in the limit τ → ∞, within the interval there always exists a weighting that reaches efficiency equal to LDA, as evidenced by the right-most panel of Figure 5. On the other hand, outside of the interval (δ 1 , δ 2 ), LDA is still, even in the limit τ → ∞, a superior choice. We conjecture that the reason for this critical difference in behaviour inside and outside of the interval is that when the value of α 1 is extreme, one of the groups is small, making the pin-pointing of the optimal direction difficult in general, but even more so for the blind methods which have no class information available. However, it is not clear why the particular points δ 1 , δ 2 serve as the cut-off values for this behavior.
Finally, note that the discontinuities make the use of the optimal choice of weighting somewhat difficult in practice, as, if one's prior information/guess on the value of the mixing proportion α 1 is even slightly off, relying on the seemingly optimal choice can in the worst case lead to relative efficiency close to zero. Moreover, recall that the previous experiments were asymptotical in nature and do not necessary reflect the behaviour of the method under sample sizes encountered in practical situations. Hence, we suggest using a "safe" universal value of w 1 , most preferably falling in the interval (0.725, 0.825) identified in conjunction with Figure 4. For example, Figure 3 shows that the value w 1 = 0.80 delivers, for finite τ , performance not far behind the optimal choice for any α 1 . However, if one is reasonably certain about the value of α 1 (which, optimally, is far away from δ 1 , δ 2 ) and has n sufficiently large, resorting to the optimal choice is, of course, also possible.

Simulations
The three projection pursuit estimators considered here have been discussed in the context of ICA in detail in Virta et al. (2016) where also fixed point algorithms for their computations are described. For our purpose here we can use their deflation-based algorithms when only one direction is to be extracted. Projection pursuit is considered notoriously prone to local optima and therefore it is known that good initial values for such algorithms are crucial. Virta et al. (2016) suggest to use initial values based on a simple ICA method called FOBI (Cardoso 1989). This is also suitable in our context as the normal mixture is a sub-model of the ICA model, see Lemma B.1 in the supplementary material.
The algorithms of Virta et al. (2016) are implemented in the package ICtest (Nordhausen et al. 2021), which we will use in the following together with R 3.6.1 (R Core Team 2020). Further details about software used are contained in the appendix.
Let u n be any of the estimators of θ/ θ discussed in Section 2. The accuracy of the estimator can in simulations be measured through the inner product u n θ/ θ , which, by the Cauchy-Schwarz inequality, achieves the absolute value one if and only if the two vectors are parallel. In the continuation, we call the presented inner product "Maximal similarity index" (MSI). The following lemma presents the limiting distribution of this performance measure. Lemma 4. Let the unit length vector u n satisfy √ n(s n u n − θ/ θ ) N p (0, Ψ) for some sequence of signs s n ∈ {−1, 1} and limiting covariance matrix Ψ. Then, as n → ∞, where the random vector z obeys the p-variate standard normal distribution. Moreover, the expected value of the right-hand side of (2) is tr(Ψ).
Note that the sign correction in Lemma 4 can be incorporated in practice by choosing the sign s n such that the quantity s n u n θ/ θ is positive. In the simulations we will evaluate the performances of methods through the left-hand side of (2). By Lemma 4 the average of this criterion over several replicates should be close to the trace of the limiting covariance matrix of the corresponding estimator, for sample size n large enough. Hence, the simulations also serve to "verify" our asymptotic results.
In the following simulations four projection pursuit (PP) directions have been calculated: kurtosis based (obtained by maximization of (κ n − 3) 2 ), skewness based (obtained by maximization of γ 2 n ), "safe" hybrid estimator (obtained by maximization of η n for w 1 = 0.8) and "optimal" hybrid estimator (obtained by maximization of η n for w 1 = w 1 (α 1 , τ ) which maximizes the relative asymptotic efficiency of the hybrid estimator w.r.t. LDA). Sign s n is chosen such that s n u n θ/||θ|| ≥ 0. As discussed, the performances of the four presented PP directions u n are evaluated using the maximal similarity index u n θ/||θ|| from above.
The heatmaps of the MSI-values in Figure 7 show that for moderate sample sizes (n ≥ 2000), both hybrid estimators estimate the optimal LDA direction very well. It is also visible that kurtosis and skewness based PP directions perform very badly when α 1 is near their corresponding discontinuity points. The hybrid estimators suffer from the same problem when α 1 is near 1/2 − 1/ √ 12. The hybrid estimator with optimal w 1 is performing worse when α 1 is approaching 1/2 − 1/ √ 12 from the inside the interval (1/2 − 1/ √ 12, 1/2 + 1/ √ 12). It is important to recall, that the criterion for choosing the optimal weight w 1 is an asymptotic one and thus might not perform well in small sample settings. Furthermore, in order to calculate the optimal weight w 1 for the hybrid estimator, one needs to know both the Mahalanobis distance τ between the group means and the mixing proportion α 1 , which is a rather unrealistic requirement in practice. Luckily, the hybrid estimator with the "safe" weight w 1 = 0.8 shows a very good performance in this simulation study, and is therefore recommended in cases where knowledge of α 1 and τ is lacking. Another observation based on this simulation is that for small sample sizes skewness based PP seems to be preferable. This might be due to the fact that moments of order three are easier to estimate than moments of order four.
The corresponding heatmap of the standard deviation of MSI can be found in the Appendix, Figure 11, and shows that for sample size and distance between the group means moderately large, deviation of the MSI to the to the corresponding mean, which is very close to the optimal value of 1, is negligible, for most of the values of the mixing proportion α 1 . Heatmaps of mean and standard deviation of the MSI in Figures 12 and 13 show that for large sample sizes (n = 8000, 16000, 32000) MSI is virtually 1.
In the next simulation the theoretic results are to be confirmed by exploiting the results of Lemma 4. For that purpose we select three values for τ ∈ {1, 5, 10}, to represent hardly, moderately and clearly separated clusters, respectively. Then we simulate, for sample sizes n = 500, 1000, 2000, 4000, 8000, 16000, 32000, from a three-variate Gaussian mixture model as specified above with Σ = I 3 + 1 3 , where 1 3 is matrix of ones, µ 1 = 0 and µ 2 = µ 2 (τ ) is for each τ chosen such that Mahalanobis distance between the means is equal to τ , i.e., µ 2 (1) = (0.68, −0.55, 0.6) , µ 2 (5) = (0.81, −2.24, −0.36) , µ 2 (10) = (3.06, 1.6, −1.11) . For α 1 = 0.1, 0.2, . . . , 0.5 we compute then for sample sizes n = 500, 1000, 2000, 4000, 8000, 16000, 32000 the means of the 2n(1 − s n u n )θ/ θ based on 2000 repetitions. The question is then whether those averages for the presented methods stabilize for the cases they are expected to work. In order not to clutter the figure we computed only for α 1 = 0.1 trace tr(Ψ), for the corresponding matrix Ψ. Figure 8 shows the results of this simulation and confirms the corresponding theoretic findings from above. The less separated the clusters are the more difficult the estimation and even for n = 32000 observations there is no stabilization visible. But the more separated the two clusters are, the faster the stabilization. Also the closer α 1 is to the critical value 1/2 − 1/ √ 12, the worse is kurtosis based PP. Skewness based PP similarly is better the more skewed the distribution and clearly does not work in the symmetric case. Both hybrid estimators show an excellent performance in this setting. It is also clearly visible that the empirical lines for the mixing proportion α 1 = 0.1 correspond to the theoretically computed dashed lines given that the groups are separated enough and we assume for τ = 1 the line would be reached for much larger sample sizes. Though we show the theoretic lines only for one mixing proportion the behaviour is similar for all others naturally with the exception of skewness not working in the symmetric case.
Principal component analysis (PCA) can also be seen as a projection method where the variance is maximized. PCA is arguably the most popular dimension reduction method and often used before clustering. While skewness and kurtosis can be related to mixtures the variance does not have the same connection with the discriminant direction as the other cumulants. A theoretic consideration when PCA can be used to estimate the discriminant is in Appendix A. Here we show an example where PCA fails.

Real data example
To compare the hybrid estimator to PCA in a real data set we consider the finance data set available in the R package Rmixmod (Langrognet et al. 2020), which consists of 889 records of companies where, based on four numeric summary statistics, it should be decided if the company is financially healthy or not, where the information is provided in the data set. The scatter plot matrix is given in the Appendix as Figure 14 and shows no clear clusters. As a reference we compute for the data set LDA and then compare this supervised estimate via the estimate s n u n θ n / θ n of the MSI to our hybrid estimator for different weights and to PCA. Figure 10 shows obtained MSI values for the discussed estimators. The figure clearly shows that as long as enough weight is given to kurtosis, the hybrid estimator based PP clearly outperforms PCA, while the performance is poor if skewness gets too much weight. This is not surprising as the amount of healthy (457) and bankrupt (432) companies is almost equal. The weight of 0.8 gives again a good performance. Nevertheless, even though for most values of w 1 , and especially for suggested w 1 ∈ [0.7, 0.8], hybrid estimators clearly outperform PCA, achieved MSI values of around 0.5 are not that good. Such performance can be explained with the low sample size and that the cluster centers are not that far apart, as for example is visualized in the Appendix in Figure 15.

Discussion
In this paper, we conducted an asymptotic comparison of two popular estimators of the linear discriminant direction, LDA and projection pursuit based on skewness and kurtosis. For the latter, we proposed using the weighted combination of kurtosis and skewness as the projection index (giving the individual cumulants as special cases). Both the theoretical results and simulations indicate that, with a suitable choice of weighting, such projection pursuit achieves reasonably good performance compared to LDA (e.g., around 15% relative asymptotic efficiency if the Mahalanobis distance between the groups is 5, see Figure 3), considering it operates in complete absence of any label information. Moreover, in the extreme case of balanced and infinitely well-separated groups, projection pursuit is able to reach asymptotic efficiency equal to LDA with an optimal choice of weighting.
The use of our optimal weighting results is difficult in practice by the discontinuities around the mixing proportions δ 1 , δ 2 observed in Section 3, see Figure 5. As such, unless one is absolutely sure that the mixing proportion is not in these regions, our recommendation is to use a universal choice of weighting, anything between 0.7 and 0.8 (as the weight for skewness) likely being a good choice.
At first we thought that the discontinuities, and the surprising recommendation to favor kurtosis just outside the interval (δ 1 , δ 2 ), might be caused by the uneven robustness properties of skewness and kurtosis in the objective function. Namely, being based on fourth moments, kurtosis is more affected by outliers than skewness (despite the standardization with second moments). Hence, we also considered using the "'balanced" objective function, in an attempt to put skewness and kurtosis on an equal footing. However, the asymptotic properties of η * (not shown here) turn out to be essentially the same as for η, including the discontinuities which are also observed for it. Note also that the discontinuous behavior was observed also in Virta et al. (2016), where the normal mixture model was studied using independent component analysis.
Similarly one could extend our considerations here to many other PP indices as well, which often are modifications of skewness or kurtosis (see e.g. Hou & Wentzell (2014)) or otherwise motivated to be useful in clustering or structure detection, see for example Cook et al. (1993), Fischer et al. (2019) and references therein for alternative indices. These indices are however often computationally expensive and therefore much less popular than skewness and kurtosis.
Finally, besides projection pursuit, there exist also other blind estimators of the linear discriminant. For example, it is known that the linear discriminant can be reconstructed using invariant coordinate selection (ICS) (Tyler et al. 2009) where two scatter matrices are jointly diagonalized. Especially when using the regular covariance matrix and the scatter matrix of fourth moments in this context as, for example, suggested in (Alashwali & Kent 2016, Peña et al. 2010), would allow a theoretic comparison. Actually this combination corresponds to FOBI mentioned in Section 4. Another prospective line of work is the extension of our asymptotic results to mixtures of elliptical distributions, as Peña & Prieto (2001) indeed showed that projection pursuit yields a Fisher consistent estimator of the linear discriminant also in the case of general elliptical families. Similarly, another possible extension is to the case of multiple groups instead of just two or to groups with unequal covariance matrices.  Figure 10: MSI s n u n θ n / θ n for the finance data, where u n is the direction based on PCA and PP estimators for w 1 = 0, 0.1, 0.2, . . . , 0.9, 1. θ n is the direction based on LDA.

A Equivalent results for PCA
While PCA manages to capture the linear discriminant direction only under very specific conditions, and cannot thus be reasonably seen as a "blind" estimator of it, we still give for it in the following, for completeness, equivalent results to the ones in Sections 2 and 3. The first result, detailing conditions required for the Fisher consistency of PCA, is qualitatively well-known in the literature (see, e.g., Section 9.1 in Jolliffe (2002)), but, as far as we know, the exact eigenvalue bound is novel. Lemma A.1. Given model (1), the following two are equivalent: i) The vector h := µ 2 − µ 1 is an eigenvector of Σ and, denoting the corresponding eigenvalue with φ, the second-to-largest eigenvalue where β and τ are as in Theorem 2.
ii) The unique leading unit length eigenvectors of Cov(x) are ±θ/ θ .
Lemma A.1 states that for the first PC to recover the discriminant direction, it is necessary that the difference µ 2 − µ 1 between the group means is an eigenvector of Σ. However, it is not necessary for it to be the leading eigenvector but instead, roughly, the more well-separated the groups are (large Mahalanobis distance τ ) and the more balanced the groups are (large β), the smaller the corresponding eigenvalue can be relative to the rest of the spectrum. Note also that in the spherical case, Σ ∝ I p , the first part of condition i) in Lemma A.1 is trivially satisfied.
Asymptotic results for PCA are also well-known, see, e.g., Anderson (1963), Davis (1977), and the following theorem details the strong consistency and the limiting normality of the first PC in our particular scenario. For completeness, we provide a proof.
Theorem A.1. Given model (1), assume that the condition i) (or, equivalently, ii)) holds and let u n be any sequence of leading unit-length eigenvectors of the sample covariance matrix C n of x 1 , . . . , x n . Then, there exists a sequence of signs s n ∈ {−1, 1} such that, as n → ∞, 1) s n u n → θ/ θ , almost surely.
It is evident from part 2) of Theorem A.1 that the limiting covariance matrix of the PCA-based estimator is not proportional to the four others derived in Theorems 1, 2, 3 and 4. However, proportionality is reached in the special case where the group covariance matrix is spherical, Σ = σ 2 I p , for some σ 2 > 0. In this case, h/ h = θ/ θ , the Moore-Penrose pseudoinverses in Theorem A.1 equal (Σ + βhh − φI p ) † = −1/(β h 2 )(I p − hh / h 2 ) and the limiting covariance matrix Ψ PCA can be expressed as Comparison to Theorem 1 now reveals that the relative asymptotic efficiency of PCA vs. LDA equals τ β, showing, in particular, that in the balanced case with α 1 = α 2 PCA surpasses LDA in asymptotic efficiency as soon as the Mahalanobis distance between the groups is greater than 4. Moreover, in the limit τ → ∞, PCA is infinitely more efficient than LDA regardless of the mixing proportion. This counterintuitive result is, of course, not something one should rely on in practice, as the conditions required to achieve the situation are being very restrictive.
Proof of Lemma B.1. We have where h := µ 2 − µ 1 . Let then Γ := U Σ −1/2 where U is an orthogonal matrix satisfying U Σ −1/2 h ∝ e 1 and e 1 is the first canonical basis vector of R p (such an U always exists as Σ is full rank and h = 0). Now, for some b = 0. The result now follows by writing out the density function of Γ{x − E(x)} and observing that it factors into a product of the density of a univariate Gaussian mixture and the densities of p − 1 univariate Gaussians with zero means.
Proof of Theorem 1. The estimator w is translation invariant, meaning that we may, without loss of generality, assume that E(x) = 0. Under this, the model (1) takes the form We begin by deriving asymptotic linearizations forx n2 −x n1 . Let in the following β := α 1 α 2 . By LLN,ȳ n → p α 1 and (1/n) i y i x i → p −βh. Hence, the relationȳ nxn1 = (1/n) i y i x i shows thatx n1 → p −α 2 h. We further have the expansion, which, by CLT, shows thatx n1 is asymptotically normal, One can similarly show that, Defining a n1 := √ n{(1/n) i y i x i + βh}, a n2 := √ n{(1/n) i (1 − y i )x i − βh} and a n3 := √ n(ȳ n − α 1 ), the previous two can be written as The two in combination yield the desired linearization, We then derive a similar expansion for the pooled covariance matrix S n . It is straightforwardly seen that This together with the equivalent formula for the second group yields The second term above expands as, and the third as,

Denoting then
The above in particular shows that S is asymptotically normal. Hence, the relation We are now equipped to derive the limiting distribution of the optimal direction w n = S −1 n (x n2 −x n1 ). Recalling that θ = Σ −1 h, we have, by the calculus of o p (1) and O p (1) sequences, By the definitions of a n1 , a n2 , a n3 and A n4 and CLT, the limiting covariance matrix of βΣ √ n(w − θ) is where ∆ := βλ + 1 and λ := θ h. The covariance matrix is a sum of a total of 36 terms, which we next compute one-by-one. We use the notation γ := α 3 1 + α 3 2 .
Summing the previous terms, we obtain β(1 + βλ)Σ + β 2 hh . Hence, the the limiting covariance of √ n(w n − θ) is Finally, the Jacobian of the map θ → θ/ θ is ( θ 2 I p − θθ )/ θ 3 and the delta method then implies that the scaled direction w/ w has the limiting covariance matrix, Before proving results regarding the blind estimators, we establish two auxiliary lemmas. Lemma B.2. Let A = p j=2 λ j w j w j +w 1 w 1 C ∈ R p×p , where λ 2 , . . . , λ p ∈ R, w 1 , . . . , w p constitute an orthonormal set of vectors and C ∈ R p×p is a symmetric positive definite matrix. Then A is invertible and Proof of Lemma B.2. Observe first that BB † = B † B = I p − w 1 w 1 . Then, we compute the product of the two matrices to be, The opposite product can be verified to equal identity in a similar manner, proving the claim.
Proof of Theorem 2. The objective functions are translation invariant, meaning that we may, without loss of generality, assume that E(x) = 0. This makes the marginal distribution of The strong consistency of the estimator can be shown in the usual way by establishing that the objective function is strongly uniformly convergent in the compact parameter set S p−1 (or, more precisely, in its subset where the sign of the estimator is fixed), that is, For simplicity, we give the proof of the uniform convergence only in Theorem 3, in the context of skewness (having lower moments than kurtosis), and similar (but lengthier) argments can be used to show (B.3).
To show the limiting normality, note that the Largrangian corresponding to the optimization problem is n (u) = {κ n (u) − 3} 2 + λ n (u u − 1) where λ n is the Lagrangian multiplier. Using some matrix calculus, the corresponding gradient is seen to be The gradient vanishes at the (signadjusted) sample maximum s n u n and multiplication of the gradient from the left with s n u n thus yields that 0 = s n u n ∇ n (s n u n ) = −2λ n , showing that λ n = 0. We next work on the level of individual probability elements ω ∈ Ω. By Lemma 1, LLN and the strong consistency of s n u n , there exists a probability one set H such that s n u n → u 0 and κ n (u) − 3 → t = 0 for all ω ∈ H. Thus, for each ω ∈ H, the maximizer u n satisfies, for n large enough, the estimating equations n2 (s n u n )m n3 (s n u n ) − s n4 (s n u n )m n1 (s n u n ) = 0. Using Lagrangian multipliers we can similarly show that the population maximizer u). For each ω ∈ H, we have, for n large enough, the Taylor expansion where ∇ ∇g nκ (ũ n ) is the third order tensor of second derivatives of g, the symbol × denotes the vector-by-tensor multiplication (producing a matrix) andũ n satisfies ũ n − u 0 ≤ u n − u 0 , implying thatũ n → u 0 . Multiplying the expansion by √ n and using the fact that g nκ (s n u n ) = 0 gives that Now, the elements of ∇ ∇g nκ (u) are polynomials of the sample moments ofx i and the elements of u implying that, by LLN, ∇ ∇g nκ (ũ n ) converges to a constant and (s n u n − u 0 ) × ∇ ∇g nκ (ũ n ) converges to zero, for any ω ∈ H. Now, by the unit lengths of s n u n and u 0 , we have c 0 h(s n u n + u 0 ) √ n(s n u n − u 0 ) = 0, where c 0 := (1/2){3s 2 (u 0 ) 2 − s 4 (u 0 )} θ (h Σ −1 h) −1 (the inclusion of the constant c 0 simplifies things later on). Summing this with equation (B.4) gives {(s n u n − u 0 ) × ∇ ∇g nκ (ũ n ) + ∇g nκ (u 0 ) + c 0 h(s n u n + u 0 ) } √ n(s n u n − u 0 ) = √ ng nκ (u 0 ).

(B.5)
Hence, assuming further that we have √ ng nκ (u 0 ) N p (0, Π), then the limiting distribution of s n u n is, by Slutsky's theorem, Thus, to complete the proof, we next derive expressions for G and Π (and show that the former has indeed full rank).
Proof of Theorem 3. The strong consistency follows as soon as we show the strong uniform consistency, By Theorem 2 and Lemma 1 in Andrews (1992), (B.12) holds if, 1) the parameter space is compact, 2) we have γ n (u) 2 → γ(u) 2 , a.s., for all u ∈ S p−1 (this holds by LLN and the continuous mapping theorem), 3) γ 2 is uniformly continuous in u and, 4) γ 2 n is Lipschitz continuous in the sense that |γ n (u 1 ) 2 − γ n (u 2 ) 2 | ≤ K n u 1 − u 2 for all u 1 , u 2 ∈ S p−1 and some random variable K n converging almost surely to a constant.
We now verify condition 4) above. Using the notation of the proof of Theorem 2, we have . Now,s n2 (u) is, for all u ∈ S p−1 , lower bounded by the smallest eigenvalue of the sample covariance matrix, which by the continuity of the eigenvalues and the positive-definiteness of the covariance matrix converges almost surely to a positive constant. Moreover, we have which converges, by LLN, almost surely to a constant, and similar result can be shown for |s n2 (u)|. Finally, x i −x 3 , and putting everything above together, we conclude that the Lipschitz continuity 4) holds. What remains to be verified is then condition 3), which can be shown similarly to 4) after recalling that Lipschitz continuity implies uniform continuity. Hence, the strong consistency of the estimator follows.
Summing the terms gives Π = s 2 (s 2 s 4 − s 3 2 − s 2 3 )(Σ − τ −1 hh ). This yields the limiting covariance, Finally, simplifying the constant in front shows that it equals Proof of Theorem 4. Again, the strong consistency follows as in Theorem 2 and we omit its proof. For the limiting distribution, we give in the following the key steps of the proof (and use the same notation as in the proofs of Theorems 2 and 3).
The matrix Cov(y 1 , y 2 ) consists of the following 36 terms: Proof of Lemma A.1. We first show that i) implies ii). The positive-definiteness of Σ in conjunction with the relation Σh = φh gives that Σ −1 h = φ −1 h. Consequently, Hence, ±θ/ θ are the unique leading unit-length eigenvectors of Cov(x) if the second-to-largest eigenvalue of Cov(x) is smaller than φ(1 + βτ ), the eigenvalue corresponding to ±θ/ θ .
Proof of Theorem A.1. The proof of the strong consistency is done similarly as in Theorem 2 and we omit it.
The gradient of the Lagrangian corresponding to the extraction of the leading unit length eigenvector of C n is 2C n u − 2λ n u, where λ n is the Lagrangian multiplier. The gradient vanishes at s n u n , allowing us to solve the value of λ n = u n C n u n by multiplying the gradient equation from left with u n . Plugging the multiplier back in gives (I p − P n )C n s n u n = 0, where P n := u n u n . This equation is equivalent to the following equality, A n √ n(s n u n − u 1 ) = −(I p − P n ) √ n{C n − Cov(x)}u 1 , (B.15) where A n := (I p − P n )C n − s n φ 1 (u n u 1 )I p + φ 1 s n u 1 u n , φ 1 = u 1 Cov(x)u 1 = θ −2 τ (1 + βτ ) and τ = h Σ −1 h. Now, by the strong consistency s n u n → u 1 , we have P n → p u 1 u 1 =: P. Consequently, A n → p (I p − P)Cov(x) − φ 1 I p + φ 1 u 1 u 1 = Cov(x) − φ 1 I p .

C Additional simulation results
In this section, we give supporting plots as a supplementary material to claims made and plots presented in the article. Simulations and the corresponding plots are done using R 3.6.1 (R Core Team 2020) together with R packages ICtest (Nordhausen et al. 2021), mvtnorm (Genz et al. 2020), MASS (Venables & Ripley 2002), GGally (Schloerke et al. 2021), ggpubr (Kassambara 2020), dplyr (Wickham et al. 2021), tidyr (Wickham 2020) and RColorBrewer (Neuwirth 2014). Figures 11 and 13 show the standard deviation of maximal similarity index s n u n θ/ θ where u n is one of PP estimators discussed in the article, as a function of the Mahalanobis distance between the group means τ and mixing proportion α 1 , for sample sizes n ∈ {500, 1000, 2000, 4000} and n ∈ {8000, 16000, 32000} respectively. Figure 14 shows a scatter matrix plot of the finance data set from the R-package Rmodmix, where the point in the plot is being colored red if the company is being bankrupt, and blue otherwise, as well as the marginal densities for both groups which are given at the diagonal. Figure 15 shows boxplots of the projection scores of the finance data set from the R-package Rmodmix along the PP directions based on PCA, LDA, kurtosis, skewness and hybrid estimator η n (·, w 1 ), for w 1 = 0.1, 0.2, . . . , 0.9 for healthy and bankrupted companies.  Figure 13: Standard deviation of the MSI s n u n θ/||θ|| as a function of Mahalanobis distance τ between the group means and mixing proportion α 1 , where u n is one of the four estimators discussed above.     Figure 15: Plot shows boxplots of the projection scores of the finance data along the directions based on PCA, LDA, as well as the PP directions obtained by maximizing (κ n − 3) 2 , γ 2 n and η n (·, w 1 ), for w 1 = 0.1, 0.2, . . . , 0.9 for healthy and bankrupted companies.