Double data piling leads to perfect classiﬁcation

: Data piling refers to the phenomenon that training data vectors from each class project to a single point for classiﬁcation. While this interesting phenomenon has been a key to understanding many distinctive prop- erties of high-dimensional discrimination, the theoretical underpinning of data piling is far from properly established. In this work, high-dimensional asymptotics of data piling is investigated under a spiked covariance model, which reveals its close connection to the well-known ridged linear classiﬁer. In particular, by projecting the ridge discriminant vector onto the subspace spanned by the leading sample principal component directions and the maximal data piling vector, we show that a negatively ridged discriminant vector can asymptotically achieve data piling of independent test data, essentially yielding a perfect classiﬁcation. The second data piling direction is obtained purely from training data and shown to have a maximal prop- erty. Furthermore, asymptotic perfect classiﬁcation occurs only along the second data piling direction.


Introduction
Classification of high dimensional data has become an extremely common statistical problem. For a two-group classification, we use the high-dimension, lowsample-size (HDLSS) asymptotic regime (Hall, Marron and Neeman, 2005), in which the dimension p increases while the sample size n is fixed, to reveal a perhaps counter-intuitive phenomenon in classification of HDLSS data.
Consider a situation where a classification rule is given by a linear separating hyperplane, or equivalently via the orthogonal projection of data onto its normal vector. A classical choice for such a vector is the Fisher's linear discriminant vector w FLD (Fisher, 1936), which maximizes the between-group scatter (w T d) 2 while minimizing the within-group scatter w T Sw, where d =X 1 −X 2 and S is the p × p pooled covariance matrix of rank min{p, n − 2}. While w FLD = S −1 d any other choices of α. This implies that an asymptotic perfect classification can happen only at vα.
Note that the second data piling direction is found by allowing the ridge parameter to be negative. Recently, for over-parameterized regression models with p > n, Kobak, Lomond and Sanchez (2020); Wu and Xu (2020); Tsigler and Bartlett (2020) have pointed out that negatively ridged coefficient estimators can be optimal. In particular, Tsigler and Bartlett (2020) have shown that a spiked covariance model with the number of spikes much smaller than the sample size is a key for a negative ridge parameter to be optimal in ridge regression. Our finding with the m-spiked model parallels this observation, in a classification setting. In the linear regression context, extreme overfitting of ridgeless estimators is found to perform surprisingly well (Hastie et al., 2019;Holzmüller, 2020;Bartlett et al., 2020), a phenomenon called "double descent". The ridgeless estimator in the classification context is exactly the first data piling vector w MDP . Our results further suggest that it is possible to asymptotically interpolate test data by the second data piling vector. We also note that an asymptotic perfect classification of functional data is shown to be possible (Delaigle and Hall, 2012). There, a key condition enabling perfect classification is that the norm of the mean difference is comparable to or larger than the standard deviation of the leading principal component scores, which is analogous to our model with μ 1 − μ 2 = O(p 1/2 ), λ 1/2 i = O(p β/2 ) for i = 1, . . . , m and β ≤ 1. The HDLSS asymptotic regime we adopt has been used to reveal some unique characteristics of HDLSS data. Recent developments on the HDLSS asymptotic studies of various multivariate methods are surveyed by Aoshima et al. (2018).

Negatively ridged discriminant directions
For i = 1, 2, let X i1 , . . . , X ini ∈ R p denote a sample drawn from an absolutely continuous distribution on R p with mean μ i and common covariance matrix Σ. We assume p > n := n 1 + n 2 . Let d =X 1 −X 2 be the vector of sample mean difference, and S = 2 i=1 ni j=1 (X ij −X i )(X ij −X i ) T /n be the sample withingroup covariance matrix. We consider a ridge-type discriminant vector: For a ridge-parameter α ∈ (−∞, ∞), where α p = αp. We write w α =w α / w α . Allowing α p to increase along p will be convenient in our analysis of diverging p. In (2.1), we have extended the range of ridge parameter to include negative real values. This is in contrast to the conventional range α ∈ [0, ∞) considered in Friedman (1989); Guo, Hastie and Tibshirani (2007); Lee, Ahn and Jeon (2013) and many others. The price we pay is that (2.1) is in fact ill-defined for some α ∈ (−∞, 0] with rank-deficient S + α p I p . In what follows, we precisely redefine w α for α ∈ [−∞, ∞], by filling in the ill-defined locations.
Let S =Û 1ΛÛ T 1 be the eigen-decomposition of S such thatÛ 1 = [û 1 , . . . , u n−2 ] consists of the orthogonal eigenvectors, andΛ be the diagonal matrix whose ith entryλ i (i = 1, . . . , n − 2) is the ith largest eigenvalue. (There are exactly n − 2 positive eigenvalues almost surely.) LetÛ 2 be a p × (p − n + 2) matrix whose columns form an orthonormal basis for the nullspace of S. Then, for α at which (2.1) is defined, we decomposew α into three parts: (2. 2) The decomposition above reveals that w α as well asw α are decomposed into the weighted sum of projections of the mean difference d onto three orthogonal subspaces. The first two subspaces belong to the column space of S. We choose for the first m sample eigenvectorsû 1 , . . . ,û m to constitute the first subspace, in the anticipation that the role of the leading m eigenvectors is distinct from the rest of eigenvectors under the m-component model we consider. The last subspace, spanned byÛ 2 , is the nullspace of S. The set of α at which (2.2) is not defined is {0, ±∞} ∪ {−λ i /p : i = 1, . . . , n − 2}. It can be seen that the last term of (2.2) is parallel to the maximal data piling direction, and that lim α→0 w α = w MDP . Similarly, we have lim α→±∞ w α = d/ d and lim α↑−λi/p w α =û i , lim α↓−λi/p w α = −û i , (i = 1, . . . , n − 2). We set w 0 := w MDP , w ±∞ := d/ d , and w −λi/p :=û i , so that w α is left-continuous at −λ i /p and continuous elsewhere. Then, the discontinuities of the parameterization path α → w α are exactly the n − 2 sign flips, each of which occurs when −α p crosses an eigenvalue of S. It helps to visualize the modified w α as a trajectory along varying α. For this, temporarily assume a one-component model, i.e., m = 1. The parameterization path α → w α is visualized in Fig. 1 for a high-dimensional data with p = 5000 and n 1 = n 2 = 20 (the model for this data set is decribed in Section 4.1). To visualize the p-vector w α in a three-dimensional plot, we make use of the decomposition (2.2). For any α ∈ [−∞, ∞], the ridged discriminant vector w α lies in the (n − 1)-dimensional subspace S X spanned by {X ij −X}, whereX = i j X ij /n, or equivalently, by the column space of S and d. Recall that m = 1 in this example. The first and third terms of (2.2) are spanned bŷ u 1 , w MDP ∈ S X , respectively, which provide the two axes of the figure. The remaining n − 3 dimensions of S X , corresponding to the second term of (2.2), are collapsed to a nonnegative value. For the data set used in Fig. 1, it is apparent that, for most values of α, w α is close to the plane spanned byû 1 and w MDP . It turns out that the subspace S = span(û 1 , w MDP ) is the only meaningful subspace for large p, while the nullspace of S, within S X , does not possess any discriminative information. We will show this more carefully in the next subsection. For general m ≥ 1, the subspace of interest is S = span(û 1 , . . . ,û m , w MDP ).
(2.3)  (2) is the second term of (2.2). w 0 is the maximal data piling direction, and w ±∞ is the mean difference direction. As α crosses −λ 1 /p, wα is flipped. (b) The path of wα projected onto S = span(û 1 , w MDP ). See Appendix B for details on the second axis and bα to see how the n−3 sign flips are suppressed for visualization.

Concentration of ridge directions in high dimensions
Key conditions are described with respect to the mean difference μ = μ 1 − μ 2 and the principal component structure of the common covariance matrix Σ. Denote the eigen-decomposition of Σ by Σ . . . , u p ] collects the eigenvectors, and the entries of the diagonal matrix Λ are the ordered eigenvalues λ 1 ≥ · · · ≥ λ p . Assumptions 1-3 play important roles in describing the high-dimensional asymptotic behaviors of w α and associated classifiers.
Remark 2.1. A seemingly more relaxed model may be assumed in place of Assumption 1. For example, for some natural number M , we may set 1 ≥ β 1 ≥ β 2 ≥ · · · ≥ β M ≥ 0 and λ i = O(p βi ) for i = 1, . . . , M. This model is asymptotically equivalent to the model assumed in Assumption 1 with m = #{β i = 1}, due to the following reason: There is no difference between λ i = O(p βi ) with β i < 1 and λ i = O(1) in the HDLSS asymptotic results we derive; see, e.g., Lemma 3.6. Note that allowing β i ∈ (0, 1) for i > m should make a difference in the convergence rates of our conclusions. Since we do not explicitly state the rates of convergence, we use Assumption 1 for the sake of simplicity.
For a vector w ∈ R p and a subspace A of R p , we write P A and P A w for the orthogonal projection matrix and the orthogonal projection of w, respectively, onto A. Let U m = span(u 1 , . . . , u m ) be the subspace formed by the first m leading eigenvectors.
The m-component model in Assumption 1 is routinely assumed in classification problems (Qiao et al., 2010;Ahn and Marron, 2010;Yata and Aoshima, 2012) and is a special case of spiked covariance models (Hellton and Thoresen, 2017;Jung, Lee and Ahn, 2018;Ishii, Yata and Aoshima, 2019;Fan et al., 2021). The parameter β ∈ [0, 1] determines the order of magnitude of the m leading eigenvalues λ 1 , . . . , λ m of Σ. The asymptotic result obtained by assuming β ∈ (0, 1) is equivalent to the simple null case β = 0 (cf. Ahn et al., 2007;Yata and Aoshima, 2020), so we pay a special attention to the β = 1 case. Note that the case of β = 0 (or, equivalently, m = 0) is easier to classify, since the Mahalanobis distance Σ −1/2 μ 2 is strictly larger than that under β = 1 and m ≥ 1. Assumption 3 introduces k that controls the asymptotic portion of the mean difference μ along the m-dimensional principal subspace U m . We do not allow k = 1, as in such a case the mean difference vector μ is completely within the subspace U m with larger variance. Put differently, the quantity 1 − k 2 > 0 is the portion of the mean difference in the subspace of relatively smaller magnitude of noise.
The class of distributions we consider is given by the following assumption on the true principal scores z ij = Λ −1/2 U T (X ij − μ i ). These principal scores may not be independent with each other unless we assume normality.
Assumption 4. The elements of the p-vector z ij have uniformly bounded fourth moments, and for each p, z ij consists of the first p elements of an infinite random sequence (z (1) , z (2) , . . .) i,j , which is ρ-mixing under some permutation.
One of our main tools of analysis is the law of large numbers applied across variables (p → ∞), rather than across sample (n → ∞). For this, the dependency among the principal scores is controlled by the ρ-mixing condition; see Appendix A for definition. A sequence of independent normally distributed random variables satisfies the ρ-mixing condition. Assumption 4 is much weaker than normality and independence, yet it enables an application of the law of large numbers, as shown in Jung and Marron (2009).
We are now ready to state our result on w α . We define Angle (w, A) := cos −1 {|w T P A w|/( w P A w )}. For two unit vectors w 1 , w 2 ∈ R p , Angle(w 1 , w 2 ) = cos −1 |w T 1 w 2 | is the acute angle formed by the two vectors. The notation P − → represents convergence in probability. Throughout, we let n be fixed and p → ∞. as p → ∞.
All proofs are contained in Appendix C. Theorem 2.1 confirms that the concentration of w α towards S for almost all values of α, illustrated in Fig. 1, is bound to happen in high dimensions. The only exception is at α = −τ 2 /n, to which all scaled eigenvalues −λ i /p (i = m + 1, . . . , n − 2) converge. (The asymptotic behaviors of eigenvalues of S are stated in Lemma C.2.) Heuristically, at α = α i := −λ i /p ≈ −τ 2 /n for i ≥ m + 1, we have w αi =û i , which is orthogonal to S for every p.
We remark that if the assumption β = 1 in Theorem 2.1 is replaced by β ∈ [0, 1), then a stronger, yet less interesting, statement can be given. Specifically, if β ∈ [0, 1), then the role ofû 1 , . . . ,û m is no longer different from the rest of sample principal component directions, and for α = −τ 2 /n, Angle(w α , w MDP ) P − → 0 as p → ∞. This can be shown by an argument used in Jung (2018). In contrast, in the m-component model with β = 1, Angle(w α , w MDP ) can be large in the limit, depending on the choice of α, as depicted for finite p in Fig. 1 Based on the above, it will be convenient to consider a projection of w α onto S, to which almost all ridge directions converge in high dimensions. We propose to use satisfying v α = 1. This v α is not exactly the projection P S w α / P S w α , because P S w α = 0 at α = −λ i /p for any i = m + 1, . . . , n − 2. The v α is given by filling in the n − m − 2 undefined points of the scaled projections. It can be seen that v α simply ignores the second term of (2.2). For the special case of m = 1, the trajectory of the unit vectors v α over α ∈ [−∞, ∞], which is a unit semicircle on the plane S, is illustrated in Fig. 2(a).

First data piling
The data piling in two-group classification problems refers to the phenomenon that when data are projected onto a vector w, the projected data are piled on two points, one for each group (Ahn and Marron, 2010). This "first" data piling can be observed whenever p > n − 2. Specifically, for any vector w in the nullspace of S, i.e., w ∈ span(Û 2 ) in the decomposition (2.2), data piling occurs. Among those, the maximal data piling direction w MDP uniquely maximizes the distance between two piling locations. For example, in Fig. 2(b) the data X := {x ij : i = 1, 2, j = 1, . . . , n i } projected to w MDP are piled on two locations on the w MDP axis, and the distance between them is the largest. Note that w α , v α and S depend only on the data set X .

Second data piling
The second data piling phenomenon occurs for new, independent data Y, sampled from the same model as X and independent of X . One may regard X as training data, Y as testing data. Figure 2(b) plots both X and Y of a onecomponent model, i.e., m = 1, projected onto S. While Y projected on w MDP does not exhibit a data piling, it is interesting to observe that P S Y tend to lie on two parallel straight lines, one for each group. It turns out the piling of Y onto these two lines occurs asymptotically as p → ∞. We call this new phenomenon as the second data piling, this time for independent observations. For the direction vector v on S orthogonal to both lines, P v Y exhibits data piling on two points, asymptotically.
In the following, we show that such direction v can be identified purely from the training data X , even before observing the new data Y. Moreover, we show that such second data piling direction is also maximal in a sense similar to the first maximal data piling.
We begin by focusing on the two lines, l 1 and l 2 , in Fig. 2(b). The direction of the parallel lines has a close connection with u 1 , the first true principal component direction. Let u 1,S = P S u 1 be the projection of u 1 onto S, and write κ MDP := w T MDP (X 1 −X 2 ) / √ p, which is the distance between the two piles on the first data piling direction, scaled by √ p. The two lines are The two lines l 1 and l 2 are parallel to u 1,S . For general cases where m ≥ 1, the two lines (3.1) are naturally extended to two m-dimensional affine subspaces, L 1 and L 2 . Let u i,S = P S u i be the projection of u i onto S (i = 1, . . . , m), and write U m,S = [u 1,S , . . . , u m,S ]. We define Double data piling 6391 We now confirm that the independent data are piled along the affine subspaces. For any observation Y ∈ Y, we write π(Y ) = i if Y belongs to group i.
Theorem 3.1. Suppose Assumptions 1-4 hold and β = 1. For any observation Then, for i = 1, 2 and for any > 0, Theorem 3.1 reveals that exact second data piling for independent data occurs asymptotically. Even for the finite-dimensional example in Fig. 2(b), the distance between Y S and L i is small enough to perceive the second data piling.
The parameterization of L i in (3.2) involves unknown parameters u 1 , . . . , u m , and it is impossible to obtain L i , or the direction orthogonal to L i , from only X . Our next result shows that there exists anα, purely a function of the data X , such that vα is a direction asymptotically perpendicular to L i . As Theorem 3.2 shows, any HDLSS-consistent estimator of −τ 2 /n is such anα. We sayθ is an HDLSS-consistent estimator for θ if for any > 0, lim p→∞ P(|θ − θ| > ) = 0.

Theorem 3.2. Suppose Assumptions 1-4 hold and β = 1. For any HDLSSconsistent estimatorα of
An immediate consequence of Theorem 3.2 is that the projections of any new data Y onto vα exhibit the asymptotic second data piling; as p → ∞, P vα Y pile on two distinct locations, one for each group. Note that the estimatorα of −τ 2 /n is typically negative, and the second data piling direction is a projected negatively-ridged discriminant direction. Since the scaled eigenvalues of S converge to τ 2 /n (shown in Lemma C.2), candidates forα include −p −1λ i for i = m + 1, . . . , n − 2. In what follows, we fix If L 1 and L 2 do not coincide, then a separating hyperplane orthogonal to vα, passingX, is expected to provide a satisfactory classification of the independent data Y. At the end of this subsection, we show that L i 's do not coincide in the limit, thus perfect classification is possible. The distance between the two parallel affine subspaces is asymptotically equivalent to the distance between the two piles of P vα Y. This leads to a natural question: Are there other directions v ∈ S X exhibiting the second data piling phenomenon? We will see that there are infinitely many such directions, but among those vα leads to the maximal asymptotic distance between the two piles.
The second data piling on a v ∈ S X ⊂ R p is recognized asymptotically as p → ∞. While vα is defined for any p, we wish to consider any sequence of directions, which may not have a simple definition applicable to every p, as a candidate for a second data piling direction. For this, it will be clearer to think of v ∈ R p as the pth element of an infinite sequence We write {v} for such a sequence with v indicating the pth element of {v}. Let V be the collection of sequences of unit vectors in the sample space X , that is, Let Y and Y be any two independent observations in Y from the same group. We characterize the collection of all sequences of second data piling direction vectors as That is, a sequence {v} is a second-data-piling-direction sequence if all observations from the same population are projected to a single point asymptotically.
In the two-class discrimination problem, there are at most two points to which the independent data are projected. To better understand the second data piling directions, we use alternative but equivalent definitions of A to show that any {v} ∈ A is asymptotically close to a sequence {w}, where each w is in the direct sum of span(vα) and span({û i } n−2 i=m+1 ).

Lemma 3.3. Suppose Assumptions 1-4 hold and β = 1. Let A be the collection of all sequences {v} ∈ V such that for any independent
shows that other second data piling directions are given by "lifting" vα on S towards span({û i } n−2 i=m+1 ). We next show that among the many second-data-piling-direction sequences in A, {vα} maximizes the limiting distance between the two piles. For {w} ∈ A, let D(w) be the probability limit of if the limit exists. The probability limit may not exist for, e.g., an oscillating sequence {w}, and the limiting distance D(w) may be a random variable.

Theorem 3.4. Suppose Assumptions 1-4 hold and β = 1. Then, for any {w} ∈ A such that D(w) exists, D(w) ≤ D(vα)
with probability 1, and the equality holds if and only if w − vα P − → 0 as p → ∞. Moreover, for any independent observation Y , as p → ∞, where γ is a strictly positive random variable depending only on the first m principal scores of X .
Consequently, we may call {vα} a sequence of maximal second-data-piling directions. Since vα ∈ S, the result above also justifies our choice of focusing on v α in S rather than w α in the whole sample space S X .
Theorem 3.4 also shows that the limiting maximal distance between the two piles is γ(1 − k 2 )δ 2 . Since the two lines do not coincide in the limit, a perfect classification occurs. To be specific, let φ α (Y ; X ) be a classification rule defined for a given α: A perfect classification occurs in the limit p → ∞, even if the sample size of X is kept fixed, and for a negative ridge parameterα < 0.

Perfect classification at negative ridge
In this subsection, we show that the perfect classification occurs only at the negatively ridged φ α , i.e., at α =α. Denote the limits of correct classification rates of φ α by For simplicity, we assume that the prior π 1 = P {π(Y ) = 1} = 1 − π 2 is equal to n 1 /n. These limits (3.8) exist as shown in Lemma 3.6 below. There, we provide an explicit expression of P i (α), from which the limiting accuracy can be evaluated.
Recall that any observation X ∈ R p is represented by , where the elements of z (p) = (z (1) , . . . , z (p) ) T are the first p elements of an infinite sequence; see Assumption 4. For a given l = 1, . . . , m, the uncentered lth principal score of X, u T l X, depends only on z (l) for each and every p, and the almost sure limit of p −1/2 u T l X exists. For X ij ∈ X , we write the lth limiting principal score as x (l),ij , which satisfies P( Similarly, let y = (y (1) , . . . , y (m) ) T which collects the first m limiting principal scores of an independent observation {Y }. In Lemma 3.6 below, the limiting probabilities P i (α) depend on the distribution of Note that Ω is the within-class scatter matrix of first m limiting principal scores, so ξ α in (3.9) can be considered as a generalized linear classifier.
Lemma 3.6 shows a sharp distinction between a null model with β ∈ [0, 1) and the m-component model with β = 1. For β < 1, the within-group variance of X ij becomes negligible when compared to the magnitude of the true mean difference μ , thus perfect classification occurs for any reasonable classifier. In particular when β < 1, for any α = −τ 2 /n, v α converges to w MDP as p → ∞. The only exception is α = −τ 2 /n, at which v −τ 2 /n becomes orthogonal to w MDP as p → ∞. The result on β < 1 is consistent with findings in Hall, Marron and Neeman (2005); Jung (2018), where only the case β = 0 was studied. Note that in those previous work an additional condition δ 2 > |1/n 1 − 1/n 2 |τ 2 is required, while we do not. This is because we use the total meanX rather than the centroid mean (X 1 +X 2 )/2 used in Hall, Marron and Neeman (2005);Jung (2018).
The m-component model with β = 1 is more interesting, as the limiting accuracy P(α) = π 1 P 1 (α) + π 2 P 2 (α) depends on the distribution of the first m true principal component scores (z (1) , . . . , z (m) ) T , through ξ α . Note that the sign of ξ α depends on an m-dimensional classification result based on the unobservable principal component scores of Y and X . The positive constant C i is related to the maximal second-data-piling distance (3.6), and represents the magnitude of group separation on the nullspace of span(u 1 , . . . , u m ).

Theorem 3.7. Suppose Assumptions 1-4 hold and
Remark 3.1. The regularity condition in Theorem 3.7 can be written more directly for ξ α having its support on (−∞, ∞) for any α = −τ 2 /n, and is relaxed by the following condition: For any

Numerical studies
In this section, we first numerically demonstrate the perfect classification of φ α (3.7) via a simulation experiment (Section 4.1), then confirm that the optimal ridge parameter α of the classifier φ α is indeed negative in some real data situation, including well-known handwritten image datasets (Section 4.2) and a number of microarray datasets (Section 4.3).

A simulation experiment
We compare the classification performances of the classifier φα (3.7) based on vα, and two others classifiers, defined similar to (3.7) but using wα and w MDP in place of vα, respectively. The model we use is a one-component model satisfying Assumptions 1-4 with β = 1. Specifically, X ij ∼ N p (μ i , Σ), where Σ has a compound symmetry structure, Σ = 1 p 1 T p +40I p , with eigenvalues (p+40, 40, . . . , 40) and the first eigenvector u 1 = 1 p / √ p. Here, 1 p is the p-vector consisting of 1. The first p/2 coordinates of the mean difference vector μ = μ 1 − μ 2 are ( √ 2 + √ 3)/2, and the rest ( We set n 1 = n 2 = 20, and p varies from 100 to 10,000. For this model, δ 2 = 1, k = 1/2, σ = 1 and −τ 2 /n = −1. The correct classification rates of the three classifiers, obtained from the sample of size n = 40, are empirically computed using 1,000 independent observations. This is averaged over 100 repetitions to estimate the accuracy P{φ(Y ; X ) = π(Y )} of the classifier φ. Table 1 displays the result. Using vα not only outperforms the others, but also achieves nearly perfect classification for sufficiently large p. The poor performance of using wα is due to the inflation of noisy second term in (2.2). Under the same model, Fig. 3 displays an estimate of the accuracy of φ α at α ∈ (−1.5, 0.5) for several choices of dimension. At α = 0, φ 0 corresponds to the classifier based on w MDP . For moderately large dimensions (p = 200, 500), the accuracy increases as α increases, and positive ridge-parameters appear to be optimal. On the other hand, for p in the thousands, the correct classification rate is the highest near α = −τ 2 /n = −1, as Theorem 3.7 postulates. For this model, near-perfect classification will occur at α = −τ 2 /n, for larger choices of p.

Handwritten character recognition examples
The original MNIST (Modified National Institute of Standards and Technology) dataset (LeCun, Cortes and Burges, 2010) contains 60,000 images of handwritten digits 0 to 9, while the EMNIST (Extended MNIST) database (Cohen et al., 2017) has 124,800 images of handwritten English alphabets a to z, in 28 × 28 = 784 pixels. Since the true classification boundaries for these data are likely to be non-linear, we employed random Fourier features (Rahimi and Recht, 2007), a popular method originally developed for kernel methods for large scale problems. The idea of random Fourier features is to embed the original data into high-dimensional space so that a linear inner product in the embedded space approximates a non-linear kernel function. As similarly done in Kobak, Lomond and Sanchez (2020), we obtained random Fourier features of each image corresponding to Gaussian RBF kernel and used them as variables for linear classification. Let x j be the 1 × 784 vector with pixel intensity values for the jth image, scaled to be bounded by −1 and 1. For each choice of d = 500, 1000, 2000, 3000, we then calculated exp(−ix j Z), where Z ∈ R 784×d is a random matrix consisting of independent Gaussian random variables with mean zero and standard deviation 0.1. Taking the real and the imaginary parts of exp(−ix j Z), we obtained p = 2d features.
For classification, we converted the multi-category problems with 10 groups to 10 2 = 45 binary problems for the MNIST data and 26 groups to 26 2 = 325 binary problems for the EMNIST data. To mimic the HDLSS situation, we randomly chose n 1 = n 2 = 100 images for each class as the training data for each binary problem, and computed v α (2.5) and the corresponding classifier φ α (3.7), for each of m = 1, 2, 3, 4 and for a fine grid of α ∈ (−20, 20). Using the hold-out data as the test data set, the test error rates are computed for each choice of (m, α). This experiment was repeated ten times to report the average test error rates. For all choices of dimension p, and for all choice of the number of components m, the optimal choice of the ridge parameter turns out to be negative for a majority of cases considered; Table 2 collects the number of cases under which a negative ridge parameter provides the smallest test error rate. For a reference, Figure 4 shows the test error rates of φ α over the grid of α, when the number of leading components is set as m = 4. Patterns are similar for m = 1, 2, 3; see Figs. D.1-D.3 in Appendix D.1. As the dimension increases, the overall error rates decrease, and the fraction of "negative-ridge-optimal" cases (i.e., the optimal choice of α is negative) increases as the dimension increases.
We also have repeated the above experiment with a larger sample size of 200 for each category, for the MNIST dataset. When the dimension p is large enough, by increasing the sample size from n i = 100 to n i = 200, the optimal ridge parameterα n tends to shrink to zero; see the top panels of Fig. 5. For most cases,α 100 <α 200 < 0. This makes sense since the asymptotically optimal choice of α is −τ 2 /n (see Theorem 3.7), which also shrinks to zero as n increases. We also confirm that as the sample size increases (albeit still much smaller than the dimension), the misclassification rate becomes smaller; see the bottom panels of Fig. 5.
An additional experiment suggests that the key assumptions (Assumptions 1 and 2) are in fact satisfied for our binary classification of the MNIST dataset. See Appendix D.2.

Microarray data examples
Statistical analysis of microarray gene expression data has been a prominent example of the HDLSS situations. We use eight sets of public microarray data sets to examine whether the conclusion of Section 4.2 holds. The microarray   (2006) datasets we used are listed in Table 3 along with their dimensionality and sample sizes.

Fig 4. Misclassification rates of φ α applied to two-group classifications of images of handwritten digits (MNIST) and alphabets (EMNIST). Each curve represents a trajectory of average test error rates from a binary classification. The minimum error rate of each curve is marked with a star.
Since the sample sizes are quite small for these datasets, we computed the leave-one-out average error rates of the classifier (3.7) for a few choice of m and for a fine grid of α. The results are visually summarized in Fig. 6. Two datasets result in zero test error rates for any choice of α and thus excluded from the figure and from further discussion. Out of the remaining six cases, if we take m = 1, for all cases the optimal ridge parameters are indeed negative; for m = 2, 3, or 4, the number of the negative-ridge-optimal is 5, 5 or 3, respectively, out of six cases.
The axis for α in Fig. 6 is scaled so that the estimateα = −τ 2 /n (3.3) is −1. Observe from the figure that the optimal ridge parameters, if they are negative, are typically between (−τ 2 /n, 0). This is partly due to the bias of the estimator −τ 2 /n (the magnitude ofτ 2 /n is typically larger than τ 2 /n). This suggests that in practice a cross-validation over a range of ridge parameters is recommended, rather than simply using the estimate.
In the data examples above, the true number m of leading principal components is unknown. While the number m may be estimated, by e.g., Bai and Ng (2002); Passemier and Yao (2014); Jung, Lee and Ahn (2018), we do not pursue it here.

Discussion
In this work, we have revealed a perhaps counter-intuitive phenomenon of second data piling in the HDLSS context, and showed that a negatively ridged discriminant direction vα, withα < 0, exhibits the second data piling with a maximal property. This second data piling direction vα is asymptotically orthogonal to the true leading principal component (PC) directions. Thus, by using the direction vα for classification, we effectively remove the excessive variability in the leading PC directions. This observation naturally leads to an approach of directly removing the leading PC directions (i.e., projecting the data onto the nullspace of the leading eigenspace). This approach, via an estimation of the PC directions, has been already proposed by Aoshima and Yata (2019) in a similar setting. Although both our classifier and the classifier of Aoshima and Yata  Table 3. Each curve represents the trajectory of leave-one-out error rates. The minimum error rate of each curve is marked with a filled circle. Results for 'lung cancer' and 'methylation' are omitted as their error rates are zero everywhere.
(2019) aim to achieve a similar goal of classifying in the nullspace of the leading eigenspace, there has been no discussion of double data piling in Aoshima and Yata (2019). Our current work not only reveals the double data piling phenomenon but also provides an answer to the question "why is removing the leading eigenspace beneficial?" Our analysis assumes that the true number of components m is known. What happens if one chooses to use an estimate or guess m * = m in defining the subspace S (2.3) and direction vα (2.5)? In fact, the second data piling does not occur for m * < m. Heuristically, if m * < m, then vα is not asymptotically orthogonal to the principal component directions with large variance, thus resulting in no data piling. On the other hand, the second data piling occurs for m * > m, with a slower rate of convergence. Rigorously confirming these facts is rather involved, and will be addressed in our subsequent work.
It is well-known that the first data piling also occurs in high-dimensional multi-category data situations. Suppose there are K categories. Let B be the p × (K − 1) matrix with the kth column given byX k −X K , and S T be the total covariance matrix. If p > n − K − 1, then the first data piling happens on the column space of S − T B, where S − T is the Moore-Penrose inverse of S T . Is there a situation under which the second (asymptotic) data piling also occurs for such multi-category classification? While we do have an answer for this question (the answer is yes), we choose to use a binary classification framework to introduce the notion of the second data piling, in this work. Ahn and Marron (2010) observed that the first maximal data piling direction for p > n − 2, when appended by the Fisher's discriminant direction for p ≤ n − 2, exhibits the so-called double descent phenomenon. The term double descent is coined for the regression setting (Hastie et al., 2019). In the classification context, the double descent phenomenon is understood as follows: The misclassification rate initially decreases as p increases from 1, increases as p approaches to n − 2, then decreases as p diverges. The first descent, a "U"-shaped dip, of the misclassification rate is due to the bias-variance trade-off. The second descent occurs at the HDLSS regime. This phenomenon has been observed for a number of real data situations, but has not been fully understood theoretically. A natural question is whether the double descent phenomenon is observed for the second maximal piling direction. Simulation studies (not reported here) using vα (well-defined for both p ≤ n and p > n) indicate that the misclassification rate is monotone with respect to p, thus just a single descent. This observation parallels the empirical observations by Kobak, Lomond and Sanchez (2020) and Wu and Xu (2020) in the regression setting, and suggests that negatively ridged classifiers may mitigate the double decent phenomenon. The misclassification rate corresponding to vα is smaller than that using w MDP , for any p > n, which is expected for large p by Theorem 3.7. A thorough investigation on this matter is left as a future research agenda.

Appendix A: On ρ-mixing condition
The concept of ρ-mixing was first proposed in Kolmogorov and Rozanov (1960); see Bradley (2005) for detailed introduction on mixing conditions. The definition of ρ-mixing follows. For a σ-algebra A, denote the class of square-integrable and A-measurable functions as L 2 (A). The ρ-type measure of dependency of two σ-algebras F and G is defined as Then, the ρ-mixing coefficient is defined as:

Appendix B: Additional details for the linear discriminant direction w α
In this section, we provide more descriptions of the parameterization path w α has.

B.1. Parametrization path of w α with sign flips
Main article Figure 1 plots a parametrization path of w α . For its second coordinate, we use which is exactly (I − Pû 1 − P wMDP )w α . The term b α is for the second term in (2.2), which lies on an (n − 3)-dimensional subspace. Suppressing it to a non-negative real value, however, may blind the noticeable features on the parametrization of w α such as sign-flips. Here, we use a different choice of coordinate c α , As depicted in Figure B.1, the coordinate c α is designed so that the sign is reversed every time α p passes −λ i , which reflects the property of w α .

B.2. Projection of w α into lower dimensional space
Another cost we pay when suppressing the (n−3)-dimensional vector, the second term in (2.2), into one coordinate is that even different vectors have the same coordinates. For instance, b α ≈ 1 whenever α p reaches −λ i for i = 2, . . . , n − 2. Figure B.2 illustrates the curve with only three axisû 1 ,û 2 , w MDP . A similar shape of the path can be obtained by alteringû 2 toû i , 3 ≤ i ≤ n − 2.
Exactly two sign flips are illustrated.

C.1. Preliminary results on principal component scores
Denote horizontally concatenated p × n data matrix as X, that is, The within sample covariance matrix S can be expressed as follows: S = (X − X)(X − X) T /n where X = [X 1 , . . . ,X 1 ,X 2 , . . . ,X 2 ]. Consider the eigenvalue decomposition of Σ, . . . , m) and λ i = τ 2 i (i = m + 1, . . . , p). Also, U = [u 1 , . . . , u p ] is the p × p orthogonal matrix where u i 's are normalized eigenvectors corresponding to λ i . Denote the matrix of principal component scores of X, Z, that is, Proof. We begin by decomposing the quantities in (a)-(d) into two terms: The terms A 2 , B 2 , C 2 are irrelevant to β, and can be shown to converge to 0 as p → ∞. Specifically, to handle A 2 and B 2 , for any fixed l ≥ 1 let W i ∈ R l (i = 1, . . . , p) be any random vectors satisfying E(W i ) = 0, Var(W i ) = I l and Cov(W i , W ι ) = 0 for any 1 ≤ i = ι ≤ p. Note that by Assumption 1, there exists M < ∞ such that τ i ≤ M for all i. Chebyshev's inequality gives Similarly, C 2 → 0 in probability as well since in which we used the fact that z i and ζ i are independent. The term D 2 is irrelevant to β as well, but does not degenerate. We use Theorem 1 of Jung and Marron (2009) and Assumptions 1-4 are satisfied, then Since by Assumption 1, τ i ≤ M and lim p→∞ p i=1 τ 2 i /p = τ 2 , (C.1) holds. It remains to show that A 1 , B 1 , C 1 , and D 1 converge to their respective counterparts in the statement of the lemma. For β < 1, they all converge to 0 almost surely. For β = 1, Assumptions 3-4 guarantee that with probability 1, they converge to the random variables given in the statement. Combining the above gives the desired result.
Let J m be the matrix of ones of size m × m. For the group-wise centered data matrix is X − X = X(I n − J), where X = XJ = [X 1 , . . . ,X 1 ,X 2 , . . . ,X 2 ].
The empirical common principal components are given either by the eigendecomposition of S or by the singular-value-decomposition of X − X =Û 1 DV T 1 = n−2 i=1 d iûiv T i , whereû i is the ith sample principal component direction and v i is the vector of (normalized) sample principal component scores. Here, the (n − 2) × (n − 2) diagonal matrix D collects the non-zero singular values {d i } n−2 i=1 in descending order. Note that whereλ i = d 2 i /n is the ith largest eigenvalue of S. We make use of the dual matrix S D = (X − X) T (X − X)/n which is a finitedimensional matrix and shares all n − 2 nonzero eigenvalues with S. By writing X − X = X(I n − J) and by Lemma C.1 (d), the limiting distribution of S D can be obtained. As p → ∞, Proof. As can be seen in (C.4), we have as p → ∞. Here, (ii) follows immediately from (C.5). While for β = 1, we get p −1 nφ i (S D ) P − → φ i (S 0 ) for i = 1, . . . , n− 2 as p → ∞. To show (i), we claim that for i = 1, . . . , m, φ i (S 0 ) = φ i (Ω) + τ 2 . For this, let λ be a nonzero eigenvalue of (I n −J)W W T (I n −J) and v be its corresponding eigenvector. Then, there exists u satisfying v = (I n − J)u, and thus (I n − J)(W W T )(I n − J)u = λ(I n − J)u. Hence, Since Ω = W T (I n − J)W and (I n − J)W W T (I n − J) share their eigenvalues, we have φ i (S 0 ) = φ i (Ω) + τ 2 for i = 1, . . . , m. On the other hand, since Ω is of rank m with probability 1, the rest of eigenvalues of S 0 equals to τ 2 .
The limit of the inner product between the sample eigenvectorû i (i = 1, . . . , m) and d depends on β; for β = 1, and Proof.
(i) We expressû i withv i using (C.3), The random variables nλ ι /p and z T i (I n − J)v ι are stochastically bounded since they converge in probability as shown in Lemma C.2 and (C.6), respectively. For 0 ≤ β < 1, since λ i /p → 0 for all 1 ≤ i ≤ p, we get u T . We decompose the quantity d TÛ 1 / √ p into two terms: From (C.3), (C.6), and Lemmas C.1 and C.2, we have Adding (C.7) and (C.8), we get the desired results.

C.2. Proof of Theorem 2.1
Proof. Let β = 1. First, we make use ofw α in (2.2). The limiting angle between w α and S is analyzed through the quantity, where P S is the orthogonal projection operator onto S. We claim that the quantity in (C.9) converges to 1 in probability. First, the following holds by Pythagoras' theorem: Since Lemmas C.2 and C.3 give that the last term in (C.10) converges to 0 in probability for α = −τ 2 /n, it remains to show that P Swα 2 /p is stochastically bounded. We decompose P Swα 2 /p into two terms: The first term converges to a certain quantity from Lemmas C.2 and C.3, so is thus stochastically bounded. To deal with the second term, note that Û 2Û T 1 d 2 . The limit of norm of d can be derived using Lemma C.1, as p → ∞. While we handle the limit of Û 1Û Combining this with (C.11) and (C.12) gives as p → ∞. We denote the limit of (C.13) as κ 2 for κ > 0. Since P Swα 2 /p converges in probability, and we come to conclusion. We make use of random variable κ in the proof of Theorem 3.1.

C.3. Proof of Theorem 3.1
Proof. We continue to assume β = 1. Note that u ι, We begin with focusing on the inner products u T ιûi and u T ι w MDP .
On u T ιûi . The limit of the inner product u T ιûi (ι, i = 1, . . . , m) can be derived immediately from Lemma C.3 (i): As p → ∞, where Ω is defined in Lemma C.2.
On u T ι w MDP . We claim that as p → ∞, , which is given by Assumptions 1-3; as p → ∞. We now deal with L 1 and L 2 in (3.1) which generally exist on S, where κ 1 = (1− η 1 )(1 − k 2 )δ 2 /κ MDP and κ 2 = −(1 − η 2 )(1 − k 2 )δ 2 /κ MDP having η 1 = n 1 /n and η 2 = n 2 /n. The distance between two piles induced by w MDP , t j u j,S and (C.14) gives the limit of the second term, Also, from Lemma C.1 and (C.7), (C.17) To evaluate the limit of p −1/2 w T MDP (Y −X), we decompose it into two terms, Routinely, the limit of K 1 can be evaluated through the expression of d and Y −X using the principal scores and Lemma C.1.
(C.25) Also, from (C.13), Lemmas C.2 and C.3 (ii), By (C.24) and (C.26), the desired result is obtained. The positive term γ depends on the first true principal scores of training data, which are invariant to p. We make use of γ in Theorem 3.5.

D.1. Additional figures
Figures D.1 to D.3 display the misclassification rates of binary classification of the MNIST and EMNIST datasets, for the cases where the number of leading components is set as m = 1, 2, 3. These figures are referenced in Section 4.2.

D.2. Model assumptions
For the MNIST data example, we check that Assumptions 1 and 2 are indeed satisfied. Checking the asymptotic assumptions is possible in this case, due to the following two reasons. First, dealing with varying dimensions is possible, since we can obtain random Fourier features of any dimension p. We have experimented with p = 1000, 2000, 4000, 6000 in Section 4.2. Second, since the data set consists of a sufficiently large number of sample (6,000 observations on average for each category), we may regard the sample estimate a close approximation of true parameter.
For each case of binary classification problems (total 10 2 = 45 cases), and for each dimension p, we have computed the size of the mean difference μ = μ 1 − μ 2 , and the eigenvalues λ i of the within covariance matrix Σ. The mean difference μ and covariance matrix Σ are computed from the whole sample, and we treat as if there are the true parameters. (The experiment in Section 4.2 was conducted based on a random sample of size n = 100 or 200 from the whole sample.) Assumption 2 is satisfied if μ = O( √ p) or, equivalently, μ 2 = O(p). In Fig. D.4, we confirm that μ 2 is linear in p, and Assumption 2 is seemingly satisfied.  Assumption 1 gives that the first m eigenvalues of Σ grow at the rate of p. Moreover, the ratio of λ i over the total variance should be constant if the assumption is true. Define the ratio by If λ i = O(p) and r i is constant over p, then the ith component may be considered as a leading component (or a "spike"). In Fig. D.5, we confirm that λ i is indeed nearly linear in p for i = 1, 2, 3, 4; in Fig. D.6, we confirm that the ratio r i is nearly constant for i = 1, 2, 3, 4. To summarize, Figs. D.4-D.5 suggest that our key assumptions are satisfied for the MNIST dataset. Therefore, the numerical results on the classification in Section 4.2 should come at no surprise.