Geodesic projection of the von Mises...Fisher distribution for projection pursuit of directional data

We investigate geodesic projections of von Mises–Fisher (vMF) distributed directional data. The vMF distribution for random directions on the (p−1)-dimensional unit hypersphere Sp−1 ⊂ Rp plays the role of multivariate normal distribution in directional statistics. For one-dimensional circle S1, the vMF distribution is called von Mises (vM) distribution. Projections onto geodesics are one of main ingredients of modeling and exploring directional data. We show that the projection of vMF distributed random directions onto any geodesic is approximately vM-distributed, albeit not exactly the same. In particular, the distribution of the geodesic-projected score is an infinite scale mixture of vM distributions. Approximations by vM distributions are given along various asymptotic scenarios including large and small concentrations (κ → ∞, κ → 0), high-dimensions (p → ∞), and two important cases of double-asymptotics (p, κ → ∞, κ/p → c or κ/ √ p → λ), to support our claim: geodesic projections of the vMF are approximately vM. As one of potential applications of the result, we contemplate a projection pursuit exploration of high-dimensional directional data. We show that in a high dimensional model almost all geodesic-projections of directional data are nearly vM, thus measures of non-vM-ness are a viable candidate for projection index. MSC2020 subject classifications: Primary 62H11; secondary 62E20, 62R30.


Introduction
The von Mises-Fisher (vMF) distribution, sometimes referred to as the Fishervon Mises-Langevin distribution, plays a central role in modeling and inference for the directional data (Mardia and Jupp, 2000;Ley and Verdebout, 2017). While the multivariate normal distribution is not defined on the sample space S p−1 = {x ∈ R p : x x = 1} of directions, the vMF distribution is the closest notion to the normal distribution. For a random direction x ∈ S p−1 , we write x ∼ vMF p (μ, κ) if its density function is f vMF (x) = c p (κ) exp(κx μ) with respect to the surface area measure of S p−1 , for the location parameter μ ∈ S p−1 and the concentration parameter κ ≥ 0. The normalizing constant is

S. Jung
where I ν is the modified Bessel function of the first kind of order ν. (We use the convention 0/0 = 1.) On S 1 , the vMF distribution becomes the von Mises (vM) distribution.
We consider the one-dimensional projection of directional data onto geodesics, a notion closest to the orthogonal projection onto a vector in Euclidean space. Projections onto geodesics and the evaluation of the projection scores and residuals are fundamental ingredients of modern directional statistics and, in general, statistics on manifolds; applications include dimension reduction (Fletcher et al., 2004;Jung, Dryden and Marron, 2012), regression (Fletcher, 2013;Cornea et al., 2017), classification (Pizer and Marron, 2017) and developments of parametric models (Schulz et al., 2015;Kim et al., 2019). Here, the projection onto a geodesic γ is defined intrinsically. With a distance function ρ : S p−1 × S p−1 → [0, π], the projection of x ∈ S p−1 onto γ is the point on γ closest to x, and the projection score S γ (x) ∈ (−π, π] is given by a parameterization of γ. See Section 2 for definitions of the geodesic, the metric, and the geodesic projection. In Euclidean space, the projection u z of a Gaussian random vector z ∈ R p is also Gaussian for any u ∈ S p−1 , and even almost all projections of a non-Gaussian z are also Gaussian, asymptotically; see Diaconis and Freedman (1984); Bickel, Kur and Nadler (2018). On the other hand, it is yet unknown whether the geodesic projection S γ (x) of a vMF-distributed random direction x ∈ S p−1 is also vM.
In this paper, we show that the answer to the above fundamental question is indeed negative. In particular, we show that for x ∼ vMF p (μ, κ) and for any geodesic γ, the distribution of S γ (x) is an infinite scale mixture of vM distributions with concentration parameters ranging from 0 to κ. Nevertheless, the density of S γ (x) is nearly vM-distributed, in the sense that it is close to the vM family to human eyes. The difference is not statistically significant for moderately large sample sizes.
The distribution of S γ (x) is denoted by PvM(p, κ, δ), where δ = ρ(γ, μ) ≥ 0 measures how much γ deviates from the mode of vMF p (μ, κ). In a special case where the geodesic γ passes through the mode, δ = 0 and we say the geodesic is canonical with respect to μ. For canonical geodesics γ, the density function of S γ (x) is expressed using I ν and M ν , the modified Struve function of the second kind of order ν, to facilitate approximations by vM densities. We refer to Appendix A for definitions and several useful properties of I ν and M ν .
We show that PvM(p, κ, δ) is indeed well approximated by vM distributions in various asymptotic scenarios, including large and small concentrations (κ → ∞ and κ → 0), high dimensions (p → ∞), and a double-asymptotic direction p → ∞, κ → ∞, κ/p = c ∈ (0, ∞). Various asymptotic expansions of I ν and M ν are used in the approximations. For the special asymptotic regime of p → ∞, κ → ∞, κ/ √ p = λ ∈ (0, ∞), we provide a high dimensional approximation of the distribution by the projected normal distribution (Mardia and Jupp, 2000, p. 178). The projected normal distribution is again approximated by the vM distribution. This work is developed in the course of formulating projection pursuit for directional data. Projection pursuit (Friedman and Tukey, 1974) is a well-988 S. Jung geodesic parameterized by (q, v) is for t ∈ R. Throughout, we require v = 1 so that v ∈ S p−1 and v q = 0. With the restriction v = 1, the image of R under γ, γ(R), is equal to the image of I := (−π, π] under γ. That is, γ(I) = γ(R) = {Exp q (tv) : t ∈ R}.
The geodesic distance function ρ : S p−1 × S p−1 → [0, π] between x, y ∈ S p−1 is defined as the length of the shortest path on S p−1 between the two points, and is the arc length formed by the geodesic segment joining x and y: ρ(x, y) = cos −1 (x y). For a w ∈ T x satisfying y = Exp x (w) and w ≤ π, ρ(x, y) = w . If such w is unique, then the shortest path between x and y is given by the geodesic joining x and y, denoted by Γ(x → y).
The geodesic projection and scores for an S 2 -valued sample are illustrated in Fig. 1. For any two points x 1 , x 2 , the geodesics Γ(x 1 → P γ (x 1 )) and Γ(x 2 → P γ (x 2 )) are not in general parallel.
The next lemma is on the uniqueness and computation of P γ (x) and S γ (x).
Lemma 2.1. Given x ∈ S p−1 and a geodesic γ(q, v), if either x q or x v is nonzero, then P γ (x) is uniquely given by P γ (x) = Exp q (S γ (x)v), where S γ (x) = arg min t∈R ρ(Exp q (tv), x) = atan2(x v, x q).
Otherwise, x q = x v = 0 and P γ (x) is everywhere on γ(R).
The function atan2(y, x) is the two-argument inverse tangent which takes the signs of x and y into account to determine in which quadrant (x, y) lies. In other words, S γ (x) is the polar angle of the polar coordinates for (x q, x v) in Cartesian coordinates.

Geodesic projections of vMF random directions
For x ∈ S p−1 , suppose x ∼ vMF p (μ, κ). We wish to find the distribution of the projection score S γ (x) on any given γ. While vMF plays the role of normal distribution in the areas of directional data, a natural question is whether the projected score of a vMF-distributed x is also a vM. We show that this is not the case.
The distribution of S γ (x) can be parameterized by p, κ and the deviation δ of γ from the mode μ, due to the rotational symmetry of the vMF distribution. The deviation is defined as (3.1) With respect to the mode μ of the distribution, a geodesic γ is parameterized by (q, v), where q is the point on γ(R) closest to μ, i.e., δ = ρ(μ, q), and v ∈ γ(R) is any of the two unit vectors orthogonal to q. For δ < π/2, such q is unique. The distribution of S γ (x) is invariant to the choice of v, and to the choice of q if δ = π/2. In any case, v μ = 0. Among the geodesics, we pay special attention to the geodesics passing through the mean μ, i.e., the case δ = 0. Any geodesic passing though μ can be parameterized by μ and v ∈ S p−1 satisfying v μ = 0. These geodesics γ(μ, v) will be referred to as canonical geodesics, hereafter. The distribution of S γ (x) is identical for any canonical geodesic γ, due to the rotational symmetry of the vMF distribution.
The opposite extreme is the case δ = π/2, where the geodesic γ is as far as possible from the center of the distribution.
In the following, we inspect the distribution of S γ (x), x ∼ vMF p (μ, κ). The density function of S γ (x) takes a special form, and does not coincide with any known distributions. We denote T ∼ PvM(p, κ, δ) for T = S γ (x), where δ is the 990 S. Jung deviation of γ from μ, defined in (3.1). For the canonical geodesics with δ = 0, we simply write PvM(p, κ) for PvM(p, κ, 0).

The density of PvM(p, κ, δ)
The density of T = S γ (x), with γ = γ(q, v), can be obtained by evaluating the joint density of (X, Y ) = (x q, x v), then switching to the polar coordinates (R, T ), where R cos T = X and R sin T = Y . In general, there is no closed-form expression for the density of T , but we show that T ∼ PvM(p, κ, δ) is an infinite scale mixture of vM distributions.
For any p ≥ 3, κ > 0 and δ ∈ [0, π/2), f PvM (t) is symmetric about t = 0, strictly decreasing on t > 0, and has its unique mode at t = 0. These properties are inherited from the mean-zero vM distributions. In Fig. 2, an example of the joint density of (R, T ) is displayed. If κ = 0, both vMF p (μ, 0) and PvM(p, 0, δ) are the uniform distributions on S p−1 and S 1 , respectively.
For the special case of δ = 0 (that is, when γ is a canonical geodesic), we have f PvM (t; p, κ) = f PvM (t; p, κ, 0) = 1 0 f R (r)f vM (t; κr)dr, with a simpler expression of the mixing density Even in this special case, the integral involved has no closed form solution. Fortunately, it can be represented by the values of special functions I ν and M ν . Our next result is on the exact density function of S γ (x), when γ is a canonical geodesic.
Theorem 3.2. Let x ∈ S p−1 , p ≥ 3, follow vMF p (μ, κ) for κ ≥ 0. Suppose that the geodesic γ = γ(q, v) passes through μ with q = μ. Then the density function of the projection score 3) and In Fig. 3, the density function f PvM is plotted for a few choices of the dimension-concentration pair (p, κ). There, the closest vM density is overlaid as well. The precise meaning of "closest" is defined in Section 3.2. To human eye, the family of the density f PvM is similar to the vM family.
For the opposite extreme case of δ = π/2 where γ is as far from μ as possible, all vM distributions in the mixture of (3.2) becomes the uniform distribution on (−π, π], which in turn leads that f PvM (t; p, κ, π/2) is uniform. A slightly more general statement is shown in Lemma 3.3.

Fig 3.
Examples of the probability density function f PvM (·; p, κ, 0) of the projection score Sγ (x), overlaid with the histogram of Sγ (x), obtained from n = 10,000 random sample of vMFp(μ, κ). The best approximating vM density, obtained by (3.4), shows that f PvM is close to a vM density, but is not exactly the same with any vM density.

Lemma 3.3. Let x be a random variable on
As κ increases, f PvM becomes more concentrated. On the other hand, the higher the dimension p, the more dispersed f PvM . As the deviation δ increases, the distribution of S γ (x) becomes more dispersed toward the uniform. This is graphically shown in Fig 4, in which the density of S γ (x) is obtained by the numerical integration of f (r, t) = f R (r)f vM (t; κ cos(δ)r) over r ∈ (0, 1).

Comparisons to the Jones-Pewsey distribution family
Although f PvM looks close to a vM density, it is not equal to any vM density. We show this for a more general distribution family, which includes the vM as a special case.
The Jones-Pewsey distribution (Jones and Pewsey, 2005) is a general threeparameter family of symmetric circular distributions. The density function of the Jones-Pewsey distribution, with location parameter μ, concentration parameterκ ≥ 0, and a shape parameter φ ∈ R, is given by where ω(κ, φ) is the normalizing constant. The Jones-Pewsey distribution includes the vM distribution, the Cardioid distribution, the wrapped Cauchy distribution and the circular t-distribution (Shimizu and Iida, 2002) as special cases; see Ley and Verdebout (2017) for a discussion on the Jones-Pewsey distribution.
The following result shows, in particular, that there is no vM distribution whose density function equals f PvM (t; p, κ). Theorem 3.4. For any given p ≥ 3 and κ ∈ (0, ∞), there is no Jones-Pewsey distribution whose density function equals f PvM (t; p, κ).
Note that if κ = 0 or ∞, then f PvM (t) is the uniform distribution or the point mass at 0, respectively.
While PvM(p, κ, δ) is not exactly a vM distribution, they are virtually the same for the majority of possible (p, κ). To support this claim, we test the null hypothesis of S γ (x) ∼ vM(0,κ) using a random sample of size n from PvM(p, κ, δ). We use Kuiper (1960)'s goodness-of-fit test (adapted for the vM as in Section 6.2 of Pewsey, Neuhäuser and Ruxton (2013)), andκ is given by the maximizer of the vM likelihood. We choose (p, κ, δ) = (20, 10, 0) as used in the middle panel of Fig. 3. For the sample size up to n = 1000, the power of the test is less than 6% (numerically evaluated with significance level 5%), meaning that in the usual situations where the sample size is at most hundreds one cannot distinguish PvM(p, κ) with a vM distribution.

Approximations to vM distributions
In this section, we provide various approximations of PvM(p, κ, δ) by the vM distributions, to support our claim that PvM is nearly vM. Asymptotic results in this section will also be useful in our discussion of projection pursuit, in Section 5. Since a PvM(p, κ, δ) has zero mean, the approximating vM(μ,κ) has μ = 0. We identify the concentration parameterκ as a function of (p, κ, δ).
We summarize our finding in Theorem 4.1. There, φ(·) is the density function of the standard normal distribution. We say θ follows the Cardioid distribution (centered at μ = 0) with parameter ρ ∈ (0, 1 2 ) if its density is f Cardioid (t; ρ) = Cardioid distribution for low concentration; see Kent (1978) and Mardia and Jupp (2000).
(i) (High concentration) For any fixed p ≥ 3 and for any s ∈ R, as κ → ∞, is the beta function. For any fixed p ≥ 3 and for any t ∈ (−π, π), as κ → 0, . For any fixed κ > 0 and for any t ∈ (−π, π), as p → ∞, An immediate implication of Theorem 4.1 is that for x ∼ vMF p (μ, κ), and for a canonical geodesic γ, S γ (x) ⇒ Unif(−π, π) as κ → 0 or as p → ∞. Here and throughout, the notation ⇒ stands for the convergence in distribution. This result is consistent with the high-dimensional asymptotic results in the literature. In particular, it is shown in Watson (1988) and Dryden (2005) that for a fixed κ, vMF p (μ, κ) converges to the uniform distribution on S ∞ as p → ∞. The uniform distribution on S p−1 is a special case of the vMF distribution with κ = 0, and Unif(−π, π) is a special case of the PvM distribution with κ = 0. Thus, the approximation by vM(κ l ) in the low concentration setting and the high-dimensional approximation vM(κ d ) are bound to be similar in the high dimensional situations. In fact,κ d ≈κ l for large p; for u = (p − 1)/2, Gautschi's inequality (Qi, 2010) gives

S. Jung
In the opposite case where κ → ∞, the underlying vMF p (μ, κ) converges to the point mass at μ. The rate of convergence is exactly κ −1/2 and PvM(p, κ) is approximately vM or normal. In the high-dimensional case, if the concentration parameter κ increases at the same rate as p, then Case (iv) in Theorem 4.1 leads that for λ = 2κ/p ∈ (0, ∞), as κ → ∞. We remark that if κ increases at the rate κ = √ pλ for λ ∈ (0, ∞), the distribution of S γ (x) neither diverges to the uniform nor degenerates to a point mass. This special case will be discussed in the general setting where δ ≥ 0 in Section 4.2.

The case of general geodesics
Approximation of PvM(p, κ, δ) in the general case where δ ∈ [0, π/2] is discussed in this section. Since the exact density function is not available for δ > 0, the results derived here are weaker than the corresponding results for δ = 0 in Section 4.1, in the sense that the approximation error is not evaluated in most cases. An exception is the low-concentration case κ → 0 (while other parameters are fixed), and an extension of Theorem 4.1 (ii) is given. We also inspect two different asymptotic regimes of high-dimensional cases p → ∞ and p, κ → ∞ at the same rate. It is seen that PvM(p, κ, δ) becomes either as dispersed as possible (when p → ∞ and κ is fixed) or converges to 0 with the rate κ −1/2 (when p → ∞ and κ = Θ(p)). An interesting result is obtained when we consider the case κ = Θ(p 1/2 ). As p → ∞, PvM(p, κ, δ) does not degenerate in the limit, but converges to the projected normal distribution.
In the following, we discuss the four asymptotic scenarios of approximating PvM(p, κ, δ).

Low concentration approximation
The distribution of S γ (x), PvM(p, κ, δ), is a mixture of vM(κ cos(δ)R) over R ∈ (0, 1), as given in Theorem 3.1. In the low-concentration asymptotic scenario of κ → 0, all concentration parameters of the vM converge to zero, but the mixing density f R does not degenerate. Nevertheless, a mixture of vM(0)'s is simply vM(0), which is the uniform distribution. A much stronger statement is given in Theorem 4.2, where we provide an approximation of PvM(p, κ, δ) by vM(κ), and its approximation error.
Note that the above results coincides with Theorem 4.1 (ii) for δ = 0. It is conjectured that, for δ > 0, the approximations in the other asymptotic regimes are simply those in Theorem 4.1 with κ replaced by κ cos δ. While the conjecture seems true for the case p, κ → ∞ discussed in Theorem 4.3 later, a rigorous proof has not been obtained.
In our proof of Theorem 4.2, detailed in the Appendix, we provide a uniform bound of the density function f PvM : . The functions f (−) (t) and f (+) are easier to handle as the integrals involved have closed-form expressions, and both converge to the uniform density as κ → 0.

High-dimensional approximations
We consider three asymptotic regimes: κ is fixed, κ = Θ(p) and κ = Θ( √ p) as High-dimension with κ fixed Using the joint density function of (R, S γ (x)) in Theorem 3.1 and the large-order asymptotic expression of I ν in (A.9), it can be shown that R → 0 in probability, and furthermore, where X ∼ Rayleigh(1) and U ∼ Uniform(−π, π) are independent. A proof of (4.1) is given in Appendix Section B.3. This result is not entirely new and can be derived from Watson (1988). Watson investigated the vMF distribution in high-dimensional spheres, and in particular the orthogonal projections of a vMF-distributed random directions onto a subspace. For the projection onto the subspace spanned by γ(q, v), let P = [q, v] be the orthogonal frame of the subspace. Watson (1988) showed that see also Dryden (2005). Since the polar angle S γ (x) is invariant to positive simultaneous scaling of P x = (q x, v x), Lemma 2.1 and the continuous mapping theorem give High-dimension, high-concentration with κ = Θ(p) Assume that both p and κ increases at the same rate. If the deviation δ of the geodesic γ from the 998 S. Jung mode μ of vMF p (μ, κ) is not its maximum, i.e. δ < π/2, then S γ (x) degenerates as p → ∞ and can be well approximated by either the vM or normal distribution. In the following we show the convergence of a scaled S γ (x) in the limit p, κ → ∞, In the general geodesic case where δ > 0, the distribution of S γ (x) is only available as the mixture of vM distributions vM(κ cos(δ)R). It turns out as p → ∞, the mixing random variable R concentrates at shown in Lemma B.7 in Appendix. Informally, S γ (x) is the mixture of vM(κ cos(δ)r 0 ), with weight approaching 1, and the rest with weight approaching 0, as p → ∞. This entails that S γ (x) ∼ vM(κ cos(δ)r 0 ) in the limit, but because κ increases as well, the vM is again approximated by the normal distribution. This is shown next.
Remark 1. In Theorem 4.3, we used u = (p − 4)/2 as it is the easiest form to handle, but the conclusion remains the same if u = (p − 1)/2 or p/2 is used instead.
For the special case of a canonical geodesic γ with cos δ = 1, the statement of the above theorem can be obtained from Theorem 4.1 (iv). For this special case, Theorem 4.1 (iv) is in fact a stronger statement than Theorem 4.3, as the rate of convergence Θ(p −1 ) is not available in the latter. Theorem 4.3, together with Lemma B.2, implies that if both p and κ are large, PvM(p, κ, δ) is well approximated by vM(κ) and by N (0,κ −1 ).

High-dimension, moderately large
does not degenerate when κ increases at the rate of √ p. The case p → ∞ while κ = √ pλ is investigated by Watson (1988), in which Watson showed that the orthogonal projection of x ∼ vMF p (μ, κ) onto the subspace spanned by P = [p, v] converges to a normal distribution when scaled by √ p. We show in Theorem 4.4 that the polar angle S γ (x) of the projection score p 1/2 P x converges to the distribution of polar angle of the limiting 2-variate normal distribution, called the projected normal distribution.
The projected normal distribution, sometimes referred to as angular Gaussian or offset normal distribution, is obtained by projecting a Gaussian random vector on R q on S q−1 (in our case, q = 2), and does not coincide with the vMF distribution. We write PN 2 (μ, I 2 ) for the distribution of the polar angle of N 2 (μ, I 2 ).
For each p, let γ p be a random geodesic parameterized with respect to μ p , that is, γ p = γ(q, v) and Δ p := cos −1 (q μ p ) = ρ(μ p , γ) and v μ p = 0. Assume that γ p and x p are independent, and for a constant δ The density function of PN 2 ((λ cos δ, 0) , where λ δ = λ cos δ, and Φ and φ are the distribution function and the density function of the standard normal distribution, respectively (Watson, 1983;Pukkila and Rao, 1988;Presnell, Morrison and Littell, 1998). The density f PN is unimodal, symmetric about zero, and exhibits heavier tails than the vM. Albeit the difference, the density f PN (θ; λ δ ) is well approximated by vM distributions when λ δ is low or high. In particular, for high concentration case where λ δ → ∞, we have for low concentration case where λ δ → 0, we have forκ = π/2λ δ , Note that the latter vM approximation of PN is asymptotically equivalent to the high-dimensional approximation by vM( √ π 2(p−1) κ) given in Theorem 4.1 (iii), for δ = 0. The assertions (4.2) and (4.3) are formally stated in Lemma 4.5.

Summary and comparison
In Sections 4.1 and 4.2, several approximations of PvM(p, κ, δ) are identified and justified in various asymptotic scenarios. A summary is given in Table 1.
A natural question to ask is that for a given (p, κ), which approximation works the best? We attempt to answer this question numerically. For each p = 3, . . . , 50, a fine grid of κ ∈ (0, 70], and δ ∈ {0, 30 • , 60 • , 80 • }, we compare f PvM (·; p, κ, δ) with each of the five approximations given in Table 1, and record the closest approximation to PvM(p, κ, δ) in terms of the Hellinger distance.
The result is shown in Fig. 6. There, the high p, large κ approximation (the 4th row in Table 1) is shown to be superior than the others for most cases. When δ increases, the PvM distribution becomes less concentrated, and the lowconcentration (small κ) approximation becomes better. We note that the highdimensional approximations are similar to the low-concentration approximations for most cases of (p, κ). The approximation by the projected normal is seen to be the closest roughly for the cases κ < 2p. However, the difference between the approximating projected normal density and the other approximating vM densities, especially the small κ and high p approximations, are not large (in fact thin). These claims are numerically confirmed in Appendix C, in which we provide a more careful numerical comparison on the quality of approximations.

An application to projection pursuit of directional data
Projection pursuit (Friedman and Tukey, 1974) is a technique for exploration of multivariate analysis, aiming at finding low-dimensional "interesting" projections of high-dimensional data in R p . A direction u ∈ R p is deemed interesting if the distribution of the orthogonal projection score x u is as far from Gaussian as possible (Friedman, 1987;Bickel, Kur and Nadler, 2018). Such a non-normal distribution is an evidence of potential clusters or outliers in the multivariate data.
For directional data on S p−1 , we propose to seek the geodesic γ such that the projection score S γ (x) is as non-vM as possible. As claimed in preceding sections, the projection score of a vMF-distributed random direction is approximately vM-distributed. Since any vMF distribution is rotationally symmetric about its mode, no geodesic is more interesting than any other geodesics. On the other hand, if the data exhibits an interesting mode of variation such as clusters, the projection score onto the geodesic passing through two clusters is bimodal, and strongly non-vM.
We demonstrate that a measure of non-vM-ness is useful in detecting clusters and outliers. In the following two examples, conformity to the vM distributions is measured by the p-value of the Kuiper's test with respect to the vM family, using a sample of projection scores S γi (x).
Example 1. Consider a location mixture of vMF, given by x ∼ 1 2 vMF 3 (μ 1 , κ) + 1 2 vMF 3 (μ 2 , κ), where ρ(μ 1 , μ 2 ) = π/4 and κ = 50. A sample of size n = 100 from this model is plotted in Fig. 7. While we have chosen to use p = 3 for presentational purposes, exploration of directional data for p > 3 by visualization is limited. We compare two geodesics γ 1 and γ 2 , where γ 1 passes through both μ 1 and μ 2 , and γ 2 stems from the sample Fréchet mean of data, the minimizer μ of space, then γ 1 corresponds to the Fisher's discriminant direction. The clustered nature of the population appears along the geodesic γ 1 whose projection score is strongly non-vM (small p-value); see the bottom left panel of Fig. 7. On the other hand, the projection onto γ 2 is vM-distributed, and is deemed uninteresting.
Example 2. We consider a contaminated vMF 3 (μ, 50), where 10 percent of the observations are located 90 degrees away from μ. A sample of size n = 50 is plotted in Fig. 8. Two geodesics are compared as well: γ 2 passes through the two clusters and γ 1 stems from the sample Fréchet mean of the data and is orthogonal to γ 2 . It can be checked that the outliers appear on S γ2 (x) (which is strongly non-vM), while the uninteresting γ 1 has its scores distributed not significantly different from a vM distribution. While the above two examples concern a low-dimensional situation, the departure from the vM distribution is a viable measure of interestingness in highdimensional situations. We show that under the location mixture model, similar to the examples above, almost all geodesics γ are uninteresting, i.e., the projections are approximately vM. Therefore, a geodesic γ 0 with a non-vM projection score is certainly interesting.
In particular, we use the high-dimensional setting of Theorem 4.4, and choose the geodesic γ from the uniform distribution on the set of all geodesics on S p−1 . Note that the uniformly distributed γ is given by choosing a random orthogonal 2-frame (q, v). We also consider the case where the geodesic is confined to pass through the mean of the directional data, a common practice in dimension reduction. The result is easily derived from Theorem 4.4.  , v) is chosen from the uniform distribution on the Stiefel manifold V 2,p (consisting of 2-frames in R p ). Then, as p → ∞.
The result above shows that as p → ∞ for almost all choices of γ, the corresponding projection score is approximately vM-distributed (either uniform or the projected normal). However, as demonstrated in Example 1, there certainly is an interesting direction: For the canonical geodesic γ with respect to both μ 1 and μ 2 , i.e., γ = Γ(μ 1 → μ 2 ), S γ (x) converges to a location mixture of two projected normal distributions.
Corollary 5.1 also hints a use of "non-Gaussianity" as a projection index in high dimensions. Unlike the usual vector-valued data where all uninteresting projections are Gaussian, uninteresting geodesic projections of directional data can be uniformly distributed. The uniform distribution is a special case of vM, but is strongly non-Gaussian. Thus, for directional data, non-normality of scores does not always indicate interesting patterns of data.
Remark 2. Extensions of Corollary 5.1 may be necessary in developing a projection pursuit for high-dimensional directional data. For example, instead of the 2-mixture of equal concentrations assumed in Corollary 5.1, consider a Kmixture of vMF p (μ k , λ k √ p) for K > 2 with possibly different concentration parameters λ k . If the number K of mixtures is fixed, then it can be checked that the assertion (i) of Corollary 5.1 still holds even if λ k 's are distinct. On the other hand, generalizing the assertion (ii) is tricky. With more than two components, the limiting distances from the component centers to the grand mean, d k := lim p→∞ ρ(μ, μ k ), should be treated arbitrary (as opposed to d 1 = d 2 = d in the K = 2 case). With potentially different concentration parameters λ k , we conjecture that S γ (x p ) converges to a scale mixture of PN 2 ((λ k cos d k , 0) , I 2 ). Such a situation in a low-dimensional setting was explored in Example 2. There, the location mixture S γ2 (x) is shown to be more severely non-vM compared to the scale mixture S γ1 (x).

Discussions
We point out two directions of future research topics.

On the projected vMF
We have shown that the projection score of the vMF distribution onto a geodesic is not exactly vM-distributed, but is so approximately. Whether a similar property holds for more general distributions with ellipse-like symmetry such as Kent distribution (Kent, 1982) and the scaled vM distribution (Scealy and Wood, 2019), or for a semiparametric rotationally symmetric models (see e.g. Paindaveine and Verdebout, 2020), is an interesting question. An anonymous reviewer has asked whether a property similar to Theorem 3.1 holds for semiparametric rotationally symmetric models with density at x ∈ S p−1 proportional to f (κx μ). A calculation similar to the proof of the theorem shows that if f (st) = f (s)f (t), then the distribution of T = S γ (x) is a scale mixture of f (κ cos δ cos(t)). Whether the condition on f can be relaxed has not been answered. Furthermore, consider a dimension-k subsphere A k = {P k z : z ∈ S k } ⊂ S p−1 , defined for a p × (k + 1) orthogonal frame P k . The subsphere is a multidimensional generalization of the geodesic, as A 1 is the image over a geodesic. A related question is whether the projected score in S k onto A k is approximately vMF-distributed.

On a formal development of projection pursuit for directional data
We have demonstrated that departures from vM distributions can be used as useful projection pursuit indices. A comprehensive development of projection pursuit for directional data and a thorough investigation of it are deemed beneficial. We point out two major issues that need to be addressed in the development.
• Projection index: A projection index measures the interestingness of the geodesic projection. While various measures of non-vM-ness are viable candidates, computationally simpler notions such as kurtosis may provide satisfactory exploration of data. As an instance, Alashwali and Kent (2016) primarily used the kurtosis as the projection index in their study of projection pursuit. • Structure removal or k-dimensional projection pursuit: Our discussion is limited to the dimension-1 projection onto geodesics, and there certainly is a need to detect more than one interesting geodesics or even a subsphere. A common technique to explore more than one interesting directions is structure removal (Friedman, 1987). For directional data, structure removal translates to removing the identified interesting geodesic γ. Such a task is not straightforward and invites a deeper discussion on the strategy of dimension reduction for directional data (and for data on manifolds in general). Roughly, the strategies are categorized into two. The forward dimension reduction aims to successively increase the dimension of "interesting" submanifolds. On the other hand, the backward dimension reduction seeks to successively remove "uninteresting" directions. While there are evidences that the backward approaches are more suited for manifold-valued data (Jung, Dryden and Marron, 2012;Huckemann and Eltzner, 2018;Pennec et al., 2018), whether that remains the same for projection pursuit shall be answered. See Damon and Marron (2014) for an introduction on this issue.

Appendix A: Special functions
We collect several facts on the modified Bessel and Struve functions, and their various asymptotic expansions. We refer to Sections 10 and 11 of Olver et al. (2010) and DLMF (2020) as a general reference for these special function.

A.1. Standard power series and integral representations
The modified Bessel function of the first kind of order ν can be defined by the standard power series (Olver et al., 2010, 10.25.2): The modified Struve function of the first kind of order ν can be defined by the standard power series (Olver et al., 2010, 11.2.2): The modified Struve function of the second kind of order ν is defined by

S. Jung
and can be expressed as These functions, I ν (z), L ν (z) and M ν (z), are defined for z > 0 and ν > − 1 2 . They appear in our discussion as the solution for special forms of integrals. In particular, we have (Olver et al., 2010, 10.32.2, 11.5.4): Combining the above with (A.3), we obtain a new integral representation, which is used in the proof of Theorem 3.2.
Proof of Lemma A.1. Let z > 0 and ν > − 1 2 be fixed. Write Decomposing the latter integral, we have
Next two results are on the various asymptotic ratios of the special functions and the exponential function, and are primarily used in high-concentration asymptotic arguments, e.g., in Theorem B.1.
Proof of Lemma A.2. (i). Using the first two terms of the Hankel's expansion, . The assertion (i) follows.
Part (iii) is given by the Hankel's expansion of I u (z) combined with (ii).
Proof of Lemma A.3. By (A.6), we have ( z 2 ) ν < 0 for any z ≥ 0, the statement of the lemma follows.

A.3. Large-order asymptotic expressions and related expansions
The large-order asymptotic expressions for the modified Bessel function of the first kind of order ν, I ν , and for the modified Struve function of the first kind of order ν, L ν , are known (Olver et al., 2010, 10.41.1 and 11.6.5): For argument z fixed, as ν → ∞, Note that the extra √ ν in the denominator of the approximation of L ν (z) makes L ν (z) negligible compared to I ν (z). Therefore, for the modified Struve function of the second kind M ν , we have and Also useful is the the Poincaré expansion of the gamma function (Olver et al., 2010, 5.11.3):
Proof of Theorem 3.2. The joint density function of (R, T ), where R = S γ (x), R cos T = q x and R sin T = v x, is given by Theorem 3.1, or more directly by equation (B.4) in the proof of Theorem 3.1. Setting δ = 0, we have Due to Lemma 2.1, the distribution of the projection score t = S γ (x) is obtained by marginalizing (B.5). Integration by parts gives the marginal density of t, In general, there is no closed-from expression of g(t). If |t| ∈ (π/2, π], that is, if κ cos t < 0, then the integral in (B.6) is closely related to the modified Struve function M ν . Using the integral representation of M ν in (A.6), Suppose now |t| < π/2 so that κ cos t > 0. It is shown in Lemma A.1 that the integral (B.6) is a linear combination of M ν (κ cos t) and I ν (κ cos t): The derivation is completed by combining above results with the definitions of c p (κ) and S p−3 .
Proof of Lemma 3.3. Without loss of generality, let μ = e 1 , q = e p−1 and v = e p . Then in the spherical coordinate system used in the proof of Theorem 3.2, q x = p−2 j=1 sin(θ j ) cos(φ) and v x = p−2 j=1 sin(θ j ) sin(φ), which in turn leads that φ = S γ (x). With the given spherical coordinates, the marginal density of φ is uniform for any rotationally symmetric distribution about e 1 on S p−1 , as shown in Lemma 3 of Kim, Schulz and Jung (2020).
Proof of Theorem 3.4. The Jones-Pewsey distribution has the unique mode at μ. Since f PvM has the mode at 0, we do not consider Jones-Pewsey distributions whose mode is not at 0.

B.2. Proof of Theorem 4.1
We first give a sketch of the proof of Theorem 4.1, then supply a detailed proof.
Sketch of the proof of Theorem 4.1. As the exact expression of the density f PvM (·; p, κ) consists of I u (z), M u (z) and the gamma function, inspections of these functions are necessary.
For high concentration case, the argument z of I u (z) and M u (z) are increasing. We use Hankel's large-argument expansion of the modified Bessel function to approximate I u (z) as z → ∞; see (A.7) and (A.8). We use the first two terms of the Hankel's expansion while providing a bound for the remainder. In Lemma A.2, the modified Struve function M u (z) is shown to be asymptotically negligible when compared to the exponential function e z or to I u (z) for large z. Then, |f κ (s; p, κ) − g κ (s; κ)| is evaluated using the expansions; see Theorem B.1. In addition, we give an independent proof for the approximation of the vM by the normal distribution in high concentration in Lemma B.2. The assertion (i) is given by the above two.
For the high-dimension case (iii), the order u of I u (z) and M u (z) increases. The large-order asymptotic expressions in (A.9) and (A.11) and the Poincaré expansion of the gamma function (A.12) are used to show f PvM (t; p, κ) = f Cardioid (t;κ d /2) + Θ(p −1 ) as p → ∞. Applying Lemma B.3 completes the proof of (iii).
Finally, for the high-concentration, high-dimensional case (iv), the uniform expansion of the modified Bessel function of the first kind (A.13) is useful: Uniformly for 0 < z < ∞, as ν → ∞. For example, the uniform expansion can be applied to the expression I u (κ) = I u (uλ), as both u = (p − 1)/2 and κ increase but λ = κ/u is fixed. In our application, the uniform expansion is carefully applied to expand I u− 1 2 (uz) as u → ∞. It turns out I u− 1 2 (uz) is not asymptotically equivalent to I u (uz). These issues are taken care of in Theorem B.6 where we show f κ (s; p, κ) = g κ (s;κ b ) + Θ(p −1 ). The proof for (iv) is completed by applying Lemma B.2 together with a change of variable.
In addition to the big Theta notation, we use the following. For asymptotic equivalence we say a(ν) b(ν) if lim ν→∞ a(ν)/b(ν) = 1. We also use the standard big O notation: The inverses are not in general true.

B.2.1. High concentration approximation
When the concentration parameter κ in vMF p (μ, κ) is large, the vMF density is very similar to the spherical normal distribution with variance 1/κ (Kent, 1978). Similar to this result, the geodesic-projected vMF density f PvM can be approximated by a vM density.
As κ → ∞, both vMF p (μ, κ) and f PvM (p, κ) degenerates to their respective mode. In the following, we compare the density of √ κS γ (x) with that of √ κθ, for θ ∼ vM(κ). Denoting the density functions of √ κS γ (x) and √ κθ by f κ (s) and g κ (s), respectively, we derive the following result on the similarity of f κ (s) to g κ (s).
Theorem B.1. For all p ≥ 3, and for any s ∈ R, where the definition of h(t) is given in Theorem 3.2. The scaled vM density is decomposed into g κ (s) = g 1 (κ) · g 2 (s; κ), where, for u = (p − 1)/2, Using the above factorization of density functions, It is sufficient to show that A(s; κ) = O(κ −1 ) and B(s; κ) = Θ(κ −1 ) as κ → ∞. We record the following for multiple use. Applying Hankel's large κ approximation of I 0 (κ) given in (A.7), we have, as κ → ∞, (i) A bound on A(s; κ). Since f PvM is symmetric and has the unique mode at 0, the maximum of 1 + h(s/ √ κ) is attained at s = 0. That is, we use (B.7) and the asymptotic expressions of the ratios of the modified Bessel and Struve functions, derived in Lemma A.2 (i) and (iii). We get: We also observe that, by Lemma A.2 (i), where ν = (p − 1)/2. Combining (B.8) and (B.9) shows that for any s ∈ R, bound on B(s; κ). Choose κ large enough to satisfy κ > (2s/π) 2 so that |s|/ √ κ < π/2. Then, h(s/ √ κ) has two terms (see the first line of (3.3)), and we write h(s/ Define g I to be the function satisfying g I (x) = g 2 (s; κ) for x = cos(s/ √ κ). Then, 10) The first term of (B.10) is bounded: Here, we used the fact that Mu(z) . (B.12) Note that the limit in (B.12) is easily obtained by replacing M u (z) with its power series representation (A.4). For the second term of (B.10), we use the exact Hankel's expansion of I 1 (κ) in (A.8) and write In the last asymptotic equality, we used x = cos(s/ √ κ) → 1 as κ → ∞.
Combining (B.7), (B.10), (B.11) and (B.13) gives The next lemma is on the approximation of vM distributions by the normal distribution when κ → ∞. Let φ(s) be the density function of the standard normal distribution.
Lemma B.2. For any s ∈ R, as κ → ∞, Proof of Lemma B.2. Using the Hankel's expansion (A.8), and taking the power series representation of cos(t) and exp(z), we have

B.2.2. Low concentration approximation
We first give the statement and an independent proof on the similarity of the vM distribution and Cardioid distribution when κ is small. Lemma B.3. For any t ∈ (−π, π), as κ → 0, Proof of Lemma B.3. By using the standard power series representation for the modified Bessel function (A.1), we write Our main result for the low-concentration case follows:
We now analyze f κ (s) for a given s ∈ R. Since s/ √ κ < π/2 for sufficiently large κ, it is sufficient to inspect f κ (s) = κ −1/2 C(p, κ)(1 + h 1 (s/ √ κ)), where h 1 is the expression in the first line of (3.3) in Theorem 3.2. By (B.19) and To compare g k (s; w) with (B.20), we write an alternative limiting form of g k (s; w). Using Hankel's expansion for the modified Bessel function in (A.8), and the power series representation of cos(t), we have, as both u and w tend to infinity, The proof is completed if the following holds: While the uniform expansion of I u (uλ) is the main tool in verifying (B.21) and (B.22), care is needed for (B.21). In particular, we write λ)} to make sure the order u − 1 2 of the modified Bessel function appears exactly in the argument.
A proof of (B.22), as mentioned in the proof of Theorem B.6. Continuing the proof of Theorem B.6, we treat s ∈ R as fixed. Let r u (s) = cos(s/ √ uλ). Use the uniform expansion of the modified Bessel function as u → ∞ to write where the second asymptotic equality is given by lim u→∞ r u (s) = 1. We shall use the following large u approximations: As u → ∞, Then, as needed.

Appendix C: An extended comparison of approximated vM distributions
As referenced in Section 4.3, the qualities of the vM approximations in Section 4 are further inspected in this appendix. For each triple (p, κ, δ), there are five approximations summarized in Table 1. For notational simplicity, we denote f (i) for the approximated density under the ith scenario of Table 1. The same order is used in the legends of Figs. 6 and C.1. For example, f (1) refers to the "large κ" approximation, shown in the first row of Table 1.
The quality of approximation f (i) , measured by the Hellinger distance to f := f PvM (p, κ, δ), is inspected separately over two regions of interest. In Fig. C.1, for each choice of δ ∈ {0 • , 30 • , 60 • , 80 • }, regions "A" and "B" of (κ, p) represent regions in which the f (5) (and f (4) , respectively) approximations appear to work best. These regions are numerically obtained, but roughly, region A corresponds to the cases κ < 2p, and region B to κ > 2p. For region A, the projected normal approximation f (5) is chosen as a reference, and we evaluate the difference of the quality by D(i reference) := H(f, f (reference) ) − H(f, f (i) ), over a fine grid of (κ, p) ∈ A and for all four choices of δ. If D(i reference) > 0, then the approximation of f (i) is better than that of f (reference) (and vice versa). The higher D(i reference), the better approximation of f (i) over f (reference) . In  . The right column of the figure is similarly obtained for region B, with f (4) (high p, high κ) as the reference. We observe the following: • For region A (κ < 2p), approximations given by f (2) , f (3) and f (5) are of similar quality. (Differences in Hellinger distance are at most 0.059.) • For region B (κ > 2p), the approximation by f (4) (high p, high κ) shows the best quality. The projected normal approximation by f (5) is among the worst.