Conjugate Priors and Posterior Inference for the Matrix Langevin Distribution on the Stiefel Manifold

Directional data emerges in a wide array of applications, ranging from atmospheric sciences to medical imaging. Modeling such data, however, poses unique challenges by virtue of their being constrained to non-Euclidean spaces like manifolds. Here, we present a unified Bayesian framework for inference on the Stiefel manifold using the Matrix Langevin distribution. Specifically, we propose a novel family of conjugate priors and establish a number of theoretical properties relevant to statistical inference. Conjugacy enables translation of these properties to their corresponding posteriors, which we exploit to develop the posterior inference scheme. For the implementation of the posterior computation, including the posterior sampling, we adopt a novel computational procedure for evaluating the hypergeometric function of matrix arguments that appears as normalization constants in the relevant densities.


Introduction
Analysis of directional data is a major area of investigation in statistics. Directional data range from unit vectors in the simplest case to sets of ordered orthonormal frames in the general scenario. Since the associated sample space is non-Euclidean, standard statistical methods developed for the Euclidean space may not be appropriate to analyze such data. Additionally, it is often desirable to design statistical methods that take into consideration the underlying geometric structure of the sample space. There is a need for methodological development for a general sample space such as the Stiefel manifold (James, 1976;Chikuse, 2012) that goes beyond those techniques designed for simpler non-Euclidean spaces like the circle or the sphere. Such a novel methodology can support various emerging applications, increasingly seen in the fields of biology (Downs, 1972;Mardia and Khatri, 1977), computer science (Turaga et al., 2008;Lui and Beveridge, 2008) and astronomy (Mardia and Jupp, 2009;Lin et al., 2017), to mention but a few.
One of the most widely used probability distributions on the Stiefel manifold is the matrix Langevin distribution introduced by Downs (1972), also known as the Von-Mises Fisher matrix distribution (Mardia and Jupp, 2009;Khatri and Mardia, 1977). In early work, Mardia and Khatri (1977) and Jupp and Mardia (1980) investigated properties of the matrix Langevin distribution and developed inference procedures in the frequentist setup (Chikuse, 2012). The form of the maximum likelihood estimators and the profile likelihood estimators for the related parameters can be found in Jupp and Mardia (1979); Mardia and Khatri (1977); Chikuse (1991bChikuse ( ,a, 1998. It is not patently clear from these works whether the form of the associated asymptotic variance can be obtained directly without using bootstrap procedures. A major obstacle facing the development of efficient inference techniques for this family of distributions has been the intractability of the corresponding normalizing constant, a hypergeometric function of a matrix argument (Mardia and Jupp, 2009;Muirhead, 2009;Gross and Richards, 1989). Inference procedures have been developed exploiting approximations that are available when the argument to this function is either small or large.
Almost all the hypothesis testing procedures (Jupp and Mardia, 1979;Mardia and Khatri, 1977;Chikuse, 1991aChikuse, ,b, 1998 therefore depend not only on large sample asymptotic distributions but also on the specific cases when the concentration parameter is either large or small (Chikuse, 2012;Mardia and Khatri, 1977;Downs, 1972). In particular, a general one sample or two sample hypothesis testing method for the finite sample case is yet to be developed.
For any given dataset, the stipulation of large sample is comparatively easier to verify than checking whether the magnitude of the concentration is large. It may not be possible to ascertain whether the concentration is large before the parameter estimation procedure, which is then confounded by the fact that the existing parameter estimation procedures themselves require the assumption of large concentration to work correctly. Hence, from a practitioner's point of view, it is often difficult to identify whether the above-mentioned procedures are suitable for use on a particular dataset.
Although a couple of Bayesian procedures have been proposed in related fields (see references in Lin et al. (2017)), a comprehensive Bayesian analysis is yet to be developed for the matrix Langevin distribution. In a recent paper, Lin et al. (2017) have developed a Bayesian mixture model of matrix Langevin distributions for clustering on the Stiefel manifold, where they have used a prior structure that does not have conjugacy. To accomplish posterior inference, Lin et al. (2017) have used a nontrivial data augmentation strategy based on a rejection sampling technique laid out in Rao et al. (2016). It is worthwhile to note that the specific type of data augmentation has been introduced to tackle the intractability of the hypergeometric function of a matrix argument. It is well known that data augmentation procedures often suffer from slow rate of convergence (van Dyk and Meng, 2001;Hobert et al., 2011), particularly when combined with an inefficient rejection sampler. Elsewhere, Hornik and Grün (2014) have proposed a class of conjugate priors but have not presented an inference procedure for the resulting posterior distributions.
In this article, we develop a comprehensive Bayesian framework for the matrix Langevin distribution, starting with the construction of a flexible class of conjugate priors, and proceeding all the way to the design of an practicable posterior computation procedure. The difficulties arising from the intractability of the normalizing constant do not, of course, disappear with the mere adoption of a Bayesian approach. We employ nontrivial strategies to derive a unique posterior inference scheme in order to handle the intractability of the normalizing constant. A key step in the proposed posterior computation is the evaluation of the hyper-geometric function of a matrix argument, that can be computed using the algorithm developed in Koev and Edelman (2006). Although general, this algorithm has certain limitations vis-à-vis measuring the precision of its output. We therefore construct a reliable and computationally efficient procedure to compute a specific case of the hypergeometric function of matrix argument, that has theoretical precision guarantees (Section 6.2). The procedure is applicable to a broad class of datasets including most, if not all, of the applications found in Downs et al. (1971); Downs (1972); Mardia (1979, 1980); Mardia and Khatri (1977); Mardia et al. (2007); Mardia and Jupp (2009) ;Chikuse (1991aChikuse ( ,b, 1998Chikuse ( , 2003; Sei et al. (2013); Lin et al. (2017). The theoretical framework proposed in this article is applicable to all matrix arguments regardless of dimensionality. In the following two paragraphs, we summarize our contributions.
We begin by adopting a suitable representation of the hypergeometric function of a matrix argument to view it as a function of a vector argument. We explore several of its properties that are useful for subsequent theoretical development, and also adopt an alternative parametrization of the matrix Langevin distribution so that the modified representation of the hypergeometric function can be used. When viewed as an exponential family of distributions, the new parameters of the matrix Langevin distribution are not the natural parameters (Casella and Berger, 2002). Thus the construction of the conjugate prior does not directly follow from Diaconis and Ylvisaker (1979) (DY), an issue that we elaborate on (Section 3.1). We then propose two novel and reasonably large classes of conjugate priors, and based on theoretical properties of the matrix Langevin distribution and the hypergeometric function, we establish their propriety. We study useful properties of the constructed class of distributions to demonstrate that the hyperparameters related to the class of distributions have natural interpretations. Specifically, the class of constructed distributions is characterized by two hyperparameters, one controls the location of the distribution while the other determines the scale. This interpretation not only helps us understand the nature of the class of distributions but also aids in the selection of hyperparameter settings. The constructed class of prior distributions is flexible because one can incorporate prior knowledge via appropriate hyperparameter selection; and at the same time, in the absence of prior knowledge, there is a provision to specify the hyperparameters to construct a uniform prior. Since this uniform prior is improper by nature, we extend our investigation to identify the conditions under which the resulting posterior is a proper probability distribution.
Following this, we discuss properties of the posterior and inference. We show unimodality of the resulting posterior distributions and derive a computationally efficient expression for the posterior mode. We also demonstrate that the posterior mode is a consistent estimator of the related parameters. We develop a Gibbs sampling algorithm to sample from the resulting posterior distribution. One of the conditionals in the Gibbs sampling algorithm is a novel class of distributions that we have introduced in this article for the first time. We develop and make use of properties such as unimodality and log-concavity to derive a rejection sampler to sample from this distribution. We perform multiple simulations to showcase the generic nature of our framework and to report estimation efficiency for the different algorithms. We end with an application demonstrating the strength of our approach.
We should note that a significant portion of the article is devoted to establishing a number of novel properties of the hypergeometric function of matrix arguments. These properties play a key role in the rigorous development of the statistical procedures. These properties, including the exponential type upper and lower bounds for the function, may also be relevant to a broader range of scientific disciplines.
The remainder of the article is organized as follows. In Section 2, we introduce the matrix Langevin distribution defined on the Stiefel manifold and explore some of its important properties. Section 3 begins with a discussion of the inapplicability of DY's theorem, following which we present the construction of the conjugate prior for the parameters of the matrix Langevin distribution. In particular, we establish propriety of a class of posterior and prior distributions by proving the finiteness of the integral of specific density kernels. In Sections 4 and 5, we lay out the hyperparameter selection procedure and derive properties of the posterior. In Section 6 we develop the posterior inference scheme. In Sections 7 and 8, we validate the robustness of our framework with experiments using simulated datasets and demonstrate the applicability of the framework using a real dataset, respectively. Finally, in Section 9, we discuss other developments and a few possible directions for future research. Proofs of all theorems and properties of the hypergeometric function of matrix arguments are deferred to the supplementary material (Pal et al., 2019).
We use d and D interchangeably. D is the diagonal matrix with diagonal d. We use matrix notation D in the place of d wherever needed, and vector d otherwise.

The Matrix Langevin Distribution on the Stiefel Manifold
The Stiefel manifold, V n,p , is the space of all p ordered orthonormal vectors (also known as p-frames) in R n (Mardia and Jupp, 2009;Absil et al., 2009;Chikuse, 2012;Edelman et al., 1998;Downs, 1972) and is defined as where R n×p is the space of all n × p, p ≤ n real-valued matrices, and I p is the p × p identity matrix. V n,p is a compact Riemannian manifold of dimension np − p(p + 1)/2 (Chikuse, 2012). A topology on V n,p can be induced from the topology on R n×p as V n,p is a sub-manifold of R n×p (Absil et al., 2009;Edelman et al., 1998). For p = n, V n,p becomes identical to O(n), the orthogonal group consisting of all orthogonal n × n real-valued matrices, with the group operation being matrix multiplication. Being a compact unimodular group, O(n) has a unique Haar measure that corresponds to a uniform probability measure on O(n) (Chikuse, 2012). Also, through obvious mappings, the Haar measure on O(n) induces a normalized Haar measure on the compact manifolds V n,p . The normalized Haar measures on O(n) and V n,p are invariant under orthogonal transformations (Chikuse, 2012). Detailed construction of the Haar measure on V n,p and its properties are described in Muirhead (2009);Chikuse (2012). Notation wise, we will use μ and μ 2 to denote the normalized Haar measures on V n,p and V p,p , respectively.
The matrix Langevin distribution (ML-distribution) is a widely used probability distribution on V n,p (Mardia and Jupp, 2009;Chikuse, 2012;Lin et al., 2017). This distribution is also known as Von Mises-Fisher matrix distribution . As defined in Chikuse (2012), the probability density function of the matrix Langevin distribution (with respect to the normalized Haar measure μ on V n,p ) parametrized by F ∈ R n×p , is where etr(·) = exp(trace(·)) and the normalizing constant, 0 F 1 (n/2, F T F /4), is the hypergeometric function of order n/2 with the matrix argument F T F/4 (Herz, 1955;James, 1964;Muirhead, 1975;Gupta and Richards, 1985;Richards, 1987, 1989;Butler and Wood, 2003;Koev and Edelman, 2006;Chikuse, 2012). In this article, we consider a different parametrization of the parameter matrix F in terms of its singular value decomposition (SVD). In particular, we subscribe to the specific form of unique SVD defined in Chikuse (2012) ((1.5.8) in Chikuse (2012)), Henceforth, we shall use the phrase "unique SVD" to refer to this specific form of SVD. Khatri and Mardia (1977) (page 96) shows that the function 0 F 1 (n/2, F T F /4) depends only on the eigenvalues of the matrix F T F , i.e., As a result, we reparametrize the ML density as This parametrization ensures identifiability of all the parameters M, d and V . With regard to interpretation, the mode of the distribution is MV T and d represents the concentration parameter (Chikuse, 2003). For notational convenience we omit the indicator function and write the ML density as where it is understood that M ∈ V n,p , d ∈ S p , V ∈ V p,p . The parametrization with M, d and V enables us to represent the intractable hypergeometric function of a matrix argument as a function of vector d, the diagonal entries of D, paving a path for an efficient posterior inference procedure.
We note in passing that an alternative parametrization through polar decomposition with F = MK (Mardia and Jupp, 2009) may pose computational challenges since the elliptical part K lies on a positive semi-definite cone and inference on positive semi-definite cone is not straightforward (Hill and Waters, 1987;Bhatia, 2009;Schwartzman, 2006).

Conjugate Prior for the ML-Distribution
In the context of the exponential family of distributions, Diaconis and Ylvisaker (1979) (DY) provides a standard procedure to obtain a class of conjugate priors when the distribution is represented through its natural parametrization (Casella and Berger, 2002). Unfortunately, for the ML distribution, the DY theorem can not be applied directly, as demonstrated next. We therefore develop, in Section 3.2, two novel classes of priors and present a detailed investigation of their properties.

Inapplicability of DY Theorem for Construction of Priors for the ML-Distribution
In order to present the arguments in this section, we introduce notations P θ , x A , μ, and μ A , that are directly drawn from Diaconis and Ylvisaker (1979). In brief, P θ denotes the probability measure that is absolutely continuous with respect to an appropriate σ-finite measure μ on a convex subset of the Euclidean space, R d . In the case of the ML distribution, μ is the Haar measure defined on the Stiefel manifold. The symbol X denotes the interior of the support of the measure μ. As shown in Hornik and Grün (2013) X := {X : X 2 < 1} for the case of the ML distribution. According to the assumptions of DY X dP θ (X) = 1 (see paragraph after (2.1), page 271 in Diaconis and Ylvisaker (1979)). In the current context, P θ is the probability measure associated with the ML distribution. Therefore, which violates the required assumption mentioned above. Secondly, in the proof of Theorem 1 in Diaconis and Ylvisaker (1979) DY construct a probability measure restricted to a measurable set A as follows.
Considering the notation x A = Z μ A (dZ) for any measurable set A, the proof of Theorem 1 in Diaconis and Ylvisaker (1979) relies on the existence of a sequence of measurable sets {A j } j≥1 and corresponding points x Aj j≥1 that are required to be dense in supp(μ), the support of the measure μ (see line after (2.4) on page 272 in Diaconis and Ylvisaker (1979)). It can be shown that a similar construction in the case of the ML distribution would lead to a x A where x A does not belong to supp(μ), the Stiefel manifold. Therefore, the mentioned set of points x Aj j≥1 that are dense in supp(μ) does not exist for the case of the ML distribution.
Together, the two observations make it evident that Theorem 1 in Diaconis and Ylvisaker (1979) is not applicable for constructing conjugate priors for the ML distribution. We would like to point out that the construction of the class of priors in Hornik and Grün (2013) is based on a direct application of DY, which is not entirely applicable for the ML-distribution. On the other hand, the idea of constructing a conjugate prior on the natural parameter F followed by a transformation, involves calculations of a complicated Jacobian term (Hornik and Grün, 2013). Hence the class of priors obtained via this transformation lacks interpretation of the corresponding hyperparameters.

Two Novel Classes of Conjugate Priors
Let μ denote the normalized Haar measure on V n,p , μ 2 denote the normalized Haar measure on V p,p , and μ 1 denote the Lebesgue measure on R p + . For the parameters of the ML-distribution, we define the prior density with respect to the product measure μ × μ 1 × μ 2 on the space V n,p × R p + × V p,p .
Definition 1. The probability density function of the joint conjugate prior on the parameters M, d and V for the ML distribution is proportional to Henceforth, we refer to the joint distribution corresponding to the probability density function in Definition 1 as the joint conjugate prior distribution (JCPD). We use the terminology, joint conjugate prior class (JCPC ) when we use as a prior distribution for the parameters of the ML-distribution. Although, the JCPC has some desirable properties (see Theorem 5 and Section 5.2), it may not be adequately flexible to incorporate prior knowledge about the parameters if the strength of prior belief is not uniform across the different parameters. For example, if a practitioner has strong prior belief for the values of M but is not very certain about parameters d and V , then JCPC may not be the optimal choice. Also, the class of joint prior defined in Definition 1 corresponds to a dependent prior structure for the parameters M , d and V . However, it is customary to use independent prior structure for parameters of curved exponential families (Casella and Berger, 2002;Gelman et al., 2014;Khare et al., 2017). Consequently, we also develop a class of conditional conjugate prior where we assume independent priors on the parameters M , d and V . This class of priors are flexible enough to incorporate prior knowledge about the parameters even when the strength of prior belief differs across different parameters.
It is easy to see that the conditional conjugate priors for both M and V are MLdistributions whereas the following definition is used to construct the conditional conjugate prior for d.
Definition 2. The probability density function of the conditional conjugate prior for d with respect to the Lebesgue measure on R p + is proportional to Note that g(d ; ν, η) is a function of n as well. However we do not vary n anywhere in our construction, and thus we omit reference to n in the notation for g(d ; ν, η).
Henceforth we use the terminology, conditional conjugate prior distribution for d (CCPD) to refer to the probability distribution corresponding to the probability density function in Definition 2. We use the phrase conditional conjugate prior class (CCPC), to refer to the following structure of prior distributions where M, d, V are assumed to be independent apriori. As per Definitions 1 and 2, the integrability of the kernels mentioned in (3) and (5) are critical to prove the propriety of the proposed class of priors. In light of this, Theorem 1 and Theorem 2 provide conditions on ν, Ψ and η for g(M, d, V ; ν, Ψ) and g(d ; ν, η) to be integrable, respectively.
The conditions mentioned in this theorem do not span all cases; we have not addressed the case where Ψ 2 = 1. As far as statistical inference for practical applications is concerned, we may not have to deal with the case where Ψ 2 = 1 as the hyper-parameter selection procedure (see Section 4) and posterior inference (even in the case of uniform improper prior, see Section 5.3) only involve cases with Ψ 2 < 1. We therefore postpone further investigation into this case as a future research topic of theoretical interest.
Theorem 2. Let d ∈ R p + , η = (η 1 , . . . , η p ) ∈ R p and n be any integer with n ≥ p. Then for any ν > 0, We can alternatively parametrize the CCPD class of densities by the following specification of the probability density function, where max 1≤j≤p η j < ν. In this parametrization, if we consider the parameter choices, ν = 0 and β := −η, then the resulting probability distribution corresponds to the Exponential distribution with rate parameter β.

Properties of the CCPD and JCPD Class of Distributions
It is important to explore the properties for the CCPD and JCPD class of distributions in order to use them in an effective manner. Intuitive interpretations of the parameters ν, η, Ψ are desirable, for example, for hyper-parameter selection. Due to conjugacy, Bayesian analysis will lead to posterior distributions involving JCPD and CCPD, and therefore, it is necessary to identify features that are required to develop practicable computation schemes for posterior inference. The following four theorems establish some crucial properties of the CCPD and JCPD class of distributions.
Notably, the mode of the distribution is characterized by the parameter η and does not depend on the parameter ν. The proof of the theorem relies on a few nontrivial properties of 0 F 1 n 2 , D 2 4 , i.e., the hyper-geometric function of a matrix argument, that we have established in the supplementary material Section 1. It is easy to see that the function h −1 is well defined as the function h is strictly increasing in all its coordinates. Even though subsequent theoretical developments are based on the formal definition and theoretical properties of h −1 and h functions, numerical computation of the functions are tricky. The evaluation of the functions depend on reliable computation of 0 F 1 n 2 , D 2 4 and all its partial derivatives. In Section 6.2, we provide a reliable and theoretically sound computation scheme for these functions.
On a related note, it is well known that log-concave densities correspond to unimodal distributions if the sample space is the entire Euclidean space (Ibragimov, 1956;Dharmadhikari and Joag-Dev, 1988;Doss and Wellner, 2016). However, the mode of the distribution may not necessarily be at a single point. part (b) of Theorem 3 asserts that the CCPD has a single point mode. Moreover, the sample space of CCPD is d ∈ R p + , which merely encompasses the positive quadrant and not the whole of the p dimensional Euclidean space. Hence general theories developed for R p (or R) do not apply. In fact, when η j ≤ 0, the density defined in Definition 2 is decreasing as a function of d j on the set R + and the mode does not exist as R + does not contain the point 0. In all, part (b) of Theorem 3 does not immediately follow from part (a) and requires additional effort to demonstrate.
In order to introduce the notion of "concentration" for the CCPD class of distributions we require the concept of a level set. Let the unnormalized probability density function for the CCPD class of distributions, g(x; ν, η) (see Definition 5), achieve its maximum value at m η (part(b) of Theorem 3 ensures that m η is a unique point) and let be the level set of level l containing the mode m η where 0 ≤ l < 1. To define the level set we could have used g(x; ν 0 , η) for any fixed value of ν 0 > 0 instead of g(x; 1, η). However, without loss of generality, we choose ν 0 = 1.
Theorem 4. Let d ν ∼ CCPD(·; ν, η) for a fixed η ∈ R p with m η being the mode of the distribution. If P ν (·; η) denotes the probability distribution function corresponding to d ν , then (a) P ν (S l ; η) is an increasing function of ν for any level set S l with l ∈ (0, 1), The major impediment to proving Theorem 4 arises from the intractability of the normalizing constant of the CCPD(·; ν, η) distribution. Although involved, the proof essentially uses the log convexity of 0 F 1 n 2 , D 2 4 to get around this intractability.
From Theorem 4, it is clear that the parameter ν relates to the concentration of the probability around the mode of the distribution. Larger values of ν imply larger concentration of probability near the mode of the distribution.
Definition 3. In the context of the probability distribution CCPD (· ; η, ν), the parameters η and ν are labeled as the "modal parameter" and the "concentration parameter", respectively.
In Figure 1, we display three contour plots of the CCPD(· ; ν, η) distribution with η = (0.85, 0.88). Note that the corresponding mode of the distribution is h −1 (0.85, 0.88) = (7, 5) for all three plots. We can observe the implication of part (b) of Theorem 3 as the "center" of the distributions are the same. Contrastingly, it can be observed that the "spread" of the distributions decrease as the value of the parameter ν increases, as implied by Theorem 4.
Note that the mode of the distribution is characterized by the parameter Ψ and does not depend on the parameter ν. The proof of the theorem depends crucially on a strong result, a type of rearrangement inequality proved in Kristof (1969). For the concentration characterization of JCPD, we define the level sets in the context of the JCPD distribution. Let the unnormalized probability density function for the JCPD class of distributions, g(M, d, V ; ν, Ψ), achieve its maximum value at the point (M,d,V ) (see Theorem 5) and be the level set of level l from some l ∈ (0, 1). The following theorem characterizes the concentration property of the JCPD distribution.
If P ν (· ; Ψ) denotes the probability distribution function corresponding to the distribution JCPD(·; ν, Ψ), then (a) P ν (A l ; Ψ) is a strictly increasing function of ν for any level set A l with l ∈ (0, 1).

(b) For any open set
Parts (a) and (b) of the above theorem characterize the concentration whereas part (c) relates CCPD to the JCPD class of distributions. Part(c) also motivates the development of a sampling procedure for the JCPD distribution. The proof of part (a) Theorem 6 is similar to that of the proof of Theorem 4. The proof for part (b) of Theorem 6 is more involved and depends on several key results, including the rearrangement inequality by Kristof (1969), the log convexity of 0 F 1 n 2 , D 2 4 , and the fact that g(h −1 (η) ; ν, η), the value of the unnormalized CCPD density at the mode, is a strictly increasing function of the parameter η.
Note that unlike in the case of the CCPD distribution, we do not attempt to establish the log concavity of JCPD, the reason being that the underlying probability space V n,p × R p + ×V p,p is non-convex. Nevertheless, it is evident that beyond a certain distance (based on a suitable metric on V n,p ×R p + ×V p,p ) the value of the density drops monotonically as one moves farther away from the center. Based on the characteristics of the parameters ν and Ψ of the JCPD class of distributions, we have the following definitions.
Definition 4. The parameters Ψ and ν in the distribution JCPD are labeled the "modal" parameter and the "concentration" parameter, respectively.
Interestingly, both distributions CCPD and JCPD are parameterized by two parameters, one controlling the center and the other characterizing the probability concentration around that center. One may therefore visualize the distributions in a fashion similar to that of the multivariate Normal distribution controlled by the mean and variance parameters. This intuitive understanding can help practitioners select hyper-parameter values when conducting a Bayesian analysis with the CCPD and JCPD distributions.
Thus far we have established properties of CCPD and JCPD that relate to basic features of these distributions. Additional properties, which are required for a Markov chain monte carlo (MCMC) sampling scheme, are developed in Section 5.1.

Informative Prior
We now present procedures for the selection of hyperparameter values aimed at incorporating prior beliefs about the parameters (M, d, V ). Consider the scenario where a practitioner has the prior belief that the values for the parameters M, d, V are close to M belief , d belief , V belief , respectively. A standard approach to incorporating this prior knowledge is to select the hyper-parameter values in such a manner that the mode of the corresponding prior distribution becomes M belief , d belief , V belief . In order to achieve this in the current context, we first computeη = h(d belief ) where h(·) is defined in (2.8) in the supplementary material. Note that we always get a feasibleη for every real d belief ∈ S p .
In the case of the CCPC class of priors, we choose η =η, (3.4). Theorem 3 guarantees that the above hyperparameter specifications yields a prior distribution that has mode at (M belief , d belief , V belief ). From Theorem 3, we also see that larger values of the hyper-parameter ν lead to larger concentration of the prior probability around the mode. The hyper-parameters ξ D and γ D play a similar role for the ML distribution. Hence the hyper parameters ν, ξ D and γ D are chosen to have larger values in case the practitioner has a higher confidence in the prior belief.
In the case of the JCPC class of priors, we apply Theorem 5 to construct JCPD (see (3.2)) with mode at M belief , d belief , V belief . In particular, we set Ψ = M belief Dη(V belief ) T where Dη is the diagonal matrix with diagonal elementsη = h(d belief ). Using the concentration characterization described in Theorem 5, the practitioner may choose the value of the hyper-parameter ν appropriately, where a larger value for the parameter ν implies greater confidence in the prior belief.
It is noteworthy that for both the JCPC and CCPC class of priors, there is an intimate connection between the sample size and the interpretation of the hyper-parameter ν. As a heuristic one may envisage ν as incorporating "information" equivalent to ν many historic observations of the model.

Uniform Improper Prior
In the case where the practitioner does not have a prior belief about the parameter values, an automatic procedure for hyper-parameter selection can be helpful. In this and the next subsection, we discuss two automatic procedures to select the values of the hyper-parameters. In the absence of prior information, usage of uniform prior is common in the literature. In the context of the current model, for the JCPC and CCPC class of distributions, the prior for the parameters (M, d, V ), is called a uniform prior if g(M, d, V ; ν, Ψ) ∝ 1 and Both classes of priors JCPC and CCPC are flexible enough to accommodate a uniform prior. For JCPC , this can be achieved by setting ν = 0 in (3.2). Correspondingly, for the CCPC class, the uniform prior can be constructed by choosing ν = 0, ξ D = 0 and γ D = 0 in (3.4). Note that the resulting uniform prior is improper in nature as the above choices of hyper parameters do not lead to a proper probability distribution. Hence, it is necessary to check the propriety of the resulting posterior (see Section 5.3 for more details).

Empirical Prior
Another widely used automatic method is to use empirical information contained in the data to select appropriate values of the hyper-parameters. Let W 1 , W 2 , . . . W N be independent and identically distributed samples drawn from ML(· ; M, d, V ). Consider the sample mean, W = ( andη as the diagonal elements of D W . One can set Ψ = W as the hyper-parameter in the case of the JCPC prior. In the case of the CCPC class of priors, one can choose η =η, and for the hyper-parameters related to M and V , apply the same procedure as discussed previously in this section. For both classes of priors, a value for ν that is less than or equal to 10 percent of the sample size N , is recommended.  Figure 1, we display the resulting conditional distribution for d given M, V . Figure 1 shows that the "center" of the distribution is located at (7, 5). Figure 1 also displays the "spread" of the distribution around the mode when using ν = 10, ν = 20 and ν = 35.

Properties of Posterior
The derivation of the posterior distributions for the JCPC and CCPC class of priors is straightforward since they were built with conjugacy in mind, which then entails that the posterior distributions lie in the corresponding classes. However, inference for the resulting posterior distributions is challenging because not only are the normalizing constants intractable for both the JCPD and CCPD distributions, but also, the unnormalized version of the corresponding density functions involve 0 F 1 n 2 , D 2

4
. We first focus our attention on developing properties of the posterior distribution when involving JCPC and CCPC priors. In particular, we derive explicit forms of the posterior conditionals under different prior settings, the linearity of the posterior mode parameters and the strong consistency of the posterior mode.

Posterior Conditionals
Let W 1 , W 2 , . . . W N be independent and identically distributed samples drawn from . (5.1) First, let us assume a JCPD prior with parameters ν and Ψ. Theorem 5 not only implies that the posterior has a unique mode, but also provides an expression for the mode. Furthermore, we see that the corresponding posterior distribution is JCPD with concentration (ν + N ) and posterior modal parameter Ψ N = ν ν+N Ψ + N ν+N W . Let η Ψ N be the diagonal elements of the diagonal matrixD Ψ N , where Ψ N =M NDΨ NV N is the unique SVD for Ψ N . From Theorem 6, it follows that the full posterior conditionals for the parameters M, d, V are ML, CCPD and ML distributions, respectively.
In Section 6 we shall use these results to construct a Gibbs algorithm. A part of the Gibbs scheme would require sampling from the relevant CCPD distribution, which we propose to implement by simulating from the full conditional distribution of each of the components of d given the rest, when d ∼ CCPD (· ; ν, η). To refer to this conditional distribution in subsequent text, we have the following definition.
Definition 5. Let ν > 0, ∈ R p−1 + and η ∈ R p + with max 1≤j≤p η j < 1. A random variable is defined to be distributed as CCPD j (· ; , ν, η), if the corresponding probability density function (with respect to the Lebesgue measure on R) is proportional to Let d = (d 1 , . . . , d p ) be a random vector with d ∼ CCPD (· ; ν, η) for some max 1≤j≤p η j < 1, ν > 0. Let d (−j) be the vector containing all but the j-th component of the vector d. Then the conditional distribution of d j given d (−j) Now, since the conditional posterior of d was shown to be CCPD, the conditional posterior distribution of d j | d (−j) follows a CCPD j distribution. In the case of a Bayesian analysis with a CCPC prior, (3.4) and (5.1) determine the corresponding posterior distribution to be proportional to The conditional probability density for the posterior distribution of d given

Linearity of Posterior Modal Parameter
We observe that the posterior modal parameter is a convex combination of the prior modal parameter and the sample mean when applying the JCPC class of priors. In particular, from Section 5.1 we get In a similar fashion, we observe from (5.3) that the modal parameter for the conditional posterior distribution of d given M, V, {W i } N i=1 is a convex combination of the prior modal parameter and an appropriate statistic of the sample mean. We should point out here that the posterior linearity of the natural parameter of an exponential family distribution directly follows from Diaconis and Ylvisaker (1979). However, in our parametrization, the ML density is a curved exponential family of its parameters and posterior linearity appears to hold for the "modal parameter".

Posterior Propriety When Using Uniform Improper Prior
In the case where a uniform improper prior is used, the corresponding posterior is proportional to (5.1)). It follows from Theorem 1 that the function in (5.4) leads to a proper distribution, JCPD(· ; N, W ), if W 2 < 1. The following theorem outlines the conditions under which W 2 < 1.
Theorem 7. Let W 1 , . . . , W N be independent and identically distributed samples from an ML-distribution on the space V n,p . If (a) N ≥ 2, p < n,

Strong Consistency of the Posterior Mode
In the case where we use a JCPD(·; ν, Ψ) prior for Bayesian analysis of the data {W i } N i=1 , the corresponding posterior distribution is a JCPD with concentration ν + N and posterior modal parameter Ψ N = The form of the function h(d) is provided in Theorem 3. The nontrivial aspect of finding the posterior mode is the computation of the function h −1 (d Ψ ). In our applications, we use a Newton-Raphson procedure to obtain h −1 (d Ψ ) numerically. We use large and small argument approximations for 0 F 1 n 2 , D 2 4 (see Jupp and Mardia (1979)) to initialize the Newton-Raphson algorithm for faster convergence. Note that the success of the Newton-Raphson procedure here depends on the efficient computation of 0 F 1 n 2 , D 2 4 and its partial derivatives. In Section 6.2, we provide a method to compute these functions reliably.
The following theorem demonstrates that the mode of the posterior distribution is a strongly consistent estimator for the parameters M, d, V .
where a.s. stands for almost sure convergence.

MCMC Sampling from the Posterior
Apart from finding the posterior mode, a wide range of statistical inference procedures including point estimation, interval estimation (see Section 8) and statistical decision making (see Section 8) can be performed with the help of samples from the posterior distribution. For the JCPD and CCPD classes of distributions, neither is it possible to find the posterior mean estimate via integration, nor can we directly generate i.i.d. samples from the distributions. We therefore develop procedures to generate MCMC samples using a Gibbs sampling procedure, which requires the results on posterior conditionals stated in Section 5.1.
It follows from Theorem 6 and Section 5.1 that under JCPD prior the conditional distribution of M given d, V and the conditional distribution of V given M, d are ML distributions, while the conditional distribution of d given M, V is CCPD. Consequently, the conditional distribution of d j | d (−j) follows a CCPD j distribution (see Definition 5). Also, let us assume that the unique SVD forν Also, let us denote the vector containing the diagonal element of the matrix M TΨ N V to be η Ψ . Based on the above discussion, we can now describe the algorithm as follows.
Algorithm 1 Gibbs sampling algorithm to sample from posterior when using JCPC prior.
If instead we use a CCPC prior, (see (3.4)) for Bayesian analysis of the data, then the full conditional distribution of M, d, V are ML, CCPD and ML distributions, respectively. The steps involved in the Gibbs sampling Markov chain are then as follows.
Algorithm 2 Gibbs sampling algorithm to sample from posterior when using CCPC prior.
To implement the above algorithms we need to sample from the ML and CCPD distributions. For the former, we use the procedure developed in Hoff (2009) to sample from the ML distributions. Sampling from CCPD j is much more involved and is explained in detail in the next subsection. The following result provides some theoretical guarantees that shall be useful for this specific sampler.
Even though parts (a) and (b) of the above theorem follow immediately from Theorem 3, they are included here for completeness; all the properties play a crucial role in the construction of the sampling technique for CCPD j . The proof of part (c) is essentially an implication of the fact that the right tail of the distribution decays at an exponential rate. Note that the ratio g 1 (B; d (−1) , ν, η)/g 1 (m; d (−1) , ν, η), mentioned in part (c), is free of the intractable normalizing constants of the distribution. Therefore, the numerical computation of the ratio is possible as long as we can compute the corresponding . Using Theorem 9, we develop an accept-reject sampling algorithm that can generate samples from CCPD j with high acceptance probability. The detailed construction of the sampler is provided next. We conclude this section with a description of an efficient procedure for computing the 0 F 1 n 2 , D 2 4 constant.
To construct a proposal density g 1 (x), we employ two different strategies, one for the bounded interval (0, M crit ] and the other using Theorem 9 to tackle the tail, (M crit , ∞), of the support of the conditional posterior distribution of d 1 .
The procedure is as follows. Let δ = M crit /N bin where N bin is the total number of partitions of the interval (0, M crit ]. Consider k = ([m/δ] + 1) where [m/δ] denotes the greatest integer less than or equal to m/δ. Now define the function where K † n,p,Mcrit is as defined in part (d) of Theorem 9.
For the case where M crit tends to ∞ (see Remark 1) the constant K † n,p,Mcrit approaches a finite constant, whereas Γ (ν(n−1)+2) 2 , M crit ν(1 − η 1 ) monotonically decreases to zero. Therefore, the positive constant q N bin +1 can be made arbitrary close to zero by choosing a suitably large value for M crit when the value of n, p, ν, η 1 are fixed. Note that the quantities {q j } N bin +1 j=1 may not add up to 1, therefore we construct the corresponding set of probabilities, q j for j = 1, 2, · · · , N bin +1. The following algorithm lists the steps involved in generating a sample from the distribution corresponding to the kernel g 1 (·).

Algorithm 3
Steps for the rejection sampler for CCPD j . Sample y ∼ Uniform ((Z − 1) δ, Zδ), 4: else Sample y ∼ TruncatedGamma shape = ν(n−1)+2 2 , rate = ν(1 − η 1 ), support = (M crit , ∞) 5: end if 6: Sample U ∼ Uniform (0, 1), 7: if U ≤ g1(y) g 1 (y) then 8: Accept y as a legitimate sample from g 1 (·) 9: else Go to Step 1 10: end if Figure 2 shows a typical example of the function g 1 (x) and the corresponding g 1 (x). The blue curve represents the unnormalized density g 1 . The black curve and the red curve after M crit constitutes the function g 1 (defined in (6.2)). Note that the red curve after the point M crit represents the last term (involving K † n,p,Mcrit ) in the summation formula in (6.2). In Figure 2(a), the values of δ and M crit are set such that the key components of g 1 and g 1 (x) are easy to discern. On the other hand, Figure 2(b) displays the plot of g 1 (x) when recommended specification of M crit and δ are used.
The choice of N bin plays a crucial role in the algorithm and is required to be determined before constructing the proposal density for the accept-reject algorithm. Note that N bin and δ are interconnected. If one is specified, the value of the other can be determined. We decide to choose the parameter δ and compute the corresponding N bin . In the case where the concentration parameter is high, a finer partition of the proposal histogram (smaller value of δ) is required to keep the acceptance rate of the algorithm high. Based on our empirical results, we recommend selecting δ to be of the order of 1 √ ν . The acceptance probability remains stable across different choices of ν when the value δ is set accordingly (see Figure 3). The estimated acceptance probabilities, used in Figure 3, were calculated based on 10000 Monte Carlo samples for each value of ν varied from 1 to 100. The relationship between N bin and δ and ν is presented in Table 1.
Finally, successful implementation of the sampling algorithm developed in this subsection requires the computation of 0 F 1 n 2 , D 2 4 , a key step for the computation of g 1 (·).
In Section 6.2 we discuss the procedure that we have adopted to compute 0 F 1 n 2 , D 2 4 .

Computation of
We first describe an efficient and reliable computational procedure to compute the function 0 F 1 n 2 , D 2 4 when the argument matrix D is of dimension 2×2. The procedure is relevant to many applications considered in the field (Downs et al., 1971;Downs, 1972;Mardia, 1979, 1980;Mardia and Khatri, 1977;Mardia et al., 2007;Mardia and Jupp, 2009;Chikuse, 1991aChikuse, ,b, 1998Chikuse, , 2003Sei et al., 2013;Lin et al., 2017). We emphasize that the computational procedure described below is applicable for analyzing data on V n,2 for all n ≥ 2.
Consider the representation developed in Muirhead (1975) for the Hypergeometric function of a matrix argument where D is a 2 × 2 diagonal matrix with diagonal elements d 1 > 0, d 2 > 0. From Butler and Wood (2003) (see page 361), it can be seen that, where I c+2k−1 (·) is the modified Bessel function of the first kind with order (c + 2k − 1). Hence from (6.3) and (6.4), we get that where the last inequality follows from I ν+1 (x)/I ν (x) < x 2(ν+1) for x > 0, ν > −1 (see page 221 in Ifantis and Siafarikas (1990)). For fixed values of d 1 , d 2 we can find M such that A M ≤ and M 4 ≥ (d 1 d 2 )/(4 1 ) for some 1 < 1 2 and a predetermined error bound . For such a choice of M , if k is any integer such that k ≥ M , then where the last inequality follows due to the fact that M 4 ≤ (M + 2c−1 2 )(M + 1)(M + c 2 )(M + 2c+1 2 ) as c > 1 2 . Hence from (6.5) we get that Consequently, for a given value of the matrix D and an error level , we can select M accordingly, so that 0 F 1 (c, D) is approximated as where the error in the approximation is at most .
In the case when the matrix D is of dimension p × p with p > 2, we rely on the computational technique developed in Koev and Edelman (2006). Development of efficient computational schemes for the hyper geometric function of a matrix argument in general dimension is an active area of research (Gutiérrez et al., 2000;Koev and Edelman, 2006;Nagar et al., 2015;Pearson et al., 2017). In principle, the theoretical framework developed in this article integrated with the general computation scheme specified in Koev and Edelman (2006) can handle data on V n,p for arbitrary integers n ≥ p ≥ 2, but the results from the combined procedure may lack precision as it inherits the limitations of the algorithm in Koev and Edelman (2006) (see page 835 in Koev and Edelman (2006)). In the following remark we specify the assumptions under which the combined procedure can be applied effectively. Koev and Edelman (2006) is a general procedure for computing p F q (·) for arbitrary integers p, q ≥ 0. Naturally, the algorithm applies to 0 F 1 which is the object of focus in the current context. Due to its generality, the computational scheme has certain limitations. In particular, it requires appropriate specification of a "tuning parameter" that can not be determined in an automated manner. However, from an empirical exploration of the procedure, we observed that the corresponding outputs can be quite robust. Particularly, the output was found to stabilize after a certain point (we will call this the "stabilization point") when the value of the tuning parameter was gradually increased. For the case of p = 2, if the tuning parameter is specified to be larger than the stabilization point, the output from Koev and Edelman (2006) is very close to the true value, as determined by our arbitrary precision algorithm. Extrapolating to p ≥ 3, we presume that the true value of the corresponding hyper geometric function will be close to the output of Koev and Edelman (2006) if the tuning parameter is set larger than the "stabilization point". As the "stabilization point" is observed to be larger for larger values of D, we can set the value of the tuning parameter to a single prespecified number for an entire analysis only if we assume that the diagonal elements of the matrix D are bounded above by a prespecified finite number. Under this assumption, we can rely on Koev and Edelman (2006) for the analysis of data on V n,p , n ≥ p ≥ 3. In that case, the combination of our theoretical framework and the algorithm for the computation of the hypergeometric function from Koev and Edelman (2006) would work effectively for practical applications (see Simulation Section 7.2).

Remark 2. The algorithm developed in
In contrast, the procedure to compute 0 F 1 n 2 , D 2 4 that we have developed, though targeted towards a specific case, has a theoretical guarantee for a desired level of precision of its output. Since many statistical applications, as mentioned earlier, are about analyzing data on V n,2 , the computation procedure we have designed specifically for V n,2 has its own merit.

Simulation
To evaluate the performance of the procedure presented in the previous sections, we performed simulation experiments. We considered two different setups. In the first, we analyzed simulated datasets in V n,p where we varied n to assess its effect on the posterior estimation efficiency. Here, the value of p was fixed at 2 and the computation of 0 F 1 n 2 , D 2 4 developed in Section 6.2 was utilized. In the second setup, we analyzed data on V n,p to demonstrate the generic applicability of our framework by setting p = 3, n = 5. Here, we used the procedure in Koev and Edelman (2006) to calculate the value

Simulation Setup (p = 2)
We present results from experiments with simulated data where we varied the dimension of the Stiefel manifold, n, across a range of values. The objective of this simulation study was to see how the error rates varied with the dimension n. Specifically, we generated 3000 observations using ML distribution on V 3,2 , V 5,2 , V 10,2 , and V 15,2 . These correspond to the Stiefel Manifolds with dimension [n = 3, p = 2], [n = 5, p = 2], [n = 10, p = 2], and [n = 15, p = 2], respectively. We generated 50 datasets for each simulation setting using the algorithm mentioned in Hoff (2009). In order to generate data for each dataset we fixed the parameters M and V to the canonical orthogonal vectors of appropriate dimension and generated two entries of the parameter D from two independent gamma distributions.
We ran posterior inference for each of these datasets using 3000 MCMC samples with an initial 1000 samples as burn-in. We used the posterior mean of the parameter F as the point estimate F . Finally we assessed our performance by computing the relative error for the estimate of F true = M true D true V T true . We define the relative error as: where · denotes the matrix Frobenious norm. Figure 4 shows the average relative error with the corresponding standard deviation of estimation for V 3,2 , V 5,2 , V 10,2 , and V 15,2 for N = 2000 (panel (a)) and for N = 3000 (panel (b)). The average relative errors do not seem to exceed 11% and 9% for N = 2000 and 3000, respectively even with the dimension as high as 15. The error rate tends to increase with higher dimension, i.e., value of n. Also, we investigated the relationship with the total sample size and found these error rates to decrease with larger sample sizes. For example, the reduction in average relative error rate for n = 5 and N = 2000 is around 2%. Overall, these results demonstrate the robustness of our inference procedure.

Simulation Setup (p > 2)
Having demonstrated the efficiency of our method for a range of values of n with p = 2, we now present an example of a generalized simulation scenario for p > 2. Here we use the procedure in Koev and Edelman (2006) to numerically approximate the value where D is a p × p dimensional matrix with p > 2 (see Remark 2). Through the entire simulation we fixed the tuning parameter required in the computation of 0 F 1 n 2 , D 2 4 to a large prespecified value. Here we give a specific example with n = 5 and p = 3. We generated 50 datasets of 500 observations each using the ML distribution with different parameters, on V 5,3 . We then ran posterior inference for each of these datasets using 1100 MCMC samples with an initial 100 sample burn-in. We used the posterior mean of the parameter F as before as the estimate of the true parameter F . Using the same metric we computed the average relative error of the estimation ( Figure 5). We observed that our sampling algorithm for d i (i = 1, 2, 3) runs with a very low rejection rate. As can be seen in Figure 5, the average relative errors do not exceed 3%, demonstrating the general applicability of our framework beyond p = 2.

Application
Finally, to showcase the methodology developed in this paper, we analyzed the vectorcardiogram (VCG) dataset discussed in Downs et al. (1971). The dataset contains vectorcardiograms of 56 boys and 42 girls aged between 2 and 19 years. Individuals in the dataset are partitioned into four groups: groups 1 and 2 consist of boys aged between 2-10 and 11-19 years, while groups 3 and 4 consist of girls aged between 2-10 and 11-19 years. Each sample contains vectorcardiograms acquired using two different Figure 5: Average relative error for datasets on V 5,3 . measurement systems, the Frank lead system (Frank, 1956;Downs et al., 1971) and the McFee lead system (Downs et al., 1971). Here, we restrict ourselves to groups 1 and 3 and measurements acquired using the McFee lead system. For each individual sample, we considered the pair of orthogonal vectors that provides the orientation of the "QRS loop" (Downs et al., 1971) in R 3 . Each orientation in the sample is defined by a 3 × 2 matrix with orthonormal columns, i.e., an element in V 3,2 . Additional details regarding the measurements, data structures, and data processing can be found in Downs et al. (1971).

MCMC Convergence Diagnostics
We ran several MCMC convergence diagnostic tests for the MCMC samples from the posterior of F = MDV T , which is the natural parameter of the Matrix Langevin distribution. The parameter F uniquely identifies and is uniquely identified by the parameters M, D, V . Moreover the elements of the matrix M and V are interrelated whereas the components of F are not thus constrained. We therefore focused the diagnostics on F and studied its estimation accuracy. As notation, F i,j denotes the [i, j]-th element of F . We first ran convergence diagnostics based on potential scale reduction factor (PSRF) (Gelman et al., 1992). We ran the MCMC procedure three times with different random seeds for 10,000 MCMC iterations with a 1000 sample burn-in. The PSRF is a weighted sum of within-chain and between-chain variances. The calculated PSRF was 1.00 with an upper confidence bound 1.01, indicating no evidence of lack of convergence. We show how the PSRF changed with the iterations in Figure 6 for all components of F . We also calculated a multivariate potential scale reduction factor (MPSRF) that was proposed by Gelman and Brooks (Brooks and Gelman, 1998). The calculated MPSRF was 1.01, also confirming that there was no lack of convergence. The log-likelihood is yet another measure representative of the multi-dimensional parameters. In this case too, the calculated PSRF for log-likelihood was 1.0 with an upper confidence bound 1.0, indicating no evidence of lack of convergence. Finally, we calculated the Heidelberg and Welch (HW) diagnostic Welch, 1981, 1983) which is a test statistic based on the Cramer-von Mises test statistic to accept or reject the null hypothesis that the MC is from a stationary distribution. This diagnostic has two parts and the MC chain for F passed both the Stationarity and Halfwidth Mean tests. This test too, then, showed no evidence for lack of convergence.

Parameter Estimation
We modeled the VCG dataset using ML distributions on V 3,2 . There were 28 and 17 observations in groups 1 and 3, respectively. We assumed that each i.i.d. observation in group 1 follows a ML distribution with parameters M group1 , d group1 and V group1 , and likewise, i.i.d. observations in group 3 follow a ML distribution with parameters M group3 , d group3 and V group3 . We used the uniform improper prior for estimation of the parameters related to both groups (see Section 4). From (5.4), we note that the posterior distributions of (M group1 , d group1 , V group1 ) and (M group3 , d group3 , V group3 ) given the data are are the sample means of the observations in groups 1 and 3, respectively. We verified the spectral norm condition in Theorem 1 for the posterior distributions to be well defined; we found W group1 2 = 0.946 and W group3 2 = 0.941.
Using Theorem 3, we can infer that the above-mentioned posterior distributions have unique modes. Also from Theorem 3 we can compute the posterior mode and they were  Similarly, we can compute the posterior mode for the parameters of group 3 (not reported here). To estimate the posterior mean for the parametric functions F group1 = M group1 D group1 V T group1 and F group3 = M group3 D group3 V T group3 , we ran the MCMC based posterior inference procedure described in Section 6 to generate MCMC samples from each of the posterior distribution.
For group 1, the posterior mean for the parametric function F group1 = M group1 D group1 V T group1 was where the entries of the matrix SD( F group1 ) provides the standard deviation for the corresponding entries of F group1 . From the MCMC samples, we also estimated the posterior density of each entry of F group1 and F group3 . Figure 9 shows the corresponding density plots. The estimates related to group 3 were

Hypothesis Testing
Finally, we conducted a two sample hypothesis test for comparing different data groups on the Stiefel manifold. We have chosen hypothesis testing as one of our demonstrations because a general two sample test that does not rely on asymptotics or on the concentration being very large or very small, has not been reported in the literature for data lying on the Stiefel manifold Chikuse, 2012). The procedure described here is valid for finite sample sizes and does not require any additional assumptions on the magnitude of the parameters.
We considered the VCG dataset and carried out a test to compare the data group 1 against the data group 3, i.e.
To test the hypotheses in a Bayesian model selection framework, we considered two models Model 0 and Model 1 . In Model 0 , we assumed M group1 = M group3 , d group1 = d group3 , V group1 = V group3 while in Model 1 , we did not impose any structural dependencies between the parameters. We assumed the prior odds between the models to be 1 and computed the Bayes factor where Data denotes the combined data from both groups. Since an analytic form for the Bayes factor is not available in this case, we used an MCMC based sampling technique to estimate the Bayes factor. We used the empirical prior (see Section 4) with the choice of prior concentration set at 1 percentage of the corresponding sample size. We followed the procedure described in Section 6 to generate MCMC samples from each of the required posterior distribution. We used the harmonic mean estimator (HME) (Newton and Raftery, 1994) to estimate the marginal likelihoods required for computing the Bayes factor. It is well known that the HME may not perform well when using improper priors. Consequently, unlike in Section 8.2 where we focus on the parameter estimation, we use an informative prior for this part of the analysis. We observed that the HME estimator is stable for the current context. The estimate of log(B 01 ) was 51.994. Hence, we conclude that there is not enough evidence to favor Model 1 over Model 0 .

Discussion and Future Directions
In this article, we have formulated a comprehensive Bayesian framework for analyzing data drawn from a ML distribution. We constructed two flexible classes of distributions, CCPD and JCPD, which can be used for constructing conjugate priors for the ML distribution. We investigated the priors in considerable detail to build insights into their nature, and to identify interpretations for their hyper-parameter settings. Finally, we explored the features of the resulting posterior distributions and developed efficient computational procedures for posterior inference. An immediate extension would be to expand the framework to mixtures of ML distributions, with applications to clustering of data on the Stiefel manifold.
On a related note, we observed that the tractability of the set of procedures proposed in this article depends crucially on one's capacity to compute the hypergeometric function 0 F 1 n/2, F T F /4 as a function the matrix F . We were naturally led to a modified representation of 0 F 1 n/2, D 2 /4 (see Section 2) as a function of a vector argument d. We explored several properties of the function 0 F 1 n/2, D 2 /4 , that are applicable to research areas far beyond the particular problem of interest in this article. As a special note, we should highlight that we designed a tractable procedure to compute the hypergeometric function of a n × 2 dimensional matrix argument. There are many applications in the literature Jupp and Mardia, 1979;Chikuse, 1998Chikuse, , 2003Lin et al., 2017) where the mentioned computational procedure of 0 F 1 n 2 , D 2 4 can make a significant impact. As such, the manner in which we have approached this computation is entirely novel in this area of research and the procedure is scalable to "high-dimensional" data, such as in diffusion tensor imaging. In the near future, we plan to further explore useful analytical properties of the hypergeometric function, and extend our procedure to build reliable computational techniques for the hyper-geometric function where the dimension of the matrix argument is n × p with p ≥ 3.
Finally, there is scope for extending the newly proposed family of prior distributions to a larger class of Bayesian models involving more general densities on manifolds. The properties of the prior and posterior discovered can also be seamlessly generalized. The coming together of state-of-the-art Bayesian methods incorporating topological properties of the underlying space promises to be a rich area of research interest.