Bayesian Inference over the Stiefel Manifold via the Givens Representation

We introduce an approach based on the Givens representation for posterior inference in statistical models with orthogonal matrix parameters, such as factor models and probabilistic principal component analysis (PPCA). We show how the Givens representation can be used to develop practical methods for transforming densities over the Stiefel manifold into densities over subsets of Euclidean space. We show how to deal with issues arising from the topology of the Stiefel manifold and how to inexpensively compute the change-of-measure terms. We introduce an auxiliary parameter approach that limits the impact of topological issues. We provide both analysis of our methods and numerical examples demonstrating the effectiveness of the approach. We also discuss how our Givens representation can be used to define general classes of distributions over the space of orthogonal matrices. We then give demonstrations on several examples showing how the Givens approach performs in practice in comparison with other methods.


Introduction
Statistical models parameterized in terms of orthogonal matrices are ubiquitous, particularly in the treatment of multivariate data.This class of models includes certain multivariate time series models (Brockwell et al., 2002), factor models (Johnson and Wichern, 2004), and many developed probabilistic dimensionality reduction models such as Probabilistic PCA (PPCA), Exponential Family PPCA (BXPCA), mixture of PPCA (Ghahramani et al., 1996), and Canonical Correlation Analysis (CCA) (Murphy, 2012, Chapt. 12.5).These models were traditionally used in fields such as psychology (Ford et al., 1986), but more recently are gaining traction in other diverse applications including biology (Hamelryck et al., 2006), finance (Lee et al., 2007), materials science (Oh et al., 2017), and robotics (Lu and Milios, 1997).Despite their ubiquity, routine and flexible posterior inference in Bayesian models with orthogonal matrix parameters remains a challenge.Given a specified posterior density function, modern probabilistic programming frameworks such as Stan, Edward, and PyMC3 (Carpenter et al., 2016;Tran et al., 2016;Salvatier et al., 2016) can automatically generate samples from the associated posterior distribution with no further input from the user.Unfortunately, these current frameworks do not offer support for density functions specified in terms of orthogonal matrices.While innovative methods have been introduced to handle such densities, these approaches are often specialized to specific probability distributions or require re-implementations of the underlying inference methods (Hoff, 2009;Brubaker et al., 2012;Byrne and Girolami, 2013;Holbrook et al., 2016).An appealing alternative approach to this problem is to parameterize the space of orthogonal matrices, i.e. the Stiefel manifold, in terms of unconstrained Euclidean parameters, then use this parameterization to transform a posterior density on the space of orthogonal matrices to a posterior density on Euclidean space.The resulting density could then be used to sample from the posterior distribution using a probabilistic programming framework such as Stan.While appealing, the transformation approach can pose its own challenges related to the change in measure, topology, and parameterization.While there are many possible parameterizations of orthogonal matrices (Anderson et al., 1987;Shepard et al., 2015), we seek smooth continuously differentiable representations that can be used readily in inference methods such as Hamiltonian Monte Carlo (HMC).It is also important to consider for transformed random variables the change-of-measure adjustment term which needs to be computable efficiently.For the Stiefel manifold, the topology also can pose issues when mapping to Euclidean space.Finally, in practice, we also seek representations that have an intuitive interpretation helpful in statistical modeling and specifying prior distributions.We introduce an approach to posterior inference for statistical models with orthogonal matrix parameters based on Givens representation of orthogonal matrices (Shepard et al., 2015).Our approach addresses several of the concerns of using a transformation-based approach.We derive an analytic changeof-measure adjustment term that is efficient to compute in practice.We also address two topological issues that occur when using the Givens representation to transform densities.The first issue has to do with multi-modalities in the transformed density that are introduced by the transform.We resolve it by a parameter expansion scheme.The second issue has to do with how the Givens representation pathologically transforms densities near "poles" of the Stiefel manifold.We present both theory and numerical results showing how this plays a minimal role in practice.We also discuss how the Givens representation can be used to define new distributions over orthogonal matrices that are useful for statistical modeling.Our approach enables the application of sampling methods such as the No-U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014) to arbitrary densities specified in terms of orthogonal matrices.Our approach is relatively easy to implement and has the advantage that current sampling algorithms and software frameworks can be leveraged without needing significant modifications or re-implementations.We expect this work to help facilitate practitioners' ability to build and prototype rapidly complex probabilistic models imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019 with orthogonal matrix parameters in probabilistic frameworks, such as Stan, Edward, or PyMC3.We organize the paper as follows.In Section 2, we discuss existing methods for Bayesian inference over the Stiefel manifold and the difficulty in implementing these methods in existing Bayesian software frameworks.In Section 3 we describe the Givens representation by first introducing the Givens reduction algorithm and then connecting it to a geometric perspective of the Stiefel manifold.We try to provide an approachable intuition to the transform.We then describe how to practically apply the Givens representation to sample from posterior densities that are specified in terms of orthogonal matrices in Section 4. In Section 5 we present results for statistical examples demonstrating in practice our Givens representation approach and how it compares with other methods.

Related Work
While several innovative methods have been proposed for sampling distributions over the Stiefel manifold, many are either specialized in the types of distributions they can sample or require significant modifications of the underlying inference methods.This poses challenges for their adoption and use in practice, especially within current probabilistic programming frameworks used by practitioners.In Hoff (2009), a Gibbs sampling approach is introduced for densities specified in terms of orthogonal matrices when the orthogonal matrix parameter, conditioned on all other model parameters, follows a distribution in the Bingham-von Mises-Fisher family.In practice, this limits the flexibility of this approach to a specific class of models and may not offer the same sampling efficiency as modern algorithms such as HMC.Recently, HMC-based methods have been proposed for sampling distributions over the Stiefel manifold and handling constraints (Brubaker et al., 2012;Byrne and Girolami, 2013;Holbrook et al., 2016).The work of Brubaker et al. (2012) proposes a modified HMC algorithm which uses an update rule for constrained parameters based on the symplectic SHAKE integrator (Leimkuhler and Reich, 2004).Specifically, for unconstrained parameters, the method uses a standard Leapfrog update rule.For constrained parameters, the method first takes a Leapfrog step which usually moves the parameter to a value that does not obey constraints.The method then uses Newton's method to "project" the parameter value back down to the manifold where the desired constraints are satisfied.In Byrne and Girolami (2013) and Holbrook et al. (2016) a separate HMC update rule is introduced to deal with constrained parameters.These works utilize analytic results, and the matrix exponential function to update the parameters in such a way that guarantees constraints are still satisfied in the embedded matrix coordinates.More precisely, they use the fact that analytic solutions for the geodesic equations on the Stiefel manifold in the embedded coordinates are known.This gives rise to their Geodesic Monte Carlo (GMC) algorithm.While this does provide a mathematically elegant approach, in practice, the matrix exponential and use of separate update rules in GMC makes the algorithm difficult to implement in general.
imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019 Unfortunately, these recently introduced HMC-based methods methods rely on specialized HMC update rules which can make them difficult to implement within probabilistic programming frameworks and to tune.In particular, they sample distributions with orthogonal matrix parameters by using different HMC update rules for constrained and unconstrained parameters, requiring additional book-keeping in software to know which update rules to use on which parameter.Unfortunately, many probabilistic programming languages do not keep track of this as they treat parameters agnostically by transforming to an unconstrained space.Without the automatic tuning of inference algorithm parameters offered in common probabilistic programming frameworks, these algorithms must be manually tuned in practice as they usually do not offer an algorithm to choose tuning parameters automatically.Furthermore, the specialization of these methods to HMC makes them difficult to generalize to other inference algorithms based on variational inference (VI) or optimization, which would be more straight-forward with a transformation based approach.We also mention the interesting recent work of Jauch et al. (2018), in which they also try to address related difficulties using a transformation-based approach, but based instead on the Cayley Transform.As in our Givens representation approach, their approach also faces the challenges of topological difficulties of mapping the Stiefel manifold to Euclidean space and in computing efficiently the change-of-measure term.We present here for the Givens representation ways to grapple with these issues which we expect are applicable generally to such methods.

The Givens Representation of Orthogonal Matrices
We discuss the the Givens reduction algorithm of numerical analysis and describe the connection between the algorithm and the geometric aspects of the Stiefel manifold.We then describe the Givens representation of orthogonal matrices.

Givens Rotations and Reductions
Given any n × p matrix, A, the Givens reduction algorithm is a numerical algorithm for finding the QR-factorization of A, i.e. an n × p orthogonal matrix Q and an upper-triangular, p × p, matrix R such that A = QR.The algorithm works by successively applying a series of Givens rotation matrices so as to "zero-out" the elements {A ij : i > j} of A. These Givens rotation matrices are simply n × n matrices, R ij (θ ij ), that take the form of an identity matrix except for the (i, i) and (j, j) positions which are replaced by cos θ ij , and the (i, j) and (j, i) positions which are replaced by − sin θ ij and sin θ ij respectively.When applied to a vector, R ij (θ ij ) has the effect of rotating the vector counterclockwise in the (i, j)-plane, while leaving other elements fixed.Intuitively, its inverse, R −1 ij (θ ij ), has the same effect, but clockwise.Thus one can "zero-out" the jth element, u j , of a vector u, by first using the arctan function to find the angle, θ ij formed in the (i, j)-plane by u i and u j , and then multiplying by the matrix R −1 ij (θ ij ) (Figure 1, inset).In the Givens reduction algorithm, these rotation matrices are applied one-byone to A in this way to zero-out all elements below the (i, i) elements of the matrix.First, all elements in the first column below the first row are eliminated by successively applying the rotation matrices R −1 12 (θ 12 ), R −1 13 (θ 13 ), • • • , R −1 1n (θ 1n ) (Figure 2).Because multiplication by R ij (θ ij ) only affects elements i and j of a vector, once the jth element is zeroed out, the subsequent rotations, R −1 13 (θ 13 ), • • • , R −1 1n (θ 1n ), will leave the initial changes unaffected.Similarly, once the first column of A is zeroed out below the first element, the subsequent rotations, which do not involve the first element will leave the first column unaffected.The rotations ) can thus be applied to zero out the second column, while leaving the first column unaffected.This results in the upper triangular matrix A.
(3.1) Crucially, the product of rotations, which we call Q −1 * , is orthogonal since it is simply the product of rotation matrices which are themselves orthogonal.Thus its inverse can be applied to both sides of Equation 3.1 to obtain (3.2) The familiar QR form can be obtained by setting Q equal to the first p columns of Q * and setting R equal to the first p rows of R * .The Givens reduction is summarized in Algorithm 1.
Algorithm 1: Psuedo-code for the Givens reduction algorithm for obtaining the QR factorization of a matrix A.

The Geometry of Orthogonal Matrices
The elements of the Stiefel manifold, V p,n , are known as p-frames.A p-frame is an orthogonal set of p n-dimensional unit-length vectors, where p ≤ n. p-frames naturally correspond to n × p orthogonal matrices which can be used to define the Stiefel manifold succinctly as Geometrically, an element of the Stiefel manifold can be pictured as a set of orthogonal, unit-length vectors that are rigidly connected to one another.A simple case is V 1,3 , which consists of a single vector, u 1 , on the unit sphere.This vector can be represented by two polar coordinates that we naturally think of as longitude and latitude, but can also be thought of simply as subsequent rotations of the standard basis vector e 1 := (1, 0, 0) T in the (x, y) and (x, z) planes, which we refer to as the (1, 2) and (1, 3) planes for generality.In mathematical terms, u 1 can be represented as u 1 = R 12 (θ 12 )R 13 (θ 13 )e 1 (Figure 1).
Continuing with our geometric interpretation, V 2,3 can be pictured as a vector in V 1,3 that has a second orthogonal vector, u 2 , that is rigidly attached to it as it moves about the unit sphere.Because this second vector is constrained to be orthogonal to the first, its position can be described by a single rotation about the first vector.Thus elements of V 2,3 can be represented by three angles: two angles, θ 12 and θ 13 , that represent how much to rotate the first vector, and a third angle, θ 23 that controls how much the second vector is rotated about the first (Figure 1).Mathematically this can be represented as the 3 Although elements of the Stiefel manifold can be represented by n × p matrices, their inherent dimension is less than np because of the constraints that the matrices must satisfy.The first column must satisfy a single constraint: the unit-length constraint.The second column must satisfy two constraints: not only must it be unit length, but it must also be orthogonal to the first column.
The third column must additionally be orthogonal to the second column, giving it a total of three constraints.Continuing in this way reveals the inherent dimensionality of the Stiefel manifold to be (3.4)

Obtaining the Givens Representation
The Givens reduction applied to an orthogonal matrix gives rise to a representation of the Stiefel manifold that generalizes the intuitive geometric interpretation described above.When applied to an n × p orthogonal matrix Y , the Givens reduction yields .5) where I n,p is defined to be the first p columns of the n × n identity matrix, i.e. the matrix consisting of the first p standard basis vectors e 1 , • • • , e p .The first n − 1 rotations transform the first column into e 1 , since it zeros out all elements below the first and the orthogonal rotations do not affect the length of the vector which by hypothesis is unit length.Similarly, the next n − 2 rotations will leave the length of the second column and its orthogonality to the first column intact because again, the rotation matrices are orthogonal.Therefore, because the second column must be zero below its second element, it must be e 2 after these n − 2 rotations are applied.Continuing in this way explains the relationship in Equation 3.5.Because Y was taken to be an arbitrary orthogonal matrix, it is clear from Equation 3.5 that any orthogonal matrix Y can be factored as (3.6) imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019 we can consider any orthogonal matrix as a function, Y (Θ), of these angles, effectively parameterizing the Stiefel manifold and yielding the Givens representation.The Givens representation is a smooth representation with respect to the angles Θ (Shepard et al., 2015), and lines up with our geometric insight discussed in the previous subsection.We also note that the number of angles in the Givens representation corresponds exactly to the inherent dimensionality, d, of the Stiefel manifold.

Using the Givens Representation to Sample Distributions Over the Stiefel Manifold
Using the Givens representation in practice to sample distributions over the Stiefel manifold requires solving several practical challenges.In addition to the standard change-of-measure term required in any transformation of a random variable, care must be taken to address certain pathologies that occur due to the differing topologies of the Stiefel manifold and Euclidean space.We further describe these challenges and how we overcome them in practice.We also briefly remark on how the Givens representation can be leveraged to define new and useful distributions over the Stiefel manifold.We conclude the section by describing how the computation of the Givens representation scales in theory, particularly in comparison to GMC.

Transformation of Measure Under the Givens Representation
As is usual in any transformation of random variables, care must be taken to include an extra term in the transformed density to account for a change-ofmeasure under the transformation.Formally, for a posterior density over orthogonal matrices that takes the form p Y (y), the proper density over the transformed random variable, Θ(Y ), takes the form (Keener, 2011).Intuitively, this extra term accounts for how probability measures are distorted by the transformation (Figure 3).Usually this term is calculated by simply taking the determinant of the Jacobian of the transformation.Unfortunately, the Givens representation, Y (Θ), is a map from a space of dimension d := np − p(p + 1)/2 to a space of dimension np.Thus its Jacobian is non-square and the determinant of the Jacobian is undefined.One way to compute the change-of-measure term analogous to the Jacobian determinant is to appeal to the algebra of differential forms.Denote the product of n × n rotation matrices in the Givens representation by G, i.e.
.1) and denote its jth column by G j .Muirhead (2009) shows that the proper measure for a signed surface element of V p,n is given by the absolute value of the differential form  Under the mapping, regions near the pole are shrunk to regions on the sphere with little area, as opposed to regions near to the equator which the transform maps to much larger areas on the sphere.Intuitively, the change-of-measure term quantifies this proportion of shrinkage in area.
Letting J Yi(Θ) (Θ) be the Jacobian of the ith column of Y with respect to the angle coordinates of the Givens representation, this differential form can be written in the coordinates of the Givens representation as Because this is a wedge product of d d-dimensional elements, Equation 4.3 can be conveniently written as the determinant of the d × d matrix where the G k:l denote columns k through l of G.As we show in the Appendix, this term, which would otherwise be expensive to compute, can be analytically simplified to the following simple-to-compute product whose absolute value serves as our measure adjustment term:

Implementation of Angle Coordinates
Two issues related to the topology of the Stiefel manifold arise when using the Givens representation to map densities over the Stiefel manifold to densities over Euclidean space.Let θ 12 , θ 23 , • • • θ p,p+1 range from −π to π.We refer to these specific coordinates as the latitudinal coordinates to evoke the analogy for the simple spherical case.Similarly, let the remaining coordinates range from −π/2 to π/2.We refer to these coordinates as longitudinal coordinates.Formally, this choice of intervals defines a coordinate chart from Euclidean space to the Stiefel manifold, i.e. the correspondence mapping between these two spaces.While this choice of intervals allows us to represent almost the entire Stiefel manifold in the Givens representation, because the topology of these two space differ, certain connectedness properties of the Stiefel manifold can not be accurately represented in the Givens representation.For example, when representing V 1,3 in Euclidean space using the Givens representation, contiguous regions of the manifold on either side of the sliver corresponding to θ 12 = π are disconnected (Figure 4).As we show, this could lead to samples that are not representative of the correct distribution when applying sampling methods such as HMC.To address this, we introduce auxiliary parameters to the Givens representation to better represent the connectedness of the Stiefel manifold in Euclidean space.
In addition to these disconnected regions, the coordinate chart will also contain singularities where the measure adjustment term (Equation 4.5) approaches zero near the "poles" of the Stiefel manifold i.e.where the longitudinal coordinates equal −π/2 or π/2.This means that a finite density over the Stiefel manifold that is transformed to a density over Euclidean space using the Givens representation will equal zero at the poles of the Stiefel manifold regardless of whether the original density assigns zero to those points.On the sphere, this happens at the North and South poles where the longitudinal coordinates become exactly −π/2 or π/2 (Figure 4).In practice, this prevents algorithms such as HMC from obtaining samples in a small region near these points even when there is positive probability mass in these regions under the original density.The reason is that the acceptance ratio used by algorithms such as HMC will always equal zero at these points for finite densities.Thus proposals at these points will always be rejected.Because of finite numerical precision, this also holds true for points on the Stiefel manifold that are numerically near these poles.While in theory this would necessitate a change in coordinate charts near these regions, fortunately, in practice these pathological regions generally have a negligible effect.We present both theoretical analysis and empirical numerical experiments showing the minimal impacts of these regions in numerical estimates of expectations obtained using the Givens representation.
imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019 over the Stiefel manifold that cross this sliver have no equivalent representation, Θ(t), in the coordinates of the Givens representation.This can become particularly problematic when there is significant probability mass on both sides of the sliver.The grid over the sphere reveals how the Givens representation maps areas that are the same size in the Θ coordinates to smaller and smaller regions on the sphere the closer they are to the poles.Thus the measure adjustment term (Equation 4.5), which measures how the transform changes the area of these infinitesimal regions, goes to zero near the poles, making the transformed density at these points zero regardless of whether they were assigned to zero by the original density.

Auxiliary Parameters for Addressing Connectedness
Simply allowing the latitudinal coordinates of the Givens representation to range from −π to π leaves regions of parameter space that should otherwise be connected, disconnected in the Givens representation.For a distribution over the Stiefel manifold that is not sufficiently concentrated away from these disconnected regions, this can lead to highly non-representative samples when naively applying the Givens representation for sampling (Figure 5, upper).
To handle this multimodality introduced by reparameterizing in terms of Givens angles, we introduce for each angle parameter, θ ij , an independent auxiliary parameter, r ij .We then transform the density to sample over the x ij , y ij -space via the transform x ij = r ij cos θ ij and y ij = r ij sin θ ij .In the transformed space, the two ends of the interval are connected, producing samples that are distributed more evenly across the two disconnected regions (Figure 5, lower).Formally, we assign to r a marginal distribution with density p r (r) so that θ and r are independent and the marginal distribution of θ is left untouched by the introduction of the auxiliary parameter.This leads to the joint density p θ,r (θ, r) = p θ (θ)p r (r) which we then transform to the equivalent density over the unconstrained (x, y)space by the simple change-of-variables formula between two-dimensional polar coordinates and two-dimensional Euclidean coordinates: Because these two ends of the interval are disconnected in this representation, the sampler gets "stuck" in the mode corresponding to the −π side of the interval rather than the π side.(Lower) 1,000 samples from the equivalent distribution sampled over the (x, y)-space.
By introducing an auxiliary coordinate, one can effectively replicate the topology of a circle, effectively "wrapping" the two ends of the interval so that the sampler avoids getting stuck in one region.
This again leaves the marginal distribution of θ unaffected, however, in the new space, paths θ(t) that cross the region of parameter space at θ = π can actually be represented.In practice, we set p r (r) to a normal density with mean one and standard deviation 0.1.Although r ij does not necessarily need to be set to this particular distribution to achieve the correct marginal distribution over θ, this choice helps to avoid the region of parameter space where r = 0 and the transformed density is ill-defined.

Transformation of Densities Near the Poles
Even with the usual change of variables formula and the measure adjustment term (Equation 4.5), a finite density over the Stiefel manifold that is transformed to a density over Euclidean space using the Givens representation will not be completely equivalent to the original density.In particular, when any of the longitudinal angles has absolute value equal to π/2, Equation 4.5 will equal zero.
Thus the transformed density will be zero at these points even when the original density is non-zero there.Because of finite numerical precision, in practice this creates a region of the Stiefel manifold that cannot be sampled by algorithms such as HMC, despite having a positive probability mass under the original distribution.Specifically, because a computed numerical density will be zero at values numerically near the poles, the acceptance ratio in HMC will always be zero at these points so that proposals in the region will always be rejected.This effectively blocks off a portion of parameter space by limiting all longitudinal angles to the region [−π/2 + , π/2 − ] where is a small value on the order of numerical precision.While a change in coordinate chart could be utilized, we show that in practice the exceedingly small volume of this region mitigates the effect of these regions on numerical samplers.
For general n and p, the volume of this blocked off region is O(p 2 ).First note that the uniform density over V p,n in the Givens representation is simply a constant times the absolute value of Equation 4.5.However, since this density factors into a product of independent terms, the longitudinal angles are independent of one another.The probability that at least one longitudinal angle falls within the -region is thus equal to the sum of the individual probabilities of each angle falling within the region plus higher order terms.Each of these individual probabilities is proportional to cos j−i−1 θ ij , which for small can be bounded by j−i−1 over the interval [π/2− , π/2].Thus the probability of falling within the -region is bounded by a constant times the following quantity: Because this quantity falls off with the square of , even for modestly small , the probability of a uniformly sampled point falling within the -region is small.Empirical results further illustrate this.For various values of n, p, and we drew 100,000 samples uniformly form the Stiefel manifold by sampling the elements of an n × p matrix from a standard normal distribution, then taking the QR factorization of this matrix, a common technique for uniformly sampling the Stiefel manifold (Muirhead, 2009).We then took these samples, converted them into their Givens representation, and calculated the number of samples that had any longitudinal angle within the region [−π/2, −π/2 + ] or the region [π/2 − , π/2].The results are closely explained by Equation 4.7.In particular, the proportion of samples that fell within this region does not change much for fixed p and increasing n.Furthermore, the proportion increases linearly with p, and it decreases quadratically with (Table 1).
For non-uniform distributions with a probability density p(Θ) that is finite in the -region, the probability of any of the longitudinal angles falling within the region can again be bounded by a constant times O(p 2 ).We took 100,000 samples from the von Mises Fisher distribution over V 1,3 with parameters µ = (0, 0, 1) and κ = 1, 10, 100, and 1000 using the simulation method of Wood (1994) as implemented in the R package Rfast.For fixed κ the probability of a sample falling in the -region drops off with the square of as the bound would suggest.This holds true even when probability mass is highly concentrated  1 The number of uniform samples out of 100,000 that fell within the region for various values of n, p, and .Samples are taken uniformly from the Stiefel manifold using the QR factorization method.As the theoretical bound suggests, the number of samples falling in this region increases modestly for fixed p and increasing n.It increases linearly with p, and it decreases quadratically with .In particular, whenever is halved, the number of samples falling within the region decreases by about a fourth.We also note that for = 1e − 5, the value we used for most of our experiments, the number of samples falling within the region was zero for all settings.2 The number of samples from a von Mises Fisher distribution with µ = (0, 0, 1) and κ = 1, 10, 100 and 1000 that fell within the region for various values of n, p, and .For each value of κ, 100,000 total samples were taken.As the theoretical bound suggests, the number of samples that fall within the region decreases with the square of so that even for a modestly small value of = 1e − 5, none of the 100,000 samples fall within this region even in the highly concentrated case (κ = 1, 000).
near these regions (Table 2), although for highly distributions that are highly concentrated near the poles we advise users to conduct a similar sensitivity analysis for their particular case.
Because the probability of samples falling within the region falls with the square of , even for modestly small the distribution of derived quantities of Y (Θ) remains largely unaffected when sampling using the Givens representation with small enough .We sampled the von Mises Fisher distribution with the same values of κ using the Givens representation in Stan but with the longitudinal angles deliberately limited to the interval [−π/2 + , −π/2 + ] with = 0.1, 0.05, 0.025, 0.0125, and 1e−5.We then examined histograms and expectations of the principal angle arccos(µ T Y ), which represents the angle between the sample and the direction at the pole which lies directly in the middle of the -region.For large and large κ the lack of samples from the -region is evident when compared to samples using the method of Wood (1994) since the sample can not get close enough to µ.However, for any fixed κ as is decreased, the number of samples that fall within the -region decreases rapidly as the bound would suggest (Figure 6).Table 3 illustrates this effect numerically using the expectation of the principal angle.Histograms of the principal angle, arccos(µ T Y ), sampled under the von Mises Fisher distribution with µ = (0, 0, 1) and κ = 1, 10, 100 and 1000 using the Givens representation in Stan with various sizes of the area and using the method of Wood (1994).For small and large κ, the lack of samples from the -region in the latter method is evident in the histograms.In particular, despite the large amount of mass near zero, the number of samples is never larger than a bound that is dictated by .As decreases, the bound rapidly becomes negligible because of the quadratic relationship between and the volume of the area.Thus for = 1e − 5, the histograms of the Givens representation method and the Wood (1994) method become indistinguishable.

Specifying Distributions Using the Givens Representation
So far, we have focused on transforming densities defined in terms of the canonical orthogonal matrix coordinates, Y , into densities specified in terms of the angles, Θ, of the Givens representation.However, the angles of the Givens representation can also be used to define new distributions over the Stiefel manifold that may be useful in modeling.In fact, using the intuition described in Section 3.2, the sign and magnitude of the angle θ ij of the Givens representation roughly corresponds to the sign and magnitude of the i − j element of Y .Thus one can create, for example, sparse priors over the Stiefel manifold by placing sparse priors over the angles of the Givens representation.Cron and West (2016) utilize sparsity promoting priors over the coordinates of the Givens representation to produce a prior distribution over covariance matri-  3 The empirical expectation of the principal angle, arccos(µ T Y ), sampled under the von Mises Fisher distribution with µ = (0, 0, 1) and κ = 1, 10, 100 and 1000 using the Givens representation in Stan with various sizes of the area and using the method of Wood (1994).As decreases, the empirical expectation computed using the Givens representation becomes much closer to those taken via the method of Wood (1994).For small κ the expectations do not differ much even for large because much less mass concentrates near the regions.
ces that favors sparse matrices.Specifically, they describe a model for multivariate Gaussian observations with an unknown covariance matrix.They parameterize the covariance matrix in terms of its eigen-decomposition Y ΛY T , then parameterize the orthogonal matrix Y using the Givens representation.They then place spike-and-slab mixture priors over the angles of the Givens representation, placing significant prior mass on orthogonal matrices whose Givens angles are mostly zero and thus sparse in the canonical coordinates.They describe a custom reversible jump-based method for sampling the resulting posterior distribution.
Our Givens representation approach provides a routine and flexible way to sample the posterior distribution associated with this model and other more general models using common probabilistic modeling frameworks.In Section 5, we illustrate this with a sparse PPCA example motivated by Cron and West (2016) that places truncated horseshoe priors on the angles of the Givens representation.

Computational Scaling of the Givens Representation
The primary computational cost in computing the Givens representation is the series of d n × n matrix multiplications applied to I n,p in Equation 3.6.Fortunately, unlike dense matrix multiplication, applying a Givens rotation to an n × p matrix only involves two vector additions of size p (Algorithm 2).Thus since d scales on the order of np, computation of the Givens representation in aggregate scales as O(np 2 ).In comparison, GMC involves an orthogonalization of an n×p matrix which scales as O(np 2 ) and a matrix exponential computation that scales as O(p 3 ).We note, however, that this comparison is somewhat obfuscated by the gradient of the log probability computation that is required by both methods.When using the Givens representation in a framework such as Stan, this gradient is computed internally using reverse-mode automatic differentiation.Meanwhile, GMC requires user-provided, analytically-known gradients of the log density.These analytically-derived gradients are typically faster to compute than applying reverse-mode automatic differentiation, but this of course limits the types of densities that GMC can be applied to.
Algorithm 2: Psuedo-code for obtaining the orthogonal matrix Y from the Givens Representation as well as appropriately adjusting the log of the posterior density.
In practice, sampling efficiency will depend on several factors, including the size of the orthogonal matrix being sampled, making it difficult to generally recommend one method over the other in terms of efficiency.For uniform sampling of the Stiefel manifold, we find that GMC scales better when p is much smaller than n, whereas the Givens representation scales better when p is large and closer to n.We present benchmarks comparing the two methods on orthogonal matrices of various sizes in Section 5.

Experiments
We demonstrate the use of the Givens representation for uniformly sampling the Stiefel manifold as well as several statistical examples.All Givens representation experiments were conducted in Stan using the automatic warm-up and tuning options.For all Stan experiments, we ensured that there were no divergences during post-warmup sampling and that all R were 1.01 or below.Presence of divergences suggests that the sampler may not be visiting areas of the posterior distribution that contain positive mass (Betancourt and Girolami, 2015), while R tests for convergence of the Markov chain to the stationary distribution (Gelman et al., 1992).All timing experiments were conducted on a 2016 Macbook Pro.

Uniform Sampling on the Stiefel Manifold
We sample uniformly from the Stiefel manifold of various sizes to assess the practical scalability of the Givens representation.We compare its sampling efficiency and R values to GMC using 500 post-warmup samples from each method (Table 4).We chose the step size tuning parameter of GMC by manually trying several possible values, then selecting the specific value that produced the highest number of effective samples per second over 500 samples.As mentioned in Section 4.1, to uniformly sample the Stiefel manifold in the Givens representation, the change-of-measure term, Equation 4.5, must be computed as part of the density.Meanwhile, uniform sampling over the Stiefel manifold is achieved in GMC simply using a constant density because the method uses the canonical matrix coordinates.However, as mentioned in section 4.4, this comes at the cost of an expensive HMC update to ensure that the updated parameter still satisfies the constraints.In practice, we find that GMC scales better as n is increased, although the approach using the Givens representation in Stan remains competitive (Figure 7).For small values of n the Givens representation approach in Stan produces more effective samplers per second, while for larger values GMC scales better since the primary cost of the matrix exponential remains constant.

Probabilistic Principle Component Analysis (PPCA)
Factor Analysis (FA) and PPCA (Tipping and Bishop, 1999) posit a probabilistic generative model where high-dimensional data is determined by a linear imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019 function of some low-dimensional latent state (Murphy, 2012, Chapt. 12).Geometrically, for a three-dimensional set of points forming a flat pancake-like cloud, the orthogonal matrix corresponding to this linear function can be thought of as a 2-frame that aligns with this cloud (Figure 8).Formally, PPCA posits the following generative process for how a sequence of high-dimensional data vectors (5.1) To ensure identifiability, W is constrained to be an orthogonal n × p matrix while Λ is a diagonal matrix with positive, ordered elements.Because x i is a linear transformation of a multivariate Gaussian, z i can be integrated out of the model 5.1 to yield the simplified formulation where denote the empirical covariance matrix, this yields the simplified PPCA likelihood (5.4) Traditional PCA corresponds to the closed-form maximum likelihood estimator for W in the limit as σ 2 → 0, providing no measure of uncertainty for this point-estimate.Furthermore, for more elaborate models, the analytical form of the maximum-likelihood estimator is rarely known.Sampling the posterior of a model both provides a measure of uncertainty for parameter estimates and is possible even for more elaborate models.We used the Givens representation in Stan to sample the posterior distribution of the parameters in model 5.2 from a simulated dataset with n = 50 and p = 3.For Λ and σ 2 we chose uniform priors over the positive real line and for W we chose a uniform prior over the Stiefel manifold yielding the unnormalized posterior density or in the Givens representation (5.6) The latter term comes from Equation 4.5.
For the true value of the parameters we used the settings used by Jauch et al. (2018) in their experiments section: Λ 2 = diag(5, 3, 1.5), σ 2 = 1, and W drawn uniformly from V 3,50 .We took 10,000 samples using Stan's NUTS algorithm with default settings.Table 5 shows the posterior quantiles along with R and n eff values for Λ 2 and σ 2 .Like Jauch et al. (2018), we plot histograms of the principal angle, 2, 3 (5.7) between the columns, W j , of posterior draws of W and the columns of the first three eigenvectors of Σ, E j (Figure 9).We also plot the true values of W used in the simulation along with the 90% credible intervals, computed from posterior samples, of the marginal posterior distributions of the elements of W (Figure 10).

Sparse PPCA
To illustrate the utility of placing priors over the angle parameters, Θ, of the Givens representation, we fit a PPCA model with sparse priors over Θ to simulated data generated from 5.1 with the same parameter settings as in the  5 Posterior quantiles, R, and n eff values for Λ 2 and σ 2 computed over 10,000 posterior draws.previous example, but with W replaced with a sparse matrix.To generate a sparse orthogonal matrix, we drew an orthogonal matrix uniformly from V 3,50 , converted the result to the Givens representation, randomly set each angle to zero with probability 0.8, then converted the result back to the canonical representation.The result was an orthogonal V 3,50 matrix, W , with 85% of its elements equal to zero.For our model, we used the standard PPCA likelihood 5.4 with uniform priors over Λ 2 and σ 2 , but rather than placing a prior over Θ to make W uniform over the Stiefel manifold a-priori, we set Θ to follow the regularized horseshoe prior of Piironen et al. (2017), a commonly used sparsity-inducing prior.Formally, we set with hyper-parameters set to τ 0 = 0.01, ν = 10, and s = π/4 following the guidelines of Piironen et al. (2017).We took 10,000 posterior draws in Stan using the resulting unnormalized posterior density (5.9) imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019 Table 6 shows a posterior summary for the sparse model versus the non-sparse model corresponding to the density 5.6.While the marginal posterior distributions of Λ 2 and σ 2 are similar for both models, the sparse model expectedly results in a much sparser posterior distribution over Θ and thus W (Figure 11).
In particular, for elements of W that are truly zero, the marginal posterior distributions of the sparse model tend to concentrate much closer to zero, while for truly non-zero elements, the sparse model is able to concentrate posterior mass away from zero.

The Network Eigenmodel
We used the Givens representation to sample from the posterior distribution of the network eigenmodel of Hoff (2009) which was also illustrated in Byrne and Girolami (2013).We compared the results obtained from using the Givens representation in Stan to results obtained from GMC.The data used in those works and originally described in Butland et al. (2005) consists of a symmetric graph matrix, Y , of dimension 270 × 270.However, for our experiments we use a smaller 230 × 230 version of the dataset as we were unable to find access to the larger version.The version we used is freely available in the R package eigenmodel.For our GMC experiments we used the same tuning parameters for GMC as Byrne and Girolami (2013).The graph matrix encodes whether the proteins in a protein network of size n = 230 interact with one another.The probability of a connection between all combinations of proteins can be described by the lower-triangular portion of a symmetric matrix of probabilities, however, the network eigenmodel uses a much lower dimensional representation to represent this connectivity matrix.Specifically, given an orthogonal matrix U , a diagonal matrix Λ, and a scalar c, then letting Φ(•) represent the probit link function, the model is described as follows: c ∼ N (0, 10 2 ) (5.10) (5.12) For U we specified a uniform prior over the Stiefel manifold, which again in the Givens representation corresponds to a prior density that is the absolute value of the change-of-measure term.The Stan implementation using the Givens representation took approximately 300 seconds to collect 1000 samples, 500 of which were warmup.In contrast, GMC took 812 seconds to run the same 1000 samples using the hyperparameter values specified in Byrne and Girolami (2013).Figure 12 compares traceplots for c, Λ, and the elements of the top row U for the 500 post warmup samples from each sampler.Computed R and n eff for these parameters are shown in Table 7.

Discussion
We have shown how the Givens representation can be used to develop sampling methods for distributions over the space of orthogonal matrices.We developed approaches for grappling with issues posed by the Stiefel manifold topology and metric.We showed how an auxiliary parameter approach and analytic results for the density measure adjustment terms can be used to develop efficient computational methods.We also showed how our Givens representation approach can be used to specify distributions over the space of orthogonal matrices.We then demonstrated in practice our methods on several examples making comparisons with other approaches.We expect our introduced Givens representation methods to be applicable to a wide-class of statistical models with orthogonal matrix parameters and to help further facilitate their use in modern probabilistic programming frameworks.
imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019 We derive the simplified form (Expression 4.5) of the differential form (Expression 4.2).We point out that Khatri and Mardia (1977) provide a similar expression for a slightly different representation, but do not offer a derivation.We start with the determinant of the matrix form of the change-of-measure term from Expression 4.4 (reproduced below): and In the new notation Equation can be written in the following block matrix form: Note that the block matrices above the diagonal are all zero.This can be seen by observing that the rotations in the Givens representation involving elements greater than i will not affect e i , i.e. letting R i : To simplify the diagonal block determinant terms in Expression A.6 we take advantage of the following fact .8)imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019 In other words, the terms R T p • • • R T i+1 have no effect on the determinant.This can be seen by first separating terms so that .10)and then noticing that R i+1 • • • R p affects only the first i columns of the identity matrix so .11)Thus Expression A.6 is equivalent to Now consider the k, l element of the (n−i)×(n−i) block matrix This can be written as (A.13) Since e T i+k R T i R i e i = 0, taking the derivatives of both sides and applying the product rule yields .14)Combining expression A.14 this fact with expression A.13, the expression for the k, l element of (A.15) and the partial derivative of this expression with respect to i, i + l is zero when k > l.Thus it is apparent that   which is zero up to the ith spot.After the i + kth spot,

Fig 1 .
Fig 1.(Inset) Givens rotations can be used to rotate a vector so as to eliminate its component in a certain direction.(Main Figure)A p-frame on the Stiefel manifold can be visualized as a set of rigidly connected orthogonal basis vectors, u 1 and u 2 , shown here in black.One can move about the Stiefel manifold and describe any p-frame by simultaneously applying rotation matrices of a prescribed angle to these basis vectors.Applying the rotation matrix R 12 (θ 12 ) corresponds to rotating the two basis vectors together in the (1,2)-plane, which by our convention is the (x, y)-plane.Similarly, simultaneously applying R 13 (θ 13 ) corresponds to a rotation of the 2-frame in the (1, 3) or (x, z)-plane, while R 23 (θ 23 ) corresponds to rotating u 2 about u 1 .

Fig 2 .
Fig 2. The Givens reduction eliminates lower diagonal elements of an n×p matrix one column at a time.Because each rotation, R ij (θ ij ), only affects rows i and j, previously zeroed out elements do not change.

Fig 3 .
Fig 3. Uniform sampling in the Givens representation coordinates does not necessarily lead to uniform sampling over the Stiefel manifold without the proper measure adjustment term.Under the mapping, regions near the pole are shrunk to regions on the sphere with little area, as opposed to regions near to the equator which the transform maps to much larger areas on the sphere.Intuitively, the change-of-measure term quantifies this proportion of shrinkage in area.

Fig 4 .
Fig 4. The angular coordinate chart has an infinitesimal sliver of measure zero lying at θ 12 = −π = π that separates two otherwise connected parts of the sphere.Trajectories Y (t)over the Stiefel manifold that cross this sliver have no equivalent representation, Θ(t), in the coordinates of the Givens representation.This can become particularly problematic when there is significant probability mass on both sides of the sliver.The grid over the sphere reveals how the Givens representation maps areas that are the same size in the Θ coordinates to smaller and smaller regions on the sphere the closer they are to the poles.Thus the measure adjustment term (Equation4.5),which measures how the transform changes the area of these infinitesimal regions, goes to zero near the poles, making the transformed density at these points zero regardless of whether they were assigned to zero by the original density.
Fig 5. (Upper) 1,000 samples from a Von Mises distribution with parameters µ = −π and κ = 5 sampled over the space θ ∈ [−π, π] using Stan.Most of the mass of the distribution is concentrated at the ends of the interval while little mass is concentrated towards the middle.Because these two ends of the interval are disconnected in this representation, the sampler gets "stuck" in the mode corresponding to the −π side of the interval rather than the π side.(Lower) 1,000 samples from the equivalent distribution sampled over the (x, y)-space.By introducing an auxiliary coordinate, one can effectively replicate the topology of a circle, effectively "wrapping" the two ends of the interval so that the sampler avoids getting stuck in one region.

Fig
Fig 6.Histograms of the principal angle, arccos(µ T Y ), sampled under the von Mises Fisher distribution with µ = (0, 0, 1) and κ = 1, 10, 100 and 1000 using the Givens representation in Stan with various sizes of the area and using the method ofWood (1994).For small and large κ, the lack of samples from the -region in the latter method is evident in the histograms.In particular, despite the large amount of mass near zero, the number of samples is never larger than a bound that is dictated by .As decreases, the bound rapidly becomes negligible because of the quadratic relationship between and the volume of the area.Thus for = 1e − 5, the histograms of the Givens representation method and the Wood (1994) method become indistinguishable.

Fig
Fig 7.For small values of n the Givens representation approach in Stan produces more effective samplers per second, while for larger values GMC scales better since the primary cost of the matrix exponential remains constant.

Fig 8 .
Fig 8. Three dimensional data generated from Equation 5.1 with p = 2, W = I 3,2 , Λ 11 = 2, Λ 22 = 1 and σ = 1.The maximum-a-posteriori (MAP) estimate of PPCA corresponds to a single orthogonal matrix in the Stiefel Manifold that is closest, in terms of average squared distance, to the set of points.When there are few data points relative to the size of the matrix, this point estimate can often have high variance.

Fig 9 .
Fig 9. Histograms of the principal angles (in radians) between posterior samples of W and the first three eigenvectors of Σ.

Fig 10 .
Fig 10.True values of W used in the simulation along with 90% credible intervals computed using draws of the posterior.Each facet corresponds to one of the three columns of W

Fig 11 .
Fig 11.True values of W used in the simulation along with 80% posterior credible intervals computed using 10,000 draws of the posterior from the sparse and non-sparse models, respectively.Compared to the non-sparse model, the posterior distribution of the sparse model places much more posterior mass close to zero for values that are truly zero, while concentrating mass away from zero for truly non-zero values.

Fig 12 .
Fig 12. Traceplots of samples from the Givens representation implementation in Stan and GMC.For brevity, only the top three elements of U are shown.
l = 1, • • • , n, we define the following shorthand notation A.5)    Thus for j > i, ∂ j Y i = 0 and the determinant of Expression A.4 simplifies to the product of the determinant of the matrices on the diagonal i.e. the following expression:p i=1 det G T i+1:n ∂ i Y i .(A.6)A.1.Simplifying Diagonal Block TermsLet I i denote the first i columns of the n×n identity matrix and let I −i represent the last n − i columns.The term G T i+1:n in expression A.6 can be written as zeros above the diagonal and that detI T −i R T i • • • R T 1 ∂ i Y i issimply the product of the diagonal elements of the matrix.imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019A.2.Diagonal Elements of the Block MatricesTo obtain the diagonal terms of the block matrices, we directly compute −∂ i,i+l e T i+k R T i for l = k, R i e i , and their inner-product.DefiningD ij := ∂ ij R ij ,imsart-generic ver.2014/10/16 file: givens.texdate: November 5, 2019−∂ i,i+k R i e i+k = −∂ i,i+k (R i,i+1 • • • R i,i+k e i+k ) (A.16) = −R i,i+1 • • • R i,i+k−1 D i,i+k e i+k ( R and n eff values averaged over all elements of the matrix parameter Y .

Table 6 R
and n eff values averaged over all elements of the matrix parameter Y .

R
and n eff values for the parameters in the network eigenmodel.For brevity, only three of the matrix parameters are shown.