Bayesian Estimation of Correlation Matrices of Longitudinal Data

Estimation of correlation matrices is a challenging problem due to the notorious positive-definiteness constraint and high-dimensionality. Reparameterizing Cholesky factors of correlation matrices in terms of angles or hyperspherical coordinates where the angles vary freely in the range [0, π) has become popular in the last two decades. However, it has not been used in Bayesian estimation of correlation matrices perhaps due to lack of clear statistical relevance and suitable priors for the angles. In this paper, we show for the first time that for longitudinal data these angles are the inverse cosine of the semi-partial correlations (SPCs). This simple connection makes it possible to introduce physically meaningful selection and shrinkage priors on the angles or correlation matrices with emphasis on selection (sparsity) and shrinking towards longitudinal structure. Our method deals effectively with the positive-definiteness constraint in posterior computation. We compare the performance of our Bayesian estimation based on angles with some recent methods based on partial autocorrelations through simulation and apply the method to a data related to clinical trial on smoking.


Introduction
Covariance and correlation matrices play a fundamental role in every aspect of multivariate statistics (Anderson, 2003). Flexible modeling and estimation of correlation matrices are daunting tasks due to (1) the positive-definiteness constraint, (2) the number of unknown elements growing quadratically with the dimension and (3) the diagonal entries being the same and equal to one. More specifically, modeling directly the individual elements requires repeatedly checking the positive-definiteness of the estimated matrix in an iterative model fitting procedure. Many strategies involving various matrix decompositions such as spectral, Cholesky decomposition, and factor models have been developed to circumvent the notorious positive-definiteness constraint (Chiu et al., 1996;Pinheiro and Bates, 1996;Pourahmadi, 1999;Fan et al., 2008). Unconstrained parameterization of correlation matrices using angles and partial autocorrelations has been around for a while (Pinheiro and Bates, 1996;Rapisarda et al., 2007;Joe, 2006) where the partial autocorrelations and angles as new parameters vary freely in the ranges [−1, 1] (Joe, 2006) and [0, π) (Pinheiro and Bates, 1996;Pourahmadi and Wang, 2015;Tsay and Pourahmadi, 2017), respectively.
In Bayesian covariance or precision matrix estimation, inverse-Wishart distribution and its various modifications have been widely used due to its mathematical simplicity. The separation strategy in Barnard et al. (2000) writes a covariance matrix as Σ = SRS and assumes independent prior for the diagonal matrix of standard deviations S and the correlation matrix R. Specifically they considered two priors: The first one is marginally uniform prior for r ij obtained from marginal distribution on R starting from a standard inverse-Wishart prior on Σ with specific choices of hyperparameters, and the second one is the jointly uniform prior of the form p(R) ∝ 1 over R k where R k denotes the space of all valid correlation matrices of dimension k. In the context of graphical model selection, Kundu et al. (2019) used regularized inverse-Wishart prior on Σ and showed its connection to shrinkage in Σ −1 through equivalence with a Cholesky-based regularization approach. In this article we restrict our attention only on the estimation of a correlation matrix for ordered data and not the covariance matrix owing to the separation strategy. We address some Bayesian modeling and inferential challenges in estimating a correlation matrix by introducing suitable priors on the angles which go beyond the traditional use of the inverse-Wishart distribution, the marginal and joint uniform priors in Barnard et al. (2000).
Structured correlation matrices play an important role in many applications. Liechty et al. (2004) considered various block-structured correlation matrices such as common correlation model, grouped correlation model and grouped variables model, and the priors on R were restricted to the set R k of positive-definite correlation matrices. For example, in the common correlation model, the prior on R is p(R|μ, σ 2 ) = C(μ, σ 2 ) i<j exp{−(r ij − μ) 2 /2σ 2 }I{R ∈ R k }, where C(μ, σ 2 ) is the normalizing constant with normal and inverse gamma distributions for μ and σ 2 , respectively. Due to the lack of conjugacy in the prior model, they used Metropolis-Hastings (MH) algorithm for posterior inference. However, presence of the indicator function I{R ∈ R k } makes the computation very expensive. In fact, the proposal density for a particular element r ij in every iteration of the MH step must be restricted to an interval [l ij , u ij ] where these limits are quadratic functions of the rest of the elements of R, see (Barnard et al., 2000(Barnard et al., , p. 1305) for more details and explicit formulae for l ij , u ij .
The challenges of dealing with the positive-definiteness and unit diagonal constraints of a correlation matrix have led to two unconstrained reparameterizations of R for ordered or longitudinal data. The reparameterization in Joe (2006) is based on the notion of partial autocorrelation function (PAC), or the correlation coefficient between two variables given the intermediate variables.
The key advantages of using PACs are avoiding the positive-definiteness constraint and providing interpretable parameters. Interestingly, PACs had appeared earlier in Kurowicka and Cooke (2003), Kurowicka and Cooke (2006) in the context of vine graphical models. It has proved useful for elicitation of priors for Bayesian correlation estimation as in Daniels and Pourahmadi (2009) who constructed a flexible prior on R using independent shifted Beta priors on the PACs, and Wang and Daniels (2014) developed the underlying regression models involving PACs and a triangular prior which shifts the prior weight to a more intuitive choice. Moreover, Gaskins et al. (2014) proposed shrinkage and selection priors on the PACs aiming to estimate the correlation matrix. In all these developments, instead of setting the full partial correlations or the entries in R −1 to zero to incorporate sparsity, the goal has been to set some PACs to zero to encourage parsimony in modeling R. Although the angular parameterization has been around much longer (Pinheiro and Bates, 1996;Rapisarda et al., 2007), it has not been used as yet in eliciting priors for correlation matrices.
The aim of this article is to study and deal with some of the computational challenges in Bayesian estimation of correlation matrices by using its Cholesky decomposition, (Pinheiro and Bates, 1996;Rapisarda et al., 2007) and the ensuing angles (hyperspherical coordinates) as the new parameters varying freely in the range [0, π). This enables us to deal effectively with the positive-definiteness constraint, resulting in faster computation of the posteriors for our proposed selection and shrinkage priors. For longitudinal data, we show that these angles are directly related to the as yet dormant notion of semi-partial correlations (SPCs). More precisely, we identify the angles as the inverse cosine of ρ ji:1,2,...,j−1 between the variables y i and y j (i > j) conditioned on the previous variables y 1 , y 2 , . . . , y j−1 , see Huber (1981), Eaves and Chang (1992), and Madar (2015). We propose natural and appealing shrinkage and selection priors on these angles and show that their performance is similar to or in some cases better than the shrinkage and selection priors on the partial autocorrelation (Gaskins et al., 2014).
The rest of this article is organized as follows. In Section 2, we discuss some preliminaries about the angular parameterization and connect the angles to the notion of semi-partial correlations. Section 3 introduces our proposed shrinkage and selection priors on the angles and develops Bayesian estimation of correlation matrices. Sampling scheme under our proposed priors is discussed in Section 4. Section 5 compares the performance of our priors to those based on the PACs through simulations. In Section 6, priors are compared on a data from a smoking cessation clinical trial. Finally Section 7 concludes the article with discussion.

Reparameterization of R by angles
This section describes a connection between the well-known hyperspherical coordinates (angles) of the Cholesky factor of a correlation matrix R = (r ij ) and the less familiar semi-partial correlation coefficients ρ ji:1,2,...,j−1 between the variables y i and y j (i > j) conditioned on the previous variables y 1 , y 2 , . . . , y j−1 , see Huber (1981), Eaves and Chang (1992), and Madar (2015).
For a k×k correlation matrix R with 1's on the diagonal, its Cholesky decomposition is given by R = BB where the Cholesky factor B is a lower triangular matrix. Since the rows of B are vectors of unit-length, it turns out that they admit the following representation involving trigonometric functions of some angles (Pinheiro and Bates, 1996;Rapisarda et al., 2007): with c ij = cos(θ ij ) and s ij = sin(θ ij ), where the angles θ ij 's are measured in radians, 1 ≤ j < i ≤ k. Restricting θ ij ∈ [0, π) makes the diagonal entries of B non-negative, and hence B is unique to which we can associate a (k − 1) × (k − 1) lower triangular matrix Θ with k(k − 1)/2 angles: Note that the (i, j)-th element of Θ is denoted by θ i+1,j so that θ ij corresponds to the (i, j)-th element of R, we refer to Θ as the angular matrix associated to R. For further details, properties and applications of these angles, see Creal et al. (2011), Zhang et al. (2015 and Tsay and Pourahmadi (2017).
Given a correlation matrix R (symmetric and positive definite) and its Cholesky decomposition R = BB with entries b ij , matching entries of both sides it follows that b 11 = 1, b i1 = r i1 , i = 2, . . . , k.
(2.2) Thus, the entries in the first columns of B and R are the same. The θ ij s, the entries of Θ are computed recursively via θ i1 = arccos(b i1 ) = arccos(r i1 ), for i = 2, 3, . . . , k, (2.3) Also, given a matrix Θ with entries θ ij ∈ [0, π), construct the lower triangular matrix and from (2.2) we obtain the simple relation r 21 = cos θ 21 . Then, the statistical meaning of θ 21 as the inverse cosine of r 21 is fairly clear.
The relationship between r ij 's and θ ij 's are so that while the two angles in the first column are tied to the individual marginal correlations as in (2.2), this is not the case for θ 32 in the second column. In fact, the situation gets more complicated for larger k's. In general, the entries of the first column of Θ are just the inverse cosine of the respective entries of the first column of R, but as one moves towards its last column the expression for θ ij becomes more complicated and hence less interpretable as a function of the entries of R.
However, in case of some special correlation matrices, the angles are nicely structured. We present forms of Θ for AR(1) and banded correlation structure which are of special interest in longitudinal data in the next two examples and a block common correlation in example (e).
(c) AR(1) correlation matrix: For an AR(1) correlation specified by r, corresponding Θ is characterized by a single angle, θ 21 = arccos (r). This is called pivotal angle and other angles are function of it, called implied angle (Tsay and Pourahmadi, 2017). Moreover, it is instructive to note that the corresponding angle approaches to π/2 as the lag increases which is stated in the following proposition.
Proof. We use induction on the column index j by noting that For j = 1, r i1 = r (i−1) = b i1 which goes to 0 under the assumption. Therefore, θ i1 → π/2. Suppose that the assertion holds for any integer m < j, i.e. θ im → π/2. By where λ is an integer between 1 and k. Corresponding Θ is also banded by the following proposition.
For the only if part, r ij = 0 for |i − j| > λ where 1 ≤ λ < k and the proof follows by induction. Note (since B is lower triangular matrix).
By the construction of B, the diagonal entries are positive. Thus r i1 = 0 implies b i1 = 0 which in turn implies θ i1 = π/2. For the induction step, assuming j < i and i − j > λ, consider From the induction hypothesis, θ jl = π/2 for l ≤ j − 1 implies the second summand is 0. Thus, we must have b ij = 0 which implies θ ij = π/2 and this completes the proof.
(e) Block common correlation matrices: As a generalization of a compound symmetric (exchangeable) correlation matrix, consider a block common correlation matrix which is a blocked-matrix where the correlations within each block are equal and different across blocks. Such matrices arise in many applications due presence of common (latent) factors in different regions (aggregation of carbon dioxide sequestration storage assessment units Blondes et al. (2013)) and stock returns of different companies within the same industry (Liechty et al., 2004;Tsay and Pourahmadi, 2017). As an illustration, we consider the following 6 × 6 correlation matrix, with 6 distinct correlations r i , i = 1, 2, . . . , 6, which is much smaller than 15, the number of distinct entries of a generic correlation matrix of this size. The corresponding matrix of angles Θ, is completely determined by six pivotal angles denoted by (θ 21 , θ 31 , θ 51 , θ 43 , θ 53 , θ 65 ) (Tsay and Pourahmadi, 2017) where their subscripts indicate their locations in the partitioned matrix Θ.
In general, for a k × k correlation matrix R, if there is d common correlation blocks, then the pivotal angles consist of d angles in the range [0, π), say θ pivotal = (θ 1 , θ 2 , . . . , θ d ) . The other angles in Θ matrix, called implied angles, are functions of pivotal angles and can be obtained using an algorithm in Tsay and Pourahmadi (2017). When the blocks are known, it is simple to determine the positions of pivotal angles. One can use prior only on the pivotal angles to shrink them to different target values, and this will reduce the dimension of the parameter space to d and finally use the algorithm in (Tsay and Pourahmadi, 2017, p. 11) to estimate the entire correlation matrix. It is worthwhile to note that AR(1) and banded correlation matrices often arise in the context of longitudinal data. Therefore, the characterization of Θ as discussed in examples (c) and (d) is useful to elicit prior in Bayesian estimation of such matrices.

The angles and semi-partial correlations
Statistical interpretation and plausible meaning of the angles as the new parameters of a correlation matrix are of interest when eliciting priors. This task is complicated by the nonlinearity of the relationships between the correlations and angles as seen in (2.5).
Here, we use a relatively dormant formula for b ij stated without proof in (Cooke et al., 2011, Chapter 3) and identify the angles as the inverse cosine of the semi-partial correlations (SPCs) ρ ji:1,2,...,j−1 between the variables y i and y j (i > j) conditioned on the previous variables, see Huber (1981), Eaves and Chang (1992), and Madar (2015). Surprisingly, the simplicity of the relations between the angles and SPCs is reminiscent of the relations in (2.2) between the entries of the first columns of Θ and R. (2.7) (b) the angles θ ij 's are precisely the inverse cosine of the SPCs: The proof is given in Appendix A (Ghosh et al., 2020).

Distributions of the angles
Cholesky decomposition of a correlation matrix, and hence the concepts of the angles and semi-partial correlations depend on ordering or labeling the variables in R. Next, one may assign distributions to the angles so that the distribution of R is a power of its determinant and hence invariant to permutations of its rows and columns (Pourahmadi and Wang, 2015, Theorem 1).
Theorem 2. For a k-dimensional random correlation matrix R with the corresponding matrix of angles Θ, let the random variables in the j th column of Θ be independent and identically distributed as where α is a constant, θ ∈ [0, π). Then (a) the joint distribution of R is given by where c k (α) is the normalizing constant.
(c) The distribution is symmetric about π/2, hence its mean and median are equal to π/2.
Although the proof appears in (Pourahmadi and Wang, 2015, Theorem 1), for reader's convenience, we give an outline in Appendix B.
It turns out that these distributions on the angles reduce to the joint uniform prior of Barnard et al. (2000) on a correlation matrix for specific value of α.
Recall that the joint uniform prior assigns a uniform distribution to the set of all valid k × k correlation matrices (2.11) Indeed, α = 0 in (2.10) leads to p J . As such this prior is noninformative and not suitable in longitudinal data analysis, since higher lag (auto)correlations tend to zero faster than those with smaller lags.

Prior specifications on angles
We propose shrinkage and selection priors on angles for ordered data guided by the fact that two variables far apart have correlation decaying to zero. Therefore, it is natural to expect that the semi-partial correlation between two variables y i & y j (i > j) in the random vector Y given the preceding variables y 1 , y 2 , . . . , y j−1 decays to zero as the lag (i − j) increases. In terms of angles, this essentially means that the corresponding θ ij goes to π/2, since θ ij is related to the corresponding semi-partial correlation only through the cosine function (2.8).
We note that in Bayesian statistics, spike and slab priors have also been used in practice as a selection prior with a spike at a target value, say π/2. There is a vast literature on selection priors (Mitchell and Beauchamp, 1988;Ishwaran and Rao, 2005).

Selection prior
A way to motivate the formation of our selection prior is to note that when R is an identity matrix, all the entries of Θ are π/2. Thus, forming a selection prior as a mixture of a Dirac delta with mass at π/2 and a continuous density having support in [0, π), is capable of selecting or centering the angles at π/2. In terms of the SPCs, this amounts to encouraging the semi-partial correlation between y i and y j given y 1 , y 2 , . . . , y j−1 to be centered at 0. Similar to PAC framework where selection prior (Gaskins et al., 2014) on π ij is constructed using a mixture of point mass at zero and a SBeta distribution in [−1, 1], our selection prior on the angles, denoted by p θ,SP , assumes independent mixture distributions for individual θ ij 's by where, η ij = P r(θ ij = π/2) for 1 ≤ i < j ≤ k and δ π/2 denotes a Dirac delta with mass at π/2. To make such priors more suitable for longitudinal data, we further parameterize η ij = η 0 |j −i| −γ so that as the lag |j −i| increases, the prior in (3.2) puts more weight at π/2. Since the angle θ ij is related to the partial correlation ρ ji:1,2,...,j−1 in (7) through cos(θ ij ), this implies that for variables which are far apart or having greater lag |i − j|, the corresponding θ ij 's are closer to π/2. We further assume a Unif(0, 1) distribution for the hyper-parameter η 0 and a Gamma(a, a) distribution for the hyper-parameter γ so that γ has prior mean 1. In our simulation study, we choose a = 5 to make our results comparable to those in Gaskins et al. (2014).

Shrinkage prior
In Bayesian covariance estimation, shrinkage priors have been used to shrink the posterior estimate towards specific structures. For example, Liechty et al. (2004) considered priors to shrink the correlation matrix to certain group-structured targets, and Wang and Pillai (2013) considered scale mixture of uniform distributions to construct shrinkage priors for covariance matrix estimation.
To define an analogue of the above shrinkage prior on angles we exploit the interpretation of angles in (2.8) and assume that the semi-partial correlation ρ ji:i,2,...,j−1 follows a SBeta(α ij , α ij ) distribution. Using change of variable, we propose our shrinkage prior p θij ,SH on θ ij by, Note that the form (3.3) is very similar to (2.9) except the exponent. The mean of the distribution is π/2 [Theorem 2(c)] but the variance has no closed form expression. Using delta method, the variance can be approximated by η ij = 1/(2α ij + 1) As in selection prior, we further parameterize η ij = η 0 |i − j| −γ to make it dependent on lag and assume a Unif(0, 1) prior on η 0 and a Gamma(a, a) prior on γ.

Sampling from posterior distribution
We assume throughout that the data y 1 , y 2 , · · · , y n follow a multivariate normal distribution of dimension k with mean zero vector and covariance R. Restricting attention to correlation matrices is natural, for example, in the analysis of multivariate probit model to circumvent the issue of identifiability (Chib and Greenberg, 1998). Denoting Y = [y 1 , y 2 , · · · , y n ] the likelihood function parameterized by Θ is given by where T denotes the transformation from R to Θ.

Updating θ ij :
Denote by Θ [−ij] , the Θ matrix after dropping its (i, j)-th element and p θij (θ) the prior for θ ij . For convenience, we often drop the subscripts SP and SH from p θij when the context is easily understood. The posterior distribution of θ ij given others is: Note that the involvement of θ ij in the likelihood makes the posterior non-conjugate.
To sample from (4.1), we incorporate auxiliary variable z ij (Damien et al., 1999;Neal et al., 2003;Gaskins et al., 2014), write and sample θ ij in two steps. While the first step is common for both selection and shrinkage prior, the notable difference appears in second step owing to the presence of Dirac delta function in selection prior (3.2).
Step 2. For shrinkage prior (3.3), sample θ ij uniformly from the set {θ : In case of selection prior (3.2), sample θ ij from p θij ,SP (θ) restricted to the con- To sample from p θij ,SP (θ) restricted to C, let F (θ) = P (θ ij ≤ θ) be the cumulative distribution function of p θij ,SP . The expression of F (θ) is available in closed form. Next draw U uniformly over the set F (C) and update θ ij by F −1 (U ) = inf{θ : F (θ) ≥ U }.

Updating η 0 , γ:
For updating the parameters η 0 , γ in selection prior, we incorporate dummy variable ϑ ij = I{θ ij = 0} ∼ Ber(η ij ) so that P (θ ij = 0) = η ij . This makes the distribution of η 0 , γ dependent on Θ only through the variables ϑ ij . Next we use two slice samplers to update them and similar is the case for shrinkage prior.

Comparing priors on the angles and PACs
We perform a number of simulation studies to assess the performance of our selection and shrinkage priors on the angles relative to the selection and shrinkage priors of Gaskins et al. (2014) on partial autocorrelations. Since the selection prior performed better than the shrinkage prior in their simulation study, here we focus only on the selection prior and follow their simulation set-up as much as possible.
It can be seen that Θ C leads to a banded correlation matrix and the entries in the rows of Θ D decay to π/2. For each of the 4 correlation matrices, we simulate 60 data-sets of sample sizes n = 20, 200 from a multivariate normal distribution having mean zero and covariance matrix equals to the chosen correlation matrix. For comparison of the risks, our competitor is p π;SP which performed the best in Gaskins et al. (2014). To compute the posterior for p π;SP , we also obtain Π for each R above. We run an Markov chain Monte Carlo (MCMC) for 5000 iterations with a burn-in 1000 and retain every tenth iteration providing 500 outputs for each data-set. At each iteration of Θ, we retain the corresponding correlation matrix R(Θ) and at each iteration of Π, in addition to R(Π), Θ(Π) or Θ(R(Π)) is retained.
The posterior estimates of Θ, R(Θ), R(Π) and Θ(Π) are obtained by taking the average of these samples after burn-in from each iteration. For each case, we gauged the performance by the risk estimates with respect to the two loss functions discussed earlier by taking average of these loss functions over 60 replications of the simulated data.
The results are summarized in Table 1, where we note that for the identity matrix (R A ) our selection prior outperforms all its competitors, but our shrinkage prior is  Table 1: Risks for our selection prior (p θ;SP ), shrinkage prior (p θ;SH ) and the selection prior in Gaskins et al. (2014).
outperformed by the selection prior in Gaskins et al. (2014). For the AR(1) (R B ), our selection prior and the selection prior of Gaskins et al. (2014) are comparable. For banded R C and R D , our selection prior and shrinkage prior are comparable and perform better than p π;SP . The joint uniform prior performs poorly in all of these cases.
In summary, our selection prior and shrinkage prior show advantage over based on certain scenarios.

Computational advantages of angle parameterization
The computational challenges of using constrained priors like the joint uniform prior p J (R) are well-known, other notable examples are the common correlation priors in Liechty et al. (2004), priors for sparse R −1 in Wong et al., 2003;Pitt et al., 2006;Carter et al., 2011, which place a flat prior on the non-zero components for a given pattern of zeros. These methods usually require computing the normalizing constants related to volumes of certain subsets of R k corresponding to patterns of zeros, and where the prior and posterior densities are supported on constrained sets. Due to the presence of the indicator function of R k in the prior and posterior, in the Metropolis-Hastings scheme, the proposal density for updating r ij has to be restricted to an interval [l ij , u ij ] where these bounds are functions of the rest of the entries of R ±1 (Barnard et al., 2000;Liechty et al., 2004). Of course, unconstrained parameterization resolves the tedious task of computing the normalizing constant in every update of the MCMC algorithm and consequently posterior computation is faster.
Next, we compare the time complexity of implementing the MCMC algorithm for the constrained prior p J (R) on the space of valid correlation matrices R k , and its two unconstrained reparameterizations on the spaces of angles Θ and partial autocorrelations Π. The prior on the angles: is obtained from (2.9) for α = 0, and the prior on PACs is where π ij ∈ [−1, 1], for more details see Gaskins et al. (2014).
We consider three different settings of (n, k), namely (50, 5), (100, 10) and (500, 15) and simulate a sample of size n from a k-dimensional normal distribution having mean 0 and covariance matrices set to Identity, AR(1) with correlation 0.4 and a general correlation matrix, respectively. For posterior sampling, we use the slice sampling techniques similar to Section 4 for Θ and Gaskins et al. (2014) for Π with necessary adjustments needed the priors (5.1) and (5.2) and for the constrained case we use Metropolis-Hastings algorithm to sample individual correlations r ij from the restricted set determined by rest of the entires of R as in (Barnard et al., 2000;Liechty et al., 2004).
In Figure 1, we present run times (in log scale of seconds) for 2000 independent MCMC samples satisfying "effectiveSize" function in R which gives effective number of MCMC samplers adjusting the autocorrelation. As expected the unconstrained priors outperform constrained method significantly in any dimension with respect to the execution-time. The simulations were run on a 2.6 GHz Intel Core i5 processor. The numerical results above may not be surprising by noting that the computational complexity of simulating a posterior of R based on priors on angles or generating general random correlation matrices (Pourahmadi and Wang, 2015) is O(k 3 ) compared to O(k 4 ) of the Joe (2006) proposal based on partial correlations, and O(k 3 ) of the Lewandowski et al. (2009) method using the partial correlations defined on C-vines, respectively.

Data analysis
We analyze a data set (Gaskins et al., 2014) simulated based on first Commit to Quit (CTQ I) study of Marcus et al. (1999), a clinical trial designed to encourage women to stop smoking. The aim of the study was how exercise is effective to increase quit rate, as weight gain seems to be an influencing factor in a smoking cessation program. Providing an educational intervention of equal time for the control group, the study spans 12 weeks and the patients were encouraged to quit smoking at week 5.
The data is provided in the form of a 281 × 9 matrix, where rows correspond to patients and columns 2-9 correspond to weeks and first column corresponds to treatment assignment (0 for control and 1 for exercise). For each patient, columns 2-9 denote the patient's smoking status from 5-th to 12-th week after they are asked to quit smoking. With n = 281, k = 8 (discarding first column), we associate an n × k matrix Y = (y ij ) to the data, whose entries take values −1,0,1; where 1 denotes success (i-th patient not smoking in j-th week), −1 denotes failure (still smoking in j-th week) and 0 denotes a missing observation. Introducing latent variables y * ij , we assume a multivariate probit model Chib and Greenberg (1998) where, and if y ij = 0, the sign of y * ij represents the (unobserved) quit status for the week. Next, we assume y * i = (y * i1 , y * i2 , · · · , y * ik ) ∼ N k (μ i , R) for i = 1, 2, · · · , n and μ i is parameterized as μ i = X i β; where X i is a q × k matrix of covariates and β is a q × 1 vector of regressors. To circumvent the identifiability issue, covariance matrix is restricted to be a correlation matrix. As in Gaskins et al. (2014), we consider two choices of X i : time-varying which specifies a different μ it for each time within each treatment group (q = 2k) and time-constant which gives the same μ it across all times within treatment group (q = 2). With this set-up, we consider a flat prior on β and the priors on R are the selection and shrinkage priors in Gaskins et al. (2014) for PACs and the angle (Θ), respectively.

Posterior computation
For posterior computation, we run an MCMC chain for 12,000 iterations with a burn-in of 3000, retaining every tenth observation. The three sets of parameters appearing in the posterior are regression parameters, latent variables and correlation matrix.
2. Sampling R. For angle based priors, sampling scheme in Section 4 is used and R code provided in Gaskins et al. (2014) has been used for PAC based priors on the residuals y * i − μ i , for i = 1, 2, · · · , n.
3. Sampling y * i s. For sampling latent variables, we use Proposition 1 of Liu et al. (2009) as in Gaskins et al. (2014).
For comparison we use deviance information criterion (DIC) which does not require counting the number of model parameters, making it an effective criterion for model selection when shrinkage or sparsity is concerned. DIC is defined as (Spiegelhalter et al., 2002) Dev + 2p D where, Dev = −2 n i=1 l(β,R|y i ), (6.1) with l denoting log-likelihood function and expectation is taken with respect to the posterior distribution.
For the CTQ data, the posterior estimateβ is the posterior mean, as for the posterior estimate ofR we use the posterior median for angle-based priors and the one used by (Gaskins et al., 2014, pp.12) for PAC-based priors. The numerical results for various priors on the correlation matrix are reported in Table 2, where it can be seen that the DIC is smaller for the time constant mean structure in coherence with the findings of Gaskins et al. (2014). One can note that for time varying mean structure, the models are heavily penalized by p D which deals with 14 extra parameters compared to time constant models. Our angle-based selection prior appears have a tendency of lower DIC value and hence preferred.

Discussion
We have dealt with some computational challenges in Bayesian estimation of correlation matrices by using its Cholesky decomposition and the ensuing angles as the new parameters which vary freely in [0, π). This reparameterization deals effectively with the positive-definiteness constraint on a correlation matrix and results in faster computation of the posteriors. At a first encounter, angles may not seem the most natural parameters in statistics. However, to our knowledge we have shown for the first time that the angles in the present context are simply the inverse cosine of the familiar semi-partial correlations, see Huber (1981), Eaves and Chang (1992), Cooke et al. (2011). Thus, the angles are statistically meaningful and the new connection opens up the possibility of using the wealth of distributions from directional statistics as potential priors for Bayesian analysis of correlation matrices. Through simulations and data analysis we have shown that the performance of our shrinkage and selection priors on the angles is better or comparable to those based on the PACs in Gaskins et al. (2014).