Direct covariance matrix estimation with compositional data

Compositional data arise in many areas of research in the natural and biomedical sciences. One prominent example is in the study of the human gut microbiome, where one can measure the relative abundance of many distinct microorganisms in a subject's gut. Often, practitioners are interested in learning how the dependencies between microbes vary across distinct populations or experimental conditions. In statistical terms, the goal is to estimate a covariance matrix for the (latent) log-abundances of the microbes in each of the populations. However, the compositional nature of the data prevents the use of standard estimators for these covariance matrices. In this article, we propose an estimator of multiple covariance matrices which allows for information sharing across distinct populations of samples. Compared to some existing estimators, which estimate the covariance matrices of interest indirectly, our estimator is direct, ensures positive definiteness, and is the solution to a convex optimization problem. We compute our estimator using a proximal-proximal gradient descent algorithm. Asymptotic properties of our estimator reveal that it can perform well in high-dimensional settings. Through simulation studies, we demonstrate that our estimator can outperform existing estimators. We show that our method provides more reliable estimates than competitors in an analysis of microbiome data from subjects with chronic fatigue syndrome.


Introduction
High-dimensional compositional data arise in many areas of modern science.To study the human gut microbiome, for example, practitioners measure the relative abundances of various microbes using next-generation sequencing followed by alignment and normalization (Gloor et al., 2017).For each subject in a study, the resulting measurement is a p-dimensional vector which has nonnegative entries and sums to one (Huson et al., 2007;Segata et al., 2012).More generally, compositional data arise when, for example, one observes multivariate count-valued data wherein the total counts in a sample is an experimental artifact.Here, we focus on compositional data which belong to the set where [p] = {1, . . ., p} for a positive integer p.
To make matters concrete, let X = (X 1 , . . ., X p ) ⊤ ∈ C p−1 be a random composition whose components correspond to the variables of interest.Letting W = (W 1 , . . ., W p ) ⊤ denote the corresponding latent abundances, also known as the basis (Aitchison, 1982), we assume where each W j ∈ (0, ∞).When characterizing the dependence between any two components from the compositional vector, the parameter of interest is often the basis covariance matrix Ω * ∈ S p + , where Ω * jk = Cov {log(W j ), log(W k )} , (j, k) ∈ [p] × [p], and S p + denotes the set of p × p symmetric positive definite matrices.In many studies involving compositional microbiome data, practitioners are interested in modeling the interactions and dependencies between microbe abundance (Faust et al., 2012;Ma et al., 2021).For instance, one may want to estimate whether two microbes occur in higher frequencies jointly.The basis covariance matrix Ω * provides one route for addressing such questions (Jiang et al., 2019;Matchado et al., 2021;He et al., 2021), but is not straightforward to estimate from independent realizations of X because W is latent.One common approach relies on the estimation of the variation matrix Θ * (Aitchison, 2003, Chapter 4), defined elementwise by Thus, letting ω * = diag(Ω * ) ∈ R p and 1 p = (1, 1, . . ., 1) ⊤ ∈ R p , To define an estimator of Θ * , let x i = (x i1 , . . ., x ip ) ⊤ ∈ C p−1 , i ∈ [n], denote independent realizations of X.Let also z ijk = log(x ij /x ik ) and zjk = n −1 n i=1 z ijk for all (j, k) ∈ [p] × [p].The sample estimator Θ is defined elementwise by While Θ is a natural estimator of Θ * , it is unclear how to use it to estimate Ω * in general because there are infinitely many Ω such that Θ = ω1 ⊤ p + 1 p ω ⊤ − 2 Ω.Namely, the diagonal entries of Θ and ω1 ⊤ p + 1 p ω ⊤ − 2 Ω are zero, so there are p(p − 1)/2 unique equalities but p(p + 1)/2 unknowns in Ω.However, if one assumes that many entries of Ω * are zero, then it can be estimated based on (1).Cao et al. (2019) proved that if p ≥ 5 and Ω * has fewer than (p − 1) nonzero off-diagonal entries, then no two Ω * correspond to the same Θ * .Thus, if one could assume s < p − 1 off-diagonal entries of Ω * are nonzero, one could consider the estimator arg min where Ω − denotes the matrix Ω with its diagonal entries set to zero, ∥A∥ 2 F = tr(A ⊤ A) = j,k A 2 jk is the squared Frobenius norm of a matrix A, and ∥A∥ 0 = j,k 1(A jk ̸ = 0) counts the number of nonzero entries in A. In practice, assuming only s < p−1 off-diagonal elements are non-zero is often too restrictive.Of course, (2) could also be used with s ≥ p − 1, but due to the L 0 constraint, (2) is the solution to a nonconvex optimization problem and is computationally challenging for large p.
In view of (2) and its limitations, and given that the L 1 norm is a convex relaxation of the L 0 norm, a natural alternative is arg min where ∥A∥ 1 = j,k |A jk | for a matrix A. Cao et al. (2019) described their estimator as a "one-step approximation to (3)", but did not study (3).Appealingly, the problem in (3) can be recast as an L 1 -penalized least squares problem and computed via existing algorithms.However, neither (2), (3), nor the method of Cao et al. (2019) provide estimates which are guaranteed to be positive definite, or even nonnegative definite (see Section 2.2).Replacing the feasible set in (2) or (3) by S p + , or a subset thereof, complicates computation substantially.For example, even in the context of standard covariance matrix estimation (i.e., when the log(W j ) are observable), enforcing sparsity and positive definiteness simultaneously is challenging (Bien and Tibshirani, 2011;Rothman, 2012;Xue et al., 2012;Xu and Lange, 2022).
In many applications, one requires a basis covariance matrix estimate from multiple distinct populations.For example, in our motivating data analysis, the goal is to compare how the microbes interact in the gut of patients with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) versus controls (Giloteaux et al., 2016).To estimate the two  (Giloteaux et al., 2016) using the method of Cao et al. (2019).Each node corresponds to an OTU as described in Section 7. Green edges denote positive estimated correlations, red edges denote negative estimated correlations, and the absence of an edge indicates an estimated correlation of zero.The thickness of an edge indicates the magnitude of the correlation: thicker corresponds to a larger magnitude.Panel (a) is the network estimated from control patients while panel (b) is the network estimated from patients with ME/CFS.
basis covariance matrices, one could apply existing estimators to each of the populations (ME/CFS patients and controls) separately.However, sample sizes are often small relative to the dimension of the basis covariance.For example, there are only 37 and 47 control and ME/CFS patients, respectively, used to estimate both 39 × 39 basis covariance matrices.
A more efficient approach would estimate the two covariance matrices jointly in order to borrow information across populations.If, for instance, the basis covariances have similar sparsity patterns, exploiting this shared information across populations can substantially improve efficiency.Joint estimation is especially common in the literature on estimating sparse covariance and inverse covariance matrices from multivariate normal data collected on multiple populations (Bigot et al., 2011;Guo et al., 2011;Danaher et al., 2014;Price et al., 2015;Ma and Michailidis, 2016;Cai et al., 2016;Saegusa and Shojaie, 2016;Price et al., 2021).In the context of estimating basis covariance matrices from microbiome data, it is natural to assume the covariance matrices have similar sparsity patterns.Biologically, it is often reasonable to assume there are microbes whose abundances are uncorrelated in all the populations in a study.For example, in Section 7, when we estimate the basis covariance matrices for ME/CFS patients and controls using our method, which shares information across populations, we estimate identical sparsity patterns.In contrast, when we estimate these matrices separately using an existing method, few estimated nonzero correlations are shared between populations (Figure 1).We investigate the reliability of these estimates in Section 7.2.
In this article, we study (3) under positive definite constraints, and propose a generalization of (3) for estimating multiple covariance matrices simultaneously.We establish asymptotic error bounds for both the single and multiple population versions of our estimator, and we propose an efficient algorithm for their computation.In simulation studies and our analysis of the ME/CFS microbiome data, we demonstrate that our methods can provide more reliable estimates of the covariance matrices of interest than existing competitors.

Multiple basis covariance matrix estimation
In the remainder of this article, we let the subscript (h) denote data or population parameters from the hth population, h ∈ [H] for some H ≥ 1.For example, x (h)i ∈ R p is the vector with compositional data for observation i ∈ [n (h) ] in the hth population.Similarly, Ω * (h) is the basis covariance for the hth population.
We focus on estimating Ω * (1) , . . ., Ω * (H) using the data where i=1 z (h)ijk .To describe our estimator, define Ω ∈ R H×p×p as the three-way tensor where We present a visualization of the tensor Ω in Figure 2. The mode-1 fibers, Ω •jk , are vectors containing the (j, k)th entry of all the Ω (h) .Assuming sparsity patterns are shared across populations is thus equivalent to assuming Ω * •jk = 0 for many pairs (j, k).Finally, let ω Generalizing (3) with an additional positive definiteness constraint, we propose to estimate Ω * using arg min where λ ≥ 0, γ ≥ 0, and ϵ ≥ 0 are user-specified tuning parameters, ∥ • ∥ 2 denotes the Euclidean norm of a vector, and A ≽ ϵI p means that A − ϵI p is positive semidefinite.The estimator (4) imposes both a lasso-type penalty on the off-diagonal entries of the Ω (h) , as well as a group lasso penalty on the mode-1 fibers of the tensor Ω.Note that if H = 1, taking either λ = 0 or γ = 0 with the other nonzero, (4) simplifies to (3) with a (a) positive definiteness constraint.For example, if γ = 0, then (4) simplifies to the estimator arg min subject to Ω (h) = Ω ⊤ (h) , Ω (h) ≽ ϵI p , applied to each of the H populations separately.The estimator (5) can be seen as a convex approximation to (2) where we have replaced the L 0 constraint with an L 1 constraint, and replaced the feasible set with the closed convex set {Ω ∈ R p×p : Ω = Ω ⊤ , Ω ≽ ϵI}.The tuning parameter ϵ serves as a lower bound on the smallest eigenvalue of the solution.For this reason, we do not recommend tuning ϵ, but rather fixing it at some reasonably small quantity like 10 −4 , as in Xue et al. (2012).
By taking γ > 0, however, (4) ties the estimators of Ω * (1) , . . ., Ω * (H) together.For large values of the tuning parameter γ, the second penalty in (4) will require that the solution to (4) has some Ω •jk = 0, i.e., that sparsity is partially shared across all H basis covariance matrix estimates.In the leftmost subfigure of Figure 2, (a), we provide an example of the group of parameters-the (1,2)th element of each Ω (h) -which are jointly penalized by the group lasso penalty.The tuning parameter γ controls whether this group is entirely zero or not, whereas the tuning parameter λ controls sparsity in the individual entries of the Ω (h) , as displayed in subfigure (b).
Importantly, (4) is a convex optimization problem and as we discuss in Section 3, can be solved using first-order methods.

Positive definiteness
To understand why enforcing positive definiteness can be necessary, consider estimating a single Ω * whose off-diagonal entries are assumed to be zero.Then the problem reduces to estimation of the variances of the log abundances and (2) admits a closed-form solution.
Proposition 1.If p ≥ 3, the solution to (2) with s = 0 (or equivalently, (3) with λ = ∞) is unique and is given by The factor 2 in the denominator is due to the double sum running over both the upper and lower triangular parts of the symmetric Θ.The proposition reveals variance estimates can be negative if positive definiteness is not enforced.Roughly speaking, for large p, ω j will be negative if the average of the elements in Θ not in the jth row or column is larger than the average of the elements in the jth row and column.It is not difficult to produce such examples.As an illustration, the following Θ resulted from simulating n = 10 compositional x i ∈ C 2 by drawing the log(W ) independently from a multivariate normal distribution with mean zero and identity covariance matrix: Θ =   0 3.83 2.45 3.83 0 1.24 2.45 1.24 0   .

Existing estimators
An alternative estimator of a single basis covariance matrix Ω * is based on the centered log-ratio covariance matrix (Aitchison, 2003), Γ * , whose (j, k)th entry is where g(X) = ( p i=1 X i ) 1/p is the geometric mean of X.Specifically, , where γ * = diag(Γ * ).Cao et al. (2019) show there exists a unique Γ * such that Θ Thus, by proposing a two-step procedure to get an estimate of Γ * from Θ, they also get an indirect estimate of Ω * that can perform well when p is large.However, their estimator is not guaranteed to be positive definite, nor is it the solution to an optimization problem amenable to analysis.Fang et al. (2015) proposed a different estimator, using that with F = I p − 1 p 1 ⊤ p /p, F Ω * F = F Cov(log X)F .Thus, replacing Cov(log X) with its sample version, say, Ω X , a natural estimating equation is F (Ω − Ω X )F = 0. To account for differing variances in each element of F (Ω * − Ω X )F , they propose the weighted least squares estimator arg min where Fang et al. (2015) suggest including a positive definiteness constraint on the optimization variable Ω in (6), their computational algorithm does not enforce this constraint.Instead, if the solution to ( 6) is not positive definite, they estimate Ω * using its nearest positive definite matrix.This can be appropriate, but often leads to a non-sparse estimate (Sun and Vandenberghe, 2015).
In contrast to the methods of Cao et al. (2019) and Fang et al. (2015), our estimator is "direct" in the sense that we do not rely on estimation of intermediate quantities like Γ * , nor do we rely on post-hoc adjustments to achieve positive definiteness.
Many other estimators of Ω * exist, though we do not cover them in detail here.In general, these estimators do not enforce both positive definiteness and sparsity, and are not specifically designed to estimate multiple covariance matrices simultaneously: see Friedman and Alm (2012); Ban et al. (2015); He et al. (2021); Li et al. (2022), for example, and see Ma et al. (2021) for a comprehensive review.
As mentioned in Section 1, our work is related to the literature on jointly estimating sparse precision (inverse covariance) matrices for multiple populations (e.g., Guo et al., 2011;Danaher et al., 2014;Price et al., 2015;Saegusa and Shojaie, 2016;Price et al., 2021).However, our work is distinct in at least two important ways.First, these existing methods are largely focused on estimating precision matrices, rather than covariance matrices.Sparse precision matrix estimation and sparse covariance matrix estimation are fundamentally different tasks.Second, these methods, broadly speaking, assume the observed data are normally distributed or utilize a normal negative log-likelihood as a loss function.These methods are thus not directly applicable to either covariance matrix estimation, nor the analysis of compositional data.
Differences in motivation aside, there are some similarities between our work and that of Guo et al. (2011) and Danaher et al. (2014), for example.Danaher et al. (2014) use the same penalty as ( 4), but applied to precision matrices for normally distributed data.Guo et al. (2011) use a group lasso-type penalty to encourage shared sparsity patterns across populations in the same context as Danaher et al. (2014).The existing work most closely related to that of our own is Bigot et al. (2011), who use a group lasso penalty in the context of estimating multiple sparse covariance matrices from data observed with additive noise.One could not straightforwardly use their method for estimating basis covariance matrices from compositional data, and moreover, neither their theory nor algorithms apply to our estimator.

Proximal-proximal gradient descent algorithm
In order to solve the optimization problem to compute (4), we must address both the nondifferentiability of the objective function and the positive definiteness constraint.To do so, we use the proximal-proximal gradient descent algorithm (Davis and Yin, 2017), which allows us to handle the nondifferentiable penalty and positive definiteness constraint separately.The algorithm generalizes the well-known proximal gradient descent algorithm (Parikh and Boyd, 2014, Section 4.2) to handle problems where the objective function to be minimized is the sum of three convex functions.Specifically, supposing f and g are closed, proper, and convex functions; and ℓ is convex and differentiable with β −1 -Lipschitz continuous gradient for some β > 0; consider a problem of the form minimize Further suppose there exists where ∂f (u) denotes the subdifferential of f at u.The proximal operator of a function f evaluated at u is Davis and Yin (2017) show that ( 7) can be solved by an algorithm whose (t)th iterates are computed using the updating equations where v (0) is an arbitrary point in R d and α ∈ (0, 2β) is fixed.Here, the superscript (t) denotes the (t)th iterate.As t → ∞, u g → u ⋆ and u Davis and Yin, 2017).In practice, however, this algorithm can be slow to converge: fixing the step size α ∈ (0, 2β) can sometimes lead to incremental progress.Therefore, we use a modified version of the proximalproximal gradient descent algorithm proposed by Pedregosa and Gidel (2018), which allows us to start with a step size α larger than 2β and reduce its value as needed, thus potentially accelerating the descent.The (t + 1)th iterates of the algorithm use the updating equations At each step, after (8) is carried out, the value , α), then the algorithm proceeds to (9).If , α), then α is replaced with τ α, where τ ∈ (0, 1) is a constant, and ( 8) is carried out again.This process is repeated until ℓ(u , α).The efficiency of this algorithm hinges on the ability to compute the proximal operators of the functions g and f efficiently.As we will show momentarily, we can write the optimization problem from (4) as ( 7) and the corresponding g and f have proximal operators which can be solved in closed form.

Application to proposed estimator
In order to express the problem in (4) in a form analogous to (7), we must define the corresponding ℓ, f , and g.First, let χ ϵ : R p×p → {0, ∞} be the function If we minimize (11) over all Ω ∈ R H×p×p , the minimizer with respect to each Ω (h) must belong to the set 11) has the form of (7).Moreover, f and g are closed, proper, and convex functions; and the function ℓ is convex and differentiable with Lipschitz continuous gradient.
First, (12) can be separated across the second and third mode of Ω since for all (j, k) ∈ and •jj for j ∈ [p].The solution to (15) is •jk , αλ , where (a) + = max(a, 0) and soft(y, τ ) = max(|y|−τ, 0)sign(y) is applied elementwise (Simon et al., 2013).The second step, ( 13), also has a closed form solution.In particular, ( 13) can be solved with respect to each Ω (h) separately, in parallel, using that = arg min where u (h)j and ξ (h)j are the jth eigenvector and eigenvalue of Ω (Henrion and Malick, 2012).This is the projection of Ω The convergence of the algorithm follows immediately from results in Pedregosa and Gidel (2018).The specific algorithm we implement is given in Algorithm 1.Without the positive definiteness constraint (e.g., by taking ϵ = −∞), a version of this algorithm simplifies to the standard proximal gradient descent algorithm (Parikh and Boyd, 2014, Chapter 4.2).
Enforcing the positive definiteness constraint on each of the Ω (h) requires the eigendecomposition of a p × p matrix, a computation costing O(p 3 ) floating point operations.To reduce the computational burden imposed by this additional constraint, our software implementation first solves (4) without the positive definiteness constraint.To do so, we use accelerated proximal gradient descent (Parikh and Boyd, 2014, Section 4.3).If the solution to the unconstrained problem satisfies the constraint, then we know this is also the solution to the constrained problem.If the solution does not satisfy the constraint, we then apply the proximal-proximal gradient descent algorithm, Algorithm 1, initializing at the solution to the unconstrained problem.This scheme can significantly improve the computing time, especially when n is large relative to p.
An R package implementing our estimators, along with code for reproducing the results in Section 6, can be downloaded from https://github.com/ajmolstad/SpPDCC.
Algorithm 1 Adaptive proximal-proximal gradient descent algorithm for computing multiple covariance matrices for compositional data.
5. If the objective function value converged, terminate the algorithm.Else, set t = t+1 and return to step 1.

Practical considerations
To select tuning parameters (λ, γ), we use V -fold cross-validation.Given candidate tuning parameter sets λ and γ, for (4) we select tuning parameters according to arg min where Ω λ,γ −v is the solution to (4) with input sample variation matrices Θ −v , which are computed using all the data from outside the vth fold.
In some applications, it may be preferable to let populations with larger samples sizes have a greater effect on the objective function.To do so, we propose an alternative variation of (4), defined as the argument minimizing where N = H h=1 n (h) .The objective function ( 16) accounts for the distinct sample sizes by weighting each populations' contribution to the likelihood according to its relative contribution to the total sample size N .When using this estimator, we also recommend modifying the cross-validation criterion so that the tuning parameter pair selected is arg min where n (h),v is the number of samples in the vth fold from the hth population, and We compare the performance of ( 16) to (4), among other competitors, in the Appendix.

Asymptotics for single population estimator
Though our primary focus is the multipopulation estimator (4), the estimator (5) is itself a novel and useful estimator of a single basis covariance matrix.In this section, we study its asymptotic properties.Specifically, we study Ω defined as arg min We will require the following assumptions.
A2. (Row-wise sparsity).As n → ∞, max j s j /p → 0 where s j is the number of nonzero off-diagonal entries in the jth row of Ω * .
The assumptions A1-A3 are natural in the context of high-dimensional compositional data.Assumption A2 is needed to establish the restricted strong convexity of the loss function Cao et al. (2019) assume similar sparsity, which in their setting ensures approximate identifiability (see their Proposition 1 and the comments following it).An analogous interpretation is possible here: unidentifiable parameters often lead to loss functions that are constant in some directions, but Assumption A2 ensures the loss function is strictly convex around Ω * in the directions that matter (see Negahban et al., 2012, Section 2.4, for details).For this assumption to hold, we need p to grow with n.This is congruous with the assumptions in Cao et al. (2019), who require p → ∞ as n → ∞ for consistency and characterize this as a "blessing of dimensionality".Assumption A3 is standard in high-dimensional covariance matrix estimation.
We now state our first result concerning the asymptotic error of our estimator.The proof of this and all subsequent results can be found in the Appendix.Recall s = p j=1 s j and let φ p be the pth largest eigenvalue of its matrix-valued argument.
The error bound in (18) consists of two parts: the error for estimating off-diagonals and the diagonals.The asymptotic Euclidean norm error for the diagonals is b 1 p log(p)/n.The Frobenius norm error for the off-diagonals, however, cannot be disentangled from the diagonal error.Though our results would seem to suggest that ∥[ Ω − Ω * ] − ∥ F ≤ b 1 s log(p)/n with probability tending to one, we are only able to establish a bound for We cannot isolate the asymptotic error for the off-diagonals because of the intrinsic connection between the diagonals and off-diagonals in the objective function ( 17).This is in contrast to some traditional covariance matrix estimators, where off-diagonals can be estimated in a way which is not dependent on the diagonals.Note that although (17) is the solution to a penalized least squares problem, we do not assume any type of restricted eigenvalue condition (Raskutti et al., 2010).Instead, in our proof we first show that Ω − Ω * belongs to a restricted set, then establish a quadratic lower bound on ℓ( Ω) − ℓ(Ω * ) − tr{∇ℓ(Ω * ) ⊤ ( Ω − Ω * )} over this set where here, ℓ is the objective function from (3) with λ = 0. Our technique for establishing this bound may be applicable in other penalized least squares problems.
Direct comparison of our estimation error bound to those established in Cao et al. ( 2019) is not possible as their results are given in terms of the spectral norm, and under a different set of assumptions.

Asymptotics for multiple population estimator
Next, we consider the multiple population estimator (4) with λ = 0.By doing so, we are able to illustrate how our method exploits shared sparsity across the populations.Our results will apply with N = H h=1 n (h) tending to infinity.To establish error bounds for this estimator, we will need a slightly different set of assumptions than in the single population case.
A7. (Alignment of n (h) , p, H, and L).As N → ∞, p → ∞, log(p)/n (h) → 0, and Assumption A4 requires that the log abundances take values over the interval [−L, L].When W (h)k is a normalized count-as is standard in microbiome data-this assumption requires that all counts are bounded away from zero and infinity.A positive lower bound on W (h)k is often assumed implicitly in the analysis of compositional data.Of course, A4 is stronger than A1, but allows us to establish a concentration inequality on the Euclidean norm of the fibers of ∇ℓ(Ω * ).The quantity L will appear in our asymptotic error bound, so this assumption is not so restrictive since L can be arbitrarily large.
Assumption A5 requires that the number of nonzero off-diagonal entries in any of the Ω * (h) does not grow too quickly with p. Like A2, A5 implicitly requires that p grows as N → ∞.
Assumption A6 is a requirement on how frequently, as N → ∞, we sample from each of the H populations.This assumption requires that we do not systemically undersample from any of the H populations.Our error bounds will depend on π, so we can quantify how sampling affects estimation accuracy.
We are ready to state our asymptotic error bound for (4) with λ = 0. Our bound will depend on s = p j=1 sj .
The bound in Theorem 5.2 can be interpreted in a similar way as the bound in Theorem 5.1.Specifically, we cannot separate the error for estimating the off-diagonals of the Ω * (h) from the error for estimating the diagonals.In particular, where the diagonals and off-diagonals affect the error bound are through their contribution to numerator in the leftmost term of the error bound: the √ s comes from having to estimate nonzero entries in s off-diagonals of the Ω * (h) , whereas the √ p comes from having to estimate p diagonal entries in each Ω * (h) .Just as in Theorem 5.1, we can establish a bound specifically for the diagonals.If there exists a constant b 3 ∈ (0, ∞) such that L 4 H ≤ b 3 log(p) for N sufficiently large, which is natural since one would not expect H nor L to grow with N , we then achieve essentially the same result as in Theorem 5.1: there exists a constant b 4 ∈ (0, ∞) such that ∥ ω − ω * ∥ F ≤ b 4 p log(p)/(πN ) with probability tending to one.
If each Ω * (h) had a substantial number of zeros which were not shared across all H populations, (4) should perform better with λ > 0. Specifically, we conjecture that in this case, one could replace A5 with a combination of A2 (applied to each population separately) and a relaxed version of A5, and obtain an improved error bound relative to that in Theorem 5.2 by taking λ > 0. However, proving this type of result is technically challenging, and it is unclear whether some of our intermediate results can be generalized.
Model 1.The Ω * (h) are tridiagonal with either all positive or all negative correlations: Model 2. The Ω * (h) are block diagonal with each block having an AR(1) structure: and D ∈ R p×p is a diagonal matrix with diagonal entries equally spaced from 3 to 1, and separately.Specifically, we use the method of Cao et al. (2019) with adapative soft-thresholding, COAT, and use the method of Fang et al. (2015), cclasso.We also use an oracle estimator, Oracle, which is the adaptively soft-thresholded sample covariance matrix of each log(W (h)1 ), . . ., log(W (h)n (h) ).This is an oracle method in the sense that we do not have access to the underlying abundances in practice.Finally, we consider two versions of our method, SCC and SCC-H, where SCC is short for sparse compositional covariance matrices.The estimator SCC is defined in (4), whereas SCC-H is (5) with a separate tuning parameter λ chosen for each h ∈ [H].The method SCC-H estimates the covariances separately, but using a version of our criterion.Including both SCC and SCC-H serves to illustrate to what degree the improvement in performance is driven by the loss function versus the sharing of sparsity patterns across the fibers of Ω * .
To assess the performance of each method, we measure the average (over H populations and 50 independent replications) Frobenius norm error and L 1 matrix norm error of the estimated covariance matrices on the correlation scale.We use the correlation scale because cclasso was designed specifically for correlation matrix estimation.Relative performances under both Frobenius norm and L 1 matrix norm error on the covariance scale are similar and thus relegated to the Appendix.

Results
In Figure 3, we display average Frobenius norm errors (divided by p).With p = 40, the cclasso software would sometimes return undefined estimates (NA in R), so we omit comparisons in these settings.Unsurprisingly, under Model 1, SCC substantially outperforms all of the competitors, including Oracle.This illustrates the utility of exploiting shared sparsity patterns when estimating multiple covariance matrices.Notably, Oracle, COAT, and SCC-H all perform similarly in each setting under Model 1.Under Model 2, SCC outperforms all competitors, including Oracle, once p ≥ 120.Comparing the competitors which could be used in practice, SCC-H performs better than COAT and cclasso in all situations.The estimator COAT performs worse than cclasso for small p, but significantly outperforms cclasso once p ≥ 120.Finally, under Model 3, SCC and SCC-H perform similarly in every setting.
The only method to outperform SCC and SCC-H is Oracle, which cannot be used in practice.
The fact that Oracle outperforms the other methods so substantially speaks to the difficulty One should be careful drawing conclusions based on Frobenius norm error results alone, however, because our methods, SCC and SCC-H, both minimize a Frobenius norm criterion, whereas COAT does not.Thus, these results may be biased in favor of SCC and SCC-H.For this reason, we also included L 1 matrix norm results in Figure 4. Here, the L 1 matrix norm is the maximum of the L 1 vector-norm of the columns of a matrix.Under Model 1 with n = 50, there appears to be little difference between the methods-other than cclasso-in terms of L 1 matrix norm.However, when n ≥ 100, SCC significantly outperforms all competitors.Under Model 2, the results more closely mirror those in Figure 3: SCC outperforms all competitors, including Oracle, when p ≥ 120.The results under Model 3, relatively speaking, are similar to those observed under Model 3 using Frobenius norm error.The method Oracle performs best, but among the methods which could be used in practice, SCC and SCC-H clearly outperform cclasso and COAT.
The performance of SCC and SCC-H can be partially explained by their performance in recovering the true set of nonzero off-diagonals.In Model 1, SCC has nearly perfect TPR, and TNR only slightly lower than the best performing competitor.SCC-H tends to have similar TPR as COAT, but also tends to have higher TNR.A similar conclusion can be drawn under Model 2. Under Model 3, however, COAT tends to have higher TPR and similar TNR to SCC, whereas SCC-H has lower TPR and higher TNR than COAT.Note that under Model 3, oracle does well in part due to the fact that the covariances have varying diagonals.
In the Appendix, we provide additional simulation study results.First, we present the results from Figures 3 and 4, but on the covariance scale.Relative performances closely mirror those in Figures 3 and 4. Second, we present results comparing SCC-H to COAT and cclasso in terms of estimating Ω * (1) under Models 1-3.Our results clearly demonstrate that SCC-H can significantly outperform both COAT and cclasso for single population basis covariance matrix estimation.Finally, we also perform additional simulation studies wherein the sample sizes (n (1) , n (2) , n (3) , n (4) ) = (100,75,50,25).We compare the same competitors as in Section 6.2, but also include ( 16).Under Model 1, (16) outperforms the competitors, though under Models 2 and 3, there is little difference between (4) and ( 16).
7 Analysis of microbiome in myalgic encephalomyelitis/chronic fatigue syndrome

Basis covariance matrix estimation
We illustrate our method by analyzing data on the gut microbiome of patients diagnosed with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) versus controls from Giloteaux et al. (2016).In order to obtain the microbial profiles, Giloteaux et al. (2016) sequenced 16S rRNA genes from stool samples using Illumina MiSeq.After first filtering patients based on total reads (≥ 5000), we filter operational taxonomic units (OTUs) to only those that comprise at least 10% of total reads in one or more patients.This reduced the original 138 OTUs to p = 39 OTUs, though led to us filtering out only 8% and 10% of each subjects' total reads (on average) in controls and ME/CFS, respectively.Following Cao et al. (2019), we add 0.5 to all counts to avoid zeros before converting counts to compositions.To be clear, these counts Y (h)i are distinct from the latent abundances W (h)i , which are assumed to be independent and identically distributed for all k ∈ [n (h) ].For example, we may assume that Y (h)ij = M (h)i W (h)ij where M (h)i is a positive random variable (Ma et al., 2021), so that In the Appendix, we demonstrate that our estimates do not change much when we use a pseudocount of 0.01 instead of 0.5 to handle zeros.
Our estimates of the covariance matrices, with tuning parameters chosen using ten-fold cross-validation, are in Figure 5.Each node in these graphs represents a unique OTU.The nodes are colored according to OTU's phylum and each node's family, genus, and species is provided in the Appendix.The thickness of the edge corresponds to the strength of the association: stronger associations are represented by thicker edges.Positive and negative correlations are colored, respectively, with green and red, while a zero correlation is represented by the absence of an edge.
Examining the estimated covariance matrices, the majority of associations occur within two OTUs belonging to the same phylum.We also see that our method estimates the two groups' covariance matrices to have identical sparsity patterns, in sharp contrast with the estimates based on COAT, the method of Cao et al. (2019) (Figure 1).Most strong positive and negative associations are shared across the two groups.Notably, one of the eight associations whose direction differ across controls and ME/CFS is (23-3; Ruminococcus bromii-Bacteroides ovatus).Ruminococcus bromii is known to degrade resistant starch  particles inaccessible to other bacteria, whereas Bacteroides ovatus digests inulin (Porter and Martens, 2016).That these two OTUs are estimated to be associated is interesting since both resistant starch and inulin are fermentable carbohydrates whose joint behavior has been of interest in past studies (Younes et al., 2001).
In addition, an insight gleaned from our estimates is that more negative associations are observed in chronic fatigue syndrome patients than in controls.This coheres with the reduced diversity in the microbiome communities for ME/CFS patients observed by Giloteaux et al. (2016).Moreover, the positive associations between (25-35) and (6-17) are much stronger in ME/CFS than in controls.This too suggests reduced diversity in ME/CFS as OTUs labeled 6, 35, 25, and 17 all belong to the same phylum, Firmicutes.
Estimates using COAT are more difficult to interpret.First, there is a larger number of nonzero entries in both estimates, and their sparsity patterns differ substantially.In total, COAT identifies 185 associations in one population not present in the other.Moreover, the estimates from COAT disagree in terms of their strongest associations.For example, one of the strongest positive associations estimated in controls is between (29-13), whereas in patients with ME/CFS, their method estimates these two OTUs to be uncorrelated.
Finally, we emphasize that our estimator (4) does not require that sparsity patterns are identical across CFS and controls.Instead, the similarity of sparsity patterns is determined by the combination of tuning parameters (γ, λ), which are selected by cross-validation.Thus, in this application, it is the data which suggest that the sparsity patterns are identical.

Stability assessment
We perform a stability assessment to determine to what degree our respective estimates, displayed in Figures 1 and 5, are reliable.Following Cao et al. (2019), we generate 100 independent bootstrap samples and refit both estimators to the bootstrapped samples.We say an estimated nonzero correlation is stable if it is nonzero in at least 80 of the 100 bootstrap samples.In Table 2, we report the stability of each correlation estimate.
In the first four columns, we assess the stability of all correlations: in rows labeled positive and negative, we report the number of correlations estimated to be positive and negative, respectively, in the estimates displayed in Figures 1 and 5.In the row labeled stability, we report the percentage of these correlations which were estimated to be nonzero at least 80 of the 100 bootstrap samples.For example, SCC estimated 22 positive and 16 negative correlations (Figure 5a); of these 38 correlations, 89.5% of them were estimated to be nonzero in at least 80 bootstrap samples.For our method, in both controls and ME/CFS basis covariance matrix estimates, almost all of the edges we estimated to be nonzero are stable.COAT, on the other hand, has lower stability in both controls and ME/CFS.
In the first row of the "shared correlations" columns of Table 2, we report the number of estimated correlations where a correlation was positive in both estimates (controls and ME/CFS), or negative in both estimates.Our method estimated that all correlations are shared, and has reasonably high stability.In particular, this column suggests that of the 40 shared correlations, 90% were estimated in at least 80 bootstrap samples.COAT has slightly lower stability for its shared correlations despite estimating fewer than half as many as our method.
Finally, the most telling result comes in the "distinct correlations" columns of Table 2. Here, we report the number of correlations which were nonzero in controls and zero in ME/CFS (D1) and the number of correlations which were zero in controls and nonzero in ME/CFS (D2).We see that SCC estimates no correlations to be distinct, whereas COAT estimates 185 correlations to be distinct.However, the stability of these correlations is zero: none of these distinct correlations appeared in 80 or more of the bootstrap samples.These results suggest that the estimates provided by our method may be more reliable than COAT.

Discussion
In this article, we proposed a new method for estimating basis covariance matrices from compositional data.An important question about our method is whether it could provide reasonable estimates of the basis precision (inverse covariance) matrix.Though our method can provide estimates of Ω −1 * (h) (since our estimates are always positive definite), these estimates will not, in general, be sparse.If a practitioner is interested in sparse precision matrix estimation, we recommend using methods specifically designed for this task, e.g., Zhang et al. (2023).To the best of knowledge, there exist no methods for jointly estimating multiple sparse precision matrices from compositional data.This could be a fruitful direction for future research.
There are two aspects of our data analysis which could be improved.First, the original data were counts (reads per OTU), which we converted to compositions.It has been argued that total reads per patient is an experimental artifact, and thus, microbiome sequencing data should be converted to compositions (e.g., see Gloor et al., 2017, and references therein).However, as pointed out by a referee, there is nonetheless some loss of information when we ignore total reads per patient.Ideally, an estimator could somehow make use of this additional information.Second, our method assumes that components of the observed composition are positive with probability one.In 16S rRNA sequencing (microbiome) data, however, it is common to observe many zeros.Thus, as future work, we hope to extend our method to address these two issues.

Figure 2 :
Figure 2: Visualization of (a) the fibers of Ω which are penalized by the final term in (4), (b) the organization of Ω into the Ω (h) , and (c) the three way tensor Ω.

Figure 5 :
Figure 5: Estimated correlation networks using (4) for (a) control patients and (b) ME/CFS patients.The thickness of the edge corresponds to the strength of the association: stronger associations are represented by thicker edges.Positive and negative correlations are colored, respectively, with green and red, while a zero correlation is represented by the lack of an edge.

Table 2 :
Stability for all correlations, shared correlations, and distinct correlations over 100 bootstrap samples.For the distinct correlation columns, D1 refers to a correlation which was nonzero in controls, but zero in ME/CFS, whereas D2 refers to a correlation which was zero in controls but nonzero in ME/CFS.