Sparse permutation invariant covariance estimation

The paper proposes a method for constructing a sparse estimator for the inverse covariance (concentration) matrix in high-dimensional settings. The estimator uses a penalized normal likelihood approach and forces sparsity by using a lasso-type penalty. We establish a rate of convergence in the Frobenius norm as both data dimension $p$ and sample size $n$ are allowed to grow, and show that the rate depends explicitly on how sparse the true concentration matrix is. We also show that a correlation-based version of the method exhibits better rates in the operator norm. We also derive a fast iterative algorithm for computing the estimator, which relies on the popular Cholesky decomposition of the inverse but produces a permutation-invariant estimator. The method is compared to other estimators on simulated data and on a real data example of tumor tissue classification using gene expression data.


Introduction
Estimation of large covariance matrices, particularly in situations where the data dimension p is comparable to or larger than the sample size n, has attracted a lot of attention recently.The abundance of high-dimensional data is one reason for the interest in the problem: gene arrays, fMRI, various kinds of spectroscopy, climate studies, and many other applications often generate very high dimensions and moderate sample sizes.Another reason is the ubiquity of the covariance matrix in data analysis tools.Principal component analysis (PCA), linear and quadratic discriminant analysis (LDA and QDA), inference about the means of the components, and analysis of independence and conditional independence in graphical models all require an estimate of the covariance matrix or its inverse, also known as the precision or concentration matrix.Finally, recent advances in random matrix theory -see Johnstone (2001) for a review, and also Paul (2007) -allowed in-depth theoretical studies of the traditional estimator, the sample (empirical) covariance matrix, and showed that without regularization the sample covariance performs poorly in high dimensions.These results helped stimulate research on alternative estimators in high dimensions.
Many alternatives to the sample covariance matrix have been proposed.A large class of methods covers the situation where variables have a natural ordering, e.g., longitudinal data, time series, spatial data, or spectroscopy.The implicit regularizing assumption underlying these methods is that variables far apart in the ordering have small correlations (or partial correlations, if the object of regularization is the concentration matrix).Methods for regularizing covariance by banding or tapering have been proposed by Bickel and Levina (2004) and Furrer and Bengtsson (2007).Bickel and Levina (2008) showed consistency of banded estimators in the operator norm under mild conditions as long as (log p)/n → 0, for both banding the covariance matrix and the Cholesky factor of the inverse discussed below.
When the inverse of the covariance matrix is the primary goal and the variables are ordered, regularization is usually introduced via the modified Cholesky decomposition, Here L is a lower triangular matrix with l jj = 1 and l jj ′ = −φ jj ′ , where φ jj ′ , j ′ < j is the coefficient of X j ′ in the population regression of X j on X 1 , . . ., X j−1 , and D is a diagonal matrix with residual variances of these regressions on the diagonal.Several approaches to regularizing the Cholesky factor L have been proposed, mostly based on its regression interpretation.A k-banded estimator of L can be obtained by regressing each variable only on its closest k predecessors; Wu and Pourahmadi (2003) proposed this estimator and chose k via an AIC penalty.Bickel and Levina (2008) showed that banding the Cholesky factor produces a consistent estimator in the operator norm under weak conditions on the covariance matrix, and proposed a cross-validation scheme for picking k.Huang et al. (2006) proposed adding either an l 2 (ridge) or an l 1 (lasso) penalty on the elements of L to the normal likelihood.The lasso penalty creates zeros in L in arbitrary locations, which is more flexible than banding, but (unlike in the case of banding) the resulting estimate of the inverse may not have any zeros at all.Levina et al. (2008) proposed adaptive banding, which, by using a nested lasso penalty, allows a different k for each regression, and hence is more flexible than banding while also retaining some sparsity in the inverse.Bayesian approaches to the problem introduce zeros via priors, either in the Cholesky factor (Smith and Kohn, 2002) or in the inverse itself (Wong et al., 2003).
There are, however, many applications where an ordering of the variables is not available: genetics, for example, or social and economic studies.Methods that are invariant to variable permutations (like the covariance matrix itself) are necessary in such applications.Regularizing large covariance matrices by Steinian shrinkage of eigenvalues has been proposed early on (Haff, 1980;Dey and Srinivasan, 1985).More recently, Ledoit and Wolf (2003) proposed a way to compute an optimal linear combination of the sample covariance with the identity matrix, which also results in shrinkage of eigenvalues.Shrinkage estimators are invariant to variable permutations but they do not affect the eigenvectors of the covariance, only the eigenvalues, and it has been shown that the sample eigenvectors are also not consistent when p is large (Johnstone and Lu, 2004).Shrinking eigenvalues also does not create sparsity in any sense.Sometimes alternative estimators are available in the context of a specific application -e.g., for a factor analysis model with known factors Fan et al. (2008) develop regularized estimators for both the covariance and its inverse.
Our focus here will be on sparse estimators of the concentration matrix.Sparse concentration matrices are widely studied in the graphical models literature, since zero partial correlations imply a graph structure.The classical graphical models approach, however, is different from covariance estimation, since it normally focuses on just finding the zeros.For example, Drton and Perlman (2008) develop a multiple testing procedure for simultaneously testing hypotheses of zeros in the concentration matrix.There are also more algorithmic approaches to finding zeros in the concentration matrix, such as running a lasso regression of each variable on all the other variables (Meinshausen and Bühlmann, 2006), or the PC-algorithm (Kalisch and Bühlmann, 2007).Both have been shown to be consistent in high-dimensional settings, but none of these methods supply an estimator of the covariance matrix.In principle, once the zeros are found, a constrained maximum likelihood estimator of the covariance can be computed (Chaudhuri et al., 2007), but it is not clear what the properties of such a two-step procedure would be.
Two recent papers, d 'Aspremont et al. (2008) and Yuan and Lin (2007), take a penalized likelihood approach by applying an l 1 penalty to the entries of the concentration matrix.This results in a permutation-invariant loss function that tends to produce a sparse estimate of the inverse.Yuan and Lin (2007) used the max-det algorithm to compute the estimator, which limited their numerical results to values of p ≤ 10, and derived a fixed p, large n convergence result.d 'Aspremont et al. (2008) proposed a much faster semi-definite programming algorithm based on Nesterov's method for interior point optimization.While this paper was in review, a new very fast algorithm for the same problem was proposed by Friedman et al. (2008), which is based on the coordinate descent algorithm for the lasso (Friedman et al., 2007).
In this paper, we analyze the estimator resulting from penalizing the normal likelihood with the l 1 penalty on the entries of the concentration matrix (we will refer to this estimator as SPICE -Sparse Permutation Invariant Covariance Estimator) in the high-dimensional setting, allowing both the dimension p and the sample size n to grow.We give an explicit convergence rate in the Frobenius norm and show that the rate depends on how sparse the true concentration matrix is.For a slight modification of the method based on using the sample correlation matrix, we obtain the rate of convergence in operator norm and show that it is essentially equivalent to the rate of thresholding the covariance matrix itself obtained in Bickel and Levina (2007).We also derive our own optimization algorithm for computing the estimator, based on Cholesky decomposition and the local quadratic approximation.Unlike other estimation methods that rely on the Cholesky decomposition, our algorithm is invariant under variable permutations.Because we use the local quadratic approximation, the algorithm is equally applicable to general l q penalties on the entries of the inverse, not just l 1 .
The rest of the paper is organized as follows: Section 2 summarizes the SPICE approach in general, and presents consistency results.The Cholesky-based computational algorithm, along with a discussion of optimization issues, is presented in Section 3. Section 4 presents numerical results for SPICE and a number of other methods, for simulated data and a real example on classification of colon tumors using gene expression data.Section 5 concludes with discussion.

Analysis of the SPICE method
We assume throughout that we observe X 1 , . . ., X n , i.i.d.p-variate normal random variables with mean 0 and covariance matrix Σ 0 , and write X i = (X i1 , . . ., X ip ) T .Let Σ 0 = [σ 0ij ], and Ω 0 = Σ −1 0 be the inverse of the true covariance matrix.For any matrix M = [m ij ], we write |M | for the determinant of M , tr(M ) for the trace of M , and ϕ max (M ) and ϕ min (M ) for the largest and smallest eigenvalues, respectively.We write M + = diag(M ) for a diagonal matrix with the same diagonal as M, and M − = M − M + .In the asymptotic analysis, we will use the Frobenius matrix norm M 2 F = i,j m 2 ij , and the operator norm (also known as matrix 2-norm), M 2 = ϕ max (M M T ).We will also write | • | 1 for the l 1 norm of a vector or matrix vectorized, i.e., for a matrix It is easy to see that under the normal assumption the negative log-likelihood, up to a constant, can be written in terms of the concentration matrix as is the sample covariance matrix.We define the SPICE estimator Ωλ of the inverse covariance matrix as the minimizer of the penalized negative log-likelihood, where λ is a non-negative tuning parameter, and the minimization is taken over symmetric positive definite matrices.SPICE is identical to the lasso-type estimator proposed by Yuan and Lin (2007), and very similar to the estimator of d 'Aspremont et al. (2008) (they used |Ω| 1 rather than |Ω − | 1 in the penalty).The loss function is invariant to permutations of variables and should encourage sparsity in Ω due to the l 1 penalty applied to its off-diagonal elements.
We make the following assumptions about the true model: Note that assumption A2 guarantees that Ω 0 exists.Assumption A1 is more of a definition, since it does not stipulate anything about s (s = p(p − 1)/2 would give a full matrix).
Theorem 1.Let Ωλ be the minimizer defined by (1).Under A1, A2, A3, if The theorem can be restated, more suggestively, as The reason for the second formulation (3) is the relation of the Frobenius norm to the operator norm, Before proceeding with the proof of Theorem 1, we discuss a modification to SPICE based on using the correlation matrix.An inspection of the proof reveals that the worst part of the rate, p log p/n, comes from estimating the diagonal.This suggests that if we were to use the correlation matrix rather than the covariance matrix, we should be able to get the rate of s log p/n.Indeed, let Σ 0 = W ΓW , where Γ is the true correlation matrix, and W is the diagonal matrix of true standard deviations.Let Ŵ and Γ be the sample estimates of W and Γ, i.e., Ŵ 2 = Σ+ , Γ Then we can define a modified correlation-based estimator of the concentration matrix by It turns out that in the Frobenius norm Ω has the same rate as Ω, but for Ω we can get a convergence rate in the operator norm (matrix 2-norm).As discussed previously by Bickel andLevina (2008), El Karoui (2007) and others, the operator norm is more appropriate than the Frobenius norm for spectral analysis, e.g., PCA.It also allows for a direct comparison with banding rates obtained in Bickel and Levina (2008) and thresholding rates in Bickel and Levina (2007).
Theorem 2. Under assumptions of Theorem 1, Note.This rate is very similar to the rate for thresholding the covariance matrix obtained by Bickel and Levina (2007).They showed that under the assumption max i j |σ ij | q ≤ c 0 (p) for 0 ≤ q < 1, if the sample covariance entries are set to 0 when their absolute values fall below the threshold λ = M log p n , then the resulting estimator converges to the truth in operator norm at the rate no worse than O P c 0 (p) log p n (1−q)/2 .Since the truly sparse case corresponds to q = 0, and c 0 (p) is a bound on the number of non-zero elements in each row, and thus √ s ≍ c 0 (p), this rate coincides with ours, even though the estimator and the method of proof are very different.However, Lemma 1 below is the basis of the proof in both cases, and ultimately it is the bound (6) that gives rise to the same rate.A similar rate has been obtained for banding the covariance matrix in Bickel and Levina (2008), under an additional assumption that depends on the ordering of the variables and is not applicable here (see Bickel and Levina (2007) for a comparison between banding and thresholding rates).
In the proof, we will need a lemma of Bickel and Levina (2008) (Lemma 3) which is based on a large deviation result of Saulis and Statulevičius (1991).We state the result here for completeness.
where c 1 , c 2 and δ depend on k only.
Proof of Theorem 1.Let Note that we suppress the dependence on λ in Ω and ∆.
The main idea of the proof is as follows.Consider the set where the minimizer ∆ must be inside the sphere defined by Θ n (M ), and hence For the logarithm term in (7), doing the Taylor expansion of f (t) = log |Ω + t∆| and using the integral form of the remainder and the symmetry of ∆, Σ 0 , and ), and ∆ is ∆ vectorized to match the dimensions of the Kronecker product.
Therefore, we may write (7) as, For an index set A and a matrix Now, using symmetry again, we write To bound term I, note that the union sum inequality and Lemma 1 imply that, with probability tending to 1, max and hence term I is bounded by The second bound comes from the Cauchy-Schwartz inequality and Lemma 1: also with probability tending to 1. Now, take By (10), The first term comes from a bound on the integral which we will argue separately below.The second term is always positive, and hence we may omit it for the lower bound.Now, note that Thus we have for M sufficiently large.
It only remains to check the bound on the integral term in (10).Recall that ϕ min (M ) = min x =1 x T M x.After factoring out the norm of ∆, we have, for ∆ ∈ Θ n (M ), The first inequality uses the fact that the eigenvalues of Kronecker products of symmetric matrices are the products of the eigenvalues of their factors.Now with probability tending to 1, since ∆ ≤ ∆ F = o(1).This establishes the theorem.
As noted above, an inspection of the proof shows that p log p/n in the rate comes from estimating the diagonal.If we focus on the correlation matrix estimate Kλ in (4) instead, we can immediately obtain Corollary 1.Under assumptions of Theorem 1, Now we can use Corollary 1 to prove Theorem 2, the operator norm bound.
Proof of Theorem 2. Write where we are using the sub-multiplicative norm property AB ≤ A B (see, e.g., Golub and Van Loan (1989)).Now, W −1 and K are O(1) by assumptions A2 and A3.Lemma 1 implies that and since Ŵ Note that in the Frobenius norm, we only have Ŵ 2 −W 2 = O P ( p log p/n), and thus the Frobenius rate of Ωλ is the same as that of Ωλ .

The Cholesky-based SPICE algorithm
In this section, we develop an iterative algorithm for computing the SPICE estimator using the Cholesky decomposition; however, unlike other estimators that depend on the Cholesky decomposition, we minimize a permutation invariant objective function, and thus the estimator remains permutation invariant.We use the quadratic approximation to the absolute value, a standard tool in optimization which has been previously used in the statistics literature to handle lasso-type penalties, for example, by Fan and Li (2001) and Huang et al. (2006).In this our algorithm differs from the glasso algorithm of Friedman et al. (2008), which is based on a lasso algorithm and works directly on the absolute values.Both algorithms have computation complexity of O(p 3 ), but we acquire another small constant factor (on the order of 10) due to the additional iterations required for the quadratic approximation to converge (see more on this in Section 4).However, using the quadratic approximation allows us to write down the algorithm explicitly in general terms for an l q penalty |w ij | q with q ≥ 1, rather than only for q = 1.In particular, our algorithm is equally applicable for use with a ridge penalty (q = 2), although in that special case it simplifies even further, or with a bridge penalty (1 < q < 2) proposed by Fu (1998), which may work better for certain classes of covariances.It can also be used with SCAD (Fan and Li, 2001) or other more complicated non-convex penalties that are typically approximated by the local quadratic approximation.Even though we derive the algorithm with a general q, in this paper we only present results for q = 1.
Our goal is to minimize the objective function, where q = 1 corresponds to the computation of Ωλ in (1).For q ≥ 1, the objective function is convex in the elements of Ω and has a global minimum Ω.
Our strategy is to re-parametrize the objective (20) using the Cholesky decomposition of Ω to enforce automatic positive definiteness.Rather than using the modified Cholesky decomposition with its regression interpretation, as has been standard in the literature, we simply write where T = [t ij ] is a lower triangular matrix.We can still use the regression interpretation if needed, by writing where φ jj ′ is the coefficient of X j ′ in the regression of X j on X 1 , . . ., X j−1 , and d jj is the corresponding residual variance.
To minimize f in terms of T , we apply a cyclical coordinate descent approach and minimize f with respect to one element of T at a time.Further, we use a quadratic approximation to f , which allows us to find the minimum of the univariate functions of each parameter in closed form.The algorithm is iterated until convergence.Here we outline the main steps of the algorithm, and leave the full derivation for the Appendix.
In a slight abuse of notation, we write X for the n × p data matrix where each column has already been centered by its sample mean.The three terms in (20) can be expressed as a function of T as follows: The quadratic approximation for |u| q is shown in (25).Since the algorithm is iterative, u (k) denotes the value of u from the previous iteration, and u (k+1) is the value at current iteration.
Hunter and Li (2005) suggest replacing |u (k) | in the denominator with |u (k) |+ ǫ to avoid division by zero, and refer to this as the ǫ-perturbed quadratic approximation.This quadratic approximation to f , which we denote fǫ,k at iteration k, allows us to easily take partial derivatives with respect to each parameter in T , and provides a closed form solution for the univariate minimizer for each coordinate.
Step 2. Repeat Step 1 until convergence of T and set T (k+1) = T .
Steps 2 and 3 may seem redundant, but they are needed for two different reasons.
Step 2 is needed because we only minimize with respect to one parameter at a time, holding all other parameters fixed; and Step 3 is needed because of the quadratic approximation for |u| q .After convergence, we replace entries in Ω with smaller magnitude than ǫ with zero, using a fixed value of ǫ = 10 −8 .Another approach with virtually the same performance is to replace entries of Ω(k) with ǫ if their magnitude falls below ǫ in Step 3, and use (25) directly in the objective function in Step 1 instead of using fǫ,k .
In practice, we found that working with the correlation matrix as described in Theorem 2 is slightly better than working with the covariance matrix, although the differences are fairly small.Still, in all the numerical results we standardize the variables first and then rescale our estimate by the sample standard deviations of the variables.

Algorithm convergence
The convergence of the algorithm essentially follows from two standard results.For the inner loop cycling through individual parameters, the value of the objective function decreases at each iteration, and the objective function is differentiable everywhere.Thus the inner loop of the algorithm converges by a standard theorem on cyclical coordinate descent for smooth functions (see, e.g., Bazaraa et al. (2006), p. 367), to a stationary point ∇g(T ) = 0, where g(T ) = fǫ,k (T T T ).The function fǫ,k is convex in the original parameters ω ij , but since we reparametrized it in terms of T , the function g is not necessarily convex in T .In the next proposition we verify that this stationary point of g corresponds to the global minimum of the convex function fǫ,k .
Proposition 1.Let f ≡ fǫ,k be the original convex function f approximated by the ǫ-perturbed local quadratic approximation at iteration k, let T be a p × p lower triangular matrix, and let g(T ) = f (T T T ).Let S 0 be the unique solution to ∇ f (S) = 0, and let T 0 be a solution to ∇g(T ) = 0. Then S 0 = T T 0 T 0 .Proof of Proposition 1.Let h : T → T T T .Note that h maps all of R p(p+1)/2 (all lower triangular matrices) into a convex subset of R p(p+1)/2 (non-negative definite symmetric matrices).Denote the differential of h in the direction d ∈ R p(p+1)/2 evaluated at t 0 ∈ R p(p+1)/2 by ∇h(t 0 ) [d].Then, where T 0 and D are, respectively, t 0 and d written as p × p matrices.Now, using the chain rule and ( 26), we have where we now think of f as a function from R p(p+1)/2 to R. Since f is convex and has a unique minimizer If any diagonal elements of T 0 are 0, then T 0 is singular, and so is T T 0 T 0 , and thus g(T 0 ) = ∞, so a singular T 0 cannot be a stationary point of g.Since T 0 is lower triangular and all its diagonal elements must be non-zero, one can show by induction that T T 0 D = −(T T 0 D) T implies D = 0.For the outer loop iterating through the quadratic approximation, we can apply the argument of Hunter and Li (2005) for ǫ-perturbed local quadratic approximation obtained from general results for minorize-maximize algorithms, and conclude that as k → ∞ and ǫ → 0 the algorithm converges to the global minimum of the original convex function f in (20).In practice, we have also observed that our algorithm and glasso converge to the same solution.

Computational complexity
The computational complexity of the algorithm in terms of p is O(p 3 ), since each parameter update is at most O(p) (see (32) in the Appendix), and there are O(p 2 ) parameters.The only other algorithm for computing this estimator at the cost of O(p 3 ) is glasso of Friedman et al. (2008); the algorithms of Yuan and Lin (2007) and d 'Aspremont et al. (2008) have higher computational cost.For extensive timing comparisons of glasso and the algorithm of d 'Aspremont et al. (2008), which showed convincingly that glasso is much faster, see Friedman et al. (2008).The exact timing also depends on the implementation, platform, etc (our algorithm is implemented in C and glasso in Fortran).Actual computing times we obtained for glasso and the SPICE algorithm are shown below in Figure 1, for model Ω 2 described in Section 4.1, with values of tuning parameters chosen as described in Section 3.3.

Choice of tuning parameter
Like any other penalty-based approach, SPICE requires selecting the tuning parameter λ.In simulations, we generate a separate validation dataset, and select λ by maximizing the normal likelihood on the validation data with Ωλ estimated from the training data.Alternatively, one can use 5-fold cross-validation, which we do for the real data analysis.There is some theoretical basis for selecting the tuning parameter in this way -see Bickel and Levina (2007).

Numerical Results
In this section, we compare the performance of SPICE to the shrinkage estimator of Ledoit and Wolf (2003) and to the sample covariance matrix when applicable (p < n), using simulated and real data.We do not include any estimators that depend on variable ordering (such as banding of Bickel and Levina (2008) or the Lasso penalty on the Cholesky factor of Huang et al. (2006)), nor estimators that focus on introducing sparsity in the covariance matrix itself rather than in its inverse (such as thresholding), as they would automatically be at a disadvantage on sparse concentration matrices.The Ledoit-Wolf estimator does not introduce sparsity in the inverse either, but we use it as a benchmark for cases when p > n, since the sample covariance is not invertible.

Simulations
In simulations, we focus on comparing performance on sparse concentration matrices, with varying levels of sparsity.We consider the following four covariance models.
1. Ω 1 : AR(1), 3. Ω 3 = B+δI, where each off-diagonal entry in B is generated independently and equals 0.5 with probability α = 0.1 or 0 with probability 1 − α = 0.9.B has zeros on the diagonal, and δ is chosen so that the condition number of Ω 3 is p (keeping the diagonal constant across p would result in either loss of positive definiteness or convergence to identity for larger p). 4. Ω 4 : Same as Ω 3 except α = 0.5.
All models are sparse (see Figure 2), and are numbered in order of decreasing sparsity (or increasing s).Note that the number of non-zero entries in Ω 1 and Ω 2 is proportional to p, whereas Ω 3 and Ω 4 have the expected number of non-zero entries proportional to p 2 .
For all models, we generated n = 100 multivariate normal training observations and a separate set of 100 validation observations.We considered five different values of p, 30, 100, 200, 500 and 1000.The estimators were computed on the training data, with the tuning parameter for SPICE selected by minimizing the normal likelihood on the validation data.Using these values of the tuning parameters, we computed the estimated concentration matrix on the training data and compared it to the population concentration matrix.
We evaluate the concentration matrix estimation performance using the Kullback-Leibler loss, Note that this loss is based on Ω and does not require inversion to compute Σ, which is appropriate for a method estimating Ω.The Kullback-Leibler loss was used by Yuan and Lin (2007) and Levina et al. (2008) to assess performance of methods estimating Ω, and is obtained from the standard entropy loss of the covariance matrix (Lin and Perlman, 1985;Wu and Pourahmadi, 2003;Huang et al., 2006) by reversing the roles of Σ and Ω.
Results for the four covariance models are summarized in Table 1, which reports the average loss and the standard error over 50 replications.For Ω 1 , Ω 2 , and Ω 3 , SPICE outperforms the Ledoit-Wolf estimator for all values of p.The sample covariance performs much worse than either estimator in all cases (for  p = 30).For Ω 4 , the least sparse of the four models, the Ledoit-Wolf estimator is about the same as SPICE (sometimes a little better, sometimes a little worse).This suggests, as we would expect from our bound on the rate of convergence, that SPICE provides the biggest gains in sparse models.
To assess the performance of SPICE on recovering the sparsity structure in the inverse, we report percentages of true non-zeros estimated as non-zero (TP %) and percentages of true zeros estimated as zero (TN %) in Table 2.We also plot heatmaps of the percentage of time each element was estimated as zero out of the 50 replications in Figure 2, for p = 30 for all four models.In general, recovering the sparsity structure is easier for smaller p and for sparser models.
Finally, some example computing times: the SPICE algorithm for Ω 2 takes about 2 seconds for p = 200, 1 minute for p = 500, and 15 minutes for p = 1000 on a regular PC.Glasso and SPICE both have complexity O(p 3 ), but because of the quadratic approximation, SPICE tends to require more iterations to converge, and on average, we have observed a difference in computing times on the order of about 10 between glasso and SPICE.However, this factor does not grow with p, and SPICE computing times are still very reasonable even for large p.

Colon tumor classification example
In this section, we compare performance of covariance estimators for LDA classification of tumors using gene expression data from Alon et al. (1999).In this experiment, colon adenocarcinoma tissue samples were collected, 40 of which were tumor tissues and 22 non-tumor tissues.Tissue samples were analyzed using an Affymetrix oligonucleotide array.The data were processed, filtered, and reduced to a subset of 2,000 gene expression values with the largest minimal intensity over the 62 tissue samples.Additional information about the dataset and pre-processing can be found in Alon et al. (1999).
To assess the performance at different dimensions, we reduce the full dataset of 2,000 gene expression values by selecting p most significant genes as mea- sured by the two-sample t-statistic, for p = 50, 100, 200.Then we use linear discriminant analysis (LDA) to classify these tissues as either tumorous or nontumorous.We classify each test observation x to either class k = 0 or k = 1 using the LDA rule where πk is the proportion of class k observations in the training data, μk is the sample mean for class k on the training data, and Ω is an estimator of the inverse of the common covariance matrix on the training data computed by one of the methods under consideration.Detailed information on LDA can be found in Mardia et al. (1979).
To create training and test sets, we randomly split the data into a training set of size 42 and a testing set of size 20; following the approach used by Wang et al. (2007), we require the training set to have 27 tumor samples and 15 non-tumor samples.We repeat the split at random 100 times and measure the average classification error.The average errors with standard errors over the 100 splits are presented in Table 3.We omit the sample covariance because it is not invertible with such a small sample size, and include the naive Bayes classifier instead (where Σ is estimated by a diagonal matrix with sample variances on the diagonal).Naive Bayes has been shown to perform better than the sample covariance in high-dimensional settings (Bickel and Levina, 2004).
For an application such as classification, there are several possibilities for selecting the tuning parameter.Since we have no separate validation data available, we perform 5-fold cross-validation on the training data.One possibility (columns A in Table 3) is to continue using normal likelihood as a criterion for cross-validation, like we did in simulations.Another possibility (columns B in Table 3) is to use classification error as the cross-validation criterion, since that is the ultimate performance measure in this case.Table 3 shows that for SPICE both methods of tuning perform similarly.For reference, we also include the best error rate achievable on the test data, which is obtained by selecting the tuning parameter to minimize the classification error on the test data (columns C in Table 3).SPICE provides the best improvement over naive Bayes and Ledoit-Wolf for p = 50; for larger p, as less informative genes are added into the pool, the performance of all methods worsens.

Discussion
We have analyzed a penalized likelihood approach to estimating a sparse concentration matrix via a lasso-type penalty, and showed that its rate of convergence depends explicitly on how sparse the true matrix is.This is analogous to results for banding (Bickel and Levina, 2008), where the rate of convergence depends on how quickly the off-diagonal elements of the true covariance decay, and for thresholding (Bickel and Levina, 2007;El Karoui, 2007), where the rate also depends on how sparse the true covariance is by various definitions of sparsity.We conjecture that other structures can be similarly dealt with, and other types of penalties may show similar behavior when applied to the "right" type of structure -for example, a ridge, bridge, or other more complex penalty may work well for a model that is not truly sparse but has many small entries.A generalization of this work to other penalties has been recently completed by Lam and Fan (2007), who have also proved "sparsistency" of SPICE-type estimators.
While we assumed normality, it can be replaced by a tail condition, analogously to Bickel and Levina (2008).The use of normal likelihood is, of course, less justifiable if we do not assume normality, but it was found empirically that it still works reasonably well as a loss function even if the true distribution is not normal (Levina et al., 2008).
The Cholesky decomposition of covariance was only considered appropriate when variables are ordered, and we have shown it to be a useful tool for enforcing positive definiteness of the estimator even when variables have no natural ordering.Our optimization algorithm has complexity of O(p 3 ) and is equally applicable to general l q penalties.For simplicity, we separate the likelihood and the penalty by writing f (T ) = ℓ(T ) + P (T ).We also suppress the ǫ-perturbation in the denominator for simplicity of notation.For the likelihood part, taking the partial derivative with respect to t lc , 1 ≤ c ≤ p, c ≤ l ≤ p gives For the penalty part, using the quadratic approximation (25) gives since the only nonzero terms in (31) are those for which j ′ ≤ l and either j ′ = c or j = c.For 1 ≤ k ≤ l such that k = c, we have ∂ ∂t ℓc ω 2 ck = 2ω ck t lk , and collecting terms together we get (ω ck − t lc t lk )t lk |ω 0 ck | q−2 , then take the positive solution tcc = u + .We also can quickly update the ω ck involving t lc via ω ck = ω 0 ck + t lk ( tlc − t lc ).
where by A P ≍ B we mean A = O P (B) and B = O P (A)), we have the rate of log p/n for Ŵ −1 −W −1 .This together with Corollary 1 in turn implies that Ŵ −1 and Kλ are O P (1), and the theorem follows.

Fig 1 .
Fig 1. Computing time in seconds vs p (log-log scale) for SPICE and glasso.

Table 2
Percentage of correctly estimated non-zeros (TP %) and correctly estimated zeros (TN %) in the concentration matrix (average and SE over 50 replications) for SPICE

Table 3
Averages and SEs of classification errors in % over 100 splits.