High dimensional sparse covariance estimation via directed acyclic graphs

We present a graph-based technique for estimating sparse covariance matrices and their inverses from high-dimensional data. The method is based on learning a directed acyclic graph (DAG) and estimating parameters of a multivariate Gaussian distribution based on a DAG. For inferring the underlying DAG we use the PC-algorithm and for estimating the DAG-based covariance matrix and its inverse, we use a Cholesky decomposition approach which provides a positive (semi-)definite sparse estimate. We present a consistency result in the high-dimensional framework and we compare our method with the Glasso for simulated and real data.


Introduction
Estimation of covariance matrices is an important part of multivariate analysis.There are many problems with high-dimensional data where an estimation of the covariance matrix is of interest, for example in principal component analysis or classification by discriminant analysis.Application areas where such problems arise include gene microarrays, imaging and image classification or text retrieval.In many of these applications, the primary goal is the estimation of the inverse of a covariance matrix Σ −1 , also known as the precision or concentration matrix, rather than the covariance Σ itself.In lowdimensional settings with p < n, where p denotes the row-or column-dimension of Σ and n the sample size, we can obtain an estimate of Σ −1 by estimation and inversion of the Gaussian maximum likelihood estimator ΣMLE .But when p is large, inversion of this estimate is problematic and its accuracy is very poor.
Recently, two classes for high-dimensional covariance estimation have emerged: those that rely on a natural ordering among variables and typically assuming that variables far apart in the ordering are only weakly correlated, and those which are invariant to variable permutation.Regularized estimation by banding or tapering [3,13,5] or using sparse Cholesky factors of the inverse covariance matrix relying on the natural ordering of the variables [31,14,19] are members of the first class of covariance estimators.When having no natural ordering among the variables, estimators should be permutation invariant with respect to indexing the variables.A popular approach to obtain a sparse permutation-invariant estimate is to add a Lasso penalty on the entries of the concentration matrix to the negative Gaussian log-likelihood [12,8,2,26].This amounts to shrinking some of the elements of the inverse covariance matrix exactly to zero.Alternatively, the Lasso can be used for inferring an undirected conditional independence graph using node-wise regressions [24] and a covariance estimate can then be obtained using the structure of the graph.Other approaches include a simple hardthresholding of the elements of the unpenalized maximum likelihood estimator [4], with the disadvantage that the resulting estimate is not necessarily positive (semi-) definite.
The method which we present here is also invariant under permutation of the variables.The type of regularization which we pursue is based on exploiting a sparse graphical model structure first and then estimating the covariance matrix and its inverse using non-regularized estimation.Because of the sparsity of the graphical model structure, the second step does not need any regularization anymore.More precisely, we use a sparsely structured Cholesky decomposition of the concentration matrix for estimation of the covariance and concentration matrix.To obtain the structure of such a Cholesky factor, we estimate a DAG (in fact, an equivalence class of DAGs).Thus, this approach enforces a completely different sparsity structure on the Cholesky factor than proposals for ordered data as in e.g.[3,13,5,9].
For a given DAG, our approach equals the iterative conditional fitting (ICF) method presented in [10,6] which reduces here to the standard technique of fitting Gaussian DAG models.Our contribution is to use an estimated DAG, i.e. an estimated equivalence class of DAGs from the PC-algorithm [27], and to analyze the method in the high-dimensional case taking the uncertainty of structure estimation of the equivalence class of DAGs into account.We argue in this paper that within the class of methods which are invariant under variable permutation, a graph-structured approach can be worthwhile for a range of scenarios, sometimes resulting in performance gains up to 30-50% over shrinkage methods.
In Section 2 we give a brief overview over graph terminology and graphical models.Section 3 introduces our methodology and we show asymptotic consistency of the method in the high-dimensional framework in Section 4. Simulations and real data examples are presented in Section 5 and we propose a robustified version of our procedure in Section 6.

Graph terminology and graphical models 2.1 Graphs
Let G = (V, E) be a graph with a set of vertices V and a set of edges E ⊆ V × V .In our context, we use V = {1, . . ., p} corresponding to some random variables X 1 , ..., X p .
A graph can be directed, undirected or partially directed.An edge between two vertices, for example i and j, is called directed if the edge has an arrowhead: i ← j or i → j.An edge without arrowhead is an undirected edge: i − j.A graph in which all edges are directed is called a directed graph; and vice-versa, a graph in which no edge is directed is called an undirected graph.A graph which may contain both directed and undirected edges is called a partially directed graph.The underlying undirected graph of a (partially) directed graph G which we derive by removing all the arrowheads is called the skeleton of G. Two vertices i and j are adjacent if there is any kind of edge between them.The adjacency set of a vertex i, denoted by adj(i, G), is the set of all vertices that are adjacent to i in G.A path is a sequence of vertices {1, . . ., k} such that i is adjacent to i + 1 for each i = 1, . . ., k − 1.A directed path is a path with directed edges that follows the direction of the arrows.When the first and last vertices coincide, the directed path is called a directed cycle.An acyclic graph is a graph that contains no directed cycles.A directed graph with no directed cycles is called directed acyclic graph (DAG).If i → j, then i is called a parent of j and j is called a child of i.The set of parents of i in G is denoted as pa(i) and the set of children as ch(i).If there is a directed path from i to j, then i is called an ancestor of j and j is called an descendant of i.The set of ancestors of i is denoted as an(i), the set of descendants as de(i) and the set of non-descendants as nde(i).A v-structure in a graph G is an ordered triple of vertices (i, j, r), such that i → j and j ← r, and i and r are not adjacent in G.The vertex j is then called a collider.A path Q from i to j in a directed acyclic graph G is said to be blocked by S, if it contains a vertex v ∈ Q such that either (i) v ∈ S and v is no collider; or (ii) v / ∈ S nor has v any descendants in S, and v is a collider.A path that is not blocked by S is said to be active.Two subsets A and B are said to be d-separated by S if all paths from A to B are blocked by S. In other words, there is no active path from A to B.

Graphical models and Markov properties
Graphical models form a probabilistic tool to analyze and visualize conditional dependence between random variables, using some encoding with edges in the graph.Fundamental to the idea of a graphical model is, based on graph theoretical concepts and algorithms, the notion of modularity where a complex system is built by combining simpler parts.One can distinguish between three main graphical models.Here we focus on DAG models, where all the edges of the graph are directed.According to [18], a DAG model may exhibit several directed Markov properties.In the following, we present only two of them.
We use the following notation.Let P denote the distribution of (X 1 , ..., X p ).For x ∈ R p , we denote by x A = {x j ; j ∈ A} for A ⊆ V = {1, . . ., p} and analogously for the random vector X A .Furthermore, for disjoint subsets A, B and S, we denote by X A ⊥ ⊥ X B |X S conditional independence between X A , X B given X S .

Definition 2.2. [Recursive factorization property]
We say that P admits a recursive factorization according to a DAG G whenever there exist non-negative functions f i (.|.) (i = 1, . . ., p), such that and P has a density f with respect to the measure ν, where If the density f of P is strictly positive, as for example in the case of a multivariate Gaussian distribution, both Markov properties in Definitions 2.1 and 2.2 are equivalent.For more details see [18, pp.46-52].
A DAG model encodes conditional independence relationships via the notion of dseparation.Several DAGs could encode the same set of conditional independence relationships.These DAGs form an equivalence class, consisting of DAGs with the same skeleton and v-structures.A complete partially directed acyclic graph (CPDAG) uniquely describes such an equivalence class.In fact, directed edges in the CPDAG are common to all DAGs in the equivalence class.Undirected edges in the CPDAG correspond to edges that are directed one way in some DAGs and another way in other DAGs of the equivalence class.The absence of edges in the CPDAG means that all DAG members in the equivalence class have no corresponding edge.If all the conditional independence relationships of a distribution P and no additional conditional independence relations, can be inferred from the graph, we say that the distribution P is faithful to the DAG G.More precisely, if P is faithful to the DAG G: for any triple of disjoint sets A, B and S in V , Note that the directed global Markov property in Definition 2.1 implies the implication from the right-to the left-hand side; the other direction is due to the faithfulness assumption.

Covariance estimation based on DAGs
Our methodology is based on two steps.We first infer the CPDAG, i.e. the equivalence class of DAGs, and we then estimate the covariance (concentration) matrix based on the CPDAG structure.
The Gaussian assumption implies that E X i | X pa(i) is linear in X pa(i) which will be useful in the second estimation step for the concentration or covariance matrix.Moreover, it allows us to equate conditional independence with zero partial correlation which makes estimation for the CPDAG much easier.

Estimating the covariance matrix from a DAG
We first assume that the underlying DAG is given.Using the factorization property from Definition 2.2 in Section 2.2 we have: We use here and in the sequel the short-hand notation f For data as in (1), we can then write the likelihood function as Using the Gaussian assumption this leads to the likelihood in terms of the unknown parameter Σ (or Σ −1 respectively).
where µ i|pa(i) and Σ i|pa(i) are the conditional expectation and variance of X i given the parents X pa(i) .Note that the conditional covariance is a fixed quantity whereas the conditional mean depends on the variables X pa(i) .For a single random variable X i we have: with assumption µ i = 0 ∀i from above, and: The expressions Σ i,pa(i) and Σ pa(i),pa(i) are sub-matrices formed by selecting the corresponding rows and columns from the full covariance matrix Σ.For example, Σ i,pa(i) is the sub-matrix (or vector) of Σ with row i and columns j ∈ pa(i).The values µ i|pa(i) and Σ i|pa(i) , in the ith factor L i of the likelihood, are connected to regression, as described next.
Consider for each node i a regression from X i on X pa(i) , where We can represent these p regressions in matrix notation as follows: where A is a p × p matrix corresponding to the regressions and ǫ is the vector of the error terms.That is: . Now we can easily compute Σ or Σ −1 , because we can write (4) as where We then also easily obtain See [7,Chap. 3] and [21] for more details.
Since Σ and Σ −1 are based on the structure of the DAG G (via the matrix A) we write Σ G and Σ −1 G .An estimator is now constructed as follows.Consider the maximum likelihood estimator as an "initial" estimator Σinit of Σ and use the following plug-in estimators: where Â and Cov (ǫ) are as in ( 5) but with the plug-in estimates ΣMLE i,pa(i) , ( ΣMLE pa(i),pa(i) ) −1 (for Â) and ΣMLE i|pa(i) (for Cov (ǫ)) using formula (3).Note that the estimators in (7) are automatically positive semi-definite having eigenvalues ≥ 0 (and positive definite assuming Σi|pa(i) > 0 for all j, which would fail only in very pathological cases).Furthermore, we could use another "initial" estimator than ΣMLE for estimating Σ i,pa(i) , Σ pa(i),i and Σ pa(i),pa(i) .We are exploiting this possibility for a robustified version, as discussed in Section 6.Finally, the estimator in ( 7) is implemented in the R-package ggm [6].

Inferring a directed acyclic graph
The conditional dependencies between X 1 , ..., X p and hence the DAG are usually not known.We use the PC-algorithm [27] with estimated conditional dependencies to infer the corresponding CPDAG G, i.e. the equivalence class of DAGs (inferring the true DAG itself is well-known to be impossible due to identifiability problems).
Estimation of the skeleton and partial orientation of edges are the two major parts of inferring a CPDAG.In the following we will describe these two steps.

Estimating the CPDAG
In a first step, we start from a complete undirected graph.When two variables X i X j are found to be conditional independent given X K for some set K, the edge i − j is deleted: details are given in Algorithm 1.In a second step, the edges are oriented using the conditioning sets K which made edges drop out in the first step: details are given in Algorithm 2. In the first step of the PC-algorithm, we need to estimate the conditional independence relations between X 1 , ..., X p .Under the Gaussian assumption conditional independencies can be inferred from partial correlations.Then, the conditional independence of X i and X j given X K = {X r ; r ∈ K}, where K ⊆ {1, ..., p}\{i, j}, is equivalent to the following: the partial correlation of X i and X j given {X r ; r ∈ K}, denoted by ρ i,j|K , is equal to zero.This is an elementary property of the multivariate normal distribution, see [18,Prop. 5.2].Hence to obtain estimates of conditional independencies we can use estimated partial correlations ρi,j|K .For testing whether an estimated partial correlation is zero or not, we apply Fisher's z-transform where Φ is the cumulative distribution function of the standard Normal distribution and the significance level 0 < α < 1 is a tuning (threshold) parameter of the PC-algorithm described in Algorithms 1 and 2.
Algorithm 1: The PC-algorithm for the skeleton Input: z-transform of estimated partial correlations, tuning parameter α Output: Skeleton of CPDAG G, separation sets S (used later for directing the skeleton) Form the complete undirected graph G on the set {1, . . ., p};

Delete edge i, j;
Denote this new graph by G; Save K in S(i, j) and S(j, i); until edge i, j is deleted or all K ⊆ adj(i, G)\{j} with |K| = l have been chosen ; until all ordered pairs of adjacent variables i and j, such that |adj(i, G)\{j}| ≥ l and K ⊆ adj(i, G)\{j} with |K| = l, have been tested for conditional independence ; until for each ordered pair of adjacent nodes i,j: |adj(i, G)\{j}| < l ; If ρ i,j|K = 0 is plausible, the edge i − j is deleted and K is saved in S(i, j).We call S = {S(i, j); i, j ∈ {1, . . ., p}, i = j} the separation sets.These sets are important for extending the estimated skeleton to a CPDAG as described below in Algorithm 2. [23] showed that the rules in Algorithm 2 are sufficient to orient all arrows in the CPDAG, see also [25, pp.50].The PC-algorithm, described in Algorithms 1 and 2, yields an estimate ĜCPDAG (α) of the true underlying CPDAG which depends on the tuning parameter α.
Algorithm 2: The PC-algorithm: extending the skeleton to a CPDAG Input: Skeleton G of CPDAG, separation sets S Output: CPDAG forall pairs of nonadjacent variables i, j with common neighbor k do if k / ∈ S(i, j) then whenever there is an arrow i → j such that i and k are nonadjacent; R2 Orient i − j into i → j whenever there is a chain i → k → j; R3 Orient i − j into i → j whenever there are two chains i → k → j and i → l → j such that k and l are nonadjacent; until no more orienting of undirected edges is possible by the rules R1 to R3 ;

The PC-DAG covariance estimator
Having an estimate ĜCPDAG (α) of the CPDAG, we pick any DAG ĜDAG (α) in the equivalence class of the CPDAG.This can be done by directing undirected edges in the CPDAG at random without creating additional v-structures or cycles.The estimate for the covariance and concentration matrix is then: and since the PC-algorithm for DAGs is involved, we call it the PC-DAG covariance estimator.Its only tuning parameter is α used in the PC-algorithm.As described in Section 3.2.1, it has the interpretation of a significance level for a single test whether a partial correlation is zero or not.The choice of this tuning parameter α can be done using cross-validation of the negative out-of-sample log-likelihood.
We remark that the zeros in Σ−1 ĜDAG (α) are the same for any choice of a DAG in the estimated CPDAG ĜCPDAG (α).However, the non-zero estimated elements of the estimated matrices will be slightly different.To avoid an unusual random realization when selecting a DAG from ĜCPDAG (α), we can sample many DAGs and average the corresponding estimates for Σ −1 or Σ.
In some cases, we need some small modifications of the PC-DAG covariance estimator which are described in Appendix B. Estimation of a CPDAG as described in Algorithm 1 and 2 is efficiently implemented in the R-package pcalg, as described in its reference manual [17].

Consistency
We prove asymptotic consistency of the estimation method in high-dimensional settings where the number of variables p can be much larger than the sample size n.In such a framework, the model depends on n and this is reflected notationally by using the subscript n.We assume: (A) The data is as in (1) with distribution P n of (X 1 , ..., X pn ) being multivariate normal N (0, Σ n ), Markovian as in Definition 2.1 or 2.2 and faithful to a DAG G n .
(C) The dimension p n = O(n a ) for some 0 ≤ a < ∞.
(D) The maximal cardinality q n = max i=1,...,pn |adj(i, G n )| of the adjacency sets in G n satisfies q n = O(n (E) For any i, j ∈ 1, ..., p n , let ρ n;i,j|S denote the partial correlation between X i and X j given S, where S ∈ {1, . . ., p n } \ {i, j}.These partial correlations are bounded above and below: sup ρ n;i,j|S ≤ M for some M < 1, and , where b is as in (D).
(F) For every DAG in the equivalence class of the true underlying CPDAG (induced by the distribution in assumption (A)), the conditional variances satisfy the following bound: Assumption (C) allows the number of variables p n to grow as an arbitrary polynomial in the sample size and reflects the high-dimensional setting.Assumption (D) is a sparseness assumption, requiring that the maximal number of neighbors per node grows at a slower rate than O(n 2 ).Assumption (F) is a regularity condition on the conditional variances.Assumption (E), in particular the second part, is a restriction which corresponds to the detectability of non-zero partial correlations: obviously, we cannot consistently detect non-zero partial correlations of smaller order than 1 √ n .For sparse graphs with b close to 1/2 in (D), the value d close to 1/2 is allowed.i.e. close to the 1/ √ n detection limit.
Under assumptions (A)-(E), the PC-algorithm was shown to be consistent for inferring the true underlying CPDAG [15, Th.2].More precisely, we denote by ĜCPDAG;n (α) the estimate for the underlying CPDAG, using the PC-algorithm with tuning parameter α (Algorithms 1 and 2), and by G CPDAG;n the true underlying CPDAG.Then, assuming (A)-(E) and for α n = 2(1 − Φ(n 1/2 c n /2)): Concerning the consistency of DAG based estimation of the concentration matrix, we have the following new result.

Lemma 4.1. Under assumptions (A)-(D) and (F) the following holds. For any DAG
G in the equivalence class of the true underlying CPDAG and using the estimator Σ−1 G in (7): A proof is given in the Appendix.We then obtain the main theoretical result.

Theorem 4.1. Under assumptions (A)-(F) and using the tuning parameter
)) in the PC-algorithm, the following holds for the estimator in (8): Proof: The estimate ĜDAG;n (α) is a DAG element of the estimated equivalence class encoded by the estimated CPDAG ĜCPDAG;n (α).Denote this DAG by G * .Consider the event Since P [A n ] → 1, the proof is complete.

Simulation and real data analysis
We examine the behavior of our PC-DAG estimator using simulated and real data and compare it to the Glasso method [12,2].The Glasso is defined as: where ΣMLE is the empirical covariance matrix in (6), Σ −1 1 = i<j |Σ −1 ij | and the minimization is over non-negative definite matrices.
All computations are done with the R-packages pcalg [17] and glasso.

Simulation study
We consider a DAG and a non-DAG model for generating the data.

DAG models
We focus on the following class of DAG models.We generate recursively where ǫ 1 , . . ., ǫ p i.i.d.∼ N (0, 1) and B is an adjacency matrix generated as follows.We first fill the matrix B with zeros and replace every matrix entry in the lower triangle by independent realizations of Bernoulli(s) random variables with success probability s where 0 < s < 1. Afterwards, we replace each entry having a 1 in the matrix B by independent realizations of a Uniform([0.1,1])random variable.If i < j and B ji = 0 the corresponding DAG has a directed edge from node i to node j.The variables X 1 , . . ., X p have a multivariate Gaussian distribution with mean zero and covariance Σ which can be computed from B. We consider this model for different settings of n, s and p: The settings D1 to D4 mainly differ in the sparsity s of the generated data, which is related to the expected neighborhood size E [adj(i, G)] = s(p − 1) for all i.For each of these settings we estimate the covariance and the concentration matrix with both methods, our PC-DAG and the Glasso estimator.
We use two different performance measures to compare the two estimation techniques.First, the Frobenius norm of the difference between the estimated and the true matrix Σ − Σ F and Σ−1 − Σ −1 F .And second, the Kullback-Leibler Loss ∆ KL ( Σ−1 , We sample data X (1) , . . ., X (n) i.i.d.from the DAG model described above for each value of p in settings D1-D4.Then we derive, on a separate validation data-set X (1) * , . . ., X (n) * , the optimal value of the tuning parameters α (PC-DAG) or λ (Glasso), with respect to the negative Gaussian log-likelihood.The two different performance measures are evaluated for the estimates based on the training data X (1) , . . ., X (n) with optimal tuning parameter choice based on the validation data.All results are based on 50 independent simulation runs.
Figures 1 and 2 show that in the sparse settings D1 and D2, the PC-DAG estimator clearly outperforms Glasso.Concerning the more dense settings D3 and D4, the PC-DAG method degrades only for the covariance matrix, whereas for the inverse covariance matrix Σ −1 , the figures still show an improvement of the PC-DAG estimator compared (b), we see that for a small increase of the sample size the Glasso improves substantially less compared to the PC-DAG estimator.The results in terms of the Kullback-Leibler loss are summarized in Table 1.

Non DAG models
Next we generate data from a non-DAG model proposed by [26].The concentration matrix equals where each off-diagonal entry in B is generated independently and equals 0.5 with probability π or 0 with probability 1 − π, all diagonal entries of B are zero, and δ is chosen such that the condition number of Σ −1 is p.The concentration matrices, which we generate from this model vary in their level of sparsity: for Σ −1 (1) we take π = 0.1 and for Σ −1 (2) we choose π = 0.5, i.e.Σ −1 (1) is sparser than Σ −1 (2) .Note that the expected numbers of non-zero entries in Σ −1 (1) and Σ −1 (2) are proportional to p 2 .We tune and compare the estimation methods as described in Section 5.1.1.
In Figures 3 and 4 we see that in case of the dense model with π = 0.5, the two methods do not differ much (some of the differences are so small that they are invisible on the scales shown in the plots).But for the sparse model with π = 0.1 we observe that our PC-DAG estimator is better than the Glasso, in particular for the setting nD2.The results in terms of the Kullback-Leibler loss are summarized in Table

Real data
In this section we compare the two estimation methods for real data.

Isoprenoid gene pathways in Arabidopsis thaliana
We analyze the gene expression data from the isoprenoid biosynthesis pathway in Arabidopsis thaliana given in [30].Isoprenoids comprehend the most diverse class of natural products and have been identified in many different organisms.In plants isoprenoids play important roles in a variety of processes such as photosynthesis, respiration, regulation of growth and development.This data set consists of p = 39 isoprenoid genes for which we have n = 118 gene expression patterns under various experimental conditions.As performance measure we use the 10-fold cross-validated negative Gaussian log-likelihood for centered data.
The results are described in Figure 5.We find that none of the two methods performs substantially better than the other and the slight superiority of Glasso is in the order  of 1% only.The marginal difference in the negative log-likelihood between the two estimation techniques may be due to the high noise in the data.

Breast Cancer data
Next, we explore the performance on a gene expression data set from breast tumor samples.The tumor samples were selected from the Duke Breast Cancer SPORE tissue bank on the basis of several criteria.For more details on the data set see [29].The data matrix monitors p = 7129 genes in n = 49 breast tumor samples.We only use the 100 variables having the largest sample variance.
As before we first center the data and then compute the negative log-likelihood via 10-fold cross-validation.Figure 6 shows the result.
As for the Isoprenoid gene pathways data-set, we cannot nominate a winner here.In fact, the performances are even more indistinct than before.
Figure 5: 10-fold CV of negative log-likelihood against the logarithm of the average number of non-zero entries of the estimated concentration matrix Σ−1 .The squares stand for the Glasso and the circles for the PC-DAG estimator.

A robust PC-DAG covariance estimator
In this section we propose a robust version of the PC-DAG estimator.According to Section 3, we need an initial covariance matrix estimation Σinit in order to run the PC-DAG technique.In Section 3, we used the sample covariance Σinit = ΣMLE from (6).
It is well known that the standard sample covariance estimator is not robust against outliers or non-Gaussian distributions.
In order to get a robust version of the PC-DAG method we start with a robust estimate of Σ.We propose to use the orthogonalized Gnanadesikan-Kettenring (OGK) estimator presented by [22].Employing the OGK estimator in the PC-algorithm, i.e. estimating partial correlations from the OGK covariance estimate, we obtain a robustified estimate of the CPDAG, see also [16], and finally a robust PC-DAG covariance estimate as in ( 7) and ( 8) by using again the OGK covariance estimator instead of ΣMLE .
An "ad-hoc" robustification of the Glasso method can be achieved by using in (10) the robust OGK covariance estimate instead of the sample covariance ΣMLE .
We compare the standard PC-DAG, robust PC-DAG, standard Glasso and the robust show that without or with moderate outliers, the standard and robust PC-DAG estimators perform about as well as the standard and robust Glasso: the claim is based on the observation that the minimum Kullback-Leibler loss of each of the four methods is about the same, although the corresponding sparsity of the fitted concentration matrix may be very different.In the presence of more severe outliers, the robust PC-DAG technique is best as can be seen from Figure 7 (c).In summary, the robust PC-DAG estimator is a useful addition to gain robustness for estimating a high-dimensional concentration matrix.

Summary and Discussion
We have introduced the PC-DAG estimator, a graphical model based technique for estimating sparse covariance matrices and their inverses from high-dimensional data.The  method is based on very different methodological concepts than shrinkage estimators.Our PC-DAG procedure is invariant to variable permutation, yields a positive definite estimate of the covariance and concentration matrix, and we have proven asymptotic consistency for sparse high-dimensional settings.An implementation of the estimator is based on the R-package pcalg [17].We remark that alternatively, one could construct a high-dimensional covariance estimate based on a sparse undirected conditional independence graph which itself can be inferred from data using e.g. the node-wise Lasso procedure from [24].
We have compared our PC-DAG estimator with the Glasso [12,2] in two simulation models.For the concentration matrix, our PC-DAG approach clearly outperforms the Glasso technique for some parameter settings, with performance gains up to 30-50%, while it keeps up with Glasso for the rest of the considered scenarios.For estimation of covariances, the conclusions are similar but slightly less pronounced than for inferring concentration matrices.Furthermore, we have compared the two methods in two real data-sets and found only marginal differences in performance.If the data generating mechanism is well approximated by a DAG-model, the PC-DAG estimator is undoubtedly better than the shrinkage-based Glasso.However, it is very hard to know a-priori how well a DAG-model describes the underlying true distribution.Finally, we have presented a robustification of our PC-DAG estimator for cases where the Gaussian data is contaminated by outliers.and ε i independent of X pa(i) .The corresponding OLS estimates based on n i.i.d.samples X (1) , . . ., X (n) as in ( 1) are denoted by Lemma A.1.Suppose that the Gaussian assumption in (A), assumptions (B) and (F) hold.Then, for every ǫ > 0, P sup i=1,...,pn,j∈pa(i) n ≥ 2(q n + C 4 ) where C 1 , C 2 > 0 are constants depending on σ 2 and r (see Assumptions (B) and (F)), and C 3 , C 4 > 0 are absolute constants.
Proof.The proof is analogous to Lemma 7.1 in [20].For completeness, we give a detailed derivation.The union bound yields P sup i=1,...,pn,j∈pa(i) Next we analyze sup i,j P β(i) j − β (i) j > ǫ for a general ǫ > 0.
Let i ∈ {1, ..., p n } and denote by s(i, j) = pa(i)\j.We consider first the conditional distribution of β(i) j |X pa(i) .The variance of X i |X pa(i) is σ 2 i|pa(i) and we denote the variance of X j |X s(i,j) by σ 2 j|s(i,j) .Further, we denote the sample variance of X j by σ2 j , the sample variance of X j |X s(i,j) by σ2 j|s(i,j) and the sample multivariate correlation coefficient between X j and X s(i,j) by R 2 j|s(i,j) .Then, when conditioning on X pa(i) = {X r,j ; r = 1, . . ., n, j ∈ pa(i)}, Now we apply Bernstein's inequality [28, Lemma 2.2.11] by writing and noting that χ 2 n−qn−1 −(n−q n −1) can be viewed as the sum of n−q n −1 independent centered χ 2 1 random variables.Hence, the last term is bounded above by > 0 are constants arising from moment conditions.This expression is in addition bounded above by for all n such that n−2 2 − q n > C ′ 3 , and C 3 > 0 is a constant arising from moment conditions.Because this bound in (18) holds for all X s(i,j) with |s(i, j)| ≤ q n , it also holds for the unconditional probability P B C js(i,j) .
Lemma A.2. Suppose that the Gaussian distribution in assumption (A), assumptions (B) and (F) hold.Then, for every ǫ > 0, P sup where C > 0 is an absolute constant and r > 0 as in assumption (F).
Proof.Using the union bound, for ǫ > 0, P sup i=1,...,pn For the conditional probability, when conditioning on X pa(i) = {X r,j ; r = 1, . . ., n, j ∈ pa(i)}, we have that P σ2 | X pa(i) .Since this bound holds for all X pa(i) , the bound also applies to the unconditional probability:

Because
We use now a Taylor expansion: where σ2 i|pa(i) − σ 2 i|pa(i) ≤ σ2 i|pa(i) − σ 2 i|pa(i) .Consider the set B = {sup i=1,...,pn σ2 i|pa(i) − σ 2 i|pa(i) ≤ r/2} with r > 0 as in assumption (F).Then, on B, we have The first term and second term on the right-hand side can be bounded using (20), leading to the bound in the statement of the lemma.This completes the proof of Lemma A.  By plugging these bounds into the formula above and using that the summations are over at most q n terms only (due to sparsity of Âki and A ki ), we obtain Σ−1 G,n;i,j − Σ −1 n;i,j ≤ Cq n δ where C > 0 is an absolute constant and δ the maximal absolute difference of Â's and λ's:

B Modifications of the PC-DAG covariance estimator
With finite sample size, the PC-algorithm may make some errors.One of them can produce conflicting v-structures when orienting the graph: if so, we deal with it by keeping one and discarding other v-structures.In our implementation, the result then depends on the order of the performed independence tests.Furthermore, it may happen that the output of the PC-algorithm is an invalid CPDAG which does not describe an equivalence class of DAGs.In such a case we use the retry type orientation procedure implemented in the pcAlgo-function of the pcalg-package, see the reference manual of the pcalg-package [17] for more information.

Definition 2 . 1 .
[Directed global Markov property] Let A, B and S be disjoint subsets of V and G a DAG on V .If X A ⊥ ⊥ X B |X S whenever A and B are d-separated by S in the graph G, we say P obeys the directed global Markov property relative to the DAG G.

Figure 6 :
Figure 6: 10-fold CV of negative log-likelihood against the logarithm of the average number of non-zero entries of the estimated concentration matrix.The squares stand for the Glasso and the circles for the PC-DAG estimator.

Figure 7 :
Figure 7: Kullback-Leibler loss against the logarithm of the average number of non-zero elements of Σ −1 for Gaussian data (a), 10% t 3 contaminated Gaussian data (b) and 10% Cauchy contaminated Gaussian data (c).
(and the bound does not depend on the index i ≤ C < ∞

2 .
Proof of Lemma 4.1.Let G be a DAG from the true underlying CPDAG, i.e. the true equivalence class.Using the union bound we have G,n;i,j − Σ −1 n;i,j > γ .