Estimation of Gaussian graphs by model selection

We investigate in this paper the estimation of Gaussian graphs by model selection from a non-asymptotic point of view. We start from a n-sample of a Gaussian law P_C in R^p and focus on the disadvantageous case where n is smaller than p. To estimate the graph of conditional dependences of P_C, we introduce a collection of candidate graphs and then select one of them by minimizing a penalized empirical risk. Our main result assess the performance of the procedure in a non-asymptotic setting. We pay a special attention to the maximal degree D of the graphs that we can handle, which turns to be roughly n/(2 log p).


Introduction
Let us consider a Gaussian law P C in R p with mean 0 and positive definite covariance matrix C. We write θ for the matrix of the regression coefficients associated to the law P C , more precisely θ = θ (j) i i,j=1,...,p is the p × p matrix such that θ (j) j = 0 for j = 1, . . ., p and E X (j) X (k) , k = j = k =j θ (j) k X (k) , j ∈ {1, . . ., p} , a.s.
for any random vector X = X (1) , . . ., X (p) T of law P C .Our aim is to estimate the matrix θ by model selection from an n-sample X 1 , . . ., X n i.i.d. with law P C .We will focus on the disadvantageous case where the sample size n is smaller than the dimension p.
We call henceforth shape of θ, the set of the couples of integers (i, j) such that θ (j) i = 0.The shape of θ is usually represented by a graph g with p labeled vertices {1, . . ., p}, by setting an edge between the vertices i and j when θ (j) i = 0.This graph is well-defined since θ (j) i = 0 if and only if θ (i) j = 0; the latter property may be seen e.g. on the formula θ (j) i = −(C −1 ) i,j /(C −1 ) j,j for all i = j.
The graph g is of interest for the statistician since it depicts the conditional dependences of the variables X (j) s.Actually, there is an edge between i and j if and only if X (i) is not independent of X (j) conditionally on the other variables.The objective in Gaussian graphs estimation is usually to detect the graph g.Even if the purpose of our procedure is to estimate θ and not g, we propose to simultaneously estimate g as follows.We associate with our estimator θ of θ, the graph ĝ where we set an edge between the vertices i and j when θ(j) i is non-zero.
Estimation of Gaussian graphs with n ≪ p is a current active field of research motivated by applications in postgenomic.Biotechnological developments (microarrays, 2D-electrophoresis, etc) enable to produce a huge amount of proteomic and transcriptomic data.One of the challenge in postgenomic is to infer from these data the regulation network of a family of genes (or proteins).The task is challenging for the statistician due to the very high-dimensional nature of the data and the small sample size.For example, microarrays measure the expression levels of a few thousand genes (typically 4000) and the sample size n is no more than a few tens.The Gaussian graphical modeling appears to be a valuable tool for this issue, see the papers of Kishino and Waddell [14], Dobra et al [9], Wu and Ye [20].The gene expression levels in the microarray are modeled by a Gaussian law P C and the regulation network of the genes is then depicted by the graph g of the conditional dependences.
Various procedures have been proposed to perform graph estimation when p > n.Many are based on multiple testing, see for instance the papers of Schäfer and Strimmer [16], Drton and Perlman [8; 10] or Wille and Bühlmann [19].We also mention the work of Verzelen and Villers [17] for testing in a non-asymptotic framework whether there are (or not) missing edges in a given graph.Recently, several authors advocate to take advantage of the nice computational properties of the l 1 -penalization to either estimate the graph g or the concentration matrix C −1 .Meinshausen and Bühlmann [15] propose to learn the graph g by regressing with the Lasso each variable against the others.Huang et al. [13] or Yuan and Lin [21] (see also Banerjee et al. [1] and Friedman et al. [11]) suggest in turn to rather estimate C −1 by minimizing the log-likelihood for the concentration matrix penalized by the l 1 -norm.The performance of these algorithms are mostly unknown: the few theoretical results are only valid under restrictive conditions on the covariance matrix and for large n (asymptotic setting).In addition to these few theoretical results, Villers et al. [18] propose a numerical investigation of the validity domain of some of the above mentioned procedures.
Our aim in this work is to investigate Gaussian graph estimation by model selection from a non-asymptotic point of view.We propose a procedure to estimate θ and assess its performance in a non-asymptotic setting.Then, we discuss on the maximum degree of the graphs that we can accurately estimate and explore the performance of our estimation procedure in a small numerical study.
We will use the Mean Square Error of Prediction (MSEP) as a criterion to assess the quality of our procedure.To define this quantity, we introduce a few notations.For any k, q ∈ N, we write • k×q for the Frobenius norm in R k×q , namely A 2 k×q = Trace (A T A), for any A ∈ R k×q .The MSEP of the estimator θ is then where C1/2 is the positive square root of C and X new is a random vector, independent of θ, with distribution P C .We underline that the MSEP focus on the quality of the estimation of θ and not of g.In particular, we do not aim to estimate at best the "true" graph g, but rather to estimate at best the regression matrix θ.We choose this point of view for two reasons.First, we do not believe that the matrix θ is exactly sparse in practice, in the sense that θ (j) i = 0 for most of the i, j ∈ {1, . . ., p}.Rather, we want to handle cases where the matrix θ is only approximately sparse, which means that there exists a sparse matrix θ * which is a good approximation of θ.In this case, the shape g of θ may not be sparse at all, it can even be the complete graph.Our goal is then not to estimate g but rather to capture the main conditional dependences given by the shape g * of θ * .The second reason for considering the MSEP as a quality criterion for our procedure is that we want to quantify the fact that we do not want to miss the important conditional dependences, but we do not worry too much missing a weak one.In other words, even in the case where the shape g of θ is sparse, we are interested in finding the main edges of g (corresponding to strong conditional dependences) and we do not really care of missing a "weak" edge which is overwhelmed by the noise.The MSEP is a possible way to take this issue into account.
To estimate θ, we will first introduce a collection M of graphs, which are our candidates for describing the shape g of θ.If we have no prior information on g, a possible choice for M is the set of all graphs with degree 1 less than some fixed integer D.Then, we associate with each graph m ∈ M, an estimator θm of θ by minimizing an empirical version of the MSEP with the constraint that the shape of θm is given by m, see Section 2 for the details.Finally, we select one of the candidate graph m by minimizing a penalized empirical MSEP and set θ = θ m.Our main result roughly states that when the candidate graphs have a degree smaller than n/(2 log p), the MSEP of θ nearly achieves, up to a log(p) factor, the minimal MSEP of the collection of estimators { θm , m ∈ M}.
It is of practical interest to know if the condition on the degree of the candidate graphs can be avoided.This point is discussed in Section 3.1, where we emphasize that it is hopeless to try to estimate accurately graphs with a degree D large compared to n/(1 + log(p/n)).We also prove that the size of the penalty involved in the selection procedure is minimal in some sense.
The remaining of the paper is organized as follows.After introducing a few notations, we describe the estimation procedure in Section 2 and state our main results in Section 3. Section 4 is devoted to a small numerical study and Section 6 to the proofs.

A few notations
Before describing our estimation procedure, we introduce a few notations about graphs we shall use all along the paper.
Indeed, to any g ∈ G we can associate a graph with p vertices labeled by {1, . . ., p} by setting an edge between the vertices i and j if and only if (i, j) ∈ g.For simplicity, we call henceforth "graph" any element g of G.
For a graph g ∈ G and an integer j ∈ {1, . . ., p}, we set g j = {i : (i, j) ∈ g} and denote by |g j | the cardinality of g j .Finally, we define the degree of g by deg(g) = max {|g j | : j = 1, . . ., p}.

b. Directed graphs
As before, we will represent the set of the directed graph with p vertices labeled by {1, . . ., p} by the set G + of all the subset g of {1, . . ., p} 2 fulfilling (j, j) / ∈ g for all j ∈ {1, . . ., p}.More precisely, we associate with g ∈ G + the directed graph with p vertices labeled by {1, . . ., p} and with directed edges from i to j if and only if (i, j) ∈ g.
We note that G ⊂ G + and we extend to g ∈ G + the above definitions of g j , |g j |, and deg(g).Although G is contained in G + , it should be noted that the associated interpretation is different since the graphs in G + are directed with possibly two directed edges between two vertices.

Estimation procedure
In this section, we explain our procedure to estimate θ.We first introduce a collection of graphs and models, then we associate with each model an estimator and finally we give a procedure to select one of them.

Collection of graphs and models
Our estimation procedure starts with the choice of either a collection M ⊂ G of graphs or a collection M ⊂ G + of directed graphs which are our candidates to describe the shape of θ.Among the possible choices for M we mention four of them: We call degree of M the integer D M = max {deg(m) m ∈ M} and note that the above collections of graphs have a degree bounded by D.
To the collection of graphs M, we associate the following collection {Θ m , m ∈ M} of models to estimate θ.The model Θ m is the linear space of those matrices in R p×p whose shape is given by the graph m, namely As mentioned before, we known that θ = 0, so it seems irrelevant to (possibly) introduce directed graphs instead of graphs.Nevertheless, we must keep in mind that our aim is to estimate θ at best in terms of the MSEP.In some cases, the results can be improved when using directed graphs instead of graphs, typically when for some i, j ∈ {1, . . ., p} the variance of θ Finally, we note the following inclusions for the families of models mentioned above

Collection of estimators
We assume henceforth that 3 ≤ n < p and that the degree D M of M is upper bounded by some integer D ≤ n − 2. We start with n observations X 1 , . . ., X n i.i.d. with law P C and we denote by X the n × p matrix X = [X 1 , . . ., X n ] T .In the following, we write A (1) , . . ., A (p) for the p columns of a matrix A ∈ R k×p .
We remind the reader that 2 , where Θ is the space of p × p matrices with 0 on the diagonal.An empirical version of n×p , which can also be viewed as an empirical version of the loss In this direction, we associate with any m ∈ M, an estimator θm of θ by minimizing on Θ m this empirical risk We note that the p × p matrix θm then fulfills the equalities where Θ m in R n (for the usual scalar product).Hence, since the covariance matrix C is positive definite and D is less than n, the minimizer of ( 1) is unique a.s.

Selection procedure
To estimate θ, we will select one of the estimator θm by minimizing some penalized version of the empirical risk X(I − θm ) 2 /n.More precisely, we set θ = θ m where m is any minimizer on M of the criterion with the penalty function pen : N → R + of the form of the penalties introduced in Baraud et al. [4].To compute this penalty, we define for any integers d and N the Dkhi function by where F d,N denotes a Fisher random variable with d and N degrees of freedom.
The function x → Dkhi(d, N, x) is decreasing and we write EDkhi[d, N, x] for its inverse, see [4] Section 6.1 for details.Then, we fix some constant K > 1 and set

Size of the penalty
The size of the penalty pen(d) is roughly 2Kd log p for large values of p. Indeed, we will work in the sequel with collections of models, such that  [4] for an exact bound.In Section 3.2, we show that the size of this penalty is minimal in some sense.

Choice of the tuning parameter K
Increasing the value of K decreases the size of the graph m that is selected.The choice K = 2 gives good control of the MSEP of θ, both theoretically and numerically (see Section 3 and 4).If we want that the rate of false discovery of edges remains smaller than 5%, the choice K = 3 may also be appropriated.

Computational cost
The computational cost of the selection procedure appears to be very high.For example, if M = M deg,+ D the computational complexity of the procedure increases as p (D+1) with the dimension p.In a future work [12], we will propose a modified version of this procedure, which presents a much smaller complexity.
A few additional remarks on the estimation procedure 1-The matrix θ belongs to the set Γ = θ ∈ R p×p : ∃ K positive definite such that θ i,j = −K i,j /K j,j , for i = j , but the estimator θ has no reason to belong to this space.To avoid this unpleasant feature, it would be natural to minimize (1) on the space Θ m ∩ Γ instead of Θ m .Unfortunately, we do not know how to handle this case neither theoretically nor numerically.We also emphasize that the matrix θ is not assumed to be exactly sparse, so it does not belong to any of the {Θ m ∩ Γ, m ∈ M} in general.In particular, it is unclear whether the MSEP of the estimator obtained by minimizing (1) on Θ m ∩ Γ is smaller than the MSEP of θm .
2-In the special case where M = M deg,+

D
, the minimization of ( 2) can be obtained by minimizing X (j) − X θ(j) independently for each j.This nice computational feature does not hold for the other collections of graphs introduced in Section 2.1.

The main result
Next theorem gives an upper-bound on the MSEP of a slight variation θ of θ, defined by θ(j) = θ(j) 1 { θ(j) ≤ √ p Tn} , for all j ∈ {1, . . ., p} , with T n = n 2 log n .(4) We note that θ and θ coincide in practice since the threshold level T n increases very fast with n, e.g.T 20 ≈ 6.10 7 .
In the sequel, we write σ Then, the MSEP of the estimator θ defined by ( 4) is upper bounded by where 4 and the resid- The proof of Theorem 1 and of the next corollary is delayed to Section 6.3.
Corollary 1. Assume that p > n ≥ 3 and that Condition (5) holds.Then, there exists some constant C K,η , depending on K and η only, such that where Corollary 1 roughly states that when the candidate graphs have a degree smaller than n/(2 log p), the MSEP of θ nearly achieves, up to a log(p) factor, the minimal MSEP of the collection of estimators { θm , m ∈ M}.In particular, if g ∈ M, the MSEP of θ is upper-bounded by log(p) times the MSEP of θg , which in turn is roughly upper bounded by deg(g) × C 1/2 (I − θ) 2 log(p)/n.
The additional term n −1 C 1/2 (I −θ) 2 in (7) can be interpreted as a minimal variance for the estimation of θ.This minimal variance is due to the inability of the procedure to detect with probability one whether an isolated vertex of g is isolated or not.We mention that when each vertex of the graph g is connected to at least one other vertex, this variance term n −1 C 1/2 (I − θ) 2 remains smaller than the MSEP of θg .
Below, we discuss on the necessity of Condition (5) on the degree of the graphs and on the size of the penalty.

Is Condition (5) avoidable?
Condition (5) requires that D M remains small compared to n/(2 log p).We may wonder if this condition is necessary, or if we can hope to handle graphs with larger degree D. A glance at the proof of Theorem 1 shows that Condition (5) can be replaced by the weaker condition , we obtain that the latter condition is satisfied when so we can replace Condition (5) by Condition (8) in Theorem 1.Let us check now that we cannot improve (up to a multiplicative constant) upon (8).Phythagorean equality gives 2 , so there is no hope to control the size of C 1/2 (θ − θ) 2 if we do not have for some δ ∈ (0, 1) the inequalities with large probability.Under Condition ( 5) or ( 8), Lemma 1 Section 6 ensures that these inequalities hold for any δ > √ η with probability 1 − 2 exp(−n(δ − √ η) 2 /2).We emphasize next that in the simple case where C = I, there exists a constant c(δ) > 0 (depending on δ only) such that the Inequalities (9 ), the Inequalities (9) enforces that n −1/2 X satisfies the so-called δ-Restricted Isometry Property of order D introduced by Candès and Tao [5], namely for all β in R p with at most D non-zero components.Barabiuk et al. [2] (see also Cohen et al. [6]) have noticed that there exists some constant c(δ) > 0 (depending on δ only) such that no n × p matrix can fulfill the δ-Restricted Isometry Property of order D if D ≥ c(δ)n/(1 + log(p/n)).In particular, the matrix X cannot satisfies the Inequalities (9) when

Can we choose a smaller penalty?
As mentioned before, under Condition (5) the penalty pen(d) given by ( 3) is approximately upper bounded by K 1 + e η √ 2 log p 2 (d + 1).Similarly to Theorem 1 in Baraud et al. [4], a slight variation of the proof of Theorem 1 enables to justify the use of a penalty of the form pen(d) = 2Kd log(p − 1) with K > 1 as long as D M remains small (the condition on D M is then much stronger than Condition ( 5)).We underline in this section, that it is not recommended to choose a smaller penalty.Indeed, next proposition shows that choosing a penalty of the form pen(d) = 2(1 − γ)d log(p − 1) for some γ ∈ (0, 1) leads to a strong overfitting in the simple case where θ = 0, which corresponds to C = I. for some γ ∈ (0, 1) and θ = 0.Then, there exists some constant c(γ) made explicit in the proof, such that when m is selected according to (2) In addition, in the case where M = M deg,+

Numerical study
In this section, we carry out a small simulation study to evaluate the performance of our procedure.Our study concerns the behaviour of the estimator θ when the sparsity decreases (Section 4.2) or when the number of covariates p increases (Section 4.3).In this direction, we fix the sample size n to 15 (a typical value in post-genomics) and run simulations for different values of p and for different sparsity levels.For comparison, we include the procedure "or" of Meinshausen and Bühlmann [15].This choice is based on the numerical study of Villers et al. [18], where this procedure achieves a good trade-off between the power and the FDR.We write henceforth "MB" to refer to this procedure.

Simulation scheme
The graphs g are sampled according to the Erdös-Rényi model: starting from a graph with p vertices and no edges, we set edges between each couple of vertices at random with probability q (independently of the others).Then, we associate with a graph g a positive-definite matrix K with shape given by g as follows.
For each (i, j) ∈ g, we draw K i,j = K j,i from the uniform distribution in [−1, 1] and set the elements on the diagonal of K in such a way that K is diagonal dominant, and thus positive definite.Finally, we normalize K to have ones on the diagonal and set C = K −1 .
For each value of p and q we sample 20 graphs and covariance matrices C.Then, for each covariance matrix C, we generate 200 independent samples (X 1 , . . ., X 15 ) of size 15 with law P C .For each sample, we estimate θ with our procedure and the procedure of Meinshausen and Bühlmann.For our procedure, we set M = M deg 4 and K = 2 or 2.5.For Meinshausen and Bühlmann's estimator θMB we set λ according to (9) in [15] with α = 5%, as recommended by the authors.
On the basis of the 20*200 simulations we evaluate the risk ratio r.Risk = MSEP( θ) min m MSEP( θm ) , as well as the power and the FDR for the detection of the edges of the graph g.The calculations are made with R www.r-project.org/.

Decreasing the sparsity
To investigate the behaviour of the procedure when the sparsity decreases, we fix (n, p) = (15, 10) and consider the three graph-density levels q = 10%, q = 30% and q = 33%.The results are reported in Table 1.
When q = 10% the procedures have a good performance.They detect on average more than 80% of the edges with a FDR lower than 5% and a risk ratio around 2.5.We note that MB has a slightly larger risk ratio than our procedure, but also a slightly smaller FDR.
When q increases above 30% the performances of the procedures decline abruptly.They detect less than 25% of the edges on average and the risk ratio increases above 4.When q = 30% or q = 33% our procedure is more powerful than MB, with a risk ratio 33% smaller.
In this simulation study, all the candidate graphs have a degree smaller than 4. Using candidate graphs with a larger degree should not change the nature of the results.Actually, when q = 30 or 33%, less than 2% of the selected graphs have a degree equal to 4 and the mean degree of the selected graphs is between 1 and 2.

Increasing the number of covariates
In this section, we focus on the quality of the estimation of θ and g when the number of covariates p increases.We thus fix the sample size n to 15 and the sparsity index s := pq to 1.This last index corresponds to the mean degree of a vertex in the Erdös-Rényi model.Then, we run simulations for three values of p, namely p = 15, p = 20 and p = 40 (in this last case we set M = M deg 3 to reduce the computational time).The results are reported in Table 2.

Table 1
Our procedure with K = 2, K = 2.5 and MB procedure: Risk ratio (r.Risk), Power and FDR when n = 15, p = 10 and q = 10%, 30% and 33%.When the number p of covariates increases, the risk ratios of the procedures increase and their power decrease.Nevertheless, the performance of our procedure remains good, with a risk ratio between 3.6 and 6.5, a power close to 70% and a FDR around 5.6 ± 1%.In contrast, the performances of MB decrease abruptly when p increases.For values of p larger or equal to 22 (not shown), MB procedure does not detect any edge anymore.This phenomenon was already noticed in Villers et al. [18].

Conclusion
In this paper, we propose to estimate the matrix of regression coefficients θ by minimizing some penalized empirical risk.The resulting estimator has some nice theoretical and practical properties.From a theoretical point of view, Theorem 1 ensures that the MSEP of the estimator can be upper-bounded in terms of the minimum of the MSEP of the { θm , m ∈ M} in a non-asymptotic setting and with no condition on the covariance matrix C. From a more practical point of view, the simulations of the previous section exhibit a good behaviour of the estimator.The power and the risk of our procedure are better than those of the procedure of Meinshausen and Bühlmann, especially when p increases.The downside of this better power is a slightly higher FDR of our procedure compared to that of Meinshausen and Bühlmann.If the FDR should be reduced, we recommend to set the tuning parameter K to a larger value, e.g.K = 3.
The main drawback of our procedure is its computational cost and in practice it cannot be used when p is larger than 50.In a future work [12], we propose a modification of the procedure that enables to handle much larger values of p.
Finally, we emphasize that our procedure can only estimate accurately graphs with a degree smaller than n/(2 log p) and as explained in Section 3.1, we cannot improve (up to a constant) on this condition.

A concentration inequality
Lemma 1.Consider three integers 1 ≤ d ≤ n ≤ p, a collection V 1 , . . ., V N of d-dimensional linear subspaces of R p and a n × p matrix Z whose coefficients are i.i.d. with standard gaussian distribution.We set Then, for any x ≥ 0 where N has a standard Gaussian distribution and Similarly, for any x ≥ 0 Proof.The map Z → ( √ n λ * d (Z)) is 1-Lipschitz, therefore the Gaussian concentration inequality enforces that To get (10), we need to bound E (λ * d (Z)) from below.For i = 1, . . ., N , we set We get from [7] the bound hence there exists some standard Gaussian random variables N i such that where (x) + denotes the positive part of x.Starting from Jensen's inequality, we have for any λ > 0 E max i=1,...,N Setting λ = √ 2 log N , we finally get This concludes the proof of ( 10) and the proof of ( 11) is similar.

Proof of Corollary 1
Corollary 1 is a direct consequence of Theorem 1 and of the three following facts.
1.The equality p j=1 σ 2 j = C 1/2 (I − θ) 2 holds.2. Proposition 4 in Baraud et al. [4] ensures that when D M fulfills Condition ( 5), there exists a constant C(K, η) depending on K and η only, such that pen(d) 3. When D M fulfills (5) the MSEP of the estimator θm is bounded from below by The latter inequality follows directly from Lemma 1.
Finally, to give an idea of the size of C(K, η), we mention the following approximate bound (for n and p large)

Proof of Theorem 1
The proof is split into two parts.First, we bound from above Then, we bound this last term by the right hand side of (6).
To keep formulas short, we write henceforth m and M * j,D is the set of those subsets m of {1, . . ., j − 1, j + 1, . . ., p} × {j} with cardinality D.Then, for any j = 1, . . ., p We prove in the next paragraphs that n .The proofs of these bounds bear the same flavor as the proof of Theorem 1 in Baraud [3].
and therefore Upper bound on E (j) 2 .All we need is to bound P λ * j ≥ λ 0 , θ(j) = 0, λ 1 j ≤ 3/2 from above.Writing λ − for the smallest eigenvalue of C, we have on the event Besides, for any m ∈ M, with ε (j) distributed as a standard Gaussian random variable in R n .Therefore, on the event λ * j ≥ λ 0 , θ(j) = 0, λ 1 j ≤ 3/2 we have As a consequence, Finally, Upper bound on E (j) 3 .We note that n λ 1 j 2 follows a χ 2 distribution, with n degrees of freedom.Markov inequality then yields the bound As a consequence, we have Upper bound on E (j) 4 .Writing λ + for the largest eigenvalue of the covariance matrix C, we have The random variable Z = XC −1/2 is n×p matrix whose coefficients are i.i.d. and have the standard Gaussian distribution.The condition (5) enforces the bound so Lemma 1 ensures that and finally Conclusion.Putting together the bounds (12) to (15), we obtain n .Let m * be an arbitrary index in M. Starting from the inequality and following the same lines as in the proof of Theorem 2 in Baraud et al. [4] we obtain for any K > 1 where for any m ∈ M and j ∈ {1, . . ., p} where M j = {m j , m ∈ M}.The choice (3) of the penalty ensures that the last term is upper bounded by K p j=1 σ 2 j log(n)/n.We also note that X(θ Combining this inequality with E X(θ (j) − θ c. Conclusion.The bound ( 17) is true for any m * , so combined with ( 16) it gives (6).

Proof of Proposition 1
The proof of Proposition 1 is based on the following Lemma.
Let us consider a n×p random matrix Z whose coefficients Z (j) i are i.i.d. with standard Gaussian distribution and a random variable ε independant of Z, with standard Gaussian law in R n .
The proof of this lemma is technical and in a first time we only give a sketch of it.For the details, we refer to Section 6.5.
Sketch of the proof of Lemma 2. We have According to Lemma 1, when |s| is small compared to n/ log p, we have Z α 2 ≈ n α 2 with large probability and then We will prove first that on the event Ω 0 we have Crit ′ (s) > Crit ′ (ŝ Dn,p ) for any s with cardinality less than γD n,p /6 and then we will prove that Ω 0 has a probability bounded from below by 1 − 3p −1 − 2 exp(−nγ 2 /512).
We write ∆(s) = Crit ′ (ŝ D ) − Crit ′ (s).Since we are interested in the sign of ∆(s), we will still write ∆(s) for any positive constant times ∆(s).We have on Ω 0 ∆(s) We will now bound P (Ω c 0 ) from above.We write Y = Z T ε/ ε (with the convention that Y = 0 when ε = 0) and

1 .
the set M # D ⊂ G of all graphs with at most D edges, 2. the set M deg D ⊂ G of all graphs with degree less than D, 3. the set M #,+ D ⊂ G + of all directed graphs with at most D directed edges, 4. the set M deg,+ D ⊂ G + of all directed graphs with degree less than D.