Berry-Esseen bounds for estimating undirected graphs ∗

: We consider the problem of providing nonparametric conﬁ-dence guarantees — with ﬁnite sample Berry-Esseen bounds — for undirected graphs under weak assumptions. We do not assume sparsity or in-coherence. We allow the dimension D to increase with the sample size n . First, we prove lower bounds that show that if we want accurate inferences with weak assumptions then D must be less than n . In that case, we show that methods based on Normal approximations and on the bootstrap lead to valid inferences and we provide new Berry-Esseen bounds on the accuracy of the Normal approximation and the bootstrap. When the dimension is large relative to sample size, accurate inferences for graphs under weak assumptions are not possible. Instead we propose to estimate something less demanding than the entire partial correlation graph. In particular, we consider: cluster graphs, restricted partial correlation graphs and correlation graphs.


Introduction
There are many methods for estimating undirected graphs, such as the glasso (Yuan and Lin, 2007;Friedman and Tibshirani, 2007) and sparse parallel regression (Meinshausen and Bühlmann, 2006).While these methods are very useful, they rely on strong assumptions, in particular, sparsity and some type of incoherence assumption.Most methods do not come with any confidence guarantees.
Recently, some papers -such as Liu (2013) and Ren et al. (2013) -have provided confidence guarantees but they still rely on eigenvalue conditions and sparsity.Ravikumar et al. (2011) consider graph recovery with less restrictive tail conditions on the distribution but they still invoke sparsity and incoherence assumptions.A greedy method for discrete random variables is given in Jalali, Johnson and Ravikumar (2011).Again, an incoherence-like assumption (restricted convexity and smoothness) is invoked.
The purpose of this paper is to construct a nonparametric estimator G of an undirected graph G with confidence guarantees that does not make sparsity and incoherence assumptions.Furthermore, we provide Berry-Esseen type bounds on the coverage of the confidence intervals.
The confidence guarantee we seek is where n is the sample size, P n denotes the distribution for n observations drawn from P and r n is an explicit rate.The notation G ⊂ G means that the edges of G are a subset of the edges of G.This means that, with high probability, there are no false edges.One could use other error measures such as false discovery rates, but we shall use the guarantee given by (1).Of course, setting G to be the empty graph would trivially satisfy (1).So we also want to ensure that the estimator has non-trivial power for detecting edges.We focus at first on partial correlation graphs: a missing edge means that the corresponding partial correlation is 0. We distinguish two cases.In the first case, the dimension D can increase with n but is smaller than n.In that case we show that Gaussian asymptotic methods and bootstrap methods yield accurate confidence intervals for the partial correlations which then yield confidence guarantees for the graph.The novelty here is that we provide finite sample bounds on the coverage accuracy.(We also show that, in principle, one can construct finite sample intervals, but these intervals turn out to be too conservative to be useful.) In the second case, D can be large, even larger than n.In this case it is not possible to get valid inferences for the whole graph under weak assumptions.We investigate several ways to handle this case including: cluster graphs, restricted partial correlation graphs and correlation graphs.Again we provide Berry-Essen bounds for the methods.
Contributions.Here is a summary of our contributions: 1. We develop new Berry-Esseen bounds for the delta method and the bootstrap with increasing dimension.2. The methods provides confidence guarantees.3. The methods do not depend on Normality or other parametric assumptions.4. The methods do not require sparsity or incoherence conditions. 5.The methods have valid coverage when the dimension increases with the sample size.6.The methods are very simple and do not require any optimization.
Related Work.Our approach is similar to the methods in Liu (2013) and Ren et al. (2013).They use tests on partial correlations to estimate an undirected graph.This approach has two advantages over other methods: it eliminates the need to choose a tuning parameter (as in penalized methods like the glasso) and it provides error control for the estimated graph.However, the results in that paper assume conditions like those in most papers on the penalized methods like the lasso, namely, sparsity.These conditions might be reasonable in some situations, but our goal is to estimate the graph without invoking these assumptions.In the special case of fixed dimension, our method is similar to that in Drton and Perlman (2004).Schäfer et al. (2005), building on work by Ledoit and Wolf (2004), consider a shrinkage approach to estimating graphs.They make no sparsity or incoherence assumptions.Their examples suggest that their approach can work well in high dimensions.Their method introduces a bias-validity tradeoff: large shrinkage biases the partial correlations but have valid asymptotics in high dimensions.Low shrinkage has low bias but compromises the validity of the asymptotics in high dimensions.Shrinkage graphs are beyond the scope of this paper, however.
Previous research for increasing but moderate dimensions includes Portnoy (1988) and Mammen (1993).Our results are very much in the spirit of those papers.However, our emphasis is on finite sample Berry-Esseen style bounds.
Outline.We start with some notation in Section 2. We discuss various assumptions in Section 3. We then establish lower bounds in Section 4. Finite sample methods are presented in Section 5.However, these do not work well in practice.Asymptotic methods for the moderate dimensional case are considered in Section 6.Specifically, we develop a delta method and a bootstrap method that accommodate increasing dimension.Recent results on high dimensional random vectors due to Chernozhukov, Chetverikov andKato (2012, 2013) play an important role in our analysis.Methods for the high-dimensional case are considered in Section 7. In Section 8 we give some numerical experiments and some examples.Concluding remarks are in Section 9.

Notation
Let Y 1 , . . ., Y n ∈ R D be a random sample from a distribution P .Each Y i = (Y i (1), . . ., Y i (D)) T is a vector of length D. We allow D ≡ D to increase with n.We do not assume that the Y i 's are Gaussian.If A is a matrix, we will sometimes let A jk denote the (j, k) element of that matrix.
Let Σ ≡ Σ(P ) denote the D × D covariance matrix of Y i and let Ω = Σ −1 .Let Θ = {θ} jk be the matrix of partial correlations: be the the sample covariance matrix and let Θ n be the matrix of sample partial correlations.Given a matrix of partial correlations Θ let G ≡ G(P ) be the undirected graph with D nodes and such that there is an edge between nodes j and k if and only if θ jk = 0. Equivalently, there is an edge if and only if Ω jk = 0.
In Section 7 we consider other types of graphs.For any matrix A, let vec(A) denote the vector obtained by stacking the columns of A. We define the following quantities: If A is m × n then there is a unique permutation matrix K mn -called the commutation matrix -such that Let J denote a D × D matrix of ones.For matrices L and U with the same dimensions, we write L ≤ U to mean that L jk ≤ U jk for all j, k.
The Frobenius norm of A is denotes by ||A|| F = j,k A 2 jk , the operator norm by ||A|| = sup ||x||=1 ||Ax|| and the max norm by We let Φ denote the cdf of a standard Normal random variable.Recall that a random vector where µ = E(X).The smallest and largest eigenvalues of a matrix A are denoted by λ min (A) and λ max (A).We write a n b n to mean that there is some c > 0, not depending on n, such that a n ≤ cb n for all large n.We often use C to denote a generic positive constant.

Assumptions
In this section we discuss the assumptions we make and we also discuss some of the commonly used assumptions that we will not use.
The Assumptions.In the case where D < n we make the following assumptions: (A3) λ min (T ) ≥ c 0 > 0 where T is the asymptotic covariance of √ n(s − σ) and is given in Equation ( 23).Also assume that min j γ jj > 0 where γ, the asymptotic variances of the sample partial correlations, is given in (31).
When D > n, we first perform a dimension reduction and then we assume (A1)-(A4) on the reduced problem.We remark that the sub-Gaussian assumption is stronger than needed and is made for simplicity.
Assumptions Note Made.Now we discuss the assumptions that are commonly made for this problem, but that we will not use.
(B1) Incoherence.A typical incoherence condition is where Γ = Σ ⊗ Σ, S is the set of pairs with edges between them and || • || ∞ is the maximum absolute column sum.(There are other versions of this assumption but they are similar in character.)(B2) Sparsity.The typical sparsity assumption is that the maximum degree d of the graph is o( √ n).
(B3) Donut.It is assumed that each partial correlation is either 0 or is strictly larger than log D/n, thus forbidding a donut around the origin.
Discussion.Assumptions (A1) and (A4) will hold if Y has thin enough tails.In fact, if the random variables are bounded or if we truncate the data, then (A1) and (A4) are guaranteed to hold.Assumptions (A2) and (A3) are, strictly speaking, not needed but without them the confidence intervals could be very large.Both (A2) and (A3) are, in principal, testable, using methods like those we develop in this paper (namely, by constructing bootstrap confidence intervals for the eigenvalues).
In contrast, (B1)-(B3) are quite strong.They may be reasonable in certain specialized cases.However, for routine data-analysis, we regard these assumptions with some skepticism when D > n.They serve to guarantee that many high-dimensional methods will work, but seem unrealistic in practice.Moreover, the assumptions are very fragile.The incoherence assumption is especially troubling although Ren et al. (2013) have been able to weaken it.The donut assumption ensures that non-zero partial correlations will be detected with high probability.To the best of our knowledge, (B1)-(B3) are not testable when D > n.Nor are we aware of a single paper that has ever provided evidence that they do hold in real applications with the exception of applications in engineering (signal processing) where the design is generated by the user.
Our goal is to develop methods that have confidence guarantees, with Berry-Esseen bounds, and that avoid these assumptions.

Lower bounds
Constructing a graph estimator for which (1) holds is easy: simply set G to be identically equal to the empty graph.Then G will never contain false edges.
But to have a useful estimator we also want to have non-trivial power to detect edges; equivalently, we want confidence intervals for the partial correlations to have width that shrinks with increasing sample size.In this section we find lower bounds on the width of any confidence interval for partial correlations.This reveals constraints on the dimension D as a function of the sample size n.Specifically, we show that (without sparsity) one must have D < n to get consistent confidence intervals.This is not surprising, but we could not find explicit minimax lower bounds for estimating partial correlations so we provide them here.
The problem of estimating a partial correlation is intimately related to the problem of estimating regression coefficients.Consider the usual regression model where ǫ ∼ N (0, σ 2 ) and where we take the intercept to be 0 for simplicity.(Normality is assumed only in this section.)Suppose we want a confidence interval for β 1 .
We will need assumptions on the covariance matrix Σ for X = (X 1 , . . ., X D ).Again, since we are interested in the weak assumption case, we do not want to impose strong assumptions on Σ.In particular, we do not want to rule out the case where the covariates are highly correlated.We do, however, want Σ to be invertible.Let S denote all symmetric matrices and let where 0 < a ≤ A < ∞.To summarize: Y = β T X + ǫ where ǫ ∼ N (0, σ 2 ), and Σ = Cov(X) ∈ S(a, A).Let P be all such distributions.A set-valued function C n is a 1 − α confidence interval for β 1 if for all P ∈ P. Let C n denote all 1 − α confidence intervals.Let be the width of C n .
Theorem 1. Assume that D ≤ n and that α < 1/5.Then, for all large n, inf Proof.Let us write the model in vectorized form: where The proof has two stages.First we will derive a lower bound that involves conditioning on X.In the second stage, we integrate over the marginal distribution of X to get an unconditional bound on the width W n .Let M = N (0, Σ) with λ min (Σ) ≥ a > 0. Without loss of generality, assume Σ −1 11 = 1.Let p 0 (x, y) = p 0 (y|x)m(x) and p 1 (x, y) = p 1 (y|x)m(x) where p 0 (y|x) and p 1 (y|x) will be specified later.Here, m(x) is the density of M .Let P 0 and P 1 be the distributions with densities p 0 and p 1 .Let P denote a generic distribution in P. Now, inf where For any two real numbers r 0 , r 1 , we have that max{r 0 , r 1 } ≥ (r 0 + r 1 )/2.Hence, Now we fix X = x ∈ R n×D and lower bound inf Cn max P0,P1 E P (W n |X = x).Assume that x T x is invertible.Consider Equation ( 16) where the matrix X is taken as fixed.Multiplying each term in the equation by (x T x) −1 x T we can rewrite the equation as ) and β 0 = (δ, b, . . ., b) which now defines P 0 and P 1 .The (conditional) Kullback-Leibler distance between p 0 (y|x) and Note that, since D ≤ n, x T x is invertible with probability one.The conditional total variation distance is thus bounded above by TV(x) ≡ α S −1 11 S 11 .Let So far, we have a bound on W n which depends on x.Now we will turn this into an unconditional bound by integrating over the marginal distribution of X.
Thus, for large enough n, TV(x)dM (x) ≤ 2α.Integrating over dM (x) we have Then, Now we establish the analogous upper bound.
Theorem 2. Assume that D ≤ n and that 0 Proof.The least squares estimator is For T ∼ χ 2 D , we have with probability 1 − α/2.Combining the results, gives a the bound on |Z j |.Now consider estimating a partial correlation corresponding to a covariance matrix Σ. 12)).Let θ be the partial correlation between two components of W , say, W D and W D−1 .Let C n be the set of 1 − α confidence intervals for θ.Assume that D ≤ n and that α < 1/5.Then Proof.Let b > 0 be a small positive constant.Let W = (W 1 , . . ., W D ) where ).For P 0 take q = 0 and for P 1 take q = δ.So, P 0 = N (0, Σ 0 ) and P 1 = N (0, Σ 1 ), say.Then Ω 1 = Σ −1 1 corresponds to a complete graph while Ω 0 = Σ −1 0 has a missing edge.Let us write W = (Y, X) where Y = W 1 and X = (W 2 , . . ., W D ).We note that the marginal distribution of X is the same under P 0 and P 1 .The conditional distribution of Y given X under P j can be written Y = β T j X + ǫ where β 0 = (0, b, . . ., b) and β 1 = (δ, b, . . ., b).The rest of the proof follows the proof of Theorem 1.
We conclude that without further assumptions (namely sparsity plus incoherence) we cannot make reliable inferences unless D ≤ n.
Remark 4. These lower bounds were computed under the assumption of Normality.This is good enough to show the dependence on dimension.However, this makes the minimax lower bound optimistic.When we develop the methods, we shall not assume Normality.

A finite sample method
For completeness, we give here a finite sample confidence interval that has length O( D/n).However, the intervals do not work well in practice and we explore asymptotic methods in the following section.In this section we suppose that |Y i (j)| ≤ B for some finite constant B. First we recall the following result from Vershynin (2010).
Theorem 5 (Vershynin, 2010).There exists c α , depending only on B, such that Theorem 6.Let where λ is the smallest eigenvalue of S.
where Θ = Θ + ∆ n J and Θ = Θ − ∆ n J where we recall that J is a D × D matrix of ones.
Proof.By the previous result, ||S −Σ|| ≤ c α D n with probability at least 1−α.From Horn and Johnson (1990) page 381, Note that, with probability at least 1 − α, Also note that From Lemma 3 of Harris and Drton (2012), Despite the apparent optimal rate, in practice the confidence intervals are gigantic.Instead, we turn to asymptotic methods.

Increasing dimension and Berry-Esseen bounds
We call the case where D is increasing with n but smaller than n, the moderate dimensional case.Here we derive confidence sets for the partial correlations in this case.We deal with the high-dimensional case D > n in the next section.
Our goal is to show the accuracy of the delta method and the bootstrap.In particular, we develop new results on the delta method for multiple non-linear statistics with increasing dimension.The state-of-the-art for delta method results are the papers by Pinelis and Molzon (2013); Chen and Shao (2007) where, in particular, the former applies to the multivariate case.Rather than adapt those results, we instead develop a slightly different approach that leverages recent developments in high dimensional statistics.This allows us to develop a simultaneous delta method and bootstrap for multiple inference with increasing dimension.Throughout this section, we assume that D < n.

Preliminary definitions and results
Recall that Define the map g j by θ j = g j (σ).We can write θ = G(σ) where where and ǫ ∼ N (0, Σ).The finite sample variance matrix of δ is given by (Boik and Haaland (2006)), (24) where K (D,D) is the commutation matrix defined in (6) and where Lemma 7.For all ǫ > 0 we have the following inequalities: where ζ is defined in (9).
Proof.Using the sub-Gaussian property, we have . The second result follows from the first since ||s − σ|| ≤ D||s − σ|| ∞ .The third inequality follows from a standard inequality; see Lemma 2.2 of Devroye and Lugosi (2001) for example.For the fourth inequality, note that the absolute value |q j | of each element of q has the form Lemma 8. Let Φ be the cdf of a standard Gaussian random variable.Let A and B be random variables.Then, for every ǫ > 0, By a similar argument, sup We need the following recent results on high-dimensional random vectors due to Chernozhukov, Chetverikov and Kato (2012).
Theorem 10 (Gaussian Anti-Concentration; CCK 2013).Let Z 1 , . . ., Z k be centered, not necessarily independent, Gaussian random variables.Then where C depends only on max j Var(Z j ) and min j Var(Z j ).
An immediate corollary of this result is the following.
Lemma 11.Let Z ∼ N (0, Σ) where Z ∈ R k .There exists c > 0 depending only on max j Σ jj and min j Σ jj but not on k such that, for every ǫ > 0, Proof.Let Y = max j Z j .Then where the last inequality is precisely the previous anti-concentration inequality.
Remark 12.A union bound would have given a bound of order kǫ instead of ǫ log k/ǫ.Lemma 11 leads to much sharper bounds in our delta method and bootstrap bounds.
where C is only a function of max j Σ Y (j, j) and min j Σ Y (j, j).

Berry-Esseen bounds for high-dimensional delta method
It follows from Lemma 7 that, for large enough C, P (s / ∈ B) ≤ 1/n 2 .We assume throughout the analysis that s ∈ B as the error this incurs is of smaller order than the rest of the error terms.Let Θ and Θ be the matrix of partial correlations and the matrix of estimated partial correlations.Let θ = vec(Θ) and θ = vec( Θ).Recall that By Taylor expansion and (25), where L = dvec(G)/dσ T so that L is the D 2 × D 2 matrix whose j th row is ℓ j ≡ ℓ j (σ) ≡ dg j (σ)/dσ T .Similarly, R = (R 1 , . . ., R D 2 ) T where R j = 1 2 δ T H j δ and H j is the Hessian of g j , evaluated at some point between s and σ.Let where Z j = √ n( θ j − θ j )/e j is the normalized estimate, e j = γ 1/2 (j, j) = ℓ j (σ) T T (σ)ℓ j (σ) and T (σ) is defined in (23).The covariance of Z is Note that Γ jj = 1 for all j.
Theorem 14.Let W ∼ N (0, Γ) where W ∈ R D 2 and let and where T n is defined in (24).Then, where and we recall that ζ is defined in (9).Hence, if z α ≡ −Φ −1 (α/D 2 ) then Remark 15.In the above result, the dimension enters mainly through the terms γ n and ξ n .Except for these terms, the dependence on D is only logarithmic.We discuss these terms in Section 6.5.
Proof.By (30), So, where we used Theorem 9 applied to V * = γ −1/2 LV and Lemma 11.Recall that s ∈ B except on a set of probability 1/n 2 and on this set, and so by Lemma 7, .
Using Holder's inequality, Hence, using Lemma (7), The result follows by computing a similar lower bound and taking the supremum over z.For the last statement, note that W j ∼ N (0, 1).So In practice we need to use T j = √ n( θ j −θ j )/ e j where e j = ℓ j (s) T T (s)ℓ j (s) ≡ U j (s) is the estimated standard error.We have the following result for this case.
Theorem 16.Define γ n and ξ n as in the previous theorem.Let where U j (a) = ℓ T j (a)T (a)ℓ j (a).Then, Proof.Let E = {max j e j / e j < 1 + ǫ} and F = {max Z j < u/ǫ} where ǫ = (4ρ n /ζ) log n/(nζ 2 ) and u = ǫ log(n).Note that e j − e j = U j (σ) − U j (s) = (σ − s) T U ′ j where U ′ is the gradient of U evaluated at some point between s and σ.Then, for 0 < ǫ ≤ 1, where A n is defined in (33).Next, A similar lower bound completes the proof.

The bootstrap
In this section we assume that max j |Y (j)| ≤ B for some B < ∞.This is not necessary but it simplifies the proofs.We do not require that B be known.Let Y * 1 , . . ., Y * n be a sample from the empirical distribution and let s * be the corresponding (vectorized) sample covariance.Now let θ * be the partial correlations computed from Y * 1 , . . ., Y * n ∼ P n where P n is the empirical distribution.The (un-normalized) bootstrap rectangle for θ is where Z α = F −1 (1 − α) and is the bootstrap approximation to The accuracy of the coverage of the bootstrap rectangle depends on sup Let Z ∼ N (0, Γ) where Z ∈ R D 2 .First we need the following limit theorem for the un-normalized statistics.
Theorem 17. Define γ ′ n = max j sup a∈B |||H j (a)||| and ξ ′ n = max j sup a∈B ||ℓ j (a)|| 1 .Then Proof.The proof is the same as the proof of Theorem 14 with γ ′ n and ξ ′ n replacing γ n and ξ n .Now we bound sup z | F (z) − F (z)|.
and hence In the previous theorem, we showed that For II, we proceed exactly as in the proof for of the previous theorem but with P n replacing P (and with Y 1 , . . ., Y n fixed).This yields, for any ǫ > 0, and arguing as in the proof of Theorem 14 we conclude that For III, we use Theorem 13 which implies that where ∆ = max s,t |Γ(s, t) − Γ n (s, t)|.Each element of Γ n (s, t) is a sample moment and Γ(s, t) is corresponding population moment, and so, since P n is sub-Gaussian, ∆ = O P ( log D/n).Hence, III = O P ( log D n ) 1/6 .

A super-accurate bootstrap
Now we describe a modified approach to the bootstrap that has coverage error only O(log D/n 1/8 ) which is much more accurate than the usual bootstrap as described in the last section.The idea is very simple.Let R be the 1 − α bootstrap confidence rectangle for σ described in Section 7.1.Write θ = G(σ) and define By construction, T inherits the coverage properties of R and so we have immediately: Corollary 19. .
The set T then defines confidence sets for each θ j , namely, We should stress that, in general, obtaining a confidence set by mapping a confidence rectangle can lead to wide intervals.However, our foremost concern in this paper is coverage accuracy.
Constructing the set T can be difficult.But it is easy to get an approximation.We draw a large sample σ 1 , . . ., σ N from a uniform distribution on the rectangle R. Now let θ j = min 1≤s≤N g j (σ s ), θ j = max 1≤s≤N g j (σ s ).
Then [θ j , θ j ] approximates the confidence interval for θ j .Alternatively, we take σ 1 , . . ., σ N to be the bootstrap replications that are contained in R. Note that there is no need for a multiple comparison correction as the original confidence rectangle is a simultaneous confidence set.

Comments on the error terms
The accuracy of the delta method depends on the dimension D mainly through the terms γ n , ξ n and ρ n .Similarly, the accuracy of the (first version of the) bootstrap depends on γ ′ n and ξ ′ n .In this section we look at the size of these terms.We focus on γ ′ n and ξ ′ n .Recall that ℓ j = dθ j /dσ T .Then Let (s, t) be such that θ j = Θ st .Then, Define (J, K, M ) by σ J = Σ ss , σ K = Σ tt and σ M = Σ st .Then where [A] j denotes the j th row of A and f j is a sparse vector that is 0 except for three entries.Now the Hessian is H j = ( dℓ1 dσ T , . . ., where we used the fact that dvec(Ω ⊗ Ω) see, for example, p 185 of Magnus and Neudecker (1988).Note that ||f j || 0 = O(1) independent of D. The presence of this sparse vector helps to prevent the gradient and Hessian from getting too large.By direct examination of ℓ j and H j we see that the size of γ ′ n and ξ ′ n depends on how dense Ω is.In particular, when Ω is diagonally dominant, γ ′ n and ξ ′ n are both O(1).In this case the error terms have size O((log D)/n 1/8 ).However, if Ω is dense, then ||ℓ j || 1 can be of order O(D 2 ) and and |||H j ||| can be of order O(D 4 ).In this case the error can be as large as D 4 /n 1/8 .On the other hand, the bootstrap in Section 6.4 always has accuracy O((log D)/n 1/8 ).But the length of the intervals could be large when Ω is dense.And note that even in the favorable case, we still require D < n for the results to hold.(We conjecture that this can be relaxed by using shrinkage methods as in Schäfer et al. (2005).)These observations motivate the methods in the next section which avoid direct inferences about the partial correlation graph in the high-dimensional case.
It is interesting to compare the size of the errors to other work on inference with increasing dimension.For example, Portnoy (1988) gets accuracy D 3/2 /n for maximum likelihood estimators in exponential families and Mammen (1993) gets accuracy D 2 /n for the bootstrap for linear models.

Back to graphs
Finally, we can use the above methods for estimating a graph with confidence guarantees.We put an edge between j and k only if 0 is excluded from the confidence interval for θ jk .The desired guarantee stated in (1) then holds.

The high dimensional case
Now we consider the case where D > n.We present three methods for dealing with the high-dimensional case: (B1) Correlation graphs.This is a common technique in biostatistics.We connect two nodes if the confidence interval for two variables excludes [−ǫ, ǫ] for some threshold ǫ ∈ [0, 1].Our contribution here is to provide confidence guarantees using the bootstrap that are valid as long as D = o(e n 1/7 ).In this paper we use ǫ = 0. (B2) Cluster graphs.We cluster the features and average the features within each cluster.These averaged features define a (dimension-reduced) set of new features.As long as the number of clusters L is o(n) we get valid inferences.In this case, each node corresponds to the average of the variables in a cluster and the edges correspond to conditional independence statements about the new features.(B3) Restricted Graphs.Define the restricted partial correlation where L is some fixed number, θ(Y (j), Y (k)|Y (S)) is the partial correlation between Y (j) and Y (k) given the set of variables Y (S) where S varies over all subsets of {1, . . ., D} − {j, k} of size L These are sometimes called lower-order partial correlations.Now construct a graph based on the restricted partial correlations.Note that L = 0 is a correlation graph and L = D is a partial correlation graph.(This is similar to the idea in Castelo and Roverato, 2006).The bootstrap leads to valid inferences only requiring D = o(e n 1/7 ).
Remark 20.Following Schäfer et al. (2005), we could estimate U = (1 − λ)Σ + λT where T is, for example, a diagonal matrix.The graph is constructed from biased partial correlations corresponding to U −1 .When λ is close to 1, highdimensional asymptotic confidence intervals have accurate coverage.Thus we have a bias-validity tradeoff.Investigating this tradeoff is quite involved and so we will examine this method elsewhere.
In this section we make the following assumptions.
The proofs of the results in this section are similar to those in Section 6 but they are easier as the error terms are, by design, not dependent on dimension sensitive quantities like γ n and ξ n .Because of this, we shall only present proof outlines.

Correlation graphs
The simplest approach to constructing graphs is to use correlation or covariances rather than partial correlation.Let ρ jk denoted the correlation between Y (j) and Y (k).The true graph G ǫ connects j and k if |ρ(j, k)| > ǫ where 0 ≤ ǫ ≤ 1 is some user-specified threshold.The algorithm is in Figure 1.Of course, we can use either ρ or σ; we get the same graph from either.
Theorem 21.Let r jk denote the sample correlation between Y (j) and Y (k) and let r be the D 2 × 1 vector of correlations.Similarly, let ρ be the vector of true correlations.Define Z α by the bootstrap equation Then .
We thus have Remark 22.A very refined Berry-Esseen result for a single correlation was obtained by Pinelis and Molzon (2013).
Proof Outline.The proof is the same as the proof of Theorem 18.However, in this case, it is easy to see that γ ′ n and ξ ′ n are O(1), independent of the D since the gradient ℓ j and Hessian H j is a function only of the bivariate distribution of (Y (s), Y (t)) corresponding to the correlation.

Cluster graphs
The idea here is to partition the features into clusters, average the features within each cluster and then form the graph for the new derived features.If the clusters are sufficiently few, then valid inference is possible.
There are many clustering methods.Here we consider choosing a set of representative features -or prototypes -using the L-centers algorithm, which we describe below.Then we assign each feature to its nearest center.We average (B4) Using D 2 , construct a confidence graph for the L new features using either the delta method or the bootstrap from Section 6. (B5) (Optional): Construct a correlation graph for the features within each cluster.the features within each cluster and then find the undirected graph of these new L derived features.Let G be the graph for these new features.We estimate G using confidence intervals for the partial correlations.Note that the graph G as well as the estimated graph G are both random.
To ensure the validity of the confidence intervals, we use data spitting.We split the data randomly into two halves.The first half is used for clustering.The confidence intervals are constructed from the second half of the data.
The cluster-graph algorithm is described in Figure 2. It is assumed in the algorithm that the number of features L = o(n) is specified by the user.An improvement is to use a data-driven approach to choosing L. We leave this to future work.
The asymptotic validity of the method follows from the results in Section 6 together with the data-splitting step.Without the data-splitting step, the proofs in Section 6 would not be valid since the feature selection process would introduce a bias.The independence introduced by the splitting thus seems critical.Whether it is possible to eliminate the data-splitting is an open problem.Let us state, without proof, the validity assuming the bootstrap is used.A similar result holds for the delta method.
Theorem 23.Let θ be the vector of k partial correlations for the features selected from the first half of the data.Let R be the confidence rectangle using the second half of the data.Then where γ ′ n and ξ ′ n are functions of the distribution of the selected features.Another possibility is as follows.For each (j, k) let Z jk be a dimension reduction of the variables (Y (s) : s = j, k).Then we could estimate the partial correaltion of Y (j) and Y (k) given Z jk .This would require a separate dimension reduction step for each pair (j, k).

Restricted partial correlations
Instead of building a graph from partial correlations, we can use a weaker measure of dependence.Motivated by Castelo and Roverato (2006), we define For L = 0 we get a correlation graph.For L = D−2 we get back the usual partial correlation graph.By choosing 0 < L = o(n) we get something in between these two cases while still retaining validity of the confidence intervals.The estimate of θ jk is the sample version Theorem 24.Define Z α by the bootstrap equation Then .
The proof is basically the same as the proof of Theorem 21.We remark, however, that in this case, L has to be fixed and chosen in advance.
We think that the restricted partial correlation idea is very promising but currently we have no efficient way to compute the graph this way.To compute the restricted partial correlation we would need to do the following: for each pair (j, k) we have to search over the D−2 L subsets and find the maximum.This is repeated for all D 2 pairs.Then the entire procedure needs to be bootstrapped.Despite the fact that the method is currently not computationally feasible, we include it because we believe that it may be possible in the future to find efficient computational approximations.

Experiments
In this section we illustrate the methods with some simple examples.We consider three models: Bootstrap based undirected graph for Dense model with α = .9,a = .9,n = 100 and dimensions 20,30,40,50.
Bootstrap based undirected graph for Markov model with α = .9,a = .9,n = 100 and dimensions 20,30,40,50.The purpose of the experiments is to get some intuitive sense of how much information in the original graph is captured in the dimension reduced graph.
In each case we show results for the bootstrap.We stopped when the results became numerically unstable.Then we increased the dimension and switched to the high dimensional methods, namely, the cluster graphs, the correlation graphs and the restricted graphs.(We do not include the block graphs which did not work well.)The results are in Figures 3, 4, 5, 6, 7 and 8.
The results for the dense model are good up to D = 50.After that, the cluster graph method is used and it clearly captures the qualitative features of the graph.Validity holds as D increases but the power starts to decrease leading to missing edges.The cluster graph is interesting here as it obviously cannot reconstruct the Markov structure but still does capture interesting qualitative features of the underlying graph.The SEM model is difficult; it is a complete graph but some edges are harder to detect.The power again falls off as D increases.Again Cluster graph for Markov model with α = .9,a = .9,n = 100 and dimensions 70, 80, 90, 100 and L = 20.
we see that the cluster graph loses information but permits us to find a graph with qualitative features similar to the true graph with higher dimensions.The correlation graph for the dense and SEM models, while preserving validity has essentially no power.More precisely, the graphical model leaves a very small imprint in the correlation matrix.For example, the covariance in the dense model is easily seen to be O(a/D).So while the inverse covariance matrix is dense, the covariance matrix has small entries.The correlation graph for the Correlation graph for Markov model with α = .9,a = .9,n = 100 and dimensions 70, 80, 90, 100 and L = 20.
Markov model does contain useful information as shown in Figure 9.Of course, there are extra edges due to the induced correlations.Nevertheless, most of the essential structure is apparent.We also considered the behavior of the correlation graph for a few other models.Figure 10 shows the correlation graph for a null model, a dense covariance matrix, a four-block model and a partial Markov chain (10 edges).In each case, n = 100 and D = 12. Figure 11 shows the same models but with D = 200.For these models the method does very well even with D > n.As mentioned earlier, the restricted partial correlation graph is so computationally intensive that it is not yet practical.We believe the method is promising which is why we have included it in the paper but at this point we do not have numerical experiments.
Finally, as a sanity check, we checked the coverage of the bootstrap for two models: the null model (no edges) and the Markov model.We declare an error if there is even a single wrong edge.Using α = .10and n = 100 we have the following error rates: The error rates is well under α.Indeed, we see that the coverage is conservative as we would expect.
We re-ran all the experiments with α = 0.05 and α = 0.01.As expected, with smaller α we tend to get slightly sparser graphs.However, the differences were quite small.The cluster graph for the SEM model showed the greatest sensitivity to the choice of α.The others were quite insensitive.

Conclusion
We have described methods for inferring graphs that use weak assumptions and that have confidence guarantees.Our methods are atavistic: we use very traditional ideas that have been swept aside in light of the newer sparsity-based approaches.We do not mean in any way to criticize sparsity-based methods which we find fascinating.But our main message is that the older methods still have a role to play especially if we want methods that use weaker assumptions.
There are several open problems that we will address in the future.We briefly describe a few here.First, we do not have any theory to characterize how the original graph relates to the graph of the dimension reduced problem.It would be useful to have some general theory which shows which features of the original graph are preserved.
Perhaps the most important extension is to go beyond linear measures of dependence.Following Bergsma (2011), write Y = g(X) + ǫ Y and Z = h(X) + ǫ Z and define the nonparametric partial correlation .
Bergsma shows that, for some q 1 , q 2 > 0, One can then extend the techniques in this paper to get confidence measures.Other problems for future development are: the development of computationally efficient methods for computing the restricted partial correlation graph and the extension of our theory to shrinkage graphs.

Appendix: Alternative delta method
If one is only interested in a single partial correlation, then one can use use a Taylor series together with the Berry-Esseen theorem.We provide this analysis here.At the end, we can turn this into a joint confidence set for all partial correlations using the union bound but this leads to a larger error than our earlier analysis.So the main interest of this section is single partial correlations.Note that Var(V i ) = s 2 jk and Let Z ∼ N (0, 1).Proof.Let E = {s jk / s jk > 1 + ǫ} and F = {T jk > u/ǫ} where ǫ = ρ n /n log(nD 2 ) and u = ǫ log(n).Note that s jk − s jk = U (σ) − U (s) = (σ − s) T Q ′ where Q ′ is the gradient of Q evaluated at some point between s and σ.Then, for 0 < ǫ ≤ 1, and H * j is the Hessian of g j evaluated at a point between s and s * .Since all the Y i 's are contained in the bounded rectangle B ×• • •×B, it follows that under the empirical measure P n , Y * i is sub-Gaussian with ζ = B.It then follows that s * ∈ B expect on a set of probability at most 1/n.Choosing

(
B1) Choose L = o(n).(B2) Randomly split the data into two halves D 1 and D 2 .(B3) Using D 1 select L proto-features: (a) Choose a feature j randomly and set S = {j} and C = {1, . . ., D}− S. (b) Repeat until S has L elements: i.For each j ∈ C compute the minimum distance d j = min i∈S d(i, j).ii.Find j ∈ C to maximize d j .Move j from C to S. (c) For L clusters by assigning each feature to its closest center.(d) Average the features within each clusters.