An Oracle Approach for Interaction Neighborhood Estimation in Random Fields

We consider the problem of interaction neighborhood estimation from the partial observation of a finite number of realizations of a random field. We introduce a model selection rule to choose estimators of conditional probabilities among natural candidates. Our main result is an oracle inequality satisfied by the resulting estimator. We use then this selection rule in a two-step procedure to evaluate the interacting neighborhoods. The selection rule selects a small prior set of possible interacting points and a cutting step remove from this prior set the irrelevant points. We also prove that the Ising models satisfy the assumptions of the main theorems, without restrictions on the temperature, on the structure of the interacting graph or on the range of the interactions. It provides therefore a large class of applications for our results. We give a computationally efficient procedure in these models. We finally show the practical efficiency of our approach in a simulation study.


Introduction
Graphical models, also known as random fields, are used in a variety of domains, including computer vision [4,21], image processing [9], neuroscience [19], and as a general model in spatial statistics [18].The main motivation for our work comes from neuroscience where the advancement of multichannel and optical technology enabled the scientists to study not only a unit of neurons per time, but tens to thousands of neurons simultaneously [20].The very important question now in neuroscience is to understand how the neurons in this ensemble interact with each other and how this is related to the animal behavior [8,19].This question turns out to be hard for three reasons at least.First, the experimenter has always only access to a small part of the neural system.Moreover, there is no really good model for population of neurons in spite of the good models available for single neurons.Finally, strong long range interactions exist [15].Our work tries to overcome some of these difficulties as will be shown.
A random field can be specified by a discrete set of sites G, possibly infinite, a finite alphabet of spins A, and a probability measure P on the set of configurations X (G) = A G .One of the objects of interest are the one-point specification probabilities, defined for all sites i in G and all configurations x in X (G) by a regular version of the conditional probability From a statistical point of view, two problems are of natural interest.

Interaction neighborhood identification problem (INI):
The INI problem is to identify, for all sites i in G, the minimal subset G i of G necessary to describe the specification probabilities in site i (see Sections 2 and 3 for details).G i is called the interaction neighborhood of i and the points in G i are said to interact with i. G i is not necessarily finite but only a finite subset V M ⊂ G of sites is observed.The observation set is a sample X 1:n (V M ) = (X 1 (j), ..., X n (j)) j∈VM , where (X 1 , ..., X n ) are i.i.d with common law P .The question is then to recover from X 1:n (V M ), for all i in V M , the sets

Oracle neighborhood problem (ON):
The ON problem is to identify, for all i in G, a set Ĝi = Ĝi (X 1:n (V M )), such that the estimation of the conditional probabilities P (x(i)|x(j), j ∈ G/{i}) by the empirical conditional probabilities P (x(i)|x(j), j ∈ Ĝi ) has a minimal risk (see Sections 2 and 3 for details).Ĝi is then said to satisfy an oracle inequality and it is also called oracle.We look for oracles among the subsets of V M and we consider the L ∞ -distance between conditional probabilities to measure the risk of the estimators.An oracle is in general smaller than G i because it should balance approximation properties and parsimony.
The literature has mainly been focused in the INI problem, see [3,7,10,11,17] for examples.It requires in general strong assumptions on P to be solved.For example, the ℓ 1 -penalization procedure proposed in [17] requires an incoherence assumption on the interaction neighborhoods that is very restrictive, as shown by [3].Moreover, it is assumed in [3,7,17] that G is finite and that all the sites are observed, i.e.V M = G.Csiszar and Talata [10] consider the case when G = Z d but assume a uniform bound on the cardinality of G i .The procedure proposed in [11] holds for infinite graph with each site having infinite neighborhoods, but requires that the main interactions belong to a known neighborhood of i of order O(ln n).Moreover, the result is proved in the Ising model only when the interaction is sufficiently weak.
The first goal of this paper is to show that the ON problem can be solved without any of these hypotheses.We introduce in Section 3.2 a model selection criterion to choose a model Ĝi and prove that it is an oracle in Theorem 3.2.This result does not require any assumption on the structure of the interaction neighborhoods inside or outside V M .
The second objective is to show that a selection rule provides also a useful tool to handle the INI problem.We introduce the following two steps procedure.First, we select, for all sites i in V M , a small subset Vi of V M with the model selection rule.We prove that this set contains the main interacting points inside V M with large probability.Following the idea introduced in [11], we use then a test to remove from Vi the points of (G/G i ) ∩ Vi .The new test can be applied to all neighborhoods V i that are smaller than O(ln n) and that contain the main interaction points in G i .It requires less restrictive assumptions on the interactions outside V i and on the measure P than the one of [11].For example, it works in the Ising models without restrictions on the temperature parameter.Furthermore, the two-step method let us look for the interacting points inside all the observation set V M (of order O(e n β ) for some 0 ≤ β < 1), and not only inside a prior subset V i (smaller than O(ln n)) of V M .
All the results hold under a key assumption H1 that is not classical, but that is satisfied by Ising models, see Theorem 4.5.We obtain then a large class of models, widely used in practice, where our methods are efficient.We also provide for this model a computationally efficient version of our main algorithms.
The paper is organized as follows.In Section 2, we introduce notations and assumptions used all along the paper.Section 3 gives the main results, in a general framework.Section 4 shows the application to Ising models and Section 5 presents a large simulation study where the problem of the practical calibration of some parameters is adressed.Section 6 is a discussion of the results with a comparison to existing papers.Section 7 gives the proofs of the main theorems and some technical results are recalled in an appendix in Section 9.

Notations and Main Assumptions
Let G be a discrete set of sites, possibly infinite, A = {−1, 1} be the binary alphabet of spins, and P be a probability measure on the set of configurations X (G) = A G .More generally, for all subsets V of G, let X (V ) = A V be the set of configurations on V .In what follows, the triplet (G, A, P ) will be called a random field.For all i in G, for all V ⊂ G, for all x in X (G), let x(V ) = (x(j)) j∈V and for all probability measures Q on X (V ∪ {i}), let be a regular version of the conditional probability.All along the paper, we will use the convention that, if V is a finite set, Q a probability measure on X (V ) and x is a configuration such that Q(x(V /{i})) = 0, then Q i|V (x) = 1/2.For all x in X (G) and all j in G, let x j be the configuration such that x j (k) = x(k) for all k = j and x j (j) = −x(j).We say that there is a pairwise interaction from j to i if there exists x in X (G) such that P i|G (x j ) = P i|G (x).For all subsets V of G, for all probability measures Q on X (V ), let With the above notations, there is a pairwise interaction from j to i if and only if ω G i,j (P ) > 0. Our second task in this paper is to recover the set G i of sites having a pairwise interaction with i.This definition differs in general from the one suggested in introduction.However, it is easy to check that they coincide in the Ising models defined in Section 4. Let X 1:n = (X 1 , ..., X n ) be i.i.d. with common law P .Let V M be a finite subset of G of observed sites, with cardinality M .The observation set is then ).Let P be the empirical measure on X (G) defined for all configurations x in X (G) by For all real valued functions f defined on For all subsets V of V M , the L ∞ -risk of P i|V is defined by P i|V − P i|G ∞ .This risk is naturally decomposed into two terms.From the triangular inequality, we have We call variance term the random term P i|V − P i|V ∞ and bias term the Let us finally present our general assumptions on the measure P .In the following ν and κ min are positive constants.The two first assumptions are classical and will only be used to discuss the main results.
CA:(Continuity) For all growing sequences The following last assumption is very important for the model selection criterion to work.It is satisfied for example by a generalized form of the Ising model as we will see in Section 4.
H1: For all finite subsets V of G,

Control of the variance term of the L ∞ -risk
Our first theorem provides a sharp control of the variance term of the risk of P i|V .It holds without assumption on the measure P or the finite subset V .
Theorem 3.1.Let P be a probability measure on X (G), let V be a finite subset of G.
There exists an absolut constant c 1 such that, for all δ > 1, There exists an absolut constant c 2 ≤ 400 such that, for all δ > 1, Remark: (1) implies that, The variance term goes almost surely to 0 if ν |V | << n(ln n) −1 .If in addition P satisfies CA and (V n ) n∈N * is a growing sequence of sets with limit G, the estimator P i|Vn is consistent.• (1) is only interesting theoretically, because the parameter p V − is unknown in practice.We will use (2) for our model selection algorithm.

Model Selection
We deduce from Theorem 3.1 that the risk of the estimator P i|V is bounded in the following way.For all δ > 1, for all subsets V , The risk of P i|V depends on the approximation properties of V through the bias P i|G − P i|V ∞ that is typically unknown in practice, and on the complexity of V , measured here by p V − .The aim of this section is to provide model selection procedures in order to select a subset of V M that optimizes the bound (3).In the following, we denote by G n a finite collection of subsets of V M , possibly random, and we call optimal or oracle in G n , any subset Ĝ = Ĝ(X 1:n (∪ V ∈Gn V )) in G n , possibly random, such that, We introduce the following selection rule.Let N n be an almost sure bound on the cardinality of |G n |.For all δ > 1 and for all The following theorem states that Ĝ(C, δ, G n ) is almost an oracle.Theorem 3.2.Let P be a probability measure on X (G) satisfying H1.Let G n be a finite collection of finite subsets of G, possibly random, and let N n be an almost sure bound on the cardinality of G n .For all C > c 2 , δ > 1, let Ĝδ (C) = Ĝ(C, δ, G n ) be the estimator given by (4).There exists a positive constant K = K(c 2 , C, κ min ) such that, Remarks: • Theorem 3.2 states that the risk of the estimator selected by the rule (4) is the best among the collection G n .It is the main result of the paper and we will discuss in what follows several applications.• The key idea of the proof is that, by assumption H1, we have P i|G ∞ − P i|V ∞ ≃ P i|G − P i|V ∞ , hence, our decision rule consists essentially in minimizing the sum of the bias term and the variance term of the risk, and the selected estimator is then an oracle.
• The constant c 2 derived in Theorem 3.1 is very pessimistic.Hence, Theorem 3.2 is more interesting theoretically.In the simulations of Section 5, we will calibrate C with the slope algorithm introduced in [5] and illustrate the nice properties of the resulting Ĝδ (C).
Let us go back to the ON problem.It is solved thanks to the following corollary.
) be the estimator given by (4).With probability larger than 1 − δ −1 , we have Remarks: • The complexity of the model selection algorithm for the collection G m,M is O(nM m ).This collection is used when a uniform bound m on the cardinalities of the |G i | is known.The complexity is the minimal necessary to recover the interaction graph in this problem [7].

Estimation of the interaction subgraph
Let M be an integer and let V M be a finite subset of G, with cardinality M .For all subsets Let V be a finite subset of G, we study in this section the estimators of G i given by ĜV We introduce the following function.
Ψ represents the minimal value of the bias term at a given value of the variance term.Our assumption concerns the rate of convergence of Ψ to 0.

H2(ǫ Ψ ):
There exist C Ψ > 0, α Ψ > 0 such that, for all K > 1, for all v > 0, Theorem 3.4.Let (G, A, P ) be a random field satisfying H1, H2.Let e ≤ M be an integer, let V M be a finite subset of G with cardinality M .Let δ > 1 and let and let Ĝδ (C) = Ĝ(C, δ, G n ) be the set selected by the selection rule (4).Let c > 0 and let Ĝ Ĝδ (C) i (c) be the associated set defined by (5).Let K be the constant defined in Theorem 3.2 and let c ∞ = 2 c 2 + C −1/αΨ Ψ (2K) 1−1/αΨ .We have Remark: ) contains exactly the sites that have a pairwise interaction with i of order the risk of an oracle.It provides a partial solution to the INI problem.
• Theorem 3.4 requires the extra assumption H2 compared to Theorem 3.2.
Moreover, the theoretical constant c ∞ depends on the constants κ min , C Ψ , α Ψ .
Let us conclude this section with the two steps algorithm suggested by Theorem 3.4 to estimate Estimation algorithm: Choose a model Ĝ, applying the model selection algorithm of Theorem 3.2 to the collection of all subgraphs of V M with cardinality smaller than log 2 (n).

Ising Models
The remaining of the paper is devoted to Ising models.These models are very important in statistical mechanics [12] and neuroscience [19] where they represent the interactions respectively between particles and neurons.In this section, we prove that Ising models satisfy H1, so that all our general results apply in these models.We also define effective algorithms for the ON and INI problems, adapted to this special case.

Verification of H1.
Let us recall the definition of Ising models.
In this case, T = r −1 is called the temperature parameter of the pairwise potential f .Definition 4.2.A probability measure P on X (G) is called an Ising model with potential f if, for all x ∈ X (G), a∈A e j∈G fi,j (a,x(j)) = 1 1 + e j∈G fi,j (xi(i),x(j))−fi,j (x(i),x(j)) .
The existence of a such a measure is well known [12]. Remark: • The classical Ising model has potential f defined by • One of the fundamental questions studied for this class of models is the description of conditions on potential f that guarantees uniqueness and non-uniqueness of the Ising model.Usually, high temperature implies conditions for the uniqueness of the Ising model and low temperature implies non-uniqueness [12].
, we have then It is clear that Ising models satisfy CA and NN with ν = (1 + e 2r ) −1 .
Definition 4.3.Let (G, A, P ) be an Ising model, with potential f .For all i, j in G, for all a in A, let Let us first recall some elementary facts about Ising models.
Proposition 4.4.Let (G, A, P ) be an Ising model, with potential f .For all finite subsets V of G, for all i, j in G, we have P ) ≤ e 2r (e 4r −1) 4r(1+e −2r ) 2 ω i,j (f ).The following theorem states that all of our general results apply in Ising models.
The key ingredient of the proof is the precise control of the bias term (6).
P satisfies assumption H1 i.e. there exists a constant κ min > 0 such that, for all finite subsets V of G,

A special strategy for Ising models
The model selection algorithm (4) might be computationally demanding in practice when the collection G n is too large.This is the case of the collection G log 2 (n),M used several times in Section 3, when the values of M and n are large.The purpose of this section is to show that a special strategy, computationally more attractive, can be adopted in Ising models.The idea comes from [7].Let us describe the method.
Reduction of the number of sites.Let x 1 be the configuration in X (G) such that, for all j in G, x 1 (j) = 1.
Step 1 Computation of the empirical probabilities.For all j in V M , let p(j) = P (x 1 (j)), p(i, j) = P (x 1 (i, j)).
Step 2 Reduction step.We keep the j in V M such that Let also η ms be the smallest η > 3 (2n) −1 ln(6M δ) such that the number of j kept after Step 2 is smaller than κ log 2 (n).
We denote by V (η) the set of j kept after Step 2. It is clear that the reduction algorithm has a complexity O(nM ).Remark that the values | p(i, j) − p(i) p(j)| do not depend on the configuration x 1 since the alphabet has only two letters.

Model selection algorithm. Let
Step 1 Computation of the conditional probabilities.For all V in V (η ms ), compute P i|V , and pen(V ).
Step 2 Selection Step.We choose C > c 2 and Ĝ = arg min Hence, the complexity of the model selection algorithm is O(n κ+1 ).The global complexity of the algorithm is therefore O(n κ+1 + nM ).As a comparison, the model selection algorithm for

Control of the risk of the resulting estimator
Theorem 4.6.Let (G, A, P ) be an Ising model, with potential f .Let ) 2 e 6r (e 4r − 1) .
With probability larger than 1 − δ we have that Furthermore, let us denote by With probability larger than 1 − 2δ, we have, Remarks: • The estimator of the interaction graph has better properties than the one obtained with selection and cutting procedure.The main difference is that there is no term ( p Ĝ − ) −1/2 in the rate of convergence.• The oracle inequality might be a little bit less sharp than the one obtained in (19).This is the price to pay to have a computationally efficient algorithm.• Our result holds in the Ising model.However, [7] used a similar approach in more general random fields with some additional assumptions and obtained good properties for the INI problem.

Simulation studies
In this section we illustrate results obtained in Sections 3 and 4 using simulation experiments and introduce the slope heuristic.All these simulation experiments can be reproduced by a set of MATLAB R routines that can be downloaded from www.princeton.edu/∼dtakahas/publications/LT10routines.zip.Let G = {−1, 0, 1} × {−1, 0, 1}.For the sections 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, and 5.7, we consider an Ising model on A G , with pairwise potential given by The pair of sites (i, j) where j ∈ V i is shown in Figure 1.For all these experiments, i = (0, 0).We simulated independent samples of the Ising model with increasing sample sizes n = 100k, k = 1, . . ., 100.For each sample size we have N = 100 independent replicas.

Variance term of the risk
In the following experiment we will verify Theorem 3.1 in a simulation.For each sample size we computed the normalized variance term Representation of the interacting pairs of the Ising model used in the simulation experiments.The edges between sites indicate the interacting pairs.The grey colored edges indicate the sites interacting with site (0, 0).
N different samples and obtained the average value.The result is summarized in Figure 2.

Slope heuristic
The constant c 2 derived from Theorem 3.1 is too pessimistic to be used in practice.The purpose of this section is to present a general method to design this constant.It is based on the slope heuristic, introduced in [5] and proved in several other frameworks in [1,14].We refer also to [2] for a large discussion on the practical use of this method.In order to describe it, let us introduce, for all V in G m,M , a quantity ∆ V , possibly random, measuring the complexity of the model V .The heuristic states the following facts.
1.There exists a positive constant C min such that when C < C min , the complexity of the model selected by the rule (4) is as large as possible.2. When C is slightly larger than C min the complexity of the selected model is much smaller.3. When C = 2C min then the risk of the selected model is asymptotically the one of an oracle.
The heuristic yields the following algorithm, defined for all complexity measures ∆ V .
1.For all C > 0, compute ∆ Ĝ(C) , the complexity of the model selected by the rule (4).2. Choose Cmin such that ∆ Ĝ(C) is very large for C < Cmin and much smaller for C > Cmin .

Select the final Ĝ = Ĝ(2 Cmin ).
The algorithm is based on the idea that Cmin ≃ C min and therefore that the final Ĝ, selected by 2 Cmin ∆ V is an oracle by the third point of the slope heuristic.The actual efficiency of this approach depends highly on the choice of the complexity measure ∆ V and on the practical way to choose Cmin in step 2 of the algorithm.We illustrate the dependence on ∆ V in the following experiences.
∆ V is either the cardinality of V (the dimension) or the variance estimator C(np V − ) −1/2 .Cmin is selected with the maximum jump criteria [2]: fix an increasing sequence of positive numbers C 0 , . . ., C t and define If the maximum is achieved in more than one value, take the biggest of such k.
Remark: The calculation of Cmin does not yield a significant increase of computational time compared to the evaluation of the model selection criteria for one fixed constant C. The only additional cost is due to the fact that one has to keep in the computer memory the conditional probabilities that must be computed only once.

Oracle risk compared to the risk of the estimated model
One way to verify the performance of the slope heuristic proposed in previous section is to compute the ratio With a reasonable procedure, we expect that the above quantity remains bounded.We applied the model selection procedure (4) with slope heuristic discussed above for the set {V ⊂ G \ {i} : |V | ≤ 8}.For each sample size we computed the ratio (7) for 100 different samples and we obtained the average.The result is summarized in Figure 3.

Sample size
Risk ratio Plot of the number of samples n against the average of ratio (7).Observe that the risk ratio remains bounded for both the variance (solid black) and the dimension (dashed grey) as the measure of complexity.

Discovery rate of the model selection procedure for ON problem
Another way to measure the performance of our model selection procedure is to compute the positive discovery rate and the negative discovery rate with respect to the oracle Ĝoracle .We estimated ( 8) and ( 9) and the result is summurized in Figure 4. Plot of positive and negative discovery rates with respect to the oracle against the sample size n.In solid/dashed black lines are represented the positive/negative discovery rates using the variance (V) as the complexity measure and in solid black/grey lines the positive/negative discovery rates using the dimension (D).Observe that the variance gives a better positive and negative discovery rates with respect to oracle when compared to the dimension.

Performance of the model selection procedure for INI problem
A natural question is how well the proposed model selection procedure behaves for the INI problem.Observe that the model selection procedure was designed to solve the ON problem and in principle does not necessary work for the INI problem.To investigate this question for each sample size we estimated the positive discovery rate and the negative discovery rate with respect to the interaction neighborhood V i .The result is summurized in Figure 5. Plot of positive and negative discovery rates with respect to Vi against the sample size n.In solid/dashed black lines are represented the positive/negative discovery rates using the variance (V) as the complexity measure and in solid black/grey lines the positive/negative discovery rates using the dimension (D).Observe that the variance gives higher positive discovery rates than the dimension as the measure of complexity although the negative discovery rates are the same.

Relationship between the INI and ON problems
Another interesting question is to understand what is the relationship between the INI and ON problems.Useful quantities for this are the positive discovery rate and the negative discovery rate We estimated these quantities and the results are summarized in Figure 6.Plot of positive and negative discovery rates of the oracle with respect to Vi against the sample size n.The solid black line represents the results for positive discovery rates and the dashed grey line represents the results for the negative discovery rates.Observe that in this example the oracle Ĝoracle matches the interaction neighborhood Vi quite fast.Also observe that in this example the oracle never included interactions not contained in Vi.

Select and cut procedure
Here we will show the usefulness of the two-step procedure introduced in Theorem 3.4 by an example.We consider the same independent samples used in previous experiments.We also consider i = (0, 0) and sample sizes n = 100k, k = 1, . . ., 100 with 100 independent replicas for each sample size.Let Ĝ(2 Cmin ) be the subset of G chosen by first applying the model selection procedure for the set {V ⊂ G \ {i} : |V | ≤ 8}.To choose the constant in the model selection procedure, we used the slope heuristic with variance as the complexity measure.Let Ĝ(SC) be the subset of G obtained by applying to the subset Ĝ(2 Cmin ) the cutting procedure with cv V n = 0.3(n p V − ) −1 .We first computed the average of the risk ratio for each sample size and compared them with the average of risk ratio (7).The results are summarized in Figure 7.
We also computed the positive and negative discovery rates of Ĝ(SC) and Ĝ(2 Cmin ) with respect to V i .The results are presented in Figure 8.

Risk ratio
Selection & Cut Selection Fig 7. Plot of the number of samples n against the average of risk ratio ( 12) and (7).
In solid black is represented the risk ratio for the two-step procedure and in dashed grey the risk ratio for the model selection procedure alone.Observe that the ratio of the two-step procedure remains closer to one when compared to the model selection alone.Observe that the two-step procedure has almost perfect negative discovery rates with incresing positive discovery rates.

Computationally efficient algorithm
In this section we will illustrate the performance of the strategy introduced in Section 4.  For this experiment i = 1 and |V i | = 16.We simulated independent samples of the Ising model with increasing sample sizes n = 100k, k = 1, . . ., 100.For each sample size we have N = 50 independent replicas.In this example, it is not practical to compute all candidates in collection G 8,200 whereas the algorithm introduced in Section 4.2 is very efficient.We illustrate its performance in the case where the number of sites j kept after Step 2 of the reduction step in Section 4.2 is 10.We denote the model chosen by this algorithm by Ĝefficient We estimated the probability that the selected model Ĝefficient recover the largest, and second, third, fourth, fifth largest interaction potentials.Formally, let J 1 = max{|J ij |1 j∈Vi : i, j ∈ G} and J k = max{|J ij |1 j∈Vi : i, j ∈ G \ J k−1 }, for k = 1, . . ., 5. We estimated for k = 1, . . ., 5. The result of the simulation is presented in Figure 10.By Monte Carlo simulation using a sample size of 100 000 we concluded that the considered Ising model at site i = 1 does not satisfy the incoherence condition in [17].Plot of the number of samples n against the probability that Ĝefficient includes the largest (solid black), and second (dashed black), third (solid gray), fourth (dashed gray), fifth (solid light gray) largest interaction potentials.Observe that the model selection procedure includes the sites with larger interaction potentials more often.

Discussion
We introduced a model selection procedure for interaction neighborhood estimation in partially observed random fields.We prove that the proposed rule satisfies an oracle inequality.The results hold under general assumptions, which for instance, are satisfied by a generalized form of the Ising model.Our model selection approach differs from other works [3,7,10,11,17] where only the INI problem is considered and more restrictive conditions are assumed.In particular, [3,7,17] consider the INI problem for finite random fields and assume that all the interacting sites are observed.This assumption is quite strong from practical point of view, e.g. in neuroscience, where the experimenter never has access to the whole set of neurons.Our result holds for partially observed random fields without any restriction on the range of the interactions.Csiszar and Talata [10] consider a BIC like consistent model selection procedure for homogeneous random fields in Z d .As they consider the INI problem using only one realization of the random field, their result are not immediately comparable with ours, but it is interesting to note that they assume that the range of interaction is finite.In [11], it is considered the INI problem for infinite range Ising models in Z d .The main restriction in this last work is that it is assumed that the interactions between the sites are weak ("low tempterature") and that a subset of the observed sites of size O(log(n)), where n is the sample size, must be fixed to apply the proposed procedure.Our procedure has no restriction on the strength of interaction and can be applied for example for low temperature Ising models, provided that the samples come from the same phase.Moreover, the model selection procedure can be applied in high dimension situation when the subset of observed sites is of size O(n α ), α > 0. We also introduced a two-step procedure in which the model selection rule gives us a small set of candidate sites and a cutting procedure removes from this set the irrelevant interactions.This two step procedure can be understood as a combination of a model selection and a statistical test procedure in spirit of [22].Our first simulation experiment shows that the concentration bound for the variance term of the risk in Theorem 3.1 is sharp.We propose a slope heuristic with maximal jump criteria using the variance or the dimension as a measure of complexity to choose a good constant in the model selection procedure.In our simulation experiment, we measured the performance of the slope heuristic for ON and INI problems.We observed that the variance had a better behavior as a complexity measure than the dimension because 1) the risk ratio was always smaller for the variance compared to the dimension as the measure of complexity, although both risk ratios remained bounded, 2) the estimated positive and negative discovery rates with respect to the oracle were always higher for the variance compared to the dimension as the measure of complexity, 3) also the estimated positive and negative discovery rates with respect to the interaction neighborhood were always higher for the variance compared to the dimension as the measure of complexity.Although at this point the variance seems to be a better choice for the complexity measure, a more comprehensive study must be carried to obtain a definitive conclusion and we recommend to consider both measures of complexity in practice.We addressed also in the simulation experiments the relationship between the ON and INI problems and observed that for sufficiently large sample size, both coincide.The two-step procedure introduced in this article was applied in a example where it clearly enhances the performance of the model selection procedure for both the INI and ON problems.Recently, multistep statistical procedures are gaining attention [22] although only few rigorous results exist.Our result for the two-step procedure is a contribution for this growing field.The main drawback of the proposed model selection procedure is its high computational cost which becomes prohibitive when a large number of sites are observed.We introduced a computationally efficient way to overcome this difficulty in the case of Ising model.The new procedure drastically reduces the set of models for which the model selection procedure must be applied, but still keeping the main interacting sites and a good oracle property.In the simulation experiment we show that the proposed algorithm has a good performance even when the number of the observed sites is as big as 200.It must be remarked that the Ising model considered for this experiment does not satisfy the incoherence condition [17] and therefore other computationally efficient algorithms as ℓ 1 -penalizations are not guaranteed to be consistent.Finally, we provide a set of MATLAB R routines that can be used to reproduce our experimental results and to carry further simulation and applied studies.For all x such that P (x(V /{i})) = 0, we have P (x(V /{i})) = 0, thus P i|V (x) = P i|V (x).Hence, we can only consider the configurations x such that P (x(V /{i})) > 0. Let us first provide some inequalities about conditional probabilities.
Remark: In particular, we deduce from this lemma that The first inequality follows from the fact that The second one is consequence of the first one and the fact that The proof of ( 1) is concluded thanks to the following Lemma.
Lemma 7.2.Let P be a probability measure on X (G) and let V be a finite subset of G.

Conclusion of the proof of (1)
. We deduce from Lemmas 7.1 and 7.2 that, with probability larger than 1 − δ −1 , As this result is trivial when Proof of Lemma 7.2: We apply Bousquet's version of Talagrand's inequality to the class of functions We apply Lemma 9.6 with We have Hence, ) Lemma 7.2 is then obtained with ( 14) and (15).Let us now turn to the proof of (2).Let V be a finite subspace of S. As (2) holds when p V − = n −1 , it remains to prove (2) when, for all x in X (V ), P (X (V )) > 0. This is done by the following Proposition.
There exists an absolut constant c 2 ≤ 400 such that, for all δ > 1, In particular, Proof of Proposition 7.3.Let n ≥ 2, δ > 1, c 2 = 400 and let us first remark that we only have to prove (16) on the subset X ′ n ⊂ X n of all the x in X n such that P (x(V )) ≥ c 2 2 ln(δn)n −1 .Let x in X ′ n , then we also have P (x(V /{i}) = 0. From Lemma 7.1, we have From Lemma 7.1, we also have We deduce that Hence, using the elementary inequality a ∨ b ≥ √ ab with a = P (x(V /{i})), b = P (x(V /{i})) ∨ c 2 2 ln(δn)n −1 , we deduce that .
We have obtain that, for all x in X ′ n , We apply Bousquet's version of Talagrand's inequality to the class of functions We have Hence, for all ǫ > 0, with probability larger than 1 − δ −1 , we have We apply Lemma 9.6 to the sets A x = x(V ) and the real numbers Hence, Thus, for all ǫ > 0, with probability larger than 1 − δ −1 , we have sup We take ǫ = 0.001 to conclude the proof.

Proof of Theorem 3.2:
It comes from Theorem 3.1 that, for all subsets V in G n , we have, We use a union bound to get that, Hereafter in the proof of Theorem 3.2, we denote by v and by We have proved that P (Ω) ≥ 1 − δ −1 .Let C > c 2 and denote, for short Ĝ = Ĝ(C, δ, G n ).By definition of Ĝ, for all V ∈ G n , Hence, on Ω, for all V in G n , From Assumption H1, and from the triangular inequality, , P i|G ∞ − P i|V ∞ ≤ P i|G − P i|V ∞ .Plugging these inequalities in (17), we obtain that, for all V ∈ G n , On Ω, for all V ∈ G n , we have then Hence, from Theorem 3.2, with probability larger than 1 − δ −1 , we have For all |V | > log 2 (n), there is at least one configuration in X (V ) that is not observed, hence p V − = 1/n.Therefore, for all m ≥ log 2 (n), inf Taking m = M , ( 19) yields the corollary.

Proof of Theorem 3.4:
Let Ω be the event defined in the proof of Theorem 3.2 for the collection G n and let Ĝ = Ĝδ (C).We have P (Ω c ) ≤ δ −1 and, on Ω, from Corollary 3.3, By definition of Ψ, denoting by l n = n −1 Γ M (δ), we have inf Let v * be the smallest solution of the equation vl n = Ψ(v).As Ψ is nonincreasing and v → vl n is non decreasing, we have Thus, on Ω, we have Let Ω 2 be the event defined in H2.Let r < 1 and Hence r ≥ (2C Ψ K) −1/α .Thus, on Ω * , we have By the triangular inequality, we have sup Hence, on Ω * , 7.5.Proof of Theorem 4.5: In all the proof, for all subsets V , V ′ of G such that V ∩ V ′ = ∅, for all (x, y) in X (V ) × X (V ′ ), let x(V ) ⊕ y(V ′ ) be the configuration on X (V ∪ V ′ ) such that, for all j in V x(V ) ⊕ y(V ′ )(j) = x(j) and for all j in V ′ , x(V ) ⊕ y(V ′ )(j) = y(j).
Let V be a finite subset of G and let x be a configuration on X (G).
Let us now give the following lemma, whose proof is immediate from the convexity of x → e x .
It is clear that, for all x in X (G), The upper bound comes then from the inequality P (y(j) = x(j)|x(V /{i})) ≤ 1.
For the lower bound, let, for all a in A, x a max be the configuration such that, for all j in G, g i,j (a, x a max (j)) = g a i,j and let x a min be the configuration such that, for all j in G, g i,j (a, x a min (j)) = inf b∈A g i,j (a, b).From Lemma 7.4, we have Finally, we have Using successively inequalities (20), ( 21), ( 22) and ( 23) with x = x x(i) max , we obtain sup Let us now check that P satisfies assumption H1.Let x in X (V ).Using successively inequalities ( 20), ( 21), ( 22) and ( 23) with x = x(V ) ⊕ x x(i) max (G/V ), we obtain, as in the previous proof, Taking x such that P i|V (x) = P i|V ∞ and using that P i|G (x(V )⊕x This yields the theorem thanks to inequality (6).

Appendix
In this Appendix, we recall the bound given by Bousquet [6] for the deviation of the supremum of the empirical process.
Theorem 9.1.Let X 1 , ..., X n be i.i.d.random variables valued in a measurable space (A, X ).Let F be a class of real valued functions, defined on A and bounded by b.Let v 2 = sup f ∈F P [(f − P f ) 2 ] and Z = sup f ∈F (P n − P )f .Then, for all x > 0, Let us recall some well known tools of empirical processes theory.The following result can be derived from classical chaining arguments (see for example [6]).
In order to apply Lemma 9.5 to F = F I , we compute the entropy of F I .For all i = j, since A i ∩ A j = ∅, Hence d 2,Pn (t i , t j ) = α 2 i P n (A i ) + α 2 j P n (A j ).Consider an ǫ-separated set T ǫ = {t i1 , ..., t iN } in (F I , d 2,Pn ) (see also the definition in the appendix), it comes from the previous computation that, for all k = k ′ , where p * n = sup i∈I α 2 i P n (A i ).Now, let us recall the following elementary lemma.Lemma 9.7.For all positive real numbers K, A such that K/A > e, we have x 2 dx = A ln Since K/A > e, Let us now give another simple lemma.
Applying Lemmas 9.8, 9.7, and Jensen's inequality to the right hand side of (31) we have that It is then straightforward that (30) holds.

Theorem 4 . 5 .
Let (G, A, P ) be an Ising model, with potential f .There exist two positive constants c

Fig 2 .
Fig 2. Plot of the number of samples n against √ n Pi|V i − P i|V i ∞ .The dotted line indicates the linear regression line.Observe that the regression line is essentially parallel to the abscissa.
Fig 4.Plot of positive and negative discovery rates with respect to the oracle against the sample size n.In solid/dashed black lines are represented the positive/negative discovery rates using the variance (V) as the complexity measure and in solid black/grey lines the positive/negative discovery rates using the dimension (D).Observe that the variance gives a better positive and negative discovery rates with respect to oracle when compared to the dimension.
Fig 5.  Plot of positive and negative discovery rates with respect to Vi against the sample size n.In solid/dashed black lines are represented the positive/negative discovery rates using the variance (V) as the complexity measure and in solid black/grey lines the positive/negative discovery rates using the dimension (D).Observe that the variance gives higher positive discovery rates than the dimension as the measure of complexity although the negative discovery rates are the same.
Fig 6.Plot of positive and negative discovery rates of the oracle with respect to Vi against the sample size n.The solid black line represents the results for positive discovery rates and the dashed grey line represents the results for the negative discovery rates.Observe that in this example the oracle Ĝoracle matches the interaction neighborhood Vi quite fast.Also observe that in this example the oracle never included interactions not contained in Vi.

Fig 8 .
Fig 8.  Plot of positive and negative discovery rates of Ĝ(SC) and Ĝ(2 Cmin) with respect to Vi against the sample size n.The black solid/dashed lines represent the positive/negative discovery rates of the two-step procedure.The grey solid/dashed lines represent the positive/negative discovery rates of the model selection procedure alone.Observe that the two-step procedure has almost perfect negative discovery rates with incresing positive discovery rates.
2 on the Ising model on A G , where G = {1, . . ., 200}, with pairwise potential f ij (c, d) = |J ij |1 j∈Vi cd for i, j ∈ G, c, d ∈ A, V i ⊂ G, and J ij independently generated from a Gaussian distribution with E[J ij ] = 0 and E[J 2 ij ] = 4.The pairs of sites (i, j) with j ∈ V i are represented in Figure 9.

Fig 9 .
Fig 9. Representation of the interacting sites in the Ising model described in 5.8.The positions (i, j) of the dots indicate the pair of sites (i, j) for which j ∈ Vi.
Fig 10.Plot of the number of samples n against the probability that Ĝefficient includes the largest (solid black), and second (dashed black), third (solid gray), fourth (dashed gray), fifth (solid light gray) largest interaction potentials.Observe that the model selection procedure includes the sites with larger interaction potentials more often.

Definition 9 . 2 .Definition 9 . 3 .
The covering number N (ǫ, T, d) is the minimal number of balls of radius ǫ with centers in T needed to cover T .The entropy is the logarithm of the covering number H(ǫ, T, d) = ln(N (ǫ, T, d)).An ǫ-separated subset of T is a subset {t k } of elements of T whose pairwise distance is strictly larger than ǫ.The packing number M (ǫ, T, d) is the maximum size of an ǫ-separated subset of T .Those quantities are related by the famous following lemma.Lemma 9.4.(Kolmogorov and Tikhomirov[13]) Let (T, d) be a metric space and let ǫ > 0, N (ǫ, T, d) ≤ M (ǫ, T, d) ≤ N (ǫ/2, T, d).

2
on [K/A, ∞[.The result follows.By definition, pn ≤ (α * ) 2 , hence 2α * /( p * n /2) ≥ 4 > e, we deduce from Lemma 9 Lemma 9.5.Let F be a class of functions, letd 2,Pn (t, t ′ ) = P n [(t − t ′ ) 2 ] and D n = sup t∈F P n (t 2 ) then (u, F , d 2,Pn )du .The next result was used to obtain our concentration inequalities.Lemma 9.6.Let (A i ) i∈I be a collection of sets such that, for all i, j ∈ I, A i ∩ A j = ∅ and let (α i ) i∈I be a collection of positive real numbers.Let Z I = sup t∈FI |(P n − P )t|, where F I = {t i = α i 1 Ai } and P n is the empirical measure.Let α * = sup i∈I α i , p * = sup i∈I α 2 i P (A i ).We have * √ p * .