Exact Recovery in Block Spin Ising Models at the Critical Line

We show how to exactly reconstruct the block structure at the critical line in the so-called Ising block model. This model was re-introduced by Berthet, Rigollet and Srivastava in a recent paper. There the authors show how to exactly reconstruct blocks away from the critical line and they give an upper and a lower bound on the number of observations one needs; thereby they establish a minimax optimal rate (up to constants). Our technique relies on a combination of their methods with fluctuation results for block spin Ising models. The latter are extended to the full critical regime. We find that the number of necessary observations depends on whether the interaction parameter between two blocks is positive or negative: In the first case, there are about $N \log N$ observations required to exactly recover the block structure, while in the latter $\sqrt N \log N$ observations suffice.


Introduction
In a recent paper Berthet, Rigollet and Srivastava rediscovered a block version of the Curie-Weiss-Ising model [2]. This model had been introduced earlier in the statistical physics literature, see e.g. [11], [10], [9], [6]. Extensions of these models are studied in [19] or [21]. The first of these papers uses a very general interaction structure, while the latter investigates the situation in the spirit of social interaction models or statistical physics models on random graphs, see [23], [3], [7], and [18]. A related version of this model has been investigated using the method of moments in [15], [17] and [16]. The article by Berthet et al. is motivated by a considerable amount of articles investigating block models in the recent past, see e.g. [1], [12], [5], [22], [4]. The model is interesting from both a probabilistic and a statistical perspective. To define the model, one starts by partitioning the set {1, . . . , N} into a set S ⊂ {1, . . . , N} with |S| = N 2 and its complement S c . To this end we need to assume that N is even; the model itself can be defined and analyzed for arbitrary block sizes (see [19], where large deviations and Central Limit Theorems are proved for a general block structure), but the statistical questions then become more tricky.
The set S induces a Hamiltonian on the binary hypercube {−1, +1} N , N ∈ N given by Here we choose β > 0 and 0 ≤ |α| ≤ β, and we write i ∼ j, if either i, j ∈ S or i, j ∈ S c and i ∼ j, otherwise. The Hamiltonian (1.1) (or energy function), in turn, induces a Gibbs measure on {−1, +1} N given by: . It has been extensively studied, see e.g. [8]. In particular, it has been shown that its equilibrium measures are intrinsically related to the largest solution, m + (β), of the equation z = tanh(βz).
More precisely, for all β ≥ 0, in the Curie-Weiss model the random variable 1 N N i=1 σ i asymptotically concentrates in the points m + (β) and −m + (β). Note that m + (β) = 0, if and only if β ≤ 1. A similar change in the behavior was shown to be true for the block spin Ising model, see [2].
In the above setting assume that |α| ≤ β and denote by ρ N,α,β the distribution of m under the Gibbs measure µ N,α,β .
• If β + |α| ≤ 2, then ρ N,α,β weakly converges to the Dirac measure in (0, 0). • If β + |α| > 2 and α = 0, then ρ N,α,β weakly converges to the mixture of Dirac measures 1 • If β + α > 2 and α > 0, then ρ N,α,β weakly converges to the mixture of Dirac measures: ). • If β + |α| > 2 and α < 0, then ρ N,α,β weakly converges to the mixture of Dirac measures: ). Theorem 1.1 can be considered as a Law of Large Numbers for m. Central Limit Theorems for m were proved in [20]. They are also very useful for understanding the following reconstruction result, even though the authors in [2] choose a different approach. We will come back to this point when having described the reconstruction mechanism. In the major part of their work [2] Berthet et al. consider the question, whether with a given number of observations n one can reconstruct S exactly, and, if so, how n relates to N. One of their main findings is Corollary 4.6]) If the parameters α and β satisfy |α| ≤ β, α < β, and |α| + β = 2, then there exist positive constants C 1 and C 2 that depend on α and β such that there is an algorithm that recovers the block structure (S, S c ) exactly with probability 1 − δ whenever where n denotes the number of observations.
There are two regimes of parameters excluded by Theorem 1.2. The first is α = β. While one can still prove limit theorems in this case (see [15]), it is rather obvious that it is impossible to reconstruct S in this case: The interaction simply does not differentiate between spins in the same block and spins in different blocks. Another obvious question left open by Theorem 1.2 is, what happens at the critical line |α| + β = 2. The purpose of this note is to fill this gap. We will show: Theorem 1.3. If the parameters α and β satisfy α < β, α = 0, and |α| + β = 2, then the following holds true.
• If α > 0, there exists a positive constant C 3 such that there is an algorithm that recovers the block structure (S, S c ) exactly with probability 1−δ whenever n ≥ C 3 N log(N/δ).
• If α < 0, there exists a positive constant C 4 such that there is an algorithm that recovers the block structure (S, S c ) exactly with probability 1−δ whenever Here, n denotes the number of observations. Remark 1.4. Note that the number n of necessary observations indicates that the phase one enters at the phase transition point α > 0 and α + β = 2 is of a different nature than the phase one enters at α < 0 and |α| + β = 2 -even though both points still belong to the high temperature regime according to Theorem 1.1. We exclude the case α = 0, because it is already covered by Theorem 1.2.
Remark 1.5. Following the proof in [2] one can see that the constants C 1 and C 3 in the above Theorems 1.2 and 1.3 depend on α and β and explode when α tends to β. We will not elaborate on this point.
The rest of this note is devoted to the proof of Theorem 1.3. To this end we will quickly recap the general reconstruction approach from [2] in Section 2. Section 3 will generalize a result on the critical fluctuations of m 1 + m 2 and m 1 − m 2 from [20]. In particular, we will treat the case of negative α which was omitted in [20]. These two ingredients will yield the proof of Theorem 1.3, which will be given in Section 4.

The strategy for block recovery
While part 2 of Theorem 1.2 could, in principle, be shown using a large deviations estimate, the first part needs some more sophisticated arguments. In this section we will recall how the approach proposed in [2] by Berthet et al. works. We will be a bit brief and especially refer the reader to [2] for proofs. Given n observations σ (1) , . . . , σ (n) the log-likelihood function for S is given by Since it is easily seen that Z N,α,β , as a sum over all configurations σ, is independent of S, because we know its size, maximizing L(S) amounts to minimizing n k=1 H N,α,β,S (σ (k) ). On the other hand, taking the N × N matrix Q with elements Q ij = β N , if i ∼ j and Q ij = α N , otherwise , one readily sees that H N,α,β,S (σ) = − 1 2 Tr(σσ T Q) (where the upper index T indicates transposition). Note that Q depends on S. Thus finding the maximum likelihood estimator for S amounts to maximizing 1 2 Tr(ΣQ) or Tr(ΣQ), whereΣ = 1 n n k=1 σ (k) σ (k) T is the empirical covariance matrix of the observations. Here, we maximize over all matrices Q that induce a bisection of {1, . . . N} into sets S and S c of equal size. More precisely, the N × N matrix Q = (Q ij ) 1≤i,j≤N has to satisfy the following properties: Q is symmetric, . . N} and for each i = 1, . . . , N we have |{j : Q ij = β N }| = N 2 . Now, for each fixed σ the function Tr(σσ T Q) is maximized by the same set S (associated to the matrix Q) no matter, what α and β are, as long as α < β. The same holds true for n fixed observations σ (1) , . . . , σ (n) and Tr(ΣQ).
Indeed, it is an easy matter to check that As σ (1) , . . . , σ (n) are fixed so is 2 This means we have that for all k and constants c k depending on k (of course), but not depending on the partitioning (S, S c ). Thus given our n observations At first glance the right hand side may appear rather tricky, because even though the observations σ (1) , . . . , σ (n) are taken independently, the m 1 (σ (k) ) are not, when S is the free variable. However, multiplying out the products (resp. the square) and rearranging the terms we see that for some constant K depending on β and σ (1) , . . . , σ (n) but not on the choice of S. This reveals that the minimal and maximal points of 2n N Tr(ΣQ) as a function of S do not depend on α and β as long as β = α. Only, whether they are maxima or minima depends on whether α < β or α > β. Hence, as long as we keep α < β we can choose any values for them we like. We can therefore also set β = N and α = −N. This transforms our optimization problem into [2]). Obviously each R = rr T ∈ R again induces a bisection of {1, . . . , N} into two sets S and S c of equal size, where i ∼ j, if r i = r j . Moreover, there is a one-to-one correspondence between the set of valid matrices Q and R; e.g. given some matrix Q with the aforementioned properties, the corresponding matrix R = rr T ∈ R is given by r i = 1, if Q 1i = β N and r i = −1 otherwise. Thus each R ∈ R is an estimator for the unknown blocks. In [2] the authors now proceed in two steps. First the σ are centered, i.e. σ ∈ {±1} N is replaced by σ := Πσ with Π := Id N − 1 N 1 N , where 1 N is the N × N matrix with all elements equal to 1. Since for all R ∈ R, we have that Tr[ΣR] = Tr[ΓR], wherê Γ = ΠΣΠ the likelihood function remains the same over R when we replaceΣ by Γ. The decisive step in [2] is then to embed the optimization problem (2.1) into a larger class of optimization problems. Hence, instead of solving (2.1) the authors look for solutions of max The question is, of course, when a solution of (2.2) also provides a solution to (2.1). This is, where the authors in [2] spend a considerable amount of work to show that: The semidefinite programming problem (2.2) has a unique maximum at R * ∈ E + with probability 1 − δ whenever In particular, the semidefinite programming solution, if it exists, recovers exactly the block structure (S, S c ).

A limit theorem
Theorem 2.1 obviously asks for an estimate of Z − Z ′ . As a matter of fact, in [2] the authors show by a comparison of the distribution of (m 1 , m 2 ) with a Gaussian distribution that for α ≤ 0 and |α| + β > 2, the difference Z − Z ′ is of constant order in N, while it is of order 1 N , in all other cases, whenever |α| + β = 2. Indeed, for α > 0 and α + β < 2 this also easily follows from the Central Limit Theorem 1.2 for the vector m in [20] and it would follow from corresponding Central Limit Theorems also in the other cases when |α| + β = 2, if such were proven (they are quite likely true, because similar Central Limit Theorems hold for the vector 1 N N i=1 σ i in the Curie-Weiss model, see e.g. [14]). For |α| + β = 2 there is no such estimate for Z − Z ′ in [2]. Indeed, in view of Theorem 3.1 below the main tool in this reference does not work, because at least for α > 0, and α + β = 2 the vector √ N (m 1 , m 2 ) is not asymptotically Gaussian. Moreover, considering Theorem 6 and Remark 7 in [16] one may wonder, whether an exact reconstruction of (S, S c ) on the basis of the correlations between the spins is possible at all. There, the authors show that for α > 0 and α+β = 2 asymptotically: independent of whether the sites i and j belong to the same block, or to different blocks. Hence, asymptotically the correlations E(σ i σ j ) at α + β = 2 do not depend on whether i ∼ j or i ∼ j, nor do they depend on α and β.
(a) Then, if α > 0 on a scale √ N /2 the difference between m 1 and m 2 , i.e.m 1 − , is asymptotically Gaussian with mean 0 and variance converges in distribution to a probability measure ρ on R, which is absolutely continuous with Lebesgue-density where K is a normalizing constant to make ρ a probability measure. Remark 3.2. Theorem 3.1 also shows the difference between |α| + β = 2, α > 0 and |α| + β = 2, α < 0. Indeed, according to Theorem 1.1 for |α| + β slightly larger than 2, the block magnetizations m 1 and m 2 are close together when α > 0, while they move apart for α < 0. Moreover, notice that the probability measure ρ in Theorem 3.1 is the same as the limiting distribution of the appropriately rescaled magnetization in the Curie-Weiss model, see [8,TheoremV.9.5] Proof. The proof of this theorem makes use of the so-called Hubbard-Stratonovich transform.
Introduce the random variables together with the vectorsw = (w 1 ,w 2 ) andŵ = (ŵ 1 ,ŵ 2 ) consisting of the componentsw as well asŵ Note that where we have set Note that κ α > 0 as well as η α > 0, if |α| < β and |α| + β = 2. Similarly, We now begin with the case of α < 0. Then κ α < 2, which will make a difference, as we will see. Our strategy can be summarized as "linearizing" the Hamiltonian by tilting it with a suitable Gaussian random variable (a similar technique was used e.g. in [13]). To this end, let N (0, C) be a 2-dimensional Gaussian distribution with expectation 0 and covariance matrix C, Denote by ρ N,α,β := µ N,α,β • (w) −1 the distribution ofw under the Gibbs measure and by χ N,α,β := ρ N,α,β * N (0, C). Let us compute the Lebesgue density of χ N,α,β : Let A be a measurable subset of R 2 . Then with dxdy because η α = 2 when α < 0. Thus Note that in the above expression, both the integral term and the constant K 3 depend on N. However, it suffices to consider the convergence of the integral, because once the convergence of the integral is shown for all measurable subsets A ⊂ R 2 and in particular for A = R 2 , this implies the convergence of K 3 to a constant. Introduce as short-hand for the negative exponent in (3.1): y .
(3.2) Then by Taylor expansion of log cosh(z) = 1 2 z 2 − 1 12 z 4 + O(z 6 ) up to fourth order we see that the y 2 -terms cancel, and so do the xy-terms and we arrive at where the constant in the O(N − 1 2 )-term depends on x and y. Therefore we obtain for the density of the convolution and the convergence in the O(N − 1 2 )-term is uniform on compact subsets of R 2 . From here it becomes immediately clear, that the cases κ α = 2 and η α = 2 have to be treated separately. We next show how to extend the convergence to non-compact sets. This is done in the spirit of [20] or [13]. To this end for a Borel set A ⊂ R 2 we split the integral Here, for any number R > 0 we denote by B(0, R) the (2-dimensional) ball centered in 0 with radius R and by B r,N we denote the box Later R will be sent to ∞ and r will be taken very small. Note that we will however first take the limit N → ∞ and then R → ∞. The summands in (3.3) will be called inner region, intermediate region, and outer region, respectively. They will be treated separately.
As noted earlier for fixed R > 0 This already finishes the treatment of the inner region. For the outer region A ∩ B c r,N , which we will further split up in two regions, we change scales and we write We denote the free energy of the Curie-Weiss model at inverse temperature β by f CW (β). Note that f CW (β) is non-negative and given by Then, Setting , and, in accordance with (3.4), we assume that r is sufficiently small such that r < r 0 , i.e. B r,N ⊂ B r 0 ,N , and we split up the outer region for another constant K ′ α > 0.
Solving ∇Φ = 0 thus is equivalent to solving Obviously, (0, 0) is a solution. To see that (0, 0) is indeed the only solution, we assume that G(x, y) = (x, y) for some (x, y). Note that x = 0 immediately implies y = 0. If y = 0, we have tanh( κα 2 x) = x, which has only the solution x = 0, since κα 2 ≤ 1. Hence, we can assume that x = 0 and y = 0. By the symmetry of tanh, the first equation induced by G(x, y) = (x, y) reads With g(z) := (tanh(x)) ′ = 1 − (tanh(z)) 2 and the mean value theorem we have for some x 0 between y + καx 2 and y − καx 2 (recall α < 0 for the last inequality). Since 0 ≤ g(z) ≤ 1 for all z ∈ R the in equality in (3.7) can never be true, and there is no solution to G(x, y) = (x, y) with x = 0, y = 0. This, in turn implies that on R 2 the functionΦ has a unique minimum at (0, 0). This proves exactly the claim in (3.6) or in other words for all r > 0 there is a constant ξ(r, r 0 ) such that for all (x, y) ∈ B r 0 ,N \ B r,N . From here we get that for N sufficiently large. By y resp. log cosh κα y for x, y ∈ B r,N \ B(0, R), where we recall the definition of B r,N given in (3.4). Note that x, y ∈ B r,N implies κα y ∈ (−2r, 2r). For z ∈ (−2r, 2r) we use the Taylor approximation and estimate the remainder using the Lagrange form for some positive constant C > 0. Applying this toΦ and x, y ∈ B r,N \ B(0, R) and using that the terms with odd powers of x (and y) cancel, this gives Now note that for x, y ∈ B r,N we have 15CrN and CrN y 6 N 3/2 ≤ Cy 4 r 3 . (3.12) Note that the coefficients of x 2 resp. of y 4 on the right hand sides of equations (3.9)-(3.12) become arbitrarily small, if we take r small enough. On the set A r R,N := A ∩ B(0, R) c ∩ B r,N we can therefore bound Since κ α < 2, there exists some ε > 0 such that for r sufficiently small we have as well as Hence for such a small value of r and N ∈ N we get Note that the integrand on the left hand side of (3.13) implicitly depends on N while the integrand on the right right hand side does not depend on N and is integrable over R 2 , i.e. the estimate in (3.13) carries over if we take N → ∞ on both sides Further, as again the integrand on the right hand side of (3.14) is integrable over all of R 2 , the right hand side of (3.14) vanishes for R → ∞, i.e. Combining the three considerations above shows that the only contribution to χ N,α,β comes from the inner region Thus in particular, the distribution ofw 2 = N 1/4 2 (m 1 − m 2 ) convoluted with an independent Gaussian distribution with mean 0 and variance 1 √ N converges to a probability measure ρ on R with density proportional to exp(− 1 12 x 4 ). Since the N (0, 1 √ N )distribution converges to 0, as N → ∞, this implies that N 1/4 2 (m 1 − m 2 ) converges to ρ in distribution, as well. Moreover, the above calculation yields that the distribution ofw 1 = Hence, √ N 2 (m 1 + m 2 ) converges in distribution to a Gaussian distribution with mean zero and variance − 1 α . To analyze the second situation, i.e. α > 0 and α + β = 2, we now convolute the distribution ofŵ with respect to µ N,α,β,S with a 2-dimensional normal distribution N (0, C), where this time (recall that β − α > 0). Calling this convolutionχ N,α,β , i.e.χ N,α,β := µ N,α,β (ŵ) −1 * N (0, C), we can compute its density as above (recall that κ α = 2 in this case). For a 2-dimensional Borel set A we obtain where again (but this time C is different).
Expanding log cosh again up to fourth order we see that now the x 2 -terms in the exponent cancel and we obtain with a O(N − 1 2 )-term that depends on x and y but is uniformly bounded for x and y taken from a compact subset of R 2 . For non-compact sets A we note that the integrand in (3.15) equals the respective integrand in the case α < 0, if we interchange the roles of x and y and replace η α by κ α (and use log cosh(z) = log cosh(−z)), i.e. using the notation A ′ := {(y, x) : (x, y) ∈ A} we havê Again dividing the regime of integration into an inner, an intermediate and an outer region, we already saw in the case α < 0 (note that in the case α < 0 we did not use the explicit form of κ α in that part of the proof and hence we may replace κ α by η α ) In particular,ŵ 2 = √ N 2 (m 1 −m 2 ) converges in distribution to a N (0,σ 2 )-distribution, This weak convergence is equivalent to the convergence of the characteristic functions. Computing the characteristic functions of the Gaussian distribution involved in the above proof, we have therefore shown that the characteristic function ofŵ 2 in the point t ∈ R satisfies Therefore, On the level of weak convergence, or convergence in distribution, this in turn implies thatŵ 2 (m 1 +m 2 ) convoluted with an independent Gaussian distribution with mean 0 and variance 1 √ N converges to a probability measure ρ on R with density proportional to exp(− 1 12 x 4 ). Since the N (0, 1 √ N )-distribution converges to 0, as N → ∞, this implies that N 1/4 2 (m 1 − m 2 ) converges to ρ in distribution, as well.

Proof of Theorem 1.3
We are now ready to give the proof of our central result, Theorem 1.3, which consists of a combination of the previous two sections.
Proof of Theorem 1.3. In view of Theorem 2.1, all we need to do is to estimate Z −Z ′ in the critical case. Let us first assume that α > 0. Then, Theorem 3.1 shows that 1 √ N ( i∈S σ i − j / ∈S σ j ) converges to a normal distribution with mean 0 and variance 2 2−β+α . In particular But, where we recall that Z := E(σ i σ j ), if i ∼ j and Z ′ := E(σ i σ j ), if i ∼ j. Thus Note that we excluded the case α = 0, β = 2 in which case the right hand side would explode.
Note that Z is of order 1 √ N . On the one hand this follows from Theorem 6 and the remark from the paper by Kirsch and Toth [16] cited at the beginning of Section 3 and on the other hand it also follows from Theorem 3.1 directly: As noted in Theorem 3.1 (b) 1 N 3/4 i σ i converges to a non-degenerate limit distribution. Since Z ≥ |Z ′ | as a consequence of β ≥ |α| (α = 1, i.e. α = β being excluded, of course) this limit theorem implies that N (N −1)Z N 3/2 ≤ c for some constant c which shows that c is at most of order 1 √ N . But then (4.1) shows that Z − Z ′ ∼ C/N for some constant C > 0 (4.2) where ∼ indicates asymptotic equivalence.
Let us now turn to the other case where α < 0 and still β + |α| = 2. Then according to Theorem 3.1 the difference of the spins 1 N 3/4 ( i∈S σ i − j / ∈S σ j ) converges to a distribution ρ on R with density proportional to exp(− 1 12 x 4 ). In particular, ρ has finite variance Taking into account that again E( i∈S σ i − j / ∈S σ j ) = 0 and following the same calculations as above we obtain: