ESTIMATION OF HIGH-DIMENSIONAL PARTIALLY-OBSERVED DISCRETE MARKOV RANDOM FIELDS

We consider the problem of estimating the parameters of discrete Markov random fields from partially observed data in a high-dimensional setting. Using a `-penalized pseudo-likelihood approach, we fit a misspecified model obtained by ignoring the missing data problem. We derive an estimation error bound that highlights the effect of the misspecification. We report some simulation results that illustrate the theoretical findings.


Introduction and statement of the results
The problem of estimating high-dimensional networks has recently attracted a lot of attention in statistics and machine learning.Both in the continuous case using Gaussian graphical models (Drton and Perlman (2004); Meinshausen and Buhlmann (2006); Yuan and Lin (2007); d 'Aspremont et al. (2008); Bickel and Levina (2008); Rothman et al. (2008); Lam and Fan (2009)), and in the discrete case using Markov random fields (Banerjee et al. (2008); Höfling and Tibshirani (2009); Ravikumar et al. (2010); Guo et al. (2010)).This paper focuses mainly on discrete Markov random fields (MRF).Let (X (1) , . . ., X (n) ) be n i.i.d.random variables where X (i) = (X (i) 1 , . . ., X (i) p ) is a p-dimensional vector of dependent random variables with joint density for some symmetric function B : X × X → R, where X is a finite set.The real-valued symmetric matrix θ = {θ(s, s ), 1 ≤ s < s ≤ p} is the network structure and is the parameter of interest.
The term Z θ is a normalizing constant.This type of statistical models was pioneered by J. Besag (Besag (1974)) under the name auto-model.The nice feature of model ( 1) is that for any 1 ≤ s ≤ p, the conditional density of X s given {X j , j = s} = x ∈ X p−1 is for a normalizing constant Z (s) θ = Z (s) θ (x).Therefore, θ(s, j) = 0 implies that X s and X j are conditionally independent given the other variables X k , k / ∈ {s, j}.Thus estimating θ provides us with the dependence structure and the magnitude of the dependence between these variables.
This paper focuses on the situation where the outcomes X (i) j take discrete values (X is a finite set), although extension to a more general setting is possible without much difficulty.A number of recent work have shown that based on (X (1) , . . ., X (n) ), the true network structure denoted θ can be consistently estimated using a number of methods, even when the number of entries of θ is much large than n (Höfling and Tibshirani (2009); Ravikumar et al. (2010); Guo et al. (2010)).
For computational tractability, a pseudo-likelihood approach is often preferred, even though this approach incurs a certain lost of efficiency.Working mainly with the auto-logistic model (where Guo et al. (2010) shows that the 2 -norm estimation error of the penalized pseudo-likelihood estimator is bounded from above by τ −1 a log d/n, where a is the number of non-zero elements of θ and τ is the smallest eigenvalue of the information matrix.Ravikumar et al. (2010) obtained similar results for a one-node-at-the-time 1 -penalized pseudo-likelihood estimator.Xue et al. (2012) also derived some properties of the oracle estimator with the SCAD penalty.
In many situations where network estimation is needed, the network data can be only partially observed because certain nodes are missing from the sample.For example, in social network analysis, some close friends or siblings might not be part of the survey.As another example, in protein-protein networks, the analysis is often restricted to the specific subgroup of proteins that is believed to carry a role in a given biological function.So doing, some important proteins might be omitted from the analysis.In the Gaussian case the distribution of the observed nodes remains Gaussian, but its conditional independence structure can be substantially altered by the missing data problem.Chandrasekaran et al. (2012) considered this issue and studied the problem of recovering the conditional independence structure among the observed nodes (as defined in the complete data setting).They address the issue by approximating the inverse covariance matrix of the observed nodes by a sum of a sparse matrix and a low-rank matrix.Key to their approach is the fact that the marginal distribution of the observed nodes remains Gaussian, albeit one with an altered covariance matrix.Under some regularity and identifiability conditions, these authors show that the sparse component of their model consistently estimates the covariance matrix (as defined in the complete data setting) between the observed nodes.This paper consider the same issue for discrete MRF.Unlike the Gaussian case, discrete Markov random field distributions are not closed under marginalization.For example, if there exist r additional nodes denoted p+1, . . ., p+r such that the joint distribution of (X 1 , . . ., X p , X p+1 , . . ., X p+r ) is an auto-model with network structure {θ(s, s ), 1 ≤ s < s ≤ p + r}, then the joint (marginal) distribution of (X 1 , . . ., X p ) is not of the form (1) in general.To take a specific example, if r = 1 and B(x, y) = B(x)B(y), then the joint (marginal) distribution of (X 1 , . . ., X p ) is the mixture distribution where θ i (s) = B(i)θ(s, p + 1).The conditional distributions are also altered.Indeed, and keeping with the assumption r = 1, if |θ(s, p + 1)| > 0, then the conditional density of X s given {X , = s, 1 ≤ ≤ p} depends not only on X for all such that |θ(s, )| > 0, but also on X k for all k such that |θ(k, p + 1)| > 0. Because the marginal distribution of the observed node belongs to a different family than (1), it seems unlikely that the "sparse + low-rank" approach of Chandrasekaran et al.
(2012) would be of much use in this context.We propose to ignore the missing nodes and fit (the misspecified) model (1) to the observed data.It seems plausible that the resulting estimator would still be well-behaved to the extent that the missing data problem is limited.The goal of the paper is to formalize this idea.
We consider a large (possibly infinite) Markov random field model, where only part of the field is observed, and fit the misspecified model (1) using penalized pseudo-likelihood approach.We study conditions under which this procedure can recover the true network parameter.We show that the 2 -norm estimation error of the procedure is at most τ −1 √ a( log d/n + b), up to a multiplicative constant factor, where d (resp.a) is the number of possible edges (resp.the number of non-zeros entries) of the true network, τ is the smallest eigenvalue of the information matrix, and where the term b represents the effect of the missing nodes (see Theorem 1.2 for the exact statement).We conclude that the estimator θn is robust to a small to moderate amount of missing data.We report some simulation results that are consistent with these findings.In practical situations where MRF are used, it is often unclear whether one is dealing with a partially observed field with important missing nodes.The above discussion thus stresses the need for methods of detecting the existence of missing nodes in Markov random field data.We leave this problem for future research.
The remainder of the paper is organized as follows.We define the model and estimator in Section 1.1, followed by the statement of the main result in Section 1.2.A simulation example is presented in Section 1.3.Section 2 develops the technical proofs.
1.1.The setting.Let X be a finite set.X is the sample space of the observations.Let S be a finite non-empty or countably infinite set that we assume, without any loss of generality to be a subset of the integer set N. The set S represents the nodes of the network.We use the notation S 2 def = {(s, ) ∈ S × S : s < }, the set of all ordered pairs of S.More generally, if Λ is a subset of S, we denote by Λ 2 , the set of all ordered pairs (u, v) ∈ Λ × Λ, with u < v.
Let B : X × X → R be a measurable function such that B(x, y) = B(y, x) (symmetry).Throughout the paper we define which plays the role of the variance of the interaction statistics B(X s , x).
For a matrix θ : S × S → R and s ∈ S, the θ-neighborhood of s is the set and the θ-degree of node s is the (possibly infinite) quantity We denote M(S) the space of all symmetric matrices θ : S × S → R such that deg θ (s) < ∞ for all s ∈ S. For θ ∈ M(S), let µ θ be the probability measure on (X S , E S ) such that if X = {X s , s ∈ S} has distribution µ θ , then the conditional distribution of X s given {X , = s} exists and has probability mass function f for a normalizing constant Z (s) As a result, we will interchangly write f θ (u|x ∂ θ s ) to mean the same object.We call the process {X s , s ∈ S} an auto-model Markov random field.We will take for granted that such distributions µ θ exist.Obviously, this is the case if S is finite.In the case where S is infinite, it can be shown that µ θ exists for any θ ∈ M(S).This follows for instance from Georgii (1988), Theorem 4.23 (a).
For θ ∈ M(S), let {X (i) , 1 ≤ i ≤ n} be a sequence of i.i.d.auto-model Markov random fields with distribution µ θ defined on some probability space with probability measure P and expectation operator Ě .Let D be a finite subset of S with cardinality p.We assume that the random fields X (i) are only observed over D, giving rise to observations (X (1) , . . ., X (n) ), where k , k ∈ D}.We are interested in inferring the network parameter {θ (s, ), (s, ) ∈ D 2 } from the observed data.
We define the functions Ds ), and for some parameter λ n > 0. Finally, we define and we call any element θn of Argmax Q n a penalized pseudo-likelihood estimator of θ .It is useful to have some simple conditions under which Argmax Q n well-defined.It is easy to see that the function Q n is strictly concave.Thus if θn exists, it is necessarily unique.The following result gives an easily verifiable condition under which θn exists.
Proposition 1.1.Suppose that for each s ∈ D, there exists a finite constant c(s) such that for all θ ∈ M(D), all u ∈ X and for all x ∈ X Ds , Then θn exists and is unique.
1.2.Non-asymptotic estimation error bound.In this section X denotes a Markov random field with distribution µ θ .Let s ∈ D. Notice that if the entire θ -neighborhood of s (that is ∂s) is included in D, then the approximate conditional distribution ( 5) and the true conditional distribution (4) would be the same: In particular, we would have: This motivates the definition The quantity b measures the effect of the missing nodes.It measures how well the misspecified conditional densities f (s) θ (u|x Ds ) approximate the correct conditional densities f (s) θ (u|x S\{s} ) in terms of matching the first moment of the statistics B(X s , x ).As we will see below, the quantity b is the main effect of the misspecification on the recovery rate of the 1 -penalized pseudo-likelihood estimator. Let For θ ∈ M(D), we introduce the semi-norm For s ∈ D, , ∈ D s , we define the random variable and we set The family of matrices {C (s) , s ∈ D} plays the role of information matrix.Clearly, each matrix where we write θ s def = {θ(s, ), ∈ D s }.We impose the following restricted strong convexity-type assumption.
A1 There exists τ > 0 such that Theorem 1.2.Assume A1 and take we have Hence We then apply (24), with and the stated result follows easily.
Better bounds than (9) can be derived with additional assumptions.And in turn, such bounds can be used to state Theorem 1.2 with different assumptions.Consider for example the case of the as the strength of the connection between s and the missing nodes, and between s and the observed nodes, respectively.We will also assume that there are only non-negative interactions in the network: θ (s, ) ≥ 0 for all (s, ) ∈ S 2 .This assumption is mostly technical and made to simplify the analysis.Recall that c 1 is defined in (3), D ⊆ S, p = |D|, and d = p(p − 1)/2.
Corollary 1.5.Assume A1, (10), and suppose that θ (s, ) ≥ 0 for all (s, ) ∈ S 2 .Take If n, D, S, and θ are such that nτ 2 ≥ 2(64 2 )c 2 1 a 2 log(2d), ( 480)c 1 a log d n < τ , and It remains to show (12).We denote logit −1 (x) = e x 1+e x , and G(x) = e x (1 + e x ) −2 its derivative.In the case of the Ising model, the conditional means are given by E Notice that G(x) ≤ e −|x| , for all x ∈ R. Hence, by Taylor expansion For all x, y ≥ 0, it is clear that y Since 1 − e −y ≤ y for all y ≥ 0, and using also the Jensen's inequality, we have We saw earlier that E X s |X S\{s} = logit −1 j∈∂s θ (s, j)X j ≥ 1 2 , and (12) follows.
1.2.2.On assumption A1.A1 is a type of restricted eigenvalue assumption similar to the Assumption RE(s, c 0 ) of Bickel et al. (2009).This assumption is not easy to check.But following the analysis of Bickel et al. (2009), it is possible to derive sufficient conditions that give some intuition into when A1 holds.For simplicity we consider the case of product-form functions: B(x, y) = B 0 (x)B 0 (y).Then Thus assuming that there exists a finite constant α > 0 such that we have Assumption ( 13) is similar to Assumption 2 of Meinshausen and Buhlmann (2006), and is typically not restrictive.We show below that it holds for the auto-logistic model.The difficulty lies in dealing with the covariance matrix of the (local) fields {B 0 (X ), ∈ D s }.Following Bickel et al. (2009) (Section 4), the next result captures the intuition that the covariance matrix of {B 0 (X ), ∈ D s } is positive definite if the covariance matrix between the neighbors of s is positive definite and the covariance between neighbors of s and non-neighbors of s is weak.This bears some similarity with the dependency assumption A and the incoherence assumption B of Guo et al. (2010).
Proposition 1.6.Assume (13) and suppose that for all u ∈ R |∂s| , inf s∈D ∈∂s ∈∂s u u Cov (B 0 (X ), B 0 (X )) ≥ ρ ∈∂s u 2 , and Then for all θ ∈ ∆, Proof.We have Clearly s∈D ∈∂s∩D θ(s, ) 2 = 2 θ 2 2 , and Therefore, using ( 14), we get We compare three settings.In Setting 1, there is no missing data, and the samples are generated exactly from (1).In Setting 2 and 3, we generate the sample (X p+r ) from ( 1), for θ = θ , and we retain only (X Thus there are r missing nodes.In Setting 2, we use r = 2, whereas in Setting 3, we set r = 20.Table 1 shows the corresponding values of b in each setting. Regardless of the data generation mechanism, we fit model ( 1) by 1 penalized pseudo-likelihood and compute the relative Mean Square Error E θn − θ 2 / θ 2 , estimated from K replications of the estimator (K = 10).In Figure 1, we plot E θn − θ 2 / θ 2 as a function of the sample size.
As expected, the estimation error decreases with the sample size.Also, the more missing data, the worst the estimator behaves.We also observe that in Setting 2 where r = 2, the value of b is the same in the cases p = 50 and p = 100, but the estimation error is noticeably more affected by the missing data for p = 50.This seems in agrement with the conclusion of Corollary 1.5.
We also compute the proportion of signs of θ that is correctely recovered.This is plotted in Figure 2 for p = 100.The estimator θn described in Corollary 1.3 performs much better than the initial estimator θn .For the computation of θn , we follow Corollary 1.3 and use a threshold of δ = √ aλ, where a is the number of non-zero entries of θ .For large sample size, the sign recovery of θn is perfect, even in presence of missing nodes (the three lines are almost undistinguishable).
Notice that the penalty parameter λ = 0.5 log p/n varies (decreases) with n.This negatively affects the basic estimator θn but does not affect θn .Setting 1, r = 0 Setting 2, r = 2 Setting 3, r = 20 Table 1.Values of b in each setting of the simulation.

Proof of Theorem 1.2
We define Clearly, U n is strictly convex, U n (0) = 0, and minimized at θn − θ .We recall that and for r > 0 we set The next two lemmas are adaptations of Lemmas 1 and 4 from Negahban et al. (2012).We give a proof for completeness.
Lemma 2.1.On the event Ds ), which is a convex function of θ by virtue of Lemma 3.1.It is also not hard to see that Therefore, using the convexity of − n and ( 15), it follows that on Since, U n (0) = 0, we necessarily have U n ( θn − θ ) ≤ 0, which implies, in view of the above bound, that θn − θ ∈ ∆.
The main idea of the proof is to show that under the assumption of the theorem the event {inf v∈∆r U n (v) > 0, and ∇ n ( θ ) ∞ ≤ λn 2 } occurs with high probability with r = r n appropriately chosen.To make the proof easier to follow, we include the following intermediary step.For s ∈ D, ϑ ∈ R p−1 , and x ∈ X Ds , we define We recall the notation θ s = {θ(s, ), ∈ D s }.
Lemma 2.3.Consider the event Suppose that there exists τ > 0 such that τ > 48c 1 aλ n , and the event E n (τ ) holds.Then Proof.We know from Lemma 2.1 that on E n (τ ), θn − θ ∈ ∆.Set r n = 26a 1/2 λ n .We will show that on E n (τ ), inf θ∈∆r n U n (θ) > 0, and use Lemma 2.2 to conclude that θn − θ 2 ≤ r n .We recall that for θ ∈ M(D), For θ ∈ ∆, and on the event From the expression of the approximate conditional distribution f (s) θ (x s |x) given in ( 5), we have where Z(s) θ (x) = exp ∈Ds θ(s, )B(u, x ) du, θ s = {θ(s, ), ∈ D s }, and u, v here denotes the usual inner product in R Ds .Fix s ∈ D, x ∈ X Ds .We will now apply the self-concordant bound developed in Lemma 3.2 to the function The constant c is Lemma 3.2 is given here by sup u∈X sup x,y∈X |B(u, x) − B(u, y Since ∈Ds |θ(s, )| ≤ θ 1 , we combine the above with (18) to conclude that for all θ ∈ ∆, using the fact that E n (τ ) holds.This bound and (17) yield that for θ ∈ ∆, The right-hand-side is positive whenever provided 48c 1 aλ n < τ .
We now show that the event E n (τ ) occurs with high probability.
Lemma 2.4.For any Proof.Set δ n = 8c 1 log d n .We calculate that for (s, ) ∈ D 2 , By the definition of b, Therefore for each (s, ) ∈ D 2 , and by Hoeffding's inequality We conclude by the union-sum inequality that given the choice δ n = 8c 1 log d n .

Appendix
3.1.Convexity and strong convexity-type result.Let (X, A, ν) be a measure space, for some positive measure ν.Let B : X × R p → R be such that x → B(x, θ) is measurable, and We gather here two key results on F .We write | • | (resp.| • | 1 ) to denote the Euclidean norm (resp. 1 -norm) of R p .Lemma 3.2 relies on Lemma 3.3 which is taken from Bach (2010) Lemma 1.
Lemma 3.1.Suppose that the function θ → B(x, θ) is convex for ν-almost all x ∈ X.Then F is convex.
Suppose also that ν is a finite measure, and set where the variance is taken under the distribution µ θ (dx) = e B(x,θ) ν(dx)/ X e B(x,θ) ν(dx). Proof.
For t ∈ R, consider the probability measure on X defined by m t (dx) = e θ+tu,ψ(x) µ(dx) e θ+tu,ψ(x) µ(dx) , and write E t for the expectation with respect to m t .Clearly for t = 0, m t = µ θ .Under the assumption of the lemma, g has derivatives at any order and we verify that g (t) = E t ( u, ψ(X) ), and g (t) = Var t ( u, ψ(X) ) , and g (t) = E t ( u, ψ(X) − E t ( u, ψ(X) )) 3 . Therefore Then it follows from Lemma 3.3 that Var 0 (B(X, u)) .
Next notice that for all x ≥ 0 we have the inequality e −x + x − 1 ≥ x 2 2+x .The result follows.
Proof.The proof follows essentially from Gronwall's lemma.See Bach (2010) Lemma 1 for details.
Proof.Under the stated assumptions, the function t → Y ft (y)e ḡt(y) Z −1 t ν(dy) is differentiable under the integral sign and we have: f 2 (y)e g 2 (y) Z −1 g 2 ν(dy) − f 1 (y)e g 1 (y) Z −1 g 1 ν(dy) = The identity follows by carrying the differentiation under the integral sign.
In particular, log e h 2 (y) ν(dy) − log e h 1 (y) ν(dy) ≤ h 2 − h 1 ∞ . (25) Let d def = p(p−1)/2 and denote M(D) the set of all symmetric finite matrices {θ(s, ), (s, ) ∈ D 2 }, that we identify with R d .For s ∈ S, we define ∂ s def = ∂ θ s and called it the (true) neighborhood of s.We also define D s def = D \ {s}.Since the neighborhood system {∂ s , s ∈ S} is not known, we introduce the approximate conditional distributions and 48c 1 aλ n < τ .Then θn − θ 2 ≤ 26 √ aλ n , with a probability at least 1 − 3 d , where θ = {θ (s, ), (s, ) ∈ D 2 }, and for u ∈ M(D), u 2 def = { (s, )∈D 2 |u(s, )| 2 } 1/2 .Remark 1. Taking λ n = 4b + 8c 1 log d n , and assuming that 48c 1 aλ n < τ , the bound suggests that the convergence rate of the estimator θn is τ −1 a 1/2 c 1 log d n + b .This shows that in general the estimator is inconsistent for d fixed and n → ∞.When b = 0, we recover Theorem 1 of Guo et al. (2010), with a slight improvement on the requirement on the sample size.Here the condition on n reads n τ −2 a 2 log(d), whereas Theorem 1 of Guo et al. (2010) imposes n τ −2 a 3 log(d).Although the estimator is inconsistent, if b is small, θn would still give a reasonably good estimate of θ.In such cases, if in addition min (s, )∈I |θ (s, )| is comparatively large, one can also correctly recover the sign of {θ (s, ), (s, ) ∈ I} by a simple hard-thresholding rule, where the sign of a vector is defined as the vector of signs.Consider the estimator θn where θn (s, ) = θn (s, ) if | θn (s, )| > δ 0 otherwise, for a thresholding parameter δ.Following Corollary 2 of Meinshausen and Yu (2009), the next result is a direct consequence of Theorem 1.2.Corollary 1.3.Under the assumptions of Theorem 1.2, and assuming that δ > 26 √ aλ n , and min (s, )∈I |θ (s, )| > 26 √ aλ n , P sign( θn ) = sign( θ ) ≥ 1 − 3 d .1.2.1.On the misspecification parameter b.The misspecification parameter b plays an important role in the results above.Clearly b is related to how much connections there is between the observed and the missing nodes.However, as the next result shows, b is controlled mainly by the strength of the connections between the observed and the missing nodes, not necessarily by the number of missing nodes.where a ∧ b def = min(a, b).Proof.The fact b is smaller than sup x,y |B(x, y)| follows directly from its definition.Recall that E denotes the expectation under the true model θ .Hence, by first conditioning on {X , ∈ S \ {s}}, Ising model where X = {0, 1}, and B(x, y) = xy.(10) For s ∈ D, we define deg out (s) def = j∈S\D θ (s, j), and deg in (s) def = j∈Ds θ (s, j), at least 1 − 3 d Remark 2. The result formalizes the intuition that the estimation procedure will work well when D is large and the nodes in D interact only weakly with the missing nodes.If deg in (s) is large, deg out (s) would need to be exponentially larger to result in a large bias.However, note that the result still does not imply that θn is consistent for d fixed, since for d fixed, (11) will fail as n → ∞.Obviously this result is not of much practical use, because the deg in (s) and deg out (s) are typically unknown.A more promising approach would be to develop misspecification statistical testings to detect the existence of the missing nodes.However this is beyond the scope of the paper, and is left for possible future research.Proof.It suffices to show that b ≤ sup s∈D deg out (s) exp − 48c 1 aλ n = 480c 1 a log d n < τ .The corollary then follows from Theorem 1.2.

1. 3 .
Example and Monte Carlo Evidence.We consider the example of the auto-logistic model where X = {0, 1}, and B(x, y) = xy.For the simulations, we consider three cases: p = 50, p = 80, and p = 100.For each setting, we consider different sample sizes n ∈ {50, 100, • • • , 500}.The sparsity (the proportion of non-zero entries) of θ is set to 1%, and the non-zero entries of θ are generated uniformly from the interval [0.3, 1].Having all the entries of θ non-negative allows us to simulate exactly (instead of using MCMC) from the Ising distribution µ θ using the Propp-Wilson's perfect sampler.For all the simulations, we set the regularization parameter to λ = 0.5 log p)/n.To quantify the amount of missing data, we use the upper bound established

Figure 1 :Figure 1 :
Figure 1: Relative MSE versus sample size n, where star-line is Setting 1, square-line is Setting 2, triangle-line is Setting 3.