Source: Ann. Statist. Volume 38, Number 3
(2010), 1287-1319.
We consider the problem of estimating the graph associated with a binary Ising Markov random field. We describe a method based on ℓ1-regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an ℓ1-constraint. The method is analyzed under high-dimensional scaling in which both the number of nodes p and maximum neighborhood size d are allowed to grow as a function of the number of observations n. Our main results provide sufficient conditions on the triple (n, p, d) and the model parameters for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. With coherence conditions imposed on the population Fisher information matrix, we prove that consistent neighborhood selection can be obtained for sample sizes n=Ω(d3log p) with exponentially decaying error. When these same conditions are imposed directly on the sample matrices, we show that a reduced sample size of n=Ω(d2log p) suffices for the method to estimate neighborhoods consistently. Although this paper focuses on the binary graphical models, we indicate how a generalization of the method of the paper would apply to general discrete Markov random fields.
References
[1] Abbeel, P., Koller, D. and Ng, A. Y. (2006). Learning factor graphs in polynomial time and sample complexity. J. Mach. Learn. Res. 7 1743–1788.
[2] Banerjee, O., Ghaoui, L. E. and d’Asprémont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9 485–516.
[3] Bertsekas, D. (1995). Nonlinear Programming. Athena Scientific, Belmont, MA.
[4] Bresler, G., Mossel, E. and Sly, A. (2009). Reconstruction of Markov random fields from samples: Some easy observations and algorithms. Available at http://front.math.ucdavis.edu/0712.1402.
[5] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Ann. Statist. 35 2313–2351.
[6] Chickering, D. (1995). Learning Bayesian networks is NP-complete. In Learning from Data: Artificial Intelligence and Statistics V (D. Fisher and H. Lenz, eds.). Lecture Notes in Statistics 112 121–130. Springer, New York.
[7] Chow, C. and Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. Inform. Theory 14 462–467.
[8] Cross, G. and Jain, A. (1983). Markov random field texture models. IEEE Trans. PAMI 5 25–39.
[9] Csiszár, I. and Talata, Z. (2006). Consistent estimation of the basic neighborhood structure of Markov random fields. Ann. Statist. 34 123–145.
[10] Dasgupta, S. (1999). Learning polytrees. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-99). Morgan Kaufmann, San Francisco, CA.
[11] Davidson, K. R. and Szarek, S. J. (2001). Local operator theory, random matrices, and Banach spaces. In Handbook of the Geometry of Banach Spaces 1 317–336. Elsevier, Amsterdam.
[12] Donoho, D. and Elad, M. (2003). Maximal sparsity representation via ℓ1 minimization. Proc. Natl. Acad. Sci. USA 100 2197–2202.
[13] Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. PAMI 6 721–741.
[14] Hassner, M. and Sklansky, J. (1980). The use of Markov random fields as models of texture. Comp. Graphics Image Proc. 12 357–370.
[15] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
Mathematical Reviews (MathSciNet):
MR144363
[16] Horn, R. A. and Johnson, C. R. (1985). Matrix Analysis. Cambridge Univ. Press, Cambridge.
Mathematical Reviews (MathSciNet):
MR832183
[17] Ising, E. (1925). Beitrag zur theorie der ferromagnetismus. Zeitschrift für Physik 31 253–258.
[18] Kalisch, M. and Buhlmann, P. (2007). Estimating high-dimensional directed acyclic graphs with the pc-algorithm. J. Mach. Learn. Res. 8 613–636.
[19] Kim, Y., Kim, J. and Kim, Y. (2005). Blockwise sparse regression. Statist. Sinica 16 375–390.
[20] Koh, K., Kim, S. J. and Boyd, S. (2007). An interior-point method for large-scale ℓ1-regularized logistic regression. J. Mach. Learn. Res. 3 1519–1555.
[21] Manning, C. D. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
[22] Meier, L., van de Geer, S. and Bühlmann, P. (2007). The group lasso for logistic regression. Technical report, Mathematics Dept., Swiss Federal Institute of Technology Zürich.
[23] Meinshausen, N. and Bühlmann, P. (2006). High dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
[24] Ng, A. Y. (2004). Feature selection, ℓ1 vs. ℓ2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML-04). Morgan Kaufmann, San Francisco, CA.
[25] Obozinski, G., Wainwright, M. J. and Jordan, M. I. (2008). Union support recovery in high-dimensional multivariate regression. Technical report, Dept. Statistics, Univ. California, Berkeley.
[26] Ripley, B. D. (1981). Spatial Statistics. Wiley, New York.
Mathematical Reviews (MathSciNet):
MR624436
[27] Rockafellar, G. (1970). Convex Analysis. Princeton Univ. Press, Princeton.
Mathematical Reviews (MathSciNet):
MR274683
[28] Rothman, A., Bickel, P., Levina, E. and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515.
[29] Santhanam, N. P. and Wainwright, M. J. (2008). Information-theoretic limits of high-dimensional graphical model selection. In International Symposium on Information Theory. Toronto, Canada.
[30] Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, Prediction and Search. MIT Press, Cambridge, MA.
[31] Srebro, N. (2003). Maximum likelihood bounded tree-width Markov networks. Artificial Intelligence 143 123–138.
[32] Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals. IEEE Trans. Inform. Theory 51 1030–1051.
[33] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
[34] Wainwright, M. J. and Jordan, M. I. (2003). Graphical models, exponential families, and variational inference. Technical Report 649, Dept. Statistics, Univ. California, Berkeley.
[35] Wainwright, M. J., Ravikumar, P. and Lafferty, J. D. (2007). High-dimensional graphical model selection using ℓ1-regularized logistic regression. In Advances in Neural Information Processing Systems (B. Schölkopf, J. Platt and T. Hoffman, eds.) 19 1465–1472. MIT Press, Cambridge, MA.
[36] Welsh, D. J. A. (1993). Complexity: Knots, Colourings, and Counting. Cambridge Univ. Press, Cambridge.
[37] Woods, J. (1978). Markov image modeling. IEEE Trans. Automat. Control 23 846–850.
[38] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
[39] Zhao, P. and Yu, B. (2007). On model selection consistency of lasso. J. Mach. Learn. Res. 7 2541–2567.