The Annals of Statistics

Statistical guarantees for the EM algorithm: From population to sample-based analysis

Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

The EM algorithm is a widely used tool in maximum-likelihood estimation in incomplete data problems. Existing theoretical work has focused on conditions under which the iterates or likelihood values converge, and the associated rates of convergence. Such guarantees do not distinguish whether the ultimate fixed point is a near global optimum or a bad local optimum of the sample likelihood, nor do they relate the obtained fixed point to the global optima of the idealized population likelihood (obtained in the limit of infinite data). This paper develops a theoretical framework for quantifying when and how quickly EM-type iterates converge to a small neighborhood of a given global optimum of the population likelihood. For correctly specified models, such a characterization yields rigorous guarantees on the performance of certain two-stage estimators in which a suitable initial pilot estimator is refined with iterations of the EM algorithm. Our analysis is divided into two parts: a treatment of the EM and first-order EM algorithms at the population level, followed by results that apply to these algorithms on a finite set of samples. Our conditions allow for a characterization of the region of convergence of EM-type iterates to a given population fixed point, that is, the region of the parameter space over which convergence is guaranteed to a point within a small neighborhood of the specified population fixed point. We verify our conditions and give tight characterizations of the region of convergence for three canonical problems of interest: symmetric mixture of two Gaussians, symmetric mixture of two regressions and linear regression with covariates missing completely at random.

Article information

Source
Ann. Statist., Volume 45, Number 1 (2017), 77-120.

Dates
Received: September 2015
Revised: January 2016
First available in Project Euclid: 21 February 2017

Permanent link to this document
https://projecteuclid.org/euclid.aos/1487667618

Digital Object Identifier
doi:10.1214/16-AOS1435

Mathematical Reviews number (MathSciNet)
MR3611487

Zentralblatt MATH identifier
1367.62052

Subjects
Primary: 62F10: Point estimation 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]
Secondary: 90C30: Nonlinear programming

Keywords
EM algorithm first-order EM algorithm nonconvex optimization maximum likelihood estimation

Citation

Balakrishnan, Sivaraman; Wainwright, Martin J.; Yu, Bin. Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Statist. 45 (2017), no. 1, 77--120. doi:10.1214/16-AOS1435. https://projecteuclid.org/euclid.aos/1487667618


Export citation

References

  • [1] Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M. and Telgarsky, M. (2012). Tensor decompositions for learning latent variable models. Preprint. Available at arXiv:1210.7559.
  • [2] Anandkumar, A., Jain, P., Netrapalli, P. and Tandon, R. (2013). Learning sparsely used overcomplete dictionaries via alternating minimization. Technical report, Microsoft Research, Redmond, OR.
  • [3] Balakrishnan, S., Wainwright, M. J. and Yu, B. (2014). Supplement to “Statistical guarantees for the EM algorithm: From population to sample-based analysis.” DOI:10.1214/16-AOS1435SUPP.
  • [4] Balan, R., Casazza, P. and Edidin, D. (2006). On signal reconstruction without phase. Appl. Comput. Harmon. Anal. 20 345–356.
  • [5] Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41 164–171.
  • [6] Beale, E. M. L. and Little, R. J. A. (1975). Missing values in multivariate analysis. J. R. Stat. Soc. Ser. B. Stat. Methodol. 37 129–145.
  • [7] Bertsekas, D. P. (1995). Nonlinear Programming. Athena Scientific, Belmont, CA.
  • [8] Bubeck, S. (2014). Theory of convex optimization for machine learning. Unpublished manuscript.
  • [9] Buldygin, V. V. and Kozachenko, Yu. V. (2000). Metric Characterization of Random Variables and Random Processes. Translations of Mathematical Monographs 188. Amer. Math. Soc., Providence, RI. Translated from the 1998 Russian original by V. Zaiats.
  • [10] Candès, E. J., Strohmer, T. and Voroninski, V. (2013). PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66 1241–1274.
  • [11] Celeux, G., Chauveau, D. and Diebolt, J. (1995). On stochastic versions of the EM algorithm. Technical Report, No. 2514, INRIA.
  • [12] Celeux, G. and Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions. Comput. Statist. Data Anal. 14 315–332.
  • [13] Chaganty, A. T. and Liang, P. (2013). Spectral experts for estimating mixtures of linear regressions. Unpublished manuscript.
  • [14] Chen, Y., Yi, X. and Caramanis, C. (2013). A convex formulation for mixed regression: Near optimal rates in the face of noise. Unpublished manuscript.
  • [15] Chrétien, S. and Hero, A. O. (2008). On EM algorithms and their proximal generalizations. ESAIM Probab. Stat. 12 308–326.
  • [16] Dasgupta, S. and Schulman, L. (2007). A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. J. Mach. Learn. Res. 8 203–226.
  • [17] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol. 39 1–38.
  • [18] Hartley, H. O. (1958). Maximum likelihood estimation from incomplete data. Biometrics 14 174–194.
  • [19] Healy, M. and Westmacott, M. (1956). Missing values in experiments analysed on automatic computers. J. R. Stat. Soc. Ser. C. Appl. Stat. 5 203–206.
  • [20] Hero, A. O. and Fessler, J. A. (1995). Convergence in norm for alternating expectation-maximization (EM) type algorithms. Statist. Sinica 5 41–54.
  • [21] Hsu, D. and Kakade, S. M. (2012). Learning Gaussian mixture models: Moment methods and spectral decompositions. Preprint. Available at arXiv:1206.5766.
  • [22] Iturria, S. J., Carroll, R. J. and Firth, D. (1999). Polynomial regression and estimating functions in the presence of multiplicative measurement error. J. R. Stat. Soc. Ser. B. Stat. Methodol. 61 547–561.
  • [23] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Heidelberg.
  • [24] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin.
  • [25] Liu, C. and Rubin, D. B. (1994). The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 81 633–648.
  • [26] Loh, P. and Wainwright, M. J. (2012). Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression. In ISIT 2601–2605. IEEE, Piscataway Township, NJ.
  • [27] Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol. 44 226–233.
  • [28] Ma, J. and Xu, L. (2005). Asymptotic convergence properties of the EM algorithm with respect to the overlap in the mixture. Neurocomputing 68 105–129.
  • [29] McLachlan, G. and Krishnan, T. (2007). The EM Algorithm and Extensions. Wiley, New York.
  • [30] Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. J. R. Stat. Soc. Ser. B. Stat. Methodol. 51 127–138.
  • [31] Meng, X. (1994). On the rate of convergence of the ECM algorithm. Ann. Statist. 22 326–339.
  • [32] Meng, X. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80 267–278.
  • [33] Meng, X. and Rubin, D. B. (1994). On the global and componentwise rates of convergence of the EM algorithm. Linear Algebra Appl. 199 413–425.
  • [34] Neal, R. M. and Hinton, G. E. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models (M. I. Jordan, ed.) 355–368. MIT Press, Cambridge, MA.
  • [35] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization 87. Kluwer Academic, Boston, MA.
  • [36] Netrapalli, P., Jain, P. and Sanghavi, S. (2013). Phase retrieval using alternating minimization. In Neural Information Processing Systems Curran Associates, Inc., Red Hook, New York 2796–2804.
  • [37] Orchard, T. and Woodbury, M. A. (1972). A missing information principle: Theory and applications. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Theory of Statistics 1 697–715. Univ. California Press, Berkeley, CA.
  • [38] Pearson, K. (1894). Contributions to the Mathematical Theory of Evolution. Harrison and Sons, London.
  • [39] Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26 195–239.
  • [40] Rubin, D. B. (1974). Characterizing the estimation of parameters in incomplete-data problems. J. Amer. Statist. Assoc. 69 467–474.
  • [41] Sundberg, R. (1974). Maximum likelihood theory for incomplete data from an exponential family. Scand. J. Stat. 1 49–58.
  • [42] Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 82 528–550.
  • [43] Tseng, P. (2004). An analysis of the EM algorithm and entropy-like proximal point methods. Math. Oper. Res. 29 27–44.
  • [44] van Dyk, D. A. and Meng, X. L. (2000). Algorithms based on data augmentation: A graphical representation and comparison. In Computing Science and Statistics: Proceedings of the 31st Symposium on the Interface 230–239. Interface Foundation of North America, Fairfax Station, VA.
  • [45] van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Univ. Press, Cambridge.
  • [46] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
  • [47] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing 210–268. Cambridge Univ. Press, Cambridge.
  • [48] Wang, Z., Gu, Q., Ning, Y. and Liu, H. (2014). High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. Unpublished manuscript.
  • [49] Wei, G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithm. J. Amer. Statist. Assoc. 85 699–704.
  • [50] Wu, C.-F. J. (1983). On the convergence properties of the EM algorithm. Ann. Statist. 11 95–103.
  • [51] Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 8 129–151.
  • [52] Xu, Q. and You, J. (2007). Covariate selection for linear errors-in-variables regression models. Comm. Statist. Theory Methods 36 375–386.
  • [53] Yi, X., Caramanis, C. and Sanghavi, S. (2013). Alternating minimization for mixed linear regression. Unpublished manuscript.

Supplemental materials

  • Supplement to “Statistical guarantees for the EM algorithm: From population to sample-based analysis”. The supplement [3] contains all remaining technical proofs omitted from the main text due to space constraints.