Statistical Science

The MM Alternative to EM

Tong Tong Wu and Kenneth Lange

Full-text: Open access


The EM algorithm is a special case of a more general algorithm called the MM algorithm. Specific MM algorithms often have nothing to do with missing data. The first M step of an MM algorithm creates a surrogate function that is optimized in the second M step. In minimization, MM stands for majorize–minimize; in maximization, it stands for minorize–maximize. This two-step process always drives the objective function in the right direction. Construction of MM algorithms relies on recognizing and manipulating inequalities rather than calculating conditional expectations. This survey walks the reader through the construction of several specific MM algorithms. The potential of the MM algorithm in solving high-dimensional optimization and estimation problems is its most attractive feature. Our applications to random graph models, discriminant analysis and image restoration showcase this ability.

Article information

Statist. Sci., Volume 25, Number 4 (2010), 492-505.

First available in Project Euclid: 14 March 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Iterative majorization maximum likelihood inequalities penalization


Wu, Tong Tong; Lange, Kenneth. The MM Alternative to EM. Statist. Sci. 25 (2010), no. 4, 492--505. doi:10.1214/08-STS264.

Export citation


  • Anderson, G. D., Vamanamurthy, M. K. and Vuorinen, M. (2007). Generalized convexity and inequalities. J. Math. Anal. Appl. 335 1294–1308.
  • Asuncion, A. and Newman, D. J. (2007). UCI Machine Learning Repository. Available at
  • Becker, M. P., Yang, I. and Lange, K. (1997). EM algorithms without missing data. Stat. Methods Med. Res. 6 38–54.
  • Bergstrom, T. C. and Bagnoli, M. (2005). Log-concave probability and its applications. Econom. Theory 26 445–469.
  • Bijleveld, C. C. J. H. and de Leeuw, J. (1991). Fitting longitudinal reduced-rank regression models by alternating least squares. Psychometrika 56 433–447.
  • Bioucas-Dias, J. M., Figueiredo, M. A. T. and Oliveira, J. P. (2006). Total variation-based image deconvolution: a majorization–minimization approach. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006 Proceedings 861–864.
  • Blitzstein, J., Chatterjee, S. and Diaconis, P. (2008). A new algorithm for high dimensional maximum likelihood estimation. Technical report.
  • Bohning, D. and Lindsay, B. G. (1988). Monotonicity of quadratic approximation algorithms. Ann. Inst. Statist. Math. 40 641–663.
  • Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: Theory and Applications. Springer, New York.
  • Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press.
  • Boyles, R. A. (1983). On the convergence of the EM algorithm. J. Roy. Statist. Soc. Ser. B 45 47–50.
  • de Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling. In Recent Developments in Statistics (J. R. Barra, F. Brodeau, G. Romie and B. Van Cutsem, eds.) 133–145. North-Holland, Amsterdam.
  • de Leeuw, J. (1994). Block relaxation algorithms in statistics. In Information Systems and Data Analysis (H.-H. Bock, W. Lenski and M. M. Richter, eds.). Springer, Berlin.
  • de Leeuw, J. and Heiser, W. J. (1977). Convergence of correction matrix algorithms for multidimensional scaling. In Geometric Representations of Relational Data (J. C. Lingoes, E. Roskam and I. Borg, eds.). Mathesis Press, Ann Arbor, MI.
  • de Leeuw, J. and Heiser, W. J. (1980). Multidimensional scaling with restriction on the configuration. In Multivariate Analysis — V: Proceeding of the Fifth International Symposium on Multivariate Analysis (P. R. Krishnaiah ed.) 501–522. North-Holland, Amsterdam.
  • de Leeuw, J. and Lange, K. (2009). Sharp quadratic majorization in one dimension. Comput. Statist. Data Anal. 53 2471–2484.
  • De Pierro, A. R. (1995). A modified expectation maximization algorithm for penalized likelihood estimation in emission tomography. IEEE Trans. Med. Imaging 14 132–137.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Eldén, L. (2007). Matrix Methods in Data Mining and Pattern Recognition. SIAM, Philadelphia.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist. 1 302–332.
  • Groenen, P. J. F., Nalbantov, G. and Bioch, J. C. (2006). Nonlinear support vector machines through iterative majorization and I-splines. Studies in Classification, Data Analysis and Knowledge Organization (H. J. Lenz and R. Decker, eds.) 149–161. Springer, Berlin.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
  • Heiser, W. J. (1987). Correspondence analysis with least absolute residuals. Comput. Statist. Data Anal. 5 337–356.
  • Heiser, W. J. (1995). Convergent computing by iterative majorization: theory and applications in multidimensional data analysis. In Recent Advances in Descriptive Multivariate Analysis (W. J. Krzanowski, ed.). Clarendon Press, Oxford.
  • Huber, P. J. (1981). Robust Statistics. Wiley, New York.
  • Hunter, D. R. (2004). MM algorithms for generalized Bradley–Terry models. Ann. Statist. 32 384–406.
  • Hunter, D. R. and Lange, K. (2000). Quantile regression via an MM algorithm. J. Comput. Graph. Statist. 9 60–77.
  • Hunter, D. R. and Lange, K. (2002). Computing estimates in the proportional odds model. Ann. Inst. Statist. Math. 54 155–168.
  • Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. Amer. Statist. 58 30–37.
  • Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33 1617–1642.
  • Jamshidian, M. and Jennrich, R. I. (1997). Quasi-Newton acceleration of the EM algorithm. J. Roy. Statist. Soc. Ser. B 59 569–587.
  • Kent, J. T., Tyler, D. E. and Vardi, Y. (1994). A curious likelihood identity for the multivariate t-distribution. Comm. Statist. Simulation Comput. 23 441–453.
  • Kiers, H. A. L. (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Comput. Statist. Data Anal. 41 157–170.
  • Kiers, H. A. L. and Ten Berge, J. M. F. (1992). Minimization of a class of matrix trace functions by means of refined majorization. Psychometrika 57 371–382.
  • Lange, K. (1994). An adaptive barrier method for convex programming. Methods Appl. Anal. 1 392–402.
  • Lange, K. (1995a). A gradient algorithm locally equivalent to the EM algorithm. J. Roy. Statist. Soc. Ser. B 57 425–437.
  • Lange, K. (1995b). A quasi-Newton acceleration of the EM algorithm. Statist. Sinica 5 1–18.
  • Lange, K. (2004). Optimization. Springer, New York.
  • Lange, K. and Fessler, J. A. (1994). Globally convergent algorithms for maximum a posteriori transmission tomography. IEEE Trans. Image Process. 4 1430–1438.
  • Lange, K., Hunter, D. R. and Yang, I. (2000). Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist. 9 1–20.
  • Lange, K., Little, R. J. A. and Taylor, J. M. G. (1989). Robust statistical modeling using the t distribution. J. Amer. Statist. Assoc. 84 881–896.
  • Lange, K. and Wu, T. T. (2008). An MM algorithm for multicategory vertex discriminant analysis. J. Comput. Graph. Statist. 17 1–18.
  • Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788–791.
  • Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Adv. Neural Inform. Process. Syst. 13 556–562.
  • Liao, W. H., Huang, S. C., Lange, K. and Bergsneider, M. (2002). Use of MM algorithm for regularization of parametric images in dynamic PET. In Brain Imaging Using PET (M. Senda, Y. Kimura, P. Herscovitch and Y. Kimura, eds.). Academic Press, New York.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, New York.
  • Marshall, A. W. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications. Academic Press, San Diego.
  • McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York.
  • Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. J. Roy. Statist. Soc. B 51 127–138.
  • Meng, X. L. and Rubin, D. B. (1991). Using EM to obtain asymptotic variance–covariance matrices: The SEM algorithm. J. Amer. Statist. Assoc. 86 899–909.
  • Meng, X. L. and van Dyk, D. (1997). The EM algorithm—an old folk-song sung to a fast new tune. J. Roy. Statist. Soc. Ser. B 59 511–567.
  • Ortega, J. M. (1990). Numerical Analysis: A Second Course. SIAM, Philadelphia.
  • Ortega, J. M. and Rheinboldt, W. C. (1970). Iterative Solutions of Nonlinear Equations in Several Variables. Academic Press, New York.
  • Pauca, V. P., Piper, J. and Plemmons, R. J. (2006). Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl. 416 29–47.
  • Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New York.
  • Sabatti, C. and Lange, K. (2002). Genomewide motif identification using a dictionary model. Proceedings IEEE 90 1803–1810.
  • Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge.
  • Steele, J. M. (2004). The Cauchy–Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Cambridge Univ. Press and Math. Assoc. Amer., Washington, DC.
  • Takane, Y., Young, F. W. and de Leeuw, J. (1977). Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika 42 7–67.
  • Vaida, F. (2005). Parameter convergence for EM and MM algorithms. Statist. Sinica 15 831–840.
  • Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York.
  • Varadhan, R. and Roland, C. (2008). Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scand. J. Statist. 35 335–353.
  • Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Ann. Statist. 11 95–103.