Annals of Statistics

Variable selection using MM algorithms

David R. Hunter and Runze Li

Full-text: Open access


Variable selection is fundamental to high-dimensional statistical modeling. Many variable selection techniques may be implemented by maximum penalized likelihood using various penalty functions. Optimizing the penalized likelihood function is often challenging because it may be nondifferentiable and/or nonconcave. This article proposes a new class of algorithms for finding a maximizer of the penalized likelihood for a broad class of penalty functions. These algorithms operate by perturbing the penalty function slightly to render it differentiable, then optimizing this differentiable function using a minorize–maximize (MM) algorithm. MM algorithms are useful extensions of the well-known class of EM algorithms, a fact that allows us to analyze the local and global convergence of the proposed algorithm using some of the techniques employed for EM algorithms. In particular, we prove that when our MM algorithms converge, they must converge to a desirable point; we also discuss conditions under which this convergence may be guaranteed. We exploit the Newton–Raphson-like aspect of these algorithms to propose a sandwich estimator for the standard errors of the estimators. Our method performs well in numerical tests.

Article information

Ann. Statist., Volume 33, Number 4 (2005), 1617-1642.

First available in Project Euclid: 5 August 2005

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62J12: Generalized linear models 65C20: Models, numerical methods [See also 68U20]

AIC BIC EM algorithm LASSO MM algorithm penalized likelihood oracle estimator SCAD


Hunter, David R.; Li, Runze. Variable selection using MM algorithms. Ann. Statist. 33 (2005), no. 4, 1617--1642. doi:10.1214/009053605000000200.

Export citation


  • Antoniadis, A. (1997). Wavelets in statistics: A review (with discussion). J. Italian Statistical Society 6 97–144.
  • Antoniadis, A. and Fan, J. (2001). Regularization of wavelets approximations (with discussion). J. Amer. Statist. Assoc. 96 939–967.
  • Cai, J., Fan, J., Li, R. and Zhou, H. (2005). Variable selection for multivariate failure time data. Biometrika. To appear.
  • Cox, D. R. (1975). Partial likelihood. Biometrika 62 269–276.
  • Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer. Math. 31 377–403.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Li, R. (2002). Variable selection for Cox's proportional hazards model and frailty model. Ann. Statist. 30 74–99.
  • Fan, J. and Li, R. (2004). New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Amer. Statist. Assoc. 99 710–723.
  • Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928–961.
  • Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35 109–148.
  • Heiser, W. J. (1995). Convergent computation by iterative majorization: Theory and applications in multidimensional data analysis. In Recent Advances in Descriptive Multivariate Analysis (W. J. Krzanowski ed.) 157–189. Clarendon Press, Oxford.
  • Hestenes, M. R. (1975). Optimization Theory: The Finite Dimensional Case. Wiley, New York.
  • Hunter, D. R. and Lange, K. (2000). Rejoinder to discussion of “Optimization transfer using surrogate objective functions.” J. Comput. Graph. Statist. 9 52–59.
  • Kauermann, G. and Carroll, R. J. (2001). A note on the efficiency of sandwich covariance matrix estimation. J. Amer. Statist. Assoc. 96 1387–1396.
  • Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. J. Roy. Statist. Soc. Ser. B 57 425–437.
  • Lange, K., Hunter, D. R. and Yang, I. (2000). Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist. 9 1–59.
  • McLachlan, G. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York.
  • Meng, X.-L. (1994). On the rate of convergence of the ECM algorithm. Ann. Statist. 22 326–339.
  • Meng, X.-L. and Van Dyk, D. A. (1997). The EM algorithm–-An old folk song sung to a fast new tune (with discussion). J. Roy. Statist. Soc. Ser. B 59 511–567.
  • Miller, A. J. (2002). Subset Selection in Regression, 2nd ed. Chapman and Hall, London.
  • Ortega, J. M. (1990). Numerical Analysis: A Second Course, 2nd ed. SIAM, Philadelphia.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Wu, C.-F. J. (1983). On the convergence properties of the EM algorithm. Ann. Statist. 11 95–103.