Electronic Journal of Statistics

Majorization-Minimization algorithms for nonsmoothly penalized objective functions

Elizabeth D. Schifano, Robert L. Strawderman, and Martin T. Wells

Full-text: Open access


The use of penalization, or regularization, has become common in high-dimensional statistical analysis, where an increasingly frequent goal is to simultaneously select important variables and estimate their effects. It has been shown by several authors that these goals can be achieved by minimizing some parameter-dependent “goodness-of-fit” function (e.g., a negative loglikelihood) subject to a penalization that promotes sparsity. Penalty functions that are singular at the origin have received substantial attention, arguably beginning with the Lasso penalty [62].

The current literature tends to focus on specific combinations of differentiable goodness-of-fit functions and penalty functions singular at the origin. One result of this combined specificity has been a proliferation in the number of computational algorithms designed to solve fairly narrow classes of optimization problems involving objective functions that are not everywhere continuously differentiable. In this paper, we propose a general class of algorithms for optimizing an extensive variety of nonsmoothly penalized objective functions that satisfy certain regularity conditions. The proposed framework utilizes the majorization-minimization (MM) algorithm as its core optimization engine. In the case of penalized regression models, the resulting algorithms employ iterated soft-thresholding, implemented componentwise, allowing for fast and stable updating that avoids the need for inverting high-dimensional matrices. We establish convergence theory under weaker assumptions than previously considered in the statistical literature. We also demonstrate the exceptional effectiveness of new acceleration methods, originally proposed for the EM algorithm, in this class of problems. Simulation results and a microarray data example are provided to demonstrate the algorithm’s capabilities and versatility.

Article information

Electron. J. Statist., Volume 4 (2010), 1258-1299.

First available in Project Euclid: 12 November 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 65C60: Computational problems in statistics 62J07: Ridge regression; shrinkage estimators
Secondary: 62J05: Linear regression 62J12: Generalized linear models

Convex optimization iterative soft thresholding Lasso penalty minimax concave penalty non-convex optimization smoothly clipped absolute deviation penalty


Schifano, Elizabeth D.; Strawderman, Robert L.; Wells, Martin T. Majorization-Minimization algorithms for nonsmoothly penalized objective functions. Electron. J. Statist. 4 (2010), 1258--1299. doi:10.1214/10-EJS582. https://projecteuclid.org/euclid.ejs/1289575960

Export citation


  • [1] Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling., Nature 403 503–511.
  • [2] Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1993)., Statistical Models Based on Counting Processes. Springer-Verlag, New York.
  • [3] Ando, T., Suguro, M., Kobayashi, T., Seto, M. and Honda, H. (2003). Multiple fuzzy neural network system for outcome prediction and classification of 220 lymphoma patients on the basis of molecular profiling., Cancer Sci. 94 906–913.
  • [4] Annest, A., Bumgarner, R., Raftery, A. and Yeung, K. Y. (2009). Iterative Bayesian Model Averaging: a method for the application of survival analysis to high-dimensional microarray data., BMC Bioinformatics 10 72.
  • [5] Antoniadis, A., Gijbels, I. and Nikolova, M. (2009). Penalized likelihood regression for generalized linear models with nonquadratic penalties., Ann. Inst. Statist. Math. 1-31. 10.1007/s10463-009-0242-4.
  • [6] Becker, M. P., Yang, I. and Lange, K. (1997). EM algorithms without missing data., Stat. Methods Med. Res. 6 38–54.
  • [7] Binder, H. and Schumacher, M. (2008). Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models., BMC Bioinformatics 9 14.
  • [8] Binder, H. and Schumacher, M. (2009). Incorporating pathway information into boosting estimation of high-dimensional risk prediction models., BMC Bioinformatics 10 18.
  • [9] Bohning, D. and Lindsay, B. (1988). Monotonocity of Quadratic-Approximation Algorithms., Ann. Inst. Statist. Math. 40 641–663.
  • [10] Boyd, S. and Vandenberghe, L. (2004)., Convex Optimization. Cambridge University Press.
  • [11] Chrétien, S. and Hero, A. O. (2008). On EM algorithms and their proximal generalizations., ESAIM Journ. on Probability and Statistics 12 308–326.
  • [12] Clarke, F. H. (1990)., Optimization and Nonsmooth Analysis. SIAM, Philadelphia.
  • [13] Combettes, P. L. and Wajs, V. R. (2005). Signal Recovery by Proximal Forward-Backward Splitting., Multiscale Model. Simul. 4 1168-1200.
  • [14] Cox, D. R. (1972). Regression models and life-tables (with Discussion)., J. Roy. Statist. Soc. Ser. B 34 187–220.
  • [15] Cox, D. R. (1975). Partial likelihood., Biometrika 62 269–276.
  • [16] Daubechies, I., Defreise, M. and De Mol, C. (2004). An Iterative Thresholding Algorithm for Linear Inverse Problems with a Sparsity Constraint., Commun. Pure Appl. Math. 1413–1457.
  • [17] De Mol, C., De Vito, E. and Rosasco, L. (2008). Elastic-Net Regularization in Learning Theory., arXiv.
  • [18] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion)., J. Roy. Statist. Soc. Ser. B 39 1–38.
  • [19] Efron, B., Hastie, T., Johnstone, L. and Tibshirani, R. (2004). Least angle regression., Ann. Statist. 32 407–452.
  • [20] Engler, D. and Li, Y. (2009). Survival analysis with high-dimensional covariates: an application in microarray studies., Statist. Appl. Gen. Mol. Biol. 8 Article 14.
  • [21] Fan, J. and Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties., J. Amer. Statist. Assoc. 96 1348-1360.
  • [22] Fan, J., Feng, Y., Samworth, R. and Wu, Y. (2009a). SIS: Sure Independence Screening R package version, 0.2.
  • [23] Fan, J., Samworth, R., and Wu, Y. (2009b). Ultrahigh dimensional variable selection: beyond the linear model., Journal of Machine Learning Research 10 1829-1853.
  • [24] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Regularization Paths for Generalized Linear Models via Coordinate Descent Technical Report, Dept. of Statistics, Stanford, University.
  • [25] Fygenson, M. and Ritov, Y. (1994). Monotone estimating equations for censored data., Ann. Statist. 22 732–746.
  • [26] Geman, D. and Reynolds, G. (1992). Constrained restoration and the recovery of discontinuities., IEEE Trans. Patt. Anal. Mach. Intell 14 367–383.
  • [27] Gui, J. and Li, H. (2005a). Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data., Bioinformatics 21 3001–3008.
  • [28] Gui, J. and Li, H. (2005b). Threshold Gradient Descent Method for Censored Data Regression with Applications in Pharmacogenomics., Pacific Symposium on Biocomputing 10 272–283.
  • [29] Hale, E. T., Yin, W. and Zhang, Y. (2008). Fixed-point Continuation for l1-minimization: Methodology and Convergence., SIAM J. Optim. 19 1107–1130.
  • [30] Hastie, T. and Efron, B. (2007). lars: Least Angle Regression, Lasso and Forward Stagewise R package version, 0.9-7.
  • [31] Hiriart-Urruty, J. B. and Lemaréchal, C. (1996)., Convex Analysis and Minimization Algorithms I: Fundamentals. Springer.
  • [32] Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms., Amer. Statist. 58 30–37.
  • [33] Hunter, D. and Li, R. (2005). Variable Selection Using MM Algorithms., Ann. Statist. 33 1617–1643.
  • [34] Johnson, B. A., Lin, D.-Y. and Zeng, D. (2008). Penalized estimating functions and variable selection in semiparametric regression problems., J. Amer. Statist. Assoc. 103 672–680.
  • [35] Johnson, L. M. and Strawderman, R. L. (2009). Induced smoothing for the semiparametric accelerated failure time model: asymptotics and extensions to clustered data., Biometrika 96 577-590.
  • [36] Kim, Y., Choi, H. and Oh, H.-S. (2008). Smoothly Clipped Absolute Deviation on High Dimensions., J. Amer. Statist. Assoc. 103 1665–1673.
  • [37] Lange, K. (1995). A Gradient Algorithm Locally Equivalent to the EM Algorithm., J. Roy. Statist. Soc. Ser. B 57 425–437.
  • [38] Lange, K. (2004)., Optimization. Springer, New York, USA.
  • [39] Lange, K., Hunter, D. and Yang, I. (2000). Optimization Transfer Using Surrogate Objective Functions., J. of Comp. Graph. Statist. 9 1–20.
  • [40] Li, H. and Gui, J. (2004). Partial Cox regression analysis for high-dimensional microarray gene expression data., Bioinformatics 20 Suppl. 1, i208–215.
  • [41] Li, H. and Luan, Y. (2005). Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data., Bioinformatics 21 2403–2409.
  • [42] Luenberger, D. G. and Ye, Y. (2008)., Linear and nonlinear programming., Third ed. International Series in Operations Research & Management Science, 116. Springer, New York.
  • [43] Ma, S. and Huang, J. (2007). Additive risk survival model with microarray data., BMC Bioinformatics 8 192.
  • [44] Martinez-Climent, J. A., Alizadeh, A. A., Segraves, R., Blesa, D., Rubio-Moscardo, F., Albertson, D. G., Garcia-Conde, J., Dyer, M. J., Levy, R., Pinkel, D. and Lossos, I. S. (2003). Transformation of follicular lymphoma to diffuse large cell lymphoma is associated with a heterogeneous set of DNA copy number and gene expression alterations., Blood 101 3109–3117.
  • [45] Mazumder, R., Friedman, J. and Hastie, T. (2009). SparseNet: Coordinate Descent with Non-Convex Penalties Technical Report, Department of Statistics, Stanford, University.
  • [46] McLachlan, G. J. and Krishnan, T. (2008)., The EM Algorithm and Extensions., 2 ed. Wiley-Interscience.
  • [47] Meyer, R. R. (1976). Sufficient conditions for the convergence of monotonic mathematical programming algorithms., J. Comput. System. Sci. 12 108–121.
  • [48] Meyer, R. R. (1977). A comparison of the forcing function and point-to-set mapping approaches to convergence analysis., SIAM J. Control Optimization 15 699–715.
  • [49] Nikolova, M. (2000). Local strong homogeneity of a regularized estimator., SIAM J. Appl. Math. 61 633–658.
  • [50] Ortega, and Rheinboldt, (1970)., Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York.
  • [51] Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models., J. Roy. Statist. Soc. Ser. B 69 659–677.
  • [52] Polak, E. (1987). On the mathematical foundations of nondifferentiable optimization in engineering design., SIAM Rev. 29 21–89.
  • [53] Roland, C. and Varadhan, R. (2005). New iterative schemes for nonlinear fixed point problems, with applications to problems with bifurcations and incomplete-data problems., Appl. Numer. Math. 55 215–226.
  • [54] Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., Giltnane, J. M., Hurt, E. M., Zhao, H., Averett, L., Yang, L., Wilson, W. H., Jaffe, E. S., Simon, R., Klausner, R. D., Powell, J., Duffey, P. L., Longo, D. L., Greiner, T. C., Weisenburger, D. D., Sanger, W. G., Dave, B. J., Lynch, J. C., Vose, J., Armitage, J. O., Montserrat, E., López-Guillermo, A., Grogan, T. M., Miller, T. P., LeBlanc, M., Ott, G., Kvaloy, S., Delabie, J., Holte, H., Krajci, P., Stokke, T. and Staudt, L. M. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma., N. Engl. J. Med. 346 1937–1947.
  • [55] Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths., Ann. Statist. 35 1012–1030.
  • [56] Saeki, K., Miura, Y., Aki, D., Kurosaki, T. and Yoshimura, A. (2003). The B cell-specific major raft protein, Raftlin, is necessary for the integrity of lipid raft and BCR signal transduction., EMBO J. 22 3015–3026.
  • [57] Schifano, E. D. (2010). Topics in Penalized Estimation PhD thesis, Cornell, University.
  • [58] Sha, N., Tadesse, M. G. and Vannucci, M. (2006). Bayesian variable selection for the analysis of microarray data with censored outcomes., Bioinformatics 22 2262–2268.
  • [59] She, Y. (2009). Thresholding-based Iterative Selection Procedures for Model Selection and Shrinkage., Electronic Journal of Statistics 3 384-415.
  • [60] Sinisi, S. E., Neugebauer, R. and van der Laan, M. J. (2006). Cross-validated bagged prediction of survival., Statist. Appl. Gen. Mol. Biol. 5 Article 12.
  • [61] Sohn, I., Kim, J., Jung, S. H. and Park, C. (2009). Gradient lasso for Cox proportional hazards model., Bioinformatics 25 1775–1781.
  • [62] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso., J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [63] Tibshirani, R. J. (2009). Univariate shrinkage in the cox model for high dimensional data., Statist. Appl. Gen. Mol. Biol. 8 Article21.
  • [64] Tseng, P. (2004). An analysis of the EM algorithm and entropy-like proximal point methods., Mathematics of Operations Research 29 27–44.
  • [65] Tseng, P. and Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization., Mathematical Programming B 117 387–423.
  • [66] Vaida, F. (2005). Parameter convergence for EM and MM algorithms., Statist. Sinica 15 831–840.
  • [67] Varadhan, R. and Roland, C. (2008). Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm., Scand. J. Statist. 35 335–353.
  • [68] Wu, C. F. J. (1983). On the convergence properties of the EM algorithm., Ann. Statist. 11 95–103.
  • [69] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression., Ann. Appl. Stat. 2 224–244.
  • [70] Zangwill, W. I. (1969)., Nonlinear Programming; a Unified Approach. Prentice-Hall International Series in Management, Englewood Cliffs, N.J.
  • [71] Zhang, C.-H. (2008). Discussion of “One-step sparse estimates in nonconcave penalized likelihood models” by H. Zou and R. Li., Ann. Statist. 36 1553–1560.
  • [72] Zhang, C.-H. (2010). Nearly Unbiased Variable Selection Under Minimax Concave Penalty., Ann. Statist. 38 894–942.
  • [73] Zhang, D. and Zhang, M. (2007). Bayesian profiling of molecular signatures to predict event times., Theor. Biol. Med. Model. 4 3.
  • [74] Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties., J. Amer. Statist. Assoc. 101 1418–1429.
  • [75] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net., J. Roy. Statist. Soc. Ser. B 67 301–320.
  • [76] Zou, H. and Hastie, T. (2008). elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA R package version, 1.0-5.
  • [77] Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models., Ann. Statist. 36 1509–1533.
  • [78] Zou, H. and Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters., Ann. Statist. 34 1733-1751.