Majorization-Minimization algorithms for nonsmoothly penalized objective functions

: The use of penalization, or regularization, has become common in high-dimensional statistical analysis, where an increasingly frequent goal is to simultaneously select important variables and estimate their eﬀects. It has been shown by several authors that these goals can be achieved by minimizing some parameter-dependent “goodness-of-ﬁt” function (e.g., a negative loglikelihood) subject to a penalization that promotes sparsity. Penalty functions that are singular at the origin have received substantial attention, arguably beginning with the Lasso penalty [62]. The current literature tends to focus on speciﬁc combinations of dif- ferentiable goodness-of-ﬁt functions and penalty functions singular at the origin. One result of this combined speciﬁcity has been a proliferation in the number of computational algorithms designed to solve fairly narrow classes of optimization problems involving objective functions that are not every- where continuously diﬀerentiable. In this paper, we propose a general class of algorithms for optimizing an extensive variety of nonsmoothly penalized objective functions that satisfy certain regularity conditions. The proposed framework utilizes the majorization-minimization (MM) algorithm as its core optimization engine. In the case of penalized regression models, the resulting algorithms employ iterated soft-thresholding, implemented com- ponentwise, allowing for fast and stable updating that avoids the need for inverting high-dimensional matrices. We establish convergence theory un- der weaker assumptions than previously considered in the statistical literature. We also demonstrate the exceptional eﬀectiveness of new acceleration methods, originally proposed for the EM algorithm, in this class of problems. Simulation results and a microarray data example are provided to demonstrate the algorithm’s capabilities and versatility.


Introduction
Variable selection remains an important and challenging issue in statistics.Modern methods, increasingly based on the principle of penalized likelihood estimation and often used in high dimensional regression problems, attempt to achieve this goal through an adaptive variable selection process that simultaneously permits estimation of regression effects.The literature on penalized minimization of a "goodness-of-fit" function (e.g., negative loglikelihood), with a penalty singular at the origin, has become vast and continues to proliferate in part due to the consideration of specific combinations of goodness-of-fit and penalty functions, the associated statistical properties of resulting estimators, and the development of several combination-specific optimization algorithms, [e.g., 21,24,51,62,74,75,78].
Our primary goal in this paper is to propose a unified optimization framework that utilizes the Majorization-Minimization (MM) algorithm [e.g., 32,38,39] as the primary optimization tool.The resulting class of algorithms is referred to as MIST, an acronym for Minimization by Iterative Soft Thresholding.The MM algorithm has been considered previously in solving specific classes of singularly penalized likelihood estimation problems [e.g., 16,33,77]; to a large extent, this work is motivated by these ideas.Important advantages of the proposed work include the exceptional versatility of the class of MIST algorithms, their associated ease of implementation and numerical stability, and the availability of a fixed point convergence theory that permits weaker assumptions than existing papers in this area.We emphasize that the focus of this paper is on the development of a stable and versatile class of algorithms applicable to a wide variety of singularly penalized estimation problems.In particular, the consideration of asymptotic and oracle properties of estimators, as well as methods for effectively choosing associated penalty parameters, are not focal points of this paper.A reasonably comprehensive treatment of these results may be found in Johnson, Lin and Zeng [34], where asymptotics and oracle properties for estimators derived from a general class of penalized estimating equations are developed in some detail.
The paper is organized as follows.In Section 2, we provide some general background on the class of MM algorithms.Section 2.2, in particular, introduces important notation and summarizes a set of useful sufficient conditions for local convergence of general MM algorithms applied to a large and interesting class of penalized optimization problems.In Section 3, we present a more specialized version of the general algorithm and show how the minimization step of the MM algorithm can be carried out using iterated soft-thresholding.In its most general form, iterated soft-thresholding is required at each minimization step.However, in the context of penalized estimation for the class of generalized linear regression models, we further show that a judicious choice of majorization function allows one to carry out this minimization step componentwise and in one iteration.Simulation results are provided in Section 4 and an application in survival analysis to Diffuse Large B Cell Lymphoma expression data [54] is presented in Section 5. We conclude with a discussion in Section 6. Proofs and other relevant results are collected in the Appendix.

Review
A Majorization-Minimization (MM) algorithm is not really a single algorithm but rather a term that more aptly describes a general principle for solving a difficult minimization problem by transferring this problem to a related surrogate function that is much easier to minimize [39].The acronym MM, as pointed out in the rejoinder to the discussion of Lange, Hunter and Yang [39], can also stand for "Minorization-Maximization" if one desires to maximize, rather than minimize, an objective function.The Expectation Maximization (EM) algorithm [18], originally developed in the context of missing data applications, is an important special case of the class of minorization-maximization algorithms [39].As shown in Becker, Yang and Lange [6], the "optimization transfer" principle that underlies its construction exists independently of the missing data setting and leads to a powerful and general tool for constructing algorithms.Lange, Hunter and Yang [39] attribute one of the earliest examples of this class of algorithms to Ortega and Rheinboldt [50] as well as identify several later statistically-oriented examples of algorithms falling into this class.
Since β (n+1) is the minimum of the surrogate function at β (n) , the MM procedure forces ξ(β) downhill at each iteration, i.e., ξ(β (n+1) ) ≤ ξ(β (n) ) for every n ≥ 0. Provided that the objective function, its surrogate and the mapping M (•) satisfy certain regularity conditions, one can also establish "convergence" of this algorithm.For example, Lange [38,Proposition 10.3.4]proves convergence of the MM iteration sequence assuming, among other things, that the objective functions ξ(β) and ξ [S] (β, α) are twice continuously differentiable; see Lange [37,Proposition 6] for related results on the generalized EM algorithm.However, weaker statements of convergence are also possible in problems that lack this degree of smoothness.For example, Lange, Hunter and Yang [39,Sec. 3] summarize conditions under which any limit point of the sequence β (n+1) = M (β (n) ) is also a stationary point of some continuous objective function ξ(β); see Lange [38, and Lange [37, for related theoretical developments.Convergence, at least as ordinarily interpreted, can be expected provided that ξ(β) has a unique minimum; however, more generally, such "convergence" results do not necessarily imply that the MM iteration sequence itself converges to a unique limit.
This last observation is relevant to the convergence analysis of algorithms designed to solve singularly penalized regression problems.Common and increasingly important examples lacking the differentiability requirements of Lange [38,Proposition 10.3.4]include all penalized regression problems involving the lasso penalty [62], adaptive lasso penalty [74], elastic net penalty [75], and the smoothly clipped absolute deviation (SCAD) penalty [21].In order to properly analyze algorithmic convergence in these settings, an appropriately general theory of convergence is required.One such theory is developed in Appendix A.1 and complements existing convergence theory for the EM and MM algorithms that may be found in Wu [68], Lange [37], Lange, Hunter and Yang [39], Tseng [64], Lange [38] and Chrétien and Hero [11], among other places.Two important contributions of these results include useful refinements of existing theory for the EM and MM algorithms as well as a set of sufficient conditions that are relatively straightforward to verify for the general class of penalized optimization problems considered in the next section.

Objective functions with nondifferentiable, separable penalties
In this section, we summarize convergence results for generic MM algorithms that are intended to minimize an objective function of the form ξ(β) = g(β) + p(β; λ) + λε β 2 , λ > 0, ε ≥ 0 (2) where g(β) is a continuous "goodness of fit" function (e.g., a negative loglikelihood), p(β; λ) is a continuous but non-differentiable penalty function, and • denotes the usual Euclidean vector norm.Further regularity conditions will be given below; as shown later, the resulting class of problems represented by (2) contains the vast majority of penalized regression problems currently under investigation in the statistics literature.It also covers numerous additional problems by expanding the class of permissible goodness-of-fit and penalty functions in a substantial way.
We assume throughout that g(β) is convex with at least one bounded local minimizer.This implies that g(β) is coercive [e.g., 38, Chapter 10]; that is, g(β) becomes unbounded as β → ∞.The convexity of g(β) further implies that g(β) is Lipschitz continuous on each compact subset of R p (i.e., locally Lipschitz continuous).It follows that ∇g(β) exists for almost all β.We further assume where the vector λ = (λ T 1 , . . ., λ T p ) T and λ j denotes the block of λ associated with β j .It is assumed that each λ j has dimension greater than or equal to one, that all blocks have the same dimension, and that the λ j1 = λ for each j ≥ 1. Evidently, the case where dim(λ j ) = 1 for j ≥ 1 simply corresponds to the setting in which each coefficient is penalized in exactly the same way; permitting the dimension of λ j to exceed one allows the penalty to depend on additional parameters (e.g., weights, such as in the case of the adaptive lasso considered in Zou [74]).We are interested in problems with penalization; therefore, λ is assumed bounded and strictly positive throughout this paper.Several specific examples will be discussed below.For a bounded vector θ having λ > 0 as its first element, and the remainder of θ collecting any additional parameters used to define the penalty, the scalar function p(r; θ) is assumed to satisfy the following condition: (P1) p(r; θ) is a continuously differentiable concave function on (0, ∞) with p(0; θ) = 0; p′ (r; θ) ≥ 0 for r > 0; and, p′ (0+; θ) Evidently, (P1) implies that p′ (r; θ) > 0 for r ∈ (0, K θ ), where K θ > 0 may be finite or infinite.The combination of the concavity and nonnegative derivative conditions imply that the penalty increases away from the origin, but with a decreasing rate of growth that may become zero.The case where (3) is identically zero for r > 0 is ruled out by the positivity of the right derivative at the origin; similarly, the concavity assumption also rules out the possibility of a strictly convex penalty term.Neither of these restrictions is particularly problematic.Our specific interest lies in the development of algorithms for estimation problems subject to a penalty singular at the origin.Were (3) absent, or replaced by a strictly convex penalty term, the convexity of g(β) implies ( 2) can be minimized directly using any suitable convex optimization algorithm, such as that discussed in Theorem 3.2 below.
Under the conditions specified above, the objective function ( 2) is not necessarily convex and may have multiple local minima.Theorem 2.1 establishes local convergence of the indicated class of MM algorithms for minimizing objective functions of the form (2). A proof is provided in Appendix A.2, where it is shown that the conditions imposed in the statement of the theorem are sufficient conditions for the application of the general MM local convergence theory summarized in Appendix A.1.Theorem 2.1.Let g(•) and p(•; λ) satisfy the indicated assumptions.Let h(β, α) ≥ 0 be a real-valued, continuous function of β and α that is continuously differentiable in β for each α and satisfies h(β, α) = 0 when β = α.Let where q(r, s; θ) = p(s; θ) + p′ (s; θ)(r − s) for r, s ≥ 0. Assume S, the set of stationary points for ξ(β), is both non-empty and finite, where the notion of a stationary point is defined as in Clarke [12].Then: (ii) q(β, α; λ) − p(β; λ) ≥ 0 for all β = α (possibly, identically zero).
For convenience, Table 1 summarizes the various function definitions that will be used throughout the remainder of the paper.
Condition (iii) of Theorem 2.1 establishes convergence under the assumption that ξ [S] (β, α) strictly majorizes ξ(β) and has a unique minimizer in β for each α.Such a uniqueness condition is shown by Vaida [66] to ensure convergence of the EM and MM algorithms to a stationary point under more restrictive differentiability conditions.Importantly, the assumption of globally strict majorization is only a sufficient condition for convergence; this condition is only important insofar as it guarantees a strict decrease in the objective function at every iteration.It is possible to relax this condition to locally strict majorization, in which ξ [S] (β, α) majorizes ξ(β), strict majorization being necessary only in an open neighborhood containing M (α).objective function to be minimized g(β) "goodness-of-fit" term p(β; λ) = j p(|β j |; λ j ) nonsmooth penalty term h(β, α) function used to majorize g(β) q(β, α; λ) = j q(|β j |, |α j |; λ j ) function used to majorize p(β; λ) minimization map for MM algorithm The use of the MM algorithm and selection of (4) are motivated by the results Zou and Li [77]; we refer the reader to Remark 3.1 below for further comments in this direction.The assumptions on g(β) clearly cover the case of the linear and canonically parametrized generalized linear models upon setting g(β) = −ℓ(β), where ℓ(β) denotes the corresponding loglikelihood function.
Other prominent examples include estimation under the semiparametric Cox regression model [14] and accelerated failure time models are also covered upon setting g(β) to be either the negative logarithm of the partial likelihood function [e.g., 2, Theorem VII.2.1] or the Gehan objective function [e.g., 25,35].
The proposed penalty specification also covers the smoothly clipped absolute deviation [SCAD; e.g., 21] penalty upon setting p(r; λ j ) = pS (r; λ, a) for each j ≥ 1, where pS (r; λ, a) is defined as the definite integral of on the interval 0 ≤ u ≤ r and some fixed value of a > 2 (e.g., a = 3.7).
Remark 2.2.The SCAD and MCP penalties are not strictly concave and lead to surrogate majorizers that fail to satisfy the globally strict majorization condition in (iii) of Theorem 2.1 unless h(β, α) is strictly positive whenever β = α; see Remark 3.1 for further discussion and also Theorem 3.4 below.

A simple algorithm for convex objective functions
The class of MM algorithms suggested by Theorem 2.1 provides a very general and useful framework for proposing new algorithms in penalized estimation problems, the key to which is a methodology for solving the minimization problem (1), a step repeated with each iteration of the MM algorithm.Successful application requires the construction of a suitable majorizing function that can be more easily minimized than the desired objective function.In this regard, it is helpful to note that the problem of minimizing ξ [S] (β, α) for a given α is equivalent to minimizing in β.In particular, if m(β) = g(β) + λε β 2 + h(β, α) is strictly convex for each α, which clearly occurs if both g(β) and h(β, α) are convex in β and at least one is strictly convex, then (10) is also strictly convex and the corresponding minimization problem has a unique solution.
Remark 3.1.For ε = h(β, α) = 0 and g(β) = −ℓ(β) for ℓ(β) = n i=1 ℓ i (β) with ℓ i (β) a twice continuously differentiable loglikelihood function, the majorizer used by the MM algorithm induced by the surrogate function (10) corresponds (up to sign) to the minorizer employed in the LLA algorithm of Zou and Li [77], an improvement on the so-called LQA algorithm proposed in Hunter and Li [33].Zou and Li [77, Proposition 1] assert convergence of their LLA algorithm under imprecisely stated assumptions and are additionally unclear as to the nature of the convergence results actually established.For example, while Zou and Li [77, Theorem 1] demonstrate that the LLA algorithm does indeed have an ascent property, their results do not establish the convergence of the LLA solution sequence.
In contrast, Theorem 2.1 shows that strict majorization, under a few precisely stated conditions, is sufficient to ensure local convergence of the resulting MM algorithm to a stationary point of (2).In Section 3.2, it is further demonstrated how a particular choice of h(β, α) yields a strict majorizer that permits both closed form minimization and componentwise updating at each step of the MM algorithm, even in the case of penalties that fail to be strictly concave.
Numerous methods exist for minimizing a differentiable convex objective function [e.g., 10].However, because (10) is not differentiable, such methods do not apply in the current setting.Specialized methods exist for nonsmooth problems of the form (10) in settings where g(β) has a special structure; a well-known example here is LARS [19], which can be used to efficiently solve lasso-type problems in the case where g(β) is replaced by a least squares objective function.Recently, Combettes and Wajs [13, Proposition 3.1; Theorem 3.4] proposed a general class of fixed point algorithms for minimizing f 1 (h) + f 2 (h), where f i (•), i = 1, 2 are each convex and h takes values in some real Hilbert space H. Hale, Yin and Zhang [29,Theorem 4.5] specialize the results of Combettes and Wajs [13] to the case where H is some subset of R p and f 2 (h) = p j=1 |h i |.The collective application of these results to the problem of minimizing (10) generates an iterated soft-thresholding procedure with an appealingly simple structure.Theorem 3.2, given below, states the algorithm along with conditions under which the algorithm is guaranteed to converge; a proof is provided in Appendix A.3.The resulting class of procedures, that is, MM algorithms in which the minimization of ( 10) is carried out via iterated soft-thresholding, is hereafter referred to as MIST, an acronym for (M)inimization by (I)terated (S)oft (T)hresholding.Two important and useful features of MIST include the absence of high-dimensional matrix inversion and the ability to update each individual parameter separately.Theorem 3.2.Let p(•; λ) satisfy the assumptions of Section 2.2.Suppose m(β) in Table 1 is strictly convex with a Lipschitz continuous derivative of order L −1 > 0 for each bounded vector α.Then, for any such α and a constant ̟ ∈ (0, 2L), the unique minimizer of (10) can be obtained in a finite number of iterations using iterated soft-thresholding: 1. Set n = 1 and initialize the algorithm with a bounded vector b (0) .2. Compute b (n) = S(b (n−1) − ̟∇m(b (n−1) ); ̟τ ), where for any vectors e j denotes the j th unit vector for R p , is the univariate soft-thresholding operator, and τ = (p ′ (|α 1 |; λ 1 ), . . ., p′ (|α p |; λ p )) T .3. Stop if converged; else, set n = n + 1 and return to Step 2. Theorem 3.4 of Combettes and Wajs [13] shows that the update in Step 2 can be generalized to where ̟ n ∈ (0, 2L) and δ n ∈ (0, 1] is a suitable sequence of relaxation constants.Judicious selection of these constants, possibly updated at each step, can improve the convergence rate.In principle, the minimization algorithm of Theorem 3.2 can also be replaced with any other suitable minimization algorithm.For example, since the penalty term appearing in (10) is in fact a separable convex function of the parameters, one could instead employ the coordinate gradient descent method recently proposed in Tseng and Yun [65].An advantage of the proposed approach is its computational simplicity; moreover, as will be seen in Section 3.2, the proposed soft-thresholding update arises naturally in a wide class of minimization problems of interest to statisticians.Theorem 3.2 imposes the condition that the gradient of m(β) is globally L −1 -Lipschitz continuous.The role of this condition, also imposed in Combettes and Wajs [13, Proposition 3.1; Theorem 3.4], is to ensure that the update at each step of the proposed algorithm is a contraction, thereby guaranteeing its convergence to a fixed point; see Schifano [57] for a proof of this result.However, in view of the generality of the Contraction Mapping Theorem [e.g., 42, Theorem 10.2.1], it is possible to relax the requirements on ∇m(β) provided that one selects a suitable starting point.A useful extension is summarized in the corollary below; one may prove this result in a manner similar to Theorem 4.5 of Hale, Yin and Zhang [29].As a reminder to the reader, the relevant optimization problem at this stage involves a specified vector α having bounded elements.
Corollary 3.3.Let p(•; λ) satisfy the assumptions of Section 2.2 and let α be given.Suppose m(β) is strictly convex and twice continuously differentiable function of β ∈ Ω, where Ω ⊂ R p is a convex, compact set.Then, there exists a unique minimizer β * of (10) on Ω and the algorithm of Theorem 3.2 converges to β * in a finite number of iterations provided that b (0) ∈ Ω, λ * = max β∈Ω λ max (β) < ∞ and ̟ ∈ (0, 2/λ * ), where λ max (β) denotes the maximum eigenvalue of ∇ 2 m(β).Some useful insight into the form of the proposed thresholding algorithm can be gained by considering the behavior of the penalty derivative term p′ (r; θ).As suggested earlier, (P1) implies that p′ (r; θ) decreases from its maximum value towards zero as r moves away from the origin.For some penalties (e.g., SCAD, MCP), this derivative actually becomes zero at some finite value of r > 0, resulting in situations for which τ j = p′ (|α j |; λ j ) = 0 for at least one j.If this occurs for component j, then j th component of the vector S b (n) − ̟∇m(b (n) ); ̟τ simply reduces to the j th component of the argument vector b In the extreme case where τ = 0, the proposed update reduces to b (n+1) = b (n) − ̟∇m(b (n) ), a steepest descent step; equivalently, the algorithm takes an inexact Newton step in which the inverse hessian matrix is replaced by ̟I p , I p denoting the p × p identity matrix, and with step-size chosen to ensure that this update yields a contraction.Hence, if each of the components of b (n) − ̟∇m(b (n) ) are sufficiently large in magnitude, the proposed algorithm simply takes an inexact Newton step towards the solution; otherwise, one or more components of this vector may be thresholded.Notably, replacing ̟I p with any diagonal matrix having bounded entries preserves the componentwise nature of the proposed algorithm; alternative strategies that both adapt the step size to each component and maintain the indicated convergence properties are worthy of further investigation.

Penalized estimation for generalized linear models
The combination of Theorems 2.1, 3.2 and Corollary 3.3 lead to a useful and stable class of algorithms with the ability to deal with a wide range of penalized regression problems.In settings where g(β) is strictly convex and twice continuously differentiable, one can safely assume that h(β, α) = 0 for all choices of β and α provided that p′ (r; θ) in (P1) is strictly positive for r > 0; important examples of statistical estimation problems here include many commonly used linear and generalized linear regression models, semiparametric Cox regression [14], and smoothed versions of the accelerated failure time regression model [cf. 35].The SCAD and MCP penalizations, as well as other penalties having p′ (r; θ) ≥ 0 for r > 0, can also be used; however, additional care is required.In particular, and as pointed out in an earlier remark, if one sets h(β, α) = 0 for all β and α then convergence of the resulting algorithm to a stationary point is no longer guaranteed by the above results due to the resulting failure of these penalties to induce strict majorization.
The need to use an iterative algorithm for repeatedly minimizing ( 10) is not unusual for the class of MM algorithms.However, it turns out that for certain choices of g(β), a suitable choice of h(β, α) in Theorem 3.2 guarantees both strict majorization and permits one to compute the minimizer of (10) in closed form (i.e., in one step), resulting in a single soft-thresholding update at each iteration.Below, we demonstrate how the MIST algorithm simplifies in settings where g(β) corresponds to the negative loglikelihood function of a canonically parametrized generalized linear regression model with a uniformly bounded hessian matrix.The result applies to all penalties satisfying condition (P1), including SCAD and MCP.A proof is provided in Appendix A.4. Theorem 3.4.Let y be N × 1 and suppose the probability distribution of y follows a generalized linear model with a canonical link and linear predictor X β, where is the corresponding loglikelihood, η = X β with elements ηi and E[y i ] = d ′ (η i ), i = 1, . . ., N, for d(•) strictly convex and twice continuously differentiable.Let λ max ( β) denote the largest eigenvalue of −∇ 2 ℓ( β) and assume that λ * = max β λ max ( β) < ∞.
Let the penalty function be defined as in (3) and satisfy (P1); note that β 0 remains unpenalized.Define and that its corresponding set of stationary points, defined in the sense of Clarke [12], is non-empty, finite and consists only of bounded local or global minima.Then: 2. The functions g( β) = −ℓ( β), p(β; λ) and h( β, α) satisfy the regularity conditions of Theorem 2.1; hence, an MM algorithm that uses (14) converges to a minimizer of (2). 3.For each bounded vector α ∈ R p+1 , (a) the minimizer β * of ξ [S] ( β, α) is unique and satisfies where S(•; •) is the soft-thresholding operator defined in (11) and In view of previous results, the result in # 3 of Theorem 3.4 shows that the resulting MM algorithm takes a very simple form: given the current iterate β . update the remaining parameters β (n) : where T .The proposed algorithm can be easily generalized to accommodate additional regression variables (besides the intercept) not subject to penalization.The specific choice of function h( β, α) clearly serves two useful purposes: (i) it leads to componentwise-soft thresholding; and, (ii) it leads to strict majorization, as is required in condition (iii) of Theorem 2.1, allowing one to establish the convergence of MIST for SCAD and other penalties having p′ (r, θ) = 0 at some finite r > 0.
An important class of problems to which these results apply is the setting of penalized linear regression.Suppose that y has been centered to remove β 0 from consideration and that the problem has also been rescaled so that X, which is now N ×p, satisfies the indicated conditions.Then, Theorem 3.4 applies directly with where ̟ is defined as in Theorem 3.4 with λ * set equal to the largest eigenvalue of X ′ X.For the class of adaptive elastic net penalties (i.e., p(r; λ j ) = λω j r in (3)), the resulting iterative scheme is exactly that proposed in De Mol, De Vito and Rosasco [17, pg. 17], specialized to the setting of a Euclidean parameter.
In particular, we have τ j = λω j and γ j = 0 in Theorem 3.4, and the proposed update reduces to where ν = 2̟ −1 .Setting ν = 1 and ε = 0 yields the iterative procedure proposed in Daubechies, Defreise and De Mol [16] provided X ′ X is scaled such that I − X ′ X is positive definite.The MIST algorithm extends these iterative softthresholding procedures to a much wider class of penalized estimation problems.
In an interesting unpublished paper, Mazumder, Friedman and Hastie [45] propose the SparseNet algorithm, a coordinatewise descent algorithm for minimizing objective functions of the form (2) with g(β) = 1 2 y − Xβ 2 , ε = 0 and p(β; λ) a family of penalty functions satisfying (3) and several additional regularity conditions.Their specification includes the lasso, SCAD and MCP penalties, as well as several other examples of nonconvex penalties.The full SparseNet algorithm intends to generate the solution surface as a function of the penalty parameter λ and a parameter γ indexing the penalty family (i.e., restricted to a two dimensional grid).While the algorithm incorporates a number of useful features, solutions are found for each (λ, γ) pair using a simple coordinate descent algorithm.In the case of the lasso penalty (γ = ∞) and provided X is column-standardized, this coordinate descent algorithm is almost identical to the componentwise soft-thresholding algorithm proposed in Daubechies, Defreise and De Mol [16] (hence MIST), the primary differences stemming from the form of the iterative update (i.e., the use of a simultaneous update implemented via componentwise soft-thresholding versus cyclical application of the soft-thresholding operator).For other penalties, such as SCAD and MCP, the coordinatewise updates utilized by SparseNet rely on so-called generalized thresholding operators [cf.59], departing more substantially from the iterated soft-thresholding procedure used in the MIST algorithm.Mazumder, Friedman and Hastie [45] provide an explicit proof of the convergence of the solution sequence obtained for a given (λ, γ) pair.The regularity conditions under which these results are obtained appear to be similarly weak to those required by Theorem 3.4 (i.e., applied to the penalized least squares problem).However, unlike MIST, it not obvious how to extend the SparseNet algorithm to more general choices of g(β) in the absence of reparameterizations that permit componentwise separation of parameters.
The restriction to canonical generalized linear models in Theorem 3.4 is imposed to ensure strict convexity of the negative loglikelihood.Our results are easily modified to handle non-canonical generalized linear models, provided the negative loglikelihood remains strictly convex in β and the hessian can be appropriately bounded.Interestingly, not all canonically parametrized generalized linear models satisfy the regularity conditions of Theorem 3.
It is easy to see that ∇ℓ( β), hence ∇m( β), is locally but not globally Lipschitz continuous; hence, it is not possible to choose a matrix C = ̟ −1 I such that (14) everywhere majorizes ξ( β).Nevertheless, progress remains possible.For example, Corollary 3.3 implies that one can still use a single update of the form (17) provided that a suitable Ω, hence C and β (0) , can be identified.
Alternatively, using results summarized in Becker, Yang and Lange [6], one can instead majorize −ℓ( β) for any bounded α using where, for every i, θ ij ≥ 0 are any sequence of constants satisfying Of importance here is the fact k j (β j ; α j ) is a strictly convex function of β j and does not depend on β k for k = j.One may now take h( β, α) in Theorem 2.1 as being equal to k( β, α) + ℓ( β), leading to the minimization of In particular, componentwise soft-thresholding is replaced by componentwise minimization of (18), the latter using any algorithm capable of minimizing a continuous univariate convex function.
Remark 3.5.The Cox proportional hazards model [14], while not a generalized linear model, shares the essential features of the generalized linear model utilized in Theorem 3.4.In particular, the negative log partial likelihood, say g(β) = −ℓ p (β), is (under mild regularity conditions) strictly convex, twice continuously differentiable, and has a bounded hessian [e.g., 2, 9].Consequently, Theorem 3.4 and its proof are easily modified for this setting upon taking g(β) as indicated,

Accelerating convergence
Similarly to the EM algorithm, the stability and simplicity of the MM algorithm frequently comes at the price of an increased number of iterations before convergence.Numerous methods of accelerating the EM algorithm have been proposed in the literature; see McLachlan and Krishnan [46] for a review.Recently, Varadhan and Roland [67] proposed a new method for EM called SQUAREM, obtained by "squaring" an iterative Steffensen-type (STEM) acceleration method.Both STEM and SQUAREM are structured for use with iterative mappings of the form θ n+1 = M (θ n ), n = 0, 1, 2, . . ., hence applicable to both the EM and MM algorithms.Specifically, the acceleration update for SQUAREM is given by where Varadhan and Roland [67] suggest several steplength options, with preference for the choice γ n = − r n / v n .Roland and Varadhan [53] provide a proof of local convergence for SQUAREM under restrictive conditions on the EM mapping M (θ), while Varadhan and Roland [67] outline a proof for global convergence for versions of SQUAREM that employ a back-tracking strategy.We study the effectiveness of SQUAREM applied to the simplified form of the MIST algorithm, hereafter denoted SQUAREM 2 , in Section 4.3.

Simulation results
The simulation results summarized below are intended to compare the estimates of β obtained from existing methods to those obtained using the simplified MIST algorithm of Theorem 3.4.In particular, we consider the performance of MIST for the class of penalized linear and generalized linear models, demonstrating its capability of recovering the solutions provided by existing algorithms when both algorithms are forced to use the same set of "tuning" parameters (i.e., penalty parameter(s), plus any additional parameters required to define the penalty itself).In cases where multiple local minima can arise, we further show that the MIST algorithm often tends to find solutions with lower objective function evaluations in comparison with existing algorithms, provided these algorithms utilize the same choice of starting value.

Example 1: Linear model
Let 1 m and 0 m respectively denote m-dimensional vectors of ones and zeros.Then, following Zou and Zhang [78], we generated data from the linear regression model where T is a p-dimensional vector with intrinsic dimension q = 3[p/9], ǫ ∼ N (0, σ 2 ), and x follows a p-dimensional multivariate normal distribution with zero mean and covariance matrix Σ having elements Σ j,k = ρ |j−k| , 1 ≤ k, j ≤ p.We considered σ ∈ {1, 3}, ρ ∈ {0.0, 0.5, 0.75} and p ∈ {35, 81} for N = 100 independent observations.Penalized least squares estimation is considered for five popular choices of penalty functions, all of which are currently implemented in the R software language: LAS, ALAS, EN, AEN, and SCAD.The LAS, ALAS, EN and AEN penalties are all convex and lead to unique solutions under mild conditions; the SCAD penalty is concave and the resulting minimization problem is, depending on the design matrix and choice of a, either convex or nonconvex [cf. 72].The SCAD examples considered here lead to non-convex objective functions, hence may have multiple solutions.In each case, we used existing software for computing solutions subject to these penalizations and compared those results to the solutions computed using the MIST algorithm.For the MIST algorithm, ̟ was chosen to be 2/(λ * + .001)where λ * is the largest eigenvalue of X ′ X where X is appropriately scaled to match the scaling of the existing algorithm.
Regarding existing methods, we respectively used the lars [30] and elasticnet [76] packages for computing solutions in the case of the LAS and EN penalties.For the ALAS and AEN penalties, we used software kindly provided by Zou and Zhang [78] that also makes use of the elasticnet package.The weights for the AEN penalty are computed using ω j = | βEN j | −γ , j = 1, . . ., p, where βEN is an EN estimator and γ is a positive constant.Using EN-based weights in the AEN fitting algorithm necessitates tuning parameter specification for both EN and AEN.As in Zou and Zhang [78], the ℓ 1 parameters λ (λ 1 in their notation) are allowed to differ, whereas the ℓ 2 parameters ε (λ 2 in their notation) are forced to be the same.Evidently, setting ε = 0 (λ 2 = 0) results in the ALAS solution.For the SCAD penalty, we considered the estimator of Kim, Choi and Oh [36] (HD), as well the one-step SCAD (1S) and LLA estimators of Zou and Li [77].The code for the first two methods was kindly provided by their respective authors; the LLA estimator was computed using the R package SIS.The choice a = 3.7 was used for all implementations of SCAD.
We considered finding solutions using penalties in the set Λ = {0.1, 1, 5, 10, 20, 100}.In particular, for LAS and SCAD, λ = λ 1 ∈ Λ.For EN, both λ = λ 1 ∈ Λ and λε = λ 2 ∈ Λ.For simplicity, we fixed the weights for AEN for a given λ 2 by selecting the 'best' βEN among the six estimators involving λ = λ 1 ∈ Λ based on a BIC-like criteria.Likewise for ALAS, the weights were computing using the 'best' βLAS among the six estimators involving λ = λ 1 ∈ Λ.The parameter γ for the ALAS and AEN penalties was respectively set to three and five for p = 35 and p = 81.
The convergence criteria used by the existing software packages were used without alteration in this simulation study.The convergence criteria used for MIST were as follows: the algorithm stopped if either (i) the normed difference of successive iterates was less than 10 −6 (convergence of coefficients); or, (ii) the difference of the objective function evaluated at successive iterates was less than 10 −6 and the number of iterations exceeded 10 6 (convergence of optimization).Due to the large number of comparisons and highly intensive nature of the computations, we ran B = 100 simulations for each choice of ρ, σ, and p.We report the results for the convex penalties in Table 2 and those for the SCAD penalty in Tables 3 and 4.
In Table 2, we summarize the average normed difference between the solution obtained using existing software and that obtained using MIST, βexist − βmist , over the B = 100 simulations; in particular, we report in the two leftmost panels the maximum value of this difference, computed across all combinations of tuning parameters.These maximum differences (all of which are multiplied by 10 5 ) are remarkably small for all (A)LAS and (A)EN penalties, indicating that MIST recovers the same (unique) solutions as the existing algorithms.Interestingly, the values for LAS are slightly larger than the rest, where the maximum differences all resulted from the smallest value of λ considered (λ = 0.1).In these cases, the algorithm tended to stop using the objective function criteria rather than the (stricter) coefficient criteria, resulting in slightly larger differences on average.
The results for SCAD are reported in Tables 3 (p = 35) and 4 (p = 81) and display (i) the average normed differences, multiplied by 10 3 , for each combination of λ, ρ, σ, p and starting value; and, (ii) the proportion of simulated datasets in which the MIST solution yields a lower or equivalent evaluation of the objective function in comparison with the solution obtained using another method for the indicated choice of starting value.We remark here that SCAD penalties used in the existing implementations are multiplied by a factor of N, i.e., p(β; λ) = p j=1 N pS (|β j |; λ, a), so the MIST implementation incorporates this factor of N as well.The results for λ = 100 are not shown, as the solution was 0 p in all cases.In comparison with the convex penalties, larger normed Table 3 Algorithm performance in Example 1 (LM: p = 35, N = 100) for SCAD penalty.The column 'avg' is the average normed differences ×10 3 between the MIST solution and the existing method's solution; 'prop' is the proportion of MIST solutions whose objective function evaluation was less than or equal to that of the existing method's solution 4206.65 1.00 4206.65 1.00 4206.65 1.00 4261.12 1.00 4261.12 1.00 4261.12 1.00 LLA 0.09 1.00 0.30 1.00 0.14 1.00 0.07 1.00 0.36 1.00 0.10 1.00 λ = 20 0 HD 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 1S 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 LLA 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.5 HD 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 1S 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 LLA 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0. 75  HD=High Dimensional SCAD [36]; 1S=one-step & LLA=local linear approximation [77] differences are observed, even when controlling for the use of the same starting value.Such differences are a result of two important features of the SCAD optimization problem: (i) the possible existence of several local minima; and, (ii) the fact that the MIST, HD, and LLA algorithms each take a different path from a given starting value towards one of these solutions.For example, while each of the LLA, MIST, and HD algorithms involve majorization of the objective function using a lasso-type surrogate objective function, both the majorization Table 4 Algorithm performance in Example 1 (LM: p = 81, N = 100) for SCAD penalty.The column 'avg' is the average normed differences (×10 3 ) between the MIST solution and the existing method's solution; 'prop' is the proportion of MIST solutions whose objective function evaluation was less than or equal to that of the existing method's solution HD 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 1S 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 LLA 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.5 HD 6.69 1.00 6.69 1.00 6.69 1.00 HD 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 1S 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 LLA 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.5 HD 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 1S 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 LLA 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0. 75  HD=High Dimensional SCAD [36]; 1S=one-step & LLA=local linear approximation [77] and minimization of the resulting surrogate function are carried out differently in each case.In particular, the LLA algorithm, as implemented in SIS, majorizes only the penalty term and adapts the lasso code in glmpath in order to minimize the corresponding surrogate objective function at each step.The HD algorithm is similar in spirit, but instead decomposes the penalty term into a sum of a concave and convex function and utilizes the the algorithm of Rosset and Zhu [55] to minimize the corresponding surrogate objective function.The MIST al-gorithm instead uses the same penalty majorization as the LLA algorithm, but additionally majorizes the negative loglikelihood term in a way that permits minimization of the surrogate function in a single soft-thresholding step.Each procedure therefore takes a different path towards a solution, even when given the same starting value.
We remark here that differences must also expected between any of LLA, HD, MIST and 1S.From an optimization perspective, the one-step estimator is the result of running just one iteration of the LLA algorithm, starting from the unpenalized least squares estimator β ml [77]; hence, 1S only provides an approximate solution to the desired minimization problem.All other methods (LLA, MIST, HD) iterate until some local minimizer (or stationary point) is reached.For example, when using either β ml or β 1S,λ as the starting value, MIST always found a solution that produced a lower evaluation of the objective function in comparison to β 1S,λ .However, when using the null starting value of 0 p , the one-step estimator did occasionally result in a lower objective function evaluation in cases involving smaller values of λ.This behavior is not terribly surprising; with small λ, the one-step solution should generally be close to the unpenalized least squares solution, as the objective function itself is likely to be dominated by the least squares term.
Of all the SCAD algorithms considered here, MIST and LLA tended to find the most similar solutions (i.e., have the smallest normed differences).For the cases in which the LLA solution had lower objective function evaluations, all of the MIST solutions were also LLA solutions; i.e., when starting the LLA algorithm with the MIST solution, the algorithm terminated at the starting value (i.e., the LLA solution coincides with the MIST solution).With the exception of three of these cases, starting the MIST algorithm with the LLA solution also resulted in the same behavior.The HD and MIST algorithms also generally gave similar results, with one source of difference being the respective stopping criteria used.The stopping criteria for HD, assessed in order, are as follows: (1) 'convergence of optimization': stop if the absolute value of the difference of the objective evaluated at successive iterates is less than 10 −6 ; (2) 'convergence of penalty gradient': stop if the sum of the absolute value of the differences of the derivative of the centered penalty evaluated at successive iterates is less than 10 −6 ; (3) 'convergence of coefficients:' stop if the sum of the absolute value of the differences of successive iterates is less than 10 −6 ; and, (4) 'jump-over' criteria: stop if the objective at the previous iterate plus 10 −6 was less than the objective at the current iterate.After careful analysis of the results, we assert the following: • The MIST solution usually has the same or a lower evaluation of the objective function in comparison with HD, regardless of starting value.• HD tends to have the most difficulty in cases of high predictor correlation, a likely result of the fact that this algorithm relies on the variance of the unpenalized least squares estimator, hence matrix inversion, to take steps towards solution.In contrast, MIST requires no matrix inversion.
On balance, the MIST algorithm performs as well or better than LLA and HD in locating minimizers in nearly all cases.As suggested above, variation in the solutions found can be traced to the path each algorithm takes towards a solution and differences in stopping criteria.Remarkably, in cases when the correlation among predictors was low, the choice of starting value made little difference for MIST; either the same solution was found for all starting values or none of the starting values dominated in terms of finding the lower or equivalent objective evaluations.In settings involving higher correlation, however, using either 0 p or the 1S starting values tended to result in the lower evaluations of the objective function in comparison with using the unpenalized least squares solution.Similar behavior was observed for the LLA algorithm.In contrast, the choice of starting value had a much larger impact on the performance of the HD estimator; in particular, the use of 0 p as a starting value typically resulted in the lowest objective function evaluations when compared to using a non-null starting value.

Example 2: Binary logistic regression
As in Example 1, we considered the LAS, ALAS, EN, AEN, and SCAD penalties.There are at least two R packages that allow penalization using the LAS and EN penalties: glmpath [51], which handles binomial and poisson regression using a "predictor-corrector" method, and glmnet [24], which handles binomial and multinomial regression using cyclical coordinate descent.Both methods can be tuned to find the same solutions, so for ease of presentation we only consider the results of glmnet for comparison in the tables and analysis below.The SIS package [22] permits computations with the ALAS, AEN, and SCAD penalties using modifications of the Park and Hastie [51] code.For SCAD, we compared the results of MIST to the results from the one-step (1S) algorithm [GLM version, 77] using the code provided from the authors and the LLA algorithm as implemented in Fan et al. [22].
As before, we only considered comparing solutions that use the same combination of tuning parameters; for the present example, the set considered here is Λ = {0.001,0.01, 0.05, 0.1, 0.2, 1.00}, reflecting a need to accommodate the different scaling of the problem.The data generation scheme for this example was loosely based on the simulation study found in Friedman, Hastie and Tibshirani [24].Binary response data were generated according to a logistic (rather than linear) regression model using , where β * is a p−vector with elements β j = 3 × (−1) j exp(−2(j − 1)/200), j = 1, . . ., q, q ∈ {25, 75}, and the remaining 100 − q components set to zero.Here, x i follows a p-dimensional multivariate normal distribution with zero mean and covariance Σ = 3 −2 P where the correlation matrix P is such that each pair of predictors has the same population correlation ρ.We considered three such correlations, ρ ∈ {0.0, 0.5, 0.75}.For the MIST algorithm, ̟ was selected to be 2/(λ * + 0.001) where λ * is the largest eigenvalue of X ′ WX where W = .25IN where the design matrix X is appropriately scaled to match the scaling of the existing algorithm.

Table 5
Algorithm performance in Example 2 (GLM) for SCAD penalty.The column 'avg' is the average normed differences (×10 3 ) between the MIST solution and the existing method's solution; 'prop' is the proportion of MIST solutions whose objective function evaluation was less than or equal to that of the existing method's solution For the B = 100 simulations, the maximum (across different tuning parameters) average normed difference of solutions between the existing and proposed methods, multiplied by 10 5 , are reported for each of the strictly convex cases in the right-most panel of Table 2.As before, these maximums are generally remarkably small, indicating that MIST recovers the same (unique) solutions as the existing algorithms.The results for SCAD are reported in Table 5, which displays the same information as in the corresponding tables from Example 1; the HD comparisons are omitted here as the methodology and code were only developed for the case of penalized least-squares.In the GLM setting, the 1S estimator is computed by applying the LARS [19] algorithm to a quadratic approximation of the negative loglikelihood function evaluated at the MLE.In contrast, both MIST and LLA utilize the exact objective function and iterate until a stationary point, usually a local minimizer, is found.As in the linear model case, LLA uses glmpath to minimize the surrogate at each step, whereas the MIST algorithm uses a single application of the soft thresholding operator to minimize the surrogate function at each step.
In this example, the starting value carried even greater importance in comparison with the linear model setting.In particular, in the case of MIST, the combination of a 0 p starting value and small penalty parameter led to solutions with objective function evaluations that were substantially larger in comparison with those obtained using either β ml and β 1S,λ .Such behavior may be directly attributed to the fact that the ML and 1S starting values either minimize or nearly minimize the negative loglikelihood portion of the objective function, the dominant term in the objective function when λ is "small."In contrast, a 0 p starting value led to the best performance for "large" λ; upon reflection, this is also not very surprising, since large penalties induce greater sparsity and 0 p is the sparsest possible solution.
There were a few settings in which the 1S estimator resulted in a lower objective function evaluation in comparison with applying MIST started at β ml .This reflects the fact that several local minima can exist for non-convex penalties like SCAD.In addition, and as was observed before, using the 1S solution as a starting value always led to MIST finding a solution with a lower evaluation of the objective function in comparison with that provided by the 1S solution.Regarding the use of LLA, which also requires a starting value specification, we again examined the cases for which LLA resulted in lower objective function evaluations.For these cases, all MIST solutions were LLA solutions, and all LLA solutions were MIST solutions with the exception of one.Hence, both methods find valid, if often different, solutions, a behavior that we again attribute to the differences in paths taken towards a solution.

Effectiveness of convergence acceleration
We explored the effectiveness of SQUAREM 2 , defined in Section 3.3, when applied to several simulated datasets taken from the previous two simulation studies.Table 6 indicates the relative reduction in elapsed time ('RRT') and numbers of MM updates, i.e., invocations of mapping M (•), required for the original and SQUAREM 2 -accelerated algorithms to converge for five randomly chosen simulation datasets across the five penalty functions.The SQUAREM 2 algorithm converged without difficulty in these cases and required substantially fewer MM updates than the original algorithm; the percent reduction in time was as high as 96%.We remark here that the regularity conditions imposed in Roland and Varadhan [53] and Varadhan and Roland [67], particularly smoothness conditions, are not satisfied in this particular class of examples.Hence, while the simulation results are certainly very promising, the question of convergence (and its associated rate) of SQUAREM 2 in this class of problems continues to remain an interesting open problem.

Table 6
Acceleration from SQUAREM 2 applied to simplified MIST algorithm for five randomly selected simulation datasets.The relative reduction in elapsed time is given by 'RRT', while the number of MM updates are given for the original MIST implementation and SQUAREM 2 implementation in '# orig' and '# sm 2 ', respectively.Parameter Θ in the top portion of the table collects the dimension and noise information for the linear model examples, i.e., Θ = (p, σ)

Example: Genes associated with lymphoma patient survival
Diffuse large-B-cell lymphoma (DLBCL) is an aggressive type of non-Hodgkins lymphoma and is one of the most common forms of lymphoma occurring in adults.Rosenwald et al. [54] utilized Lymphochip DNA microarrays, specialized to include genes known to be preferentially expressed within the germinal centers of lymphoid organs, to collect and analyze gene expression data from 240 biopsy samples of DLBCL tumors.For each subject, 7399 gene expression measurements were obtained.The expression profiles along with corresponding patient information can be downloaded from their supplemental website http://llmpp.nih.gov/DLBCL/.Since the original profiles had some missing expression measurements, we used the dataset subsequently analyzed by Li and Gui [40] which estimated the missing values using a nearest neighbor approach.
During the time of followup, 138 patient deaths were observed with median death time of 2.8 years.Rosenwald et al. [54] used hierarchical clustering to group the genes into four gene-expression signatures: Proliferation (PS), which includes cell-cycle control and checkpoint genes, and DNA synthesis and replication genes; Major Histocompatibility Complex ClassII (MHC), which includes genes involved in antigen presentation; Lymph Node (LNS), which includes genes encoding for known markers of monocytes, macrophages, and natural killer cells; and Germinal Cen-ter B (GCB), which includes genes that are characteristic of germinal center B cells; see Alizadeh et al.
[1] for more information on gene signatures.Based on the gene clusters, they built a Cox proportional hazards model [14,15] to predict survival outcomes in the DLBCL patients.Subsequently, this dataset has been analyzed numerous times, typically to evaluate methods related to subgroup identification and/or survival prediction [e.g., 4,20,27,28,40,41,63].
Here, we instead focus on the performance of two different penalties, namely SCAD and MCP, with regard to the identification of genes associated with DLBCL survival.The simulation results of Zhang [71,72] suggest that MCP has superior selective accuracy over the SCAD penalty, at least for the case of a linear model.There, selection accuracy was measured as the proportion of simulation replications with correct classification of both the zero and non-zero coefficients, with MCP outperforming SCAD in all simulation settings.
To illustrate the utility and flexibility of the MIST algorithm, we reanalyzed the DLBCL data, fitting a penalized Cox regression model respectively using SCAD and MCP penalty functions, and running these procedures in combination with the Iterative Sure Independence Screening procedure [ISIS,23] in order to ensure that the dimension of the parameter space was maintained at a manageable level.For SCAD, we considered both the 1S and LLA estimators.The existing optimization functions provided in the SIS package for the ISIS procedure were used for the 1S estimator, whereas relevant modifications to the ISIS code were made in order to accommodate the fully iterative LLA and MCP estimators.Optimization at each step of the ISIS algorithm in the case of the MCP penalty utilized the MIST algorithm, as we are aware of no other algorithm capable of fitting the Cox regression model subject to MCP penalization.The default settings in the SIS package were used to determine the maximum number of predictors ([ n 4 log n ] = 10) and to define the additional ISIS parameters (e.g., use of the unpenalized MLE as a starting value, ranking method, tuning parameter selection) for all three analyses (1S-SCAD, LLA-SCAD, MIST-MCP).The parameter a was set to 3.7 for all analyses; hence, only the selection of λ required any tuning.Table 7 displays the 11 genes identified by at least one of the three analyses.The x's in a given column indicate the genes with non-zero coefficients resulting from the corresponding penalization.The final column provides references for genes previously linked to DLBCL in the literature.Genes belonging to the original Rosenwald et al. [54] gene expression signatures are indicated with parenthetical initials.Note that the references provided are not meant to be an exhaustive list, but instead intend to demonstrate the relevance of certain genes and/or their altered expression levels in DLBCL survival.
Interestingly, the LLA-SCAD and MIST-MCP penalizations selected the same subset of genes, having a nearly complete overlap with those selected from the 1S-SCAD penalization.The number of genes selected in each case is 10, the maximum specified by ISIS; 9 of these were shared across the three penalizations.According to NCBI Entrez Gene search (http://www.ncbi.nlm.nih.gov/),many of these genes are biologically relevant.For example, CDK7 codes for a protein x [28], [58], [73], [4] [7, 8] that regulates cell cycle progression and is represented in the Proliferation Signature, although reported under a different Lymphochip ID as this gene was spotted multiple times on the array.Also members of the Proliferation Signature are SEPT1, coding for a protein involved in cytokinesis, and BUB3, coding for a mitotic checkpoint protein.DNTTIP2 regulates transcriptional activity of DNTT, a gene for a protein expressed in a restricted population of normal and malignant pre-B and pre-T lymphocytes during early differentiation.HLA-DRA, a member of the MHC Signature, plays a central role in the immune system and is expressed in antigen presenting cells, such as B lymphocytes, dendritic cells, macrophages.From the GCB Signature, the ESTs weakly similar to thyroxinebinding globulin precursor is highly cited.Additionally, RFTN1 plays a pivotal role in regulating B-cell antigen receptor-mediated signaling [56].
The gene AI568329, selected by all methods, is not described in the original dataset and its function is unknown.Similarly, although cited at least twice, a description for AA830781 is also unavailable.Both of these genes may be related to lymphoma or risk of death from lymphoma, as it is possible that these genes (and potentially others) were selected because of coexpression or correlation with other relevant genes.Interestingly, the two genes not commonly identified across the three penalizations were both cited in Martinez-Climent et al. [44].They found altered gene expression of TSC22D3 and ITGAL (both involved in a variety of immune phenomena) in one case who initially presented with follicle center lymphoma and subsequently transformed to DLBCL.
The results of this analysis demonstrate equivalence in selection performance between MCP and LLA-SCAD for the case of Cox proportional hazards model.Increasing the maximum number of predictors to 21 again resulted in equivalent selection performance between MCP and LLA-SCAD, with 21 predictors ultimately selected (results not shown).The 1S estimator also resulted in the selection of 21 predictors, but with increased dissimilarity between MCP/LLA-SCAD and 1S: only 13 of the 21 genes were selected by all three methods.It should be noted that Zhang [71,72] did not use any form iterative variable selection (e.g., ISIS) in his comparisons between SCAD and MCP for the case of the linear model; in addition, Zhang [71,72] fixed values for both penalty parameters in his simulations and also did not use a = 3.7.The use of ISIS, the methodology used for selecting λ, and the use of a = 3.7 [e.g., 23] in both the MCP and SCAD penalties may all play a role in the results summarized above.

Discussion
This paper proposed a versatile and general algorithm capable of dealing with a wide variety of nonsmoothly penalized objective functions, including but not limited to all presently popular combinations of goodness-of-fit and penalty functions.In particular, the MIST algorithm utilizes a judicious choice of majorization to generate a MM algorithm that applies soft-thresholding componentwise and which, in certain settings, allows one to minimize the majorizing function in a single iteration.We established a suitable convergence theory, as well as new results on the convergence of rather general MM algorithms.In the case of penalized least squares, our results are complementary to convergence results obtained for coordinate descent algorithms designed for use with the lasso penalty [cf.45,65,69].In general, while the minimizers obtained at each step of the MIST algorithm are not necessarily coordinatewise minima of the desired objective function, the MIST algorithm continues to drive the objective function steadily downhill, converging to a local minimizer.We further demonstrated the remarkable effectiveness of the simple SQUAREM 2 acceleration procedure in these problems as tool for accelerating the slow but steady convergence of the proposed class of algorithms.Beyond specification of the penalty parameter(s) λ, virtually no effort was expended in tuning or otherwise specializing the MIST algorithm for solving a given problem.At the expense of greater analytical work, the rate of convergence for the standard MIST algorithm can itself be improved.
The simulation results of this paper highlight the fact that nonconvex penalties tend to endow the corresponding objective function with multiple local minima.The resulting sensitivity of computational algorithms to the choice of starting value, while known, has not been especially emphasized in the literature on penalized estimation.In this regard, the computationally attractive one-step method of Zou and Li [77] provides a useful choice of starting value for fully iterative SCAD-based algorithms.In addition to being unique under mild regularity conditions, it is also easily generalized to other nonconvex penalties, such as MCP, and yields estimators with attractive asymptotic properties.Unfortunately, its utility for identifying starting values is limited to settings where N > p because the justification for the 1S estimator relies heavily on the use of the unpenalized MLE as a starting value.
SparseNet [45] provides an interesting addition to the class of methods able to deal with non-convex penalty functions in the case of penalized least squares.The implementation of this methodology using the MCP penalty [71,72] appears to hold particular promise as a tool for variable selection.As indicated earlier, the core optimization procedure used within SparseNet is a form of iterated thresholding and is developed with a particular focus on the linear model.It would be interesting to explore the possibility of extending SparseNet to the class of generalized linear models and related problems (e.g., Cox regression), with one obvious approach being to replace the coordinate descent algorithm with the MIST algorithm.
The simulated examples in this paper only consider settings with N > p, in part to ensure that the goodness-of-fit function g(β) remains strictly convex.While the MIST algorithm has not yet been extensively tested in settings where N ≪ p, preliminary results show that the algorithm continues to find reasonable solutions when given a reasonable starting value, but tends to converge at a slower rate in comparison with N > p.As suggested by Table 6, the SQUAREM 2 acceleration procedure can produce dramatic gains even as N gets close to p; the problem of tuning the algorithm, the development of acceleration procedures and the problem of selecting suitable starting values in problems with multiple local minima, particularly in settings where p > N but the number of "important" predictors p 0 ≪ N , are left for future work.Two important and unresolved challenges in such problems include rigorously justifiable methods for determining good starting values and penalty parameter(s).

Appendix A
This appendix is divided into several sections.Section A.1 reviews and extends the convergence theory for the EM algorithm established in Wu [68]; the extension utilizes results of Meyer [47] to establish stronger convergence results for general MM algorithms.Section A.2 contains the proof of Theorem 2.1 and makes direct use of these results.Finally, Sections A.3 and A.4 respectively contain the proofs of Theorems 3.2 and 3.4, establishing the convergence of iterated soft thresholding when used to minimize (10) and convergence of the proposed class of MIST algorithms in the case of the generalized linear model.

A.1. Local convergence of MM algorithms in nonsmooth problems
Using convergence theory for algorithms derived from point-to-set maps developed by Zangwill [70], Wu [68] established some general convergence results for the EM algorithm under a range of conditions.In what follows, the key convergence result of Zangwill [70] is restated; this result, given in Theorem A.1 and adapted from Wu [68], is stated in a form convenient for use with the MM algorithm and provides for a very general (and comparatively weak) form of convergence.We then draw on stronger convergence results due to Meyer [47] in order to establish a more useful convergence theory for MM algorithms designed to minimize nondifferentiable objective functions; this result is stated in Theorem A.3.Finally, we provide a set of sufficient regularity conditions that ensure the validity of the conditions of both theorems in a wide class of statistical estimation problems.
Let ξ(β) be the real-valued function to be minimized, where β ∈ B. While the focus of this paper is on B = R p , the development below assumes only that B is some (possibly proper) subset of R p and that B 0 ⊂ B is a compact subset.Let M : B → B denote the map (1), where ξ [S] (•, •) is any function that majorizes ξ(β) for β ∈ B. In general, M (•) is a point-to-set map, and therefore a set.We say that β is a generalized fixed point of M (•) if β ∈ M ( β); we say that β is a fixed point of M (•) if M ( β) = { β} (i.e., a singleton).Theorem A.1 below states Theorem A of Zangwill [70] for the case of the MM algorithm.
One may then draw the following conclusions: M1.The sequence {β (n) , n ≥ 0} has at least one limit point in S; in addition, the set of all such points, say S 0 , satisfies S 0 ⊆ S. M2.Each limit point β ∈ S 0 satisfies lim n→∞ ξ(β (n) ) = ξ( β).M3.Each limit point β ∈ S 0 is a generalized fixed point of M (•).
Remark A.2. Assumptions [Z1]-[Z3] are imposed in Wu [68].The assumption [Z1] implies that {β (n) , n ≥ 0} is a bounded sequence, ensuring the existence of at least one limit point.Further comments on [Z2] will be made below, as it is possible to impose reasonable sufficient conditions that ensure this condition.
The assumption [Z3] enforces the descent property at each update, as would be expected in any EM, GEM or MM algorithm.An equivalent formulation of [Z3] follows [e.g., 47, p. 114]: Z3 ′ .For each α ∈ B and β ∈ M (α) : e., a strict decrease occurs at points α that are not generalized fixed points); (ii) ξ(β) ≤ ξ(α) if α ∈ M (α) (i.e., if α is a generalized fixed point, it is possible to observe no change in the objective function).

Conclusion [M1]
means the limit of any convergent subsequence β (nj ) lies in S; hence, the above theorem guarantees convergence of subsequences, but does not ensure global convergence of the iteration sequence.Subsequential convergence permits, for example, oscillatory behavior in the limit sequence.Meyer [47,48] offers several refinements of Theorem A.1, strengthening the statements of convergence.His results, adapted for the MM algorithm, follow below; in particular, see Theorem 3.1, Corollary 3.2, and Theorems 3.5 and 3.6 of Meyer [47].
If β is also an isolated local minimum of ξ(•) on B 0 , then there exists an open neighborhood B ǫ ⊆ B 0 of β such that β Suppose instead that [Z1-Z3] and [Z5] hold.Then, in addition to results [M1]-[M3] of Theorem A.1, the following conclusion can be drawn: Remark A.4. Assumption [Z4] strengthens [Z3] by imposing the condition that the iteration scheme is strictly monotonic; as such, all generalized fixed points of M (•) are also fixed points, a situation that permits stronger statements of convergence results.Assumption [Z5] imposes the somewhat weaker assumption that there exists at least one isolated fixed point of the iteration sequence; similarly to [M7], [M8] implies that the iteration converges to this point.Two further consequences of these results are (i) one may take S to be the set of fixed points of M (•); and, (ii) all solutions to the minimization problem min β∈B ξ(β) are in fact fixed points of M (•), hence contained within S [47, pp. 110-11].

Conclusions [M1]-[M7]
essentially mirror those in Vaida [66, Theorems 1-3], who obtains strong convergence results for the EM and MM algorithms un-der continuous differentiability assumptions on the objective and majorization functions and the additional condition that ξ [S] (β, α) has a unique global minimizer in β for each α ∈ S, where S is the (assumed finite) set of stationary points of ξ(β).This uniqueness condition, reflected in [Z4], provides a verifiable convergence condition that is often satisfied in statistical applications.
Sufficient conditions that ensure [Z1]-[Z4], but weaker than conditions imposed in Vaida [66], are now provided.In particular, suppose that the objective function, its surrogate and the mapping M (•) satisfy the following regularity conditions: R1. ξ(β) is locally Lipschitz continuous for β ∈ B and there exists at least one b Assume that the set of stationary points S of ξ(β) is a finite set, where the notion of a stationary point is defined as in Clarke [12] The above conditions do not imply that the objective function ξ(β) is differentiable everywhere.Condition [R1] does imply that ξ(β) is bounded below on B, that ∇ξ(β) exists for almost all β, and that the set of global minimizers of ξ(β) on B is non-empty and bounded.Conditions [R2] and [R3] imply that ξ [S] (β, α) strictly majorizes ξ(β) and, in addition, however, ∇ξ(β * ) need not exist in order for 0 ∈ ∂ξ(β * ).In general, the assumption that S is finite does not mean that the gradient exists at any of these points; in view of Proposition A.8, conditions [R1]-[R5] also do not imply an equivalence between the existence of an isolated fixed point of M (•) and differentiability of M (•) at that point.

A.2. Proof of Theorem 2.1
The assumptions stated in the theorem immediately yield that ξ(β) is locally Lipschitz continuous for each bounded λ > 0, hence (i) is satisfied; in addition, the stated assumptions imply ξ(β) is also coercive, hence attains a global minimum interior to R p .
In order to establish the convergence of the corresponding MM algorithm in (iii), it suffices to prove that the assumptions of the theorem and consequent assertions established thus far are sufficient to ensure that Conditions A.4. Proof of Theorem 3.4 First, we note that assumptions of this theorem are sufficient to ensure that ξ( β) is locally Lipschitz continuous on R p+1 .To establish (1.), note that the choice of h( β, α) in (13) with appropriately chosen ̟ guarantees majorization of −ℓ( β) since our assumptions imply that ∇ 2 (−ℓ( β)) can be suitably bounded on R p+1 [e.g., 38,Chapter 6].As shown earlier, penalties of the form (3) satisfying assumption (P1) can also be linearly majorized.Hence, (14) majorizes ξ( β).Turning to (2.), observe that ( 14) is a strictly convex function of β; this follows from the fact that p j=1 (τ j |β j | + λεβ 2 j ) is convex and that −ℓ( β) is strictly convex and twice differentiable.In addition, the function h( β, α) ≥ 0 is continuous in both β and α, twice continuously differentiable in β for each α, and has h( β, α) = 0 when β = α and is strictly positive otherwise.As a result, and in combination with (1.), ( 14) strictly majorizes ξ( β).Since all conditions of Theorem 2.1 are now satisfied, the stated convergence result for the MM algorithm immediately follows.
the boundary cases for r = 0 and/or s = 0 following from Hiriart-Urruty and Lemaréchal [31, Remark 4.1.2,p. 21].In other words, pS (r; λ, a) can be majorized by a linear function of r.
4. For example, in the classical setting of N independent Poisson observations with E[Y i | Xi ] = O i exp{x T i β} for a known set of scalar (log) offsets O 1 , . . ., O N , we have (i.e., up to irrelevant constants) ℓ
and locally Lipschitz in β for β near α.R5.M (β) is a singleton set consisting of one bounded vector for each β ∈ B.

Table 7
Genes associated with DLBCL survival with SCAD (one-step=1S and LLA) and MCP penalizations, sorted by the gene order in the original data set.ID refers to the unique Lymphochip identification number.The x's in a given column indicate the genes identified by the corresponding penalization