## Statistical Science

### Proximal Algorithms in Statistics and Machine Learning

#### Abstract

Proximal algorithms are useful for obtaining solutions to difficult optimization problems, especially those involving nonsmooth or composite objective functions. A proximal algorithm is one whose basic iterations involve the proximal operator of some function, whose evaluation requires solving a specific optimization problem that is typically easier than the original problem. Many familiar algorithms can be cast in this form, and this “proximal view” turns out to provide a set of broad organizing principles for many algorithms useful in statistics and machine learning. In this paper, we show how a number of recent advances in this area can inform modern statistical practice. We focus on several main themes: (1) variable splitting strategies and the augmented Lagrangian; (2) the broad utility of envelope (or variational) representations of objective functions; (3) proximal algorithms for composite objective functions; and (4) the surprisingly large number of functions for which there are closed-form solutions of proximal operators. We illustrate our methodology with regularized Logistic and Poisson regression incorporating a nonconvex bridge penalty and a fused lasso penalty. We also discuss several related issues, including the convergence of nondescent algorithms, acceleration and optimization for nonconvex functions. Finally, we provide directions for future research in this exciting area at the intersection of statistics and optimization.

#### Article information

Source
Statist. Sci. Volume 30, Number 4 (2015), 559-581.

Dates
First available in Project Euclid: 9 December 2015

Permanent link to this document
http://projecteuclid.org/euclid.ss/1449670858

Digital Object Identifier
doi:10.1214/15-STS530

Mathematical Reviews number (MathSciNet)
MR3432841

#### Citation

Polson, Nicholas G.; Scott, James G.; Willard, Brandon T. Proximal Algorithms in Statistics and Machine Learning. Statist. Sci. 30 (2015), no. 4, 559--581. doi:10.1214/15-STS530. http://projecteuclid.org/euclid.ss/1449670858.

#### References

• Allain, M., Idier, J. and Goussard, Y. (2006). On global and local convergence of half-quadratic algorithms. IEEE Trans. Image Process. 15 1130–1142.
• Allen-Zhu, Z. and Orecchia, L. (2014). A novel, simple interpretation of Nesterov’s accelerated method as a combination of gradient and mirror descent. Preprint. Available at arXiv:1407.1537.
• Argyriou, A., Micchelli, C. A., Pontil, M., Shen, L. and Xu, Y. (2011). Efficient first order methods for linear composite regularizers. Preprint. Available at arXiv:1104.1436.
• Attouch, H. and Bolte, J. (2009). On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116 5–16.
• Attouch, H., Bolte, J. and Svaiter, B. F. (2013). Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137 91–129.
• Attouch, H., Bolte, J., Redont, P. and Soubeyran, A. (2010). Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35 438–457.
• Beck, A. and Sabach, S. (2015). Weiszfeld’s method: Old and new results. J. Optim. Theory Appl. 164 1–40.
• Beck, A. and Teboulle, M. (2004). A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods Oper. Res. 59 235–247.
• Beck, A. and Teboulle, M. (2010). Gradient-based algorithms with applications to signal recovery problems. In Convex Optimization in Signal Processing and Communications (D. P. Palomar and Y. C. Eldar, eds.) 42–88. Cambridge Univ. Press, Cambridge.
• Beck, A. and Teboulle, M. (2014). A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett. 42 1–6.
• Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning 2010 1–38.
• Besag, J. (1986). On the statistical analysis of dirty pictures. J. Roy. Statist. Soc. Ser. B 48 259–302.
• Bien, J., Taylor, J. and Tibshirani, R. (2013). A LASSO for hierarchical interactions. Ann. Statist. 41 1111–1141.
• Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
• Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011). Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. now Publishers, Hanover, MA.
• Brègman, L. M. (1967). A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7 200–217.
• Cevher, V., Becker, S. and Schmidt, M. (2014). Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Process. Mag. 31 32–43.
• Chambolle, A. and Pock, T. (2011). A first-order primal–dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vision 40 120–145.
• Chaux, C., Combettes, P. L., Pesquet, J.-C. and Wajs, V. R. (2007). A variational formulation for frame-based inverse problems. Inverse Probl. 23 1495–1518.
• Chen, P., Huang, J. and Zhang, X. (2013). A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Probl. 29 025011, 33.
• Chen, G. and Teboulle, M. (1994). A proximal-based decomposition method for convex minimization problems. Math. Program. 64 81–101.
• Chouzenoux, E., Pesquet, J.-C. and Repetti, A. (2014). Variable metric forward–backward algorithm for minimizing the sum of a differentiable function and a convex function. J. Optim. Theory Appl. 162 107–132.
• Chrétien, S. and Hero, A. O. III (2000). Kullback proximal algorithms for maximum-likelihood estimation. IEEE Trans. Inform. Theory 46 1800–1810.
• Combettes, P. L. and Pesquet, J.-C. (2011). Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering 185–212. Springer, New York.
• Csiszár, I. and Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statist. Decisions 1 (supplement issue) 205–237.
• Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
• Duckworth, D. (2014). The big table of convergence rates. Available at https://github.com/duckworthd/duckworthd.github.com/blob/master/blog/big-table-of-convergence-rates.html.
• Esser, E., Zhang, X. and Chan, T. F. (2010). A general framework for a class of first order primal–dual algorithms for convex optimization in imaging science. SIAM J. Imaging Sci. 3 1015–1046.
• Figueiredo, M. A. T. and Nowak, R. D. (2003). An EM algorithm for wavelet-based image restoration. IEEE Trans. Image Process. 12 906–916.
• Frankel, P., Garrigos, G. and Peypouquet, J. (2015). Splitting methods with variable metric for Kurdyka–Lojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165 874–900.
• Geman, D. and Reynolds, G. (1992). Constrained restoration and the recovery of discontinuities. IEEE Trans. Pattern Anal. Mach. Intell. 14 367–383.
• Geman, D. and Yang, C. (1995). Nonlinear image recovery with half-quadratic regularization. IEEE Trans. Image Process. 4 932–946.
• Giselsson, P. and Boyd, S. (2014). Preconditioning in fast dual gradient methods. In Proceedings of the 53rd Conference on Decision and Control. 5040–5045. Los Angeles, CA.
• Gravel, S. and Elser, V. (2008). Divide and concur: A general approach to constraint satisfaction. Phys. Rev. E 78 036706.
• Green, P. J. (1990). On use of the EM algorithm for penalized likelihood estimation. J. Roy. Statist. Soc. Ser. B 52 443–452.
• Green, P. J., Łatuszyński, K., Pereyra, M. and Robert, C. P. (2015). Bayesian computation: A perspective on the current state, and sampling backwards and forwards. Preprint. Available at arXiv:1502.01148.
• Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
• Hestenes, M. R. (1969). Multiplier and gradient methods. J. Optim. Theory Appl. 4 303–320.
• Hu, Y. H., Li, C. and Yang, X. Q. (2015). Proximal gradient algorithm for group sparse optimization.
• Komodakis, N. and Pesquet, J.-C. (2014). Playing with duality: An overview of recent primal–dual approaches for solving large-scale optimization problems. Preprint. Available at arXiv:1406.5429.
• Magnússon, S., Weeraddana, P. C., Rabbat, M. G. and Fischione, C. (2014). On the convergence of alternating direction lagrangian methods for nonconvex structured optimization problems. Preprint. Available at arXiv:1409.8033.
• Marjanovic, G. and Solo, V. (2013). On exact $\ell^{q}$ denoising. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on 6068–6072. IEEE, New York.
• Martinet, B. (1970). Brève communication. Regularisation d’inéquations variationnelles par approximations successives. ESAIM Math. Modell. Numer. Anal. 4 154–158.
• Meng, X. and Chen, H. (2011). Accelerating Nesterov’s method for strongly convex functions with Lipschitz gradient. Preprint. Available at arXiv:1109.6058.
• Micchelli, C. A., Shen, L. and Xu, Y. (2011). Proximity algorithms for image models: Denoising. Inverse Probl. 27 045009, 30.
• Micchelli, C. A., Shen, L., Xu, Y. and Zeng, X. (2013). Proximity algorithms for the L1/TV image denoising model. Adv. Comput. Math. 38 401–426.
• Nesterov, Yu. E. (1983). A method for solving the convex programming problem with convergence rate $O(1/k^{2})$. Sov. Math., Dokl. 27 372–376.
• Nikolova, M. and Ng, M. K. (2005). Analysis of half-quadratic minimization methods for signal and image recovery. SIAM J. Sci. Comput. 27 937–966 (electronic).
• Noll, D. (2014). Convergence of non-smooth descent methods using the Kurdyka–Łojasiewicz inequality. J. Optim. Theory Appl. 160 553–572.
• O’Donoghue, B. and Candes, E. (2015). Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15 715–732.
• Palmer, J., Kreutz-Delgado, K., Rao, B. D. and Wipf, D. P. (2005). Variational EM algorithms for non-Gaussian latent variable models. In Advances in Neural Information Processing Systems 18 1059–1066. Vancouver, BC, Canada.
• Papa Quiroz, E. A. and Oliveira, P. R. (2009). Proximal point methods for quasiconvex and convex functions with Bregman distances on Hadamard manifolds. J. Convex Anal. 16 49–69.
• Parikh, N. and Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimization 1 123–231.
• Patrinos, P. and Bemporad, A. (2013). Proximal Newton methods for convex composite optimization. In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on 2358–2363. IEEE, New York.
• Patrinos, P., Lorenzo, S. and Alberto, B. (2014). Douglas-rachford splitting: Complexity estimates and accelerated variants. Preprint. Available at arXiv:1407.6723.
• Pereyra, M. (2013). Proximal Markov chain Monte Carlo algorithms. Preprint. Available at arXiv:1306.0187.
• Polson, N. G. and Scott, J. G. (2012). Local shrinkage rules, Lévy processes and regularized regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 287–311.
• Polson, N. G. and Scott, J. G. (2015). Mixtures, envelopes, and hierarchical duality. J. Roy. Statist. Soc. Ser. B. To appear. Available at arXiv:1406.0177.
• Rockafellar, R. T. (1974). Conjugate duality and optimization. Technical report, DTIC Document, 1973.
• Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14 877–898.
• Rockafellar, R. T. and Wets, R. J.-B. (1998). Variational Analysis. Springer, Berlin.
• Rudin, L., Osher, S. and Faterni, E. (1992). Nonlinear total variation based noise removal algorithms. Phys. D 60 259–268.
• Shor, N. Z. (1985). Minimization Methods for Nondifferentiable Functions. Springer, Berlin.
• Tansey, W., Koyejo, O., Poldrack, R. A. and Scott, J. G. (2014). False discovery rate smoothing. Technical report, Univ. Texas at Austin.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Tibshirani, R. J. (2014). Adaptive piecewise polynomial estimation via trend filtering. Ann. Statist. 42 285–323.
• Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. Ann. Statist. 39 1335–1371.
• Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 91–108.
• Von Neumann, J. (1951). Functional Operators: The Geometry of Orthogonal Spaces. Princeton Univ. Press, Princeton, NJ.
• Weiszfeld, E. (1937). Sur le point pour lequel la somme des distances de n points donnés est minimum. Tohoku Math. J. 43 355–386.
• Witten, D. M., Tobshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
• Zhang, X., Saha, A. and Vishwanathan, S. V. N. (2010). Regularized risk minimization by Nesterov’s accelerated gradient methods: Algorithmic extensions and empirical studies. Preprint. Available at arXiv:1011.0472.
• Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic Net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.