The Annals of Statistics

Best subset selection via a modern optimization lens

Dimitris Bertsimas, Angela King, and Rahul Mazumder

Full-text: Access denied (no subscription detected) We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


In the period 1991–2015, algorithmic advances in Mixed Integer Optimization (MIO) coupled with hardware improvements have resulted in an astonishing 450 billion factor speedup in solving MIO problems. We present a MIO approach for solving the classical best subset selection problem of choosing $k$ out of $p$ features in linear regression given $n$ observations. We develop a discrete extension of modern first-order continuous optimization methods to find high quality feasible solutions that we use as warm starts to a MIO solver that finds provably optimal solutions. The resulting algorithm (a) provides a solution with a guarantee on its suboptimality even if we terminate the algorithm early, (b) can accommodate side constraints on the coefficients of the linear regression and (c) extends to finding best subset solutions for the least absolute deviation loss function. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with $n$ in the 1000s and $p$ in the 100s in minutes to provable optimality, and finds near optimal solutions for $n$ in the 100s and $p$ in the 1000s in minutes. We also establish via numerical experiments that the MIO approach performs better than Lasso and other popularly used sparse learning procedures, in terms of achieving sparse solutions with good predictive power.

Article information

Ann. Statist. Volume 44, Number 2 (2016), 813-852.

Received: June 2014
Revised: August 2015
First available in Project Euclid: 17 March 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62J05: Linear regression 62J07: Ridge regression; shrinkage estimators 62G35: Robustness
Secondary: 90C11: Mixed integer programming 90C26: Nonconvex programming, global optimization 90C27: Combinatorial optimization

Sparse linear regression best subset selection $\ell_{0}$-constrained minimization lasso least absolute deviation algorithms mixed integer programming global optimization discrete optimization


Bertsimas, Dimitris; King, Angela; Mazumder, Rahul. Best subset selection via a modern optimization lens. Ann. Statist. 44 (2016), no. 2, 813--852. doi:10.1214/15-AOS1388.

Export citation


  • Bandeira, A. S., Dobriban, E., Mixon, D. G. and Sawin, W. F. (2013). Certifying the restricted isometry property is hard. IEEE Trans. Inform. Theory 59 3448–3450.
  • Bertsimas, D., King, A. and Mazumder, R. (2015). Supplement to “Best subset selection via a modern optimization lens.” DOI:10.1214/15-AOS1388SUPP.
  • Bertsimas, D. and Mazumder, R. (2014). Least quantile regression via modern optimization. Ann. Statist. 42 2494–2525.
  • Bertsimas, D. and Shioda, R. (2009). Algorithm for cardinality-constrained quadratic optimizaiton. Comput. Optim. Appl. 43 1–22.
  • Bertsimas, D. and Weismantel, R. (2005). Optimization Over Integers. Dynamic Ideas, Belmont.
  • Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • Bienstock, D. (1996). Computational study of a family of mixed-integer quadratic programming problems. Math. Programming 74 121–140.
  • Bixby, R. E. (2012). A brief history of linear and mixed-integer programming computation. Doc. Math. Extra Volume: Optimization Stories 107–121.
  • Blumensath, T. and Davies, M. E. (2008). Iterative thresholding for sparse approximations. J. Fourier Anal. Appl. 14 629–654.
  • Blumensath, T. and Davies, M. E. (2009). Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27 265–274.
  • Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Heidelberg.
  • Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • Candès, E. J. (2008). The restricted isometry property and its implications for compressed sensing. C. R. Math. Acad. Sci. Paris 346 589–592.
  • Candès, E. J. and Plan, Y. (2009). Near-ideal model selection by $\ell_{1}$ minimization. Ann. Statist. 37 2145–2177.
  • Candes, E. J. and Tao, T. (2006). Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inform. Theory 52 5406–5425.
  • Candès, E. J., Wakin, M. B. and Boyd, S. P. (2008). Enhancing sparsity by reweighted $l_{1}$ minimization. J. Fourier Anal. Appl. 14 877–905.
  • Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
  • Dettling, M. (2004). Bagboosting for tumor classification with gene expression data. Bioinformatics 20 3583–3593.
  • Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal $l_{1}$-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 797–829.
  • Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via $l^{1}$ minimization. Proc. Natl. Acad. Sci. USA 100 2197–2202 (electronic).
  • Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory 47 2845–2862.
  • Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484.
  • Fan, Y. and Lv, J. (2013). Asymptotic equivalence of regularization methods in thresholded parameter space. J. Amer. Statist. Assoc. 108 1044–1061.
  • Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35 109–148.
  • Friedman, J. (2008). Fast sparse regression and classification. Technical report, Dept. Statistics, Stanford Univ., Stanford, CA.
  • Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302–332.
  • Furnival, G. and Wilson, R. (1974). Regression by leaps and bounds. Technometrics 16 499–511.
  • Greenshtein, E. (2006). Best subset selection, persistence in high-dimensional statistical learning and optimization under $l_{1}$ constraint. Ann. Statist. 34 2367–2386.
  • Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988.
  • Gurobi, I. (2013). Optimization. Gurobi optimizer reference manual. Available at
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer Series in Statistics. Springer, New York.
  • Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28 1356–1378.
  • Loh, P.-L. and Wainwright, M. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems 476–484. Curran Associates, Red Hook, NY.
  • Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Statist. 37 3498–3528.
  • Mazumder, R., Friedman, J. H. and Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. J. Amer. Statist. Assoc. 106 1125–1138.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Miller, A. (2002). Subset Selection in Regression, 2nd ed. Monographs on Statistics and Applied Probability 95. Chapman & Hall/CRC, Boca Raton, FL.
  • Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM J. Comput. 24 227–234.
  • Nemhauser, G. (2013). Integer programming: The global impact. 2013–12–2013–04, Rome, Italy. Presented at EURO, INFORMS, Accessed. Available at
  • Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization 87. Kluwer Academic, Boston, MA.
  • Nesterov, Yu. (2005). Smooth minimization of non-smooth functions. Math. Program. 103 127–152.
  • Nesterov, Yu. (2013). Gradient methods for minimizing composite functions. Math. Program. 140 125–161.
  • Optimization Inc. (2015). Gurobi 6.0 performance benchmarks. Available at Accessed 5 September 2015.
  • Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
  • Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2013). On constrained and regularized high-dimensional regression. Ann. Inst. Statist. Math. 65 807–832.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 73 273–282.
  • Top500 Supercomputer Sites (2015). Directory page for Top500 lists. In Result for each list since June 1993. Accessed: 09-15-2015. Available at
  • Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52 1030–1051.
  • van de Geer, S., Bühlmann, P. and Zhou, S. (2011). The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron. J. Stat. 5 688–749.
  • Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
  • Zhang, C.-H. (2010a). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
  • Zhang, T. (2010b). Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 11 1081–1107.
  • Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
  • Zhang, Y., Wainwright, M. and Jordan, M. I. (2014). Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. Preprint. Available at arXiv:1402.1918.
  • Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statist. Sci. 27 576–593.
  • Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
  • Zheng, Z., Fan, Y. and Lv, J. (2014). High dimensional thresholded regression and shrinkage effect. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 627–649.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533.

Supplemental materials

  • Supplement to “Best subset selection via a modern optimization lens”. Supporting technical material and additional experimental results including some figures and tables are presented in the supplementary material section.