## Electronic Journal of Statistics

### Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

#### Abstract

For the problem of high-dimensional sparse linear regression, it is known that an $\ell_{0}$-based estimator can achieve a $1/n$ “fast” rate for prediction error without any conditions on the design matrix, whereas in the absence of restrictive conditions on the design matrix, popular polynomial-time methods only guarantee the $1/\sqrt{n}$ “slow” rate. In this paper, we show that the slow rate is intrinsic to a broad class of M-estimators. In particular, for estimators based on minimizing a least-squares cost function together with a (possibly nonconvex) coordinate-wise separable regularizer, there is always a “bad” local optimum such that the associated prediction error is lower bounded by a constant multiple of $1/\sqrt{n}$. For convex regularizers, this lower bound applies to all global optima. The theory is applicable to many popular estimators, including convex $\ell_{1}$-based methods as well as M-estimators based on nonconvex regularizers, including the SCAD penalty or the MCP regularizer. In addition, we show that bad local optima are very common, in that a broad class of local minimization algorithms with random initialization typically converge to a bad solution.

#### Article information

Source
Electron. J. Statist., Volume 11, Number 1 (2017), 752-799.

Dates
Received: November 2015
First available in Project Euclid: 11 March 2017

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1489201320

Digital Object Identifier
doi:10.1214/17-EJS1233

Mathematical Reviews number (MathSciNet)
MR3622646

Zentralblatt MATH identifier
1362.62053

Subjects
Primary: 62F12: Asymptotic properties of estimators
Secondary: 62J05: Linear regression

#### Citation

Zhang, Yuchen; Wainwright, Martin J.; Jordan, Michael I. Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators. Electron. J. Statist. 11 (2017), no. 1, 752--799. doi:10.1214/17-EJS1233. https://projecteuclid.org/euclid.ejs/1489201320

#### References

• [1] Regularized least-squares regression using lasso or elastic net algorithms., http://www.mathworks.com/help/stats/lasso.html.
• [2] A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming., Biometrika, 98(4):791–806, 2011.
• [3] D. Bertsekas., Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.
• [4] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector., Annals of Statistics, 37(4) :1705–1732, 2009.
• [5] S. Boyd and L. Vandenberghe., Convex Optimization. Cambridge University Press, Cambridge, UK, 2004.
• [6] P. Bühlmann and S. van de Geer., Statistics for High-Dimensional Data. Springer Series in Statistics. Springer, 2011.
• [7] F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation for Gaussian regression., Annals of Statistics, 35(4) :1674–1697, 2007.
• [8] E. Candes and T. Tao. The Dantzig selector: statistical estimation when $p$ is much larger than $n$., Annals of Statistics, 35(6) :2313–2351, 2007.
• [9] E. J. Candès and Y. Plan. Near-ideal model selection by $\ell_1$ minimization., Annals of Statistics, 37(5A) :2145–2177, 2009.
• [10] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit., SIAM Journal on Scientific Computing, 20(1):33–61, 1998.
• [11] F. H. Clarke., Optimization and Nonsmooth Analysis. Wiley-Interscience, New York, 1983.
• [12] A. S. Dalalyan, M. Hebiri, and J. Lederer. On the prediction performance of the Lasso., Bernoulli, 23:552–581, 2017.
• [13] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss., Random Structures & Algorithms, 22(1):60–65, 2003.
• [14] V. F. Dem’yanov and V. N. Malozemov., Introduction to Minimax. Courier Dover Publications, 1990.
• [15] R. Durrett., Probability: Theory and Examples. Cambridge University Press, 2010.
• [16] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties., Journal of the American Statistical Association, 96(456) :1348–1360, 2001.
• [17] R. Foygel and N. Srebro. Fast rate and optimistic rate for $\ell_1$-regularized regression. Technical report, Toyoto Technological Institute, 2011., arXiv:1108.037v1.
• [18] L. E. Frank and J. H. Friedman. A statistical view of some chemometrics regression tools., Technometrics, 35(2):109–135, 1993.
• [19] D. Ge, Z. Wang, Y. Ye, and H. Yin. Strong NP-hardness result for regularized $l_q$-minimization problems with concave penalty functions., arXiv preprint arXiv :1501.00622, 2015.
• [20] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye., GIST: General iterative shrinkage and thresholding for non-convex sparse learning. Tsinghua University, 2013. URL http://www.public.asu.edu/~jye02/Software/GIST.
• [21] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems., Technometrics, 12(1):55–67, 1970.
• [22] M. I. Jordan. On statistics, computation and scalability., Bernoulli, 19 :1378–1390, 2013.
• [23] K. C. Kiwiel. An aggregate subgradient method for nonsmooth convex minimization., Mathematical Programming, 27(3):320–341, 1983.
• [24] J. B. Lasserre. An explicit exact SDP relaxation for nonlinear 0-1 programs. In, Integer Programming and Combinatorial Optimization, pages 293–303. Springer, 2001.
• [25] P.-L. Loh and M. J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In, Advances in Neural Information Processing Systems, pages 476–484, 2013.
• [26] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Uniform uncertainty principle for Bernoulli and subgaussian ensembles., Constructive Approximation, 28(3):277–289, 2008.
• [27] R. Mifflin., A Modification and an Extension of Lemaréchal’s Algorithm for Nonsmooth Minimization. Springer, 1982.
• [28] B. K. Natarajan. Sparse approximate solutions to linear systems., SIAM Journal on Computing, 24(2):227–234, 1995.
• [29] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of $m$-estimators with decomposable regularizers., Statistical Science, 27(4):538–557, 2012.
• [30] A. Nemirovski. Topics in non-parametric statistics. In P. Bernard, editor, Ecole d’été de Probabilities de Saint-Flour XXVIII, Lecture notes in Mathematics. Springer, 2000.
• [31] M. Pilanci, M. J. Wainwright, and L. El Ghaoui. Sparse learning via Boolean relaxations., Mathematical Programming, 151:63–87, 2015.
• [32] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated Gaussian designs., Journal of Machine Learning Research, 99 :2241–2259, 2010.
• [33] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over $\ell_q$-balls., IEEE Transactions on Information Theory, 57(10) :6976–6994, 2011.
• [34] M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular values., arXiv :1003.2990, 2010.
• [35] H. D. Sherali and W. P. Adams. A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems., SIAM Journal on Discrete Mathematics, 3:411–430, 1990.
• [36] T. Sun and C.-H. Zhang. Scaled sparse linear regression., Biometrika, 99:879–898, 2012.
• [37] R. Tibshirani. Regression shrinkage and selection via the Lasso., Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.
• [38] S. A. Van De Geer and P. Bühlmann. On the conditions used to prove oracle results for the Lasso., Electronic Journal of Statistics, 3 :1360–1392, 2009.
• [39] M. J. Wainwright. Constrained forms of statistical minimax: Computation, communication and privacy. In, Proceedings of the International Congress of Mathematicians, Seoul, Korea, 2014.
• [40] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. In, Nondifferentiable Optimization, pages 145–173. Springer, 1975.
• [41] C.-H. Zhang et al. Nearly unbiased variable selection under minimax concave penalty., Annals of Statistics, 38(2):894–942, 2010.
• [42] Y. Zhang, M. J. Wainwright, and M. I. Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression., Proceedings of the Conference on Computational Learning Theory (COLT), 2014.
• [43] H. Zou. The adaptive lasso and its oracle properties., Journal of the American Statistical Association, 101(476) :1418–1429, 2006.