## The Annals of Statistics

### The landscape of empirical risk for nonconvex losses

#### Abstract

Most high-dimensional estimation methods propose to minimize a cost function (empirical risk) that is a sum of losses associated to each data point (each example). In this paper, we focus on the case of nonconvex losses. Classical empirical process theory implies uniform convergence of the empirical (or sample) risk to the population risk. While under additional assumptions, uniform convergence implies consistency of the resulting M-estimator, it does not ensure that the latter can be computed efficiently.

In order to capture the complexity of computing M-estimators, we study the landscape of the empirical risk, namely its stationary points and their properties. We establish uniform convergence of the gradient and Hessian of the empirical risk to their population counterparts, as soon as the number of samples becomes larger than the number of unknown parameters (modulo logarithmic factors). Consequently, good properties of the population risk can be carried to the empirical risk, and we are able to establish one-to-one correspondence of their stationary points. We demonstrate that in several problems such as nonconvex binary classification, robust regression and Gaussian mixture model, this result implies a complete characterization of the landscape of the empirical risk, and of the convergence properties of descent algorithms.

We extend our analysis to the very high-dimensional setting in which the number of parameters exceeds the number of samples, and provides a characterization of the empirical risk landscape under a nearly information-theoretically minimal condition. Namely, if the number of samples exceeds the sparsity of the parameters vector (modulo logarithmic factors), then a suitable uniform convergence result holds. We apply this result to nonconvex binary classification and robust regression in very high-dimension.

#### Article information

Source
Ann. Statist., Volume 46, Number 6A (2018), 2747-2774.

Dates
Revised: August 2017
First available in Project Euclid: 7 September 2018

Permanent link to this document
https://projecteuclid.org/euclid.aos/1536307232

Digital Object Identifier
doi:10.1214/17-AOS1637

Mathematical Reviews number (MathSciNet)
MR3851754

Zentralblatt MATH identifier
06968598

#### Citation

Mei, Song; Bai, Yu; Montanari, Andrea. The landscape of empirical risk for nonconvex losses. Ann. Statist. 46 (2018), no. 6A, 2747--2774. doi:10.1214/17-AOS1637. https://projecteuclid.org/euclid.aos/1536307232

#### References

• [1] Ai, A., Lapanowski, A., Plan, Y. and Vershynin, R. (2014). One-bit compressed sensing with non-Gaussian measurements. Linear Algebra Appl. 441 222–239.
• [2] Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96 6745–6750.
• [3] Anandkumar, A., Ge, R. and Janzamin, M. (2015). Learning overcomplete latent variable models through tensor methods. In Proceedings of the Conference on Learning Theory (COLT), Paris, France.
• [4] Atia, G. K. and Saligrama, V. (2012). Boolean compressed sensing and noisy group testing. IEEE Trans. Inform. Theory 58 1880–1901.
• [5] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [6] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford.
• [7] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• [8] Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203–4215.
• [9] Chapelle, O., Do, C. B., Teo, C. H., Le, Q. V. and Smola, A. J. (2009). Tighter bounds for structured estimation. In Advances in Neural Information Processing Systems 281–288.
• [10] Chen, Y. and Candes, E. (2015). Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems 739–747.
• [11] Daskalakis, C., Tzamos, C. and Zampetakis, M. (2016). Ten steps of EM suffice for mixtures of two Gaussians. Available at arXiv:1609.00368.
• [12] Donoho, D. L. (2006). Compressed sensing. IEEE Trans. Inform. Theory 52 1289–1306.
• [13] Dubrovin, B. A., Fomenko, A. T. and Novikov, S. P. (2012). Modern Geometry—Methods and Applications: Part II: The Geometry and Topology of Manifolds 104. Springer, Berlin.
• [14] Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222 309–368.
• [15] Fisher, R. A. (1925). Theory of statistical estimation. In Mathematical Proceedings of the Cambridge Philosophical Society 22 700–725. Cambridge Univ Press, Cambridge.
• [16] Ge, R., Huang, F., Jin, C. and Yuan, Y. (2015). Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory 797–842.
• [17] Guillemin, V. and Pollack, A. (2010). Differential Topology 370. AMS, Providence.
• [18] Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Statist. 1 799–821.
• [19] Keshavan, R. H., Oh, S. and Montanari, A. (2009). Matrix completion from a few entries. In IEEE International Symposium on Information Theory, 2009. ISIT 2009 324–328. IEEE, New York.
• [20] Laska, J. N. and Baraniuk, R. G. (2012). Regime change: Bit-depth versus measurement-rate in compressive sensing. IEEE Trans. Signal Process. 60 3496–3505.
• [21] Laska, J. N., Wen, Z., Yin, W. and Baraniuk, R. G. (2011). Trust, but verify: Fast and accurate signal recovery from 1-bit compressive measurements. IEEE Trans. Signal Process. 59 5289–5301.
• [22] LeCun, Y., Bengio, Y. and Hinton, G. (2015). Deep learning. Nature 521 436–444.
• [23] Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust $M$-estimators. Ann. Statist. 45 866–896.
• [24] Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Statist. 40 1637–1664.
• [25] Loh, P.-L. and Wainwright, M. J. (2013). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems 476–484.
• [26] Lozano, A. C. and Meinshausen, N. (2013). Minimum distance estimation for robust high-dimensional regression. Available at arXiv:1307.3227.
• [27] Mei, S., Bai, Y. and Montanari, A. (2018). Supplement to “The landscape of empirical risk for nonconvex losses.” DOI:10.1214/17-AOS1637SUPP.
• [28] Milnor, J. (1963). Morse Theory 51. Princeton Univ. Press, Princeton.
• [29] Milnor, J. W. (1997). Topology from the Differentiable Viewpoint. Princeton Univ. Press, Princeton, NH.
• [30] Montanari, A. and Richard, E. (2014). A statistical model for tensor PCA. In Advances in Neural Information Processing Systems 2897–2905.
• [31] Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers. Statist. Sci. 27 538–557.
• [32] Nesterov, Y. (2013). Introductory Lectures on Convex Optimization: A Basic Course 87. Springer Science & Business Media, New York.
• [33] Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Math. Program. 140 125–161.
• [34] Nguyen, T. and Sanner, S. (2013). Algorithms for direct 0–1 loss optimization in binary classification. In Proceedings of the 30th International Conference on Machine Learning 1085–1093.
• [35] Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R. and Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 4 53–77.
• [36] Plan, Y. and Vershynin, R. (2013). One-bit compressed sensing by linear programming. Comm. Pure Appl. Math. 66 1275–1297.
• [37] Plan, Y. and Vershynin, R. (2013). Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Trans. Inform. Theory 59 482–494.
• [38] Plan, Y., Vershynin, R. and Yudovina, E. (2014). High-dimensional estimation with geometric constraints. Preprint. Available at arXiv:1404.3749.
• [39] Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400–407.
• [40] Serdobolskii, V. I. (2013). Multivariate Statistical Analysis: A High-Dimensional Approach 41. Springer, New York.
• [41] Sun, J., Qu, Q. and Wright, J. (2016). A geometric analysis of phase retrieval. Available at arXiv:1602.06664.
• [42] Tsitsiklis, J. N., Bertsekas, D. P. and Athans, M. (1984). Distributed asynchronous deterministic and stochastic gradient optimization algorithms. In 1984 American Control Conference 484–489.
• [43] van de Geer, S. A. (2000). Applications of Empirical Process Theory. Cambridge Series in Statistical and Probabilistic Mathematics 6. Cambridge Univ. Press, Cambridge.
• [44] Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
• [45] Wu, Y. and Liu, Y. (2012). Robust truncated hinge loss support vector machines. J. Amer. Statist. Assoc. 102 974–983.
• [46] Xu, J., Hsu, D. and Maleki, A. (2016). Global analysis of expectation maximization for mixtures of two Gaussians. Preprint. Available at arXiv:1608.07630.
• [47] Yang, Z., Wang, Z., Liu, H., Eldar, Y. C. and Zhang, T. (2015). Sparse nonlinear regression: Parameter estimation and asymptotic inference. Available at arXiv:1511.04514.

#### Supplemental materials

• Supplement: Proofs and simulations. The supplement provides some technical background lemmas and gives all the proofs of the theorems, and additional numerical simulations.