## Annals of Statistics

### A new perspective on least squares under convex constraint

Sourav Chatterjee

#### Abstract

Consider the problem of estimating the mean of a Gaussian random vector when the mean vector is assumed to be in a given convex set. The most natural solution is to take the Euclidean projection of the data vector on to this convex set; in other words, performing “least squares under a convex constraint.” Many problems in modern statistics and statistical signal processing theory are special cases of this general situation. Examples include the lasso and other high-dimensional regression techniques, function estimation problems, matrix estimation and completion, shape-restricted regression, constrained denoising, linear inverse problems, etc. This paper presents three general results about this problem, namely, (a) an exact computation of the main term in the estimation error by relating it to expected maxima of Gaussian processes (existing results only give upper bounds), (b) a theorem showing that the least squares estimator is always admissible up to a universal constant in any problem of the above kind and (c) a counterexample showing that least squares estimator may not always be minimax rate-optimal. The result from part (a) is then used to compute the error of the least squares estimator in two examples of contemporary interest.

#### Article information

Source
Ann. Statist., Volume 42, Number 6 (2014), 2340-2381.

Dates
First available in Project Euclid: 20 October 2014

https://projecteuclid.org/euclid.aos/1413810730

Digital Object Identifier
doi:10.1214/14-AOS1254

Mathematical Reviews number (MathSciNet)
MR3269982

Zentralblatt MATH identifier
1302.62053

#### Citation

Chatterjee, Sourav. A new perspective on least squares under convex constraint. Ann. Statist. 42 (2014), no. 6, 2340--2381. doi:10.1214/14-AOS1254. https://projecteuclid.org/euclid.aos/1413810730

#### References

• [1] Amelunxen, D., Lotz, M., McCoy, M. B. and Tropp, J. A. (2013). Living on the edge: A geometric theory of phase transitions in convex optimization. Preprint. Available at arXiv:1303.6672.
• [2] Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. and Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. Ann. Inst. Statist. Math. 26 641–647.
• [3] Bartlett, P. L., Mendelson, S. and Neeman, J. (2012). $\ell_1$-regularized linear regression: Persistence and oracle inequalities. Probab. Theory Related Fields 154 193–224.
• [4] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [5] Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch. Verw. Gebiete 65 181–237.
• [6] Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97 113–150.
• [7] Borell, C. (1975). The Brunn–Minkowski inequality in Gauss space. Invent. Math. 30 207–216.
• [8] Brunk, H. D. (1970). Estimation of isotonic regression. In Nonparametric Techniques in Statistical Inference (Proc. Sympos., Indiana Univ., Bloomington, Ind., 1969) 177–197. Cambridge Univ. Press, London.
• [9] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
• [10] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1 169–194.
• [11] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• [12] Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203–4215.
• [13] Carolan, C. and Dykstra, R. (1999). Asymptotic behavior of the Grenander estimator at density flat regions. Canad. J. Statist. 27 557–566.
• [14] Cator, E. (2011). Adaptivity and optimality of the monotone least-squares estimator. Bernoulli 17 714–735.
• [15] Chandrasekaran, V. and Jordan, M. I. (2013). Computational and statistical tradeoffs via convex relaxation. Proc. Natl. Acad. Sci. USA 110 E1181–E1190.
• [16] Chandrasekaran, V., Recht, B., Parrilo, P. A. and Willsky, A. S. (2012). The convex geometry of linear inverse problems. Found. Comput. Math. 12 805–849.
• [17] Chatterjee, S. (2013). Assumptionless consistency of the lasso. Preprint. Available at arXiv:1303.5817.
• [18] Chatterjee, S., Guntuboyina, A. and Sen, B. (2013). Improved risk bounds in isotonic regression. Preprint. Available at arXiv:1311.3765.
• [19] Cirel’son, B. S., Ibragimov, I. A. and Sudakov, V. N. (1976). Norms of Gaussian sample functions. In Proceedings of the Third Japan–USSR Symposium on Probability Theory (Tashkent, 1975). Lecture Notes in Math. 550 20–41. Springer, Berlin.
• [20] Donoho, D. (1991). Gelfand $n$-widths and the method of least squares. Technical report, Dept. Statistics, Univ. California, Berkeley.
• [21] Donoho, D. L. (2006). For most large underdetermined systems of equations, the minimal $l_1$-norm near-solution approximates the sparsest near-solution. Comm. Pure Appl. Math. 59 907–934.
• [22] Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via $l^1$ minimization. Proc. Natl. Acad. Sci. USA 100 2197–2202 (electronic).
• [23] Donoho, D. L., Elad, M. and Temlyakov, V. N. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 6–18.
• [24] Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory 47 2845–2862.
• [25] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455.
• [26] Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1995). Wavelet shrinkage: Asymptopia? J. Roy. Statist. Soc. Ser. B 57 301–369.
• [27] Dudley, R. M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Funct. Anal. 1 290–330.
• [28] Durot, C. (2002). Sharp asymptotics for isotonic regression. Probab. Theory Related Fields 122 222–240.
• [29] Durrett, R. (2010). Probability: Theory and Examples, 4th ed. Cambridge Univ. Press, Cambridge.
• [30] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
• [31] Foygel, R. and Mackey, L. (2014). Corrupted sensing: Novel guarantees for separating structured signals. IEEE Trans. Inform. Theory 60 1223–1247.
• [32] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988.
• [33] Grenander, U. (1956). On the theory of mortality measurement. II. Skand. Aktuarietidskr. 39 125–153.
• [34] Groeneboom, P. and Pyke, R. (1983). Asymptotic normality of statistics based on the convex minorants of empirical distribution functions. Ann. Probab. 11 328–345.
• [35] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
• [36] Homrighausen, D. and McDonald, D. (2013). The lasso, persistence, and cross-validation. JMLR W&CP 28 1031–1039.
• [37] Jankowski, H. K. (2012). Convergence of linear functionals of the Grenander estimator under misspecification. Preprint. Available at arXiv:1207.6614.
• [38] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28 1356–1378.
• [39] Koltchinskii, V. (2009). The Dantzig selector and sparsity oracle inequalities. Bernoulli 15 799–828.
• [40] Ledoux, M. (2001). The Concentration of Measure Phenomenon. Amer. Math. Soc., Providence, RI.
• [41] Massart, P. (2007). Concentration Inequalities and Model Selection. Springer, Berlin.
• [42] McCoy, M. B. and Tropp, J. A. (2013). The achievable performance of convex demixing. Preprint. Available at arXiv:1309.7478.
• [43] McCoy, M. B. and Tropp, J. A. (2014). From Steiner formulas for cones to concentration of intrinsic volumes. Discrete Comput. Geom. 51 926–963.
• [44] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• [45] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 246–270.
• [46] Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Ann. Statist. 28 1083–1104.
• [47] Oymak, S. and Hassibi, B. (2010). New null space results and recovery thresholds for matrix rank minimization. Preprint. Available at arXiv:1011.6326.
• [48] Oymak, S. and Hassibi, B. (2013). Sharp MSE bounds for proximal denoising. Preprint. Available at arXiv:1305.2714.
• [49] Oymak, S., Thrampoulidis, C. and Hassibi, B. (2013). The squared-error of generalized LASSO: A precise analysis. Preprint. Available at arXiv:1311.0830.
• [50] Pollard, D. (1984). Convergence of Stochastic Processes. Springer, New York.
• [51] Prakasa Rao, B. L. S. (1969). Estimation of a unimodal density. Sankhyā Ser. A 31 23–36.
• [52] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
• [53] Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley, Chichester.
• [54] Rudelson, M. and Vershynin, R. (2008). On sparse reconstruction from Fourier and Gaussian measurements. Comm. Pure Appl. Math. 61 1025–1045.
• [55] Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability 1 197–206. Univ. California Press, Berkeley and Los Angeles.
• [56] Stojnic, M. (2009). Various thresholds for $\ell_1$-optimization in compressed sensing. Preprint. Available at arXiv:0907.3666.
• [57] Sudakov, V. N. and Cirel’son, B. S. (1974). Extremal properties of half-spaces for spherically invariant measures. Problems in the theory of probability distributions, II. Zap. Naučn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI) 41 14–24, 165.
• [58] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• [59] Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 73 273–282.
• [60] Tibshirani, R. J. and Taylor, J. (2012). Degrees of freedom in lasso problems. Ann. Statist. 40 1198–1232.
• [61] Tropp, J. A. (2014). Forthcoming article. Private communication.
• [62] Tsirel’son, B. S. (1982). A geometric approach to maximum likelihood estimation for an infinite-dimensional Gaussian location. I. Teor. Veroyatn. Primen. 27 388–395.
• [63] Tsirelson, B. S. (1985). A geometric approach to maximum likelihood estimation for an infinite-dimensional Gaussian location. II. Teor. Veroyatn. Primen. 30 772–779.
• [64] Tsirelson, B. S. (1986). A geometric approach to maximum likelihood estimation for an infinite-dimensional Gaussian location. III. Teor. Veroyatn. Primen. 31 537–549.
• [65] van de Geer, S. (1987). A new approach to least-squares estimation, with applications. Ann. Statist. 15 587–602.
• [66] van de Geer, S. (1990). Estimating a regression function. Ann. Statist. 18 907–924.
• [67] van de Geer, S. (1993). Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Statist. 21 14–44.
• [68] van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Univ. Press, Cambridge.
• [69] van de Geer, S. and Lederer, J. (2013). The Lasso, correlated design, and improved oracle inequalities. In From Probability to Statistics and Back: High-Dimensional Models and Processes. Inst. Math. Stat. (IMS) Collect. 9 303–316. IMS, Beachwood, OH.
• [70] van de Geer, S. and Wegkamp, M. (1996). Consistency for the least squares estimator in nonparametric regression. Ann. Statist. 24 2513–2523.
• [71] van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
• [72] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: with Applications to Statistics. Springer, New York.
• [73] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_1$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
• [74] Wang, H. and Leng, C. (2007). Unified LASSO estimation by least squares approximation. J. Amer. Statist. Assoc. 102 1039–1048.
• [75] Wang, Y. (1996). The $L_2$ risk of an isotonic estimate. Comm. Statist. Theory Methods 25 281–294.
• [76] Zhang, C.-H. (2002). Risk bounds in isotonic regression. Ann. Statist. 30 528–555.
• [77] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
• [78] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
• [79] Zou, H., Hastie, T. and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. Ann. Statist. 35 2173–2192.