Most of the non-asymptotic theoretical work in regression is carried out for the square loss, where estimators can be obtained through closed-form expressions. In this paper, we use and extend tools from the convex optimization literature, namely self-concordant functions, to provide simple extensions of theoretical results for the square loss to the logistic loss. We apply the extension techniques to logistic regression with regularization by the ℓ2-norm and regularization by the ℓ1-norm, showing that new results for binary classification through logistic regression can be easily derived from corresponding results for least-squares regression.
References
[1] A. W. Van der Vaart., Asymptotic Statistics. Cambridge University Press, 1998.
[2] P. Massart., Concentration Inequalities and Model Selection: Ecole d’été de Probabilités de Saint-Flour 23. Springer, 2003.
[3] S. A. Van De Geer. High-dimensional generalized linear models and the Lasso., Annals of Statistics, 36(2):614, 2008.
[4] C. Gu. Adaptive spline smoothing in non-gaussion regression models., Journal of the American Statistical Association, pages 801–807, 1990.
[5] F. Bunea. Honest variable selection in linear and logistic regression models via, ℓ1 and ℓ1+ℓ2 penalization. Electronic Journal of Statistics, 2 :1153–1194, 2008.
[6] D. P. Bertsekas., Nonlinear programming. Athena Scientific, 1999.
[7] S. Boyd and L. Vandenberghe., Convex Optimization. Cambridge University Press, 2003.
[8] Y. Nesterov and A. Nemirovskii., Interior-point polynomial algorithms in convex programming. SIAM studies in Applied Mathematics, 1994.
[9] R. Christensen., Log-linear models and logistic regression. Springer, 1997.
[10] D. W. Hosmer and S. Lemeshow., Applied logistic regression. Wiley-Interscience, 2004.
[11] C. Houdré and P. Reynaud-Bouret. Exponential inequalities, with constants, for U-statistics of order two. In, Stochastic inequalities and applications, Progress in Probability, 56, pages 55–69. Birkhäuser, 2003.
[12] P. Zhao and B. Yu. On model selection consistency of Lasso., Journal of Machine Learning Research, 7 :2541–2563, 2006.
[13] M. Yuan and Y. Lin. On the non-negative garrotte estimator., Journal of The Royal Statistical Society Series B, 69(2):143–161, 2007.
[14] H. Zou. The adaptive Lasso and its oracle properties., Journal of the American Statistical Association, 101 :1418–1429, December 2006.
[15] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using, ℓ1-constrained quadratic programming. IEEE Transactions on Information Theory, 55(5) :2183, 2009.
[16] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector., Annals of Statistics, 37(4) :1705–1732, 2009.
[17] J. F. Bonnans, J. C. Gilbert, C. Lemaréchal, and C. A. Sagastizábal., Numerical Optimization Theoretical and Practical Aspects. Springer, 2003.
[18] J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In, Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 263–274, 2008.
[19] P. McCullagh and J. A. Nelder., Generalized linear models. Chapman & Hall/CRC, 1989.
Mathematical Reviews (MathSciNet):
MR727836
[20] B. Efron. The estimation of prediction error: Covariance penalties and cross-validation., Journal of the American Statistical Association, 99(467):619–633, 2004.
[21] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds., Journal of the American Statistical Association, 101(473):138–156, 2006.
[22] G. Wahba., Spline Models for Observational Data. SIAM, 1990.
[23] G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions., Journal of Mathematical Analysis and Applications, 33:82–95, 1971.
Mathematical Reviews (MathSciNet):
MR290013
[24] G. H. Golub and C. F. Van Loan., Matrix Computations. Johns Hopkins University Press, 1996.
[25] C. Gu., Smoothing spline ANOVA models. Springer, 2002.
[26] K. Sridharan, N. Srebro, and S. Shalev-Shwartz. Fast rates for regularized objectives. In, Advances in Neural Information Processing Systems (NIPS), 2008.
[27] I. Steinwart, D. Hush, and C. Scovel. A new concentration result for regularized risk minimizers., High Dimensional Probability: Proceedings of the Fourth International Conference, 51:260–275, 2006.
[28] S. Arlot and F. Bach. Data-driven calibration of linear estimators with minimal penalties. In, Advances in Neural Information Processing Systems (NIPS), 2009.
[29] T. J. Hastie and R. J. Tibshirani., Generalized Additive Models. Chapman & Hall, 1990.
[30] Z. Harchaoui, F. R. Bach, and E. Moulines. Testing for homogeneity with kernel fisher discriminant analysis. Technical Report 00270806, HAL, 2008.
[31] R. Shibata. Statistical aspects of model selection. In, From Data to Model, pages 215–240. Springer, 1989.
[32] H. Bozdogan. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions., Psychometrika, 52(3):345–370, 1987.
Mathematical Reviews (MathSciNet):
MR914460
[33] P. Liang, F. Bach, G. Bouchard, and M. I. Jordan. An asymptotic analysis of smooth regularizers. In, Advances in Neural Information Processing Systems (NIPS), 2009.
[34] P. Craven and G. Wahba. Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation., Numerische Mathematik, 31(4):377–403, 1978/79.
Mathematical Reviews (MathSciNet):
MR516581
[35] K.-C. Li. Asymptotic optimality for $C_ p$, $C_ L$, cross-validation and generalized cross-validation: discrete index set., Annals of Statistics, 15(3):958–975, 1987.
Mathematical Reviews (MathSciNet):
MR902239
[36] F. Bach. Consistency of the group Lasso and multiple kernel learning., Journal of Machine Learning Research, 9 :1179–1225, 2008.
[37] C. L. Mallows. Some comments on, Cp. Technometrics, 15:661–675, 1973.
[38] F. O’Sullivan, B. S. Yandell, and W. J. Raynor Jr. Automatic smoothing of regression functions in generalized linear models., Journal of the American Statistical Association, pages 96–103, 1986.
[39] R. Tibshirani. Regression shrinkage and selection via the Lasso., Journal of The Royal Statistical Society Series B, 58(1):267–288, 1996.
[40] T. Zhang. Some sharp performance bounds for least squares regression with, ℓ1 regularization. Annals of Statistics, 37(5) :2109–2144, 2009.
[41] A. Juditsky and A. S. Nemirovski. On verifiable sufficient conditions for sparse signal recovery via, ℓ1 minimization. Technical Report 0809.2650, arXiv, 2008.
[42] A. d’Aspremont and L. El Ghaoui. Testing the nullspace property using semidefinite programming. Technical Report 0807.3520, arXiv, 2008.
[43] P. Chaudhuri and P. A. Mykland. Nonlinear experiments: Optimal design and inference based on likelihood., Journal of the American Statistical Association, 88(422):538–546, 1993.
[44] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables., Journal of The Royal Statistical Society Series B, 68(1):49–67, 2006.
[45] F. Bach. Bolasso: model consistent Lasso estimation through the bootstrap. In, Proceedings of the International Conference on Machine Learning (ICML), 2008.
[46] N. Meinshausen and P. Bühlmann. Stability selection. Technical report, arXiv: 0809.2932, 2008.
[47] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the Lasso., Annals of statistics, 34(3) :1436, 2006.
[48] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation., Journal of Machine Learning Research, 9:485–516, 2008.
[49] J. M. Borwein and A. S. Lewis., Convex Analysis and Nonlinear Optimization. Number 3 in CMS Books in Mathematics. Springer, 2000.