Annals of Statistics

Kullback–Leibler aggregation and misspecified generalized linear models

Philippe Rigollet

Full-text: Open access


In a regression setup with deterministic design, we study the pure aggregation problem and introduce a natural extension from the Gaussian distribution to distributions in the exponential family. While this extension bears strong connections with generalized linear models, it does not require identifiability of the parameter or even that the model on the systematic component is true. It is shown that this problem can be solved by constrained and/or penalized likelihood maximization and we derive sharp oracle inequalities that hold both in expectation and with high probability. Finally all the bounds are proved to be optimal in a minimax sense.

Article information

Ann. Statist., Volume 40, Number 2 (2012), 639-665.

First available in Project Euclid: 17 May 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G08: Nonparametric regression
Secondary: 62J12: Generalized linear models 68T05: Learning and adaptive systems [See also 68Q32, 91E40] 62F11

Aggregation regression classification oracle inequalities finite sample bounds generalized linear models logistic regression minimax lower bounds


Rigollet, Philippe. Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 (2012), no. 2, 639--665. doi:10.1214/11-AOS961.

Export citation


  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971) 267–281. Akad. Kiadó, Budapest.
  • Alquier, P. and Lounici, K. (2011). PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electron. J. Stat. 5 127–145.
  • Audibert, J. Y. (2008). Progressive mixture rules are deviation suboptimal. In Advances in Neural Information Processing Systems 20 (Y. S. J. Platt D. Koller and S. Roweis, eds.) 41–48. MIT Press, Cambridge, MA.
  • Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory. Wiley, Chichester.
  • Bartlett, P. L., Mendelson, S. and Neeman, J. (2012). $\ell_1$-regularized linear regression: Persistence and oracle inequalities. Probab. Theory Related Fields. To appear.
  • Belomestny, D. and Spokoiny, V. (2007). Spatial aggregation of local likelihood estimates with applications to classification. Ann. Statist. 35 2287–2311.
  • Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM Probab. Stat. 9 323–375.
  • Breiman, L. (1999). Prediction games and arcing algorithms. Neural Comput. 11 1493–1517.
  • Brown, L. D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics Lecture Notes—Monograph Series 9. IMS, Hayward, CA.
  • Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Springer, Berlin. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
  • Dalalyan, A. and Salmon, J. (2011). Sharp oracle inequalities for aggregation of affine estimators. Available at arXiv:1104.3969.
  • Dalalyan, A. S. and Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Springer, Berlin.
  • Ekeland, I. and Témam, R. (1999). Convex Analysis and Variational Problems. Classics in Applied Mathematics 28. SIAM, Philadelphia, PA.
  • Fahrmeir, L. and Kaufmann, H. (1985). Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann. Statist. 13 342–368.
  • Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In International Conference on Machine Learning 148–156.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337–407.
  • Greenshtein, E. (2006). Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint. Ann. Statist. 34 2367–2386.
  • Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988.
  • Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
  • Juditsky, A., Rigollet, P. and Tsybakov, A. B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
  • Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Heidelberg.
  • LeCam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. Univ. California Publ. Statist. 1 277–329.
  • Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
  • Lecué, G. (2012). Empirical risk minimization is optimal for the convex aggregation problem. Bernoulli. To appear.
  • Lecué, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613.
  • Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer, New York.
  • Lounici, K. (2007). Generalized mirror averaging and D-convex aggregation. Math. Methods Statist. 16 246–259.
  • Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Springer, Berlin.
  • McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London.
  • Mease, D. and Wyner, A. (2008). Evidence contrary to the statistical view of boosting. J. Mach. Learn. Res. 9 131–156.
  • Mitchell, C. and van de Geer, S. (2009). General oracle inequalities for model selection. Electron. J. Stat. 3 176–204.
  • Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin.
  • Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574–1609.
  • Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over q-balls. IEEE Trans. Inform. Theory 57 6976–6994.
  • Rigollet, P. (2012). Supplement to “Kullback–Leibler aggregation and misspecified generalized linear models.” DOI:10.1214/11-AOS961SUPP.
  • Rigollet, P. and Tsybakov, A. B. (2007). Linear and convex aggregation of density estimators. Math. Methods Statist. 16 260–280.
  • Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
  • Rigollet, P. and Tsybakov, A. (2012). Sparse estimation by exponential weighting. Statist. Sci. To appear.
  • Tsybakov, A. B. (2003). Optimal rates of aggregation. In COLT (B. Schölkopf and M. K. Warmuth, eds.). Lecture Notes in Computer Science 2777 303–313. Springer, Berlin.
  • White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1–25.
  • Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87.
  • Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.

Supplemental materials

  • Supplementary material: Minimax lower bounds. Under some convexity and tail conditions, we prove minimax lower bounds for the three problems of Kullback–Leibler aggregation: model selection, linear and convex. The proof consists in three steps: first, we identify a subset of admissible estimators, then we reduce the problem to a usual problem of regression function estimation under the mean squared error criterion and finally, we use standard minimax lower bounds to complete the proof.