Statistics Surveys

Statistical models: Conventional, penalized and hierarchical likelihood

Daniel Commenges

Full-text: Open access


We give an overview of statistical models and likelihood, together with two of its variants: penalized and hierarchical likelihood. The Kullback-Leibler divergence is referred to repeatedly in the literature, for defining the misspecification risk of a model and for grounding the likelihood and the likelihood cross-validation, which can be used for choosing weights in penalized likelihood. Families of penalized likelihood and particular sieves estimators are shown to be equivalent. The similarity of these likelihoods with a posteriori distributions in a Bayesian approach is considered.

Article information

Statist. Surv. Volume 3 (2009), 1-17.

First available in Project Euclid: 7 April 2009

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62-02: Research exposition (monographs, survey articles) 62C99: None of the above, but in this section
Secondary: 62A01: Foundations and philosophical topics

Bayes estimators cross-validation h-likelihood incomplete data Kullback-Leibler risk likelihood penalized likelihood sieves statistical models


Commenges, Daniel. Statistical models: Conventional, penalized and hierarchical likelihood. Statist. Surv. 3 (2009), 1--17. doi:10.1214/08-SS039.

Export citation


  • [1] Akaike, H. (1973). Information theory and an extension of maximum likelihood principle., Second International Symposium on Information Theory, Akademia Kiado. 267–281.
  • [2] Breslow, N.E. and Clayton, D.G. (1993). Approximate Inference in Generalized Linear Mixed Models., J. Amer. Statist. Assoc. 88, 9–25.
  • [3] Burnham, K.P. and Anderson, D.R. (2004). Multimodel inference: understanding AIC and BIC in model selection., Sociol. Methods Res. 33, 261–304.
  • [4] Cencov, N.N. (1982)., Statistical decisions rules and optimal inference. American Mathematical Society.
  • [5] Commenges, D. and Gégout-Petit, A. (2005). Likelihood inference for incompletely observed stochastic processes: general ignorability conditions., arXiv:math.ST/0507151.
  • [6] Commenges, D. and Gégout-Petit, A. (2007). Likelihood for generally coarsened observations from multi-state or counting process models., Scand. J. Statist. 34, 432–450.
  • [7] Commenges, D., Joly, P., Gégout-Petit, A. and Liquet, B. (2007). Choice between semi-parametric estimators of Markov and non-Markov multi-state models from generally coarsened observations., Scand. J. Statist. 34, 33–52.
  • [8] Commenges, D., Jolly, D., Putter, H. and Thiébaut, R. (2009). Inference in HIV dynamics models via hierarchical likelihood., Submitted.
  • [9] Commenges, D., Sayyareh, A., Letenneur, L., Guedj, J. and Bar-Hen, A. (2008). Estimating a difference of Kullback-Leibler risks using a normalized difference of AIC., Ann. Appl. Statist. 2, 1123–1142.
  • [10] Davidian, M. and Giltinan, D.M. (2003). Nonlinear models for repeated measurement data: an overview and update, J. Agric. Biol. Environ. Statist. 8, 387–419.
  • [11] De Finetti, B. (1974)., Theory of Probability. Chichester: Wyley.
  • [12] Delyon B., Lavielle, M. and Moulines, E. (1999). Convergence of a Stochastic Approximation Version of the EM Algorithm., Ann. Statist. 27, 94–128.
  • [13] Eggermont, P. and Lariccia, V. (1999). Optimal convergence rates for Good’s nonparametric likelihood density estimator., Ann. Statist. 27, 1600–1615.
  • [14] Eggermont, P. and Lariccia, V. (2001)., Maximum penalized likelihood estimation. New-York: Springer-Verlag.
  • [15] Feigin, P.D. (1976). Maximum likelihood estimation for continuous-time stochastic processes., Adv. Appl. Prob. 8, 712–736.
  • [16] Fisher, R.A. (1922). On the Mathematical Foundations of Theoretical Statistics., Phil. Trans. Roy. Soc. A 222, 309–368.
  • [17] Good, I.J. and Gaskin, R.A. (1971). Nonparametric roughness penalty for probability densities., Biometrika 58, 255–277.
  • [18] Gu, C. and Kim, Y. J. (2002). Penalized likelihood regression.: general formulation and efficient approximation., Can. J. Stat. 30, 619–628.
  • [19] Guedj, J., Thiébaut, R. and Commenges, D. (2007). Maximum likelihood estimation in dynamical models of HIV., Biometrics 63, 1198–1206.
  • [20] Heitjan, D.F. and Rubin, D.B. (1991). Ignorability and coarse data., Ann. Statist. 19, 2244–2253.
  • [21] Hastie, T. and Tibshirani, R. (1990)., Generalized additive models. London: Chapman and Hall.
  • [22] Hoffmann-Jorgensen, J. (1994)., Probability with a view toward statistics. London: Chapman and Hall.
  • [23] Jacod, J. (1975). Multivariate point processes: predictable projection; Radon-Nikodym derivative, representation of martingales., Z. Wahrsch. Verw. Geb. 31, 235–253.
  • [24] Jeffreys, H. (1961)., Theory of probability. Oxford University Press.
  • [25] Joly, P. and Commenges, D. (1999). A penalized likelihood approach for a progressive three-state model with censored and truncated data: Application to AIDS., Biometrics 55, 887–890.
  • [26] Kass, R.E. and Wasserman, L. (1996). The selection of prior distributions by formal rules, J. Amer. Statist. Assoc. 91, 1343–1370.
  • [27] Konishi, S. and Kitagawa, G. (2008)., Information Criteria and Statistical Modeling. New-York: Springer Series in Statistics.
  • [28] Kullback, S. and Leibler, R.A. (1951). On information and sufficiency., Ann. Math. Statist. 22, 79–86.
  • [29] Kullback, S. (1959)., Information Theory. New-York: Wiley.
  • [30] Le Cam, L. (1990). Maximum Likelihood: An Introduction., Int. Statist. Rev. 58, 153–171.
  • [31] Lee, Y. and Nelder, J.A. (1992) Likelihood, Quasi-Likelihood and Pseudolikelihood: Some Comparisons., J. Roy. Statist. Soc. B 54, 273–284.
  • [32] Lee, Y. and Nelder, J.A. (1996). Hierarchical Generalized Linear Models., J. Roy. Statist. Soc. B 58, 619–678.
  • [33] Lee, Y. and Nelder, J.A. (2001). Hierarchical generalised linear models: A synthesis of generalised linear models, random-effect models and structured dispersions., Biometrika 88, 987–1006.
  • [34] Lee, Y., Nelder, J.A. and Pawitan, Y. (2006)., Generalized linear models with random effects. Chapman and Hall.
  • [35] Linhart, H. and Zucchini, W. (1986)., Model Selection, New York: Wiley.
  • [36] Neymann, J. and Scott, E.L. (1988).Consistent estimates based on partially consistent observations., Econometrika 16, 1–32.
  • [37] O’Sullivan, F. (1988). Fast computation of fully automated log-density and log-hazard estimators., SIAM J. Scient. Statist. Comput. 9, 363–379.
  • [38] Rubin, D.B. (1976). Inference and missing data., Biometrika 63, 581–592.
  • [39] Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations., J. Roy. Statist. Soc. B 71, 1–35.
  • [40] Shen, X. (1997). On methods of sieves and penalization., Ann. Statist. 25, 2555–2591.
  • [41] Therneau, T.M. and Grambsch, P.M. (2000)., Modeling survival data: extending the Cox model. Springer.
  • [42] van der Vaart, A. (1998), Asymptotic Statistics, Cambridge.
  • [43] Verbeke, G. and Molenberghs, G. (2000)., Linear Mixed Models for Longitudinal Data. New-York: Springer.
  • [44] Wahba, G. (1983). Bayesian “Confidence Intervals” for the Cross-Validated Smoothing Spline, J. Roy. Statist. Soc. B 45, 133–150.
  • [45] Williams, D. (1991)., Probability with Martingales. Cambridge University Press.