Statistical models: Conventional, penalized and hierarchical likelihood

: We give an overview of statistical models and likelihood, to- gether with two of its variants: penalized and hierarchical likelihood. The Kullback-Leibler divergence is referred to repeatedly in the literature, for deﬁning the misspeciﬁcation risk of a model and for grounding the likelihood and the likelihood cross-validation, which can be used for choosing weights in penalized likelihood. Families of penalized likelihood and particular sieves estimators are shown to be equivalent. The similarity of these likelihoods with a posteriori distributions in a Bayesian approach is considered.


Introduction
Since its proposal by Fisher [16], likelihood inference has occupied a central position in statistical inference.In some situations, modified versions of the likelihood have been proposed.Marginal, conditional, profile and partial likelihoods have been proposed to get rid of nuisance parameters.Pseudo-likelihood and hierarchical likelihood may be used to circumvent numerical problems in the computation of the likelihood, that are mainly due to multiple integrals.Penalized likelihood has been proposed to introduce a smoothness a priori knowledge on functions, thus leading to smooth estimators.Several reviews have already been published, for instance [31], but it is nearly impossible in a single paper to describe with some details all the types of likelihoods that have been proposed.This paper aims at describing the conventional likelihood and two of its variants: penalized and hierarchical likelihoods.The aim of this paper is not to give the properties of the estimators obtained by maximizing these likelihoods, but rather to describe these three likelihoods together with their link to the Kullback-Leibler divergence.This interest more in the foundations rather than the properties, leads us to first develop some reflexions and definitions about statistical models and to give a slightly extended version of the Kullback-Leibler divergence.
In section 2, we recall the definition of a density and the relationship between a density in the sample space and for a random variable.In section 3, we give a slightly extended version of the Kullback-Leibler divergence (making it explicit that it also depends on a sigma-field).Section 4 gives an account of statistical models, distinguishing mere statistical families from statistical models and defining the misspecification risk.Section 5 presents the likelihood and discusses issues about its computation and the performance of the estimator of the maximum likelihood in terms of Kullback-Leibler risk.In section 6, we define the penalized likelihood and show that for a family of penalized likelihood estimators there is an identical family of sieves estimators.In section 7, we describe the hierarchical likelihood.In section 8, we briefly sketch the possible unification of these likelihoods through a Bayesian representation that allows us to consider the maximum (possibly penalized) likelihood estimators as maximum a posteriori (MAP) estimators; this question however cannot be easily settled due to the non-invariance of the MAP for reparameterization.Finally, there is a short conclusion.

Definition of a density
Consider a measurable space (S, A) and two measures µ and ν with µ absolutely continuous relatively to ν.For G a sub-σ-field of A the Radon-Nikodym derivative of µ with respect to ν on X , denoted by: dµ dν |G is the G-measurable random variable such that The Radon-Nikodym derivative is also called the density.We are interested in the case where µ is a probability measure, which we will call P 1 ; ν may also be a probability measure, P 0 .In that case we can speak of the likelihood ratio and denote it L . In order to speak of a likelihood function, we have to define a model (see section 4).Note that likelihood ratios (as Radon-Nikodym derivatives) are defined with respect to a sigma-field.The definitions and properties of these probabilistic concepts are very clearly presented in [45].For the statistician, sigma-fields represent sets of events which may be (but are not always) observed.If H and G are different sigma-fields, dP 1 dP 0 |H and dP 1 dP 0 |G are different, but if H ⊂ G the former can be expressed as a conditional expectation (given H) of the latter and we have the fundamental formula: Consider now the case where the measurable space (Ω, F ) is the sample space of an experiment.For the statistician (Ω, F ) is not any measurable space: it is a space which enables us to represent real events.We shall write in bold character a probability on (Ω, F ), for instance, P 1 .Let us define a random variable X, that is, a measurable function from (Ω, F ) to (ℜ, B).The couple (P 1 , X) induces a probability measure on (ℜ, B) defined by: P 1 X (B) = P 1 {X −1 (B)}, B ∈ B. This probability measure is called the distribution of X.If this probability measure is absolutely continuous with respect to Lebesgue (resp.counting) measure, one speaks of continuous (resp.discrete) variable.For instance, for a continuous variable we define the density dλ , where λ is Lebesgue measure on ℜ, which is the usual probability density function (p.d.f.).Note that the p.d.f.depends on both P 1 and X, while dP 1 dP 0 |X depends on X but not on a specific random variable X. Often in applied statistics one works only with distributions, but this may leave some problems unsolved.
Example 1.Consider the case where concentrations of CD4 lymphocytes are measured.Ω represents the set of physical concentrations that may happen.Let the random variables X and Y express the concentration in number of CD4 by mm 3 and by ml respectively.Thus we have Y = 10 3 X.So X and Y are different, although they are informationally equivalent.For instance the events {ω : X(ω) = 400} and {ω : Y (ω) = 400000} are the same.The densities of X and Y , for the same P 1 on (Ω, F ), are obviously different.So, if we look only at distributions, we shall have difficulties to rigorously define what a model is.

The Kullback-Leibler risk
Many problems in statistical inference can be treated from the point of view of decision theory.That is, estimators for instance are chosen as minimizing some risk function.The most important risk function is based on the Kullback-Leibler divergence.Maximum likelihood estimators, use of Akaike criterion or likelihood cross-validation can be grounded on the Kullback-Leibler divergence.Given a probability P 2 absolutely continuous with respect to a probability P 1 and X a sub-σ-field of F , the loss using P 2 in place of P 1 is the log-likelihood ratio ].This is the Kullback-Leibler risk, also called divergence [28,29], information deviation [4] or entropy [1].The different names of this quantity reflects its central position in statistical theory, being connected to several fields of the theory.Several notations have been used by different authors.Here we choose the Cencov [4] notation: ].
If X is the largest sigma-field defined on the space, then we omit it in the notation.Note that the Kullback-Leibler risk is asymmetric and hence does not define a distance between probabilities; we have to take on this fact.If X is a random variable with p.d.f.f 1 X and f 2 X under P 1 and P 2 respectively we have and the divergence of the distribution P 2 X relative to P 1 X can be written: We have that I(P 2 |P 1 ; X ) = I(P 2 X |P 1 X ), if X is the σ-field generated by X on (Ω, F ).Note that on (Ω, F ) we have to specify that we assess the divergence on X ; we might assess it on a different sigma-field and would of course obtain a different result.This provides more flexibility.In particular, we shall use this in the case of incomplete data.The observation is represented by a sigma-field O. Suppose we are interested in making inference about the true probability on X .We have complete data if our observation is O = X .With incomplete data, in the case where the mechanism leading to incomplete data is deterministic, we have O ⊂ X .In that case it will be very difficult to estimate I(P 2 |P 1 ; X ) and it will be more realistic to use ].We need this flexibility to extend Akaike's argument for the likelihood and for developing model choice criteria to situations with incomplete data.This will become important in section 5, where P 1 will be the true unknown probability (denoted P * ) and the problem will be to estimate this divergence rather than to compute it.
Example 2. Suppose we are interested in modeling the time to an event, X, and we wish to evaluate the divergence of P 2 with respect to P 1 .It is natural to compute the divergence on the sigma-field X generated by X, I(P 2 |P 1 ; X ) = I(P 2 X |P 1 X ) given by formula (3.1).Suppose that we have an observation of X under P 1 which is right-censored at a fixed time C. We observe ( X, δ) where X = min(X, C) and δ = 1 {X≤C} .Thus on {X ≤ C} we observe all the events of X but on {X > C} we observe no more events.If we represent the observation by the sigma-field O, we can say that O is generated by ( X, δ).It is clear that we have O ⊂ X .Although in theory it is still interesting to compute the divergence of P 2 with respect to P 1 on the sigma-field X it is also interesting to compute it on the observed sigma-field, which is I(P 2 |P 1 ; O).It can be proved by simple probabilistic arguments that on {X ≤ C} we have dP 1 X (C) and thus where S 1 X (.) and S 2 X (.) are the survival functions of X under P 1 and P 2 respectively.

Statistical families
We consider a subset P of the probabilities on a measurable space (S, A).We shall call such a subset a family of probabilities, and we may parameterize this family.Following [22], a parameterization can be represented by a function from a set Θ with values in P: θ → P θ .It is desirable that this function be one-to-one, a property linked to the identifiability issue which will be discussed later in this section.The parameterization associated with the family of probabilities P can be denoted Π = (P θ ; θ ∈ Θ) and we have P = {P θ ; θ ∈ Θ}.We may denote Π ∼ P. If Π 1 ∼ P and Π 2 ∼ P, Π 1 and Π 2 are two parameterizations of the same family of probabilities and we may note Π 1 ∼ Π 2 .
P is really a family of probabilities and Π a parametrized family of probabilities.We may call them statistical families if the aim of considering such families is to make statistical inference.However, a family of probability on (ℜ, B) is not sufficient to specify a statistical model (here, we do not follow [22]).A statistical model depends on the random variables chosen, as exemplified in section 2.

Statistical models
A family of probabilities on the sample space of an experiment (Ω, F ) will be called a statistical model and a parameterization of this family will be called a parameterized statistical model.Definition 1. Two parameterized statistical models Π = (P θ , θ ∈ Θ) on X and Π ′ = (P γ , γ ∈ Γ) on Y are equivalent (in the sense that they specify the same statistical model) if X = Y and they specify the same family of probability on (Ω, X ).
The pair (X, Π) of a random variable and a parameterized statistical model induces the parameterized family (of distributions) on (ℜ, B): Π X = (P θ X ; θ ∈ Θ).Conversely, the pair (X, Π X ) induces Π if X = F .In that case, we may describe the statistical model by (X, Π X ).Two different random variables X and Y induce two (generally different) parameterized families on (ℜ, B), Π X and Π Y .Conversely, one may ask whether the pairs (X, Π X ) and (Y, Π Y ) define equal or equivalent parameterized statistical models.We need the definition of "informationally equivalent" random variables (or more generally random elements).Definition 2. X and Y are informationally equivalent if the sigma-fields X and Y generated by X and Y are equal.
For sake of simplicity we have considered distributions of real random variables.The same can be said about random variables with values in ℜ d or stochastic processes that are random elements with values in a Skorohod space.Commenges and Gégout-Petit [6] gave an instance of two informationally equivalent processes.The events described by an irreversible three-state process X = (X t ), where X t takes values 0, 1, 2, can be described by a bivariate counting process N = (N 1 , N 2 ).The law of the three-state process is specified by the transition intensities α 01 , α 02 , α 12 .There is a way of expressing the intensities λ 1 and λ 2 of N 1 and N 2 such that the laws of X and N correspond to the same probability on (Ω, F ). Thus the same statistical model can be described with X or with N .

Statistical models and true probability
So-called objectivist approaches to statistical inference assume that there is a true, generally unknown, probability P * .Frequentists as well as objectivist Bayesians adopt this paradigm while subjectivist Bayesians, following De Finetti [11], reject it.We adopt the objectivist paradigm, which in our view is more suited to answer scientific issues.Statistical inference aims to approach P * or functionals of P * .Model Π is well specified if P * ∈ Π and is mis-specified otherwise.If it is well specified, then there is a θ * ∈ Θ such that P θ * = P * .If we consider a probability P θ , we may measure its divergence with respect to P * on a given sigma-field O by I(P θ |P * ; O), and we may choose θ that minimizes this divergence.We assume that there exists a value θ opt that minimizes I(P θ |P * ; O).We call I(P θopt |P * ; O) the misspecification risk of model Π.Of course, if the model is well specified, then I(P θ |P * ; O) is minimized at θ * , and the misspecification risk is null.

Definition of the likelihood
Conventionally, most statistical models assume that independently identically distributed (i.i.d.) random variables, say X i , i = 1, . . ., n, are observed.However, in case of complex observation schemes, the observed random variables become complicated.Moreover the same statistical model can be described by different random variables.For instance, in Example 2 the observed random variables are the pairs ( Xi , δ i ).However, we may also describe the observation by (δ i X i , δ i ), or in terms of counting processes by (N i u , 0 ≤ u ≤ C), where (N i u = 1 {Xi≤u} ).These three descriptions are observationally equivalent, in the sense that they correspond to the same sigma-field, say We shall adopt the description of observations in terms of sigma-fields because it is more intrinsic.We shall work with a measure space (Ω, F ) containing all events of interest.For instance the observation of subject i, O i , belongs to F .Saying that observations are i.i.d.means that the O i are independent, that there is a one-to-one correspondence between O i and O i ′ and that the restrictions of P * to O i (denoted P * Oi ) are the same.We call Ōn the global observation: Ōn = ∨ n i=1 O i .Since we do not know P * , we may in the first place reduce the search by restricting our attention to a statistical model Π and find a P θ ∈ Π close to P * , that is, one which minimizes I(P θ |P * ; O i ).We have already given a name to it, P θopt , but we cannot compute it directly because we do not know P * .The problem is that I(P θ |P * ; O i ) doubly depends on the unknown P * : (i) through the Radon-Nikodym derivative and (ii) through the expectation.
Problem (i) can be eliminated by noting that L

Ōn
. In conclusion, the maximum likelihood estimator (MLE) can be considered as an estimator that minimizes a natural estimator of the Kullback-Leibler risk.

Computation of the likelihood
Computation of the likelihood is simple in terms of the probability on the observed σ-field.The conventional way of specifying a model is in terms of a random variable and a family of distributions (X, (f θ X (.)) θ∈Θ ).Then the likelihood for observation X is simply f θ X (X).When the events of interest are represented by stochastic processes in continuous time, it is also possible to define a density and hence a likelihood function.See [15] for diffusion processes and [23] for counting processes.
Two situations make the computation of the likelihood more complex.The first is when there is incomplete observation of the events of interest.If the mechanism leading to incomplete data is random we should in principle model it.The theory of ignorable missing observation of Rubin [38] has been extended to more general mechanisms leading to incomplete data in [20].This has been developed in the stochastic process framework by Commenges and Gégout-Petit [5] (who also give some general formulas for likelihood calculus).The second situation occurs when the law is described through a conditional probability and the conditioning events are not observed.This is the framework of random effects models (see section 7.1).Although conceptually different these two situations lead to the same problem: the likelihood for subject i can be relatively easily computed for a "complete" observation G i and the likelihood for the observation O i ⊂ G i is the conditional expectation (which derives from the fundamental formula): (5.1) The conditional expectation is expressed as an integral which must be computed numerically in most cases.The only notable exception is the linear mixed effects model where the integral can be analytically computed.For examples of algorithms for non-linear mixed effects see [12] and [19].For general formulas for the likelihood of interval-censored observations of counting processes see [6].

Performance of the MLE in terms of Kullback-Leibler risk
We expect good behavior of the MLE θ when the law of large numbers can be applied and when the number of parameters is not too large.Some cases of unsatisfactory behavior of the MLE are reported for instance in [30].The properties of the MLE may not be satisfactory when the number of parameters is too large, and especially when it increases with n as in an example given by Neymann and Scott [36].In this example (X i , Y i ), i = 1, . . ., n are all independent random variables with X i and Y i both normal N (ξ i , σ 2 ).It is readily seen that not only the MLE of ξ i , i = 1 . . ., n, but also the MLE of σ 2 are inconsistent.This example is typical of problems where there are individual parameters (a ξ i for each i), so that in fact the statistical model changes with n.Such situations are better approached by random effects models.
To assess the performance of the MLE we can use a risk which is an extended version of the Kullback-Leibler risk with respect to P * : The difference with the classical Kullback-Leibler risk is that here P θ is random: so EKL(P θ , O i ) is the expectation of the Kullkack-Leibler divergence between P θ and P * .In parametric models (that is, Θ is a subset of ℜ p ), it can be shown [9,35] that where I is the information matrix and J is the variance of the score, both computed in θ opt ; here the symbol Tr means the trace.This can be nicely interpreted by saying that the risk EKL(P θ, O i ) is the sum of the misspecification risk E P * [L ] and the statistical risk 1 2 n −1 Tr(I −1 J).Note in passing that if Π is well specified we have E P * [L P * /P θ opt Oi ] = 0 and I = J, and thus

The penalized likelihood
There is a large literature on the topic: see [13,14,17,18,21,25,37,44] among others.Penalized likelihood is useful when the statistical model is too large to obtain good estimators, while conventional parametric models appear too rigid.
A simple form of the penalized log-likelihood is where J(θ) is a measure of dislike of θ and κ weights the influence of this measure on the objective function.A classical example is when θ = (α(.),β), where α(.) is a function and β is a real parameter.J(θ) can be chosen as In this case J(θ) measures the irregularity of the function α(.).The maximum penalized likelihood estimator (MpLE) θ pl κ is the value of θ which maximizes pl κ (θ).κ is often called a smoothing coefficient in the cases where J(θ) is a measure of the irregularity of a function.More generally, we will call it a metaparameter.We may generalize the penalized log-likelihood by replacing κJ(θ) by J(θ, κ), where κ could be multidimensional.When κ varies, this defines a family of estimators,(θ pl κ ; κ ≥ 0).κ may be chosen by cross-validation (see section 8).There is another way of dealing with the problem of statistical models that might be too large.This is by using the so-called sieve estimators [40].Sieves are based on a sequence of approximating spaces.For instance rather than working with a functional parameter we may restrict to spaces where the function is represented on a basis (e.g. a splines basis).Here we consider a special sieves approach where the approximating spaces may be functional spaces.Consider a family of models (P ν ) ν≥0 where: For fixed ν, the MLE θν solves the constrained maximization problem: When ν varies this defines a family of sieve estimators: ( θν ; ν ≥ 0).θν maximizes the Lagrangian L for some value of λ.The Lagrangian superficially looks like the penalized log-likelihood function, but an important difference is that here the Lagrange multiplier λ is not fixed and is a part of the solution.If the problem is convex the Karush-Kuhn-Tucker conditions are necessary and sufficient.Here these conditions are It is clear that when the observation Ōn is fixed, the function κ → J(θ pl κ ) is a monotone decreasing function.Consider the case where this function is continuous and unbounded (when κ → 0).Then for each fixed ν there exists a value, say κ ν , such that J(θ pl κν ) = ν.Note that this value depends on Ōn .Now, it is easy to see that θ pl κν satisfies the Karush-Kuhn-Tucker conditions (6.2), with λ = κ ν .Thus, if we can find the correct κ ν we can solve the constrained maximization problem by maximizing the corresponding penalized likelihood.However, the search for κ ν is not simple, and we must remember that the relationship between ν and κ ν depends on Ōn .A simpler result, deriving from the previous considerations, is: Lemma 6.1 (Penalized and sieves estimators).The families (P θ pl κ ; κ ≥ 0) and (P θν ; ν ≥ 0) are identical families of estimators.
The consequence is that since it is easier to solve the unconstrained maximization problem involved in the penalized likelihood approach, one should apply this approach in applications.On the other hand, it may be easier to develop asymptotic results for sieve estimators (because θν is a MLE) than for penalized likelihood estimators.One should be able to derive properties of penalized likelihood estimators from those of sieve estimators.

Random effects models
An important class of models arises when we define a potentially observable variable Y i for each subject, and its distribution is given conditionally on unobserved quantities.This is the classical framework of random effects models, which we have already mentioned in subsections 5.2 and 5.3.Specifically, let us consider the following model: conditionally on b i , Y i has a density f Y |b (.; θ, b i ), where θ is a vector of parameters of dimension m and b i are random effects (or parameters) of dimension K.The (Y i , b i ) are i.i.d.Typically Y i is multivariate of dimension n i .We assume that the b i have density f b (.; τ ), where τ is a parameter.Typically Y i is observed, while b i is not.This can be made more general for including the case of censored observation of Y i .
The conventional approach for estimating θ is to compute the maximum likelihood estimators.Empirical Bayes estimators of the b i can be computed in a second stage.The likelihood (for observation i) is computed by taking the conditional expectation given O i of the complete likelihood on the sigma-field including the random effect G i = O i ∨ σ(b i ).This is an application of formula (5.1).Practically the computation of this conditional expectation involves the integrals f θ Y |b (Y i |b)f b (b)db.Random effects models have been thoroughly studied in both linear [43] and non-linear [10] cases.While in the linear case computation of the above integrals is analytical, in the non-linear case it is not.The numerical computation of these multiple integrals of dimension K is a daunting task if K is larger than 2 or 3, especially if the likelihood given the random effects is not itself very easy to compute; this is the curse of dimensionality.

Hierarchical likelihood
For hierarchical generalized linear models, the hierarchical likelihood (or hlikelihood), was proposed by Lee and Nelder [32]; see also [33,34].The hlikelihood is the joint (or complete) likelihood of the observations and the (unobserved) random effects, but where the random effects are treated as parameters.The complete loglikelihood is L P θ /P 0 Ḡn where γ = (θ, b) is the set of all the "parameters".Thus, estimators (here denoted MHLE) of both θ and b can be obtained by maximizing the h-loglikelihood: Often the loglikelihood can be written L However, this formulation is not completely general, because there are interesting cases where observations of the Y i are censored.So, we prefer writing the loglikelihood as L P γ /P 0 Ōn .We note γτ = ( θτ , bτ ) the maximum h-likelihood estimators of the parameters for given τ ; the latter (meta) parameter can be estimated by profile likelihood.The main interest of this approach is that there is no need to compute multiple integrals.This problem is replaced by that of maximizing hl τ (γ) over γ.That is, the problem is now a large number of parameters that must be estimated, which this is equal to m+nK.This may be large, but special algorithms can be used for generalized linear models.
Therneau and Grambsch [41] used the same approach for fitting frailty models, calling it a penalized likelihood.It may superficially look like the penalized quasi likelihood of Breslow and Clayton [2], but this is not the same thing.There is a link with the more conventional penalized likelihood for estimating smooth functions discussed in section 6.The h-likelihood can be considered as a penalized likelihood but with two important differences relative to the conventional one: (i) the problem is parametric; (ii) the number of parameters grows with n.Commenges et al. [9] have proved that the maximum h-likelihood estimators for the fixed parameters are M-estimators [42].Thus, under some regularity conditions they have an asymptotic normal distribution.However, this asymptotic distribution is not in general centered on the true parameter values, so that the estimators are biased.In practice the bias can be negligible so that this approach can be interesting in some situations due to its relative numerical simplicity.

Akaike and likelihood cross-validation criteria
An important issue is the choice between different estimators.Two typical situations are: (i) choice of MLE's in different models; (ii) choice of MpLE's with different penalties.If we consider two models Π and Π ′ we get two estimators P θ and P γ of the probability P * , and we may wish to assess which is better.This is the "model choice" issue.A penalized likelihood function produces a family of estimators (P θ pl κ ; κ ≥ 0), and we may wish to choose the best.Here, what we call "the best" estimator is the estimator that minimizes some risk function; in both cases we can use the extended version of the Kullback-Leibler However, the difference of risks between two estimators in parametric models ∆(P θ , P γ ) = EKL(P θ ; O i ) − EKL(P γ ; O i ) can be estimated by the statistic D(P θ , P γ ) = (1/2n)(AIC(P θ ) − AIC(P γ )) and a more refined analysis of the difference of risks can be developed, as in [9].
The leave-one-out likelihood cross-validation criterion can also be considered as a possible "estimator" up to a constant of EKL [7].It is defined as:

Oi
, where Ōn|i = ∨ j =i O j and O n+1 is another i.i.d.replicate of O i .Then, we define an estimator of the difference of risks between two estimators: The advantage of LCV is that it can be used for comparing smooth estimators in nonparametric models, and in particular it can be used for choosing the penalty weight in penalized likelihood.A disadvantage is the computational burden, but a general approximation formula has been given ( [7,37]): where H L Ōn and H plκ are the Hessian of the loglikelihood and penalized loglikelihood respectively.This expression looks like an AIC criterion and there are arguments to interpret Tr[H −1 plκ H L Ōn ] as the model degree of freedom.

Link with the MAP estimator
One important issue is the relationship between the three likelihoods considered here and the Bayesian approach.The question arises because it seems that these three likelihoods can be identified with the numerator of a posteriori distributions with particular priors.Thus MLE, MpLE and MHLE could be identified with the maximum a posteriori (MAP) estimators with the corresponding priors.However, this relationship depends on the parameterization.Thus the MLE is identical to the MAP using a uniform prior for the parameters.If we change the parameterization, the uniform prior on the new parameters does not correspond in general to the uniform prior on the original parameters, as was already noticed by Fisher [16].This apparent paradox led Jeffreys to propose a prior invariant for parameterization [24], known as Jeffrey's prior.However the MAP with Jeffreys's prior is no longer identical to the MLE when Jeffreys's prior is not uniform.For instance, for the parameter of a binomial trial, Jeffreys's prior is 1/ p(1 − p).Adding the logarithm of this term to the loglikelihood shifts the maximum away from 0.5.Moreover it is questionable whether this invariance property can be identified with a non-informativeness character of this prior (for a review on the choice of priors, see [26]).
In the Bayesian paradigm, rather than considering estimators based on maximization of some expression such as the likelihood or posterior density, it is common to attempt to summarize the statistical inferences by using quantiles of the posterior distribution, such as the median, or expectations with respect to the posterior.While such estimators may be more satisfactory, they typically involve multiple integrals that are hard to compute: computations are mostly being done with the MCMC algorithm.Maximization methods have the advantage of being potentially easier in the case where multiple integrals can be avoided.There are also approximate Bayesian methods, which yield the a posteriori marginal distribution by approximating some of the multiple integrals by Laplace approximation, which in turn involves a maximization problem.Rue et al. [39] claim that this approach is much faster than the MCMC algorithm.

Conclusion
The Kolmogorov representation of a statistical experiment has to be taken seriously if we want to have a deep understanding of what a statistical model is.The Kullback-Leibler risk is underlying most of the reflexions about likelihood, as was clearly seen by Akaike [1].Finally, the link with the Bayesian approach should be explored more thoroughly than could done in this paper.The MLE and MAP estimators are the same if, in a given paramterization, the prior used for the MAP is uniform.However, this identity is not stable with respect to reparameterizations.Similar remarks hold for the link between penalized likelihood and MAP.
taking expectation under P * :I(P θ |P * ; O i ) = I(P 0 |P * ; O i ) − E P * L P θ /P 0 Oi .Minimizing I(P θ |P * ; O i ) is equivalent to maximizing E P * (L P θ /P 0Oi).We cannot compute E P * (L P θ /P 0Oi), but we can estimate it.The law of large numbers tells us that, when n → ∞: may maximize the estimator on the left hand, which is the loglikelihood L P θ /P 0 Ōn divided by n.Maximizing the loglikelihood is equivalent to maximizing the likelihood function L P θ /P 0 Ōn .The likelihood function is the function θ → L P θ /P 0