Estimation and Model Selection for Model-Based Clustering with the Conditional Classification Likelihood

The Integrated Completed Likelihood (ICL) criterion has been proposed by Biernacki et al. (2000) in the model-based clustering framework to select a relevant number of classes and has been used by statisticians in various application areas. A theoretical study of this criterion is proposed. A contrast related to the clustering objective is introduced: the conditional classification likelihood. This yields an estimator and a model selection criteria class. The properties of these new procedures are studied and ICL is proved to be an approximation of one of these criteria. We oppose these results to the current leading point of view about ICL, that it would not be consistent. Moreover these results give insights into the class notion underlying ICL and feed a reflection on the class notion in clustering. General results on penalized minimum contrast criteria and on mixture models are derived, which are interesting in their own right.


Introduction
Model-based clustering is introduced in Sections 1.1 and 1.2. Our purpose is to better understand the behavior of the ICL model selection criterion of Biernacki et al. (2000), which is presented in Section 1.3.
The main topic of this work is the choice of the number of classes in a model-based clustering framework, and then the choice of the number of components of a Gaussian mixture. The interested reader may refer to Titterington et al. (1985) or McLachlan and Peel (2000) for comprehensive studies on Gaussian mixture models. The last also provides an overview on the approaches for assessing the number of components, and particularly on the standard and widely used penalized likelihood criteria, such as AIC (Akaike, 1973) or BIC (Schwarz, 1978).
The ICL criterion studied here is an alternative to BIC. It was up to now widely presented as a penalized likelihood criterion, which penalty involves an "entropy" term. Here, however, we prove that it is actually a penalized contrast criterion with a criterion which is different from the standard likelihood: this justifies why this is not surprising, nor a drawback, that ICL does not asymptotically select the "true" number of components, even when the "true" model is considered. Even for data arising from a mixture distribution, a relevant number of classes may differ from the true number of components of the mixture.
The reason why we introduce this new contrast L cc (Section 2.1) is not that we believe it a priori to be the better one for a clustering purpose, but rather that it enables to theoretically study and understand ICL. We prove (Section 4.3) that ICL is an approximation of a criterion linked to this contrast: studying further ICL then amounts to studying L cc . The notion of class underlying ICL is proved to be a compromise between Gaussian mixture density estimation and a strictly "cluster" point of view (Section 5).
Let X be a random variable in R d with distribution f ℘ ·λ and X 1 , . . . , X n an i.i.d. sample of the same distribution. Let us denote X = (X 1 , . . . , X n ).
All proofs are gathered in Section 6.
Those are studied here as parametric models. It is then assumed the existence of a parametrization ϕ : Θ K ⊂ R D K → M K . It is assumed that Θ K and ϕ are "optimal", in the sense that D K is minimal. D K is the number of free parameters in the model M K and is called the dimension of M K . For example, at most (K − 1) mixing proportions need to be parametrized.
It shall not be needed to assume the parametrization to be identifiable, i.e. that ϕ is injective. Indeed our purpose is twofold: identifying a relevant number of classes to be designed; and actually designing those classes. Theorem 4.2 justifies that the first task can be achieved under a weaker "identifiability" assumption. Theorem 3.2 then guarantees that our estimator converges to the best parameters set, any of which is as good as the others. There will be no "true parameter" assumption. The classes can finally be defined through the MAP rule (see Section 1.2). Practically, the parameters themselves are never the quantities of interest here. They only stand as a convenient notation and this is also why we expect that the assumption about the Fisher information (see Theorem 4.2) is technical and could maybe be avoided with other techniques. Please refer to Baudry (2009, Chapter 4) for a more comprehensive discussion about the identifiability question.

Model-Based Clustering
Although the results are stated first for much more general situations, this paper is devoted to the question of clustering through Gaussian mixture models.
The process is standard (see Fraley and Raftery, 2002): • fit each considered mixture model; • select a model and a number of components based on the first step; • classify the observations through the MAP rule (recalled below) with respect to the mixture distribution fitted in the selected model.
Notably, the usual choice is made here, to identify a class with each fitted Gaussian component. The number of classes to be designed is then chosen at the second step. See for example Hennig (2010) and Baudry et al. (2010) for alternative approaches. Let us recall the MAP classification rule. It involves the conditional probabilities of the components is the probability that X arises from the k th component, conditionally to X = x, under the distribution defined by θ. Let us also denote τ ik (θ) = τ k (X i ; θ). The MAP classification rule for x is then Let us denote by L the observed likelihood associated to X: The maximum likelihood estimator in the model M K is denoted by θ MLE K .

ICL
Our motivation is to better understand the ICL (Integrated Completed Likelihood) criterion. Let us introduce the classification likelihood associated to the complete data sample (X, Z) (Z ∈ {0, 1} K is the unobserved label of X: Z k = 1 ⇔ X arises from component k): To mimic the derivation of the BIC criterion (Schwarz, 1978) in a clustering framework, Biernacki et al. (2000) approximate the integrated classification likelihood through a Laplace's approximation. Then they assume that the classification likelihood mode can be identified with θ MLE K as n is large enough and replace the unobserved Z ik 's by their MAP estimators under θ MLE K . This is questionable, notably when the components of θ MLE K are not well separated. They derive the ICL criterion: McLachlan and Peel (2000) replace the Z ik 's by their conditional expecta- Both versions of the ICL appear to behave analogously, and the latter is considered from now on. The ICL differs from the standard and widely used BIC criterion of Schwarz (1978) through the entropy term (see Section 2.2): The BIC is known to be consistent, in the sense that it asymptotically selects the true number of components, at least when the true distribution actually lies in one of the considered models (Keribin, 2000;Nishii, 1988). This nice property may however not suit a clustering purpose. In many applications, there is no reason to assume that the distribution conditional on the (unobserved) labels Z is Gaussian. The BIC in this case tends to overestimate the number of components since several Gaussian components are needed to approximate each non-Gaussian component of the true mixture distribution f ℘ . And the user may rather be interested in a cluster notion -as opposed to this strictly component approach -which also includes a separation notion and which be robust to non-Gaussian components. Of course, it depends on the application, and on what a class should be. It may be of interest to discriminate into two different classes a group of observations which the best fit is reached with a mixture of two Gaussian components having quite different parameters (we particularly think of the covariance matrices parameters). BIC tends to do so. But it may also be more relevant and may conform to an intuitive notion of cluster, to identify two very close -or largely overlapping -Gaussian components as a single non-Gaussian shaped cluster (see for example Figure 3)... ICL has been derived with this viewpoint. It is widely understood and explained (for instance in Biernacki et al., 2000) as the BIC criterion with a supplemental penalty, which is the entropy (Section 2.2). Since the last penalizes models which maximum likelihood estimator yields an uncertain MAP classification, ICL is more robust than BIC to non-Gaussian components. However we do not think that the entropy should be considered as a penalty term and an other point of view will be developed in this paper.
The references here were found by browsing the result obtained from Google Scholar citations about Biernacki et al. (2000). Only 3 pages of 16 have been studied... The behavior of ICL has been studied through simulations and real data studies by Biernacki et al. (2000), McLachlan and Peel (2000, Section 6.11), Steele and Raftery (2010) and in several simulation studies (See Baudry, 2009, Chapter 4). Besides several authors chose to use it for the mentioned reasons in various applications area: Goutte et al. (2001) (fMRI images); Pigeau and Gelgon (2005) (image collection automatic sorting); Hamelryck et al. (2006)  This practical interest for ICL lets us think that it meets an interesting notion of cluster, corresponding to what some users expect. But no theoretical study is available. Our main motivation is to go further in this direction. This leads to considering new estimation and model selection procedures for clustering, similar to ICL but for which the development of the underlying logic is driven to its conclusion, from the estimation step to the model selection step, instead of introducing the MLE. It is proved that ICL is an approximation of a criterion which is consistent for a particular loss function.

A New Contrast: Conditional Classification Likelihood
The contrast minimization framework turns out to be a fruitful approach. It enables to fully understand that ICL is not a penalized likelihood criterion, as opposed to the usual point of view. It should rather be linked to an other contrast: the conditional classification likelihood.

Definition, Origin
In a clustering context, the classification likelihood (see (1)) is an interesting quantity but neither the labels Z are observed, nor we assume that they even exist (think of the case several models with different number of components are fitted: then at most one can correspond to the true number of classes). Beside the first-mentioned works of Biernacki et al. (2000), Biernacki and Govaert (1997), for example, already proposed to directly involve the classification likelihood to select the number of classes, by estimating the unobserved data. We propose here to consider its expectation conditional on the observed sample X. In case there exists a true classification and a model with the true number of classes is considered, this conditional expectation can be interpreted as the quantity the closest to the classification likelihood, which can be considered given the available information. Let us report the following algebraic relation between L and L c : Then, denoting the conditional expectation of log L c (θ) by log L cc (θ) (for Conditional Classification log Likelihood), , which is obviously linked to the clustering objective. We consider in the following −log L cc as an empirical contrast to be minimized.

Entropy
log L cc differs from log L through the entropy (see (3)). The behavior of the entropy is based on the properties of the function h : t ∈ [0, 1] −→ (−t log t) (with h(0) = 0). This nonnegative function (see Figure 1) takes zero value if and only if t = 0 or t = 1. It is continuous but not differentiable at 0, and in particular it is not Lipschitz over [0, 1], which will be a cause of analysis difficulties. Let us also introduce the function h K : (t 1 , . . . , t K ) ∈ Π K −→ K k=1 h(t k ). This nonnegative function (see Figure 2) then takes zero value if and only if there exists k 0 ∈ {1, . . . , K} such that t k 0 = 1 and t k = 0 for k = k 0 . It reaches its maximum value log K at (t 1 , . . . , t K ) = ( 1 K , . . . , 1 K ) (proof in Section 6). Now, the contribution ENT(θ; x i ) of a single observation to the total entropy ENT(θ; x) is considered. Figure 3 represents a dataset simulated from a four-component Gaussian mixture. Let θ be such that f ( . ; θ) = f ℘ . First, ENT(θ; x i ) ≈ 0 if and only if there exists k 0 such that τ ik 0 ≈ 1 and τ ik ≈ 0 for k = k 0 . There is no difficulty to classify x i in such a case (for example x i 1 ). Second, ENT(θ; x i ) is all the greater that (τ i1 , . . . , τ iK ) is closer to ( 1 K , . . . , 1 K ), i.e. that the classification through the MAP rule is uncertain. The worst case is reached as the conditional distribution over of the components 1, . . . , K is uniform. The observation x i 2 for example has about the same posterior probability 1 2 to arise from each one of the components surrounding it. Its individual entropy is about log 2.
In conclusion the individual entropy is a measure of the assignment confidence of the considered observation through the MAP classification rule. The total entropy ENT(θ; x) is the empirical mean assignment confidence, and then measures the MAP classification quality for the whole sample. Involving this quantity in a clustering study means that one expects the classification to be confident. The class notion underlying the choice of the conditional classification likelihood as a contrast is then a compromise between the fit (and then the idea of Gaussian-shaped classes) because of the likelihood term on the one hand, and the assignment confidence because of the entropy term on the other hand (which is rather a cluster point of view).

log L cc as a Contrast
See for example Massart (2007) for an introduction to contrast minimization. Let us consider the best distribution from the L cc point of view in a model M m = {f ( . ; θ) : θ ∈ Θ m }, namely the distribution minimizing the corresponding loss function ] is a very mild assumption. The nonemptiness of Θ 0 m may be guaranteed for example by assuming Θ m to be compact. Let K be fixed and consider the minimization of the loss function at hand in the model M K (Section 1.1). First of all, remark that log L cc = log L if K = 1: Θ 0 1 is the set of parameters of the distributions which minimize the Kullback-Leibler divergence to f ℘ . Now, if K > 1, θ 0 K ∈ Θ 0 K may be close to minimizing the Kullback-Leibler divergence if the corresponding components do not overlap since then, the entropy is about zero. But if those components overlap, this is not the case anymore (Example 2.1).
To completely define the loss function, and to fully understand this framework, it is necessary to consider the best element of the universe U: The universe U must be chosen with care. There is no natural relevant choice, on the contrary to the density estimation framework where the set of all densities may be chosen. First the considered contrast is well-defined in a parametric mixture setup, and not necessarily over any mixture densities set because of the definition of the entropy term involving the definition of each component. However, this would still enable to consider mixtures much more general than mixtures of Gaussian components. The ideas developed in Baudry et al. (2010) may for example suggest to involve mixtures which components are Gaussian mixtures. But this would not make sense. The mixture with one component which is a mixture of K Gaussian components, and which then yields a single non-Gaussian-shaped class, always has a smaller −log L cc value than the corresponding Gaussian mixture yielding K classes. This illustrates how carefully the components involved in the study must be chosen: involving for example any mixture of Gaussian mixtures means that one considers that a class may be almost anything and may notably contain two Gaussian-shaped clusters very far from each other! The components should in any case be chosen with respect to the corresponding cluster shape. The most natural is then to involve in the universe only Gaussian mixtures: U may be chosen as ∪ 1≤K≤K M M K . Figure 4), with µ 0 ≈ 0.83 and σ 2 0 ≈ 0.31. This solution is obviously not the same as the one minimizing the Kullback-Leibler divergence (see Figure 5). This illustrates that the objective with the −log L cc contrast is not to recover the true distribution, even when it is available in the considered model.
The necessity of choosing a relevant model is striking in this example: this two-component model should obviously not be used for a clustering purpose, at least for datasets with great enough size.
The estimator associated to the −log L cc contrast is now considered.

Estimation: MLccE
Let us fix the number of components K and the model M K . The subscript K is omitted in the notation of this section. A new minimum contrast estimator is considered. Results are stated in a general parametric model setting with a general contrast γ and a model M with parameter space Θ ⊂ R D , and then the conditions they involve are discussed in our framework. General conditions ensuring the consistency of such an estimator are given in Theorem 3.1. They notably involve the Glivenko-Cantelli property of the class of functions {γ(θ) : θ ∈ Θ}. Sufficient conditions in terms of bracketing entropy for this property to hold are recalled and verified in the considered context in Section 3.2. Those results are also useful in the study of the model selection step (Section 4). Brought together, they provide the consistency of the estimator in Gaussian mixture models: this is Theorem 3.2.
Here and hereafter, all expectations E and probabilities P are taken with respect to f ℘ · λ. For a general contrast γ, we write its empirical version:

Definition, Consistency
The minimum contrast estimator is named MLccE (Maximum conditional classification Likelihood Estimator): To ensure its existence, we assume that Θ is compact. This is a heavy assumption, but it will be natural and necessary for the following results to hold. That the covariance matrices are bounded from below is a reasonable and necessary assumption in the Gaussian mixture framework: without this assumption, neither the log likelihood, nor the conditional classification likelihood would be bounded (for K ≥ 2). Insights to choose lower bounds on the proportions and the covariance matrices are suggested in Baudry (2009, Section 5.1). The upper bound on the covariance matrices and the compactness condition on the means, although not necessary in the standard likelihood framework, do not seem to be avoidable here (see Section 3.2). This is a consequence of the behavior of the entropy term as a component goes to zero.
The following theorem, which is directly adapted from van der Vaart (1998, Section 5.2), gives sufficient conditions for the consistency of a minimum contrast estimatorθ. We write ∀θ ∈ Θ, The strong consistency holds if (A3) is replaced by an almost sure convergence (this is the case under the conditions we are to define) and if the inequality in the definition ofθ holds almost surely.
Assumption (A1) is the least that can be expected. It is guaranteed if the parameter space is compact.
Sketch of proof. The assumptions guarantee a convenient situation. With great probability as n grows, from (A3), γ n (θ) is uniformly close to E f ℘ [γ(θ)]: this holds forθ and θ 0 . Then, from the definition ofθ, E f ℘ γ(θ) cannot be much larger than E f ℘ γ(θ 0 ) which reaches the minimal value. By (A2), this implies thatθ cannot be far from Θ 0 .
Let us apply Theorem 3.1 to Gaussian mixtures, with γ = −log L cc and Θ = Θ K . The two following hypotheses will be involved: H M log Lcc,Θ O ,1 results from lemma 3.2 and shall be discussed in Section 3.2. Under the compactness assumption, θ MLccE is then consistent. It is even strongly consistent if it minimizes the empirical contrast almost surely. Let us highlight that it then converges to the set of parameters minimizing the loss function, which has no reason to contain the true distribution -except for K = 1 -even if the last lies in M.

Bracketing Entropy and Glivenko-Cantelli Property
Recall a class of functions over R d is P-Glivenko-Cantelli, with P a probability measure over R d , if it fulfills a uniform law of large numbers for the distribution P. A sufficient condition for a family G to be P-Glivenko-Cantelli is that it is not too complex, which can be measured through its entropy with bracketing: Definition 3.1 (L r (P)-entropy with bracketing). Let r ∈ N * and l, u ∈ L r (P). The bracket [l, u] is the set of all functions g ∈ G with l ≤ g ≤ u. [l, u] is an ε-bracket if l − u r ≤ ε. The bracketing number N [ ] (ε, G, L r (P)) is the minimum number of ε-brackets needed to cover G. The entropy with bracketing E [ ] (ε, G, L r (P)) of G with respect to P is the logarithm of N [ ] (ε, G, L r (P)).
It is quite natural that the behavior of all functions lying inside an εbracket can be uniformly controlled by the behavior of the extrema of the bracket. If those endpoints belong to L 1 (P), they fulfill a law of large numbers, and if the number of them needed to cover G is finite, then this is no surprise that G can be proved to fulfill a uniform law of large numbers: Theorem 3.3. Every class G of measurable functions such that E [ ] (ε, G, L 1 (P)) < ∞ for every ε > 0 is P-Glivenko-Cantelli.
The reader is referred to van der Vaart (1998, Chapter 19) for accurate definitions and a proof of this result. This is a generalization of the usual Glivenko-Cantelli theorem. We shall prove that the class of functions {γ( . ; θ) : θ ∈ Θ K } has finite ε-bracketing entropy for any ε > 0 and the assumption (A3) will be ensured.
From now on, since Θ is typically assumed to be compact, it is assumed that Θ ⊂ Θ O ⊂ R D with Θ O open over which γ is defined and C 1 for f ℘ dλalmost all x. This is no problem for Gaussian mixture models with log L cc (or the standard likelihood by the way), for example with the general or diagonal model. But this requires (with the log L cc contrast) the proportions to be positive. Actually, this could be avoided here, but we will need this assumption for the definition of M (Hypothesis H M γ,Θ,r ). As already mentioned, components going to zero must be avoided. For the same technical reason, we have to assume the mean parameters to be bounded.
Lemma 3.1 guarantees that the bracketing entropy of {γ( . ; θ) : θ ∈ Θ} is finite for any ε, if Θ is convex and bounded. The assumption about the differential of the contrast is not a difficulty in our framework, provided that non-zero lower bounds over Θ on the proportions and the covariance matrices are imposed. The lemma is written for anyΘ bounded and included in Θ (which is not assumed to be bounded itself) since it will be applied locally around θ 0 in the Section 4.
Lemma 3.1 (Bracketing Entropy, Convex Case). Let r ∈ N * , D ∈ N * and Θ ⊂ R D assumed to be convex. Let Remark that Θ does not have to be compact. Its proof is a calculation which relies on the mean value theorem, hence the convexity assumption. The natural parameter space of diagonal Gaussian mixture models, with equal volumes (if d > 1) or not, for instance, is convex (see Examples 6.1 and 6.2, p. 26). General mixture models have a convex natural parameter space, too, since the set of definite positive matrices is convex. However, there is no reason that the parameter space Θ should be convex in general.
Lemma 3.1 can then be generalized at the price of assuming Θ to be compact, and included in an open set Θ O such that H M γ,Θ O ,r holds. This is no difficulty for the mixture models we consider, under the same lower bounds constraints as before (since Θ O itself can be chosen to be included in a compact subset of the set of possible parameters). The entropy is then increased by a multiplying factor Q, which only depends on Θ and roughly measures its "nonconvexity". Since the exponential behavior of the entropy with respect to ε is of concern, this does not make the result really weaker.
Lemma 3.2 (Bracketing Entropy, Compact Case). Let r ∈ N * , D ∈ N * and Θ ⊂ R D assumed to be compact. Let Q is a constant which depends on the geometry of Θ (Q = 1 if Θ is convex).
This lemma is proved by applying Lemma 3.1 since Θ is still locally convex. Since it is compact, it can be covered with a finite number Q of open balls, which are convex. Lemma 3.1 then applies to the convex hull of the intersection of Θ with each one of them. The supremum of M is taken over Θ O -instead of Θ -to make sure that the assumptions of Lemma 3.1 are fulfilled over those entire balls, which may not be included in Θ.
The result we need for Section 4 is Lemma 3.3, obtained from Lemma 3.1 by a slight modification. Since it is applied locally there, the convexity assumption is no problem. A supplementary and strong assumption H M γ,Θ,∞ is made. This is not fulfilled in the general Gaussian mixtures framework. A sufficient condition is that the support of f ℘ is bounded. This is false of course for most usual distributions we may have in mind, but this is a reasonable modeling assumption: most modeled phenomena are bounded. Another sufficient condition to guarantee this assumption is that the contrast is bounded from above. This is actually not the case of the contrast −log L cc , but this can be imposed: replace −log L cc by (−log L cc ∧ C) and, provided that C is large enough, this new contrast behaves like log L cc . This is a supplemental difficulty in practice to choose a relevant C value, though. Lemma 3.3 (Bracketing Entropy, Convex Case). Let r ≥ 2, D ∈ N * and Θ ⊂ R D assumed to be convex. Let Let us remark that those results are quite general. We are interested here in their application to the conditional classification likelihood, but they hold all the same in the standard likelihood framework. Maugis and Michel (2011) already provide bracketing entropy results in this framework. Our results cannot be directly compared to theirs since they consider the Hellinger distance. The dependency they get on the parameter space bounds and the variable space dimension d is explicit. This is helpful to derive an oracle inequality. But they could not derive a local control of the entropy, hence an unpleasant logarithm term in the expression of the optimal penalty they get. Their results also suggest the necessity of assuming the contrast to be bounded: see the discussion after the Theorem 4.2. The results we propose achieve the same rate with respect to ε. They depend on more opaque quantities ( M ∞ and M 2 ). This notably implies, from this first step already, the assumption that the contrast is bounded -over the true distribution support. However, it could be expected to control those quantities with respect to the parameter space bounds. Moreover, beside their simplicity, they straightforwardly enable to derive a local control of the entropy.

Model Selection
As illustrated by Example 2.1, model selection is a crucial step. The number of classes may even be the target of the study. Anyhow, a relevant number of classes must obviously be chosen so as to design a good classification.
Model selection procedures introduced here are penalized conditional classification likelihood criteria: Most results are stated for a general contrast γ and any family of models {M K } 1≤K≤K M and then applied to −log L cc and the Gaussian mixtures family of models {M K } 1≤K≤K M introduced in Section 1.1.
In Section 4.1, the consistency of such a model selection procedure ("identification" point of view) is proved for a class of penalties. Sufficient conditions are given in the general Theorem 4.1, which is applied to the framework we are interested in in Theorem 4.2. The heaviest condition of Theorem 4.1 (B4) may be guaranteed under regularity and (weak) identifiability assumptions, and is discussed in Section 4.2. Our approach is inspired from works of Massart (2007) and is the first step to reach non-asymptotic results.

Consistent Penalized Criteria
Assume that K 0 exists such that which means that the bias of the models is stationary from the model M K 0 : it is the "best" model. Remark that the last property should hold mostly in the mixtures framework, and notably if the models were not constrained, and then were nested. Under this assumption, a model selection procedure is expected to asymptotically recover K 0 , i.e. to be consistent. This is an identification aim (see McQuarrie and Tsai, 1998, Chapter 1). It would be disastrous to select a model which does not (almost) minimize the bias. And it is besides assumed that the model M K 0 contains all the interesting information (typically, the structure of the classes). Let us stress that the "true" number of components of f ℘ is not directly of concern: it is in particular not assumed that it equals K 0 , and is not even assumed to be defined (f ℘ does not have to be a Gaussian mixture). K 0 is the best choice from the particular point of view introduced by using the log L cc contrast, which is not density estimation, neither is it identification of the "true" number of components.
Assumption (B4) is the heaviest assumption. Section 4.2 is devoted to deriving sufficient conditions so that it holds. This will justify the Theorem 4.2. (M K ) 1≤K≤K M Gaussian mixture models with compact parameter space Θ K and Θ 0 Then If Θ K is convex, M and M can be defined as suprema over Θ K instead of Θ O K and there is no need to introduce the sets Θ O K . The new "identifiability" assumption (C1) introduced is reasonable: as expected the label switching phenomenon is no problem here. But it is necessary for the identification point of view to make sense, that a single value of the contrast function x −→ γ(θ; x) minimizes the loss. Remark that in the standard likelihood framework, this holds at least if any model contains the sample distribution, since it is the unique Kullback-Leibler divergence minimizer. Obviously, several parameter values, perhaps in different models, may represent it, besides the label switching. We do not know any such result with the −log L cc contrast and hypothesize that the assumption holds.
The assumption about the nonsingularity of I θ 0 is unpleasant, since it is hard to be guaranteed. Hopefully, it could be weakened. The result of Massart (2007) (Theorem 7.11) which inspires this, and is available in a standard likelihood context, does not require such an assumption since it does not rely on the study of this link between the contrast and the parameters but on a clever choice of the involved distances (Hellinger distances), and on particular properties of the log function. However, this is a usual assumption (see Redner and Walker, 1984, or below). Massart (2007) moreover does not require the contrast (i.e. the likelihood) to be bounded, as we have to. Remark however that the application of his Lemma 7.23 to obtain a genuine oracle inequality involves an assumption similar to the boundedness of the contrast. So that it seems reasonable that the assumptions about M and M (the last is much milder than the former) be necessary. They are typically ensured if either the contrast is bounded or if the support of f ℘ is bounded.
The conditions about the penalty form are analogous to that of Nishii (1988) or Keribin (2000), which are both derived in the standard maximum likelihood framework. As those of Keribin (2000), they can be regarded as generalizing those of Nishii (1988) when the considered models are Gaussian mixture models. Indeed, Nishii (1988) considers penalties of the form c n D K and proves the model selection procedure to be weakly consistent if cn n → 0 and c n → ∞. Note that Nishii (1988) assumes the parameter space to be convex. He moreover notably assumes that Θ 0 K = {θ 0 K } and that the counterpart of I θ 0 K is nonsingular, together with other regularity assumptions. Those results are not particularly designed for mixture models. Instead, as we do, Keribin (2000) considers general penalty forms and proves the procedure to be consistent if pen(K) n → 0, pen(K) → ∞ and lim inf pen(K) pen(K ) > 1 if K > K . These conditions are equivalent to Nishii's if pen(K) = c n D K . In a general mixture model framework, she assumes the model family to be well-specified, the same notion of identifiability as we do, and a condition which does not seem to be directly comparable to ours about I θ 0 K but which tastes roughly the same. It might be milder. Those assumptions are proved to hold with the standard likelihood contrast for Gaussian mixture models with lower bounded, spherical covariance matrices which are the same for all components, and if the means belong to a compact. Our conditions about the penalty are a little weaker than Keribin's, but they still are quite analogous. Moreover, as compared to those results, we notably have to keep the proportions away from zero. This is necessary because the entropy term must be handled. It does not seem easy to extend the methods used by Keribin (2000) to our framework.
The strong version of Theorem 4.2, which would state the almost sure consistency ofK to K 0 , would then probably involve penalties a little heavier, as Nishii (1988) and Keribin (2000) proved in their respective frameworks: both had to assume that pen(K) log log n → ∞. Theorem 4.2 is a direct consequence of Theorem 4.1, Lemma 4.1, Theorem 3.2, which can be applied under those assumptions, and of Corollary 4.2 below and the discussion about its assumptions along the lines of Section 4.2.
Sketch of proof. The proof relies on results of Massart (2007) and on the evaluation of the bracketing entropy of the class of functions at hand. Lemma 3.3 provides a local control of the entropy and hence, through Theorem 6.8 in Massart (2007), a control of the supremum of S n (γ(θ 0 ) − γ(θ)) as θ − θ 0 2 ∞ < σ, with respect to σ. The "peeling" Lemma 4.23 in Massart (2007) then enables to take advantage of this local control to derive a fine global control of sup θ∈Θ Sn(γ(θ 0 )−γ(θ)) θ−θ 0 2 +β 2 , for any β. This control in expectation, which can be derived conditionally to any event A, yields a control in probability thanks to Lemma 2.4 in Massart (2007), which can be thought of as an application of Markov's inequality.
Corollary 4.1. Same assumptions as Lemma 4.2, but the convexity of Θ. Besides assume that The constant involved in O P (1) depends on D, M ∞ , M 2 and I θ 0 .
This is a direct consequence of Lemma 4.2: it suffices to choose β well. The dependency of O P (1) in D, M ∞ , M 2 and I θ 0 is not a problem since we aim at deriving an asymptotic result: the order of θ − θ 0 2 ∞ with respect to n when the model is fixed is of concern.
The assumption that I θ 0 is nonsingular plays an analogous role as Assumption (A2) in Theorem 3.1: this ensures that E f ℘ [γ(θ)] cannot be close to E f ℘ γ(θ 0 ) if θ is not close to θ 0 . But this stronger assumption is necessary to strengthen the conclusion: the rate of the relation between : this approach cannot be applied without thisadmittedly unpleasant -assumption. Perhaps an other approach (with distances not involving the parameters but directly the contrast values) might enable to avoid it, as Massart (2007) did in the likelihood framework.
This last corollary states conditions under which assumption (B4) of Theorem 4.1 is ensured.

A New Light on ICL
The previous section suggests links between model selection penalized criteria with the standard likelihood on the one hand and with the conditional classification likelihood we defined on the other hand. Indeed penalties with the same form as those given by Nishii (1988) or Keribin (2000) with the standard likelihood are proved to be "consistent" in our framework. Therefore, by analogy with the standard likelihood framework, it is expected that penalties proportional to D K conform an efficiency point of view (think of AIC), and that penalties proportional to D K log n are optimal for an identification purpose (think of BIC). This possibility to derive an identification procedure from an efficient procedure by a log n factor is notified for example by Arlot (2007).
Let us then consider by analogy with BIC the penalized criterion The point is that we almost recover ICL (replace θ MLE K by θ MLccE in (2)), which may then be regarded as an approximation of this L cc -ICL criterion. The corresponding penalty is log n 2 D K , and the derivation of L cc -ICL illustrates that the entropy should not be considered as a part of the penalty. This notably justifies why ICL does not select the same number of components as BIC or any consistent criterion in the standard likelihood framework, even asymptotically. Actually, it should not be expected to do so.
When θ MLccE K differs from θ MLE K , the former provides more separated clusters. The compromise between the Gaussian component and the cluster viewpoint is achieved with θ MLccE K from the very estimation step. The user is provided a solution which aims at this compromise for each number of classes K. However, the number of classes selected through L cc -ICL differs seldom from the one selected by ICL in simulations (See Baudry, 2009, Chapter 4).
Finally, L cc -ICL is quite close to ICL and enables to better understand the concepts underlying ICL. ICL remains attractive though, notably because it is easier to implement than L cc -ICL.

Discussion
Two families of criteria, in the clustering framework, are distinguished: it is shown that ICL's purpose is of different nature than that of BIC or AIC. The identification theory for the criteria based on the conditional classification likelihood is -not surprisingly -very similar to the one for the standard likelihood. A major interest of the newly introduced estimator and criteria is to better understand the ICL criterion and the underlying notion of class. This is nor a simple notion of cluster -as for example for the k-means procedure -neither a pure notion of "component" -as underlying the MLE/BIC approach -but a compromise between both. ICL leads to discovering classes matching a subtle combination of the notions of well separated, compact, clusters, and (Gaussian) mixture components. It then enjoys the flexibility and modeling possibilities of the model-based clustering approach, but does not break the expected notion of cluster. Better understanding of the ICL criterion now means better understanding the newly involved contrast L cc .
The choice of the involved mixture components must be handled with care in this framework since it leads the cluster shape underlying the study. Several forms of Gaussian mixtures may be involved: for example, spherical and general models may be compared, or models with free proportions may be compared with models with equal proportions.
Besides it should be further studied how the complexity of the models should be measured when several model kinds are compared. The dimension of the model as a parametric space works for the reported theoretical results. But we are not completely convinced that it is the finest measure of the complexity of Gaussian mixture models. As a matter of fact this simple parametric point of view amounts to considering that all parameters play an analogous role. This is not really natural.
A further theoretical step would be to drive non-asymptotic results and oracle inequalities. This may give more precise insights about the best penalty shape to use, and then justify the use of the slope heuristics of Birgé and Massart (2007) (see also Baudry et al., 2011or Baudry, 2009 for simulations and discussions on this topic).
A practical challenge is to provide efficient optimization algorithms. Some work has been done in this direction already: see Baudry et al. (2008) and Baudry (2009, Section 5.1). But they need be improved to be more reliable, and above all to run much faster, which would obviously be a condition for a spread practical use of the new contrast.
A possibility to make this contrast more flexible would be to assign different weights to the log likelihood and the entropy: log L ccα = α log L + (1 − α) ENT, with α ∈ [0; 1]. This would enable to tune how important the assignment confidence is with respect to the Gaussian fit... The difficulty would then be to choose α. A first insight which comes in mind is to calibrate α from simulations of situations in which the user knows what solution he expects.
Proof of Lemma 3.1. Let ε > 0, and Θ ⊂ Θ, with Θ bounded. Let Θ ε be a grid in Θ which "ε-covers" Θ in any dimension with step ε. Θ ε is for example Θ 1 ε × · · · × Θ D ε with This is always possible since Θ is convex. For the sake of simplicity, it is assumed without loss of generality, that Θ ε ⊂ Θ. With the · ∞ norm, the step of the grid Θ ε is the same as the step over each dimension, ε: And the cardinal of Θ ε is at most Now, let θ 1 and θ 2 in Θ and x ∈ R d .
Example 6.1 (Diagonal Gaussian Mixture Model Parameter Space is Convex). Following Celeux and Govaert (1995), we write [pλ k B k ] for the model of Gaussian mixtures with diagonal covariance matrices and equal mixing proportions. To keep simple notation, let us consider the case d = 2 and K = 2 (d = 1 or K = 1 are obviously particular cases!). A natural parametrization of this model (which dimension is 8) is Then [pλ k B k ] = ϕ(R 4 × R + * 4 ) and the parameter space R 4 × R + * 4 is convex.
Example 6.2 (The Same Model with Equal Volumes is Convex, too...).
It is first proved thatK does not asymptotically "underestimate" K 0 . (B2) and (B3) (pen(K) = o P (1)), with large probability and for n large enough: Then crit(K) = γ n (θ K ) + pen( +ε. Then, with large probability and for n large enough,K = K. Let now K ∈ K (hence K > K 0 ). This part of the result is more involved than the first one but at this stage, it is not more difficult to derive: all the difficulty is hidden in the strong assumption (B4)... Indeed, it implies that ∃V > 0, such that for n large enough and with large probability, n γ n (θ K 0 ) − γ n (θ K ) ≤ V.