Clustering and variable selection for categorical multivariate data

This article investigates unsupervised classification techniques for categorical multivariate data. The study employs multivariate multinomial mixture modeling, which is a type of model particularly applicable to multilocus genotypic data. A model selection procedure is used to simultaneously select the number of components and the relevant variables. A non-asymptotic oracle inequality is obtained, leading to the proposal of a new penalized maximum likelihood criterion. The selected model proves to be asymptotically consistent under weak assumptions on the true probability underlying the observations. The main theoretical result obtained in this study suggests a penalty function defined to within a multiplicative parameter. In practice, the data-driven calibration of the penalty function is made possible by slope heuristics. Based on simulated data, this procedure is found to improve the performance of the selection procedure with respect to classical criteria such as BIC and AIC. The new criterion provides an answer to the question"Which criterion for which sample size?"Examples of real dataset applications are also provided.


Introduction
This article is concerned with the unsupervised classification on categorical multivariate data.The model-based clustering, which uses finite mixture models, is an intuitive and rigorous framework for the unsupervised classification.However there is no clear consensus on the way to gather individuals in general: on the basis of well separated clusters, or on the basis of the components of the mixture distribution?We refer to Baudry [2009] for a general discussion on this topic.Finite mixture models are specially adapted when each class is supposed to be characterized by a set of parameters, for instance in population genetics: in this case the populations that the biologists look for are characterized by their allelic frequencies and a genetic equilibrium; this corresponds to the notion of population as a reproduction unit, or a group of individuals sharing the same genetic structure.Finite mixture models are also known in the literature as the latent class models.
The observations are n independent realizations of a random vector, whose number L of coordinates (variables) may be large.The individuals of the sample are clustered into a certain unknown number K of populations on the basis of the frequencies of apparition of the possible states of each variable.It may happen that only a subset S of the variables are relevant for clustering purposes, and the others are just noise.Thus, in addition to the number K of populations and the frequencies of the different states, we are also interested in the subset S, which may have significance in the interpretation of the results.
A number of clustering methods for categorical multivariate data have been proposed in recent years in the context of genomics (see [Chen et al., 2006, Corander et al., 2008, Pritchard et al., 2000]).But the problem of variable selection for clustering using such data was first addressed in [Toussile and Gassiat, 2009], where the question is regarded as a model selection problem in a density estimation framework.First the components of a finite mixture distribution are identified, then the individuals are clustered into these components using the Maximum A Posteriori (MAP) method.Using simulated data, that article shows that the variable selection procedure based on the Bayesian Information Criterion (BIC) significantly improves clustering and prediction capacities in our framework.It also gives a theoretical consistency result: when the true density P 0 underlying the observations belongs to one of the competing models, then there exists a smallest model M (K0, S0) containing P 0 ; further, the BIC type criteria select M (K0, S0) with probability tending to one as the sample size n goes to infinity.This consistency approach requires large sample sizes which may be difficult to obtain.However the knowledge of the true model, aside the frequencies of the states, is an important information for the interpretation of the results.
In the present paper we adopt an oracle approach.We do not aim at choosing the true model underlying the data, even if our procedure performs well also for that.The criteria are rather designed to minimize some risk function of the estimated density with respect to the true density.In this context simpler models can be preferred to M (K0, S0) , in which too many parameters can entail estimators which overfit the data.Actually there is no need to assume that P 0 belongs to one of the competing models M (K,S) .
BIC relies on a strong asymptotic assumption, and can thus require large sample sizes to reach its asymptotic behavior; practically BIC is known to overpenalize, and therefore selects too small models for small or medium values of n (see [Nadif and Govaert, 1998]).On the contrary Akaike's Information Criterion (AIC) is known to underpenalize, and selects too large models for large and medium values of n.We would like a criterion which gathers the virtues of both AIC and BIC, and performs well for different values of n.
In this article, we propose a non asymptotic penalized criterion based on the metric entropy theory of Massart (in particular [Massart, 2007]).It leads to a non asymptotic oracle inequality, which compares the risk of the selected estimator to the risk of the estimator associated with the unknown best model (see Theorem 1 below).There exists a large literature on model selection via penalization from a non asymptotic perspective.This literature is still in development with the appearance of sophisticated tools of probability such as concentration and deviations inequalities (see [Massart, 2007] and the references therein).In mixture models the non asymptotic approach is very recent, the first related work being [Maugis and Michel, 2009] for the Gaussian mixture model.
However, the obtained penalty function presents drawbacks: it depends on a multiplicative constant for which sharp upper bounds are not available, and it leads in practice to an overpenalization -even worse than BIC.Therefore our theoretical result mainly suggests the shape of the penalty function: where D m is the dimension of model m, and λ an unknown parameter depending on the sample size and the complexity of the collection of models under competition, which has to be calibrated.A calibration of λ with the socalled slope heuristics has been proposed in [Birgé and Massart, 2007] in such a case.We propose a modified version based on a sliding window of this calibration method.The resulting criterion does not require an ad-hoc choice of the penalty parameters and adapts automatically to the data.Although the full theoretical validation of slope heuristics is provided only in the Gaussian homoscedastic and heteroscedastic regression frameworks [Arlot andMassart, 2009, Birgé andMassart, 2007], they have been implemented in several other frameworks (see [Lebarbier, 2002, Maugis and Michel, 2008, Verzelen, 2009, Villers, 2007] for applications in density estimation, genomics, etc.).The simulations performed in Subsection 4.3 illustrate that our criterion behaves well with respect to more classical criteria as BIC and AIC, both to estimate the density, even when n is relatively small, and to retrieve the true model.It can be seen as a representative of the family of the General Information Criteria (see for instance [Bai et al., 1999] whose criterion is less intuitive but presents some analogy with the slope heuristics).
The paper is organized as follows.Section 2 is devoted to the presentation of the mixture models framework and to the model selection paradigm.In Section 3 we state and prove our main result, the oracle inequality.Section 4 is devoted to the practical aspect of our procedure which has been implemented in the stand alone software MixMoGenD (Mixture Model using Genotypic Data) (see [Toussile and Gassiat, 2009]).Results on simulated experiments are also presented: we compare our proposed criterion to classical BIC and AIC, in both points of view of the selection of the true model and of density estimation.Eventually, the Appendices contain several technical results used in the main analysis.

Framework
We suppose we deal with independent and identically distributed (iid) realizations of a multivariate random vector X = (X l ) 1≤l≤L .We consider two main settings: 1.Each X l is a multinomial variable taking values in {1, . . ., A l }.
2. Each X l consists in a (non ordered) set X l,1 , X l,2 of two (that may be equal) qualitative variables taking their values in the same set {1, . . ., A l }.
All along this article, these two settings will be referred to as Case 1 and Case 2.
In both cases, the numbers A l of allowed states are supposed to be known, and to verify A l ≥ 2. The first case is a usual latent class model with various applications (psychomatrics, marketing, credit scoring, genomics, etc.), while the last one is more specific to genotypic data.In this context X = (X l ) 1≤l≤L represents the genotype of an individual at L loci of its DNA.Case 1 corresponds to haploid organisms, with a single representative of each chromosome; at any locus l a single allele X l is measured.Case 2 corresponds to diploid organisms, with two representatives of each chromosome; at any locus l, two alleles X l,1 and X l,2 are observed together.
We consider a model-based clustering, which means that the sample is a finite mixture of an unknown number K of populations (clusters), each being characterized by a set of frequencies of the states.Let denote by Z the (unobserved) population an individual comes from.Variable Z takes its values in the set {1, . . ., K} of the labels of the different clusters.Its distribution is given by the vector π = (π k ) 1≤k≤K , where π k = P (Z = k).Conditionally to Z, the variables X 1 , . . ., X L are supposed to be independent.In Case 2, the states X l,1 and X l,2 for the l th variable are also supposed to be independent conditionally to Z.The preceding two assumptions are what biologists respectively call Linkage Equilibrium (LE) and Hardy-Weinberg Equilibrium (HWE).According to these assumptions, the probability distribution of a genotype x = x l 1≤l≤L in a population k is given in the following equations where α k,l,j is the probability of state j associated to variable X l in population k.The mixing proportions π k and the probabilities α k,l,j will be treated as parameters.
In the context of genomics, Hardy-Weinberg and linkage equilibria are based on several simplifying assumptions that can seem unrealistic; however they have still proven to be useful in describing many population genetic attributes and serve as a base model in the development of more realistic models of microevolution.Further, the choice of estimators derived from the maximum likelihood estimator (MLE) responds to the wish of biologists to group the sample into clusters minimizing the Hardy-Weinberg and linkage desequilibria, and this brings some robustness to our modeling (see [Latch et al., 2006] and references therein).
Going deeper, the oracle approach emphasizes that we should often prefer simplified and misspecified models.This introduces a modeling bias in order to get more robust estimators and classifiers, and at the end we get a smaller estimation error.This legitimizes also the following simplification.
It may happen that the structure of interest is contained in only a subset S of the L available variables, the others been useless or even harmful to detect a reasonable clustering into statistically different populations.For the variables in S, the frequencies of the states in at least two populations are different: we will call them clustering variables.For the other variables, the states are supposed to be equally distributed across the clusters.This approximation is theoretically justified by the oracle heuristics, which is able to take advantage of the misspecification; the simulations performed in [Toussile and Gassiat, 2009] illustrate its benefits.We denote by β l,j the frequency of state j associated to variable X l in the whole population: Obviously, S = ∅ if K = 1, otherwise S belongs to P * (L), the set of all non empty subsets of {1, . . ., L}. Summarizing all these assumptions, we can write down the likelihood of an observation x = x l 1≤l≤L : Case 1: where θ = (π, α, β) is a multidimensional parameter, with where Then we consider the collection of all parametric models with (K, S) ∈ M := {(1, ∅)} ∪ (N\{0, 1}) × P * (L).To alleviate notations, we will often use the single index m ∈ M instead of (K, S).
Each model M (K,S) corresponds to a particular structure situation with K clusters and a subset S of clustering variables.Inferring K and S becomes a model selection problem in a density estimation framework.It also leads to a data clustering, via the estimation θ of the parameter θ (K,S) and the prediction of the class z of an observation x by the MAP method:

Model selection via penalization
A common method to solve model selection problems consists in the minimization of a penalized maximum likelihood criterion.In each model M (K,S) , consider the maximum likelihood estimator (MLE) P (K,S) = P (K,S, θ) , which minimizes the log-likelihood contrast where X i describes the individual i in the sample.Then a data driven selected model M ( Kn, Sn) is chosen, where K n , S n minimizes a penalized maximum likelihood criterion of the form where pen n : M → R + is the penalty function.Eventually the selected estimator is P ( Kn, Sn) .
The penalty function is designed to avoid overfit problems.Classical penalties, such as the ones used in AIC and BIC criteria, are based on the dimension of the model.In the following, we will refer to the number of free parameters as the dimension of the model M (K,S) .The penalty functions of AIC and BIC are respectively defined by Our work is centered on the MLE estimator P (K, S) , but this last one presents a drawback.For the sake of density estimation, we would like to use the Kullback-Leibler divergence KL as a risk function to measure the quality of an estimator.Unfortunately, when an state is not present in the sample, the MLE estimator assigns to it a zero probability.As a consequence, the Kullback risk E P0 KL P 0 , P (K, S) is infinite.
The Hellinger distance offers an alternative to the Kullback-Leibler divergence.Let us consider two probability distribution P and Q, admitting respectively s and t as density functions with respect to a common σ-finite measure µ.We call Hellinger distance between P and Q the quantity h(P, Q) defined by Let (K * , S * ) be a minimizer in (K, S) of the Hellinger risk of the MLE estimator The density P (K * , S * ) is called oracle for the Hellinger risk.It is not an estimator, since it depends on the true density P 0 .However it can be used as a benchmark to quantify the quality of our model selection procedure: in the simulation performed in paragraph 4.3.2,we compare the Hellinger risk of the selected estimator P ( Kn, Sn) to the oracle risk.
3 New criteria and non asymptotic risk bounds

Main result
Our main theorem provides an oracle inequality for both Case 1 and Case 2. It links the Hellinger risk of the selected estimator to the Kullback-Leibler divergence KL between the true density and each model in the models collection.Unlike KL which is not a metric, the Hellinger distance h permits to take advantage of the metric properties (metric entropy) of the models.
Theorem 1.We consider the collection M of models defined above, and a corresponding collection of ρ-MLEs P (K,S) (K,S)∈M , which means that for every Let A max = sup 1≤l≤L A l , and let ξ be defined by There exists absolute constants κ and C such that whenever for every (K, S) ∈ M, then the model M ( Kn, Sn) where K n , S n minimizes crit(K, S) = γ n P (K,S) + pen n (K, S) over M exists and moreover, whatever the underlying probability P 0 , where, for every (K, S) ∈ M, KL P 0 , M (K,S) = inf Q∈M (K,S) KL(P 0 , Q).
The condition ξ < 1 is used in the proof to avoid more complicated calculations.In practice, ξ is very likely to be smaller than 1 for L not too small.
Note that as soon as n ≥ 2L, ( 10) is simplified in the following way The leading term for large n is κ ln n 2 , which is a multiple of the penalty function of BIC.As a consequence, we can apply Theorem 2 from [Toussile and Gassiat, 2009]: when the underlying distribution P 0 belongs to one of the competing models, the smallest model (K 0 , S 0 ) containing P 0 is selected with probability tending to 1 as n goes to infinity.Such a penalty is not surprising in our context: it is in fact very similar to the one obtained in [Maugis and Michel, 2009] in a Gaussian mixture framework.
Sharp estimates of κ are not available.Theorem 1 is too conservative in practice, and leads to an over-penalized criterion which is outperformed by smaller penalties.So it is mainly used to suggest the shape of the penalty function where λ is a parameter to be chosen depending on n and the collection M -but not on (K, S).Slope heuristics [Arlot andMassart, 2009, Birgé andMassart, 2007] can be used in practice to calibrate λ: this is done in Section 4, where we use change-point detection [Lebarbier, 2002] in relation to slope heuristics.Since h 2 is upper bounded by 2, the non-asymptotic feature of Theorem 1 is interesting when n is large enough with respect to D (K,S) .However, even with small values of n, the simulations performed in Subsection 4.3 show that the penalized criterion calibrated using the slope heuristics keep good behaviors.

A general tool for model selection
Theorem 1 is obtained from [Massart, 2007, Theorem 7.11].This last result deals with model selection problems by proposing penalty functions related to geometrical properties of the models, namely metric entropy with bracketing for Hellinger distance.
The framework here is the following.We consider some measurable space (A, A), and µ a σ-finite positive measure on A. A collection of models (M m ) m∈M is given, where each model M m is a set of probability density functions s with respect to µ.The following relation permits us to extend the definition of h to positive functions s or t whose integral is finite but not necessary 1. Denoting √ s the function defined by √ s(x) = s(x), and by Let us now recall the definition of metric entropy with bracketing.Consider some collection F of measurable functions on A, and d one of the following metrics on F : h, • 1 , or • 2 .A bracket [l, u] is the collection of all measurable functions f such that l ≤ f ≤ u.Its d-diameter is the distance d(u, l).Then, for every positive number ε, we denote by N [•] (ε, F, d) the minimal number of brackets with d-diameter not larger than ε which are needed to cover F .The d-entropy with bracketing of F is defined as the logarithm of N [•] (ε, F, d), and is denoted by We assume that for each model M m the square entropy with bracketing ) is integrable at 0. Let us consider some function φ m on R + with the following properties (I).φ m is nondecreasing, x → φ m (x)/x is non-increasing on (0, +∞) and for every σ ∈ R + and every u ∈ M m σ 0 where In order to avoid measurability problems, we suppose that for each m ∈ M, the following separability condition is verified for M m : (M).There exists some countable subset M ′ m of M m and a set A ′ ⊂ A with µ(A ′ ) = µ(A) such that for every t ∈ M m , there exists some sequence (t k ) k≥1 of elements of M ′ m such that for every x ∈ A ′ , ln(t k (x)) tends to ln(t(x)) as k tends to infinity.
Theorem 2. Let X 1 , . . ., X n be iid random variables with unknown density s with respect to some positive measure µ.Let {M m } m∈M be some at most countable collection of models, each fulfilling (M).We consider a corresponding collection of ρ-MLEs ( s m ) m∈M .Let {x m } m∈M be some family of nonnegative numbers such that m∈M e −xm = Σ < ∞, and for every m ∈ M considering φ m with property (i) define σ m as the unique positive solution of the equation Let pen n : M → R + and consider the penalized log-likelihood criterion Then, there exists some absolute constants κ and C such that whenever x m n for every m ∈ M, some random variable m minimizing crit over M exists and moreover, whatever the density s In Theorem 2, σ 2 m has the role of a variance term of s m , while the weights x m take into account the number of models m having the same dimension.

Proof of Theorem 1
In order to apply Theorem 2, we need to compute the metric entropy with bracketing of each model M (K,S) .This is done in the following result, which is proved in Appendix A.
Proposition 1 (Bracketing entropy of a model).Let η : R + → R + be the increasing convex function defined by Case 1: For any choice of K and S, M (K,S) fulfills (M).For any ε ∈ (0, 1), where C (K,S) is a technical quantity measuring the complexity of a model M (K,S) .
In the next step we establish an expression for φ m .All following results are proved in Appendix B.
Proposition 2. For any choice of m = (K, S), the function φ m defined on (0, η(1)] by We do not define φ m for σ bigger than η(1), to avoid more complicated expressions.This is why a condition on ξ appears in the following lemma: Then, for all n ≥ 1 if ξ < 1, and for n > ξ 2 K otherwise, the solution σ m of ( 12) verifies σ m < η(1).
From Proposition 2 we can deduce an upper bound for σ m , with a similar reasoning to [Maugis and Michel, 2009].First, σ m ≤ η(1) entails η −1 (σ m ) ≤ 1, and we obtain the lower bound σ m ≥ σ m , where This can be used to get an upper bound Let us now choose the weights x m .If we take something bigger than nσ 2 m , this will change the shape of the penalty in Theorem 2. We define The following Lemma shows that this choice is suitable.To express the penalty function we have to lower bound η −1 ( σ m ).This is done in the following Lemma.
Lemma 3. Using the preceding notations, This ends the proof of Theorem 1.

In practice
In real datasets the numbers A l of possible states at each variable X l are not necessarily known.The numbers A l of observed states can be used instead.In fact, the MLE estimator select a density with null weight on non-observed states.
Then, in each model M (K,S) , an approximated MLE estimator can be computed thanks to the Expectation-Maximization (EM) algorithm (see [Dempster et al., 1977]).
The other two points that have to be done before reaching the final estimator P ( Kn, Sn) are the choice of the penalty function, and the sub-collection of models on which the EM algorithm will be used.These two points are discussed in Subsections 4.1 and 4.2.Then simulations are presented in Subsection 4.3.

Slope heuristics and Dimension jump
Theorem 1 suggests to take a penalty function of the shape (11), defined modulo a multiplicative parameter λ which has to be calibrated.Slope heuristics, as presented in [Arlot andMassart, 2009, Birgé andMassart, 2007], provide a practical method to find an optimal penalty pen opt (m) = λ opt D m /n.These heuristics are based on the conjecture that there exists a minimal penalty pen min (m) = λ min D m /n required for the model selection procedure to work: when the penalty is smaller that pen min , the selected model is one of the most complex models, and the risk of the selected estimator is large.On the contrary, when the penalty is larger than pen min , the complexity of the selected model is much smaller.Then the optimal penalty is close to twice the minimal penalty: The name "slope heuristics" comes from λ min being the slope of the linear regression γ n P m ∼ D m /n for a certain sub-collection of the most competing models m.For example, on the left panel of Figure 1 below, a slope is visible for the models containing the true model M (K0, S0) .Even if this example is favorable and mainly here for illustration purposes, it shows that the slope heuristics are sensible with the modelings of the present work.
Instead of estimating λ min by linear regression, another method is jump detection.Suppose we have at hand a reasonable grid λ 1 < . . .< λ r of candidate values of λ min , and a sub-collection M explored of the most competitive models.Each λ i leads to a selected model m i with dimension D mi .If you plot D mi as a function of λ i , λ min is expected to lie at the position of the biggest jump.However, the right panel of Figure 1 illustrates an important point: in that example the biggest jump is at λ ≈ 5.1, but λ min is around 0.9, which corresponds to several successive jumps.We propose an improved version of the dimension jump method of [Arlot and Massart, 2009], based on a sliding window: we consider at a time all jumps in an window of h ≥ 1 following intervals in the grid.Algorithm 1 below describes the procedure.

Sub-collection of models for calibration
For a given maximum value K max of the number of clusters, the number of models under competition is equal to 1 + (K max − 1) * 2 L − 1 .Since this number is huge in most situations, it is very painful to consider all competing models for calibration of the parameter λ.On the other hand, we need enough models to ensure that there is a clear jump in the sequence of selected dimension.We consider the modified backward-stepwise algorithm proposed in [Toussile and Gassiat, 2009], which explores of cardinalities of S. It enables to gather the most competitive models among all possible S for a given number K of clusters and a given penalty function pen n .It gives also the choice to add a complementary exploration step based on a similarly modified forward strategy.
We will refer to this algorithm as explorer (K, pen n ).Since we do not know the final penalty during the exploration step, we consider a reasonable grid 1/2 = λ 1 < . . .< λ r = ln n containing both penalty functions associated to AIC and BIC (7).To each value λ i of the grid is associated a penalty function pen λi .We launch explorer K, pen λi for all values of K in {1, . . ., K max } and for all values of λ i of the above grid, and we gather the explored models in M explored .This sub-collection seemly contains the most competitive models and it is then used to calibrate λ.

Numerical experiments
Our proposed procedure with a data-driven calibration of the penalty function has been implemented for Case 2 in the software MixMoGenD (Mixture Model using Genotypic Data), which already proposed a selection procedure based on asymptotic criteria BIC and AIC (see [Toussile and Gassiat, 2009]).Here, we conduct numerical experiments on simulated datasets for performances assessment of the new non asymptotic criterion with respect to BIC and AIC.
We present two experiments, both in Case 2. The first one considers the consistency of the selected model: we study how the procedure retrieves the main features of the true model as the number of individuals in the datasets increases.In the second one, we are rather interested in a validation of the model selection procedure from the oracle point of view: we compare the Hellinger risk of the selected estimator to the oracle risk.

Consistency performances
In this experiment we consider a setting with L = 10 variables of 10 states each.We chose a parameter with K 0 = 5 populations of equal probability.The frequencies of the states have been chosen such that the genetic differentiation between the populations is decreasing with the variables rank.In the first 6 variables, the populations are more separated.In the following 2 variables, the populations are very poorly differentiated.In the last 2 variables, the states follow the same uniform distribution in all populations.The whole parameter is available at http://www.math.u-psud.fr/~toussile/.
We considered different values n of the sample size in [50, 900] and for each of them, 10 datasets have been simulated.The results are summarized in Figure 2. The left panel gives the proportion of selecting the subset S n of clustering variables containing the first 6 variables, which are the most genetically differentiated variables.The right panel gives the proportion of selected models with In this experiment, AIC seems to be the best criterion for variable selection; however the different between AIC and the new criterion is not significant.It also appears that AIC estimates the number of clusters better than the other criteria for small sample sizes (around n = 100 and n = 200), but it overestimates this number from n = 500.On the contrary, the new criterion perfectly estimates the number of clusters for sample sizes ≥ 300.BIC performs poorly for both variables selection and classification on datasets with small sizes.As expected, the data-driven calibration of the penalty function improves globally the performances of the selection procedure, and it gives thus an answer to the question "Which penalty for which sample size?".It may happen that the results obtained on small sample sizes change a little from one run to another.In fact, the EM algorithm can miss the global maximum on such sample sizes, in particular in models of higher dimension.In our experiments, it is probably the case with some datasets of size n ≤ 300, when the number of free parameters in the simulated model is ≥ 310.

Oracle performances of the estimator
Since the new criterion is designed in an oracle perspective, it is interesting to compare the associated estimator to the oracle for Hellinger risk.Recall that the oracle is the estimator associated to the model indexed by the minimizer (K * , S * ) of the risk E h 2 P 0 , P (K, S) over the collection of models M.
In this experiment, we consider simulated datasets with reduced variability in order to reduce the computation time.The parameter underlying the data admits L = 6 variables, 3 states for each variable, and K 0 = 3 populations with equal probability.The frequencies of the states have been chosen in such a way that the genetic differentiation between the population is significant on the first 3 variables, very small on the 4 th and 5 th variables, while the states of the 6 th variable follow the uniform distribution in all populations.Thus the true model is defined by K 0 = 3 and S 0 = {1, 2, 3, 4, 5}.The whole parameter is available at http://www.math.u-psud.fr/~toussile/.
We estimated the oracle using a Monte Carlo procedure on 100 simulated datasets of size 500 each, and got K * = 3 and S * = {1, 2, 3, 4}.The results we obtained are summarized in Figure 3 and Table 1.  Sn) .The alternative hypothesis is that the mean of the Hellinger distance associated to the criterion in the first column is less than the one associated to the criterion in the first line.
The worst behavior comes from BIC and it is not a surprise for two main reasons.First BIC is designed to find the true model which is different to the oracle in our experiments.Second, it is based on asymptotic approximation and therefore requires large samples.In contrary, compared to AIC and BIC, the new criterion with data-driven calibration of the penalty function is significantly the best in the sense of Hellinger risk and the capacity of selecting the oracle.Recall that both AIC and the new criterion are designed to find the oracle (see Table 1).But like BIC, AIC is based on asymptotic approximations.So the advantage of the new criterion over AIC is probably that it is designed in a non asymptotic perspective.

Conclusion
In this paper, we have considered a model selection via penalization, which performs simultaneously a variables selection and a detection of the number of populations, in the specific framework of multivariate multinomial mixture.This leads to a clustering in a second time.Our main result provides an oracle inequality, under the condition of some lower bound on the penalty function.
The weakness of such a result is that the associated penalized criterion is not directly usable.Nevertheless, it suggests a shape of the penalty function which is of the form pen n (m) = λD m /n, where λ = λ (n, M) is a parameter which depends on the data and the collection of the competing models.In practice λ is calibrated via the slope heuristics.
In the simulated experiments we conducted, the new criterion with penalty calibration shows good behaviors for density estimation as well as for the selection of the true model.It also performs well both when the number of individuals is large and when it is reasonnably small.This gives an answer to the question "Which criterion for with sample size?" In the modeling we considered, the model dimension grows rapidly.In real experiments the number of individuals can be small, so other modeling with reduced dimension may be needed.We currently work on models which cluster the populations differently for each variable, as well as models which allocate the same probability to several states.

A Metric entropy with bracketing
We first state several results about the entropy with bracketing, which will be used to prove Proposition 1.They are mainly adapted from [Genoveve and Wasserman, 2000], but several are improved or writen here in a more general form.These lemmas can be seen as a toolbox to calculate the metric entropy with bracketing of complex models from the metric entropy of simpler elements.
We consider a measurable space (A, A), and µ a σ-finite positive measure on A. We consider a model M, which is a set of probability density functions with respect to µ.All functions considered in the following will be positive functions in L 1 (µ).
Lemma 4. Let ε > 0. Let [l, u] be a bracket in L 1 (µ), with h-diameter less than ε, and containing s, a probability density function with respect to µ.Then Proof.First two inequalities are immediate, from l ≤ s ≤ u.For the last one, we use triangle inequality in L 2 (µ), and the definition of h: Lemma 5 (Bracketing entropy of product densities).Let n ≥ 2, and consider a collection (A i , A i , µ i ) 1≤i≤n of measured space.For any 1 ≤ i ≤ n, let M i be a collection of probability density functions on A i fulfilling (M).Consider the product model i and a sequence (t i,k ) k≥1 be such as needed for M i to verify (M).Then, with the choice Next result is just Lemma 2 from [Genoveve and Wasserman, 2000]: Lemma 7 (Bracketing entropy of the simplex).Let n ≥ 2 be an integer.Let µ be the counting measure on {1, . . ., n}.We identify any probability on {1, . . ., n} with its density s ∈ S n−1 with respect to µ.Then, if 0 < δ ≤ 1, To deal with Case 2, we also need the metric entropy of the collection of all Hardy-Weinberg genotype distributions for a given variable.
Lemma 8 (Bracketing entropy of Hardy-Weinberg genotype distributions).Suppose that, for some variable l, there exist A l ≥ 2 different states.Let Ω l be the collection of all genotype distributions following Hardy-Weinberg model (1).Then Ω l fulfills (M), and for any δ > 0 and ε Proof.
(1) permits to associate a parameter α = (α 1 , . . ., α A l ) ∈ S A l −1 to any density in Ω l .More generally, for any α ∈ [0, 1] A l , we define a function on the set of all genotypes x = {x 1 , x 2 } on A l states.Consider some δ > 0 and d α ∈ Ω l .Let [l, u] be some bracket containing α, with h-diameter less than δ.
Then d α belongs to the bracket [d l , d u ].Let us calculate its diameter.
which tends to α for the usual topology as k tends to infinity.Then, for any genotype x = {x 1 , x 2 }, ln d α (k) (x) tends to ln d α (x).Therefore Ω l fulfills (M).
Proof of Proposition 1.We build the proof for Case 2. For Case 1 everything is similar, with a simplification: we directly have S A l −1 instead of Ω l .
Using (2) we see that a probability P (K,S) • |θ is the product of a mixture density corresponding to the variables in S, and a product density in l / ∈S Ω l for the other variables.Let us call M the collection of all mixtures of K densities in l∈S Ω l .We first deal with the non clustering variables.Using Lemma 5 and Lemma 8, l / ∈S Ω l fulfills (M).For any ε ∈ (0, 1), On the same way We can apply Lemma 6, and get that M fulfills (M) and Lemma 5 again, applied to M and l / ∈S Ω l , gives that M (K,S) fulfills (M), and for any ε ∈ (0, 1), At this point, it only remains to use Lemma 7 and to compute the constants.

B Establishing the penalty
First, we need to establish some properties of function η.

Lemma 2 .
For any model M m , with m ∈ M as above, let us set x m = (ln 2)D m .Then m∈M e −xm ≤ (3/4) L .

Figure 1 :
Figure 1: Two ways to compute the slope, on a simulated sample of 1000 individuals, with 8 clustering variables among 10, and 5 populations.Models have been explored via the modified backward-stepwise described in subsection 4.2, the number K of clusters varying from 1 to 10.The size of the sliding window is 0.15.

Figure 2 :
Figure 2: The figure in the left panel gives the proportion of selected models with S n ⊇ {1, . . ., 6}, and the one in the right gives the proportion of selected models with K n = K 0 , versus the sample size.

Figure 3 :
Figure 3: The left panel gives the boxplots, means and their 95% confident intervals, for h 2 P0, P ( Kn , Sn) h 2 P0, P ( K * , S * ) ; the right panel gives the percentages of selection of the estimated oracle K * , S * ; three criteria have been used: AIC, BIC, and Cte*Dim which denotes the new criterion with data-driven calibration of the penalty function.

Table 1 :
The p-values of pairwise student tests comparing the means of the h 2 P 0 , P ( Kn