Mixture of Gaussian regressions model with logistic weights , a penalized maximum likelihood approach

In the framework of conditional density estimation, we use candidates taking the form of mixtures of Gaussian regressions with logistic weights and means depending on the covariate. We aim at estimating the number of components of this mixture, as well as the other parameters, by a penalized maximum likelihood approach. We provide a lower bound on the penalty that ensures an oracle inequality for our estimator. We perform some numerical experiments that support our theoretical analysis. AMS 2000 subject classifications: 62G08.


Framework
In classical Gaussian mixture models, the density is modeled by where K ∈ N \ {0} is the number of mixture components, Φ υ,Σ is the Gaussian density with mean υ and covariance matrix Σ, and π w,k are the mixture weights, that can always be defined from a K-tuple w = (w 1 , . . ., w K ) with a logistic scheme: .
In this article, we consider such a model in which the mixture weights as well as the means can depend on a, possibly multivariate, covariate.More precisely, we observe n pairs of random variables ((X i , Y i )) 1≤i≤n where the covariates X i s are independent while the Y i s are conditionally independent given the X i s.We assume that the covariates are in some subset X of R d and the Y i s are in R p .We want to estimate the conditional density s 0 (•|x) with respect to the Lebesgue measure of Y given X.We model this conditional density by a mixture of Gaussian regressions with varying logistic weights where υ = (υ 1 , . . ., υ K ) and w = (w 1 , . . ., w K ) are now K-tuples of functions chosen, respectively, in a set Υ K and W K .Our aim is then to estimate those functions υ k and w k , the covariance matrices Σ k as well as the number of classes K so that the error between the estimated conditional density and the true conditional density is as small as possible.
The classical Gaussian mixture case has been extensively studied (McLachlan and Peel, 2000).Nevertheless, theoretical properties of such model have been less considered.In a Bayesian framework, asymptotic properties of the posterior distribution are obtained by Choi (2008), Genovese and Wasserman (2000), Van der Vaart and Wellner (1996) when the true density is assumed to be a Gaussian mixture.AIC/BIC penalization scheme are often used to select a number of clusters (see Burnham and Anderson (2002) for instance).Non asymptotic bounds are obtained by Maugis and Michel (2011) even when the true density is not a Gaussian mixture.All these works rely heavily on a bracketing entropy analysis of the models, that will also be central in our analysis.
When there is a covariate, the most classical extension of this model is a mixture of Gaussian regressions, in which the means υ k are now functions.It is well studied as described in McLachlan and Peel (2000).In particular, in a Bayesian framework, Viele and Tong (2002) have used bracketing entropy bounds to prove the consistency of the posterior distribution.Models in which the proportions vary have been considered by Antoniadis et al. (2009).Using an idea of Kolaczyk et al. (2005), they have considered a model in which only proportions depend in a piecewise constant manner from the covariate.Their theoretical results are nevertheless obtained under the strong assumption they exactly know the Gaussian components.This assumption can be removed as shown by Cohen and Le Pennec (2013).Models in which both mixture weights and means depend on the covariate are considered by Ge and Jiang (2006), but in a mixture of logistic regressions framework.They give conditions on the number of components (experts) to obtain consistency of the posterior with logistic weights.Note that similar properties are studied by Lee (2000) for neural networks.
Although natural, mixture of Gaussian regressions with varying logistic weights seems to be mentioned first by Jordan and Jacobs (1994).They provide an algorithm similar to ours, based on EM and Iteratively Reweighted Least Squares, for hierarchical mixtures of experts but no theoretical analysis.Young and Hunter (2010) choose a non-parametric approach to estimate the weights, which are not supposed logistic anymore, using kernels and cross-validation.They also provide an EM-like algorithm and some convincing simulations.This work has an extension in a series of papers (Hunter and Young, 2012), (Huang and Yao, 2012).Young (2014) considers mixture of regressions with changepoints but constant proportions.More recently, Huang et al. (2013) have considered a nonparametric modeling for the means, the proportions as well as the variance for which they give asymptotic properties as well as a numerical algorithm.Closer to our work, Chamroukhi et al. (2010) consider the case of piecewise polynomial regression model with affine logistic weights.In our setting, this corresponds to a specific choice for Υ K and W K : a collection of piecewise polynomials and a set of affine functions.They use a variation of the EM algorithm and a BIC criterion and provide numerical experiments to support the efficiency of their scheme.
Young (2014) provides a relevant example for our analysis.The ethanol data set of Brinkman (1981) (Figure 1(a)) shows the relationship between the equivalence ratio, a measure of the air-ethanol mix used as a spark-ignition engine fuel in a single-cylinder automobile test, and the engine's concentration of nitrogen oxide (NO) emissions for 88 tests.Using the methodology described in this paper, we obtain a conditional density modeled by a mixture of four Gaussian regressions.Using a classical maximum likelihood approach, each point of the data set can be assigned to one the four class yielding the clustering of Figure 1(b).The use of logistic weight allows a soft partitioning along the NO axis while still allowing more than one regression for the same NO value.The two topmost classes seem to correspond to a single population whose behavior changes around 1.7 while the two bottom-most classes appear to correspond to two different populations with a gap around 2.6-2.9.Such a result could not have been obtained with non varying weights.
The main contribution of our paper is a theoretical result: an oracle inequality, a non asymptotic bound on the risk, that holds for penalty slightly different from the one used by Chamroukhi et al. (2010).
In Section 2, we recall the penalized maximum likelihood framework, introduce the losses considered and explain the meaning of such an oracle inequality.In Section 3, we specify the models considered and their collections, state our theorem under mild assumptions on the sets Υ K and W K and apply this result to polynomial sets.Those results are then illustrated by some numerical experiments in Section 4. Our analysis is based on an abstract theoretical analysis of penalized maximum likelihood approach for conditional densities conducted in Cohen and Le Pennec (2011) that relies on bracketing entropy bounds.Appendix A summarizes those results while Appendix B contains the proofs specific to this paper, the ones concerning bracketing entropies.

Penalized maximum likelihood estimator
We will use a model selection approach and define some conditional density models S m by specifying sets of conditional densities, taking the shape of mixtures of Gaussian regressions, through their number of classes K, a structure on the covariance matrices Σ k and two function sets Υ K and W K to which belong respectively the K-tuple of means (υ 1 , . . ., υ K ) and the K-tuple of logistic weights (w 1 , . . ., w K ).Typically those sets are compact subsets of polynomials of low degree.Within such a conditional density set S m , we estimate s 0 by the maximizer s m of the likelihood or more precisely, to avoid any existence issue since the infimum may not be unique or even not be reached, by any η-minimizer of the negative log-likelihood: Assume now we have a collection {S m } m∈M of models, for instance with different number of classes K or different maximum degree for the polynomials defining Υ K and W K , we should choose the best model within this collection.
Using only the log-likelihood is not sufficient since this favors models with large complexity.To balance this issue, we will define a penalty pen(m) and select the model m that minimizes (or rather η ′ -almost minimizes) the sum of the negative log-likelihood and this penalty:

Losses
Classically in maximum likelihood context, the estimator loss is measured with the Kullback-Leibler divergence KL.Since we work in a conditional density framework, we use a tensorized version of it.We define the tensorized Kullback- which appears naturally in this setting.Replacing t by a convex combination between s and t and dividing by ρ yields the so-called tensorized Jensen-Kullback-Leibler divergence, denoted JKL ⊗n ρ , with ρ ∈ (0, 1).This loss is always bounded by 1 ρ ln 1 1−ρ but behaves as KL when t is close to s.This boundedness turns out to be crucial to control the loss of the penalized maximum likelihood estimate under mild assumptions on the complexity of the model and their collection.
Furthermore JKL ⊗n ρ (s, t) ≤ KL ⊗n ρ (s, t).If we let d 2⊗n be the tensorized extension of the squared Hellinger distance d 2 , Cohen and Le Pennec (2011) prove that there is a constant C ρ such that C ρ d 2⊗n (s, t) ≤ JKL ⊗n ρ (s, t).Moreover, if we assume that for any m ∈ M and any s m ∈ S m , s 0 dλ ≪ s m dλ, then Cohen and Le Pennec (2011)).

Oracle inequality
Our goal is now to define a penalty pen(m) which ensures that the maximum likelihood estimate in the selected model performs almost as well as the maximum likelihood estimate in the best model.More precisely, we will prove an oracle type inequality with a pen(m) chosen of the same order as the variance of the corresponding single model maximum likelihood estimate.
The name oracle type inequality means that the right-hand side is a proxy for the estimation risk of the best model within the collection.The Kullback-Leibler term inf sm∈Sm KL ⊗n λ (s 0 , s m ) is a typical bias term while pen(m) n plays the role of the variance term.We have three sources of loss here: the constant C 1 can not be taken equal to 1, we use a different divergence on the left and on the right and pen(m) n is not directly related to the variance.Under a strong assumption, namely a finite upper bound on sup m∈M sup sm∈Sm s 0 /s m ∞ , the two divergences are equivalent for the conditional densities considered and thus the second issue disappears.
The first issue has a consequence as soon as s 0 does not belong to the best model, i.e. when the model is misspecified.Indeed, in that case, the corresponding modeling bias inf sm∈Sm KL ⊗n (s 0 , s m ) may be large and the error bound does not converge to this bias when n goes to infinity but to C 1 times this bias.Proving such an oracle inequality with C 1 = 1 would thus be a real improvement.
To our knowledge, those two first issues have not been solved in penalized density estimation with Kullback-Leibler loss but only with L 2 norm or aggregation of a finite number of densities as in Rigollet (2012).
Concerning the third issue, if S m is parametric, whenever pen(m) can be chosen approximately proportional to the dimension dim(S m ) of the model, which will be the case in our setting, pen(m)   n is approximately proportional to , which is the asymptotic variance in the parametric case.The righthand side matches nevertheless the best known bound obtained for a single model within such a general framework.

Models of mixtures of Gaussian regressions
As explained in introduction, we are using candidate conditional densities of type to estimate s 0 , where K ∈ N \ {0} is the number of mixture components, Φ υ,Σ is the density of a Gaussian of mean υ and covariance matrix Σ, υ k is a function specifying the mean given x of the k-th component while Σ k is its covariance matrix and the mixture weights π w,k are defined from a collection of K functions w 1 , . . ., w K by a logistic scheme: .
We will estimate s 0 by conditional densities belonging to some model S m defined by where W K is a compact set of K-tuples of functions from X to R, Υ K a compact set of K-tuples of functions from X to R p and V K a compact set of K-tuples of covariance matrices of size p × p. From now on, we will assume that those sets are parametric subsets of dimensions respectively dim(W K ), dim(Υ K ) and dim(V K ).The dimension dim(S m ) of the now parametric model S m is thus Before describing more precisely those sets, we recall that S m will be taken in a model collection S = (S m ) m∈M , where m ∈ M specifies a choice for each of those parameters.Within this collection, the number of components K will be chosen smaller than an arbitrary K max , which may depend on the sample size n.The sets W K and Υ K will be typically chosen as a tensor product of a same compact set of moderate dimension, for instance a set of polynomial of degree smaller than respectively d ′ W and d ′ Υ whose coefficients are smaller in absolute values than respectively T W and T Υ .
The structure of the set V K depends on the noise model chosen: we can assume, for instance, it is common to all regressions, that they share a similar volume or diagonalization matrix or they are all different.More precisely, we decompose any covariance matrix Σ into LP AP ′ , where L = |Σ| 1/p is a positive scalar corresponding to the volume, P is the matrix of eigenvectors of Σ and A the diagonal matrix of normalized eigenvalues of Σ.Let L − , L + be positive values and λ − , λ + real values.We define the set A(λ − , λ + ) of diagonal matrices A such that |A| = 1 and ∀i ∈ {1, . . ., p}, λ − ≤ A i,i ≤ λ + .A set V K is defined by where SO(p) is the special orthogonal group.Those sets V K correspond to the classical covariance matrix sets described by Celeux and Govaert (1995).

A conditional density model selection theorem
The penalty should be chosen of the same order as the estimator's complexity, which depends on an intrinsic model complexity and, also, a collection complexity.
We will bound the model complexity term using the dimension of S m : we prove that those two terms are roughly proportional under some structural assumptions on the sets W K and Υ K .To obtain this result, we rely on an entropy measure of the complexity of those sets.More precisely, for any Ktuples of functions (s 1 , . . ., s K ) and (t 1 , . . ., t K ), we let and define the metric entropy of a set F K , H d sup ∞ (σ, F K ), as the logarithm of the minimal number of balls of radius at most σ, in the sense of d sup ∞ , needed to cover F K .We will assume that the parametric dimension D of the set considered coincides with an entropy based definition, namely there exists a constant C such that for σ ∈ (0, √ 2] Assumption (DIM) There exist two constants C W and C Υ such that, for every sets W K and Υ K of the models S m in the collection S, ∀σ ∈ (0, √ 2], Note that one can extend our result to any compact sets for which those assumptions hold for dimensions that could be different from the usual ones. The complexity of the estimator depends also on the complexity of the collection.That is why one needs further to control the complexity of the collection as a whole through a coding type (Kraft) assumption (Barron et al., 2008).

Assumption (K)
There is a family (x m ) m∈M of non-negative numbers and a real number Ξ such that m∈M e −xm ≤ Ξ < +∞.
We can now state our main result, a weak oracle inequality: Theorem 1.For any collection of mixtures of Gaussian regressions model S = (S m ) m∈M satisfying (K) and (DIM), there is a constant C such that for any ρ ∈ (0, 1) and any C 1 > 1, there is a constant κ 0 depending only on ρ and C 1 such that, as soon as for every index m ∈ M, pen(m) = κ((C + ln n) dim(S m ) + x m ) with κ > κ 0 , the penalized likelihood estimate s m with m such that Remind that under the assumption that sup m∈M sup sm∈Sm s 0 /s m ∞ is finite, JKL ⊗n ρ can be replaced by KL ⊗n up to a multiplication by a constant depending on ρ and the upper bound.Note that this strong assumption is nevertheless satisfied if we assume that X is compact, s 0 is compactly supported, the regression functions are uniformly bounded and there is a uniform lower bound on the eigenvalues of the covariance matrices.
As shown in the proof, in the previous theorem, the assumption on pen(m) could be replaced by the milder one It may be noticed that if (x m ) m satisfies Assumption (K), then for any permutation τ (x τ (m) ) m satisfies this assumption too.In practice, x m should be chosen such that 2κxm pen(m) is as small as possible so that the penalty can be seen as proportional to the two first terms.Notice that the constant C only depends on the model collection parameters, in particular on the maximal number of components K max .As often in model selection, the collection may depends on the sample size n.If the constant C grows no faster than ln(n), the penalty shape can be kept intact and a similar result holds uniformly in n up to a slightly larger κ 0 .In particular, the apparent dependency in K max is not an issue: K max only appears in C through a logarithmic term and K max should be taken smaller than n for identifiability issues.Finally, it should be noted that the ln n term in the penalty of Theorem 1 may not be necessary as hinted by a result of Gassiat and van Handel (2014) for one dimensional mixtures of Gaussian distribution with the same variance.

Linear combination of bounded functions for the means and the weights
We postpone the proof of this theorem to the Appendix and focus on Assumption (DIM).This assumption is easily verified when the function sets W K and Υ K are defined as the linear combination of a finite set of bounded functions whose coefficients belong to a compact set.This quite general setting includes the polynomial basis when the covariable are bounded, the Fourier basis on an interval as well as suitably renormalized wavelet dictionaries.Let d W and d Υ be two positive integers, let (ψ W,i ) 1≤i≤dW and (ψ Υ,i ) 1≤i≤dΥ two collections of functions bounded functions from X → [−1, 1] and define where the (j) in α (j) r is a notation to indicate the link with υ j .We will be interested in tensorial construction from those sets, namely Note that in this general case, only the functions ψ W,i and ψ Υ,i need to be bounded and not the covariate X itself.
For sake of simplicity, we focus on the bounded case and assume X = [0, 1] d .In that case, we can use a polynomial modeling: ψ W,i and ψ Υ,i can be chosen as monomials x r = x r1 1 . . .x r d d .If we let d ′ W and d ′ Υ be two maximum (non negative) degrees for those monomials and define the sets of W K and Υ K accordingly, the previous Lemma becomes Lemma 2. W K and Υ K satisfy Assumption (DIM), with , not depending on K.
To apply Theorem 1, it remains to describe a collection S = (S m ) m∈M and a suitable choice for (x m ) m∈M .Assume, for instance, that the models in our collection are defined by an arbitrary maximal number of components K max , a common free structure for the covariance matrix K-tuple and a common maximal degree for the sets W K and Υ K .Then one can verify that dim and that the weight family (x m = K) m∈M satisfy Assumption (K) with Ξ ≤ 1/(e − 1).Theorem 1 yields then an oracle inequality with pen(m) = κ((C + ln(n)) dim(S m ) + x m ).Note that as x m ≪ (C + ln(n)) dim(S m ), one can obtain a similar oracle inequality with pen(m) = κ(C + ln(n)) dim(S m ) for a slightly larger κ.Finally, as explained in the proof, choosing a covariance structure from the finite collection of Celeux and Govaert (1995) or choosing the maximal degree for the sets W K and Υ K among a finite family can be obtained with the same penalty but with a larger constant Ξ in Assumption (K).

Numerical scheme and numerical experiment
We illustrate our theoretical result in a setting similar to the one considered by Chamroukhi et al. (2010) and on two real data sets.We observe n pairs (X i , Y i ) with X i in a compact interval, namely [0, 1] for simulated data and respectively [0, 5] and [0,17] for the first and second real data set, and Y i ∈ R and look for the best estimate of the conditional density s 0 (y|x) that can be written with w ∈ W K and υ ∈ Υ K .We consider the simple case where W K and Υ K contain linear functions.We do not impose any structure on the covariance matrices.Our aim is to estimate the best number of components K as well as the model parameters.As described with more details later, we use an EM type algorithm to estimate the model parameters for each K and select one using the penalized approach described previously.

The procedure
As often in model selection approach, the first step is to compute the maximum likelihood estimate for each number of components K. To this purpose, we use a numerical scheme based on the EM algorithm (Dempster et al., 1977) similar to the one used by Chamroukhi et al. (2010).The only difference with a classical EM is in the Maximization step since there is no closed formula for the weights optimization.We use instead a Newton type algorithm.Note that we only perform a few Newton steps (5 at most were enough in our experiments) and ensure that the likelihood does not decrease.We have noticed that there is no need to fully optimize at each step: we did not observe a better convergence and the algorithmic cost is high.We denote from now on this algorithm Newton-EM.Notice that the lower bound on the variance required in our theorem appears to be necessary in practice.It avoids the spurious local maximizer issue of EM algorithm, in which a class degenerates to a minimal number of points allowing a perfect Gaussian regression fit.We use a lower bound shape of C n .Biernacki and Castellan (2011) provide a precise data-driven bound for mixture of Gaussian regressions: n−2K+1 the chi-squared quantile function, which is of the same order as 1 n in our case.In practice, the constant 10 gave good results for the simulated data.
An even more important issue with EM algorithms is initialization, since the local minimizer obtained depends heavily on it.We observe that, while the weights w do not require a special care and can be simply initialized uniformly equal to 0, the means require much more attention in order to obtain a good minimizer.We propose an initialization strategy based on short runs of Newton-EM with random initialization.
We draw randomly K lines, each defined as the line going through two points (X i , Y i ) drawn at random among the observations.We perform then a K-means clustering using the distance along the Y axis.Our Newton-EM algorithm is initialized by the regression parameters as well as the empirical variance on each of the K clusters.We perform then 3 steps of our minimization algorithm and keep among 50 trials the one with the largest likelihood.This winner is used as the initialization of a final Newton-EM algorithm using 10 steps.
We consider two other strategies: a naive one in which the initial lines chosen at random and a common variance are used directly to initialize the Newton-EM algorithm and a clever one in which observations are first normalized in order to have a similar variance along both the X and the Y axis, a K-means on both X and Y with 5 times the number of components is then performed and the initial lines are drawn among the regression lines of the resulting cluster containing more than 2 points.
The complexity of those procedures differs and as stressed by Celeux and Govaert (1995) the fairest comparison is to perform them for the same amount of time (5 seconds, 30 seconds, 1 minute...) and compare the obtained likelihoods.The difference between the 3 strategies is not dramatic: they yield very similar likelihoods.We nevertheless observe that the naive strategy has an important dispersion and fails sometime to give a satisfactory answer.Comparison between the clever strategy and the regular one is more complex since the difference is much smaller.Following Celeux and Govaert (1995), we have chosen the regular one which corresponds to more random initializations and thus may explore more local maxima.
Once the parameters' estimates have been computed for each K, we select the model that minimizes with pen(m) = κ dim(S m ).Note that our theorem ensures that there exists a κ large enough for which the estimate has good properties, but does not give an explicit value for κ.In practice, κ has to be chosen.The two most classical choices are κ = 1 and κ = ln n 2 which correspond to the AIC and BIC approach, motivated by asymptotic arguments.We have used here the slope heuristic proposed by Birgé and Massart (2007) and described for instance in Baudry et al. (2011).This heuristic comes with two possible criterions: the jump criterion and the slope criterion.The first one consists in representing the dimension of the selected model according to κ (Fig. 3), and finding κ such that if κ < κ, the dimension of the selected model is large, and reasonable otherwise.The slope heuristic prescribes then the use of κ = 2κ.In the second one, one computes the asymptotic slope of the log-likelihood drawn according to the model dimension, and penalizes the log-likelihood by twice the slope times the model dimension.With our simulated data sets, we are in the not so common situation in which the jump is strong enough so that the first heuristic can be used.

Simulated data sets
The previous procedure has been applied to two simulated data sets: one in which true conditional density belongs to one of our models, a well-specified case, and one in which this is not true, a misspecified case.In the first situation, we expect to perform almost as well as the maximum likelihood estimation in the true model.In the second situation, we expect our algorithm to automatically balance the model bias and its variance.More precisely, we let in the first example, denoted example WS, and in the second example, denoted example MS.For both experiments, we let X be uniformly distributed over [0,1].Figure 2 shows a typical realization.In both examples, we have noticed that the sample's size had no significant influence on the choice of κ, and that very often 1 was in the range of possible values indicated by the jump criterion of the slope heuristic.According to this observation, we have chosen in both examples κ = 1.
We measure performances in term of tensorized Kullback-Leibler divergence.Since there is no known formula for tensorized Kullback-Leibler divergence in the case of Gaussian mixtures, and since we know the true density, we evaluate the divergence using Monte Carlo method.The variability of this randomized approximation has been verified to be negligible in practice.
For several numbers of mixture components and for the selected K, we draw in Figure 4 the box plots and the mean of tensorized Kullback-Leibler divergence over 55 trials.The first observation is that the mean of tensorized Kullback-Leibler divergence between the penalized estimator ŝ K and s 0 is smaller than the mean of tensorized Kullback-Leibler divergence between ŝK and s 0 over K ∈ {1, . . ., 20}.This is in line with the oracle type inequality of Theorem 1.Our numerical results hint that our theoretical analysis may be pessimistic.A close inspection shows that the bias-variance trade-off differs between the two examples.Indeed, since in the first one the true density belongs to the model, the best choice is K = 2 even for large n.As shown on the histogram of Figure 5,  this is almost always the model chosen by our algorithm.Observe also that the mean of Kullback-Leibler divergence seems to behave like dim(Sm) 2n (shown by a dotted line).This is indeed the expected behavior when the true model belongs to a nested collection and corresponds to the classical AIC heuristic.In the second example, the misspecified one, the true model does not belong to the collection.The best choice for K should thus balance a model approximation error term and a variance one.We observe in Figure 5 such a behavior: the larger n the more complex the model and thus K.Note that the slope of the mean error seems also to grow like dim(Sm) 2n even though there is no theoretical guarantee of such a behavior.
Figure 6 shows the error decay when the sample size n grows.As expected in the well-specified case, example W, we observe the decay in t/n predicted in the theory, with t some constant.The rate in the second case appears to be slower.Indeed, as the true conditional density does not belong to any model, the selected models are more and more complex when n grows which slows the error decay.In our theoretical analysis, this can already be seen in the decay of the variance term of the oracle inequality.Indeed, if we let m 0 (n) be the optimal oracle model, the one minimizing the right-hand side of the oracle inequality, the variance term is of order which is larger than 1 n as soon as  Kullback-Leibler divergence between the true density and the computed density using (X i , Y i ) i≤N with respect to the sample size, represented in a log-log scale.For each graph, we added a free linear least-square regression and one with slope −1 to stress the two different behavior.
dim(S m0(n) ) → +∞.It is well known that the decay depends on the regularity of the true conditional density.Providing a minimax analysis of the proposed estimator, as have done Maugis and Michel (2012), would be interesting but is beyond the scope of this paper.

Ethanol data set
We explain now with more details the result of Figure 1 for the 88 data point Ethanol data set of Brinkman (1981).Young (2014) proposes to estimate the density of the equivalence ratio R conditioned to the concentration in N O and to use this conditional density to do a clustering of the data set.In our framework, this amounts to estimate the conditional density by with our proposed penalized estimator and to use the classical maximum likelihood approach that associates (N O, R) to the class arg max to perform the clustering.
An important parameter of the method is the lower bound of the variance used in the estimation for a given number of class.This is required to avoid spurious maximizers of the likelihood.Here, the value 10 −4 chosen by hand yields satisfactory results.
Since we only have 88 points and roughly 5 parameters per class, the random initialization may yield classes with too few points to have a good estimation.We have slightly modified our K-means procedure in order to ensure than at least 10 points are assigned to each class.In that case, we have verified that the estimated parameters of the conditional density were very stable.
Note that with this strategy, no more than 8 classes can be considered.This prevents the use of the jump criterion to calibrate the penalty because the big jump is hard to define.We use instead the slope heuristic.Figure 7 shows that this slope is of order 1 and thus the slope heuristic prescribes a penalty of 2 dim(S K ), providing an estimate with 4 components.
It is worth pointing out that the maximum of the penalized likelihood is not sharp, just like in the example MS of simulated data (see figure 5).Indeed, it is quite unlikely that the true density belongs to our model collection.So, there may be an uncertainty on the selected number of components between 4, 3 and 5.Note that AIC penalization would have lead to 7 classes while BIC would also have lead to 4 classes.Our estimated penalty is nevertheless in the middle of the zone corresponding to 4 while BIC is nearby the boundary with 3 and thus we expect this choice to be more stable.In Figure 1(b) of the introduction we have shown only this clustering with 4 classes.Figure 8 shows that the choices of 3 or 5 may make sense, even though the choice 5 may seem slightly too complex.A common feature among all those clusterings is the change of slope in the topmost part around 1.7.This phenomena is also visible in Young (2014) in which an explicit change point model is used, ours is only implicit and thus more versatile To complete our study, in Figure 9, we have considered the more natural regression of NO with respect to the equivalence ratio that has not been studied by Young (2014).Using the same methodology, we have recovered also 4 clusters corresponding to a soft partitioning of the equivalence ratio value.Note that this clustering, which is easily interpretable, is very similar to the one obtained with the previous parameterization.

ChIP-chip data set
We considere here a second real data set: a Chromatin immunoprecipitation (ChIP) on chip genomic data set.Chromatin immunoprecipitation (ChIP) is a procedure used to investigate proteins associated with DNA.The data set considered is the one used by Martin-Magniette et al. (2008).In this experiment, two variables are studied: DNA fragments crosslinked to a protein of interest (IP) and genomic DNA (Input).Martin-Magniette et al. (2008) model the density of log-IP conditioned to log-Input by a mixture of two Gaussian regressions with the same variance.One component corresponds to an enriched one, in which there is more proteins than expected, and the other to a normal one.They use classical proportions that do not depends on the Input.The parameters are estimated using the EM algorithm initialized by values derived from a Principal Component Analysis of the whole data set.The best model between one and two components is selected according to the BIC criterion.For the histone modification in Arabidopsis thaliana data set, they select a two components model similar to the one obtained with logistic weights (Figure 10).
We have first compare the constant proportions model with K = 2 to the one proposed in their conclusion in which the proportions depend on the Input.We have used our affine logistic weight model and observed that this model greatly improves the log-likelihood.The dimension of this new model is 8 while the dimension of the original model is 7 so that the log-likelihood increase does not seem to be due to overfitting.We have also compare our solution to the one obtained with a constant weight with K = 3 model of dimension 11.The BIC criterion selects the K = 2 with affine weight solution.
We have then tested more complex models with K up to 20 with a penalty obtained with the slope heuristic.The models chosen are quite complex (K = 10 for constant proportions models and K = 7 for affine logistic weight models, the later being the overall winner).Although they better explain the data from the statistical point of view, those models become hard to interpret from the biological point of view.We think this is due to the too simple affine models used.Although no conceptual difficulties occur by using more complex function familie (or going to the multivariate setting), the curse of dimensionality makes everything more complicated in practice.In particular, initialization becomes harder and harder as the dimension grows and requires probably a more clever treatment than the one proposed here.In the spirit of Cohen and Le Pennec (2013), we are currently working on a first extension: a numerical algorithm for a bivariate piecewise linear logistic weights model applied to hyperspectral image segmentation.

Discussion
We have studied a penalized maximum likelihood estimate for mixtures of Gaussian regressions with logistic weights.Our main contribution is the proof that a penalty proportional, up to a logarithmic factor of the sample size, to the dimension of the model is sufficient to obtain a non asymptotic theoretical control on the estimator loss.This result is illustrated in the simple univariate case in which both the means and the logistic weights are linear.We study a toy model which exhibits the behavior predicted by our theoretical analysis and proposes two simple applications of our methodology.We hope that our contribution helps to popularize those mixtures of Gaussian regressions by giving a theoretical foundation for model selection technique in this area and showing some possible interesting uses even for simple models.
Besides some important theoretical issues on the loss used and the tightness of the bounds, the major future challenge is the extension of the numerical scheme to more complex cases than univariate linear models.

Appendix A: A general conditional density model selection theorem
We summarize in this section the main result of Cohen and Le Pennec (2011) that will be our main tool to obtain the previous oracle inequality.
To any model S m , a set of conditional densities, we associate a complexity defined in term of a specific entropy, the bracketing entropy with respect to the square root of the tensorized square of the Hellinger distance d 2⊗n .Recall that a bracket [t − , t + ] is a pair of real functions such that ∀(x, y) ∈ X × Y, t − (x, y) ≤ t + (x, y) and a function s is said to belong to the bracket [t − , t + ] if ∀(x, y) ∈ X × Y, t − (x, y) ≤ s(x, y) ≤ t + (x, y).The bracketing entropy H [],d (δ, S) of a set S is defined as the logarithm of the minimal number N [],d (δ, S) of brackets [t − , t + ] covering S, such that d(t − , t + ) ≤ δ.The main assumption on models is a property that should satisfies the bracketing entropy: Assumption (H) For every model S m in the collection S, there is a nondecreasing function φ m such that δ → 1 δ φ m (δ) is non-increasing on (0,+∞) and for every Such an integral is ofter called a Dudley type integral of these bracketing entropies and is commonly used in empirical process theory (Van der Vaart and Wellner, 1996).The complexity of S m is then defined as nσ 2 m where σ m is the unique square root of 1 σ φ m (σ) = √ nσ.For technical reason, a separability assumption, always satisfied in the setting of this paper, is also required.It is a mild condition, classical in empirical process theory (see for instance Van der Vaart and Wellner (1996)).

Assumption (Sep)
For every model S m in the collection S, there exists some countable subset S ′ m of S m and a set Y ′ m with λ(Y\Y ′ m ) = 0 such that for every t in S m , there exists some sequence (t k ) k≥1 of elements of S ′ m such that for every x ∈ X and every y The main result of Cohen and Le Pennec ( 2011) is a condition on the penalty pen(m) which ensures an oracle type inequality: Theorem 2. Assume we observe (X i , Y i ) with unknown conditional density s 0 .Let S = (S m ) m∈M an at most countable conditional density model collection.Assume Assumptions (H), (Sep) and (K) hold.Let s m be a η minimizer of the negative log-likelihood in Then for any ρ ∈ (0, 1) and any C 1 > 1, there is a constant κ 0 depending only on ρ and C 1 such that, as soon as for every index m ∈ M, In the next section, we show how to apply this result in our mixture of Gaussian regressions setting and prove that the penalty can be chosen roughly proportional to the intrinsic dimension of the model, and thus of the order of the variance.

Appendix B: Proofs
In Appendix B.1, we give a proof of Theorem 1 relying on several bracketing entropy controls proved in Appendix B.2.

B.1. Proof of Theorem 1
We will show that Assumption (DIM) ensures that for all δ ∈ (0, ) with a common C. We show in Appendix that if Assumption (DIM) There exist two constants C W and C Υ such that, for every model S m in the collection S, and C that depends only on the constants defining V K and the constants C W and C Υ .If this happens, Proposition 1 yields the results.
In other words, if we can control models' bracketing entropy with a uniform constant C, we get a suitable bound on the complexity.This result will be obtain by first decomposing the entropy term between the weights and the Gaussian components.Therefore we use the following distance over conditional densities: .
Finally, for all densities s and t over Y, depending on x, we set One can then relate the bracketing entropy of P to the entropy of W K Lemma 4. For all δ ∈ (0, √ 2], K−1 with an identifiability condition.For example, W ′ K = {(0, w 2 − w 1 , . . ., w K − w 1 )|w ∈ W K } can be covered using brackets of null size on the first coordinate, lowering squared Hellinger distance between the brackets' bounds to a sum of K − 1 terms.Therefore, H [.],sup To tackle the Gaussian regression part, we rely heavily on the following proposition, Proposition 2. Let κ ≥ 17 29 , γ κ = 25(κ− 1 2 ) 49(1+ 2κ 5 ) .For any 0 < δ ≤ √ 2 and any 5 Hellinger bracket such that t − (x, y) ≤ Φ υ(x),Σ (y) ≤ t + (x, y).We consider three cases: the parameter (mean, volume, matrix) is known (⋆ = 0), unknown but common to all classes (⋆ = c), unknown and possibly different for every class (⋆ = K).For example, [ν K , L 0 , P c , A 0 ] denotes a model in which only means are free and eigenvector matrices are assumed to be equal and unknown.Under our assumption that ∃C Υ s.t ∀δ ∈ (0, √ 2], we deduce: where We notice that the following upper-bound of C is independent from the model of the collection, because we have made this hypothesis on C Υ .
We conclude that H Note that the constant C does not depend on the dimension dim(S m ) of the model, thanks to the hypothesis that C W is common for every model S m in the collection.Using Proposition 1, we deduce thus that Theorem 2 yields then, for a collection S = (S m ) m∈M , with M = {(K, W K , Υ K , V K )|K ∈ N \ {0}, W K , Υ K , V K as previously defined } for which Assumption (K) holds, the oracle inequality of Theorem 1 as soon as

B.2. Lemma proofs
For sake of brevity, some technical proofs are omitted here.They can be found in an extended version.

B.2.1. Bracketing entropy's decomposition
We prove here a slightly more general Lemma than Lemma 3 Lemma 5. Let Then for all δ in (0, √ 2], The proof mimics the one of Lemma 7 from Cohen and Le Pennec (2011).It is possible to obtain such an inequality if the covariate X is not bounded, using the smaller distance d ⊗n for the entropy with bracketing of C.More precisely, Lemma 6.For all δ in (0, . But bounding such bracketing entropies for P and Ψ becomes much more challenging.
Proof.First we will exhibit a covering of bracket of C.
Since for all x, for all k and for all y, π k (x) and ψ k (x, y) are non-negatives, we may multiply term-by-term and sum these inequalities over k to obtain: is thus a bracket covering of C. Now, we focus on brackets' size using lemmas from Cohen and Le Pennec (2011) (namely Lemma 11, 12, 13), To lighten the notations, π − k and ψ − k are supposed non-negatives for all k.Following their Lemma 12, only using Cauchy-Schwarz inequality, we prove that sup Then, using Cauchy-Schwarz inequality again, we get by their Lemma 11: According to their Lemma 13, ∀x, The result follows from the fact we exhibited a 5δ covering of brackets of C, with cardinality N P N Ψ .

B.2.2. Bracketing entropy of weight's families
General case We prove Lemma 4. For any δ ∈ (0, √ 2], By hypothesis, for any positive ǫ, an ǫ-net N of W K may be exhibited.Let w be an element of W K .There is a z belonging to the ǫ-net N such that max l z l − w l ∞ ≤ ǫ.Since for all k in {1, . . ., K}, for any x in X , Case: Proof of Part 1 of Lemma 1. W K is a finite dimensional compact set.Thanks to the result in the general case, we get

B.2.3. Bracketing entropy of Gaussian families
General case We rely on a general construction of Gaussian brackets: This statement is similar to Lemma 10 in Cohen and Le Pennec (2011).Admitting this proposition, we are brought to construct nets over the spaces of the means, the volumes, the eigenvector matrices and the normalized eigenvalue matrices.We consider three cases: the parameter (mean, volume, matrix) is known (⋆ = 0), unknown but common to all classes (⋆ = c), unknown and possibly different for every class (⋆ = K).For example, [ν K , L 0 , P c , A 0 ] denotes a model in which only means are free and eigenvector matrices are assumed to be equal and unknown.
If the means are free (⋆ = K), we construct a grid G ΥK over Υ K , which is compact.Since .

Fig 4 .
Fig 4. Box-plot of the Kullback-Leibler divergence according to the number of mixture components.On each graph, the right-most box-plot shows this Kullback-Leibler divergence for the penalized estimator ŝ K .

Fig 5 .
Fig 5. Histograms of the selected K.
Fig 6.Kullback-Leibler divergence between the true density and the computed density using (X i , Y i ) i≤N with respect to the sample size, represented in a log-log scale.For each graph, we added a free linear least-square regression and one with slope −1 to stress the two different behavior.

Fig 7 .
Fig 7. Slope heuristic for the ethanol data set.

Fig 8 .
Fig 8. Clustering of NO data set into K classes.The strength of the color of the regression lines corresponds to the mixture proportion.

Fig 9 .
Fig 9. Clustering of NO data set into 4 classes, considering the regression of NO with respect to the equivalence ratio.
Fig 10.Clustering of ChIP-chip data set into K classes.