Finite mixture models and model-based clustering

: Finite mixture models have a long history in statistics, hav- ing been used to model population heterogeneity, generalize distributional assumptions, and lately, for providing a convenient yet formal framework for clustering and classiﬁcation. This paper provides a detailed review into mixture models and model-based clustering. Recent trends as well as open problems in the area are also discussed.


Introduction
Let X 1 , X 2 , . . . , X n be independent, identically distributed p-dimensional observations from a distribution with probability density function where π k represents the kth mixing proportion or the probability that the observation X i belongs to the kth subpopulation with corresponding density f k (x) called the kth mixing or component density. Here, K represents the total number of components with π = (π 1 , π 2 , . . . , π K ) ′ lying in the (K − 1)-dimensional simplex, i.e. 0 ≤ π k ≤ 1 ∀ k = 1, 2, . . . , K and K k=1 π k = 1. This is the most general form of a mixture: usually f k 's are assumed to be of parametric form i.e. f k (x) ≡ f k (x; ϑ k ), where the functional form of f k (·; ·) is completely known, but for the parametrizing vector ϑ k . Thus, (1) can be written in the form We refer to f (x; ϑ) as a finite mixture model density with parameter vector ϑ, where ϑ = (π ′ , ϑ ′ 1 , ϑ ′ 2 , . . . , ϑ ′ K ) ′ . When the number of mixture components K, is also known, only ϑ has to be estimated. When K is not provided, we have to additionally estimate the number of components in the mixture. Finite mixture models made their first recorded appearance in the modern statistical literature in the nineteenth century in a paper by [95] who used it in the context of modeling outliers. A few years later, [98] used a mixture of two univariate Gaussian distributions to analyze a dataset containing ratios of forehead to body lengths for 1,000 crabs, using the method of moments (MOM) to estimate the parameters in the model. More recently, mixtures of Poisson distributions have been used in positron emission tomography to model emissions occurring in a line along each pair of electronically coupled photon-sensitive crystal detectors [114]. Poisson mixtures have also been used for document classification in the field of information retrieval [66]. Other paramedic mixtures include that of the von Mises-Fisher distributions proposed for the analysis of text and gene expressions [10], but by far the most popular mixture model is the one consisting of Gaussian components [35,44,88,98,123,124]. A heavy-tailed alternative to Gaussian mixtures is to use mixtures of t-distributions [87]. We refer to [87,113] for a comprehensive survey on the history and applications of finite mixture models. Other helpful resources on the theory, applications and developments in the field are [21,45,72,73].
Finite mixture models also provide a convenient yet formal setting for modelbased clustering. Clustering had hitherto been a difficult problem with a large number of heuristic methods in the literature. In the finite mixture modeling framework, each group is assumed to have its own distribution and corresponding probability of representation. Thus the kth group has density given by f k (x; ϑ k ) and probability of inclusion in the sample π k . Under this setup, the observations X 1 , X 2 , . . . , X n can be assumed to be a sample from (2). Mixtures of Gaussian densities are again by far the most commonly used representation in model-based clustering. We note that though the framework for the latter has evolved from finite mixture modeling, they have distinct goals: finite mixture modeling is typically associated with inference on the model and its parameters while the goal of model-based clustering is to provide a partition of the data into groups of homogeneous observations. To achieve this, model-based clustering requires an additional step -after model-fitting -that assigns each observation to different groups according to some pre-specified rule. Mixing proportions can be thought of as the prior probability that an observation originated from a specific mixing distribution. We use a Bayes rule here which allocates observations to clusters in accordance with their posterior probabilities. Thus, every observation is assigned to the group having the highest posterior probability that the observation originated from this group. This is equivalent to finding the group index corresponding to the highest value π k f k (x i ; ϑ k ), k = 1, 2, . . . , K for each observation x i , i = 1, 2, . . . , n. If there are multiple posterior probabilities equal to the maximum value and the rule is indecisive, [87] recommend using randomization to break the tie among competing clusters.
In this paper, we provide a comprehensive survey of the most important results and developments in finite mixture modeling, with special reference to model-based clustering. Section 2 provides a description of inferential methods used in the literature, along with its challenges. Methodology for simulating realizations from Gaussian mixture models of desired characteristic for the purposes of evaluating different estimation and clustering methodologies are discussed in Section 3. Section 4 provides an overview of graphical tools for the visual representation and illustration of mixtures. Section 5 provides two recent applications using mixtures of non-Gaussian distributions. Finally, Section 6 describes available software for simulating from and performing inference in mixture models while Section 7 describes a few additional topics and challenges confronting mixture models in a modern setting. The paper concludes with some discussion.

Inference in finite mixture models
Finite mixture models provide for great flexibility in fitting models with many modes, skewness and non-standard distributional characteristics. The price for this flexibility is an increase in the number of parameters with the number of components f k . Here, we survey issues in estimation and model selection with regard to finite mixture models. While there is no restriction in general that all f k , k = 1, 2, . . . , K represent the same parametric distribution, we assume in what follows that the functional form of f k is parametric and the same for all components.

Estimation in finite mixture models
As mentioned earlier, [98] provided a MOM estimator for fitting a two-component univariate Gaussian mixture. In multivariate multi-component settings however, this is rarely practical. Fortunately however, maximum likelihood (ML) estimation is possible when implemented via the expectation-maximization (EM) algorithm and is the method of choice in estimation in finite mixture models. We discuss issues related to ML estimation here.

Likelihood maximization via the EM algorithm
One practical issue related to ML estimation in finite mixture models is troublesome optimization. First, the form of the likelihood function for a sample from (2) is typically complicated and severely multi-modal, rarely lending itself to mathematical treatment and analytical closed-form solutions or numerical optimization. The standard procedure for finding the ML estimate (MLE) in almost all cases is the EM algorithm and is also applicable in complicated multiparameter situations. The EM algorithm [36,86] is, thus, the primary tool in finite mixture models and model-based clustering.
The EM algorithm is implemented by assuming that there are some missing observations, namely the group identifiers, which, in conjunction with the observed data, yield so-called complete data. The corresponding complete likelihood function usually has a much more appealing form and can be readily maximized. The EM algorithm is an iterative procedure consisting of the expectation (E) and the maximization (M) steps. At the E-step of the s-th iteration, the posterior probabilities are calculated, while the M-step maximizes the expected conditional complete loglikelihood, historically denoted as Q-function, with respect to the parameter vector ϑ: Q(ϑ; ϑ (s−1) , x 1 , x 2 , . . . , x n ). Iteration of the E-and M-steps until convergence yields, under fairly mild conditions [23,36,86,125], the ML estimatê ϑ for the original observed data. Of course, the expressions for the updated parameter vector ϑ (s) at the M-step may not necessarily be of closed-form, in which case the Q-function should be maximized numerically.
Multivariate Gaussian mixtures are not just the most popular choice in finite mixture models: they are also among the most complicated cases as pointed out by [28]. The corresponding mixture density function is given by Here, µ k is the mean vector and Σ k the dispersion matrix for the k-th component normal density given by The corresponding Q-function is The E-step consists of updating the posterior probabilities π (s) ik given the current parameter estimates ϑ (s−1) : .
The covariance matrix Σ k can have various structures: therefore, the exact formula for the EM update of Σ k can be different. Throughout this paper, we assume that Σ k is a general unstructured dispersion matrix. Here, the M-step provides the convenient closed-form solutions: While different criteria can be used for terminating EM, some criteria -such as the convergence of ϑ (s) -are too demanding when there is a large number of parameters. The most usual stopping criterion is based on when the relative increase in the likelihood function is no longer appreciable. In this context, [20] introduced the so-called Aitken's rule by using Aitken's acceleration to investigate the limiting value for the sequence of log likelihood values. Specifically, they proposed the stopping criterion |ℓ A is the Aitken accelerated estimate of the limiting value such that We refer to [20,87] for further details.

Challenges in implementation
Unbounded likelihood functions In some situations -for example in the case of Gaussian mixtures with heterogeneous dispersions -the likelihood function may be unbounded. This happens, for instance, because of singular covariance matrices being estimated as a consequence of degraded components that have only one observation, or having several identical or nearly-identical observations. Gaussian mixtures with homogeneous components, however, do not share this problem as covariance matrices are restricted in the parameter space so that it is impossible to obtain degraded components.
There are several methods proposed in the literature to address the possible unboundedness of the likelihood function. [53] suggested introducing an additional constraint on dispersions of univariate normals: i.e. assume σ −2 i σ 2 j ≥ c > 0 for any i and j. The paper showed that the global maximizer of the likelihood function defined on the restricted parameter space exists for any value of c. A generalized version of this condition was proposed for the multivariate framework by [87]. The suggested restriction is |Σ i | −1 |Σ j | ≥ c > 0 for any i and j with the only inconvenience of this approach related to the fact that the constant c has to be pre-specified and it is unclear how to choose a reasonable value. Another possibility includes defining a penalized log likelihood function [28,67] that contains a penalty term preventing the log likelihood from going to infinity by construction. [87] proposed working with unconstrained normal mixtures but also relying on the result of [62] which states that even when the likelihood function is unbounded in the parameter space, there exists a strongly consistent asymptotically efficient local maximizer in the interior of the parameter space. Therefore, it is recommended to search the best local maximum in the unconstrained parameter space and then to check that the obtained solution indeed corresponds to a local maximum and is not on its way to infinity. This check can be difficult due to the presence of so-called spurious local maxima which should be ignored. Spurious solutions represent the parameter vector lying close to the boundary of the parameter space and can be easily identified by the presence of very few points in some components or by detecting some observations lying in a lower-dimensional subspace. Detailed review of these and related issues can be found in [87].
Initialization of the EM Algorithm The EM algorithm is an iterative, strictly hill-climbing procedure whose performance can depend severely on particular starting points because the likelihood function often has numerous local maxima (see, e.g. [87]). Thus, good initialization is crucial for finding ML estimates. Many different initialization procedures have been suggested in the literature (for an overview, see [40] and [78]) but no method uniformly outperforms the others. Here, we list only the most common and better-performing strategies. A model-based hierarchical clustering approach [11] was proposed and incorporated in the R package Mclust [44] designed for Gaussian mixtures. This approach was shown to work well when the components are well-separated, but not as well in other cases [81]. The use of hierarchical clustering in initialization also limits applicability to larger datasets. Another deterministic approach [78], based on finding the most separated local modes, demonstrates good performance for low dimensions but is very time-consuming for severely multi-dimensional datasets. There are also stochastic algorithms for initialization. For instance, the emEM algorithm proposed by [17] consists of two EM stages. The first stage, called the short em, involves starting from several random points and running the EM algorithm until some lax convergence criterion is satisfied. The solution producing the highest log likelihood is chosen as a starter for the second stage, called the long EM , which runs until the usual strict convergence criteria is met. A modification of the emEM algorithm, Rnd-EM, was proposed by [78]. Here, the short em stage is replaced by choosing multiple starting points and evaluating log likelihood at these values without running any EM iterations. The best obtained solution serves as an initializer for the long EM stage. As pointed out by [41], using multiple random starting points needed for finding the global maximum can be time-consuming. Besides, there is also no assurance that the global maximum has been found. In particular, they noted that successful search for the global maximum depends not only on the number of random starts but also on both the complexity of the function being optimized and the procedure for generating the random starting points. To this end, the authors developed a probabilistic measure for assessing the adequacy of the search for the global maximum with a view to guiding decisions as to when the search can be called off. In general, however, no strategy works uniformly well in all cases [81], so the usual practice is to try, as far as practical, different strategies and then to choose the solution with the highest log likelihood value.

Variance estimation
One advantage of ML estimation is the ability to obtain (at the very least, asymptotic) dispersions of the estimated quantities. This is done by inverting the corresponding information matrix which is usually estimated in practice by the observed information matrix I(θ) = − ∂ 2 log L(ϑ) ∂ϑ∂ϑ ′ | ϑ=θ , where L(ϑ) represents the likelihood function. Clearly, the related computations involve taking double derivatives, potentially with respect to vectors and matrices, and may be very complicated for finite mixture models. A way to express the observed information matrix was proposed by [74]. The approach relies on the missing information principle and likelihood calculations for complete data. While providing some flexibility, this method still does not provide an easy way to obtain the observed information matrix. Fortunately, there exists a simple approximation for I in the case of independent, identically distributed observations. The approximation relies on computing the corresponding empirical information [87] whose approximation can be obtained by where ∇q i represents the gradient vector of the expected complete loglikelihood at the i-th observation: ∇q i ≡ ∇q i (ϑ; x i ). Then, I e (θ) can be inverted and employed as an estimated covariance matrix of the MLEθ.
As an example of variance calculations, consider the case of the multivariate Gaussian mixture with unstructured covariance matrices. The corresponding gradient vector ∇q i has form given by (see [80]) and Here, -dimensional matrix such that vec(A) = Gvech(A), where vec is an operator that stacks columns of a matrix A (converting the matrix into the vector consisting of the columns of a matrix) and vech is an operator transforming a p × p symmetric matrix into a p(p+1)

Model selection
In finite mixture models, it is usually assumed that the variables and the functional form of mixing densities is known. In the past, model selection has typically referred to the problem of choosing the optimal number of components K. An aspect of model selection that has been recently investigated is the identification of variables with more discriminating power than others in the inference. We review both these aspects in brief in this section.

Choosing the optimal number of components
There is a vast literature devoted to the issue of choosing K. We refer to [87] who provide a detailed rendering of the different approaches available to address this problem. Here, we briefly summarize existing and recent contributions. Note that most methods devoted to estimating K can broadly be divided into two categories, both based on the log likelihood function. The first group of methods is parsimony-based while the second category relies on testing procedures. The former has been more widely used and discussed in the literature than the latter which has only recently been explored more: we therefore, survey parsimonybased model selection in brief while reviewing testing-based approaches in more detail.
Parsimony-based approaches choose the K minimizing the negative log likelihood function augmented by some penalty function to reflect its complexity. Various information-based criteria such as An Information Criterion (AIC) [3], Bayes Information Criterion (BIC) [109] and their modifications such as quadratic AIC/BIC [105], the Integrated Classification Likelihood criterion (ICL) [15], Normalized Entropy Criterion (NEC) [16], Minimum Information Ratio criterion (MIR) [122], and Laplace-Empirical Criterion (LEC) [87] fall into this category. BIC is among the easily implemented methods that has been repeatedly shown to demonstrate good performance [33,80,107]. [61] showed the consistency of BIC for choosing the correct number of clusters. However, BIC tends to underestimate the number of components when sample sizes are small. On the contrary, another easily implemented criterion, the AIC, typically overestimates K substantially. While more difficult to implement, [80] show that the ICL approach performs very well in a large range of cases.
In general, the criteria-based methods are easily implemented, but share one shortcoming in that it is difficult to obtain a meaningful comparison of model fit from one situation to another. For instance, [60] view improvements in BIC of less than 2 as negligible, while differences greater than 10 are often regarded as constituting strong evidence. On other words, only reductions in the BIC of more than ten should indicate a clear improvement in the model associated with increasing number of components. It is unclear however, how this value should be calibrated in different situations with regard to n and p. This is where testingbased approaches have greater appeal, because it specifies evidence in favor of a complex model against a simpler model in terms of the universally understood p-value. Most testing-based approaches use a likelihood ratio test (LRT) or some derivation thereof. However, direct application of LRT is not possible as the parameter vector ϑ lies on the boundary of the parameter space under the null hypothesis. Thus, the regularity conditions of [32] are violated and the usual asymptotic null distribution of the LRT statistic is not valid. Some special results are available [47,52], but they mostly concern comparing one-versus two-component univariate Gaussian models. To avoid the boundary problem, [2] suggest moving the parameter vector to the interior of the parameter space by postulating a prior probability distribution on the mixing proportions. The lack of theoretical null distributions of test statistics has also stimulated the development of bootstrap-based methods [1,85]. Indeed, for the case of Gaussian mixtures with unequal variances, [39] recommended bootstrapping the LRT statistic over all other methods to avoid problems with regularity conditions. This approach was also advocated by [87] as a necessary tool for assessing pvalues (page 184). However, these methods are time-consuming to implement.
We have recently proposed and investigated a likelihood-based testing procedure [80]. To keep derivations of the null distribution of the LRT statistic tractable, we introduced an additional assumption stating that a fit of the (simpler) model under the null hypothesis H 0 implies that the alternative (and more complex) model under H a also fits the data adequately. Under H a however, only the alternative model provides a good fit. An approximate null-distribution for the LRT statistic can then be developed based on Taylor series expansion. Besides keeping derivations tractable, the additional assumption stated above also addresses concerns expressed by authors such as [105] that in the spirit of [22], every restricted model is flawed and therefore will always be rejected for some n regardless of the true model.
The testing approach provides the possibility of obtaining significance of any K * -component model vis-a-vis any K-component model (K * > K). This can be displayed via a quantitation map which is a display introduced by [80] to quantitate support for any complex model relative to a simpler model. Figure 1 represents a contour plot and the quantitation map for the two-dimensional Ruspini dataset [108]. The rows in the quantitation map represent the number of components in a simpler model while the columns stand for the number of clusters under H a . Thus, every cell produced by the intersection of a particular combination of rows and columns represents a test. The color of every cell illustrates the p-value of that particular test. The quantitation map therefore is a comprehensive tool visualizing the nature of a dataset and helping to decide on the best number of mixture components. Not surprisingly for such well-separated clusters, the quantitation map clearly suggests choosing a fourcomponent solution. We refer to [80] for further details, including the q-value quantitation map to control for the proportion of expected false discoveries, and examples of performance on simulation and standard classification datasets.

Variable selection
In many multivariate datasets, some of the variables are highly correlated with the others or just do not carry much additional information about clustering. The performance of clustering algorithms can actually be severely affected then by the presence of such variables that only serve to increase dimensionality and add redundant information. The elimination of such variables can potentially improve both estimation and clustering performance. This is an aspect of model selection that has lately received some attention in the literature. A greedy variable selection algorithm based on Bayes factors was introduced by [103]. The idea of the algorithm is to divide all variables into three groups: the first group, X (1) , contains already selected variables, the second group, X (2) , consists of variables currently under consideration for inclusion into the first group, and the last group, X (3) , consists of remaining variables that are not included or considered yet. Then, they define two competing models where z is the unobserved class information for each observation. Model M 1 implies that X (2) does not carry any clustering information in addition to that already contained in X (1) . The model M 2 , on the contrary, assumes that X (2) introduces some new information about cluster memberships after X (1) has been observed. The models M 1 and M 2 are compared using the Bayes factor, B 12 , in which potentially high-dimensional terms p(X (3) |X (2) , X (1) ) cancel providing which is estimated via BIC. The authors provide a greedy algorithm which simultaneously selects the model and K. The approach is easily implemented and shown to perform well on simulated datasets with correlated redundant variables, variables with no clustering information and on the Iris [4], crabs [26] and textures [25] datasets. [103] did not allow irrelevant variables to be independent of clustering variables, potentially leading to erroneous model choices. This shortcoming was addressed by [82]. [96] proposed an approach based on the L 1 -norm penalty for the loglikelihood function in the Gaussian mixture. They suggested using the regularized loglikelihood function penalized by the term −λ K k=1 p j=1 |µ kj |, where µ kj is the j-th coordinate of the k-th mean vector. This penalty is able to shrink some fitted means toward 0. Then, variables with all µ kj , k = 1, 2, . . . , K equal to zero are eliminated. This approach is limited by the assumption of a common diagonal covariance matrix for all components. [126] extended this approach by including a new regularization scheme that groups together multiple parameters of the same variable across clusters. Another modification of this method, suggested by [118], applies different penalty functions: for instance, the adaptive L ∞ -norm and adaptive hierarchical penalties. The authors claim that the results are better for the proposed penalties but the same assumptions about covariance matrices are required to be made. While necessary for analyzing small datasets with large numbers of variables, these limitations might be very restrictive in general. Of course, as mentioned at the beginning, this is an area of model selection in finite mixture models which has only lately received attention and is under active development.
In this section therefore, we have discussed several issues in making inferences in finite mixture models. We now discuss a scheme to simulate finite mixture model distribution with varying complexity, with a view to evaluating the performance of an algorithm under different settings.

Simulating mixture distributions for evaluating clustering algorithms
There are several clustering methods [127], but none of them uniformly outperforms the other in all cases. Thus, it is important to have tools to calibrate and characterize different algorithms. Therefore, having a procedure capable of simulating data with different levels of clustering complexity can be very helpful. This can allow for a comprehensive investigation of an algorithm's properties with regard to different situations. Several approaches have been suggested in the literature (see [111] for a detailed review). Here, we give just a brief summary, noting that almost all methods in the literature only provide simulation methods for multivariate Gaussian mixtures with different clustering complexities.
One popular algorithm proposes to generate well-separated clusters from truncated multivariate Gaussian distributions [92]. However, due to the truncation step in the algorithm, the method is incapable of simulating clusters with wide ranges of separation [111] that can be misleading [5]. Many other proposed methods [19,49,63,84,100] share similar shortcomings. An attempt to control the level of overlap between any two components using intra-class correlations was made by [5] who however admitted that it still lacked the ability to provide a "perceptually meaningful description" of overlap (see page 583). The notion of c-separation was introduced by [34] in the context of learning Gaussian mixtures. Here, two p-variate Gaussian distributions N p (µ 1 , where λ max (Σ) represents the largest eigenvalue of Σ. Thus, the level of cseparation depends on Euclidean distance between clusters, dimensionality, and the value of the largest eigenvalue from both dispersion matrices. A clear drawback here is that the orientation of the clusters is not taken into consideration and considering only the highest eigenvalue can lead to widely varying mixtures of different clustering difficulty for the same values of c [78]. Also, as very helpfully pointed to us by a reviewer, the c-separation criterion performs reasonably well for small dimensions but is too wide for larger dimensions. In such cases, the reviewer additionally pointed out that a stronger condition based on employing the principle of least distances between the means producing separate modes can be developed: such development might be possible using the geometry of high-dimensional mixtures described in [45,104]. Nevertheless, this method has been used in evaluating algorithms by [68,115,116]. A slight modification of the above is the exact-c-separation of [78] who required the equality in the above expression to hold for at least one pair of clusters. OCLUS, an algorithm capable of simulating clusters with known overlaps between pairs of clusters, pairwise overlaps, was introduced by [111]. Clusters in OCLUS are assumed to be marginally independent and no group is allowed to interact with more than two other clusters. This limits the algorithm because of its inability to simulate other types of cluster configurations.
Another recent development is the R package clusterGeneration [101] which is based on the separation index of [102]. The index is defined as the ratio of the difference between the biggest lower and smallest upper quantiles over the difference in biggest upper and smallest lower quantiles. The index attains values close to 1 for well-separated clusters and can approach −1 for clusters with high overlaps. In the original paper, the authors used the 2.5% and 97.5% quantiles. Defined in an univariate framework, this index can not be readily extended to the multivariate case, so the authors suggest finding and using the one-dimensional projection that produces the highest value of the separation index. Of course, relying on a single projection may be inadequate to summarize overlap between any two components in the mixture. Therefore, any statement made on the degree of cluster separation in multivariate case is at best partial and may even lead to erroneous conclusions. An approach that allows for simulating Gaussian finite mixture models according to pre-specified levels of average and maximum pairwise overlaps was proposed by [81]. The overlap between two mixing components is defined there as the sum of both misclassification probabilities, ω i|j and ω j|i , where and ω i|j is defined similarly. The average (ω) and maximum (ω) levels of overlap serve as surrogate measures of clustering complexity. The R package MixSim is available at CRAN and can be also employed for assessing the degree of clustering difficulty for existing classification datasets. Figure 2 shows data simulated under two levels of mixture complexity and illustrates some of the capabilities of MixSim in providing mixtures with different degrees of separation.

Graphical representation and visualization
Good visualization in cluster analysis can often be very effective and helpful for understanding the nature of analyzed datasets. Biplots [46], scatter plots and contour plots are widely used to illustrate datasets and mixture models. Contour plots such as in Figures 1a and 2 can present two-dimensional data by drawing level sets of the bivariate density through corresponding shadings or contours. For multivariate datasets, biplots ( Figure 3) representing a scatter plot of the first two principal components along with variable contributions are useful. Additionally, observations on the biplot can be plotted using color and/or character (see Figure 3b) according to their group memberships. by a variable selection procedure. As we can see, there is very clear separation in three clusters provided along the first and the second principal components. The parallel distribution plot ( Figure 4) recently developed by [81] allows for visualizing multidimensional mixtures with Gaussian components. The dispersion matrix for a Gaussian mixture with individual component covariance matrices of a general form is given by Let Γ be the matrix of orthonormal eigenvectors corresponding to Σ. Applying the rotation Γ ′ to the mixture yields the rotated mixture of (rotated) Gaussian components with corresponding mean vectors Γ ′ µ k and dispersion matrices Γ ′ Σ k Γ. Then, borrowing ideas from the parallel coordinate plots of [59,121], we plot the individual rotated means against the index of the principal component. Rotated variances are used to obtain quantiles at each principle component. Connecting these quantiles yields polygons that are shaded with varying opacity according to the probability contained between the corresponding quantiles. For mixtures with well-separated components (Figure 4a), the between-cluster variability is substantial even at higher principal components, while for poorlyseparated mixtures (Figure 4b), within-cluster variability swamps the betweencluster variability fairly soon. The corresponding procedure is incorporated in the R package MixSim (function pdplot ).

Some recent applications involving non-Gaussian mixtures
As mentioned earlier, most of the work in finite mixture modeling and modelbased clustering involves multivariate Gaussian mixtures. Recently, however, there has been some interest in mixtures of non-Gaussian distributions. In this section, we detail two applications using such distributions.

Text and time-course gene expression datasets
Cluster analysis of text and gene expression datasets is similar to that of directional data. For text data, it is common to use cosine similarity as the metric for grouping similar observations, while for time-course gene expression data, it is of interest to group similar genes according to correlation. In both cases, datasets are pre-processed to lie on the L 2 -normalized subspace, i.e. they lie on the surface of the unit sphere. Note however, that the pre-processed gene expression datasets are also orthogonal to the unit vector. A popular choice for directional distributions is the p-variate von Mises-Fisher distribution which is given by the probability density function f (x; κ, µ) = C p (κ)e κµ ′ x , where κ ≥ 0 and µ is the mean vector such that µ = 1. The support of the density is the surface of the unit sphere, x = 1. The normalizing constant C p (κ) is given by where I d (κ) represents the modified Bessel function of the first kind and order d. Then the model for the finite mixture of von Mises distributions is given by where ϑ = (π 1 , π 2 , . . . , π K , κ 1 , κ 2 , . . . , κ K , µ ′ 1 , µ ′ 2 , . . . , µ ′ K ) ′ . The E-step of the EM algorithm is conceptually the same as for any other mixture and can be obtained by (3) while the solution for the M-step can be readily derived [10]. Thus, As we can see, the first two expressions can be provided in closed form but the expression for κ (s) is implicit and numerical methods are needed for estimating κ (s) . Some heuristic methods for this are discussed in [10].
A different approach to this application was provided by [37] who contended that components may be correlated and have different variances in different coordinates. They proposed to use mixtures of transformed Gaussians. In particular, they proposed a mixture of stereographic projections of multivariate Gaussian distributions. Various shapes, orientations and skewness of clusters are attainable in this framework. The authors provide a general form of the density for the inverse stereographic projection which can be conceptually used for constructing finite mixture models based on such projections. The implementation of the EM algorithm is then straightforward, with the E-step having a similar form as before, but the M-step cannot yield closed-form expressions and heuristic search methods have to be employed. The authors also consider a possibility of addition of a noise component to deal with noisy data. Computer code is available: note also that this approach is very computer-intensive and computationally impractical to apply on text or larger gene expression datasets. Further, the suggested method was evaluated on some simulation datasets. Surprisingly, AIC was seen to perform the best in estimating the number of components. We note that the reported experiments were only on estimating the number of clusters: evaluations on clustering performance were not reported.
As mentioned earlier, the pre-processed time-course gene expression datasets are standardized to not only lie on the unit sphere but also to be orthogonal to the unit vector. This constraint is not included in either of the above formulations: it would be interesting to see how inferences change under a more accurate model.

Magnitude magnetic resonance imaging data
Datasets acquired in Magnetic Resonance Imaging (MRI) or Magnetic Resonance Angiography (MRA) are typically magnitudes of complex observations, whose real and imaginary parts are both independent univariate Gaussiandistributed realizations [120]. Thus, using Gaussian mixtures to segment these datasets is not very appropriate so [31] and [79] use a mixture of Rice distributions to characterize the MR signal at each voxel. The distribution is given by where I 0 (·) represents the modified Bessel function of the first kind of zeroth order. In the application to MR images, µ is the underlying true magnitude MR signal and σ is the noise parameter. The sample represents the observed magnitude data from n voxels with an individual observation following the mixture of Ricians given by the density function where π k represents the proportion of voxels with signal µ k and the noise parameter σ is assumed to be common for all k = 1, 2, . . . , K. For the EM algorithm, we update the posterior probabilities according to (3). At the M-step, we can obtain a closed-form expression only for the mixing proportions π (s) ik . The other equations need to be solved numerically: Refer to [79] for details on computational implementation, EM initialization, parameter and variance estimation and model selection.

Finite mixtures in biological studies and surveys
Inferring the genetic structure of populations by clustering alleles observed at multiple loci, using mixtures of products of multinomial distributions is an important application for which [27] developed a software package called FAS-TRUCT. A similar scenario arises when finding population groups from respondents to multiple-choice questions in surveys in order to tailor and market products and surveys [80]. We use the latter application to illustrate and develop the model here.
Suppose there are p questions with d j , j = 1, 2, . . . , p options for the jth question. Thus, the respondent's choice for the jth question can be modeled by a multinomial distribution f (x jr ; ρ jr | r = 1, 2, . . . , d j ) = n j dj r=1 ρ xjr jr where ρ jr is the probability that the rth option has been selected while x jr represents the actual choice made by a respondent. Note that n j = dj r=1 x jr and dj r=1 ρ jr = 1. If a respondent can choose only one answer to each question, (4) reduces to the following: Assuming independence of the multinomial random variables for the responses to each of the different questions, we are led to a setup whereby the p responses of each respondent is an observation from the finite product-of-multinomials mixture model g(x jr ; π k , ρ kjr | k = 1, . . . , K, j = 1, . . . , p, r = 1, . . . , d j ) = x ijr log ρ kjr , It is easy to see that the E-step has the form π (s) while the M-step yields updated estimates Once again, the E-and M-steps alternate until convergence. [80] use the above model to analyze the voting preferences of 100 senators in the 109th United States Congress, based on 441 votes cast. On each of the bills under consideration, senators either voted for or against the motion or did not record their votes. Thus the result was a mixture model of the products of 386 trinomial and 55 binomial distributions -the last arise from those bills for which every senator recorded an up-or-down vote. They further combined their methodology on assessing significance to come up with a three-component solution: this solution had one group of 33 Republican senators, another comprising 31 Democratic senators (including one independent senator in the Democratic caucus) and a third group consisting of the senators who are either regarded to be moderate or did not vote on many occasions. Their results matched well general opinions on the voting preferences of these senators.

Available software
Several packages are available for model-based clustering and related tasks.
Based on their applications, these packages can be divided into two groups. The first group consists of software products devoted to simulating data and finite mixture models according to some pre-specified characteristics. These packages can then be used for the evaluation of clustering algorithms or assessing clustering difficulty of existing datasets. The second group of algorithms fits data to specified models, estimates classification vectors and chooses the optimal number of components. Detailed descriptions of several older programs can be found in [54]; we provide short descriptions of the most important or recent clustering packages here.

Simulation and evaluation
• OCLUS [111] is a MatLab function allowing for the generation of overlapping clusters from different multivariate distributions (see Section 3). The authors state that the procedure is available upon email request. • clusterGeneration (formerly GenClus) [101] is an R package based on the separation index of [102] (see Section 3 for details). • MixSim [81] is an R package that manipulates misclassification probabilities of Gaussian components in order to attain the pre-specified levels of average and maximum overlap. A wide range of random multi-dimensional and multi-component mixtures can be simulated. The package can be also used for assessing misclassification probabilities and overlap of existing classification datasets. It also includes graphical capabilities for plotting parallel distribution plots (using the function pdplot, see Sections 3 and 4 for more information). • CARP [90] is an open source C package with a command-line interface available from www.mloss.org with the ability to simulate generate finite Gaussian mixture model distributions, using the same engine as MixSim, but additionally can provide an evaluation of one or more clustering algorithms.

Inference and clustering
• Mclust [44] is an R package developed in FORTRAN for multivariate Gaussian mixture models. It relies on the EM algorithm for density estimation and BIC for model selection. Model-based hierarchical clustering is also implemented in Mclust and is used to initialize the EM algorithm. Various parametrizations of the dispersion matrix Σ k are available. Its flexibility, availability and relatively frequent good performance make this package one of the most popular. • EMMIX [88] is another popular piece of software [88] developed in FORTRAN.
It is designed for fitting multivariate Gaussian and t-component mixtures ( [87]). Three initialization strategies are implemented: random starting points, k-means-based starts and hierarchical-clustering-based starts. The optimal number of mixture components is selected using a resampling test.

• MIXMOD [18] is a package written in C++ and interfaced with Matlab
and Scilab. The package can be employed for the analysis of data using multivariate Gaussian and Multinomial mixture models. Several modifications of the EM algorithm and different criteria for model selection are included in this package.

Some additional topics and challenges
There are several challenges that have at best been only partially resolved in the context of finite mixture models and model-based clustering. In this section, we provide an overview of some of these challenges and outline possible approaches to addressing them. While our discussion here is with regard to model-based clustering, we note that many of the challenges also arise with distribution-free clustering methods.

Hierarchical model-based clustering and cluster merging
A fundamental issue with finite mixture modeling is that finding the best fitting mixture is not necessarily equivalent to finding the optimal partition for a given dataset. This is not necessarily a problem when all components are wellseparated, because in that case, every component in a fitted mixture model can be associated with one cluster and this relationship yields a one-to-one correspondence. However, it may well be that from a clustering point of view, one group is better modeled using several mixture components rather than one, in which case, a one-to-one correspondence between each component and a cluster may be too restrictive. For instance, several Gaussian components are often needed to model multimodal clusters or unimodal but skewed clusters. If clusters cannot be adequately fitted using a single component, it is unclear how well a finite mixture model can serve for providing reasonable clustering inference based on the correspondence between clusters and single mixture components. This discussion emphasizes that clusters can consist of multiple mixture components. Thus, the obtained finite mixture model solution is converted into a clustered partition of the dataset by merging components. A suggested approach to implementing cluster merging is model-based hierarchical clustering, which borrows ideas from its distance-based counterpart. There are several important and challenging issues that arise. For one, it has to be decided how to relate each cluster with (perhaps more than one) components of a fitted mixture model. For instance, how distant should a component (or a group of components) be from the others to be considered a cluster distinct from another, the latter formed by another component or groups of components? Then, there is the related issue of finding the optimal number of groups and the number of mixture model components providing the best fit to the dataset. [30] introduced the notion of a mutual cluster defined as a group of points sufficiently close to each other but distant from the others and which have never been separated. The authors investigated mutual clusters specifically for the case of hierarchical clustering, however the concept can be adopted for the case of model-based hierarchical clustering. [50] considered using hierarchical clustering specifically in the context of mixture modeling. They proposed to simplify a Gaussian mixture model replacing each group of components by a single Gaussian component. This provides us with a hierarchical version of finite mixture models as every observation in the dataset is stipulated to be in the same original cluster at a coarser stage.
[56] also discusses hierarchical merging methods using concepts of unimodality and misclassification. In this context, the notion of unimodality refers to finding a partitioning of mixture components such that all clusters produced by the partition have only one mode. At the same time, however, any merging of distinguishable mixture components immediately leads to multimodal clusters. Thus, for finding a clustered partition, it is sufficient to consider all pairs of obtained individual components as two-component mixtures and investigate each pair for unimodality. If some pair of components is deemed to be unimodal, the two components are merged. The procedure continues until no further pairwise merging produces a unimodal Gaussian component. In a k-component Gaussian mixture, finding these reduced modes is achieved by analyzing the values of the density lying on the so-called ridgeline surface [104] that are given for pair of components with distributions N p (µ i , Σ i ) and N p (µ j , Σ j ), by for α ∈ (0, 1). Interestingly, the ridgeline does not depend on the mixing proportions πs. [56] also discusses some limitations of the ridgeline approach described in [104] and remarks that their result solves the modality merging problem only approximately. He also investigates several other procedures, such as a ridgeline ratio method which also relies on ridgeline modality analysis with respect to Gaussian mixtures. Other approaches to merging mixture components for clustering have also been suggested. [14] discussed a simple but attractive approach for choosing the number of clusters based on merging. At the first stage, they suggest finding the number of Gaussian components using BIC, which is a consistent and efficient criterion for choosing the number of mixture components under Gaussian distributional assumptions for each of them [61]. In cases when specification of Gaussian-distributed components is not supported by the data, a model with more Gaussian components (than clusters) is typically proposed by BIC to account for the deviation from multinormality. The authors suggest postprocessing the results using ICL in a second stage of their procedure to eliminate unnecessary components, and merging them hierarchically. In doing so, they use the fact that the ICL is a version of BIC penalized by the mean entropy. Thus, the resulting number of clusters proposed after the ICL step is implemented is always smaller than or equal to that proposed by BIC. A similar but more sophisticated idea was proposed by [64], who suggested using multi-layer mixture models where the individual clusters themselves are assumed to be well-modeled from a Gaussian mixture distribution. This is an appealing feature in modelbased clustering because, as mentioned earlier, clusters can often be modeled substantially better by representing some of them individually using a mixture rather than a single distribution. The paper provides a detailed investigation of multi-layer mixture models, with particular emphasis on choosing the optimal number of components within each cluster, but there are several unresolved issues. For example, finding the total number of clusters in the dataset is still not completely resolved. There is also room for developing and studying alternative methods for constructing clusters.

Nonparametric approaches to mixture modeling and model-based clustering
A related approach to addressing the challenges posed by the lack of complete correspondence between clustering and finite mixture modeling is to use nonparametric mixture modeling [65]. In this setup, the observations are from a mixture of densities, i.e., f (x) = K k=1 π k f k (x). The basic idea of the authors is to associate every point not with a particular mixture component (using a Bayes rule) but rather with a local maximum, or mode. In fact, clustering via mode identification is a reasonable approach as it produces geometrically meaningful results regardless of the structure of the data. The method relies on specifying kernel density functions for f k s and estimating each of them. Modal clustering is applied to the dataset upon mixture density estimation. The exact mechanism is an EM-type nonparametric algorithm called Modal EM introduced by [65] that allows finding "hilltops" of the given density. The algorithm consists of two steps that have to be repeated iteratively. In the first step, updated probabilities p k for k = 1, 2, . . . , K mixture components at the current modal estimate x (s) have to be computed: f (x (s) ) . The second step maximizes the target function K k=1 p k log f k (x) with respect to x, yielding the updated x (s+1) . Of course, it is assumed that the target function has a unique maximum.
The authors argue that it is more appropriate to associate clusters with bumps of the density and this is the cornerstone of the proposed methodology. The suggested algorithm is then extended to hierarchical clustering. The authors also introduce the Ridgeline EM algorithm devoted to finding the ridge-line between the modes of two clusters. The ridgeline for any two clusters with densities f 1 (x) and f 2 (x) is given by When f 1 (x) and f 2 (x) are Gaussian densities, we obtain the explicit solution given by (5) [104]. A different approach was provided by [93] who proposed a visualizing tool called a mode tree. This tree is constructed so that the locations of the modes are related to the bandwidths at which the density estimates are obtained. The use of the tool is illustrated by [93] to adaptively investigate multimodality in datasets. Another nonparametric development was introduced by [112] who considered graph-based estimation of a cluster tree called generalized single linkage clustering. The cluster tree is a fundamental concept and the main target of nonparametric cluster analysis. The authors mention that cluster tree, describing the modal structure of the density, can be computed exactly or approximated. They also point out that some modes in the density estimate can be spurious and, perhaps, should be disregarded. For this purpose, it is also suggested to prune the cluster tree by using an excess mass threshold.

Semi-supervised clustering
In many situations, there is interest in grouping a sample of observations, but there is some, perhaps uncertain, information available on the labels or classes of some observations. This is the topic of semi-supervised clustering which arises in a number of modern fields such as bioinformatics [117] or speech recognition [57], and consequently has attracted some recent interest. The development of "pairwise relations" [13,75,110] concerns the situation when some observations are known to belong to the same group (positive relation) or different groups (negative relation). Other approaches have involved adapting K-means [13,12] or the EM algorithm for finite mixture models [13,58,110]. We focus here on adaptations to the EM algorithm.
The EM algorithm is easily derived for the case of semi-supervised clustering: the M-step is as before. However, there is a change in the E-step in that the posterior probabilities for labeled data do not need to be updated. In fact, the posterior probability vector for the ith observation with known labels consists of K − 1 zeros and the unity in the position corresponding to the cluster from which the ith observation has been originated. The other probabilities corresponding to unlabeled data are computed as usual. In all the references for model-based semi-supervised clustering listed above, it is assumed that the classes represented in the labeled data are all the classes in the entire dataset so that K known and model selection is not an issue. Initialization is also not an issue for the group means, variances and frequencies of the labeled data can be used as starting values for the EM algorithm. However several challenges arise when the assumption of known K, or representation of all classes in the labeled dataset is not a priori tenable. In the following discussion, we assume that K 0 (out of K) classes are represented in the labeled data.
In the case of initialization, one option is to ignore the labeled information and to start the algorithm using the methods (for unsupervised clustering) discussed in Section 2.1.2. However, we can potentially improve performance by considering both labeled and unlabeled data. One intuitive suggestion is to use labeled observations for obtaining initial cluster centers. This is especially important for initialization strategies involving starting the EM algorithm at random points (emEM [17] and RndEM [78]). Initializing every cluster that has labeled data with the average of all observations known to belong to this particular cluster was suggested by [29]. The other components, having only unlabeled observations, are initialized with random starting points as in the case of unsupervised clustering. If there are K clusters and we take K starting points for running the EM algorithm, there is roughly a K! K K chance on the average to start with an initialization having one starting point in each cluster. Of course, the importance for a dataset to be well-initialized depends on specific features of the particular dataset; however, it might be crucial in some cases. If K 0 clusters have data with known labels, we need to initialize only K − K 0 clusters. This increases the chance of obtaining one point in each cluster to approximately (K−K0)! K K−K 0 . A comprehensive simulation study was provided by [29] for different numbers of clusters with labeled and unlabeled observations as well as for various levels of proportions for labeled data. Thus, labeled observations can substantially improve the performance of the EM algorithm by providing a better initialization.
For model selection, [97] extended the penalized loglikelihood-based variable selection procedure of [96] to the context of model-based semi-supervised clustering for gene expression data, but their approach was limited by the assumption of uncorrelated variables. In the general case, [29] have advocated using the quantitation map for choosing the model at desired significance and have shown excellent performance on a range of simulation and classification datasets. We refer to [29] for further details. Finally, we close our discussion here, that we have assumed that the label information is complete and certain: this may not be so: for example, the label information of an observation may be ambiguous in that it may be known to come from a specific subset of clusters, but the exact classification may be unknown. We note that the model-based framework can be easily extended here also.

Constrained clustering
Most cases considered in the clustering literature address the issue of grouping of each observation without any constraints. However, this may not always be the case. Consider, for instance, the example of two-dimensional gel electrophoresis data [94] where there are a given number of proteins and an equivalent number of protein spots (observations). In example, interest centers on assigning each observed spot to the protein. This brings in a constraint that no two spots can be assigned to the same protein. Complications then arise in the estimation of the posterior probability of the E-step where the usual formula (3) is not applicable any more. To see this, we let i = 1, 2, . . . , n represent the number of gel replication and j = 1, 2, . . . , p stand for the protein index. Also let X ij be the jth observation from the ith gel. Here, X ij is trivariate with observations on isoelectric point, molecular weight and intensity. The log likelihood function for the complete data is given by where X and Z represent n p-dimensional random variables X i and n classification vectors Z i correspondingly, and ℓ = (ℓ 1 , ℓ 2 , . . . , ℓ p ) ′ ∈ ρ(p) denotes the set of all permutations of 1, 2, . . . , p. Note that the vector Z i represents the entire classification vector for the ith gel. Then, the posterior probabilities can be obtained by evaluating which is obtained by calculating over all permutations over ρ(p). The most intuitive choice of the mixing distribution f is the multivariate Gaussian distribution. One critical restriction of this procedure is related to the fact that the posterior probabilities can be obtained this way only if the number p is not very large. Otherwise, enumerating all permutations of p elements is computationally infeasible and Markov Chain Monte Carlo (MCMC) methods [48,106] are the only recourse. These are themselves not easy to implement: [91] has borrowed ideas from the literature on conditional point process [9]. It is then possible to construct a random walk process. Thus, incorporating MCMC schemes into the E-step of the EM algorithm allows approximating the posterior probabilities to proceed with the M-step in a usual fashion. We have addressed here a very specific problem, but there are other applications where similar issues arise and need to be addressed.

Massive datasets
Automated collection methods have meant a surfeit of data in many cases. This has meant that available computational resources are not always able to handle such datasets. A simple-minded approach which involves clustering a sample of the dataset and then classifying the rest of the observations does not make use of the available riches inherent in a large dataset and may potentially miss groups with fewer representations [77] unless the number of groups is known in advance. For the latter situation, [38] used a sample to obtain an initial model, then they fit the entire dataset to get the classification vector. The well-classified observations are retained and the procedure repeated again until all observations become well-classified. [24] developed a method in which they divide all observations into three different categories: certain, uncertain and compressed observations. The latter implies that the observations are known to belong to the same group.
For the case with unknown K, [77] provided a multi-staged scheme which first clusters an initial sample. Observations in the dataset that are not in the sample but can reasonably be classified into any of these identified groups are filtered out using a likelihood ratio test. The remainder are again sampled, clustered and the procedure iterated until all cases have either been clustered or classified. Final estimates of the class probabilities and model parameters are obtained from these multi-staged groupings. Although seen to work well in a number of cases, the likelihood ratio test used to identify representativeness of the identified clusters at each stage used a homogeneous dispersion assumption. This limits applicability of the approach. Another iterative model-based approach in the same spirit was developed by [43]. Their approach first fits a sub-sample of observations with some underfitted model. Then observations in the dataset having the lowest 1% mixture density values are identified. These points potentially represent a new potential component that is poorly fit by the current mixture and model therefore a new round of EM is started with these observations in one group and representatives from the other 99% observations classified according to the (underfitted) model. If the new fitted model shows an improvement in BIC, it is preferred in place of the underfit model and a new group of observations with 1% lowest mixture density values is identified. The algorithm then proceeds, terminating only when there is no substantial BIC improvement associated with introducing an additional component. [43] illustrated performance of their algorithm on one very well-separated simulated example with fourteen clusters but our experience shows substantially poor performance with overlapping components. Indeed, we have noticed that if two clusters are located very close to each other and one of them is picked up by the underfitting model, there is a very small chance that the neighboring cluster is detected. Instead, the procedure prefers selecting points from the fringes of the selected components resulting in spurious components: consequently the additionally identified clusters do not improve BIC and the procedure terminates. Thus, we consider model-based clustering of massive datasets to be a persistent challenge.

Diagnostics
Influential and outlying observations impact performance of many model-based clustering algorithms. Identifying them has been a long-standing issue in the literature, but has received scant attention. In general, there are no approaches that we know of to identify influential observations. For the case of identifying outliers, [87] describe two distinct approaches in the literature. The first method [89] suggests creating what they called an atypicality measure that can be applied to a new or a suspicious observation with respect to all clusters to see if the observation is really atypical for all groups. The atypicality measure is computed after assigning observations to the estimated components and then using a measure such as the Mahalanobis distance. If this measure is large, we have evidence to conclude that the analyzed point is an outlier. [119] however pointed out that this approach does not provide satisfactory control over the overall significance level. Instead, they [119] proposed using a modified likelihood ratio test comparing two models. The first model is constructed with all n observations included into consideration while the second model concerns only n − 1 observations that complement the tested observation. Therefore, the modified likelihood ratio test statistic represents the ratio of the maximized likelihood function with all n observations over the maximized likelihood function with n − 1 observations included. Parametric or nonparametric bootstrap is recommended for assessing the null distribution of the obtained test statistic. The authors also suggest a modification of bootstrap which is less computationally demanding. The idea is to resample only the last, nth, observation every time. [119] demonstrated that for large datasets this approach is reasonable. We note that [119] developed their methodology for when the number of components in the finite mixture model is known. Further, they demonstrated their case in the context of semi-supervised clustering: they mention that complications such as initialization may arise when there is no labeled data in the setup. Thus, the issue of identifying outliers is at best partially resolved.

Robust and skewed mixture models
While identifying outliers as described in the previous section is important, it may sometimes be important to develop mixture models that are robust to outliers. Indeed, [87] remark that while it is usually not necessary to have precise estimates of covariance matrices in Gaussian mixtures, the presence of outliers can dramatically affect all estimates. Therefore, [99] proposed using a mixture of multivariate t-distributions instead of multivariate Gaussians. The idea is that since a t distribution has heavier tails than does a normal distribution, using t-components would have the potential for better modeling data with outlying observations. Another recent development related to modeling non-Gaussian patterns in data is that of finite mixtures of skewed distributions. One popular choice of component in this regard is the skew normal distribution of [7]. The density of the univariate skew normal distribution introduced by [6] has the form given by where φ(·) is the probability density function of a standard univariate normal distribution, while Φ(·) is the corresponding cumulative distribution function. The parameters µ and σ here have meaning similar to their counterparts for the normal distribution while λ represents the skewness parameter. The multivariate skew-normal distribution introduced by [8] generalizes the univariate case. A convenient property of this generalization is that the marginal distributions are scalar skew-normal distributions. Another class of multivariate skew-normal distributions was proposed by [51]. Finite mixtures (of univariate skew-normal distributions) were first analyzed by [71]. More recently, [70] investigated finite mixtures of multivariate skew-normal distributions and established the Eand M-steps of the EM algorithm. Recently, [69] developed methodology for performing supervised learning in these multivariate mixtures in the presence of missing information. As we can see, there has lately been a great deal of interest in the area of modeling robust and skewed mixtures.

Dependent data
In this section, we consider an approach for analyzing dependent data that are marginally distributed from the mixture model (2). Suppose that we have n observations Y = (Y 1 , Y 2 , . . . , Y n ) ′ consisting of univariate normally distributed observations following an autoregressive AR(1) model. Also assume that there are K components with means µ k , k = 1, 2, . . . , K and common (marginal) variance σ 2 . The origin of every observation, however, is not known. We again introduce missing information -group memberships, which can be given in the form of the matrix where every I ik represents the indicator function I(Y i ∈ k − th cluster). Then, the entire sample can be written in the form Y ∼ M V N (Xβ, σ 2 R(ρ)) if the class memberships of observations are known. Thus, the complete likelihood as well as Q-function can be obtained and corresponding expressions for the M-step of the EM algorithm can be derived, however expressions for the EM iterations are more complicated and involve taking derivatives of R −1 (ρ) with respect to ρ, for which closed-form expressions may not be available. As a result, while the EM algorithm can be set up and used for parameter estimation in the same way as usual, estimation becomes far more difficult. This is especially true for the case when the dependence between observations is of a form more complicated than an AR(1) structure. Model-fitting presents another challenge as does variance estimation: note also that the methods detailed in Section 2.1.3 are for independent identically distributed observations and are inapplicable for dependent data. Thus, new approaches are needed.

Conclusions
This paper provides a detailed overview of mixture models with specific reference to model-based clustering. In addition to descriptions of several existing and well-known results and methods, we provide details on simulation and evaluation of clustering algorithms as well as on graphical illustration of mixtures. Two applications involving non-Gaussian mixtures are presented. We also list some available software in the field. Finally, some additional topics such as semisupervised clustering, constrained clustering, massive datasets, diagnostics and dependent observations are presented and unresolved challenges outlined. As seen here, the field has attracted a lot of interest, but there are still many questions and issues that have to be addressed. Therefore, we hope that this survey will provide readers with a good understanding of the issues involved and spur further interest and development in this field.