{"title": "An Application of Reversible-Jump MCMC to Multivariate Spherical Gaussian Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 583, "abstract": "", "full_text": "An application of Reversible-J ump \n\nMCMC to multivariate spherical Gaussian \n\nmixtures \n\nAlan D. Marrs \n\nSignal & Information Processing Dept. \nDefence Evaluation & Research Agency \n\nGt. Malvern, UK WR14 3PS \n\nmarrs@signal.dra.hmg.gb \n\nAbstract \n\nApplications of Gaussian mixture models occur frequently in the \nfields of statistics and artificial neural networks. One of the key \nissues arising from any mixture model application is how to es(cid:173)\ntimate the optimum number of mixture components. This paper \nextends the Reversible-Jump Markov Chain Monte Carlo (MCMC) \nalgorithm to the case of multivariate spherical Gaussian mixtures \nusing a hierarchical prior model. Using this method the number \nof mixture components is no longer fixed but becomes a param(cid:173)\neter of the model which we shall estimate. The Reversible-Jump \nMCMC algorithm is capable of moving between parameter sub(cid:173)\nspaces which correspond to models with different numbers of mix(cid:173)\nture components. As a result a sample from the full joint distribu(cid:173)\ntion of all unknown model parameters is generated. The technique \nis then demonstrated on a simulated example and a well known \nvowel dataset. \n\n1 \n\nIntroduction \n\nApplications of Gaussian mixture models regularly appear in the neural networks \nliterature. One of their most common roles in the field of neural networks, is in \nthe placement of centres in a radial basis function network. In this case the basis \nfunctions are used to model the distribution of input data (Xi == [Xl, X2, .\u2022\u2022 , Xd]T, \n(i = l,n)), and the problem is one of mixture density estimation. \n\n\f578 \n\nk \n\np(Xi ) = L 7Ijp(Xi I9j), \n\nj=1 \n\nA D. Marrs \n\n(1) \n\nwhere k is the number of mixture components, 7rj the weight or mixing propor(cid:173)\ntion for component j and 8 j the component parameters (mean & variance in this \ncase). The mixture components represent the basis functions of the neural network \nand their parameters (centres & widths) may be estimated using the expectation(cid:173)\nmaximisation (EM) algorithm. \n\nOne of the key issues arising in the use of mixture models is how to estimate the \nnumber of components. This is a model selection problem: the problem of choosing \nthe 'correct' number of components for a mixture model. This may be thought of \nas one of comparing two (or more) mixture models with different components, and \nchoosing the model that is 'best' based upon some criterion. For example, we might \ncompare a two component model to one with a single component. \n\n(2) \n\nThis may appear to be a case of testing of nested hypotheses. However, it has \nbeen noted [5] that the standard frequentist hypothesis testing theory (generalised \nlikelihood ratio test) does not apply to this problem because the desired regularity \nconditions do not hold. In addition, if the models being tested have 2 and 3 compo(cid:173)\nnents respectively, they are not strictly nested. For example, we could equate any \npair of components in the three component model to the components in the two \ncomponent model, yet how do we choose which component to 'leave out'? \n\n2 Bayesian approach to Gaussian mixture models \n\nA full Bayesian analysis treats the number of mixture components as one of the \nparameters of the model for which we wish to find the conditional distribution. In \nthis case we would represent the joint distribution as a hierarchical model where we \nmay introduce prior distributions for the model parameters, ie. \n\np(k, 7r, Z, 9, X) = p(k)p(7rlk)p(zl7r, k)p(9Iz, 7r, k)p(XI9, z, 7r, k), \n\n(3) \n\nwhere7r = (7rj)J=1, 9 = (9j )J=1 and z = (Zi)f::l are allocation variables introduced \nby treating mixture estimation as a hidden data problem with Zi allocating the ith \nobservation to a particular component. A simplified version of this model can be \nderived by imposing further conditional independencies, leading to the following \nexpression for the joint distribution \n\np(k, 7r, Z, 9, X) = p(k)p(7rlk)p(zl7r, k)p(9Ik)p(XI9, z). \n\n(4) \n\nIn addition, we add an extra layer to the hierarchy representing priors on the model \nparameters giving the final form for the joint distribution \n\npeA, 6, T}, k, 7r, Z, 9, X) = p(A)p(6)p(T})p(kIA)p(7rlk, 6)p(zl7r, k) x \np(9Ik, T})p(XI9, z). \n\n(5) \n\nUntil recently a full Bayesian analysis has been mathematically intractable. Model \ncomparison was carried out by conducting an extensive search over all possible \n\n\fReversible-Jump MCMC for Multivariate Spherical Gaussian Mixtures \n\n579 \n\nmodel orders comparing Bayes factors for all possible pairs of models. What we \nreally desire is a method which will estimate the model order along with the other \nmodel parameters. Two such methods based upon Markov Chain Monte Carlo \n(MCMC) techniques are reversible-jump MCMC [2] and jump-diffusion [3]. \n\nIn the following sections, we extend the reversible-jump MCMC technique to multi(cid:173)\nvariate spherical Gaussian mixture models. Results are then shown for a simulated \nexample and an example using the Peterson-Barney vowel data. \n\n3 Reversible-jump MCMC algorithm \n\nFollowing [4) we define the priors for our hierarchical model and derive a set of \n5 move types for the reversible jump MCMC sampling scheme. To simplify some \nof the MCMC steps we choose a prior model where the prior on the weights is \nDirichlet and the prior model for IJ.j = [JLji' .. . ,JLjclV and U;2 is that they are \ndrawn independently with normal and gamma priors, \n\n(6) \n\nwhere for the purposes of this study we follow[4] and define the hyper-parameters \nthus: 6 = 1.0; 'TJ is set to be the mean of the data; A is the diagonal precision \nmatrix for the prior on IJ.j with components aj which are taken to be liT] where \nTj is the data range in dimension j; a = 2.0 and (3 is some small multiple of liT;' \nThe moves then consist of: I: updating the weights; II: updating the parameters \n(IJ., u); III: updating the allocation; IV: updating the hyper-parameters; V: split(cid:173)\nting one component into two, or combining two into one. \n\nThe first 4 moves are relatively simple to define, since the conjugate nature of the \npriors leads to relatively simple forms for the full conditional distribution of the \ndesired parameter. Thus the first 4 moves are Gibbs sampling moves and the full \nconditional distributions for the weights 1rj, means Jij, variances Uj and allocation \nvariables Zi are given by: \n\n(7) \n\nwhere nk is the number of observations allocated to component k; \n\np(ltjl .. \u00b7) = \n\nP(JLjml .. \u00b7) : p(JLj .. '!-.. ) '\" N( \n\nd \n\nII \n\nm=1 \n\n-2 \n\nnjXimUj + am'f/m \n\n-\n(njuj + am) \n\n-2 \n\n-2 \n\n,(njuj + am) \n\n-1 \n\n), \n\n(8) \nwhere we recognise that IJ.j is an d dimensional vector with components JL;m (m = \n1, d), 'f/m are the components of the Itj prior mean and am represent the diagonal \ncomponents of A. \n\n-2 \n\np(uj \n\nn \n\\ ... ) == r(lI + nj - 1, '2 L-\n\n1 ~ \n\nand \n\ni=I :Zi;=l \n\n(9) \n\n(10) \n\n\f580 \n\nA. D. Marrs \n\nThe final move involves splitting/combining model components. The main criteria \nwhich need to be met when designing these moves are that they are irreducible, \naperiodic, form a reversible pair and satisfy detailed balance [1]. The MCMC step \nfor this move takes the form of a Metropolis-Hastings step where a move from state \ny to state y' is proposed, with 1r(Y) the target probability distribution and qm(Y, Y') \nthe proposal distribution for the move m. The resulting move is then accepted with \nprobability am \n\n_ . {I 1r(Y')qm(y/,y)} \n\nam - mtn \n\n, () \n\n1r Y qm y, Y \n\n( \n\n') \n\n. \n\n(11) \n\nIn the case of a move from state Y to a state y' which lies in a higher dimensional \nspace, the move may be implemented by drawing a vector of continuous random \nvariables u, independent of y. The new state y' is then set using an invertible de(cid:173)\nterministic function of x and u. It can be shown [2] that the acceptance probability \nis then given by \n\n. { \n\n1r(y')Tm{y') \n\nam=mm 1'1r(y)Tm(y)q{u)1 8(y,u)1 \n\n8y' } \n, \n\n(12) \n\nwhere Tm(Y) is the probability of choosing move type m when in state y, and q(u) \nis the density function of u. \n\nThe initial application of the reversible jump MCMC technique to normal mixtures \n[4J was limited to the univariate case. This yielded relatively simple expressions for \nthe split/combine moves, and, most importantly, the determinant ofthe Jacobian of \nthe tra~formation from a model with k components to one with k + 1 components \nwas simple to derive. In the more general case of multivariate normal models care \nmust be taken in prescribing move transformations. A complicated transformation \nwill lead to problems when the !Jacobian I for a d-dimensional model is required. \nFor multivariate spherical Gaussian models, we randomly choose a model compo(cid:173)\nnent from the current k component model. The decision is then made to split or \ncombine with one of its neighbours with probability P'k and PCIr respectively (where \nPCk = 1- Pile)' If the choice is to combine the component, we label the chosen com(cid:173)\nponent Zl, and choose Z2 to be a neighbouring component i with probability Q( l/T; \nwhere Tj is the distance from the component Zl. The new component resulting from \nthe combination of Zl and Z2 is labelled Zc and its parameters are calculated from: \n\n(13) \n\nIf the decision is to split, the chosen component is labelled Zc and it is used to define \ntwo new model components Zl and Z2 with weights and parameters conforming to \n(13). In making this transformation there are 2 + d degrees of freedom, so we need \nto generate 2 + d random numbers to enable the specification of the new component \nparameters. The random numbers are denoted u}, U2 = [U211 ... , u2dlT and U3. \nAll are drawn from Beta{2,2) distributions while the components of U2 each have \nprobability 0.5 of being negative. The split transformation is then defined by: \n\n\fReversible-Jwnp MCMC for Multivariate Spherical Gaussian Mixtures \n\n, U Z2 = \n\n2 \n\n(1 \n\n- U3 U Zc - . \n\n) 2 7r ZI \n7r Z2 \n\n581 \n\n(14) \n\nOnce the new components have been defined it is necessary to evaluate the proba(cid:173)\nbility of choosing to combine component ZI with component Z2 in this new model. \n\nHaving proposed the split/combine move all that remains is to calculate the \nMetropolis-Hastings acceptance probability (t, where (t = min(I, R) for the split \nmove and (t = min(I, 1/ R) for the combine move. Where in the case of a split move \nfrom a model with k components to one with k+ 1 components, or a combine move \nfrom k + 1 to k, R is given by: \n\nR = \n\nn ~ \n\n._l:\u00b7ij-\u00b7hn \n\np~~I,:~e,:;2 \n\np(X, le,e) n~ \n;-I:\u00b7,,-.c \no-l+nl 0-I+n2 \n'II\"!c 1+nl +n2 B(6,k6) x \n\n1r \"2 \n\n11' '\" \n\np(X,le,e) \n\nX \n\nn~;::1 J (~;) exp ( -~am ((/LZl m -11m)2 + (/LZ2m - 11m? - (JLzcm - 11m)) ) X \n\n&(c7(';~2) (a-I) exp ( -f3(u;..2 +u~2 -u;:2)) X \np:c::;;oc (g2,2(Ut)gl,1 (U3) n;=1 g2,2(U2,)) X \n\nc7d+1 \n\n'II\" \n\n\u00b7c \u00b7c \n\n(2\u00abI-uI)uI)(d+ 1)/2 J(I-u 3)u3) , \n\n(15) \n\nwhere g2,20 denotes a Beta(2,2) density function. The first line on the R.H.S is \ndue to the ratio of likelihoods for those observations assigned to the components in \nquestion, the subsequent three lines are due to the prior ratios, the fifth line is due to \nthe the proposal ratio and the last line due to the I Jacobian I of the transformation. \nThe term Palloe represents a combination of the probability of obtaining the current \nallocation of data to the components in question and the probability of choosing to \ncombine components Zl and Z2. \n\n4 Results \n\nTo assess this approach to the estimation of multivariate spherical Gaussian mix(cid:173)\nture models, we firstly consider a toy problem where 1000 bivariate samples were \ngenerated from a known 20 component mixture model. This is followed by an anal(cid:173)\nysis of the Peterson-Barney vowel data set comprising 780 samples of the measured \namplitUde of four formant frequencies for 10 utterances. For this mixture estima(cid:173)\ntion example, we ignore the class labels and consider the straight forward density \nestimation problem. \n\n4.1 Simulated data \n\nThe resulting reversible-jump MCMC chain of model order can be seen in figure \n1, along with the resulting histogram (after rejecting the first 2000 MCMC sam(cid:173)\npies). The histogram shows that the maximum a posteriori value for model order \nis 17. The MAP estimate of model parameters was obtained by averaging all the \n17 component model samples, the estimated model is shown in figure 2 alongside \nthe original generating model. The results are rather encouraging given the large \nnumber of model components and the relatively small number of samples. \n\n\f582 \n\nA. D. Marrs \n\n200 \n\n100 \n\nIteration \n\n19 \n\n20 \n\n{k} \n\nFigure 1: Reversible-jump MCMC chain and histogram of model order for simulated \ndata. \n\n.' \n\n]0 \n\n.' \n\n10.\u00b7.~\u00b7\u00b7\u00b7\u00b7 \u2022 \u2022 \u2022 \u2022 ; \n\n. . . . . \n\n~. \nt I . - ' \n. . . . . \n\n., '\u00bb\" .' \n\n~~ \u2022\u2022 ::0:''-\n\\. .' \n. , : , .~. '. \\0\" \n\n',a,.1 \n. ....... \n. \n. , ' \"~'~:\"\"\"''''~' \n.. . \n~\" \n,~, ' 'Ie\" ,:\": \n\".!'J\" \u2022 , .. ~ \n\u2022 ' \n.... \n: . :,~' \n\n-. \\ \u2022 \u2022 \n\n20 ~..... \u2022 \n\n. \n\u2022 ' ~'&.\"\" \n.. w: \n\" :. . \nr l \n-20 S,.' .;~:! \n. \n\n. \n\n. \n\n\u2022 f#. \n\n' .,. \n\n- 10 \n\nitt \n\n-30 \n\n-]0 \n\n-20 \n\n,.'~, \n.~.:' \n: I. \n-10 \n\n0 \n\n]0 \n\n20 \n\n,,;,~., :.'0 \n. .~. .1....... \\... _ .. \n.... .. \n\n.. ~. . .. \n,:~:. . ' . tI \u2022 \u2022 :_ ~ /1.\\ \n\n. , . \n\n.~. \n\n/I. \n\n\u2022 \u2022 \n\n\u2022 \n\n/I \n\n10 \"'~\" ~\"\" \n~ .. ~. \n, '. . \n\u2022\u2022 l' \n~ \n~~:! \n\n\u2022 ~:':;.,., : ... ~ \u2022 \n.. .yt.' \n.~~ . ie' ';':', , \n,T...,. \n. -. \n:' :,~' \n\n-10 \n\n-20 ',~ \u2022 \n\nCenerating Modal \n\n-]0 \n\n\" \n\n,.' \u2022. \":~.:~ \n\nMAP E.tilllllta Modal \n\n10 \n\n20 \n\n]0 \n\n-]0 \n\n-20 \n\n: I. \n-10 \n\n0 \n\n10 \n\n20 \n\n]0 \n\nFigure 2: Example of model estimation for simulated data. \n\n4.2 Vowel data \n\nThe reversible-jump MCMC chain of model order for the Peterson-Barney vowel \ndata example is shown in figure 3, alongside the resulting MAP model estimate. \nFor ease of visualisation, the estimated model and data samples have been pro(cid:173)\njected onto the first two principal components of the data. Again, the results are \nencouraging. \n\n5 Conclusion \n\nOne of the key problems when using Gaussian mixture models is estimation of the \noptimum number of components to include in the model. In this paper we extend \nthe reversible-jump MCMC technique for estimating the parameters of Gaussian \nmixtures with an unknown number of components to the multivariate spherical \nGaussian case. The technique is then demonstrated on a simulated data example \nand an example using a well known dataset. \n\nThe attraction of this approach is that the number of mixture components is not \nfixed at the outset but becomes a parameter of the model. The reversible-jump \nMCMC approach is then capable of moving between parameter subspaces which \n\n\fReversible-Jwnp MCMC for Multivariate Spherical Gaussian Mixtures \n\n583 \n\n.. \n\nFigure 3: Reversible-jump MCMC chain of model order and MAP estimate of model \n(projected onto first two principal components) for vowel data. \n\ncorrespond to models with different numbers of mixture components. As a result \na sample of the full joint distribution is generated from which the posterior distri(cid:173)\nbution for the number of model components can be derived. This information may \nthen either be used to construct a Bayesian classifier or to define the centres in a \nradial basis function networ k. \n\nReferences \n\n[1] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter Eds. Markov Chain Monte \n\nCarlo in Practice. Chapman and Hall, 1995. \n\n[2] P.J. Green. Reversible jump MCMC computation and Bayesian model determi(cid:173)\n\nnation. Boimetrika, 82:711-732, 1995. \n\n[3] D.B. Phillips and A.F.M. Smith. Bayesian model comparison via jump diffu(cid:173)\n\nsions. In W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, editors, Markov \nChain Monte Carlo in Practice. Chapman and Hall, 1995. \n\n[4] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an un(cid:173)\n\nknown number of components. J. Royal Stat. Soc. Series B, 59(4), 1997. \n\n[5] D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical Analysis of Finite \n\nMixture Distributions. Wiley, 1985. \n\nPublished with the permission of the controller of Her Britannic Majesty's \n\n\u00a9British Crown Copyright 1998 /DERA. \n\nStationary Office. \n\n\f", "award": [], "sourceid": 1422, "authors": [{"given_name": "Alan", "family_name": "Marrs", "institution": null}]}