Mean ﬁeld inference for the Dirichlet process mixture model

: We present a systematic study of several recently proposed methods of mean ﬁeld inference for the Dirichlet process mixture (DPM) model. These methods provide approximations to the posterior distribution and are derived using the truncated stick-breaking representation and related approaches. We investigate their use in density estimation and cluster allocation and compare to Monte-Carlo results. Further, more speciﬁc topics include the general mathematical structure of the mean ﬁeld approximation, the handling of the truncation level, the eﬀect of including a prior on the concentration parameter α of the DPM model, the relationship between the proposed variants of the mean ﬁeld approximation, and the connection to maximum a-posteriori estimation of the DPM model.


Introduction
Recently, variational approximation schemes have received growing interest as alternatives to Monte Carlo integration in computational Bayesian inference (Bishop 2006, Wainwright andJordan 2008).In particular, mean field methods are now frequently being used in a variety of contexts (Opper and Saad 2001).In the mean field approach, an approximation to a complicated posterior distribution is found within a more tractable set of trial functions by means of minimizing the Kullback-Leibler distance.Under appropriate conditions, mean field methods can offer large computational gains combined with ease of application and adequate quantitative accuracy.
However, the mean field approach also has a number of drawbacks.It can only be used efficiently in certain classes of models, and it is often difficult to assess accuracy without comparison to exact results since generally applicable performance criteria are not known.Furthermore, in many cases it requires substantial effort to find tractable sets of trial functions that improve upon the commonly used fully factorized class.
Nevertheless, in many cases the advantages of the mean field method outweigh the associated problems, and it is one of the current topics of research in this area to identify such opportunities.Such work typically focuses on well-defined, specific classes of models since it appears to be rather difficult to derive generally valid results.Very recently, a number of papers presented rather promising proposals regarding the application of mean field methods to inference in Dirichlet process mixture (DPM) models (Blei and Jordan 2006, Kurihara, Welling, and Teh 2007, Kurihara, Welling, and Vlassis 2007).In particular, using an image clustering problem as a practical example, Blei and Jordan (2006) showed that these methods are able to efficiently handle large high-dimensional data sets and may provide large computational gains compared to standard Markov chain Monte Carlo (MCMC) integration.
The purpose of this paper is to present a systematic study of these mean field approximation schemes.Several reasons motivate our interest in this problem.(i) The DPM model is very important as a generic building block for nonparametric Bayesian inference (Ferguson 1973, Antoniak 1974).Computational inference for this model has been developed thoroughly over the past 15 years on the basis of MCMC integration (see, e.g., Escobar 1988;1994, MacEachern 1994, Escobar and West 1995, MacEachern and Müller 1998, Neal 2000, Ishwaran and James 2001, Walker 2007).Nevertheless, in some situations, such as the image analysis problem mentioned above, it is still desirable to have complementary computational approaches available, so that detailed investigations of the mean field approximation are well justified.(ii) The previous studies of mean field inference for DPM models provide some assessment of the performance, e.g., by studying some large-scale classification problems or monitoring the dependence the Kullback-Leibler divergence on some parameters of the algorithm.However, a more systematic study of the basic properties of the method is still missing.It is also not clear how accurately mean field approximates the various posterior quantities calculated in the DPM model.In this work, we aim to address some of these issues.(iii) Mean field methods have already been applied to extensions of the DPM model (Teh, Kurihara, and Welling 2008) as well as to closely related mixture models such as latent Dirichlet allocation (Blei, Ng, and Jordan 2003).Some of the present results are expected to be of interest in these contexts as well.(iv) Since it constitutes a measure over probability measures, the Dirichlet process is quite different from the more standard ingredients of Bayesian probability models.It is therefore of interest to see how the mean field approximation behaves in this context.
The plan of the paper is as follows.In Sec. 2, we first summarize some basic facts about Dirichlet process mixtures.We then introduce the mean field approximation and outline its application to the DPM model in the truncated stick-breaking representation, as proposed by Blei and Jordan (2006).Section 3 discusses the general mathematical structure of the mean field solutions.The conclusions are applied to the problem of choosing the truncation level in the stick-breaking representation.We also discuss some instructive scenarios with one and two observations in more detail.Section 4 illustrates and extends the foregoing discussion by means of several representative numerical examples.We also compare to results obtained from MCMC integration.The comparison focuses on density estimation and cluster allocation.Several variants of the mean field approximation scheme that were proposed in Kurihara, Welling, and Teh (2007) are compared in Sec. 5.In Sec. 6 we study the mean field approximation after inclusion of a prior for the DPM parameter α.The relationship between the mean field approximation and maximum a-posteriori (MAP) estimation of the posterior is discussed in Sec. 7. A summary and some concluding remarks are given in Sec. 8.
As a main result of our study we find that while the mean field methods, by themselves, provide a useful tool for mixture modelling, great care is necessary when using them to calculate actual approximations to the DPM posterior.In particular regarding the description of data clustering, they display great differences to the exact results.

Dirichlet process mixtures
In the Dirichlet process mixture model (Ferguson 1973, Antoniak 1974), one assumes the observations to originate from a mixture distribution with an unknown number of components, and one places a Dirichlet process prior on the mixture distribution.More formally, the DPM model is defined by Here, G denotes a random probability measure drawn from a Dirichlet process with base distribution G 0 and parameter α.The measure G is almost surely discrete.The ith observation Y i is obtained from the distribution p(Y i |θ i ), conditional on the parameter θ i drawn from G. In this way, G can be thought of as representing a mixture distribution with an infinite number of components.
One of the main approaches to computing inferences for the DPM model makes use of the stick-breaking representation for the random measure G (Sethuraman 1994), i.e.,

G(•)
where the V j 's have a beta distribution B(1, α), δ η denotes the Dirac measure placing unit mass at η, and η j ∼ G 0 .The measure G is thus described by the collection of random variables V j and η j .Ishwaran and James (2001) made the important observation that a truncation of the stick-breaking representation at a sufficiently large K, i.e., setting v K = 1, already provides an excellent approximation to the full DPM model.For the truncated stick-breaking model, we can write down the explicit probability distribution function p(y 1:n , z 1:n , η 1:K , v 1:K−1 ) Here, y 1:n , z 1:n , η 1:K are shorthand for the sets of variables y 1 , . . ., y n etc., v 1:K−1 = v 1:(K−1) , whereas the w k (v 1:K ) := v k k−1 j=1 (1 −v j ) denote the mixture weights.The indicator variables z i describe the assignment of the ith observation to a specific mixture component and take integer values between 1 and K. Finally, I in (1) denotes the indicator function.In the form (1), the DPM is easily amenable to mean field variational methods.
The DPM model can be extended in various ways by placing priors on α and further parameters that may appear in the densities G 0 and p(Y i |θ i ).However, to work out the properties of the mean field approximation most clearly, we will only consider the basic model (1) except for Sec.6 where we discuss the effects of a prior distribution placed on the parameter α.
For later reference, we note that the marginal p(y 1:n ) can be written as (Lo 1984, Lemma 2) where the sum is over all possible partitions c = (c (1) , . . ., c (m) ) of the observations y 1:n , and c (j) denotes a particular subset (cluster) of observations.If c (j) contains the observations y 1 , . . ., y k , say, the corresponding factor is given by

Mean field inference for the Dirichlet process mixture model
The basic idea of the mean field method is to approximate a complicated probability distribution p by a simpler one for which one can compute inferences more easily.To this end, one starts from a set Q of tractable trial functions for which both the mean field procedure as well as subsequent inference are feasible.The mean field approximation q is then found from the minimization of the Kullback-Leibler (KL) divergence between the functions in Q and p, i.e., q = argmin q ′ ∈Q q ′ log q ′ p .
The quality of the result thus depends on the choice of the class Q. Detailed discussions of the mean field method can be found, e.g., in Opper and Saad (2001), Bishop (2006), Wainwright and Jordan (2008).
To simplify notation in the subsequent discussions, we will denote all mean field trial functions and their constituent factors by the letter q.Any particular function or factor is then identified and distinguished by its arguments and through the context (and further subscripts, if necessary).
Mean field inference for the DPM model was first discussed by Blei and Jordan (2006) within the context of the stick-breaking representation, and we follow their approach with some slight modifications.In particular, while Blei and Jordan (2006) use the full DPM model as target distribution, we start from the truncated posterior p(z 1:n , η 1:K , v 1:K−1 |y 1:n ) induced by (1) and then study how the mean-field approximation behaves in the limit of K → ∞.
A tractable class Q is given by the set of all distributions factorizing as q(z 1:n )q(η 1:K , v 1:K−1 ).To determine the actual mean field approximation, i.e., the minimizer of the KL distance, one can employ an iteration scheme where one alternatingly optimizes q(z 1:n ) for a previously found, fixed q(η 1:K , v 1:K−1 ) (in the sense of minimizing the KL distance to p) and vice versa.More specifically, the iteration scheme proceeds as follows.
(3) 2. Update q(η 1:K , v 1:K−1 ) according to 3. Calculate the negative free energy This quantity provides a lower bound on the marginal log p(y 1:n ) and is guaranteed to increase in each iteration step (the latter fact also provides a useful check on numerics).4. Repeat steps 1 to 3 until H no longer increases appreciably which signals the convergence of the iteration scheme.
A well-known problem of such mean-field iteration schemes is the existence of multiple fixed points corresponding to different stationary points of the KL divergence (MacKay 2003, Wainwright andJordan 2008).Commonly, one restarts the iteration several times and selects the solution with the largest H as final mean-field approximation, but it is sometimes difficult in practice to find the global optimum (Weiss 2001).
From (1) and the above update relations it follows that the optimized distributions factorize in all variables.It is thus sufficient to perform the iteration with fully factorized functions [see ( 7), ( 9), (10) below].A further important simplification arises if the base distribution G 0 (η) is chosen conjugate to the observation model p(y|η) and both are in the exponential family.In this case, the mean-field approximation q(η k ) will belong to the same class of distributions as G 0 , and the iterations can be carried out by updating a finite vector of parameters.In the following, we will be using two related types of such models.For analytical study, we consider the example of a simple location model (Escobar 1994) which describes a mixture of normals with fixed variance.The update equations adapted to this case are given below.In the numerical examples of the later sections, we will use a normal/inverse-gamma location-scale model (Escobar and West 1995) where the normal mixture components have variable variance.We expect that these models will also be very important in actual practical applications.
More specifically, the normal location model of Escobar (1994) is defined as with N the normal distribution and the variances λ 2 and σ 2 considered as fixed parameters.One can then derive the iteration relations Here, we have introduced the abbreviation π ik := q(z i = k) for the probability of assigning the ith observation to the kth mixture component which will be a key quantity in our subsequent discussion.We also define the mean occupation Relations ( 7) and ( 9), (10), respectively, follow directly from (3) and (4) using fully factorized trial functions i q(z i ) and k q(v k )q(η k ).They hold for arbitrary p(y|η) and G 0 (η).As mentioned above, the fully factorized form does not impose any limitations compared to the more general case q(z 1:n )q(η 1:K , v 1:K−1 ).
Relations ( 8) and ( 11) have been specialized to the normal location model (6).To simplify notation, we have omitted an explicit index for counting iteration cycles.To derive (8), we have made use of the relation ψ(z + 1) − ψ(z) = 1/z.Note that in relation ( 8), the right-hand side values of n k and n > k are calculated from the previous iteration step, i.e., (8) is not an implicit equation for the π ik 's.Furthermore, for k = K the final two terms in the argument of the exponential in (8) have to be replaced by ψ Note that the π ik 's succinctly characterize the converged solution since the functions q(v k ) and q(η k ) are straightforwardly derived from them.
The analytical expression for the lower bound (5) simplifies if it is calculated after the functions q(v k ) have been updated using the current π ik 's.In this case, a number of terms cancel and one obtains Equation ( 12) also holds at convergence.From the final mean-field approximation, we obtain information about clustering, i.e., the assignment of observations to different mixture components, through the function q(z 1:n ).In addition, we can calculate the predictive distribution p(y n+1 |y 1:n ) as (Blei and Jordan 2006) Note that ( 13) is normalized because of For the normal location model

Mixtures of mean field approximations
It is well known that one can sometimes find improved mean-field approximations that lie outside the originally chosen class Q by considering mixtures of trial functions that have non-overlapping support (Mézard, Parisi, and Virasoro 1987).For our purposes, a precise formulation of this statement can be given as follows.
Proposition.Let p(x) be the target distribution and q 1 (x), . . ., q m (x) a set of functions from Q with respective KL distances K 1 , . . ., K m > 0 to p.It is assumed that the functions q i have non-overlapping support, i.e., q i (x) > 0 for a given x implies that q j (x) = 0 for all j = i.The best approximation of the target distribution p(x) within the class of mixtures m i=1 w i q i (x) is given by ] is strictly smaller than all individual distances K i so that (15) always yields an improved approximation.
Proof.Setting q ′ (x) := m i=1 w i q i (x), it follows that dx q ′ (x) log q ′ (x) = i w i dx q i (x) log q i (x) + i w i log w i since the functions q i (x) have non-overlapping support.The KL distance between q ′ and p is therefore given by K 0 = i w i K i + i w i log w i .Minimization of K 0 with respect to the w i under the constraint i w i = 1 leads to (15) and the given expression for the minimum KL distance.
In the discussion of ( 15), the following points should be mentioned.(i) The condition of non-overlapping support is crucial since it allows to explicitly evaluate the entropy term dx q ′ (x) log q ′ (x) as given above.Otherwise, the calculation of the entropy becomes intractable and one has to resort to approximation schemes as described, e.g., in Jaakkola and Jordan (1998).(ii) The functions q 1 , . . ., q m may be chosen arbitrarily within the class Q as long as they fulfill the support condition, and need not be variational extrema.(iii) Since, in general, the mixture will not be a member of the original class Q of trial functions one can indeed obtain improved approximations that go beyond Q.(iv) The lower bound defined in ( 5) is related to the KL distance by H = −K + log p(y 1:n ).Equation ( 15) thus remains valid with the K i(k) replaced by −H i(k) , and the improved lower bound is given by log[ m k=1 exp(H k )].(v) As an interesting aside, the fact that the minimized KL distance has to be non-negative implies that m i=1 exp(−K i ) ≤ 1 for any choice of the non-overlapping distributions q i .In the extreme case of one q i coinciding with p, all the other KL distances K j , j = i, equal −∞.
In the context of the DPM model, we will see that the distributions corresponding to the various fixed points of the mean field iteration scheme often come very close to fulfilling the condition of non-overlapping support.This suggests to construct improved mean-field approximations by using these distributions in a mixture.Thereby, one has to neglect the small error introduced by the remaining overlap.Alternatively, one can choose functions q 1 , . . ., q m such that they strictly fulfill the overlap condition and are still close to the fixed-point distributions.

Mean field inference for the DPM model: General features
In this section, we give an overview and explanation of some of the main distinctive features that characterize the mean field approximation to the posterior in the DPM model.For clarity, the discussion will be based on the normal location model defined in (6).As demonstrated by the numerical studies of Sec. 4, the essential conclusions also apply to the more general location scale model of Escobar and West (1995).Throughout the following discussion, we will presuppose a vague prior distribution, i.e., λ ≫ σ.This condition is not only reasonable from a modelling point of view since it guarantees sufficient flexibility in describing the mixture distribution, but it also greatly simplifies the behavior of the mean field approximation.
In the following, we will first discuss the MF approximation for a single observation.The detailed study of this simple case which can largely be performed analytically is very instructive since it already reveals several important features that can also be found in more complex problems.We next turn to the case of multiple observations and discuss several aspects of the general mathematical structure of the mean field approximations.Our findings are then applied to discuss the problem of choosing the truncation level K. Finally, we return to a simple model problem with two data points in order to illustrate how the mean field approximation switches between different clusterings of data points.
Let us start, however, by briefly reviewing some basic features of the DPM mean-field approximation and the update relations ( 8)-( 11).The mean-field treatment is based on a truncation of the DPM model.The truncated model describes the mixture in terms of a finite number K of components, in contrast to the infinite mixture of the full problem.If K is chosen sufficiently large, the truncation provides an excellent approximation to the full model (Ishwaran and James 2001).
After convergence of the iteration procedure, the functions q(η k ) and q(v k ) yield the mean-field approximations to the posterior marginal distributions of the location and weight parameters η k and v k , respectively, for the kth mixture component.The functions q(z i ) describe the assignment probability of the ith observation to the different components.As we see from ( 9) and ( 11), the more observations are attributed to a particular component, the more sharply the posteriors q(η k ) and q(v k ) will be localized.The assignment probabilities π ik = q(z i = k) are determined by two factors.First of all, we have the prior contribution exp ψ(n k + 1) This factor is closely related to the last term of (12).As follows from the discussion in point (iii) of Sec.3.2, it ensures that the optimal mean field approximation (i.e., the iteration fixed point with largest H) populates mixture components with small k.
The second and more interesting contribution is derived from the data model p(y|η).It shows that an observation y i will tend to be assigned to components for which the location posterior q(η k ) overlaps with y i (term exp[−(µ k − y i ) 2 /2σ 2 ]) and, crucially, which are well localized (term exp(−ρ 2 k /2σ 2 )).Since the localization depends on the number of observations assigned to a component, the latter aspect leads to a "positive feedback" or "self-reinforcement" process that results in individual observations being assigned to a small number of components, often even only a single one.

Single observation
As mentioned above, in the case of a single observation (n = 1), the mean field approximation is amenable to a detailed analytical study.The results obtained are useful since they already anticipate some important traits of the general case.In the following discussion, we will set y 1 = 0 for simplicity.
At first, let us briefly discuss how the self-reinforcement effect described above manifests itself in this case.From (8)-( 11), it follows that the relevant relations for the iteration scheme are given by with ρ 2 k = (1/λ 2 +π 1k /σ 2 ) −1 (the relation for π 1K is slightly modified).Suppose that we start the iteration procedure from the initial condition π 1k = 1/K.After the first iteration cycle, all ρ 2 k will be equal, but due to the terms 16), the assignment probabilities π 1k will be ordered as π 11 > π 12 > • • • > π 1K , in general.In the next iteration cycle, the ρ 2 k 's thus will also become ordered in this way which causes π 11 to grow even further at the expense of the other components.Subsequent cycles will rapidly enhance this process, until the observation is almost completely assigned to the first component.The exponential function in relation ( 16) is crucial in bringing about this behavior since it "magnifies" the effect of any differences of its arguments for different k.To see that our description is consistent, we note that after convergence, we will have ρ 2 1 ≈ σ 2 , whereas ρ 2 k ≈ λ 2 , k > 1.This implies that the population of all components k > 1 will indeed be suppressed exponentially by a factor of ξ = exp(−λ 2 /2σ 2 ) since we have assumed λ ≫ σ.
Further studies reveal that π 1k ≈ I(z 1 = 1) is not the only possible fixed point of the iteration scheme, but there are also solutions of the form π 1k ≈ I(z 1 = l), l > 1, which arise if the initial conditions are chosen appropriately.Approximate analytical expressions for these solutions can be derived in terms of an expansion in ξ.From relations (8) and ( 11) it follows that the converged assignment probabilities π 1k obey the equations We can now seek solutions of the form π 1l = 1 − δ l , π 1k = δ k , k = l, with δ j ≪ 1 for all j.These are found to be given by where the neglected subsequent terms in the expansion are at least of order ξ 2 including logarithmic corrections.Since we have assumed that ξ ≪ 1, we see that the solutions ( 18)-( 21) indeed have the desired structure, since all correction terms δ j remain small.For l = K the expressions are slightly modified.We will return to some features of the solutions in Sec.3.2.The existence of multiple fixed points is a generic feature of the mean-field method.They are all local maxima, or at least stationary points, of the lower bound H.In the present case, (12) shows that the value of H decreases with growing l so that one would choose the solution for l = 1 as final mean-field result.We also note that the above expansion breaks down for growing l, as the terms exp[(l − k)/(α + 1)] will eventually become comparable to ξ −1 .However, since this typically only happens for large l, this complication is not of practical relevance.We also remark that the above derivation does not preclude the existence of other types of mean-field solutions with a different mathematical structure, but none could be found in numerical investigations.
An interesting and important aspect of relations ( 18)-( 21) concerns the parameter α.One sees that the essential structure of the solution, i.e., the compression of the observation into a single component, is not at all affected by α.The choice of α thus is hardly of relevance for the posterior distribution.It is instructive to compare this behavior to the exact solution for the DPM model.From (1), one obtains the posterior marginal for K → ∞ (and any value of y 1 ), which is of completely different character than the mean-field result.The distribution only depends on α, but not on λ or σ, and the observation is typically spread over several components.The mean-field solution thus does not yield a good description of this posterior marginal.
On the other hand, from (13) and the mean field solution for l = 1, we obtain the predictive density with κ = 2.This result is obtained after slightly simplifying the mean field solution by setting π 1k = I(k = 1) and using the expressions for q(v k ) and q(η k ) ensuing from a single iteration of the update equations.The exact relation for the predictive density can be derived from (2) and is given by expression ( 23) with κ = 1.Apart from this small difference in the weight factors of the Gaussians, the mean field and exact predictive densities are identical.
The foregoing discussion already indicates a general trend that is observed in numerical studies of more complex problems.The mean field results often provide a reasonable approximation to the predictive density, whereas there are strong deviations for cluster allocation, which is determined by the posterior marginals for z 1:n .
Interestingly, for a single observation it is possible to construct a closer relationship between the mean field approximation and the exact description.To this end, we consider a mixture of trial functions from Q as discussed in Sec.2.3.We choose the distributions q m , m = 1, 2, . .., such that q m (z i ) = I(z i = m) whereas q m (v k ) and q m (η k ) are obtained from a single iteration of ( 9) and ( 11), i.e., q m (v k ) = B(v k ; 1, α + 1) for k < m etc.The functions q m are nonoverlapping, and they provide very good approximations of the actual mean-field solutions described above as long as m does not become too large.If we now calculate the bounds H m and construct the mixture q mix , we find that q mix (z 1 ) coincides with the exact marginal ( 22) of the full model.
The above discussion has shown that for a single observation one can systematically improve the mean-field approximation by considering mixtures of different fixed-point solutions.However, further investigations indicate that for multiple observations, one cannot expect to approach the exact distribution in this way, in general.Moreover, the rapidly growing number of necessary distributions would make this method impractical.Nevertheless, one should keep in mind that the combination of non-overlapping mean-field solutions will always provide some improvement to the individual distributions and might therefore be taken into consideration under appropriate conditions.

Multiple observations
For multiple observations, the mean field approximation can only be studied numerically, in general.Nevertheless, typical solutions share several important structural properties that are already apparent in the special cases of relations ( 18)-( 21).A more formal explanation of this behavior is given in the Appendix.
(i) In the converged mean-field solutions, there is a clear distinction between occupied and unoccupied mixture components.The population of unoccupied components is exponentially small, i.e., of order O(exp(−λ 2 /2σ 2 )), and thus goes to zero as λ/σ → ∞.The population of occupied components remains finite in this limit.This allows to unambiguously identify occupied and unoccupied components.Note, however, that typically there will be a large number of distinct mean field solutions that differ in the assignment probabilities of the occupied components (as exemplified, e.g., in ( 18)-( 21) by the choice of different l's).
(ii) For fixed λ/σ, the assignment probabilities π ik do not change appreciably if the truncation level K is varied.For example, for a single observation we see explicitly from relations (18)-( 21) that K provides only an exponentially small correction to the occupied component, whereas the empty ones do not depend on K in leading order.Furthermore, as indicated by (20), the populations of the empty components following the last occupied one scale with k as exp(−k/α) at fixed λ/σ.
(iii) As long as the influence of the terms k−1 j=1 (α + n > j ) −1 in the iteration equations remains small, permutation of indices in a mean field solution leads to another (approximately) valid solution.The last term of ( 12) then shows that the lower bound H is maximized if the occupied components are assigned the lowest indices, in the order n 1 ≥ n 2 ≥ . . . .This property can usefully be exploited in cluster relabelling algorithms that accelerate the determination of the optimal mean field approximation (Kurihara, Welling, and Teh 2007).
An interpretation of the separation into occupied and empty components can be given as follows.The exact predictive distribution can be decomposed into a part where the new observation y n+1 is assigned to one of the pre-existing clusters and a further contribution for which y n+1 is placed into a new cluster of its own (see ( 2)).The functional form of this further contribution is given by p new (y n+1 ) = p(y n+1 |θ)p(θ)dθ and it has an overall relative weight of α/(α + n).As λ ≫ σ, the width of p new (y n+1 ) is of the order of λ.In plots of the predictive density for sufficiently small n, the contribution thus shows up as a broad unstructured "background" onto which the more structured and localized parts derived from the observations y 1:n are superimposed.
If we assume a mean field solution for which the first k 0 components are occupied, it follows from ( 13) and ( 14) that the unoccupied components provide a total contribution of to the predictive density.In this way, we can associate the empty components with the contribution α/(α + n)p new (y n+1 ) in the exact model in which y n+1 is placed in its own cluster.Furthermore, we see that the mean field term is always reduced in weight in comparison to the full DPM model, but in practice, the two contributions are often found to be quite similar in magnitude.

The role of the truncation level K
The structural properties of the mean field solutions summarized in Sec.3.2 are already very helpful in the discussion of the choice of the truncation level K.This issue is an important question in numerical calculations.In previous work (Blei andJordan 2006, Kurihara, Welling, andVlassis 2007), K was considered as an additional variational parameter, and various schemes for optimally choosing K were proposed on the basis of maximizing a variational lower bound.However, since the truncated DPM model tends towards the full DPM model as K → ∞ (Ishwaran and Zarepour 2002), one may wonder if the mean field approximation introduced in Sec.2.2 also approaches some well-defined limiting behavior.
That this is indeed the case is already implied by the discussion of Sec.3.2.First of all, the assignment probabilities π ik of occupied components that deter-mine all interesting properties of a particular mean field solution are insensitive to the choice of K. Second, the lower bound H for any specific solution rapidly approaches a well-defined limit for K → ∞.To see this we note that a component which is completely empty (i.e., π ik = 0 for all i) does not contribute at all to H.The combined contribution of all unoccupied components is thus of order exp(−λ 2 /σ 2 ) and remains bounded as K → ∞ because of the exp(−k/α) scaling.As a consequence, we conclude that it is not necessary to find an optimized value for K. Rather, one can eliminate this additional degree of freedom in a systematic way by considering the limit of K → ∞.
In fact, for practical purposes it is even unnecessary to calculate the asymptotic lower bounds precisely, since the differences in H between the various extrema typically are sufficiently large so that the solutions can be ranked unambiguously even with some remaining uncertainty in H.In practice, it often suffices to choose K just somewhat larger than the maximum number of occupied modes.We also note that the predictive density is insensitive to the choice of K as well.

Two observations
The case of two observations, n = 2, provides a simple way of obtaining some insight into how the mean field method switches between different clusterings of observations.Two observations can be assigned to either a single or two different clusters.In view of the results of Secs.3.1, we expect that the first situation corresponds to a mean-field solution where q(z i ) ≈ I(z i = 1), i = 1, 2, whereas for the second one we have q(z 1 ) ≈ I(z 1 = 1), q(z 2 ) ≈ I(z 2 = 2) or vice versa.Numerical studies indeed show that a converged solution close to the first type always exists, while the second type exists as long as the two observations are sufficiently far away (on the scale set by σ).In the spirit of the mean field approach, the type of clustering is determined by the solution with the larger lower bound H.To obtain some quantitative understanding of the behavior of the mean field approximation, we assume the data to be given by y 1 = y and y 2 = −y.Using (12), we then calculate the lower bounds after approximating the two mean-field solutions by q(z i ) = I(z i = 1) and q(z 1 ) = I(z 1 = 1), q(z 2 ) = I(z 2 = 2), respectively.The functions q(v k ) and q(η k ) are obtained from a single iteration of ( 9) and ( 11).The lower bounds that are calculated analytically in this way are very close to the ones obtained numerically from the actual mean field distributions.In particular, one finds that the two bounds become equal for This means that the mean-field approximation switches abruptly between assigning the observations to a single or to two clusters, respectively, when y crosses a threshold that is of the order of σ log(λ/σ).We also note that unless α is very small, the dependence on α is very weak.
In the full DPM model, one can infer from (2) that the probabilities for having one or two clusters change continuously upon varying y, i.e., there is a gradual transition in the cluster assignment.In particular, the two probabilities become equal for We note that those parts of ( 25) and ( 26) that only depend on the data likelihood parameters σ and λ are identical, whereas there are strong discrepancies in the dependence on the prior as expressed through α.We can interpret this result in the following way.Overall, (25) indicates that the mean field method displays a plausible behavior regarding clustering.One would expect that there should be two clusters as soon as y becomes comparable to σ.A similar result is also given by the full DPM model.However, the detailed comparison between the mean field and the exact treatment shows clear quantitiatve and qualitative differences, so that it is hard to consider the mean field calculation as an "approximation" to the full model.These overall conclusions are confirmed by the numerical study of more complicated situations.In the present case, one can improve the mean-field result by considering mixtures of the individual mean field solutions as in Sec.3.1.However, as mentioned above, solutions assigning the observations to different clusters do not exist if y is small compared to σ so that this approach is not always applicable.We also note that numerically we have not been able to find mean field solutions that make partial assignments of the observation to the mixture components (i.e., both π 11 and π 12 being large simultaneously, for example).In Sec. 4, we will discuss a numerical example of two partially overlapping clusters of observations that shows a very similar behavior to the simple case discussed here (see Fig. 2).

Numerical examples
In this section we will discuss a number of representative numerical examples that further illustrate the behavior of the mean field approximation beyond the general discussion of Sec. 3. The purpose of this discussion is, on the one hand, to show that the mean field method, when viewed on its own, provides "reasonable" results regarding clustering and density estimation.On the other hand, we want to point out the profound differences to the exact treatment of the DPM model, in particular regarding data clustering.
Here, IG denotes the inverse gamma distribution with density p(x; α, β) = β α x −α−1 exp(−β/x)/Γ (α) and shape and scale parameters α and β.The parameters s, T , and τ (as well as α) are considered fixed.Note that in the limit of s, T → ∞, this model reduces to the location model ( 6) with σ 2 = T /s and λ 2 = τ T /s.It is therefore convenient to reparametrize the above model in terms of σ 2 eff = T /s and λ 2 eff = τ T /s, so that s provides a measure of the deviation from the location model with the corresponding parameters.In the truncated stick-breaking approximation to the full model ( 27), one obtains the probability distribution where p(y|η, V ) is a normal distribution with mean η and variance V , p(η|V ) is normal with mean 0 and variance τ V , and p(V ) is given by the inverse gamma distribution defined in (28).The mean-field update equations are calculated using relations (3) and ( 4), but we do not reproduce them here as the explicit expressions will not be needed in the following.
In the first example we consider three well separated clusters of observations.Each cluster consists of 30 data points drawn from the normal distributions N (−0.5, 0.1 2 ), N (0, 0.01 2 ), and N (0.5, 0.04 2 ), respectively.The prior parameters are chosen as σ eff = 0.04, λ eff = 1, and shape parameter s = 1.The parameters have been selected such that the cluster widths are well within the support of the prior distribution for the variance of the mixture components.
In the mean field posterior, all data points within a cluster are assigned to the same mixture component, i.e., the approximation predicts exactly three components.This behavior is independent of α within the investigated range of α = 1 to α = 50.The predictive distribution for α = 1 is shown in Fig. 1 (black curve) together with the result of an MCMC calculation based on algorithm 8 of Neal (2000).Both results are in very good agreement.This indicates that the posterior distributions of the mixture parameters are well localized.As shown in the inset of Fig. 1, however, there are significant differences between MCMC and mean field regarding the distribution for the number of mixture components.As mentioned, mean field predicts three components for all values of α.The MCMC result, which we can expect to be quite close to the exact distribution in the DPM model, extends over several numbers of components and depends strongly on α.We also note that although the mean field iteration scheme has a number of different fixed point solutions, none of the ones found numerically has more than 4 components.It would thus not be possible to recover the exact component distribution by mixing different mean field solutions.The compression into a very small number of components can be regarded as a consequence of the "positive feedback" mechanism described at the beginning of Sec. 3.
Besides making the differences to exact results for the DPM model obvious, this example also illustrates that the mean field scheme is indeed able to distinguish between clearly separated clusters of observations.This behavior, which could be regarded as a minimum requirement for any clustering method, has been confirmed in further numerical studies, and it can also be explained directly from the mean field iteration scheme.For the further discussion, it is thus of interest to see how the mean field methods handles more intricate situations in which one would expect partial assignment of observations to more than one component.
In our second example, we therefore consider a case where the data points originate from a mixture consisting of two partially overlapping normal distributions.To study this problem more systematically, we generate two sets of observations {y (l) i }, l = 1, 2, that each contain 40 data points drawn from a N (0, 0.1 2 ) distribution.We then apply shifts of ∆y and −∆y, respectively, to the points in the two sets.From our previous discussion, it is clear that for the two cases ∆y ≪ ρ and ∆y ≫ ρ, the mean field approximation will firmly assign the data to a single and to two different mixture components, respectively.For the numerical study of the intermediate case, the parameters of the DPM model are set to α = 5, σ eff = 0.1, λ eff = 1, and s = 5.With this relatively large value of s, the variance of the mixture components in the DPM model is fixed quite strongly.
The mean field approximation shows a behavior that displays some parallels to the case of two observations discussed in Sec.3.4.For all investigated values of ∆y, there is a mean-field fixed-point solution where all observations are assigned to the same mixture component.For large ∆y, however, the optimal mean field solution, that has the largest lower bound H, populates two different components.As shown in Fig. 2(a), the lower bounds strongly depend on ∆y, and upon decreasing ∆y, the one-component solution eventually becomes dominant.Interestingly, the point at which the two bounds become equal is very well estimated by the simple relation ( 25).Soon after this crossover, the two-component solution becomes unstable and ceases to exist.
Figures 2(b) and (c) display some features of the mean field solution for ∆y = 0.15 where the two-component solution is still slightly dominant.Figure 2(b) shows the assignment probabilities π ik , k = 1, 2, for the two occupied mixture components.They are essentially determined by relation (8) if we disregard the prior variability of σ 2 .The figure demonstrates that the mean field solution indeed makes partial assignments of the central data points to the two mixture components.The predictive distribution is shown in Fig. 2(c) (black curve) together with the result from the subdominant one-component solution (blue) and the MCMC calculation (red).The two insets show the predictive density for ∆y = 0.1 and ∆y = 0.18 where there is strong dominance of the one-or the two-component mean field solution, respectively.In each case, the difference between MCMC and mean-field result is somewhat larger than in the example of Fig. 1.We also note that a simple mixing of the two mean-field solutions at ∆y = 0.15 makes the two-peak structure disappear and therefore does not produce an obvious improvement.
In order to illustrate the mean field method with a "realistic" data set, we have studied the well-known galaxy red shift data discussed in Roeder (1990).Figure 3 displays the results of corresponding calculations for parameters α = 1, σ eff = 0.707, λ eff = 7.07, and s = 4.These values were chosen based on the discussion by Escobar and West (1995) who, however, also include hyperpriors for some of the parameters.The optimal mean field approximation (in terms of the lower bound H) that could be found numerically contains three mixture components (black curve in Fig. 3) whereas the second-best has four components (blue).In contrast, in the MCMC calculation for the DPM model (red), the posterior distribution for the number of components is spread out between 6 and 9.One can see from Fig. 3 that mean field and MCMC virtually agree in the description of the predictive density for the two well-separated outer clusters at low and high velocities.However, some discrepancies arise for the central part  (Roeder 1990).Parameters for the DPM were chosen as α = 1, σ eff = 0.707, λ eff = 7.07, and s = 4. Black and blue curves show the two optimal mean field solutions which have three or four mixture components, respectively; red: MCMC calculation of DPM posterior.
Another comparison between mean field and MCMC results is provided in Fig. 5 of Sec.5.1.We have also studied multivariate examples using a straightforward generalization of normal/inverse-gamma DPM model described above with a diagonal covariance matrix.We have found the mean field results to show the same qualitative behavior as in the univariate case, e.g., there is still a clear distinction between occupied and unoccupied mixture components, individual observations remain firmly attached to a small number of mixture components, and the influence of the scaling parameter α is very small.
Large-n behavior of mean-field predictive density.Figure 4 provides an illustration of the behavior of the mean-field predictive density in the limit of large sample sizes.200 random samples each containing 300 [Fig.4(a)] and 3000 (b) observations, respectively, were drawn from the three-components normal mixture 0.3 * N (0, 0.15 2 ) + 0.4 * N (0.05, 0.04 2 ) + 0.3 * N (−0.035,0.02 2 ).For each sample, the predictive density was calculated based on the mean-field posterior for a location-scale DPM model with parameters σ eff = 0.06, λ eff = 1, s = 0.5, and α = 5.To summarize the results of these calculations, Fig. 4 shows the average over the predictive densities (red curve), together with the first and third quartiles of the distribution of predicted densities at each point (green and blue curves).
The results of this simulation indicate that for growing sample size the meanfield predictive density indeed approaches the true underlying density (as long as the latter can appropriately be expressed as a normal mixture).Note that the sample sizes necessary for obtaining a good approximation strongly depend on the details of the generating mixture, e.g., strongly overlapping normals with similar variance require many more observations than well separated mixture components.
How can the convergence behavior of the mean-field predictive density be understood?From (13), ( 14), and (24) follows that in the limit of n → ∞, the predictive distribution in the normal location model is determined by the Bayes estimates µ k and E(w k ) for the means and the weights of the occupied mixture components, only.If these quantities are estimated correctly, the predictive and the underlying true distribution will coincide since the contribution (24) of the empty components as well as the parameters ρ k go to zero.For the normal location-scale model, the Bayes estimates for the precisions (i.e., the inverse variances) have to be taken into account as well.
An argument supporting the convergence of the predictive towards the true distribution can be given as follows.First of all, the mean field approximation appears to be able to determine the number of mixture components correctly (in the example of Fig. 4, the optimal mean field approximation had two components for about 15% of the samples with size 300, whereas for the larger sample size, mean field found three components for all samples).Furthermore, a mean field approximation with K occupied components obtained in the DPM model is very similar to the mean field result for a normal mixture with a fixed number K of components.For such a model, however, Wang and Titterington (2006) have shown that the Bayes estimates converge against the maximum likelihood estimates and hence against the true mixture parameters.This explains the good convergence behavior of the mean-field predictive density in the DPM model.

Alternative mean-field approximation schemes
The mean-field approximation scheme discussed in Sec.2.2 which is based on the stick-breaking representation of the DPM model was introduced by Blei and Jordan (2006).Several variants of this method were subsequently proposed by Kurihara, Welling, and Teh (2007).In this section, we will examine one of these schemes in more detail, i.e., the substitution of the truncated stickbreaking prior by a finite-dimensional Dirichlet prior.We also briefly discuss marginalization over the weight variables v k in the stick-breaking prior and the data likelihood parameters.

Finite-dimensional Dirichlet prior
As discussed, e.g., in Ishwaran and James (2001), replacing the truncated stickbreaking prior by a finite-dimensional Dirichlet distribution also provides an excellent approximation to the full DPM model.More specifically, in this approach one uses the probability model 29) where the variables w 1:K have a Dirichlet distribution Dir(α/K, . . ., α/K).The cutoff K has again to be chosen large enough to make the approximation accurate.
In the mean field treatment of this model, we seek the best approximation within the class of distributions q(z 1:n )q(η 1:K , w 1:K ).The update equations for the iteration scheme imply that the optimal functions factorize further and are of the form q(z 1:n )q(η 1:K , w 1:K ) = n i=1 q(z i ) K k=1 q(η k )q(w 1:K ).Explicitly, the iteration steps are carried out as follows: q(w 1:K ) = Dir(a 1 , . . ., a K ) with The expression for the lower bound is given by This relation holds at convergence or at any stage of the iteration process if q(w 1:K ) has been updated following an update of q(z 1:n ).The calculation of the predictive density is outlined below.In order to compare this approach to the method using the truncated stickbreaking prior, we now discuss a representative numerical example that illustrates the main aspects.We consider a situation where data are drawn from two adjacent normal distributions and a superimposed broader distribution.In this case, some observations are found to be partially assigned to three different mixture components.This somewhat involved setup has been chosen since one would expect that potential differences between the methods should show up more readily than, e.g., in a situation where all cluster assignments are clear-cut.The data for the example were generated by drawing 40 observations each from the three distributions N (±0.03,0.01 2 ) and N (0.0, 0.12 2 ), and the parameters of the DPM model were selected as α = 5, σ eff = 0.06, λ eff = 1, and s = 0.5.Figure 5 shows the resulting predictive density distributions together with the MCMC result, as well the assignment probabilities π ik for the three mixture components that are found to be populated in the optimal mean field solutions.We note that mean field recovers the three mixture components of the original distribution and, overall, agrees well with the Monte Carlo calculation.More importantly, however, the two mean field results are found to be almost indistinguishable on the scale of Fig. 5.To further illustrate the similarity, Table 1 compares the parameters of the three occupied mixture components.In all cases, the mixture components E ηk (p(y n+1 |η k )) in the predictive distributions ( 13) and ( 37) are found to be practically of Gaussian shape.Table 1 shows the corresponding normal means µ k and standard deviations ρ k , the numbers n k = i π ik of observations assigned to a component, the weights w k of the mixture components, and the lower bound H.
The close correspondence between the solutions is not a particular feature of this example, but has been observed in a similar way in many other cases that were studied numerically, using univariate as well as multivariate data.In the discussion of this observation, the following points should be emphasized.(i) The correspondence between solutions is not restricted to the global optimum, but also applies to other local maxima in H that appear as fixed points of the mean field iteration scheme.The numerical deviations between parameters in corresponding solutions, as displayed, e.g., in Table 1, are typically very small compared to the differences to other, non-corresponding solutions.In this way, correspondence can easily be established.
(ii) The reason for the close similarities between the methods can be found in the fact that they only differ in the way the prior of the DPM model is handled.The discussion of Secs. 3 and 4 has shown that the mean field solutions are mostly determined by the data-likelihood part of the DPM model whereas the influence of the prior is rather weak.The resulting similarities are thus not surprising (see also remarks in Sec.5.2).
In spite of the overall similarity, it is nevertheless instructive to compare in more detail the use of Dirichlet and stick-breaking priors, respectively, in the mean field calculations.In order to motivate some of the general observations discussed below, we first give the result of a calculation along the lines of Sec.3.1 for the assignment probabilities in the case of a single data point.Setting y 1 = 0 for convenience, one again finds mean field solutions that assign the data point to a single mixture component, i.e., one obtains assignment probabilities π where the neglected terms are at least of order ξ 2 with ξ = exp(−λ 2 /2σ 2 ).The complete mean field solutions are obtained from these expressions using relations ( 31)-( 32).We now summarize some general features of mean field approximations with a finite dimensional Dirichlet prior.
(i) We again find a clear distinction between occupied and empty mixture components.In contrast to the stick-breaking case, however, the population of empty components does not remain finite (i.e., of order ξ) as K → ∞, but vanishes exponentially with K (see ( 35)).On the other hand, if there is more than one occupied component, the corresponding assignment probabilities appear to converge much more slowly towards their asymptotic limits (with 1/K).However, their overall variation with K is still rather small.
(ii) Permuting the component indices in a mean field solution leads to another valid solution (i.e., fixed point) of the iteration scheme (see ( 34)-( 35)).This is a consequence of the symmetry of the Dirichlet prior.All these solutions have the same lower bound H.With the stick-breaking prior, the largest bound is obtained if the populations of components are size-ordered, i.e., n 1 ≥ n 2 ≥ . . .(see ( 12)).
(iii) With a finite-dimensional Dirichlet prior, further types of mean field solutions can arise that have no equivalents in the stick-breaking approach.For example, the single observation considered above can be assigned to more than one component.If k 0 components become occupied, then one has π 1k ≈ 1/k 0 for these components.In the extreme case of k 0 = K, π 1k = 1/K is an exact solution.Nevertheless, in the numerical calculations for more complex problems we have only rarely encountered solutions that were significant in terms of their lower bound and did not seem to have a correspondence in the stick-breaking approach.
(iv) The predictive density is given by Since E w1:K (w k ) = (n k + α/K)/(α + n), the evaluation of the predictive density can show an appreciable dependence on K, in particular for smaller values of K (and large α).To eliminate this dependence, it is convenient to take the asymptotic limit Similar to the discussion of Sec.3.2, the term α(α + n) −1 p new (y n+1 ) is due to the empty components and describes the assignment of y n+1 to a cluster of its own.It agrees precisely with the corresponding term in the exact treatment.For the practical evaluation of (37), we can make use of the fact that even for rather moderate values of K, the population of empty components is exponentially small, whereas the population of occupied components depends only weakly on K.It is therefore possible to accurately calculate (37) with a small value of K already.As an example, the predictive density in Fig. 5(a) was obtained in this way and is almost indistinguishable from the stick-breaking result.(v) A rather interesting observation concerns the behavior of the lower bound H, which does not become asymptotically constant as in the stick-breaking case.
To study this aspect in more detail, let us consider a mean field solution for which the first k 0 components are occupied and all others empty.In this case, all terms in (33) will become constant as K → ∞, besides the first k 0 members of the last sum for which we find that after using that Γ (z) ≈ 1/z for z → 0. The lower bound thus diverges as −k 0 ln K.The reason for this behavior is the permutation symmetry of the Dirichlet prior.As mentioned in (ii), this symmetry gives rise to the existence of multiple equivalent mean field solutions that differ from each other only in the permutation of component indices, and the proper mean field description should take all of them into account.
For a special case, we can explicitly show how the combination of all these solutions re-establishes the constancy of the asymptotic bound.Assume again that there are k 0 occupied mixture components.For large K, the population of the empty components will be vanishingly small as mentioned in (ii).There are thus K!/(K − k 0 )! index permutations leading to distinct mean field solutions.Now assume in addition that for each occupied mixture k there is at least one observation i assigned (almost) exclusively to this component, i.e., π ik ≈ 1.Such types of mean field solutions are not uncommon, examples are shown in Figs. 1 and 2. Any two index permutations of such solutions fulfill, to a very good degree of approximation, the non-overlap condition of Sec.2.3.We can therefore calculate the lower bound H mix of an equi-weighted mixture of all permutations.Since it surpasses the bound of an individual solution by ln K!(K − k 0 )!, H mix will be asymptotically constant.We note that the approach to the limit is of order 1/K and thus much slower than in the stick-breaking case.A very similar behavior is observed for the exact calculation of the lower bound in the truncated DPM model with Dirichlet prior.
In the general case, however, the mean field solution will not fulfill the nonoverlap condition (see, e.g., Fig. 5).It is then no longer possible to calculate the entropy term of the lower bound analytically.It is thus an interesting and challenging problem to determine the asymptotic behavior of the lower bound under these circumstances.
The divergent behavior of the lower bound can be of relevance for practical calculations.The scaling with −k 0 ln K implies that for growing K mean field solutions with small k 0 will appear more and more favorable.In fact, in many cases there exists a solution that assigns all observation to the same mixture component (i.e., k 0 = 1).Such a solution will always dominate in the limit K → ∞.It is thus essential to be aware of this divergent behavior when ranking mean field solution based on their lower bound.On the other hand, one should also keep in mind that apart from the problem with the lower bound, finite-dimensional Dirichlet and truncated stick-breaking priors essentially yield the same approximations to the DPM posterior (Ishwaran and Zarepour 2002).Since the latter approach does not suffer from problems regarding the lower bound and the ranking of solutions, it might be a more convenient choice in practical calculations.
Finally, we briefly mention recent work on the latent Dirichlet allocation (LDA) model (Blei, Ng, and Jordan 2003).On a formal level, LDA is a generalization of a mixture model where the data likelihood is given by a discrete multinomial distribution and thus shares some similarity with the models described in this section.Since mean field methods are also often applied to the LDA, we expect that some of the results discussed here, e.g., regarding the structure of the solutions and the effects of the permutation symmetry of the Dirichlet prior, might be of relevance in this context as well.

Other approximation schemes
Marginalization over weight variables.As discussed by Kurihara, Welling, and Teh (2007), one might hope to improve the mean-field approximation by integrating out the v k 's of the stick-breaking representation so that the contribution of the prior weights is treated exactly.As a representative illustration of the results that are obtained from this method, we again consider the example of Sec.5.1.Figure 5 and Table 1 show the predictive density and the numerical values for the parameters of the mixture components, respectively.Again, we find a close similarity to the results from the other approaches.This correspondence between solutions has also been observed in all other numerical studies and not only applies to the globally optimal mean field result, but also to other local maxima with large H.As explained above, we attribute this behavior to the very weak influence of the prior distribution in the mean field calculations.
For corresponding solutions, the lower bound H is larger in the marginalized model, as one would expect since the prior weight variables v k have been treated exactly (a detailed study of the improvement in H due to marginalization has been presented by Mukherjee and Blei (2009) in the context of latent Dirichlet allocation).However, given the close similarity between the actual mean field distributions, as exemplified in Table 1, the increase in H does not seem to make the marginalized method preferable, in particular since the numerical computation of the update rules becomes more expensive.This can be avoided by using approximation schemes (Kurihara, Welling, and Teh 2007), but then the lower bound is no longer guaranteed to increase in each iteration step.
Marginalization over data likelihood parameters.A further variant of the mean field method that is mentioned, although not elaborated on, in Kurihara, Welling, and Teh (2007) consists in marginalizing over the mixture component parameters η k .However, in this case the computational cost of exactly evaluating the mean-field update relations is increased dramatically and it is not clear if efficient approximations can be derived.Nevertheless, it is still of interest to investigate the general behavior of this approach in the case of small samples and compare it to the other versions.The main result of our studies is that the mean field solutions obtained in this scheme do not have equivalents in the other approaches, but are clearly distinct.This confirms our conclusion that the similarity of the previous methods is due to the fact that they only differ in how the prior is handled, and that the prior only has a small effect on the resulting approximation.In particular, it is found that after marginalizing over η k the observations tend to be spread out over more mixture components; however, the inherent limitations of the factorized form for q(z 1:n ), as outlined in Sec. 8, prevent this approach from yielding a significantly improved approximation to the distribution of mixture components.

Mean field inference for the parameter α
The DPM model can be extended to include a prior distribution p(α) for the parameter α.In order to preserve conjugacy, we will in the following choose p(α) as a gamma distribution with shape and inverse scale parameters s 1 and s 2 , respectively.To derive the mean field iteration scheme for this model, we again apply a truncation as in (1).Analytical update rules are obtained if we seek the mean field approximation within the class of distributions q(z 1:n )q(η 1:K , v 1:K−1 )q(α).As in Sec.2.2, the optimized distributions factorize in all variables.More specifically, the mean field posterior for α remains a gamma distribution with update equations given by (Blei and Jordan 2006) whereas the update for the v k 's is modified to Similarly, in the update rule ( 8), α has to be replaced by E α (α), whereas the relation for q(η k ) remain unchanged.The lower bound at convergence is now given by Some numerical experiments indicate that the convergence through standard iteration for q(α) is quite slow and takes longer than for the other parameters.When a prior distribution for α is included in the model, one finds that the role of the truncation level K becomes more complex than before.First of all, (38) shows that the mean field posterior for α more and more approaches a normal distribution as K and hence the shape parameter s * 1 increase, irrespective of any other aspects of the system under study.The mean of the posterior distribution is given by µ = s * 1 s * 2 and its standard devation by ρ = µ/s * 2 .Under the assumption of unoccupied higher-order mixture components, one can show that E(α) is independent of K. Using the relations from the update equations, one has at convergence Here, we have assumed that the first k 0 components are populated, and used the relation ψ(x + 1) − ψ(x) = 1/x.The parameters α k and β k are defined in (40).Solving for E(α) yields Exact prior (blue curve) and posterior (red) distributions for the parameter α together with mean field approximation (black) at K = 20 for a sample with n = 12.Due to the smallness of the sample the exact posterior can be calculated without using Monte-Carlo integration.In this example, the variational posterior mean E(α) is smaller than the prior one (E prior (α) = 1), whereas the exact one is larger.For K → ∞, q(α) approaches a Dirac delta function at E(α).
i.e., E(α) is independent of K.Note that β k depends explicitly on E(α), and that both α k and β k depend implicitly on it through n k and n > k .The latter dependence is quite weak, however.Altogether, ( 43) is thus an implicit equation for E(α).It is also easy to see from (43) that the result for E(α) is independent of the precise choice of k 0 as long as all occupied mixture components are included in the summation (i.e., one might change k 0 → k 0 + ∆k, ∆k > 0).Since the iteration relations for the other factors in the mean field approximation depend on q(α) only through E(α), it is ensured the assignment probabilities π ik will also remain independent of K.
In the limit of K → ∞, it follows from the above discussions that s * 1 and s * 2 both go to infinity, since their ratio E(α) has to remain fixed.The approximately Gaussian posterior q(α) will thus be more and more localized around E(α), i.e., it tends towards a Dirac delta function δ(α − E(α)).
We also find that the lower bound no longer attains a constant limit for K → ∞.More specifically, it follows from (41) that H diverges as − 1 2 ln K in the limit of K → ∞.To derive this result, one uses the constancy of E(α) with K which allows to replace s * 2 by s * 1 E(α), together with Stirling's approximation, and the fact that components are occupied only up to a fixed index k 0 < K.The divergence is caused by the entropy term − q(α) ln q(α).Qualitatively, the contracting q(α) more and more deviates from the exact posterior marginal and thus leads to a larger Kullback-Leibler divergence.We note that this example shows how a seemingly minor modification of the model can lead to a rather strong change in the properties of the mean field approximation, here in the form of the behavior of the lower bound.The contraction of q(α) and the divergence of the lower bound can probably be avoided if one seeks the mean field approximation within the class of distributions q(z 1:n )q(η 1:K )q(v 1:K−1 , α).However, in this case the update rules are no longer of simple closed form but require numerical integrations.
In the numerical examples studied, the variational mean E(α) was always smaller than the exact one.In Fig. 6 we display a case where the mean-field result not only predicts a wrong shape of the posterior, but even gives a wrong direction for the shift from prior to posterior mean (i.e., smaller instead of larger value for expectation value).One thus has to be very careful when drawing conclusions about the exact behavior based on variational results.

Mean field approximation and MAP estimation
An interesting perspective on the mean-field approximation is obtained from studying its connection to maximum a-posteriori (MAP) estimation of the posterior mixture distribution.We define the MAP estimation problem as finding those values of η 1:K and v 1:K−1 (and hence the specific mixture distribution defined by them) that maximize the truncated posterior distribution p(η 1:K , v 1:K−1 |y 1:n ) for given observations y 1:n .An algorithm for determining the MAP estimate can be constructed in close analogy to the expectationmaximization (EM) method, with the assignment variables z i playing the role of the hidden variables in the EM algorithm.More specifically, a cycle in the EM-type iteration scheme for the normal location model is carried out by alternatingly calculating for given η 1:K and v 1:K−1 , and updating η 1:K and v 1:K−1 according to To simplify the following discussion, we now assume α > 1.At a fixed point of the iteration scheme, we have From this, we obtain an implicit condition for the component occupation numbers n k from the summation The solutions of the MAP iteration scheme have the following general properties: 1. Given a solution, we can construct (K − 1)! − 1 further solutions by permuting the indices of the components with k = K.To see this, consider such a permutation P that keeps K invariant, i.e., P(K) = K.From the given fixedpoint solution, we can construct the new solution starting from π . Re-evaluating π ik using (44) proves the consistency of the new solution.However, if the index K were included in the permutation as well, the consistency would no longer hold.Note that all possible (K − 1)! permutations describe the same mixture distribution defined by the (unordered) set of η k 's and their associated weights 2. All (K − 1)! solutions obtained above have the same posterior probability, which is immediately seen from the relation (49) Expression ( 49) also explains the special role of the component K mentioned above.The mixture component with the largest weight has to be assigned to label K in order to maximize the posterior probability (note that we have assumed α > 1).The assignment of the other components is arbitrary.For all the other K! − (K − 1)! ways in which the mixture distribution can be represented with w K not being the maximum weight, the posterior probability is smaller.Numerical calculations indicate that these representations are not even local maxima of the posterior.
3. The MAP equations may have multiple fixed-point solutions corresponding to different mixture distributions.However, for any set of observations y 1:n , π ik = I(k = K) (together with the ensuing relations for v 1:K−1 and η 1:K ) is always a fixed point of the MAP iteration scheme.In this case, all observations are assigned to the mixture component K.In agreement with the above discussion, π ik = I(k = l), l < K, can never be a fixed point as one easily sees from the iteration equations or from the fact that the posterior probability vanishes in this case.
4. As in mean field, the MAP iteration scheme supports solutions for which only a subset of the mixture components are occupied.The populations of the empty modes vanish exactly (i.e., π ik = 0), whereas in mean field they are only exponentially suppressed.The exact vanishing of component populations can be understood from two perspectives.On the one hand, given a MAP solution, we can immediately construct an equivalent solution with a larger number of components simply by inserting components with zero weight.Second, the implicit relation (48) for n k has the trivial solution n k = 0, k < K, and in many cases this solution is indeed assumed.into clusters.In the case of localized observations |y i | < ∼ σ and moderate α, it is often found that the term that assigns all observations to a single cluster has the largest probability whereas the individual probabilities for all other clusterings are significantly smaller.However, since there is a large number of possibilities for putting the observations into two or more clusters, the resulting distribution for the number of components can still become quite broad.Although not strictly in one-to-one correspondence, the fact that the single-cluster assignment has the largest individual probability indicates that the MAP estimate should also put all observations into a single mixture component.
Finally, we address the fact that the MAP and mean field solutions differ distinctively in how indices are assigned to the populated mixture components.As discussed above, in MAP the component with largest weight has index K.In the mean field approximation the lower bound is maximized if the populations of components are ordered by size, i.e., n 1 > n 2 > . . .(see ( 12)).
We explain this difference in behavior as follows.In the stick-breaking approach, a mixture distribution can be represented in K! different ways by appropriately choosing v k 's and η k 's.With MAP, we select a representation that maximizes posterior probability density.The mean field solution can be interpreted as providing a probability distribution in mixture space which is localized around the same actual mixture as chosen by MAP.However, it puts its weight on a different representation of the mixture, namely one with a large associated probability mass (with respect to a volume element in mixture space).Such a representation typically assigns small indices k to all components with large mixture weights.

Summary and conclusions
Recently, variational algorithms, and in particular mean field methods, have received increasing attention as possible alternatives to MCMC integration in computational Bayesian inference.In this paper, we have presented a systematic investigation of several mean field inference schemes for Dirichlet process mixture models that were proposed in Blei and Jordan (2006) and Kurihara, Welling, and Teh (2007).Our discussion focussed on the normal location and normal/inverse-gamma location-scale models which we believe to be among the most important for practical applications of mean field techniques in this context.The main results of our studies can then be summarized as follows.
(i) In typical fixed point solutions to the mean field iteration scheme, there is a clear distinction between occupied and essentially empty mixture components.The population of empty components is exponentially suppressed.Individual observations tend to be firmly attached to a relatively small number of components, often even only a single one.This behavior was explained in terms of a "self-reinforcement" effect in the iteration equations.We have discussed several structural properties of the solutions.To illustrate this discussion, explicit approximations for the case of a single observation were presented.It was also studied how the mean field approximation chooses between different clusterings of observations.As another important feature of the mean field solutions, their very weak dependence on the DPM parameter α was pointed out.
(ii) We compared different variants of the mean field scheme that are derived from the unmarginalized and marginalized truncated stick-breaking priors and the finite-dimensional Dirichlet priors, respectively.It was found that these methods lead to essentially equivalent results regarding density estimation and cluster allocation.This virtual equivalence was attributed to the very weak influence of the prior distribution in the mean field iteration scheme.From a practical point of view, the unmarginalized stick-breaking variant appears to be most convenient for numerical work.In the marginalized version, one either is faced with increased computational cost in the mean field iteration scheme or has to make use of approximation methods.For the finite-dimensional Dirichlet prior, we have pointed out the divergent asymptotic behavior of the lower bound and related this effect to the permutation symmetry of the Dirichlet prior.The divergence may cause problems when ranking mean field solutions based on their lower bound.
(iii) We have clarified the role of the truncation level K.The mean field solutions display a well-defined behavior in the limit of K → ∞.Together with the fact that only a fixed number of mixture components becomes occupied, this allows to characterize the asymptotic behavior of important quantities such as the lower bound and the predictive density.For practical calculations, it is sufficient to choose K slightly larger than the number of occupied components to calculate the asymptotic quantities.It thus seems natural to work with the asymptotic limit of the mean field approximation, rather than treating K as an additional variational parameter.This approach also keeps in line with the fact that the full DPM model arises as the asymptotic limit of the truncated model.
(iv) When a prior distribution for α is included in the model, the role of the truncation level becomes more complex.Whereas E(α) becomes constant in the limit of K → ∞, the mean field posterior for α approaches a Dirac delta function, and the lower bound H decreases without bound.This behavior appears to be an artifact of the factorization assumptions in the mean field trial functions and, on a qualitative level, is not easily reconciled with the fact that the mean field solutions without the α prior only depend very weakly this parameter.
(v) It was shown that the mean field approximation is closely related to MAP estimation of the DPM model.The MAP estimation problem can be shown to be equivalent to mean field inference under a restricted set of trial functions.The unrestricted mean field approximation can be thought of as providing a distribution over mixture space that is smeared out around the mixture singled out by the MAP estimate.
(vi) Compared with MCMC calculations, mean field results for the predictive density distribution were often found to be quite accurate.An explanation of this behavior in the limit of large sample sizes was given at the end of Sec. 4.However, there can be strong discrepancies regarding clustering and the number of posterior mixture components.
A well-known shortcoming of the mean-field method concerns the fact that it typically underestimates the variance of the target distribution (see, e.g., Bishop (2006), Sec. 10.1;MacKay (2003), Ch. 33).This limitation is due to the fact that the factorized form of the approximation cannot properly capture the correlation structure present in the full model.This problem is also present in the context of the DPM model and accounts for some of the behavior discussed in the previous sections.
As a simple illustration, consider the marginal prior distribution p(z 1 , z 2 ) in the finite-dimensional Dirichlet model of Sec.5.1 with n = 2 observations.The distribution takes only two values, depending on whether z 1 = z 2 or not, but this behavior cannot be modeled by a factorization q(z 1 )q(z 2 ).For a larger number of observations, the correlations become even more complicated.A further example is provided by the case of two well-separated clusters of observations.In this case, there is strong anticorrelation in the cluster assignment, in that observations from both clusters can be assigned to the mixture component with label k, but not at the same time (i.e., the posterior marginals p(z i = k) and p(z j = k) can both be large, but p(z i = z j = k) vanishes if i and j belong to different clusters).In factorized mean field, the observations from the two clusters always have to be assigned to different mixture components, q(z i = k) and q(z j = k) cannot be large simultaneously.
In this context, it is interesting to note that for the stick-breaking approach of Sec.2.2, the factorized form is optimal even within the large class of trial functions q(z 1:n )q(v 1:K−1 , η 1:K ) (and similarly for the finite-dimensional Dirichlet prior).An important task of future work thus consists in finding more general, tractable classes of trial functions that permit improvements to the mean field scheme.In the present study, we have already considered mixtures consisting of different fixed point solutions to the mean field iteration scheme.These mixtures were mathematically tractable because of the particular structure of the fixed point solutions.However, this approach did not lead to a substantial improvement in the quality of the approximation.
As an overall conclusion, we can state that the mean field method, when viewed on its own, produces reasonable results regarding density estimation and clustering.This can be explained by its connection to MAP estimation of mixture distributions.In view of point (vi) above, however, care is needed when it is used as an actual approximation to the exact DPM posterior.Nevertheless, since mean field appears to be a very useful and efficient way of calculating inferences for large-scale DPM problems it is very important to have a solid understanding of its properties.It is hoped that the present paper makes some contributions in this direction.We expect that some of the results may be also of interest in related contexts, such as latent Dirichlet allocation which is often treated with mean field techniques as well Blei, Ng, and Jordan (2003).
Helpful comments and advice by the associate editor and an anonymous referee are also gratefully acknowledged.

Appendix: Characteristic features of mean field solutions
In this Appendix, we give some brief arguments in order to explain the characteristic features of the mean field solutions outlined in Sec.3.2.In the limit of λ → ∞ at fixed σ, i.e., σ/λ → 0, the fixed points of the mean field iteration scheme are determined by with µ k = i π ik y i /n k .The appearance of the term −1/2n k in the exponent of (51) shows that the iteration scheme can entertain solutions with exactly vanishing population in some components, i.e., n k = 0.However, such solutions will be nonanalytic in σ/λ when we consider their variation with this parameter.Given any solution of the iteration scheme (51), consider its "pruned" version where all empty components have been discarded.The pruned version will be the solution of an iteration scheme in which only the occupied components are retained.Solutions of the pruned scheme in fact exist; e.g., for K = 1 solutions are given by π i1 = 1, and for n = 2, K = 2 they can be found graphically.
By construction, all pruned π ik will be non-vanishing.If we re-establish the λ dependence of the pruned scheme, we can expect the assignment probabilities to smoothly depend on σ 2 /λ 2 by the theorem of implicit functions.The effect of adding the empty components back into the iteration scheme at finite σ/λ can be seen from a perturbative argument.Calculating the assignment probabilities for the unoccupied components from the pruned π ik in a first iteration step shows that the former will be of order exp(−λ 2 /2σ 2 ), i.e., exponentially small compared to the pruned π ik for sufficiently small σ 2 /λ 2 .Further iteration steps will only provide negligible corrections to the result of the first iteration.The scaling with exp(−λ 2 /2σ 2 ) reflects the nonanalytic behavior of the unoccupied assignment probabilities.From the term − k−1 j=1 (α + n > j ) −1 in the exponent of (8) follows that the π ik 's for the unoccupied components scale with exp(−k/α) as soon as their index k is larger than the largest index for occupied components.The geometric scaling with k implies that the combined weight of the unoccupied components remains bounded as K → ∞ which leads to the insensitivity of all assignment probabilities with the truncation level.

Fig 2 .Fig 3 .
Fig 2. Mean field approximation for data drawn from a two-component mixture.(a) Lower bounds H of one-and two-component mean field solutions (black and blue curve, respectively) as a function of the shift ∆y applied to the mixture components.(b) Assignment probabilities π ik for two-component mean field solution at ∆y = 0.15.(c) Predictive densities at ∆y = 0.15, ∆y = 0.10 (left inset), ∆y = 0.18 (right inset).Black curves: dominant mean field solutions, red: MCMC calculation, blue: one-component mean field solution at ∆y = 0.15.

Fig 4 .
Fig 4. Large-n behavior of mean-field predictive density.200 random samples each of size n = 300 (a) and 3000 (b), respectively, were drawn from a three-component normal mixture with density shown by the black curve, and for each sample the mean-field predictive density was calculated.Red curves depict averaged predictive densities, whereas green and blue curves show first and third quartiles of the distribution of predicted densities at each point.

Fig 5 .
Fig 5. Mean field approximations for three-component mixture.(a) Predictive density calculated from truncated stick-breaking priors without and with marginalization over v k (see Sec. 5.2) (black and blue curves) and finite-dimensional Dirichlet prior (green).Red curve: MCMC calculation.Mean field results are almost indistinguishable.Parameters are α = 5, σ eff = 0.06, λ eff = 1, and s = 0.5.(b) Assignment probabilities π ik for occupied mixture components.Mean field results from the different schemes are indistinguishable on this scale.
Fig 6.Exact prior (blue curve) and posterior (red) distributions for the parameter α together with mean field approximation (black) at K = 20 for a sample with n = 12.Due to the smallness of the sample the exact posterior can be calculated without using Monte-Carlo integration.In this example, the variational posterior mean E(α) is smaller than the prior one (E prior (α) = 1), whereas the exact one is larger.For K → ∞, q(α) approaches a Dirac delta function at E(α).

Table 1
Mean field parameters of populated mixture components for the example of Fig.5.