{"title": "A Randomized Algorithm for Pairwise Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 424, "page_last": 430, "abstract": null, "full_text": "A Randomized Algorithm for Pairwise Clustering \n\nYoram Gdalyahu, Daphna Weinshall, Michael Werman \n\nInstitute of Computer Science, The Hebrew University, 91904 Jerusalem, Israel \n\n{yoram,daphna,werman}@cs.huji.ac.il \n\nAbstract \n\nWe present a stochastic clustering algorithm based on pairwise sim(cid:173)\nilarity of datapoints. Our method extends existing deterministic \nmethods, including agglomerative algorithms, min-cut graph algo(cid:173)\nrithms, and connected components. Thus it provides a common \nframework for all these methods. Our graph-based method differs \nfrom existing stochastic methods which are based on analogy to \nphysical systems. The stochastic nature of our method makes it \nmore robust against noise, including accidental edges and small \nspurious clusters. We demonstrate the superiority of our algorithm \nusing an example with 3 spiraling bands and a lot of noise. \n\n1 \n\nIntroduction \n\nClustering algorithms can be divided into two categories: those that require a vec(cid:173)\ntorial representation of the data, and those which use only pairwise representation. \nIn the former case, every data item must be represented as a vector in a real normed \nspace, while in the second case only pairwise relations of similarity or dissimilar(cid:173)\nity are used. The pairwise information can be represented by a weighted graph \nG(V, E): the nodes V represent data items, and the positive weight Wij of an edge \n(i, j) representing the amount of similarity or dissimilarity between items i and j. \nThe graph G might not be a complete graph. In the rest of this paper Wij represents \na similarity value. \n\nA vectorial representation is very convenient when one has either an explicit or \nan implicit parametric model for the data. An implicit model means that the \ndata distribution function is not known, but it is assumed, e.g., that every cluster \nis symmetrically distributed around some center. An explicit model specifically \ndescribes the shape of the distribution (e.g., Gaussian). In these cases, if a vectorial \nrepresentation is available, the clustering procedure may rely on iterative estimation \nof means (e.g., [2, 8]). \n\nIn the absence of a vectorial representation, one can either try to embed the graph \nof distances in a vector space, or use a direct pairwise clustering method. The \n\n\fA Randomized Algorithm for Pairwise Clustering \n\n425 \n\nembedding problem is difficult, since it is desirable to use a representation that is \nboth low dimensional and has a low distortion of distances [6, 7, 3]. Moreover, even \nif such embedding is achieved, it can help to cluster the data only if at least an \nimplicit parametric model is valid. Hence, direct methods for pairwise clustering \nare of great value. \n\nOne strategy of pairwise clustering is to use a similarity threshold (), remove edges \nwith weight less than (), and identify the connected components that remain as \nclusters. A transformation of weights may precede the thresholding!. The physically \nmotivated transformation in [1] uses a granular magnet model and replaces weights \nby \"spin correlations\". Our algorithm is similar to this model, see Section 2.4. \n\nA second pairwise clustering strategy is used by agglomerative algorithms [2], which \nstart with the trivial partition of N points into N clusters of size one, and continue \nby subsequently merging pairs of clusters. At every step the two clusters which \nare most similar are merged together, until the similarity of the closest clusters is \nlower than some threshold. Different similarity measures between clusters distin(cid:173)\nguish between different agglomerative algorithms. In particular, the single linkage \nalgorithm defines the similarity between clusters as the maximal similarity between \ntwo of their members, and the complete linkage algorithm uses the minimal value. \n\nA third strategy of pairwise clustering uses the notion of cuts in a graph. A cut \n(A, B) in a graph G(V, E) is a partition of V into two disjoint sets A and B. The \ncapacity of the cut is the sum of weights of all edges that cross the cut, namely: \nc(A, B) = 2:iEA,jEB Wij. Among all the cuts that separate two marked vertices, \nthe minimal cut is the one which has minimal capacity. The minimal cut clustering \nalgorithm [11] divides the graph into components using a cascade of minimal cuts2 . \nThe normalized cut algorithm [9] uses the association of A (sum of weights incident \non A) and the association of B to normalize the capacity c(A, B). In contrast with \nthe easy min-cut problem, the problem of finding a minimal normalized cut (Ncut) \nis NP-hard, but with certain approximations it reduces to a generalized eigenvalue \nproblem [9]. \n\nOther pairwise clustering methods include techniques of non parametric density es(cid:173)\ntimation [4] and pairwise deterministic annealing [3]. However, the three categories \nof methods above are of special importance to us, since our current work provides a \ncommon framework for all of them. Specifically, our new algorithm may be viewed \nas a randomized version of an agglomerative clustering procedure, and in the same \ntime it generalizes the minimal cut algorithm. It is also strongly related to the \nphysically motivated granular magnet model algorithm. By showing the connection \nbetween these methods, which may seem very different at a first glance, we provide \na better understanding of pairwise clustering. \n\nOur method is unique in its stochastic nature while provenly maintaining low com(cid:173)\nplexity. Thus our method performs as well as the aforementioned methods in \"easy\" \ncases, while keeping the good performance in \"difficult\" ,cases. In particular, it is \nmore robust against noise and pathological configurations: (i) A minimal cut algo(cid:173)\nrithm is intuitively reasonable since it optimizes so that as much of the similarity \n\n1 For example, the mutual' neighborhood clustering algorithm [10] substitutes the edge \nweight W,} with a new weight w:} = m + n where i is the m th nearest neighbor of j and j \nis the nth nearest neighbor of i. \n\n2The reader who is familiar with flow theory may notice that this algorithm also belongs \nto the first category of methods, as it is equivalent to a weight transformation followed by \nthresholding. The weight transformation replaces Wi} by the maximal flow between i and \nJ. \n\n\f426 \n\nY. Gdalyahu, D. Weins hall and M Werman \n\nweight remains within the parts of the clusters, and as little as possible is \"wasted\" \nbetween the clusters. However, it tends to fail when there is no clean separation \ninto 2 parts, or when there are many small spurious parts due, e.g., to noise. Our \nstochastic approach avoids these problems and behaves more robustly. \n(ii) The \nsingle linkage algorithm deals well with chained data, where items in a cluster are \nconnected by transitive relations. Unfortunately the deterministic construction of \nchains can be harmful in the presence of noise, where a few points can make a \n\"bridge\" between two large clusters and merge them together. Our algorithm in(cid:173)\nherits the ability to cluster chained data; at the same time it is robust against such \nnoisy bridges as long as the probability to select all the edges in the bridge remains \nsmall. \n\n2 Stochastic pairwise clustering \n\nOur randomized clustering algorithm is constructed of two main steps: \n\n1. Stochastic partition of the similarity graph into r parts (by randomized \n\nagglomeration). For each partition index r (r = N ... 1): \n(a) for every pair of points, the probability that they remain in the same \n\npart is computed; \n\n(b) the weight of the edge between the two points is replaced by this prob(cid:173)\n\nability; \n\n(c) clusters are formed using connected components and threshold of 0.5. \nThis is described in Sections 2.1 and 2.2. \n\n2. Selection of proper r values, which reflect \"interesting\" structure m our \n\nproblem. This is described in Section 2.3. \n\n2.1 The similarity transformation \n\nAt each level r, our algorithm performs a similarity transformation followed by \nthresholding. In introducing this process, our starting point is a generalization of \nthe minimal cut algorithm; then we show how this generalization is obtained by the \nrandomization of a single linkage algorithm. \n\nFirst, instead of considering only the minimal cuts, let us induce a probability \ndistribution on the set of all cuts. We assign to each cut a probability which \ndecreases with increasing capacity. Hence the minimal cut is the most probable cut \nin the graph, but it does not determine the graph partition on its own. \n\nAs a second generalization to the min-cut algorithm we consider multi-way cuts. \nAn r-way cut is a partition of G into r connected components. The capacity of an \nr-way cut is the sum of weights of all edges that connect different components. In \nthe rest of this paper we may refer to r-way cuts simply as \"cuts\". \n\nUsing the distribution induced on r-way cuts, we apply the following family of \nweight transformations. The weight Wij is replaced by the probability that nodes i \nand j are in the same side of a random r-way cut: Wij -+ pij' This transformation \nis defined for every integer r between 1 and N. \n\nSince the number of cuts in a graph is exponentially large, one must ask whether \npi\u00b7 is computable. Here the decaying rate of the cut probability plays an essential \nr01e. The induced probability is found to decay fast enough with the capacity, hence \npij is dominated by the low capacity cuts. Thus, since there exists a polynomial \n\n\fA Randomized Algorithm/or Pairwise Clustering \n\n427 \n\nbound on the number of low capacity cuts in any graph [5], the problem becomes \ncomputable. \n\nThis strong property suggests a sampling scheme to estimate the pairing probabil(cid:173)\nities. Assume that a sampling tool is available, which generates cuts according to \ntheir probability. Under this condition, a sample of polynomial size is sufficient to \nestimate the P'ij's. \nThe sampling tool that we use is called the \"contraction algorithm\" [5]. Its discovery \nled to an efficient probabilistic algorithm for the minimal cut problem. It was shown \nthat for a given r, the probability that the contraction algorithm returns the minimal \nr-way cut of any graph is at least N- 2(r-l), and it decays with increasing capacity3. \nFor a graph which is really made of clusters this is a rough underestimation. \n\nThe contraction algorithm can be implemented in several ways. We describe here \nits simplest form, which is constructed from N-l edge contraction steps. Each edge \ncontraction follows the procedure below: \n\n\u2022 Select edge (i, j) with probability proportional to Wij. \n\u2022 Replace nodes i and j by a single node {ij}. \n\u2022 Let the set of edges incident on {ij} be the union of the sets of edges incident \non i and j, but remove self loops formed by edges originally connecting i \nto j. \n\nIt is shown in [5] that each step of edge contraction can be implemented in O(N) \ntime, hence this simple form of the contraction algorithm has complexity of O(N2). \nFor sparse graphs an O(N log N) implementation can be shown. \n\nThe contraction algorithm as described above is a randomized version of the ag(cid:173)\nglomerative single linkage procedure. If the probabilistic selection rule is replaced \nby a greedy selection of the maximal weight edge, the single linkage algorithm is \nobtained. \nIn terms of similarity transformations, a single linkage algorithm which halts with \nr clusters may be associated with the transformation Wij~O,1 (1 if i and j are \nreturned at the same cluster, 0 otherwise). Our similarity transformation (P'ij) \nuses the expected value (or the average) of of this binary assignment under the \nprobabilistic relaxation of the selection rule. \n\nWe could estimate pij by repeating the contraction algorithm M times and averaging \nthese binary indicators (a better way is described belowJ. Using Chernoff inequality \nit can be shown4 that if M 2: (21n 2 + 4ln N - 2ln Il) / t \nthen each P'ij is estimated, \nwith probability 2:1 - Il, within t from its true value. \n\n2.2 Construction of partitions \n\nTo compute a partition at every r level, it is sufficient to know for every i-j pair \nwhich r satisfies P'ij = 0.5. \nThis is found by repeating the contraction algorithm M times. In each iteration \nthere exists a single r at which the edge between points i -\nj is marked and the \npoints are merged. Denote by rm the level r which joins i and j at the m-th iteration \n(m = 1 ... M). The median r' of the sequence {rl' r2 ... r M } is the sample estimate \n\n3The exact decay rate is not known, but found experimentally to be adequate. Other(cid:173)\n\nwise we would ignore cuts generated with high capacity. \n\n4Thanks to Ido Bergman for pointing this out. \n\n\f428 \n\nY. Gdalyahu, D. Weins hall and M Werman \n\nfor the level r that satisfies pij = 0.5. We use an on-line technique (not described \nhere) to estimate the median r' using constant and small memory. \n\nHaving computed the matrix r', where the entry r;j is the estimator for r that \nsatisfies pij = 0.5, we find the connected components at a given r value after \ndisconnecting every edge (i,j) for which r~j > r. This gives the r level partition. \n\n2.3 Hierarchical clustering \n\nWe now address the problem of choosing \"good\" r values. \n\nThe transformed weight pij has the advantage of reflecting transitive relations be(cid:173)\ntween data items i and j. For a selected value of r (which defines a specification \nlevel) the partition of data items into clusters is obtained by eliminating edges whose \nweight (pij) is less than a fixed threshold (0.5). That is: nodes are assigned to the \nsame cluster if at level r their probability to be on the same side of a random r-way \ncut is larger than half. \n\nPartitions which correspond to subsequent r values might be very similar to each \nother, or even identical, in the sense that only a few nodes (if at all) change the \ncomponent to which they belong. Events which are of interest, therefore, are when \nthe variation between subsequent partitions is of the order of the size of a clus(cid:173)\nter . This typically happens when two clusters combine to form one cluster which \ncorresponds to a higher scale (less resolution). \n\nIn accordance, using the hierarchical partition obtained in Section 2.2, we measure \nthe variation between subsequent partitions by L~=l ~Nk , where J{ is a small \nconstant (of the order of the number of clusters) and Nk is the size of the kth \nlargest component of the partition. \n\n2.4 The granular magnet model \n\nOur algorithm is closely related to the successful granular magnet model recently \nproposed in [1]. However, the two methods draw the random cuts effectively from \ndifferent distributions. In our case the distribution is data driven, imposed by the \ncontraction algorithm . The physical model imposes the Boltzmann distribution, \nwhere a cut of capacity E is assigned a probability proportional to exp(-E/T) , \nand T is a temperature parameter. \nThe probability P& measures whether nodes i and j are on the same side of a cut \nat temperature T (originally called \"spin-spin correlation function\" ). The magnetic \nmodel uses the similarity transformation Wij --+ P& and a threshold (0.5) to break \nthe graph into components. However, even if identical distributions were used, P& \nis inherently different from pij since at a fixed temperature the random cuts may \nhave different numbers of components. \n\nSuperficially, the parameter T plays in the magnetic model a similar role to our pa(cid:173)\nrameter r. But the two parameterizations are quite different . First, r is a discrete \nparameter while T is a continuous one. Moreover, in order to find the pairing prob(cid:173)\nabilities P'fJ.. for different temperatures, the stochastic process should be employed \nfor every '1 ' value separately. On the other hand , our algorithm estimates pij for \nevery 1 ::; r ::; N at once. For hard clustering (v.s. soft clustering) it was shown \nabove that even this is not necessary, since we can get a direct estimation of r which \nsatisfies pij = 0.5 . \n\n\fA Randomized Algorithm/or Pairwise Clustering \n\n429 \n\n3 Example \n\nPairwise clustering has the advantage that a vectorial representation of the data \nis not needed. However, graphs of distances are hard to visualize and we there(cid:173)\nfore demonstrate our algorithm using vectorial data. In spite of having vectorial \nrepresentation, the information which is made available to the clustering algorithm \nincludes only the matrix of pairwise Euclidean distances 5 dij . Since our algorithm \nworks with similarity values and not with distances, it is necessary to invert the \ndistances using Wij = f(d ij ). We choose f to be similar to the function used in [1]: \nWij = exp( -drj / a2 ) where a is the average distance to the n-th nearest neighbor \n(we used n=10, but the results remain the same as long as a reasonable value is \nselected). \n\nbl'---____ -.J \n\nFigure 1: The 2000 data points (left), and the three most pronounced hierarchical levels \nof clustering (right). At r=353 the three spirals form one cluster (figure a) . This cluster \nsplits at r=354 into two (figures b1,b2), and into three parts at r=368 (figures c1,c2,c3) . \nThe background points form isolated clusters, usualy of size 1 (not shown). \n\nFigure 1 shows 2000 data points in the Euclidean plane. In the stochastic stage of \nthe algorithm we used only 200 iterations of graph contraction, during which we \nestimated for every pair i-j the value of r which satisfies pij = 0.5 (see Section 2.2). \nAs expected, subsequent partitions are typically identical or differ only slightly from \neach other (Figure 2). The variation between subsequent partitions was measured \nusing the 10 largest parts (I{ = 10, see Section 2.3). The results did not depend on \nthe exact value of J{ since the sum was dominated by its first terms. \n\nAt low r values (partition into a small number of components) a typical partition is \ncomposed of one giant component and a few tiny components that capture isolated \nnoise points. The incorporation of these tiny components into the giant one pro(cid:173)\nduce negligible variations between subsequent partitions. At high r values all the \ncomponents are small , and therefore the variation between subsequent partitions \nmust decay. At intermediate r values a small number of sharp peaks appear. \n\nThe two highest peaks in Figure 2 are at r=354 and r=368; they mark meaningful \nhierarchies for the data clustering, as shown in Figure 1. We compare our results \nwith two other methods in Figures 3 and 4. \n\n5The vectorial representation of data points is not useful even if it was available, since \n\nthe parametric model is not known (see Section 1) \n\n\f430 \n\nY. Gdalyahu. D. Weinshall and M. Werman \n\n1200_-------------, \n\n1000 \n\nBOO \n\n600 \n\n400 \n\nFigure 2: The variation between subse(cid:173)\nquent partitions (see text) as a function of \nthe number of components (r). The vari(cid:173)\nation is computed for every integer r (the \nspacing between peaks is not due to sparse \nsampling). Outside the displayed range the \nvariation vanishes. \n\nFigure 3: The best bi-partition according \nto the normalized cut algorithm [9). Since \nthe first partition breaks one of the spirals, \na satisfactory solution cannot be achieved \nin any of the later stages. \n\nFigure 4: A three (macro(cid:173)\nscopic) clusters partition by \na deterministic single linkage \nalgorithm. The probabilistic \nscheme avoids the \"bridging \nthanks to the small \neffect\" \nprobability of selecting \nthe \nparticular chain of edges. \n\nReferences \n\n[1] Blatt M., Wiseman S. and Domany E., \"Data clustering usmg a model granular \n\nmagnet\", Neural Computation 9, 1805-1842, 1997. \n\n[2] Duda O. and Hart E., \"Pattern classification and scene analysis\", Wiley-Interscience, \n\nNew York, 1973. \n\n[3] Hofmann T. and Buhmann J., \"Pairwise data clustering by deterministic annealing\", \n\nPAMI 19, 1-14, 1997. \n\n[4] Jain A. and Dubes R., \"Algorithms for clustering data\", Prentice Hall, NJ, 1988. \n[5] Karger D., \"A new approach to the minimum cut problem\", Journal of the ACM, \n\n43(4) 1996. \n\n[6] Klock H. and Buhmann J., \"Data visualization by multidimensional scaling: a deter(cid:173)\n\nministic annealing approach\", Technical Report IAI-TR-96-8, Institut fur Informatik \nIII, University of Bonn. October 1996. \n\n[7] Linial N., London E. and Rabinovich Y., \"The geometry of graphs and some of its \n\nalgorithmic applications\", Combinatorica 15, 215-245, 1995. \n\n[8] Rose K., Gurewitz E. and Fox G., \"Constrained clustering as an optimization \n\nmethod\", PAMI 15, 785-794, 1993. \n\n[9] Shi J. and Malik J., \"Normalized cuts and image segmentation\", Proc. CVPR, 731-\n\n737, 1997. \n\n[10] Smith S., \"Threshold validity for mutual neighborhood clustering\", PAMI 15, 89-92, \n\n1993. \n\n[ll] Wu Z. and Leahy R., \"An optimal graph theoretic approach to data clustering: theory \n\nand its application to image segmentation\", PAMI 15, 1l01-1113, 1993. \n\n\f", "award": [], "sourceid": 1593, "authors": [{"given_name": "Yoram", "family_name": "Gdalyahu", "institution": null}, {"given_name": "Daphna", "family_name": "Weinshall", "institution": null}, {"given_name": "Michael", "family_name": "Werman", "institution": null}]}