The Geomblog: clustering

Showing posts with label clustering. Show all posts

Thursday, February 25, 2016

Time to cluster !

After many blog posts, much discussion, and some contract back and forth, I'm happy to announce that +Sergei Vassilvitskii and I have signed a contract with Cambridge University Press to write a book on clustering.

"What's that?", you say. "You mean there's more than one lecture's worth of material on clustering?".

In fact there is, and we hope to produce a good many lectures' worth of it.

Information on the book will be available at http://clustering.cc. It currently forwards to my blog page collecting some of the posts we wrote, but it will eventually have its own content.

Now the painful part starts.

Wednesday, January 18, 2012

Upcoming deadline: Multiclust 2012

One of the posts queued up in the clustering series is a note on 'metaclustering', which starts with the idea that instead of looking for one good clustering, we should be trying to explore the space of clusterings obtained through different methods and pick out good solutions informed by the landscape of answers. While the concept has been around a while, the area has attracted much more interest in recent years, with two workshops on the topic at recent data mining conferences.

I'm involved with the organization of the latest incarnation, to be held in conjunction with SDM 2012 in April. If you have

unpublished original research papers (of upto 8 pages) that are not under review elsewhere, vision papers and descriptions of work-in-progress or case studies on benchmark data as short paper submissions of up to 4 pages

then you should submit it ! The deadline is Jan 25.

Tuesday, November 01, 2011

Large-Data Clustering Part I: Clusters of clusters

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

I don't know why, but I've been struggling with a series of posts on large-data clustering for a while now. Distilling out the key ideas in a way that is intuitive and yet rigorous has been trickier than I thought. But here goes ...

Let's consider a prototypical clustering algorithm: $k$-means. In a $k$-means iteration, you have to scan each point and determine its nearest cluster center. Then you have to organize all the points that share a cluster center and compute a new one by averaging. Each iteration takes $nk$ time, and this isn't even accounting for high dimensions.

This algorithm doesn't scale well to large data, needless to say. But clustering, like most data mining tasks, is needed most acutely when we're dealing with large data. So what do we do ?

In the next series of posts, I'll try to walk you through some of the key principles that get used when designing large data clustering algorithms. As always, the twin principles of sampling and streaming will be the keys to the kingdom.

Clusters can be formed by clustering clusters.

One of the ideas that makes large-data clustering tractable is that you can cluster clusters (say that three times quickly, will ya ?). $k$-means is a particularly good example of this. I can think of a cluster center as a "representative weighted point" and discard the data associated with it. This is critical, because now I can just worry about the cluster centers themselves, and repeat.

The advantage of this is locality. I can take small chunks of data and cluster them efficiently. Then I can combine the cluster reps from each chunk to get a smaller set of reps, and so on. Many of the "standard" large-data clustering methods (BIRCH, CLARA, CLARANS and the like) are based on this principle. As a concept, it's sound. But does it guarantee anything ?

The catch of course is error propagation. How do I make sure that in this repeated (hierarchical) clustering process, I didn't commit to a bad clustering early on because of locality ? It seems like I could pay a lot of error at each level, limiting my ability to group data sets into small chunks (because small chunks mean many levels).

It turns out that there are clever ways around this problem by careful merging. The first example of this is a paper from 1997 by Charikar, Chekuri, Feder and Motwani. They look at the $k$-center problem in a general metric space (minimize the maximum radius) and their technique illustrates this principle well.

Their algorithm is quite elegant, and works in phases. In each phase, they first merge cluster centers together to ensure that all resulting cluster centers are far apart. Then they insert new points into clusters as long as they don't get too big, or create new clusters. This happens as long the number of clusters stays below $k$. When it crosses $k$, the phase ends, and then the process repeats.

Note that the algorithm reads in points and processes them on the fly in a single "pass". The reason it works inspite of the repeated mergings is because of a nifty property of the $k$-center algorithm (which lies at the heart of the Gonzalez 2-approximation for $k$-center):

If you can produce $k+1$ points that are all at distance at least $d$ from each other, then the optimal $k$-clustering has diameter at least $d$.

This is because of a 'quantitative pigeonhole principle': At least two of these points must be in a single cluster, which then has diameter at least $d$.

There are some tricks needed to figure out how much you can expand each cluster when inserting, and how far apart cluster centers should be in the phase after the current one. I'll spare you the details, but they lead to an 8-approximation, which can be improved using a neat randomization trick to a $2e$-approximation.

The above trick works because we can build a witness structure that lower bounds OPT, and so the merging can be done carefully so that error doesn't propagate badly. What happens if you change the cost function to the $k$-median though, where the witness isn't as straightforward ?

An excellent source here is the work by Guha, Meyerson, Mishra, Motwani and O'Callaghan. This paper, which combines prior work on both the theory and practice of $k$-median clustering, presents the following idea:

Break the data set into pieces. Cluster each set "loosely", using many more than $k$ clusters. Weight each cluster by the number of points assigned to it. Cluster these weighted cluster centers to find the desired $k$ clusters.

The key observations they make are:

The total cost of clustering the pieces isn't too much larger than the overall clustering cost (so it gives a good lower bound). This follows from the triangle inequality.
The new "weighted" instance admits a solution that again is not too much larger than the optimal cost (is actually within a factor of 6 for any partition). This again follows from the triangle inequality.

The last piece of this puzzle is: how to cluster each set "loosely" ? They propose using a bicriterion method, whose cost is compared to the optimal $k$-median cost, but is allowed to use more medians. Such bicriterion methods are often easier to design.

Note that there's nothing stopping them from using their own algorithm recursively to cluster the weighted cluster centers. What's neat about this is that it can be adapted to a streaming setting using another standard trick.

Specifically, if you have an algorithm that breaks things into pieces and then recombines them (a two-level scheme), you can first recursively apply it to get a multi-level hierarchical scheme, and then you can implement this in a streaming setting by updating (potentially) an entire side of the computation tree whenever a new item appears. This can also be viewed as a lazy implementation of the hierarchical strategy.

Wednesday, September 07, 2011

ECML-PKDD Conference Report: Part II

My student Parasaran Raman continues his conference report from ECML-PKDD in Athens. For his first post, see here.

Christopher Bishop gave an excellent plenary talk on the first day about the need to embrace uncertainty. He talked about the Kinect technology that uses infra-red camera and sensors to measure distance of aperson from the screen. Instead of modeling the problem as a motion tracking problem, the folks at MSR started to look at the challenge as a pattern recognition problem. Each frame is now to be cold-started and understood as a picture instead of tracking the motion in a video sense. The goal now is to classify sets of pixels enclosed inside an axis-aligned rectangle. The training data is collected as a series of motion capture videos by letting a test subject wear sensors on the body a la hollywood-animation style, and is combined with some synthetic data, corrupted artificially since these are “too pure”. The training set consists of a whopping 1 million examples of body positions .

They then use a decision tree to classify the pixels into one of 31 predefined body parts by maximizing an information gain function. Another sub-goal that Bishop talked about was to identify human joints. For this, they use a kind of mean-shift technique, applying a simple Gaussian smoothing function on joints.

The second half of the talk revolved around model-based ML; where Bishop recommends constructing model equivalent for algorithms. He says and I quote “Don't look at the algorithms; look at the models”. He talked about an application where the goal is to rank players who play each other in a common environment. They have a new rating system TrueSkill, that extends the popular Elo rating system [which typically only works in a two player game like chess] to a multi-player team games. TrueSkill performs a kind of Bayesian ranking inference on the graphs by local message passing .

He also introduced Infer.NET, a framework for running Bayesian inference in graphical models.
[http://research.microsoft.com/en-us/um/cambridge/projects/infernet/]. This looks like a cool tool to play with, especially since it provides a single platform for classification and clustering problems.

The plenary talk on day two was by Rakesh Agarwal. In a talk that focused on a very interesting application, Rakesh introduced the idea of using data mining towards enriching education. The key idea is to take concepts from textbooks and map it with a relevant Wikipedia article and then induce relationship between concepts.

The goal is to enrich sections in the textbooks by providing recommendations for illustrative pictures to go with the text. Dispersion [which is a measure of how unrelated the concepts are] and syntactic complexity play a big role in deciding “which picture should go where”. Since search engines fail when given long key words, one challenge is to find key concepts in a section that one can then match up to the most similar article in Wikipedia. The underlying assumption was that for most high school textbook sections will have a associated Wikipedia article.

Rakesh showed results on applying this to a huge word corpus from NCERT, who draft the high school textbook in India for most schools. He also talked about computing an immaturity score to see if a section is well-written and not surprisingly, subjects like physics and chemistry scored over commerce and economics!

To summarize, two things that go into solving the problem of augmenting textbooks with images are “Comity” [where they identify lots of short key queries by concatenate 2 or 3 concepts important concepts at a time] and “Affinity”. Comity ransk images by computing a relevance score of each image in context of the concepts picked out from a section and affinity identifies relevant webpages and computes an importance score. Another challenge is that these techniques cannot operate one section at a time since the textbooks would probably not like the images to repeat!

Tuesday, September 06, 2011

ECML-PKDD conference report: Part I

My student Parasaran Raman is in Athens attending ECML-PKDD and associated workshops, and will be updating this blog regularly with details from the conference. His posts are edited lightly for format and to add links.

I am in Athens [+1 to location-driven-research] to attend ECML PKDD 2011 and I am at the second MultiClust workshop today. This workshop is held in conjunction with ECML PKDD 2011.

The first talk of the morning was a invited talk by Michael Houle on "Combinatorial approaches to clustering". Michael Houle began the talk with how in high dimensions, all out intuitions about space go for a toss. One example that he gave illustrated how most of the points are concentrated along the boundaries in high dimensions. The talk revolved around how similarity measures between partitions cannot be fully trusted though there have been a range of combinatorial and spatial measures that have been introduced, since each measure suffers from some kind of drawback. He also talked about constructing secondary similarity measures [“Secondary similarity between two points v and w is defined in terms of data objects in the common intersection of neighborhoods based at v and w, where the neighborhoods themselves are determined according to a supplied primary similarity measure”] and how they can get around the curse of dimensionality.

One interesting question that was thrown out was "Can we do a fully automated clustering?", i.e., can we convert similarity functions into ranking on the neighbors, assume no knowledge of distribution of the data, and automatically pick 'k'. The talk moved on to approaches towards computing quality of a partition by computing a significance score for each point which depends on the correlation between a member of the set to the members in neighboring sets and other members of the same set, along with the size of cluster and the feature set. Once we know an object's significance, we can use this to change the shape of a cluster, by adding points and removing them.

Bart Goethals gave another invited talk titled “Cartification: from similarities to item set frequencies”. He talked about doing a k-nearest neighbors for all data points to find the right centers of clusters. The idea is that true 'centers' occurs in many knn baskets.

There were many interesting ideas out of the other talks at the workshop. I will highlight a few of them:

One of the first talks about “Subjectively interesting alternate clusters”, uses an information theoretical framework to find interesting alternate clusters.
Jilles Vreeken gave a very good talk about the current approaches in pattern mining and how we can use that knowledge in data mining. On this talk titled, “where pattern met subspace cluster”, he highlighted the similarities between subspace clustering and pattern mining where the goal is to find informative local structures.
The talk on “Factorial clustering” was about using latent variables to generate multiple clusterings. One question that they wanted an answer for was a way to compute the right distance between partitions and I said, "Hey ! Use ours!”.
Another talk on “Browsing clustering alternatives" outlined constructing a Merge-Split tree structure on the space of partitions and enable a user driven browsing of partitions.

My talk was the final one of the session. I talked about some recent work with Suresh Venkatasubramanian and Jeff Phillips on exploring the landscape of clusterings by sampling from the space of partitions proportional to a quality measure. It was generally good to see a lot of head-nods during the talk. A couple of people from the industry who were present agreed to the fact that there exists data that is multi-clusterable and the hard part was finding them. So presenting a landscape where you can pick from looked like a useful idea!

The day ended with a rooftop reception with a magnificent view of the Acropolis and the temple of Zeus! I had very useful dinner conversations with Michael Houle and Thomas Seidl about general problems in the area of meta-clustering and on how to compute distances metrics on partitions. We chatted about combinatorial distances and spatial distances, and secondary level distances, where one constructs a ranking on top of a existing distance between many members.

Monday, July 18, 2011

Axioms of Clustering

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

``I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it..."

-- Justice Potter Stewart in Jacobellis v. Ohio.

What makes one clustering better than another? To answer that question we have previously assumed that a well motivated objective function was given to us. Then the situation is easy, we compute the value of the objective and decide accordingly. In practice, however, clustering is often an intermediate step in a longer chain of computations, and there is not specific motivation for one objective function over another. Debating the pros and cons of the multitude of possible objective functions can easily take hours. Instead, we can further the ``I know it when I see it'' intuition by writing down the properties that any good clustering objective should satisfy and see if this axiomatic approach guides us towards the right solution.

This approach was undertaken by Kleinberg in his work on clustering axioms. A clustering function is one that takes a set of points (and their pairwise distances) and returns a partition of the data. The function should abide by the simple axioms:

Scale-Invariance: If all distances are scaled by a constant factor, the clustering should not change. Put another way, measuring the distance in miles or kilometers (or nanometers) should not change the final clustering partition.
Richness: Depending on the pairwise distances, any partition should be possible. For example, the clustering should not always put two specific points $x$ and $y$ together, regardless of the distance metric.
Consistency: If the pairwise distance between points in a cluster decreases, while the distances between a pair of points in different clusters increases, the clustering should not change. Indeed, such a transformation makes clusters `tighter' and better separated from each other, so why should the clustering change?

Each of the axioms seems reasonable, and it is surprising that there is no clustering function that is consistent with all three axioms ! The trouble lies with the last axiom: by changing the distances we require that a clustering stay the same, but an even better clustering may emerge. For example, consider points lying in several ``obvious'' clusters, and proceed to move one of these clusters out to infinity. As we move the points further and further out, the resulting dataset appears to be two-clusterable (see the image below).

This leads to a slightly different tack at this problem. Instead of thinking about a specific clustering, or a specific partition of the data, instead we try to define an objective function, so that the optimum clustering under that objective behaves in a reasonable manner. This is the approach adapted by Ackerman and Ben-David. Let $m$ be a function measuring the quality of the clustering. As before, Scale-Invariance requires the measure $m$ to be independent of the units on the distances and Richness requires that every clustering can be the optimum (under $m$) clustering.

The Consistency axiom is the one undergoing a subtle change. Instead of requiring that the clustering stay the same if such a perturbation to the distances occurs, we only require that the score, or measure of the clustering {\em improve} under such a transformation. This breaks the counterexample above -- while the score of a 1-cluster decreases as we move the distances, the score of a 2-clustering decreases even more and surpasses the score of the 1-clustering. Indeed, the authors demonstrate a set of different clustering metrics that are consistent under this set of axioms.

-- Sergei Vassilvitskii

Monday, June 13, 2011

Clustering with outliers

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

(ed. note: I'm happy to welcome Sergei Vassilvitskii (of k-means++ and mapreduce-algos fame, among other things) to this series. Sergei has great insights on clustering, and we decided to join forces as we continue this series. Here's his first post, on clustering with outliers)

When abstracting the clustering problem, we often assume that the data is perfectly clusterable and so we only need to find the right clusters. But what if your data is not so perfect? Maybe there's background noise, or a few random points were added to your data by an adversary. Some clustering formulations, in particular k-center or k-means, are not stable -- the addition of a single point can dramatically change the optimal clustering. For the case of k-center, if a single point $x$ is added far away from all of the original data points, it will become its own center in the optimum solution, necessitating that the other points are only clustered with $k-1$ centers.

It is easy to appeal to intuition for the definition of outliers, but formally specifying when a point becomes an outlier is a non-trivial task. Consider for example a set of points generated by a single Gaussian with small variance. Now we pick one data point and slowly start moving it away from this set. We agree that originally that point is not an outlier; as we move it out to infinity, it certainly becomes one. But where exactly did it cross this threshold?

There are two ways to define clustering with outliers. First we can simply chose to ignore some fraction of the points. In the case when there are no outliers, ignoring a small (say 1%) fraction of the points should not materially affect the final clustering results. In the case that outliers are present, however, the algorithm has the choice of ignoring them in computing the final metric. Consider again the example above; although we don't know when the point we are moving becomes an outlier, in some sense we don't care because we are always looking at the optimum clustering without it.

A generalization of the above approach assigns a cost for not classifying a particular point and gives a budget for unclassified points. Thus, if we know ahead of time that some points should be classified and some can be ignored, one can guide the algorithm to focus on the points that matter. Of course assigning each point an identical weight leads exactly to the first formulation.

Algorithms

The ability to ignore a fraction of the points seems like it would make the problem easier, since the algorithm now has explicit permission to 'sweep things under the rug.' It actually makes the problem harder, since the optimum solution can also do this. Put another way, the algorithm needs to not only find the right clustering of points, but also decide which points to cluster in the first place.

There are two extreme approaches to the latter problem: one can first decide which points are outliers and then run the baseline clustering algorithm, or one can chose not to ignore any of the points and decide which once are outliers given a clustering. Unfortunately neither one of these works well -- the first scales poorly, as the number of choices for outlier points grows exponentially; the second leads to poor solutions -- as we saw before ignoring the outliers can materially change the final outcome.

The key to finding clustering algorithms that handle outliers is to determine which points will be ignored during the algorithm execution, and not before or after. For an example, consider the $k$-center problem. Recall we are trying to find a set of centers $\mathcal{C}$ and a radius $r$ so that every point is within $r$ of some point in $\mathcal{C}$. There is a lower bound that says unless $\mathsf{P}=\mathsf{NP}$ no algorithm can find better than a 2-approximation on the radius. There are many ways to achieve a 2-approximation: we previously saw the furthest point algorithm by Gonzales: repeatedly choose the point that is furthest away from the current set $\mathcal{C}$ and add it as s center. The algorithm breaks down if there are outliers --- we will surely choose an outlier as a cluster center.

Here is an alternative algorithm for $k$-center:

Guess the optimal radius $r$.

While there are uncovered points, chose one point $p$ as a center, and mark all points within distance $2r$ as covered.

It is easy to see that in every iteration we mark as covered all points in the optimal cluster that $p$ belongs to. If after $k$ iterations there are still uncovered points remaining, it must be that the original guess of $r$ was incorrect.

To adapt this algorithm to the setting of outliers, we make two changes. First, we increase the slack on the radius from a factor of 2 to a factor of 3 (this turns out to be necessary; unless $\mathsf{P}=\mathsf{NP}$ no algorithm can find a better than a 3-approximation to the radius). Second, instead of selecting a point arbitrarily, we select the point that has the maximum number of uncovered points within radius $r$ from it.

Thus the final algorithm is as follows:

Guess the optimal radius $r$.

Repeat $k$ times: select the point $p$ that has the largest number of uncovered points within distance $r$. Mark as covered all points within distance $3r$ of $p$.

The points that are left over in the end are precisely the outliers: they are both far away from any individual cluster center, and are not well clustered themselves (otherwise one of them would have been chosen as a potential cluster). Charikar et al show that this algorithm attains a 3-approximation: if the optimum can cover all but $\ell$ points with radius $r$, then the algorithm above will cover all but $\ell$ points with radius $3r$.

Saturday, June 11, 2011

Clustering as compression

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Numquam ponenda est pluralitas sine necessitate.
-- Plurality must never be posited without necessity -- William of Ockham.

Clustering as compression generalizes mixture model clustering, as well as distance based methods. The central idea in this view of clustering is Occam's razor. Can you express the data as concisely as possible in a lossy manner ?

The devil in the details here is the definition of "concisely". There are two different notions of bounded space that are common, but they are actually not that different.

The information-theoretic approach.

The information theoretic approach dates back to Shannon. Consider sending information across a pipe. How big on average does the pipe need to be in order to transmit the information ? One of Shannon's celebrated results is that this can be quantified by the mutual information of the channel, or the amount of shared information contained between the input and output.

Thinking of mutual information as a measure of space is quite easy once you get used to it. It's helpful to consider the boundary cases. If the channel merely outputs a random flip of the input, then the output has no information about the input and the mutual information is zero, corresponding to the idea you don't need to transmit much information through the channel.

On the other hand, if the output is a deterministic function of the input, then the mutual information of the channel equals the entropy of the source, corresponding to the idea that the channel needs to send over a complete description of the input to compute the output correctly.

What does this have to do with clustering ? Think of the channel as the process by which the input points are mapped to clusters. In particular, the channel can be encoded by the conditional probability $p(t | x)$ of assigning a point x to cluster t. Once we have that, the mutual information $I(T;X)$ between the set of clusters T and the set of point X measures how concisely T represents X.

There's an intuitive explanation for this as well. Suppose that points were assigned to exactly one cluster each. Then I(T;X) equals the entropy of T, which is just the average number of bits needed to write down a description of T.

But what about the quality of the embedding ? Given some distance function on the input and output domains, you can compute the expected error as the sum of p(t | x) p(x) d(x,t) over all x,t. The study of how this varies with I(T;X) yields what's known in information theory as rate-distortion theory, and for an appropriate choice of the distance measure $d$ leads to the well known Information Bottleneck Method.

Kolmogorov complexity.

The second approach takes the Kolmogorov view: that compactness of description can be expressed as the size of the smallest program expressing the data. In many ways, the Kolmogorov complexity of a string serves as a point-wise proxy for entropy, and the equivalent of mutual information can be defined as well (for more on this, see the note by Grunwald and Vitanyi)

While the rate-distortion framework can be extended to the Kolmogorov setting, it's a lot more technical, and is hard to explain in a short note such as this. An easier application of the Kolmogorov method is due to Cilibrasi and Vitanyi, who showed how to do clustering by defining a distance metric based on compression (roughly, as the complement of a normalized mutual information), and then doing hierarchical clustering.

In summary, the main value of the compression-based viewpoint is the idea of a space-quality tradeoff, as opposed to a more artificial k vs quality tradeoff. What's nice about the information-theoretic approach is that the units for space and quality are the same, and so you don't need ugly balancing constants to compare them.

p.s There are those who might argue that the information-theoretic view of clustering is merely a kind of mixture density estimation. In a sense, this is true: specifically, the information bottleneck method can be viewed as estimating a mixture of multinomials if we reinterpret the Kullback-Leibler divergence as the associated Bregman divergence for the multinomial family . This in turn connects to methods like LDA and probabilistic PCA. However, I'd argue that the viewpoint of compression is quite different, even though the end-result looks similar, and it's the viewpoint that's valuable here.

Monday, October 25, 2010

FOCS Day 1: Clustering

A long time ago I realized that if I took blogging duties too seriously, I'd be exhausted trying to report on talks that I attended. So I'm not going to even try. I will talk about a pair of talks on clustering though.

A core question of clustering is this: learn a mixture of $k$ $d$-dimensional Gaussians with different means and covariance matrices from samples drawn from the mixture.A long line of research, starting with a paper by Sanjoy Dasgupta back in 1999, produced variations on the following result:

If the Gaussians are "well separated" and have "nice" shapes, then a "small number" of samples suffices to determine the parameters of the system.

Here, "parameters" would be means, covariance matrix elements, and relative weights. As time went on, the separations got smaller, the niceness reduced, and the "small number" got better.

These two papers take a slightly more general tack, in different ways. They both eliminate the conditional nature of prior bounds ("if the Gaussians are well separated"), instead replacing this by an unconditional bound ("poly in a parameter capturing the separation").

The first, by Moitra and Valiant (both Ph.D students!) measures the separation between the distributions via their statistical distance, and gives a bound for learning in terms of the reciprocal of this distance. Their main insight is to project the distributions onto a few directions (i.e 1D), learn what they can, and then recurse on the data that doesn't appear to be well-separated enough to learn properly. This takes a lot of technical skill to do correctly.

The second paper, by Belkin and Sinha, takes a different approach. They pay more attention to the parameters of the distribution. This can be a problem in general because closeness in distribution does not imply closeness in parameter space. However, they show that the set of all distributions with fixed moments forms a nice algebraic variety, and that the above problem can be quantified by determining how closely this variety folds in on itself (intuitively, if the variety intersects itself, then two far away parameter sets yield the same distribution). Using some standard (but not trivial) results in algebraic geometry, they can bound the complexity of the resulting structures (so that they can estimate the number of samples needed to get to a distiinct region of the space). There's more detail involved in the estimation of parameters and moments and so on.

While both papers are impressive achievements that essentially resolve the theoretical aspects of learning mixtures of distributions, I came away not entirely sure how to compare them. On the one hand, the Moitra/Valiant paper is more explicit and is likely to be less "galactic" in nature. But it appears to be limited to Gaussians. The Sinha/Belkin paper is more general, but uses more abstract machinery that is unlikely to be useful in practice.

So what remains is stilll an effective way of learning Gaussians or other families of distributions, and also a better understanding of how these bounds relate to each other.

Friday, May 14, 2010

Choosing the number of clusters III: Phase Transitions

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

We've seen two generic methods for determining the number of clusters in a data set thus far: the elbow method, and the ROC-curve/diminishing returns method. Now I want to talk about a more "dynamic" approach to identifying the number of clusters in a data set.

These ideas center around a 'simulated annealing' view of the clustering process. Imagine, if you will, a set of highly excited atoms all bouncing around in a state in which all are indistinguishable. If you now start cooling the atoms, they will start falling into local energy traps over time. If the cooling is done on the right schedule, the atoms will find a minimum energy configuration at each temperature, all the way down to absolute zero, where each atom is in its "own state", so to speak.

What does this have to do with clustering ?

The setup works something like this. Imagine a "soft" clustering assignment of points to clusters (where each point assigns a fraction of its membership to each cluster). We can write down the distortion D associated with this clustering by computing the weighted distance of each point from the various cluster centers. We'd also like the clustering to be more or less deterministic, so we can introduce a penalty team for the level of disorder in the clustering (usually captured by the conditional entropy H associated with the assignments of points to clusters).

Now we want to minimize the distortion subject to some constraint on the entropy, so in good old Lagrangian form, we write down a function of the form

F = D - TH

where T is a Lagrange parameter and F is the overall function to be minimized.

But here's the kicker. You can think of F as the free energy of a statistical ensemble, where D represents its energy, and H represents its entropy, and T is the "temperature" of the system. The T=0 limit captures the idea that distortion is all that matters, and so each point is assigned a cluster of its own. The "T large" limit prioritizes the entropy over the distortion, encouraging us to place all points in a single cluster.

So the annealing process corresponds to minimizing F while decreasing T steadily. It turns out, using magic from the realm of statistical physics, that for any temperature T, the probability of assigning point x to cluster y is proportional to exp(-d(x,y)/T), and as T decreases, the probability of assigning a point to any cluster except the very nearest one decreases dramatically.

All of this is very fascinating, and provides a "smooth landscape" in which to understand how points get assigned to (soft) clusters. Much of the work I'm exploring right now is in trying to understand the space of soft clusterings of data. But that's a digression.

What's really interesting is that if you look at the evolution of the probabilistic assignments, you start seeing phase transitions. As T starts off high, all the points are in the same cluster. At a specific temperature T, a phase transition occurs, and the data starts splitting (based on the assignments) into more than one cluster.

How can you tell when this happens ? Here's a very elegant technique, first proposed by Kenneth Rose in his work on deterministic annealing. In the high-T regime, where every point is in the same cluster, the cluster center location can be computed by solving a simple convex optimization. As the process evolves, the positive definite matrix defining the optimization starts losing its positive definiteness, till some point when one of its eigenvalues goes to zero. This point can be computed analytically, and yields a specific temperature value at which the first clusters start to emerge.

Kenneth Rose has some nice examples illustrating this behaviour. As time goes one, more and more phase transitions start to appear, as more and more clusters start to emerge. What you end up with is a hierarchy of clusterings that end with the trivial clustering where all data lie in separate clusters.

This idea has been developed further with the information bottleneck method, which replaces both the distortion and entropy terms by terms involving the mutual information of the assignments. The free energy paradigm works the same way, although now the phase transition points can't be computed analytically (I'll have more to say about the information bottleneck method)

What's entirely weird is that the phase transitions still happen (and I want to stress this point), data sets will split up into different numbers of clusters at the same transition point ! We wrote a paper a few years ago that proposes a particular "temperature" to look for phase transitions in, and lo and behold, we were able to recover the "correct" number of clusters from planted data sets by watching the data split at this point. I'm still not sure why this happens, but it was quite interesting, and yielded one way of identifying "natural clusters" in the data.

Phase transitions and the like are the coin of the realm in bifurcation theory. While a proper understanding of this area is well beyond my skill level, there has been recent work on trying to analyze the information bottleneck from a bifurcation theory perspective, specifically by trying to isolate and classify the different kinds of critical points that emerge in a bifurcation diagram of the clusters as the temperature changes. Albert Parker of the University of Montana wrote a Ph.D thesis on this topic, and I'd love to understand his work better.

Coming up next: Clustering as compression - the information bottleneck and kolmogorov complexity.

Saturday, March 13, 2010

Choosing the number of clusters II: Diminishing Returns and the ROC curve

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

In the last post, we looked at the elbow method for determining the "right" number of clusters in data. Today, we'll look at generalizations of the elbow method, all still based on the idea of examining the quality-compression tradeoff curve.

For the purpose of discussion, let's imagine that we have a plot in which the quality of the clustering (measured anyway you please, as long as 0 means all items in one cluster, and 1 means all items in separate clusters) is measured along the y-axis, and the representation complexity (i.e the space needed) is measured on the x axis, where again 0 corresponds to all items in a single cluster (least representation cost), and 1 corresponds to all items in separate clusters.

The ideal cluster is located at (0,1): cheapest representation and perfect quality. In general, as we use more and more clusters to organize the data, we can trace out a curve that starts at (0,0) and ends at (1,1). For most sane functions, this cuve is concave, and lives above the diagonal x=y.

This curve contains lots of useful information about the behavior of the data as the clustering evolves. It's often called the ROC curve, in reference to the curve used to capture the tradeoff between false positive and false negatives in classification. But what can we glean from it ?

We can ask what the curve would look like for "unclusterable" data, or data that has no definite sense of the "right number" of clusters. Such data would look self-similar: you could keep zooming in and not find any clear groupings that stood out. It's not too hard to see that data that looked like this would have a ROC curve that hugs the diagonal x=y, because there's a relatively smooth tradeoff between the quality and compressibility (so no elbow).

Conversely, data that does appear to have a definite set of clusters would try to get closer to the (0,1) point before veering off towards (1,1). This suggests a number of criteria, each trying to quantify this deviation from unclusterability.

you could measure the area between the curve and the diagonal. This is (essentially) the AUC (area-under-curve) measure.
You could measure the closest distance between (0,1) and the curve. This also gives you a specific point on the curve, which you could then associate with the "best" clustering.
You could also find the point of diminishing returns - the point of slope 1, where the clustering starts costing more to write down, but yields less quality. I've used this as the start point of a more involved procedure for finding a good clustering - more on that later.

The ROC curve gives a slightly more general perspective on the elbow method. It's still fairly heuristicy, but at least you're trying to quantify the transition from bad clustering to good in somewhat more general terms.

Ultimately, what both the ROC curve and the elbow method itself are trying to find is some kind of crucial transition (a phase transition almost) where the data "collapses" into its right state. If this sounds like simulated annealing and bifurcation theory to you, you're on the right track. In the next part, I'll talk about annealing and phase-transition-based ideas for finding the right clustering - all fairly ingenious in their own right.

(ed. note: comments on the last post have convinced me to attempt a post on the nonparametric approaches to clustering, which all appear to involve eating ethnic food. Stay tuned...)

Sunday, March 07, 2010

Choosing the number of clusters I: The Elbow Method

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

It's time to take a brief break from the different clustering paradigms, and ponder probably THE most vexing question in all of clustering.

How do we choose k, the number of clusters ?

This topic is so annoying, that I'm going to devote more than one post to it. While choosing k has been the excuse for some of the most violent hacks in clustering, there are at least a few principled directions, and there's a lot of room for further development.

(ed. note. The Wikipedia page on this topic was written by John Meier as part of his assignment in my clustering seminar. I think he did a great job writing the page, and it was a good example of trying to contribute to the larger Wikipedia effort via classroom work)

With the exception of correlation clustering, all clustering formulations have an underconstrained optimization structure where the goal is to trade off quality for compactness of representation. Since it's always possible to go to one extreme, you always need a kind of "`regularizer"' to make a particular point on the tradeoff curve the most desirable one. The choice of $k$, the number of clusters, is one such regularizer - it fixes the complexity of the representation, and then asks you to optimize for quality.

Now one has to be careful to see whether 'choosing k' even makes sense. Case in point: mixture-model clustering. Rather than asking for a grouping of data, it asks for a classification. The distinction is this: in a classification, you usually assume that you know what your classes are ! Either they are positive and negative examples, or one of a set of groups describing intrinsic structures in the data, and so on. So it generally makes less sense to want to "`choose"' $k$ - $k$ usually arises from the nature of the domain and data.

But in general clustering, the choice of $k$ is often in the eyes of the beholder. After all, if you have three groups of objects, each of which can be further divided into three groups, is $k$ 3 or 9 ? Your answer usually depends on implicit assumptions about what it means for a clustering to be "`reasonable"' and I'll try to bring out these assumptions while reviewing different ways of determining $k$.

The Elbow Method

The oldest method for determining the true number of clusters in a data set is inelegantly called the elbow method. It's pure simplicity, and for that reason alone has probably been reinvented many times over (ed. note: This is a problem peculiar to clustering; since there are many intuitively plausible ways to cluster data, it's easy to reinvent techniques, and in fact one might argue that there are very few techniques in clustering that are complex enough to be 'owned' by any inventor). The idea is this:

Start with $k=1$, and keep increasing it, measuring the cost of the optimal quality solution. If at some point the cost of the solution drops dramatically, that's the true $k$.

The intuitive argument behind the elbow method is this: you're trying to shoehorn $k$ boxes of data into many fewer groups, so by the pigeonhole principle, at least one group will contain data from two different boxes, and the cost of this group will skyrocket. When you finally find the right number of groups, every box fits perfectly, and the cost drops.

Deceptively simple, no ? It has that aspect that I've mentioned earlier - it defines the desired outcome as a transition, rather than a state. In practice of course, "`optimal quality"' becomes "`whichever clustering algorithm you like to run"', and "`drops dramatically"' becomes one of those gigantic hacks that make Principle and Rigor run away crying and hide under their bed.

The Alphabet-Soup Criteria

So can we make the elbow method a little more rigorous ? There have been a few attempts that work by changing the quantity that we look for the elbow in. A series of "`information criteria"' (AIC, BIC, DIC, and so on) attempt to measure some kind of shift in information that happens as we increase $k$, rather than merely looking at the cost of the solution.

While they are all slightly different, they basically work the same way. They create a generative model with some kind of term that measures the complexity of the model, and another term that captures the likelihood of the given clustering. Combining these in a single measure yields a function that can be optimized as $k$ changes. This is not unlike the 'facility-location' formulation of the clustering problem, where each "`facility"' (or cluster) must be paid for with an 'opening cost'. The main advantage of the information criteria though is that the quantification is based on a principled cost function (log likelihood) and the two terms quantifying the complexity and the quality of the clustering have the same units.

Coming up next: Diminishing returns, the ROC curve, and phase transitions.

Thursday, January 07, 2010

Mixture Models: Classification vs clustering

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

We're all family.

k-means is arguably the most popular clustering algorithm (and one of the top 10 data mining algorithms to boot). It has a close relative though (some might say a generalization) in the expectation-maximization algorithm (or EM).

The EM algorithm represents yet another paradigm shift in how we think about clustering, to the point where it's arguable whether we are clustering anything at all. What the EM algorithm really does is density estimation: it assumes the data being clustered is generated by picking one of k distributions according to some probability distribution (the mixture) and then sampling from this distribution.

In other words, the definition of a cluster is not in terms of relationships betweens pairs of data points any more: it's in terms of how a collection of points behave together as a group, as representative samples from a distribution.

It's not particularly hard to see that the k-means algorithm is really a special case of EM on Gaussians(if I might be permitted an indulgence, a "tropical" limiting case of EM). In general though, the EM algorithm is an alternating optimization that finds the maximum likelihood parameters for distributions that capture the data. In the "E" step, it fixes the current parameters (the current hypothesis for the distributions) and computes the expected (log) likelihood of the data by first estimating the "cluster membership probabilities" of assigning a point to a "cluster" or distribution, and then using this to estimate the overall likelihood function. The "M" step then finds the set of parameters that maximizes the likelihood function. In k-means language, the first step figures out which clusters to assign points to, and the second step assigns new centers to each cluster.

The EM algorithm is very powerful. It's generally known that if the distributions under consideration are from an exponential family, then this method is easy to implement, because the E and M steps reduce to simple operations that have closed forms. Via the usual Bregman connection between divergence functions and distributions, it's also possible to give a purely geometric interpretation to the EM method akin to k-means, but with a different distance structure.

What's even more fascinating is the information-geometric perspective. There's a wonderful Riemannian world in which distributions are points on a manifold with coordinate systems given by their (natural) parameters, and where various information distances can be related to affine connections on said manifold. Too much to go into here, but the key insight is this: the E and M steps of the EM algorithm can be seen as projection operations in primal and dual manifolds, so the EM algorithm really looks like a primal-dual procedure. Way cool !

So is this not clustering ? For a detailed argument along these lines, you should read Hal Daume's analysis (with a nifty example). My take on this is that while density estimation is indeed different to the problem of clustering, I think of mixture-model-based clustering as just a much more "volume-based" approach to doing clustering, you expect not that point associate with each other per se, but that the ensemble of points itself satisfies some global properties.

I'll develop this idea further when I talk about information-theoretic clustering: to me, density-based clustering suggests a new abstraction for thinking about what a clustering ought to mean in the absence of even metric structure, and allows us to make pleasant detours into discrepancy, quasi-randomness and kolmogorov complexity. Stay tuned....

p.s apologies for the long hiatus.

Wednesday, October 28, 2009

Clustering interlude: Time series

In the absence of an actual post on clustering, I encourage all of you to go and read Sorelle's (or should I say ***SORELLE***'s) post on time series clustering.

Sunday, August 30, 2009

Spectral Clustering.

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

I don't like you, and I like them, but maybe if we change, we can get along.

Spectral clustering is not clustering: it's actually a metric modification method. It's another way of capturing things that are close versus things that far, using a more global method.

We saw how correlation clustering could be used to capture both the ``I don't like you'' and ``I like them'' aspects of clustering by assigning positive and negative weights to individual element pairs. However, it's not easy to come by such a labelling of pairs. More often than not, all we have is a distance function. We could threshold it, declaring all distances within some parameter r to be positive, and all distances larger to be negative, but this is completely arbitrary and depends on the choice of r.

Let's think of the two-class problem for now: we merely want to divide the points into two clusters. Suppose we were able to guess a function that assigns one label (0) to points in one cluster and another label (1) to points in the other cluster. Then we could write down the cost of the clustering (in a correlation sense) by summing up, over all pairs, the difference

$f_i - f_j$ . In fact, just to make sure it's always positive, we'll square it, and thus will sum up the function $(f_i - f_j)^2$ . We'll also assume a weight function $w_{ij}$ on the edge $ij$ , so that what we're really summing up is $\sum w_{ij}(f_i - f_j)^2$ .

This is where things start to get very interesting. It's not hard to see that this can be written in term of a variant of the graph Laplacian, treating the f values as a vector. If we now allow the f values to be reals, rather than 0 or 1, we get a classical eigenvalue problem on the Laplacian, in effect determining the second smallest eigenvalue of the Laplacian.

Once we do that, the corresponding eigenvector gives us an assignment for f. we can either merely take positive or negative values of f to group the data, or we can use the f values as a feature representation and recluster the data into two groups based on that.

This is the key insight of spectral ``clustering''. We perform a spectral decomposition of the Laplacian, and take the top k eigenvectors. Those vectors give us a k-dimensional feature represntation of the data in an inner product space, and we can now recluster. The real question is: why should this feature representation help us in any way at all ?

Here's my take on what's really going on. We know that the graph Laplacian is a discrete equivalent of the Laplace-Beltrami operator on a manifold, which in turn (via the heat equation) captures the ``flow'' of material in a manifold. Return for a second to our cut-based view of the distance function. In essence, two points are close if there are many ways of getting from one to the other (i.e the min cut between them is large) and they are far if not. This is a flow-based way of thinking about distances, and in effect, what the Laplacian does is map the original data into a new metric space in which distance is based on flow, rather than just on metric distance.

This intuition can be formalized: if we switch now to the stochastic viewpoint, in which we're doing random walks on the graph, then the commute time between two vertices turns out to be precisely the Euclidean distance (squared: thanks, dsivakumar) between the feature vectors constructed from the eigenvalues of the Laplacian.

... take a deep breath...

There are a number of different concepts swirling around here that are worth keeping track of. The cut-based viewpoint of node similarity in graphs lends itself naturally to a Laplacian-based treatment, which in turn leads us to random walks and the heat equation on manifolds. At a deeper level, the Laplacian of a graph, by virtue of how it acts on functions, tells us something very basic about the structure of the distances involved. For more on this particular aspect, you should check out the discussion 'Can you hear the shape of a drum'.

But it's very important to keep in mind that spectral clustering is not so much a clustering technique as a data transformation method. Although k-means is the clustering algorithm of choice once we get down to the feature representation, other methods could be used as well. Also, as relaxations go, this paricular one (relaxing from the hypercube to the reals), isn't that great: the ``integrality gap'' can be quite high. It turns out that this doesn't matter terribly though.

As an aside, the idea of using the graph Laplacian to define a ``stretched'' distance that looks Euclidean is not limited to this problem. The most "famous" version of this idea is the Laplacian eigenmaps work by Belkin and Niyogi in the realm of manifold learning. There's also work by Guibas, Ovsjanikov and Sun that uses similar ideas to create signatures of shapes. The key idea, once again, is that if you're on a manifold of some kind, the Laplacian gives you a handle on the shape of this manifold, as well as how to ``stretch it out'' with local coordinates to make it look flat (and therefore Euclidean). It's a useful trick to keep in your data analysis toolbox.

For further reading, Ulrich von Luxburg has a great survey on spectral clustering.

Sunday, August 02, 2009

Correlation Clustering: I don't like you, but I like them...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Whether it's k-clustering, or any kind of hierarchical clustering, the world we live in is still the world of ``I don't like you'' clustering, where the geometric landscape is defined by distances between points. Consider the following example:

There is an intuitive sense in which the first clustering is not ``real'' and the second one is: the idea of 'well-separatedness'' is a pervasive component of how a good clustering is perceived. But if we only measure distances between points, and only measure the cost of a clustering in terms of how costly each cluster is, we'll never be able to distinguish between these two examples.

What's needed is a way of declaring likes (similarity) as well as dislikes (distance), and then, critically:

penalizing similar items in different clusters AS WELL AS different items in the same cluster.

By that measure, we'd be able to distinguish the first and second clusterings, because in the first case, presumably elements close to each other that lie in different clusters will make the clustering look more expensive. This point is worth reiterating. Unless we have some way of penalizing mis-separations as well as mis-groupings, we'll always be at the mercy of the tradeoff between k and cost.

Continuing the "clustering via self-help metaphors" theme to these essays, I call this the "I don't like you, but I like them" way of modelling data.

Correlation Clustering

This is where correlation clustering enters the picture. The correlation clustering model is as follows: every pair of elements is assigned a 1 or -1, encoding a similarity or dissimilarity. The goal is to find a clustering (note: no k!) in which any pair of points in a cluster is penalized for being dissimilar, and any pair of points in two different clusters is penalized for being similar. This can be generalized to arbitrary weights: for each pair of elements you assign weights w+ and w-, with the possible caveat that w+ + w- = 1.

So now the goal is merely to minimize the cost of a clustering. The elegance of correlation clustering lies in the natural way that clusters merge or split depending on the number of similar or dissimilar pairs. Do note though that you need to have input data that can be written in that manner: thresholding a distance function will give you the desired input, but is an ad hoc way of doing it, since the thresholding is arbitrary.

There are a number of different algorithms for correlation clustering, and also a very simple one that yields a good approximation guarantee: pick a point at random, and pull in all its neighbors (all points similar to it). Repeat this process with a new unpicked point, until all points have been selected. This algorithm gives a 3-approximation for correlation clustering.

Correlation clustering is also useful as a way of combining clusterings. We'll talk about this later (ed: how many times have I said that !), but the problem of conensus clustering is to ``cluster the clusterings'', or aggregate them into a single average clustering. An easy way to see the connection is this: given a collection of clusterings of a data set, create an instance of correlation clustering with the positive weight for a pair corresponding to the fraction of clusterings that ``vote'' for that pair being in the same cluster, and the negative weight being the fraction of clusterings that ``vote'' for that pair being separated.

Thursday, July 23, 2009

Clustering: Hierarchical methods

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-center, k-median, k-means, k-medoids, .... The list is endless, but that pernicious k comes up everywhere. What can we do about it ? There are really two ways to go:

Figure out the "right" k for a problem. This is a complicated matter, and will be the topic of a later post
Don't choose: give the user a universal representation from which they can decide what k they want.

This latter formulation takes us into the world of hierarchical clustering, the topic of this post.

There are two different ways to think about hierarchical clustering: a representational view and an algorithmic view. The algorithmic view is quite simple: let's try to design a clustering via greedy operations. In the top-down view, we find a good split of the data into two parts, and then recurse on each side. In the bottom-up view, we select two clusters for merging (in the beginning, all items are in separate clusters), and merge our way up.

Perhaps the most well known of the bottom-up methods is the 'single link' clustering algorithm, in which at each step, you merge the two clusters that are closest to each other. If this sounds familiar, it should: this is merely Kruskal's MST algorithm run partially to completion.

The "single-link" method is optimistic: it creates clusters via connected components, assuming that transitivity is enough to construct a cluster. The "complete-link" approach is more pessimistic: rather than defining the distance between two clusters as their nearest pair distance, it defines it as the furthest pair distance, ensuring a clique-like structure for clusters. Merging happens as before.

Other variants that are more robust to noise average the pairwise distances to obtain the clustering distance. With the right choice of distance, these amount to comparing the centroids of clusters.

You'll notice that I didn't actually define a problem that these algorithms solve, keeping in with the grand tradition of clustering :). There is a way of defining a general optimization function that single-link clustering solves optimally. For the others though, although we can define a cost function, we can't show that they're optimal.

So much for the algorithmic view of HC. The representational view goes somewhat deeper.

I've had at least one person (you know who you are) tell me that one of the only two clustering algorithms that worked for them is hierarchical clustering. And indeed, the idea is very seductive. Rather than present the user with a single k-clustering - a snapshot, if you will, of the data - we give them a tree of merges, in which there is one leaf for each object. The central idea here, and one that bears emphasizing, beacuse it's so different to how we've thought about clustering thus far, is this:

we understand the structure of data from its transitions, rather than from its states.

What I mean is this: rather than defining a clustering as a static partitioning of the data, we watch the data ``evolve'' as we start merging points together, and use the merge history to figure out what the real structure is. There are a couple of ways of doing this: the most common way is to imagine drawing the tree from top to bottom, and then cutting it by a y-monotone curve (one that intersects any vertical line in exactly one point). This can be done to ensure we have k clusters, or it can be done at points where the merge appears to be ``stable'' (the subtree doesn't merge with any other subtrees for a long time).

A particularly effective visual metaphor for this is the dendrogram, which is very popular in evolutionary tree analysis.

The dendogram is visually appealing because it does two things: firstly, it depicts the tree of merges, permuting the nodes so that there aren't crossings. Secondly, and more importantly, it uses the lengths of edges as a visual marker to indicate the 'time of merge' for clusters. If we think of starting at time 0 and merging clusters, then a cluster that sticks around for a long time will have a tall edge connecting it to its parent, in comparison with a cluster that gets merged quickly. (as an aside, visual metaphors are possibly underrated when it comes to thinking about clustering algorithms. My firm belief is that one of the reasons soft clusterings aren't used as much as hard clusterings is because it's hard to visualize them)

Sidebar:
And this is the key operational idea behind ``clustering from the transitions'': clusters that stay unmerged longer are likely to be more interesting than clusters that get merged quickly. Till recently, this was merely a heuristic way of thinking about hierarchical clusterings. However, we now have the idea of persistence that comes from the realm of computational topology. In short, persistence is a way of quantifying topological features of a shape that persist across different scales. Persistence has been studied extensively as a rigorous way of quantifying the triviality (or non-triviality) of features in a shape, and there's a recent paper that applies persistence-like concepts to clustering as well. It's still early days for this direction, but I think it's a promising one.

Returning to hierarchical clustering, one major problem with this approach is that it's local: make the wrong choice of merge early on, and you'll never get to the optimal solution for a k-clustering problem. And since there's no easy way to change your mind (i.e split a clustering), it's hard to reverse bad decisions. If you're so inclined (and I am nowadays), this is the problem of finding monotone paths in the merge-split lattice on partitions of [1..n], but I digress...

Ultimately, the value of hierarchical clustering comes in its ability to represent an entire history of clusterings, rather than just one, and the relative ease with which a bottom-up algorithm can be written. That, coupled with its uses where you really do want to find a tree (evolutionary or otherwise) is what makes it a popular implement in the clustering toolbag.

Friday, July 03, 2009

Clustering: k-means...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-means (also known as Lloyd's algorithm) is probably the most well known clustering algorithm, and as such, deserves some discussion of its own. We've already seen that it's part of the "I don't like you" family of clustering formulations. How does the method work ?

We start with a set of points in R^d. The algorithm is really simple: Fix a collection of k centers. Now perform the following alternating optimization till you reach a steady state: (1) assign each point to its nearest center (2) for each group of points assigned to the same center, compute a new center by taking the centroid of the points.

More importantly, what problem does this algorithm solve ? The quick answer is: none ! Yes, that's right: k-means gives no answer with any provable guarantee for a generic set of points for any cost measure. Pretty dismal, isn't it ? And yet, it's a really popular algorithm, a state of affairs that generally drives theoreticians to tears. So what's the deal ? The answer, as always, lies in the details, and is a good story of how theoretical work has managed to make some sense of an algorithm that was notoriously hard to analyze properly.

The underlying problem that k-means attempts to solve is the one I mentioned in the last post: let a cluster cost be the sum of squared distances to the cluster center, and minimize the sum of cluster costs over all k-clusterings. Now here are some of the bad things that we CAN say about k-means:

It might take exponential time to converge, even in the plane
It can get stuck in a local minimum that has an arbitrarily bad cost.

Running Time:

Let's first consider the problem of guaranteeing running time. Although k-means can take exponential time to converge, a smoothed complexity result says that w.h.p, a perturbed input will converge in polynomial time (for those not familiar with the concept, the smoothed complexity of a problem is its (expected) running time after the inputs have been perturbed randomly by a small amount). The smoothed complexity results suggest one reason why k-means converges quickly "in practice". The kinds of careful constructions (high dimensional, and low dimensional) needed to force super polynomial convergence times are very artificial.

A crucial initialization:

So what about the quality of the results produced ? One unspecified aspect of the k-means algorithm is how the initial set of k centers is created. Two recent works have indicated that the key to extracting good performance from k-means is by being very careful with this initial choice.The idea is very neat, and really should be getting more play in the experimental community than I think it does.

Recall the Gonzalez algorithm for the k-center problem: pick a point, and then its furthest neighbor, and then the point whose closest neighbor in this set is as far as possible, and so on. Consider the following randomized variant:

Pick the next candidate center with probability proportional to its minimum distance squared from the current set of clusters.

Papers by Arthur and Vassilvitski, and by Ostrovksy, Rabani, Swamy and Schulman both show that this simple seeding strategy allows for provable guarantees on the quality of the solution returned by k-means. The actual results are slightly different, and give two different views of the behaviour of the algorithm:

The Ostrovsky et al paper starts with the assumption that the point set is epsilon-separated: the intuition behind this definition is that the cost of a k-clustering is significantly better than the cost of a (k-1)-clustering (controlled by a parameter epsilon). Under these conditions, they show that the initial seeding gives a PTAS for the clustering problem.

I'm eliding a number of details about their algorithm. One interesting feature of their algorithm is what they do after the initial seeding: rather than run Lloyd's iterative step, they fix a radius around each center (different for each center) and compute the centroid of the portion of the set within this ball. They also use a more complicated procedure to convert this into a PTAS, and in my mind that makes their approach a lot less practical.

The Arthur/Vassilvitski paper starts with the same initialization procedure, but they assume no epsilon-separation result. Their algorithm is simply: run the initialization, and then run regular k-means. The guarantee they get is weaker (a log k approximation ratio), but they also provide empirical results showing the superiority of their approach compared to standard k-means. I think it'd be interesting to compare their approach with the 'ball k-means' approach by Ostrovsky et al to see which work better on benchmark data sets.

Another pleasing property of the k-means algorithm is its relationship to mixture modelling and the EM algorithm. I'll have more to say about EM later on, but the gist of the matter is this: The twin steps of finding new centroids and assigning points to nearest neighbours have direct analogs in the world of maximum likelihood concepts and distribution-generated data. There is a general procedure that takes a given exponential family (Gaussian, Poisson and the like), and generates a distance function d such that the density described by the distribution can be expressed as exp(-d(x, mu)), where mu is the mean of the distribution, and x is the point whose probability we're evaluating. ML folks will recognize the Bregman distances here, and the punchline is that for Gaussian distributions, the corresponding distance function is squared Euclidean distance (the exact function used in the k-means evaluation).

What does this all mean ? It means that the k-means process has a semantic meaning in the context of clustering data drawn from a mixture of Gaussians: what's even neater is that a k-means-like algorithm works for data drawn from other distributions, as long as you replace squared Euclidean distance by the appropriate (Bregman) distance function.

Apart from the simplicity of the algorithm, there are some aspects that make it attractive, regardless of lack of guaranteed bounds. From a heuristic-design standpoint, k-means is a particularly atractive example of a technique that's often called alternating optimization.

Notes:

I was asked if there were standard data sets for benchmarking clustering algorithms. I don't often see papers doing this in a systematic manner, but a good source of data sets for experimentation is the UCI Machine Learning Repository: there aren't too many clustering data sets, but they can be found here (look for clustering under the default task).

Next up: hierarchical methods...

Sunday, June 28, 2009

Clustering: The "I don't like you" view

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Any clustering problem starts with a data set. And you can basically tell what algorithm you're going to need by asking, 'what structure do I have on the data ?'. The bare minimum you need is some way of taking two items and returning a number that tells you how similar they are, or more usefully, how far apart they are.

Of course, if I know that A and B are a certain distance apart, and B and C are a certain distance apart, I might like to infer something about the distance between A and C. This is what the triangle inequality gives you, and so the most elementary form of clustering takes items and a metric defined on these items (humans actually DON'T think about distances in this manner, as many psychological studies have shown: more on that later on).

I find it convenient to view clustering algorithms from a pop-psychology self-help perspective, and so I'll deem algorithms that operate with a distance metric the 'I don't like you' methods. The general goal with such methods is to group the items into "clusters" where there isn't too much hatred :).

There are many ways to quantify this "lack of hatred". The one most natural to theoreticians is the 'minimize the worst-case angst' viewpoint: in other words, find k clusters such that over all clusters, the maximum pairwise distance between points in a cluster is minimized. This is the well-known k-center problem.

If you don't like the lack of robustness (one point can mess up the cost), you could instead sum up all the pairwise distances to assign the cost for a cluster. This gives a variant of what's normally considered the k-median problem.

A third way of measuring cluster cost is to sum the squares of distances, rather than the distances themselves. This gives you a cost function that looks a lot like what k-means attempts to solve.

It's often useful to define a representative of a cluster: in the k-center view of the world, the center of a cluster is a point whose maximum distance to all points in the cluster is minimum. Note that by triangle inequality, this is within a factor of two of the k-center cluster cost, so it's often convenient to use this as the definition of the cluster cost (the cluster radius, instead of diameter, if you will). In the sum-of-all-costs view, the center correspondingly can be defined as as the point whose sum of all distances to all other points is minimized. Similarly for the case of sum-of-square-costs.

Remarks on complexity:
There is an elegant 2-approximation for the k-center problem by Gonzalez. The algorithm itself is the essence of simplicity: pick any point in the space, take its furthest neighbor, take the point furthest from these two, and so on. The first k points picked in this manner yield the desired cluster centers, and a simple triangle inequality argument yields the 2-approximation.

The k-median problem is a little harder to solve. The best known bound in a general metric space is a 3+eps approximation, and it's known that you can't get a PTAS. The approximation algorithm itself is not too hard: it's a local search heuristic that gets closer and closer to 3 as you permit larger and larger sets of centers to be swapped out.

Going to Euclidean space, or what's in a name ?
General metrics come either from some kind of similarity function between pairs of elements or from the induced shortest path metric on a graph (there are other metrics one can induce from a graph, and we'll talk about them later). But what if you have more information ? There's a general principle of clustering that's worth enunciating here: more structure on the data can lead you to more targeted algorithms that will probably work better, so resist the urge to be too general.

The most natural space to consider is a Euclidean space, and it is this space that explains the names 'median' and 'mean'. It's well known that on the line, the point minimizing the sum of distances to a set of points (the 1-median) is given by the actual median of the numbers. This explains why the problem of minimizing sum of distances is called the k-median problem. Similarly, the point that minimizes the sum of squared Euclidean distance is the centroid or the 'mean', giving rise to the k-means problem.

There's a long line of results on clustering under various measures in Euclidean space. In brief, for most of the above formulations, you can get a PTAS in Euclidean space, but you end up playing whack-a-mole with the parameters n, d, epsilon, and k (that is to say, something ends up exhibiting exponential dependence, and which parameter it is depends on what you care most about).

The Voronoi trick
But there's a general principle that governs much of the algorithm design for these problems. It's the so-called Voronoi trick, and can be summarized by pointing out that you can always improve the clustering cost by making sure that each point is assigned to its nearest neighbor. Effectively this means that the clusters define a Voronoi partition, with the center as a site, and all points within a Voronoi cell being assigned to the corresponding site. First of all, this means that the number of possible outcomes reduces from k^n (the number of ways of partitioning n things into k groups) to n^{kd} (since the number of Voronoi partitions is much smaller than the number of partitions). Secondly, it suggests a simple heuristic for improvement: Take a given clustering: reassign points so that each point is assigned to its nearest cluster center, and then recompute the center for the cluster associated with all points in the (erstwhile) Voronoi cell. Rinse and repeat....

This heuristic is the basis of the k-means algorithm, and is also the basis for a heuristic for the k-median problem clumsily named the k-medoids algorithm.

Notes:

all the above algorithms have the parameter k for the number of clusters. It's obvious why: all the above cost measures can be minimized by placing each item in a cluster by itself, but the resulting answer is meaningless, and "no man is an island"
In a general metric space defined over a finite set of points, the cluster centers will by default by elements of the input. This is often a desirable property, if you want representatives to be from the input. But in a smooth space like a Euclidean space, there's no reason to limit yourselves to data points, and this creates a dichotomy between the so-called discrete and continuous variants of a clustering problem. Often, you can use triangle inequality to argue that the answers don't differ significantly in cost, but it's important to ask yourself which variant you care about. The discrete problem can often be "easier" to solve, because you can enumerate over the choices, but this "easier" can be expensive: for example, minimizing the sum of squares is easy in a (continuous) Euclidean space, but is more expensive in a (discrete) metric space.
For some unknown reason, it's common to describe clustering problems by the algorithm used to "solve" them, which causes no end of confusion. I lost count of the number of times I had to point out that k-means is an algorithm that attempts to optimize a specific cost function. This is more than just pedantry, because unless you realize the difference between a problem and a specific solution, you'll never be able to conceive of an alternate approach, and you'll never even try to abstract out what's special about your problem. To theoreticians, this might be obvious, but it isn't really obvious IRL.

Next up: everything you ever wanted to know about k-means.

p.s this is written in a much more breezy style than I'd use for any formal document, but the organization and thinking reflects how I'd like to structure the material. Comments, as always, are welcome.

Pages

Thursday, February 25, 2016

Wednesday, January 18, 2012

Tuesday, November 01, 2011

Wednesday, September 07, 2011

Tuesday, September 06, 2011

Monday, July 18, 2011

Monday, June 13, 2011

Saturday, June 11, 2011

Monday, October 25, 2010

Friday, May 14, 2010

Saturday, March 13, 2010

Sunday, March 07, 2010

Thursday, January 07, 2010

Wednesday, October 28, 2009

Sunday, August 30, 2009

Sunday, August 02, 2009

Thursday, July 23, 2009

Friday, July 03, 2009

Sunday, June 28, 2009

Disqus for The Geomblog