The Geomblog: alenex2011

Showing posts with label alenex2011. Show all posts

Tuesday, January 25, 2011

ALENEX: Experiments with Johnson-Lindenstrauss

I'm three days behind on my postings, so I have the luxury of looking back and attempting a larger perspective.

As Dana Randall put it at the business meeting, this is the Johnson-Lindenstrauss SODA. And it seems apropos to start with our paper at ALENEX. This was work done by my student Qiushi Wang, who's applying for grad schools (Admit him ! or email me for more info!)

The Johnson-Lindenstraus Lemma is one of the most powerful tools in the theory of metric embeddings and dimensionality reduction. Simply put, it says that given any set of $n$ points in a Euclidean space, there exists a linear mapping into roughly $O(log n)$ dimensional Euclidean space that preserves all distances approximately.

There's a long series of proofs of this lemma: all of them yield essentially the same bound on the number of dimensions and the same dependence on the error term, and so the main efforts have focused on improving the running time of the mapping itself. If we're mapping from d dimensions to k, a linear mapping can take time $O(kd)$ per point, and this is the time to beat.

There are two strands of research along these lines: the first family of methods tries to speed things up by sparsifying the projection matrix to speed up the transformation. You can make the matrix quite sparse this way, but there's a limit on what you can do, because if the input vector being projected is itself quite sparse, then the resulting vector has mostly zeros in it, destroying its norm (and any hope of preserving distances)

The trick, which leads to the second strand of research, is to "precondition" the input. The idea is quite elegant: if you apply what is essentially a random rotation to the vector, it becomes dense w.h.p, where density intuitively means that no one coordinate is very large (we assume unit norm vectors w.l.o.g). Once you do this, the resulting projection matrix can be made quite sparse.

There's a catch though: you're now using two matrices instead of one, so you end up spending d^2 time on the first part, which dominates the original $kd$ time. The second trick you need then is a special random rotation that can be applied very quickly. Essentially, you need the walsh-hadamard transform. This is the core idea behind the 2006 paper by Ailon and Chazelle, and there's been much work since on improving the bounds and the preconditioner construction. A third line of work combines the two strands is to sparsify (by subsampling) a special code matrix that has a "preconditioning" effect.

But in all of this, no one has really looked at the actual behavior of these algorithms in practice. There are a number of reasons to do this: first of all $O(\log n)$ dimensions isn't so hot if the constant is large. Secondly the algorithm is randomized, which tends to give practitioners the heebie-jeebies. And finally, the dizzying array of algorithm options available is just plain confusing.

Our paper contains most of the details, so I'll spare you the long exposition, and summarize some of the surprising and not-so-surprising conclusions thus far:

The constant in the dimension of the embedding is small. It's essentially 1 * log P/epsilon^2, where P is the number of "norm probes" you require (P = n^2 for distances and n for norms). This is good, because it means that there are no hidden large constants.
The quality of all algorithms is basically the same, and is very consistent. In other words, the fact that JL is randomized (which often causes a lot of concern in practice), is not a problem for its use (unless you working in a distributed environment and need to share randomness - pointed out to me by TS Jayram).
The distortion error itself is very nicely concentrated (normally) around 1. Unless you have highly clustered data, in which case the distortion distribution looks like a superposition of shifted Gaussians, one for each cluster center.
Since all algorithms behave essentially the same on quality, speed is the main differentiator. Here, the 'best in class' depends heavily on what you know about the data. For dense data, you can be pretty sparse (as predicted by some of the papers) and the embedding is fast. For sparse data, it turns out that at least in MATLAB, and for small dimensions, the dense method work better (a little ironic considering that much of recent work was designed to deal with the sparse case). This is because of MATLAB's heavy optimization for dense matrix multiplication.
Of course, your dimensionality might be too high to store a dense matrix, or you might not even know what the data profile is like. In that case, preconditioning methods like the original Ailon/Chazelle method work great. and there are only small differences between the methods as d increases.

We're not even close to being done with our explorations: there are at least four or five new questions to explore based on feedback we got at SODA. But it's been an illuminating experience, and I've been heartened by all the interest the community has shown in this research, based on the feedback I got.

Monday, January 24, 2011

ALENEX/ANALCO II

Today, someone asked me to post something sensational just to stir up some controversy. It turns out that without realizing it, I already did it yesterday ! I was talking about the use of CPLEX to solve (very effectively) instances of 1-median over strings, and said this:

It's not the "sexiest" thing in the world to solve algorithms problems in practice by using large industrial strength packages. However, both CPLEX and SAT solvers are examples of tools that can be used in practice to solve fairly intractable problems. It still takes a lot of engineering and skill to make the heuristics work well, but it's something that we should be using as a matter of course when designing heuristics before trying to invent an algorithm from scratch.

I should have known better than to bring down the fury of the entire field of OR on my head. Michael Trick, OR blogger extraordinaire, decided to round my variables for me: read what he had to say here. As penance, I promise to download CPLEX and encode at least one problem on it in the next year :).

I've seen Bob Sedgewick give talks a few times now, and I'm always inspired by them. This latest one was titled 'Algorithms for the masses' and was hard to summarize: it was part exhortation to do more analytic combinatorics, part discussion of a new intro CS course he and Kevin Wayne have designed, and part emphasis on using the scientific method properly to design good models for algorithm behavior and data characeteristics.

The principle at the heart of this was a fitting one for this joint talk: we should do more scientific analysis of our algorithms to figure out exactly how our algorithms behave in practice, rather than relying on O() notation as a predictive and comparative tool (both of which it isn't). This goes back to Dick Lipton's coinage of 'galactic' algorithms: Bob made the assertion (not wrong in my view) that most algorithms at STOC and FOCS are 'galactic' and much of the work at SODA is too.

While I agree that it's high time we stopped using O() notation as a cudgel, I think it's harder than one might think. Engineers can model the real world in various ways, and when they want to test their models, they can - well - run it on the real world. Even if come up with a plausible model of how my algorithm works, and what the various cost functions are, I still need to hope that the data doesn't have weird characteristics that make all the results go wonky. Probably the way to see this is that even in "the real world", if we dont know how a particular genetic mechanism works, it's as good (or bad) as not having an accurate model of data that we're testing.

The second invited talk, by James Demmel, was a little harder for me to follow, because it was a much more technical talk about the challenges of designing linear algebra routines for future architectures. He described a machine the DoE is proposing to build, and it's likely to have 1 billion cores ! But even with that many cores, the main bottleneck is going to be communication, and the goal going forward is to design algorithms that parallelize well with minimal communication.

Or as he ended his talk:

Don't communic...

Sunday, January 23, 2011

ALENEX/ANALCO

A few quick hits from ALENEX, or SODA day 0:

Moraru and Anderson used Bloom filters in a nifty way to implement exact pattern matching where you have a large set of patterns and an even larger text. The idea was to do a first pass over the text after storing all the patterns in a Bloom filter. Every subsequence of matching text was stored in a second Bloom filter, and in a second pass, all the patterns were run over this Bloom filter to take care of false positives. A final "exact" pass did the trick (at this point both sets are small enough to be reasonable). They have a companion paper at NSDI (which is a pretty good networking conference) on using this for malware detection, and that's a good example of pairing nice algorithms engineering with some interesting applications.
Chimani, Woste, and Böcker were looking at the 1-median problem on a hamming space, and showed that the simple integer programming formulation actually does great in practice, when you throw CPLEX at it. This was surprising to me on two levels: firstly, that CPLEX is actually free for academic use (who knew!) and that such a simple approach is so effective.

It's not the "sexiest" thing in the world to solve algorithms problems in practice by using large industrial strength packages. However, both CPLEX and SAT solvers are examples of tools that can be used in practice to solve fairly intractable problems. It still takes a lot of engineering and skill to make the heuristics work well, but it's something that we should be using as a matter of course when designing heuristics before trying to invent an algorithm from scratch.
Stanton and Pinar had some experimental results (and some theory) on sampling from the space of graphs that have a prescribed joint degree distribution. While degree sequences are all the rage when trying to model various "naturally occuring" graphs like router graphs or social network graphs, there's a body of work that notes that graphs with the same degree distribution can have very different properties, and that in fact statistics on the number of edges connecting nodes of certain degrees (i.e higher-order statistics on degrees) are even more relevant.They propose a simple Markov chain that allows them to sample from the space of all graphs having a prescribed joint degree distribution, and while they don't yet appear to have theoretical results on the convergence of this chain, it converges quicly in practice.

Other notes: I'll be using the hashtag #soda2011 on twitter during the day. If you're tweeting from SODA (an don't want to let the NIPS tweeters show us up!), do use this hashtag as well.

The Geomblog

Pages