Showing posts with label Infochimps. Show all posts
Showing posts with label Infochimps. Show all posts

Wednesday, November 10, 2010

Infochimps and the scaling of dataset value

Image representing Infochimps as depicted in C...Image via CrunchBaseSure, a picture is worth a thousand words, but what is a thousand words worth? How about a million? If I had a dataset of the most recent trillion words spoken by humanity, (anonymized and randomized of course!) would that be worth any more than the set of words in this blog post?

These are real questions. A Texas company called Infochimps has datasets quite similar to these, ready for you to use. Some of the datasets are free, others you have to pay for. More interesting is that if you have a dataset you think other people might be interested in, or even pay for, InfoChimps will host it for you and help you find customers. (Infochimps just announced they had raised $1.2 million in its first round of institutional funding.)

One of the datasets you can get from Infochimps for free is the set of smileys used on twitter in tweets sent between March 2006 and November 2009. It's free. It tells you that the smiley ":)" was used 13,458,831 times, while ";-}" was only used 1,822 times.

If you're willing to fork over $300, you can get a 160MB file conatining a month-by-month summary of all the hashtags, URLs and smiley's used on twitter during the same period. That dataset wil tell you that during September of 2009, the hashtag #kanyeisagayfish was used 11 times while #takekanyeinstead was used 141 times.

If you're a scrabble player, you can spend $4 for a list of the 113,809 official words, with definitions. Or you can get them free, without the definitions.

courtesy of Infochimps, Inc. CC-BY-A
I had a great talk with Infochimps President and Co-Founder Flip Kromer a few weeks ago before his presentation to the New York Data Visualization Meetup. I fell in love with one of the visualizations he showed in his presentation, and he's given me permission to reproduce it here. (Creative Commons Attribution License) It's derived from the same Twitter data set you can get from Infochimps, and shows networks of characters that are found in the same tweet. So if ♠ and ♣ appear in the same tweet over and over again, the two characters will have a strong connection in the network of characters.

The character connection data was fed to a program called Cytoscape, which is an open source visualization program used in bioinformatics; Mike Bergmann has a nice article about its use for large RDF graphs. The networks are laid out using a force-directed algorithm (which is pretty much the simplest thing you can do). Coloring is applied arbitrarily.

As you might expect, the main character networks that show up are associated with languages, but there are some anomalies. For example, the katakana character ツ (tu) sticks out. Katakana is a set of phonetic characters used in Japanese for non-Japanese words. The reason "tu" is set apart from all the other katakana is that people use it on Twitter as a smiley.

The other anomalous character subnet is labeled "???" in the graph. A closer look reveals this to be the set of characters that look like upside down roman text.

Kromer has noticed that the price (or perhaps cost) of a partial data set follows a non-monotonic curve (see graphic). Small amounts of data are essentially free, but a peak value is reached when portions of the data set are extracted from the full data set. If we were discussing book metadata, for example, peak value might accrue for a set of the 100,000 top selling books.

There's much less value, according to Kromer, in having a large incomplete chunk of a data set. Data for 10,000,000 books, for example, would have less value than the 100,000 book data set, because it's not complete. Complete data sets become extremely expensive because of the logistics involved, and because of the value of having the complete set.

This pattern seems plausible to me, but I'd like to see some clearer examples. I've previously written about having too much data, but that article looked at the effect of error rates on data collection; Kromer's curve is about utility.

For me, the most interesting thing about Infochimps is the idea that the best way to make data flow in large volumes and create new types of knowledge is to provide the right incentives for data producers through the establishment of a market. This makes a lot of sense to me; however I'm not sure that the Infochimps market has also established incentives needed for data set maintenance; the world's most valuable and expensive data sets are one that change rapidly.

Kromer contrasted the Infochimps approach to that of Wolfram, whose Alpha service is produced by "putting 100 PhDs and data in a lab". He also feels that much of the work being put into the semantic web is a "crock" because its technology stack solves problems that we don't have. Humans are pretty good at extracting meaning from data, given a good visualization.

We can even recognize upside-down text.
Enhanced by Zemanta