Thursday, December 31, 2009

Sort a huge file

You are given a huge file with say 100 terabytes of integers. They don't fit in internal memory, what strategy would you use?

Wednesday, December 30, 2009

Selecting the median in an array

Ok, this is a typical "easy" interview question: you are given an array of integers, select the median value. What is the optimal solution?

Tuesday, December 29, 2009

A good tool for POS tagging, Named Entity recognition, and chunking,

YamCha is a good tool for POS tagging, Named Entity recognition, and chunking, based on SVM.

Monday, December 28, 2009

Introduction to Computational Advertising : Andrei Broder's course online

"Computational advertising is a new scientific discipline, at the intersection of information retrieval, machine learning, optimization, and microeconomics. Its central challenge is to find the best ad to present to a user engaged in a given context, such as querying a search engine ("sponsored search"), reading a web page ("content match"), watching a movie, and IM-ing. As such, Computational Advertising provides the foundations for building ad matching platforms that provide the technical infrastructure for the $20 billion industry of online advertising.

In this course we aim to provide an overview of the technology used in building online advertising platforms for various internet advertising formats."

Handsout of Andrei's course on Computatitional Advertising are now on line

Sunday, December 27, 2009

KD-Tree: A C++ and Boost implementation

According to wikipedia a kd-tree (short for k-dimensional tree) is a space-partitioning data structure for organiizing points in a k-dimensional space. In this implementation, points are represented as a boost ublas matrix (numPoints x dimPoints) and the kd-tree can be seen as a row permutation of the matrix. The tree is built recursively. At depth k, the (k % dimPoints) coordinates are analyzed for all the points and the median of them is selected with a quickselect algorithm. The quickselection induces a natural row-index permutation for points in two sets, which are recursively partitioned on the next levels of the tree.

Here you have the code.

Saturday, December 26, 2009

Xmas present: Machine Learning: An Algorithmic Perspective

A very useful book for Machine Learning, with a lot of examples in python. Machine Learning: An Algorithmic Perspective is a must have in your library.

Friday, December 25, 2009

MultiArray: a useful container.

MultiArray is a useful boost container. Think about a hierarchical container which contains other containers. You can also have recursive definitions, where each level of the container can have other MultiArrays. Therefore a multi_array[float,2] is a matrix of float (dim=2), a multi_array[float,3] is a cuboid of float (dim =3). Instead, multi_array[multi_array[float,2], 3] is an object that you cannot represent with a 'simple' matrix: we are representing a cuboid where each cell is a matrix of floats.

MultiArrays can be accessed with the traditional iterator pattern, or accessed with the conventional bracket notation.

Maybe the most useful feature of MultiArray is the possibility to create views, where a subset of the underlying elements in a MultiArray as though it were a separate MultiArray.

Thursday, December 24, 2009

Building an HTTP proxy for filtering content

I am very agnostic for what is concerned programming languages.

C++ is a must for any production ready code, such as online search code (and if you use STL and boost you have a rich lib set). C# is useful for offline code, due to the inner rich set of lib set. Some people loves Java for this stuffs (e.g. Lucene, Hadoop, and other very good enviroment). Then, you you are in the scripting coding for fast and dirty prototyping you may want to use python (which is good for strong typed system) or perl (which is good for the vast set of modules, see CPAN).

So, I need to write an HTTP proxy for filtering some content. What is my choice? In this case, perl with HTTP::Proxy module, which allows me to filter both headers and content.

Wednesday, December 23, 2009

A commodity vector space class

Sometime you need to realize a vectors space with dense vector. Here is a commodity class code.

PS: getting the sparse version is easy if you use Boost Ublas.

Tuesday, December 22, 2009

Is this... the real 3D Mandelbrot then?

Fashinating article. Are fractals a form for search? In my opinion, Yes they are.
"The original Mandelbrot is an amazing object that has captured the public's imagination for 30 years with its cascading patterns and hypnotically colorful detail. It's known as a 'fractal' - a type of shape that yields (sometimes elaborate) detail forever, no matter how far you 'zoom' into it (think of the trunk of a tree sprouting branches, which in turn split off into smaller branches, which themselves yield twigs etc.).

It's found by following a relatively simple math formula. But in the end, it's still only 2D and flat - there's no depth, shadows, perspective, or light sourcing. What we have featured in this article is a potential 3D version of the same fractal
"

Monday, December 21, 2009

Facebook is growing big .. and I mean REALLY BIG!

Most Recent Facebook Common Stock Sale Values Company At $11 Billion, techcrunch reported. I already reported the Distribution of Facebook Users (~300M world wide). Anyway, what is really impressive is that FB is generating a lot of incoming traffic for many other web sites (you know when you embed some content). Very similar to what search engines are doing...

Sunday, December 20, 2009

Conditional Random Fields

Conditional Random Fields are a framework to build probabilistic models for segmenting
and labeling sequence data and are a valid alternative to HMMs. CRF++ is a valid package for computing CRF

Saturday, December 19, 2009

Social, Mobile and the Local search

Ok, Google is going to buy yelp. Anyway, who cares about directories anymore? I am particularly impressed by the combination of Local, Social and Mobile. Oh yes, I meant foursquare. Check it out. I say real time search with a reason.

Friday, December 18, 2009

SAT and hill-climbing

SAT is a classifical NP problem, now what is the best known algorithm for building an approximate solution? WalkSAT

Pick a random unsatisfied clauses
Consider 3 moves; flipping each variable
If (any improve the evaluation)
{
accept the best
}
else
{
probability 0.5 : take the least bad
probability 0.5: pick a random move
}

Impressive no? and what this algorithm reminds to you?

Thursday, December 17, 2009

Ellg: an open source social platform

I do encorauge to try elgg platform. Ok, it is in PhP which I don't like, but is cool and rich which I like. Consider that Facebook was written in PhP, after all.

Wednesday, December 16, 2009

Trackle

Have a look I like the idea of tracking and sharing: Trackle.com

Monday, December 14, 2009

Neural Networks and backpropagation

I am studying once again NN and back-propagation. Any paper out of there that compares standard grandient descend update rule (delta rule), with the one where you add a momentum?

Any other suggestions for different update rules?

Sunday, December 13, 2009

CTR and search engines

Chitka Research has released a report showing Bing users, the new search engine from Microsoft, clicking on ads 75% more often than Google users.

Good sign? Ask and AOL has more CTR than bing

Saturday, December 12, 2009

Google: a usefult set of slides to build a distributed system

Designs, Lessons and Advice from Building Large Distributed Systems. Classical discussion about Map&Reduce and BigTable, plus an interesting overview about next incoming Snapper