It's always nice to see clients emerging of stealth mode and showing the fruits of their labour to the public. Our friends at http://www.similarpages.com have just done so and I am doubly pleased as this also reflect the work that DigitalPebble did for them.
SimilarPages is an add-on for Firefox which allows you to discover pages which are similar to the one you are currently reading. It's pretty cool and surprisingly easy to use. It's also completely free which is always nice.
From a technical point of view, we helped SimilarPages adapting Nutch to their needs and deploying it on a 400-nodes cluster on Amazon EC2 to crawl the web. We fetched and parsed a total of more than 3 billion pages from which we obtained 200+ million lists of similarities. The crawlDB itself contained more than 10 billion URLs.
Operating at such a scale is definitely challenging and has been a great experience. From a Nutch point of view, quite a few improvements and bugfixes in Nutch 1.1 come directly from the work done at SimilarPages, so thanks for that guys!
The SimilarPages use case is actually a good example of using Nutch as a crawling platform only, i.e. not indexing the documents with Lucene or SOLR. See A. Białecki's presentation at BerlinBuzzwords 2010 for more examples on this subject.
The computation of the similarities between URLs from the Nutch crawls is made using bespoke Hadoop Map-Reduce code. More details on their approach can be found on SimilarPages' website.
I feel quite proud to have contributed to this project and wish long live to SimilarPages. If you haven't done so, give it a try!