Friday 13 August 2010

Towards Nutch 2.0

Nevermind the dodgy look of the blog - I'll improve that later!

For my first post, I'd like to mention the progress we've made recently towards Apache Nutch 2.0. It is based on a branch named NutchBase which has been developed mainly by Doğacan Güney and it is now in the trunk of the SVN repository. One of the main aspects of Nutch 2.0 is that it is now storing its data in a datastore and not in Hadoop's file-based structures. Note that we still have the distribution and replication of the data over a whole cluster and data locality for MapReduce but we also have the possibility to insert or modify a random entry in the table without having to read/write the whole data structure as it was the case before.

Nutch uses a project named GORA as an intermediate between our code and the backend storage. There would be a lot of things to say on GORA but to make it short what we are trying to achieve with it is to make it a sort of common API for NoSQL stores. GORA already has implementations for HBase and Cassandra but also SQL. The plan for GORA is to put it in the Apache Incubator or possibly as an Apache subproject (Hadoop? HBase? Cassandra?). We'll see how it goes.

There are quite a few structural changes in Nutch, most notably the fact that there aren't any segments any more as all the information about a URL (metadata, original content, extracted text, ...) in stored in a single table which means for instance no more segments to merge or metadata to move back to the crawldb. It's all in one place!

There are other substantial changes in 2.0, notably the removal of the Lucene-based indexing and search as we now rely on SOLR. Other indexing backends might be added later. Another step towards delegating functionalities to external projects is the increased used of Apache Tika for the parsing. We've removed quite a few legacy parsers from Nutch and let Tika do the work for us. We've also revamped the organisation of the code and did a lot of code clean up.

Nutch 2.0 is still at an early stage and we are actively working on it, testing, debugging etc... The good news is that it is not only an architectural change but also a basis for a whole lot of new functionalities (see for instance https://issues.apache.org/jira/browse/NUTCH-882).

I'll keep you posted on our progress, as usual : give it a try, get involved, join us...