Friday 13 August 2010

Towards Nutch 2.0

Nevermind the dodgy look of the blog - I'll improve that later!

For my first post, I'd like to mention the progress we've made recently towards Apache Nutch 2.0. It is based on a branch named NutchBase which has been developed mainly by Doğacan Güney and it is now in the trunk of the SVN repository. One of the main aspects of Nutch 2.0 is that it is now storing its data in a datastore and not in Hadoop's file-based structures. Note that we still have the distribution and replication of the data over a whole cluster and data locality for MapReduce but we also have the possibility to insert or modify a random entry in the table without having to read/write the whole data structure as it was the case before.

Nutch uses a project named GORA as an intermediate between our code and the backend storage. There would be a lot of things to say on GORA but to make it short what we are trying to achieve with it is to make it a sort of common API for NoSQL stores. GORA already has implementations for HBase and Cassandra but also SQL. The plan for GORA is to put it in the Apache Incubator or possibly as an Apache subproject (Hadoop? HBase? Cassandra?). We'll see how it goes.

There are quite a few structural changes in Nutch, most notably the fact that there aren't any segments any more as all the information about a URL (metadata, original content, extracted text, ...) in stored in a single table which means for instance no more segments to merge or metadata to move back to the crawldb. It's all in one place!

There are other substantial changes in 2.0, notably the removal of the Lucene-based indexing and search as we now rely on SOLR. Other indexing backends might be added later. Another step towards delegating functionalities to external projects is the increased used of Apache Tika for the parsing. We've removed quite a few legacy parsers from Nutch and let Tika do the work for us. We've also revamped the organisation of the code and did a lot of code clean up.

Nutch 2.0 is still at an early stage and we are actively working on it, testing, debugging etc... The good news is that it is not only an architectural change but also a basis for a whole lot of new functionalities (see for instance https://issues.apache.org/jira/browse/NUTCH-882).

I'll keep you posted on our progress, as usual : give it a try, get involved, join us...

6 comments:

  1. I have created a few segments with Nutch 1.2. Are there any tools that would let me migrate this crawl data to Nutch 2.0?
    Would it be hard to create such a tool?

    ReplyDelete
  2. Hi Alexis,

    There aren't any tools for migrating to 2.0 yet but it wouldn't be too difficult to write that.
    What you can do already is to get a list of the URLs in your 1.2 crawldb and inject that into 2.0. You'd have to refetch these URLs of course but at least you wouldn't have to rediscover them.
    Please note that 2.0 is at an early stage and has some open issues such as https://issues.apache.org/jira/browse/NUTCH-879. However it is worth playing with it anyway (and reporting bugs if you find any)

    ReplyDelete
  3. Thanks for the prompt reply!

    Nutch 2.0 is currently (too much) slower than the 1.2 version according to the issue.
    I was interested in migrating the crawldb and the segments to the datastore, not only to avoid rediscovering the urls but also to save the fetch step which takes the most time.
    I intended to reload all the data generated by my previous generate/fetch/update iterations...

    I guess I'll stick with 1.2 for now since 2.0 is apparently still under development.

    ReplyDelete
  4. Re-saving the fetch step : this could definitely be done as well but would require writing a bit of code for converting to the 1.x segments to 2.0. This would be a nice contribution BTW ;-)

    As for the speed problem, it's not so much that Nutch2.0+MySQL is slower, the problem is that for some reason it does not get as many URLs as 1.2. Could be a problem with the MySQL backend in GORA and it would be worth testing it with HBase instead.

    2.0 is definitely under development, why don't you give it a try anyway? Testing and reporting issues is definitely a way of getting it up to speed.

    ReplyDelete
  5. Sorry for the late answer.

    I will definitely try Nutch 2.0 with Hbase as datastore from now on.
    As regards the migration utility, I can give it a shot.


    Regarding the MySQL issue, this sounds like an I/O issue? A coworker told me MySQL does not support Non blocking I/O, for whatever that means and the veracity of is yet to be verified. The thread that requests data from the MySQL server might block on every single query and prevents any other query to run simultaneously, unless you use a different connection.

    Probably a way to speed-up the connect, read and write operations would be to setup the MySQL database locally. But this seems pretty incompatible with the distributed nature of a Hadoop job.

    If we have to stick with a remote server, another way would be to use a pool of connections.

    ReplyDelete
  6. Re-MySQL : I would not jump to any conclusion to quickly before having compared with GORA+HBase first. It could easily be a problem with the way things are done in Nutch 2.0 or even with the implementation of the MySQL backend in GORA. Could be interesting to hear from your experiences with GORA+HBase versus GORA+Mysql version Nutch 1.2.

    ReplyDelete

Note: only a member of this blog may post a comment.