Wednesday 13 June 2012

What's new in Nutch 1.5

Apache Nutch 1.5 has been released last week. As with each release, this one contains a lot of changes and I will just comment on a few of them.

The main change is actually not in the list above and has not been documented in the Wiki yet. The binary version of Nutch (apache-nutch-1.5.bin.*) now contains the local runtime only, i.e. what you get in runtime/local when compiling the sources. This should make things a bit more straightforward for beginners as we've seen quite a bit of confusion on the mailing lists about which configuration files should be modified (root/conf vs runtime/local/conf). The src version of Nutch is unchanged and is what you'll need if you want to run Nutch on an existing Hadoop cluster. Of course, the runtime/local directory will be generated too from the source and you'll be able to run Nutch in local mode as well. In a nutshell, if you are not sure about what you're doing, want to use Nutch in local mode without a Hadoop cluster and/or do not need any custom plugins then the binary version is what you're after. I usually recommend to use the distributed version on a pseudo-distributed Hadoop cluster for production as the Hadoop web interfaces provide a wealth of useful information, not mentioning of course that you can have more than one mapper or reducer and harness the full potential of your server.

Apart from the usual dependency updates  (Hadoop 1.0.0, Tika 1.1), this release contains many improvements to the webgraph API, which is a better alternative than the default OPIC scoring in Nutch. In the future, it would be interesting to rely on a library such as Apache Giraph to compute the page ranks as it would simplify the code and also make it more efficient.

As mentioned in a previous post, the Nutch user and dev lists seem to indicate an increasing number of users, which is great. This also mean that we tend to see the same questions and issues coming over and over. One such question was about how to parse and index html metatags (see NUTCH-809) which I had contributed 2 years ago. The parse-metatags plugin is now available in the distribution and the steps are documented in the Wiki. Note that the parsing of the html metatags is not activated by default, this is something for the next release maybe.

An important and related change in Nutch 1.5 is NUTCH-1264 which provides a generic plugin for indexing metadata which is typically used alongside parsing plugins such as parse-metatags above and is based on configuration only. The metadata converted into fields for indexing can come from the crawldb, the parse metadata or the content metadata. More work is needed to delegate the indexing parts of existing plugins to it and this is likely to happen in the next release.

Again, Nutch 1.5 contains loads of improvements and you should definitely consider using it if you are on an older version. The next Nutch release will probably be 2.0 for which a RC is already available. Nutch 2.0, a.k.a NutchGora, is a complete redesign of Nutch based on Apache Gora and uses NoSQL datastores as backends instead of relying on the Hadoop data structures. We will have more releases from the 1.x branch as well as 2.x ones, until the latter gets stable and widely used by the community.

As usual, have a look, give it a try and contribute to Nutch if you can.