Wednesday 13 June 2012

What's new in Nutch 1.5

Apache Nutch 1.5 has been released last week. As with each release, this one contains a lot of changes and I will just comment on a few of them.

The main change is actually not in the list above and has not been documented in the Wiki yet. The binary version of Nutch (apache-nutch-1.5.bin.*) now contains the local runtime only, i.e. what you get in runtime/local when compiling the sources. This should make things a bit more straightforward for beginners as we've seen quite a bit of confusion on the mailing lists about which configuration files should be modified (root/conf vs runtime/local/conf). The src version of Nutch is unchanged and is what you'll need if you want to run Nutch on an existing Hadoop cluster. Of course, the runtime/local directory will be generated too from the source and you'll be able to run Nutch in local mode as well. In a nutshell, if you are not sure about what you're doing, want to use Nutch in local mode without a Hadoop cluster and/or do not need any custom plugins then the binary version is what you're after. I usually recommend to use the distributed version on a pseudo-distributed Hadoop cluster for production as the Hadoop web interfaces provide a wealth of useful information, not mentioning of course that you can have more than one mapper or reducer and harness the full potential of your server.

Apart from the usual dependency updates  (Hadoop 1.0.0, Tika 1.1), this release contains many improvements to the webgraph API, which is a better alternative than the default OPIC scoring in Nutch. In the future, it would be interesting to rely on a library such as Apache Giraph to compute the page ranks as it would simplify the code and also make it more efficient.

As mentioned in a previous post, the Nutch user and dev lists seem to indicate an increasing number of users, which is great. This also mean that we tend to see the same questions and issues coming over and over. One such question was about how to parse and index html metatags (see NUTCH-809) which I had contributed 2 years ago. The parse-metatags plugin is now available in the distribution and the steps are documented in the Wiki. Note that the parsing of the html metatags is not activated by default, this is something for the next release maybe.

An important and related change in Nutch 1.5 is NUTCH-1264 which provides a generic plugin for indexing metadata which is typically used alongside parsing plugins such as parse-metatags above and is based on configuration only. The metadata converted into fields for indexing can come from the crawldb, the parse metadata or the content metadata. More work is needed to delegate the indexing parts of existing plugins to it and this is likely to happen in the next release.

Again, Nutch 1.5 contains loads of improvements and you should definitely consider using it if you are on an older version. The next Nutch release will probably be 2.0 for which a RC is already available. Nutch 2.0, a.k.a NutchGora, is a complete redesign of Nutch based on Apache Gora and uses NoSQL datastores as backends instead of relying on the Hadoop data structures. We will have more releases from the 1.x branch as well as 2.x ones, until the latter gets stable and widely used by the community.

As usual, have a look, give it a try and contribute to Nutch if you can.




Friday 21 October 2011

Nutch hosting and monitoring

We now provide hosting and monitoring services for Apache Nutch.

For a fixed price, we will set up, run and monitor your Nutch crawler and report on its progress. The cost of the servers is included in the offer and their hardware specs are superior to what you get from Amazon EC2, without long term commitment as the service is on a monthly basis only.
The price depends on the size of the cluster as well as the complexity of the crawl.


If you use Nutch to feed documents to a seach engine, we can also monitor and host your SOLR instances for you!

Monday 26 September 2011

Visualising Nutch mailing-lists traffic

The graph below show the traffic on the Nutch dev and user mailing lists (http://mail-archives.apache.org/mod_mbox/nutch-user/ and http://mail-archives.apache.org/mod_mbox/nutch-dev/) from March 05 to August 11.

Traffic on Nutch mailing lists
(large size version of the graph here)

Unsurprisingly the traffic on the two lists follows similar trends with ups and downs and the user list globally more active than the dev list, apart from a period in 2005 (early Nutch development), a peak in July 2010 (discussions around Nutch 2.0 and refactoring of code) and the last few months. The figures for September are not complete but seem to confirm that Nutch is definitely back to a level of activity which has not been seen for the last 5 years.




Wednesday 6 July 2011

Crawler-Commons 0.1 released

As announced on various mailing-lists : 

The initial release of crawler-commons is available from : http://code.google.com/p/crawler-commons/downloads/list


The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort. 

The current version contains resources for :
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher

This release is available on Sonatype's OSS Nexus repository [https://oss.sonatype.org/content/repositories/releases/com/google/code/crawler-commons/] and should be available on Maven Central soon.

Please send your questions, comments or suggestions to http://groups.google.com/group/crawler-commons
Doing the release was quite an interesting experience as I'd never done that before. This was the opportunity to have a closer look at ANT+Maven, how to publish artefacts and use Nexus etc... which I am sure will be useful at some point (Behemoth? GORA? Nutch?).

Now that crawler-commons is released we can start using it from Nutch, Bixo [see https://issues.apache.org/jira/browse/NUTCH-1031].

Sunday 12 June 2011

Nutch 1.3 released + BerlinBuzzwords presentation

Nutch 1.3 has been released and contains quite a few changes, some of which have been retrofitted from Nutch 2.0 in trunk.

The main modification is that Nutch now relies entirely on SOLR for indexing and searching and we removed our indexer based on Lucene as well as the search webapps (NUTCH-837). The dependencies are managed with Apache Ivy (NUTCH-821) and we've upgraded the versions of SOLR to 3.1 and Tika to 0.9. Another important change is that we have two separate runtime environments for local and deployed configurations (NUTCH-843). Nutch 1.3 contains a lot more improvements and bugfixes so if you use Nutch you should probably migrate to it.

The presentation I gave this week at BerlinBuzzwords is now available online and covered both 1.3 and 2.0, as well as an overview of Nutch. The conference itself was great and I met quite a few Nutch users and people who planned to use it as well as Doug Cutting, the creator of Nutch himself!

There are quite a few things planned for the next release(s) and also a large amount of work to do on the documentation which is a bit dated and patchy. Luckily some new committers have recently joined the project and seem keen to help with this.

Friday 27 May 2011

Parsing the Enron email dataset using Tika and Hadoop

In order to parse a large collection of emails, such as the Enron Email Dataset, we might choose to use Apache Hadoop, a scalable computing framework, and Apache Tika, a content analysis toolkit. This can be done easily with Behemoth, an open source platform for large scale document analysis developed by DigitalPebble. For more details of Behemoth, see the Behemoth Tutorial.

Using the August 21, 2009 version of the dataset, the first step is to use Behemoth's CorpusGenerator to create a corpus of BehemothDocuments from the Enron Dataset in HDFS. A BehemothDocument is the native object used by Behemoth. At ingest, it contains the original document, its content type and URL. After processing by a Behemoth module, it also contains the extracted text, additional metadata and annotations created about the document.

Once the dataset has been ingested, the next step is to use the Behemoth Tika module to create a Hadoop Map/Reduce job to extract the contents of the emails and metadata about them. Using Apache Tika 0.9, 5% of the documents fail to parse correctly. However using the latest version of Tika (Tika-1.0-snapshot revision 825923) only 0.2% documents fail.

One way to investigate why parsing is failing is by looking at the user logs generated within Hadoop, which contain details of the exceptions causing the failing documents. An alternative way is to write a custom reducer that sorts the exceptions thrown by Tika, with the exception stack being used as a key and a document URL as values. With Tika revision 825923, four exceptions are thrown, caused by two underlying problems: excessive line lengths of over 10,000 characters, the current default in the Tika mail parser, and malformed dates. The first problem can be solved by increasing the maximum line length in a MimeEntityConfig object and then modifying TikaProcessor to pass it into the ParseContext.

As for the second problem, currently the mail parser in Tika performs strict parsing, i.e. parsing a document fails when parsing a field fails. Tika-667 contains a contribution that makes it possible to turn off strict parsing, so some data can still be extracted from the emails with the malformed dates. This can also be configured via MimeEntityConfig. When these changes are incorporated, all documents are processed correctly.

Saturday 7 May 2011

Nutch talk at Berlin Buzzwords 2011

I'll be giving a talk on Apache Nutch at Berlin Buzzwords.

This talk will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, Lucene, SOLR, Tika or HBase. The presentation will contain examples of real-case uses.

The second part of the presentation will be focused on the latest developments in Nutch and the changed introduces by the forthcoming version 2.0.