Friday 27 May 2011

Parsing the Enron email dataset using Tika and Hadoop

In order to parse a large collection of emails, such as the Enron Email Dataset, we might choose to use Apache Hadoop, a scalable computing framework, and Apache Tika, a content analysis toolkit. This can be done easily with Behemoth, an open source platform for large scale document analysis developed by DigitalPebble. For more details of Behemoth, see the Behemoth Tutorial.

Using the August 21, 2009 version of the dataset, the first step is to use Behemoth's CorpusGenerator to create a corpus of BehemothDocuments from the Enron Dataset in HDFS. A BehemothDocument is the native object used by Behemoth. At ingest, it contains the original document, its content type and URL. After processing by a Behemoth module, it also contains the extracted text, additional metadata and annotations created about the document.

Once the dataset has been ingested, the next step is to use the Behemoth Tika module to create a Hadoop Map/Reduce job to extract the contents of the emails and metadata about them. Using Apache Tika 0.9, 5% of the documents fail to parse correctly. However using the latest version of Tika (Tika-1.0-snapshot revision 825923) only 0.2% documents fail.

One way to investigate why parsing is failing is by looking at the user logs generated within Hadoop, which contain details of the exceptions causing the failing documents. An alternative way is to write a custom reducer that sorts the exceptions thrown by Tika, with the exception stack being used as a key and a document URL as values. With Tika revision 825923, four exceptions are thrown, caused by two underlying problems: excessive line lengths of over 10,000 characters, the current default in the Tika mail parser, and malformed dates. The first problem can be solved by increasing the maximum line length in a MimeEntityConfig object and then modifying TikaProcessor to pass it into the ParseContext.

As for the second problem, currently the mail parser in Tika performs strict parsing, i.e. parsing a document fails when parsing a field fails. Tika-667 contains a contribution that makes it possible to turn off strict parsing, so some data can still be extracted from the emails with the malformed dates. This can also be configured via MimeEntityConfig. When these changes are incorporated, all documents are processed correctly.

Saturday 7 May 2011

Nutch talk at Berlin Buzzwords 2011

I'll be giving a talk on Apache Nutch at Berlin Buzzwords.

This talk will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, Lucene, SOLR, Tika or HBase. The presentation will contain examples of real-case uses.

The second part of the presentation will be focused on the latest developments in Nutch and the changed introduces by the forthcoming version 2.0.