DigitalPebble's Blog: hadoop

Showing posts with label hadoop. Show all posts

Monday, 9 July 2012

Nutch 2.0 is out (at last!)

Like pretty much any 2.0 release, Nutch 2.0 marks a radical change from the 1.x branch. I've mentioned 2.0 in previous posts but let's do a bit of history first. Nutch was initially started by Doug Cutting (Lucene's creator) and Mike Caffarella around 2002, then came the MapReduce paper from Google and in 2005 MapReduce was implemented as part of Nutch then became a sub project of Lucene at Apache. You know what happened to Hadoop after that : open source super-stardom, millions of dollars in investment, fierce competition between commercial distributions but also a myriad of related projects (HBase, ZooKeeper, Pig, Hive, Mahout etc...) with in the background the emergence of new concepts such as Big Data or NoSQL.

Meanwhile Nutch tagged along following the various releases of Hadoop but was based on the same architecture. It simply started relying on other projects more and more instead of implementing its own stuff, mainly Apache Tika (another offspring of Nutch) for parsing and extracting metadata from various document formats and Apache SOLR for indexing and searching documents. This made the code much lighter, easier to maintain and also up to date with all sorts of functionalities provided by these projects. However the way we stored and access data in Nutch remained the same since the beginning of Hadoop i.e. SequenceFiles and MapFiles.

Nutch 2.0 (a.k.a NutchGora) started in earnest 2 years ago when one of our clients decided to invest in the development of a NoSQL-based version of Nutch. There had been a preliminary version called NutchBase developed by Dogacan Guney which was used as a basis except that instead of relying exclusively on HBase, we decided to implement our own Backend-Neutral-MapReduce-friendly-ORM which is now an Apache Top Level Project known as Apache GORA and serializes data with Apache AVRO. GORA provides us with a unified access to various backends, NoSQL or not, an object-to-datastore mapping mechanism and utilities for MapReduce. This means that Nutch 2.0 can run on HBase, Cassandra, Accumulo or MySQL with just a few configuration files to modify.

One major change in 2.0 is that, instead of having a separation between the status of the URLs (crawlDB) separated from the data for these URLs (content and text in segments) and the webgraph (linkDB), we have a single table-like representation of the data where each entry contains everything we know about a URL, even the links that point to it or the various versions of its content (depending on the backend used). Not having separate segments is definitely good news. One of the side effects is that a fetch or parse step can be resumed.

From a technical point of view this means that Nutch is not limited to the sequential processing of Hadoop data structures but can operate at a more atomic level (GET, PUT). Most Nutch tasks are still MapReduce operations though but at least we can get the backends to filter the data and provide only what is needed for a specific task to the MapReduce operations.

The best example of this that I can think of is the update step in a Nutch crawl. Basically what this step does is to merge the information from a round of fetching with the rest of the CrawlDB, typically to change the status of the URLs we have fetched and add the new URLs we have discovered when parsing. With the 1.x branch this is done with a MapReduce operation which takes both the CrawlDB and the segment as input, reduces on the URLs and updates the status of the CrawlDatum objects in the reduce step. All good. Except that as the crawlDB gets larger and larger, the time taken by the update step gets longer and longer up to a point where it ends up being the slowest part of the crawl. Think about a billion entries in the crawlDB and a single URL to update and you'll get the picture.

There are ways of alleviating this for 1.x (i.e. generate multiple segments in one go and update them with the crawldb at the same time) but the point is that with Nutch 2.0 the equivalent operation would be linear with the number of URLs modified, not the whole crawl dataset.

The change of paradigm between sequential datastructures to a table-like representation is a major change for Nutch which will certainly have many positive side-effects. Being the first release of 2.0, we can expect quite a few fixes to be needed and a massive overhaul of the documentation in the next months but the move seems to be positively welcomed by the Nutch community. Of course 1.x will continue to be the trunk for as long as necessary, i.e. until 2.0 is stable and has all the functionalities that 1.x has.

BTW my slides about 2.0 from last year's Berlin Buzzword are now here.

It is also a symbolic move, with Nutch being at the origin of many successful projects, it was about time it caught up with its famous offspring and the concepts which arose from it.

Friday, 27 May 2011

Parsing the Enron email dataset using Tika and Hadoop

In order to parse a large collection of emails, such as the Enron Email Dataset, we might choose to use Apache Hadoop, a scalable computing framework, and Apache Tika, a content analysis toolkit. This can be done easily with Behemoth, an open source platform for large scale document analysis developed by DigitalPebble. For more details of Behemoth, see the Behemoth Tutorial.

Using the August 21, 2009 version of the dataset, the first step is to use Behemoth's CorpusGenerator to create a corpus of BehemothDocuments from the Enron Dataset in HDFS. A BehemothDocument is the native object used by Behemoth. At ingest, it contains the original document, its content type and URL. After processing by a Behemoth module, it also contains the extracted text, additional metadata and annotations created about the document.

Once the dataset has been ingested, the next step is to use the Behemoth Tika module to create a Hadoop Map/Reduce job to extract the contents of the emails and metadata about them. Using Apache Tika 0.9, 5% of the documents fail to parse correctly. However using the latest version of Tika (Tika-1.0-snapshot revision 825923) only 0.2% documents fail.

One way to investigate why parsing is failing is by looking at the user logs generated within Hadoop, which contain details of the exceptions causing the failing documents. An alternative way is to write a custom reducer that sorts the exceptions thrown by Tika, with the exception stack being used as a key and a document URL as values. With Tika revision 825923, four exceptions are thrown, caused by two underlying problems: excessive line lengths of over 10,000 characters, the current default in the Tika mail parser, and malformed dates. The first problem can be solved by increasing the maximum line length in a MimeEntityConfig object and then modifying TikaProcessor to pass it into the ParseContext.

As for the second problem, currently the mail parser in Tika performs strict parsing, i.e. parsing a document fails when parsing a field fails. Tika-667 contains a contribution that makes it possible to turn off strict parsing, so some data can still be extracted from the emails with the malformed dates. This can also be configured via MimeEntityConfig. When these changes are incorporated, all documents are processed correctly.

Saturday, 19 March 2011

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :

strong background in NLP and Java
GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
IE, Linked Data, Ontologies
statistical approaches and machine learning
large scale computing with Hadoop
knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
good social and presentation skills
good spoken and written English, knowledge of other languages would be a plus
taste for challenges and problem solving

    DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.

    More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.

    This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.

   Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com

    Best regards,

    Julien Nioche

Tuesday, 14 December 2010

Module management with IVY

I've just recently some massive changes to the way we manage the code in Behemoth. Prior to that, we had a single src directory containing the various resources for using Tika, GATE, UIMA or Nutch within Behemoth. That worked fine but had a few drawbacks, mostly that we ended up with an enormous job file containing all the dependencies for all the modules. In practice most people use Behemoth with only one type of resource but not more (e.g. UIMA vs GATE).

There was also a concept of Sandbox in Behemoth which I mentioned a couple of times. The idea was to allow external contributions based on Behemoth's core and keep them separated.

Before the change, Grant Ingersoll (who has been using Behemoth to parse a large amount of documents with Tika) had made a contribution which allowed to generate a jar file for the Behemoth core classes only. In his case, he wanted to be able to play with the Behemoth output without having to deal with a mega large job file. The modularisation of the code allows to do just that but extends the principle to all the modules.

Here is how it now works. I split the code into several modules managed by Apache Ivy (by simply following the tutorials) e.g. core, uima, gate, tika, solr, etc... Most non-core modules have at least a dependency to core as well as the external jars that they require. All modules have the same ant targets and the main ant build script at the root of the project allows to resolve the dependencies, compile, test for each module. We now get separate jars file for each module (which Grant needed for the core) but also publish these jars locally via Ivy so that the other modules can rely on them.

Building a job file is done on a per-module basis, by going into a module's root directory and calling 'ant job'. The resulting job file should then contain all the dependencies for this module and can be used in Hadoop, as usual.

This new organisation of the code is definitely cleaner, leaner and easier to maintain or extend. If for instance a user want to build a process which combines the functionalities of two or more modules, it is just a matter of creating a new module with the right dependencies to the modules used (say for instance Tika + Gate + SOLR), write a custom Job and Mapreduce class and generate a job file as described above.

The concept of sandboxes is now deprecated, as they are now modules, just like everything else. The beauty being that - if the Behemoth modules are published and accessible publicly, one could simply point to them in the Ivy config of a local module and build a Behemoth application with a minimal amount of code.

Isn't that just fun!

Wednesday, 10 November 2010

Gora in incubation at Apache

Great news! GORA has been accepted in the Apache Incubator in September. It now has a brand new site, JIRA, wiki, subversion repository etc... As I explained in my very first post, GORA has been developed as a part of Nutch 2.0 to provide an abstract storage layer. Think about it as a ORM that can be plugged into a number of storage backends (Cassandra, Hbase, Mysql, etc...). What we also get from it is the ability to use these backends directly into Hadoop's MapReduce without having to write any custom code. Another way of looking at it is that it provides a simple and unified API over these various backends. This would allow for instance to develop a prototype using say, MySQL as a backend then switch to Cassandra when more scalability is needed. Since your application would be based on GORA you would not need to modify any of your code, but just the mapping schema (which is based on Apache Avro).

I was thinking about using HBase in Behemoth to avoid having multiple SequenceFiles but GORA would be a better solution as it would give us more options as to what backend to use. On top of that, we would be able to operate at an atomic level and not by batches only, i.e. process a single document from the store and put it back to the DB. Since Behemoth currently relies on the Hadoop data structures, we can only process a whole corpus and generate a new version as output, which is exactly why we wanted to have GORA in Nutch (imagine you have a 10+ billion crawlDB and add say 10M pages per fetch round - every update step in Nutch 1.x requires to read 1010M entries and write out between 1000 and 1010M; a bit wasteful isn't it? )

Assuming that we use GORA (and the AVRO schema for the Behemoth documents), we could then implement a custom Datastore in GATE to debug a Behemoth corpus or test a GATE application.

Now that GORA is in Apache-land, it will hopefully get more contributors involved and more back ends supported.

Monday, 27 September 2010

SimilarPages is out!

It's always nice to see clients emerging of stealth mode and showing the fruits of their labour to the public. Our friends at http://www.similarpages.com have just done so and I am doubly pleased as this also reflect the work that DigitalPebble did for them.

SimilarPages is an add-on for Firefox which allows you to discover pages which are similar to the one you are currently reading. It's pretty cool and surprisingly easy to use. It's also completely free which is always nice.

From a technical point of view, we helped SimilarPages adapting Nutch to their needs and deploying it on a 400-nodes cluster on Amazon EC2 to crawl the web. We fetched and parsed a total of more than 3 billion pages from which we obtained 200+ million lists of similarities. The crawlDB itself contained more than 10 billion URLs.

Operating at such a scale is definitely challenging and has been a great experience. From a Nutch point of view, quite a few improvements and bugfixes in Nutch 1.1 come directly from the work done at SimilarPages, so thanks for that guys!

The SimilarPages use case is actually a good example of using Nutch as a crawling platform only, i.e. not indexing the documents with Lucene or SOLR. See A. Białecki's presentation at BerlinBuzzwords 2010 for more examples on this subject.

The computation of the similarities between URLs from the Nutch crawls is made using bespoke Hadoop Map-Reduce code. More details on their approach can be found on SimilarPages' website.

I feel quite proud to have contributed to this project and wish long live to SimilarPages. If you haven't done so, give it a try!

Saturday, 28 August 2010

Behemoth talk from BerlinBuzzwords 2010

The talk I gave on Behemoth at BerlinBuzzwords has been filmed (I do not dare watching it) and is available on http://blip.tv/file/3809855.

The slides can be found on http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp

The talk contains a quick demo of GATE and mentions Tika, UIMA and of course Hadoop.