Showing posts with label gora. Show all posts
Showing posts with label gora. Show all posts

Monday 16 September 2013

NUTCH FIGHT! 1.7 vs 2.2.1

We've had releases in the Nutch 2.x branch for over a year now. As I described in a previous post, the main difference with the 1.x branch is the use of Apache Gora as a storage abstraction layer, which allows to use various flavours of NoSQL databases such as HBase, Cassandra or Accumulo as backends.

There seems to be a growing number of 2.x users, even though 1.x probably still holds the lead, and 2.x (and Gora) is being improved rapidly as a result. The venerable 1.x version has for it its reliability and a few more functionalities currently missing in 2.x, however how do they compare in terms of performance?

Procedure


We have measured the performance of Nutch 1.7 against 2.2.1 (HBase and Cassandra) using 3 million URLs from the CommonCrawl project. These URLs were obtained using the Commoncrawl module in Behemoth.

For this test, we are less interested in fetching the entire contents of the 3M crawl database, but rather how performance varies when using common Nutch commands (inject / generate / parse / update). The fetch time is less relevant here as it is mainly network-bound and is affected less by the differences in storage between both versions.

Disclaimer : It is important to note that we are not comparing Cassandra and HBase themselves, but via  their respective Gora modules and any conclusions drawn are not necessarily applicable in general.  What we are presenting here is what a user gets when using Nutch 2.x with the default configuration for these backends. As we will see later, a lot depends also on the design of Nutch 2.x itself.


Setup


Nutch 1 version: apache-nutch-1.7
Nutch 2 version: apache-nutch-2.2.1
Cassandra version: cassandra-1.2.9
Hbase version: hbase-90.4


We used a large AWS EC2 instance (available at http://aws.amazon.com/ec2/) with 7.5 GB Memory with Hadoop 1.2.0 installed with Apache Whirr. Mapreduce has been configured to allow a maximum of 2 Mappers and 2 Reducers available.

Nutch  configuration

To make the different crawls comparable in terms of the handled urls, newly-discovered links on the webpages are not added to the crawl database, but we only fetch the ones that belong to the original 3M. Furthermore, we limit the number of urls per host to 100 and the size of the fetchlist to 5K.

These parameters are set in nutch-site.xml with the following properties: 

<property>
 <name>db.update.additions.allowed</name>
<value>false</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>


<property>
<name>generate.max.count</name>
<value>100</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>


Note that the the number of urls per fetchlist can also be set in the crawl script, which we run from runtime/deploy/bin. We also removed the lines in the script which were related to indexing operations as these were less relevant for the comparison.

Results

The results can be found in the table below where the values are the averages for each step over 3 runs. The average time per iteration excludes the fetching as explained above. The steps vary a bit between Nutch 1.x and 2.x (e.g. generation done in a singe step, inlinks computed as part of the update in 2.x) but 


Task on 3M urls/ 5K per iteration as listed in Nutch 1.x/
100 urls/host
Nutch 1.7
Time (min) per Tasks in Iteration
averaged over 3
runs
Nutch 2.2.1 & Cassandra
Time (min) per Tasks in Iteration
averaged over 3
runs
Nutch 2.2.1 &
Hbase
Time (min) per Tasks in Iteration
averaged over 3
runs
inject
15.17
85.41
27.25
crawldb
2.20
-
-
Iteration:



generate:select
5.11
11.8
18.54
generate:partition
0.33
-
-
fetch
12.9
34.0
38.9
parse
2.0
7.6
13.64
crawldb:update
3.14
26.3
18.27
linkdb
0.41
-
-
linkdb-merge
1.08 (last 2 it.)
-
-
Avg. per Iteration (min.)
12
45
50
Total Time (min.)
29
130
78


As we can see from these figures, Nutch 1 beats Nutch 2 with both Cassandra (N2C below) and HBase (N2H) on all tasks and with a considerable margin. Who is to be on second place is less clear, as we shall see when looking at the different steps involved in more depth. 

Injection is obviously fastest in N1 (17 minutes), while, N2H takes almost double the amount of time for the same task (27 minutes), but this is still exceeded by N2C, where Injection takes a staggeringly 85 minutes. 

However, it makes up for it during the iterations, where on average it is about 10 minutes faster than N2H. So if we were to run the crawl with more iterations, the longer time taken for injection (as this is only done once) would take less weight in the total. Within the iteration and except for the updating step, N2C is usually faster than N2H. 

The distribution of Mappers and Reducers for each task also stays constant over all iterations with N2C, while with N2Hbase data seems to be partitioned differently in each iteration and more Mappers are required  as the crawl goes on. This results in a longer processing time as our Hadoop setup allows only up to 2 mappers to be used at the same time. Curiously, this increase in the number of mappers was for the same number of entries as input. 

The number of mappers used by N2H and N2C is the main explanation for the differences between them. To give an example, the generation step in the first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes. The latter had its input represented by 3 Mappers whereas the former required only 2 mappers. The mapping part would have certainly taken a lot less time if it had been forced into 2 mappers with a larger input (or if our cluster allowed more than 2 mappers / reducers) .

Besides the way data are stored, Nutch 1 and 2 differ by their storage strategies. Nutch 1.x has a concept of segments corresponding to a fetchlist, i.e. one round of crawling, separate from the data structure containing the status of the URLs (crawldb) whereas Nutch 2.x stores everything in a single, table-like structure. 

The implications of this is that the fetching and parsing steps of Nutch 1.x have the segments as input (i.e 5K URLs in our tests), whereas in Nutch 2.x the whole table (3M URLs) is the input. The way GORA currently operates is that all the entries are given by the backends then filtered on the client side before being submitted as input to the MapReduce Job. When the number of URLs in the table is large, a substantial amount of time is spent getting the content from the backends, filtering them as a preamble to the MapReduce job and discarding most of them in the process as they are not in the current fetchlist.

There is a JIRA issue in GORA  about filtering the content on the backend side which would certainly improve things but it does not seem to have been worked on for quite some time. 



Conclusions


Although more flexibility in terms of storage is attractive, at the moment this still seems to come at the price of a much lower performance compared to Nutch 1.x, which is also simpler to setup as it does not required to configure GORA and the backends (and the corresponding knowledge and skills).

This also has an impact on the hardware that can be used, as running HBase or Cassandra has an impact on the RAM required. We initially ran this test on a slightly dated laptop (3GB RAM) and could not get it to work successfully with HBase nor Cassandra. The same crawl with Nutch 1.7 ran fine.

Nutch 1.x also has the advantage of having been around for much longer and as a result is a lot more reliable. It also has some features currently missing from 2.x (e.g. pluggable indexing backends).

We can expect the performance in Nutch 2.x to improve a lot as GORA gets more features such as the one mentioned above.

We ran this test on a single server in pseudo distributed mode, but it would be interesting to see what happens on a properly distributed setup. 



Monday 9 July 2012

Nutch 2.0 is out (at last!)

Like pretty much any 2.0 release, Nutch 2.0 marks a radical change from the 1.x branch. I've mentioned 2.0 in previous posts but let's do a bit of history first. Nutch was initially started by Doug Cutting (Lucene's creator) and Mike Caffarella around 2002, then came the MapReduce paper from Google and in 2005 MapReduce was implemented as part of Nutch then became a sub project of Lucene at Apache. You know what happened to Hadoop after that : open source super-stardom, millions of dollars in investment, fierce competition between commercial distributions but also a myriad of related projects (HBase, ZooKeeper, Pig, Hive, Mahout etc...) with in the background the emergence of new concepts such as Big Data or NoSQL.

Meanwhile Nutch tagged along following the various releases of Hadoop but was based on the same architecture. It simply started relying on other projects more and more instead of implementing its own stuff, mainly Apache Tika (another offspring of Nutch) for parsing and extracting metadata from various document formats and Apache SOLR for indexing and searching documents. This made the code much lighter, easier to maintain and also up to date with all sorts of functionalities provided by these projects. However the way we stored and access data in Nutch remained the same since the beginning of Hadoop i.e. SequenceFiles and MapFiles.

Nutch 2.0 (a.k.a NutchGora) started in earnest 2 years ago when one of our clients decided to  invest in the development of a NoSQL-based version of Nutch. There had been a preliminary version called NutchBase developed by Dogacan Guney which was used as a basis except that instead of relying exclusively on HBase, we decided to implement our own Backend-Neutral-MapReduce-friendly-ORM which is now an Apache Top Level Project known as Apache GORA and serializes data with Apache AVRO. GORA provides us with a unified  access to various backends, NoSQL or not, an object-to-datastore mapping mechanism and utilities for MapReduce. This means that Nutch 2.0 can run on HBase, Cassandra, Accumulo or MySQL with just a few configuration files to modify.

One major change in 2.0 is that, instead of having a separation between the status of the URLs (crawlDB) separated from the data for these URLs (content and text in segments) and the webgraph (linkDB), we have a single table-like representation of the data where each entry contains everything we know about a URL, even the links that point to it or the various versions of its content (depending on the backend used). Not having separate segments is definitely good news. One of the side effects is that a fetch or parse step can be resumed. 

From a technical point of view this means that Nutch is not limited to the sequential processing of Hadoop data structures  but can operate at a more atomic level (GET, PUT). Most Nutch tasks are still MapReduce operations though but at least we can get the backends to filter the data and provide only what is needed for a specific task to the MapReduce operations.

The best example of this that I can think of is the update step in a Nutch crawl. Basically what this step does is to merge the information from a round of fetching with the rest of the CrawlDB, typically to change the status of the URLs we have fetched and add the new URLs we have discovered when parsing. With the 1.x branch this is done with a MapReduce operation which takes both the CrawlDB  and the segment as input, reduces on the URLs and updates the status of the CrawlDatum objects in the reduce step. All good. Except that as the crawlDB gets larger and larger, the time taken by the update step gets longer and longer up to a point where it ends up being the slowest part of the crawl. Think about a billion entries in the crawlDB and a single URL to update and you'll get the picture.

There are ways of alleviating this for 1.x (i.e. generate multiple segments in one go and update them with the crawldb at the same time) but the point is that with Nutch 2.0 the equivalent operation would be linear with the number of URLs modified, not the whole crawl dataset.

The change of paradigm between sequential datastructures to a table-like representation is a major change for Nutch which will certainly have many positive side-effects. Being the first release of 2.0, we can expect quite a few fixes to be needed and a massive overhaul of the documentation in the next months but the move seems to be positively welcomed by the Nutch community. Of course 1.x will continue to be the trunk for as long as necessary, i.e. until 2.0 is stable and has all the functionalities that  1.x has.

BTW my slides about 2.0 from last year's Berlin Buzzword are now here.

It is also a symbolic move, with Nutch being at the origin of many successful projects, it was about time it caught up with its famous offspring and the concepts which arose from it.

  

Wednesday 10 November 2010

Gora in incubation at Apache

Great news! GORA has been accepted in the Apache Incubator in September. It now has a brand new site, JIRA, wiki, subversion repository etc... As I explained in my very first post, GORA has been developed as a part of Nutch 2.0 to provide an abstract storage layer. Think about it as a ORM that can be plugged into a number of storage backends (Cassandra, Hbase, Mysql, etc...). What we also get from it is the ability to use these backends directly into Hadoop's MapReduce without having to write any custom code. Another way of looking at it is that it provides a simple and unified API over these various backends. This would allow for instance to develop a prototype using say, MySQL as a backend then switch to Cassandra when more scalability is needed. Since your application would be based on GORA you would not need to modify any of your code, but just the mapping schema (which is based on Apache Avro).

I was thinking about using HBase in Behemoth to avoid having multiple SequenceFiles but GORA would be a better solution as it would give us more options as to what backend to use. On top of that, we would be able to operate at an atomic level and not by batches only, i.e. process a single document from the store and put it back to the DB. Since Behemoth currently relies on the Hadoop data structures, we can only process a whole corpus and generate a new version as output, which is exactly why we wanted to have GORA in Nutch (imagine you have a 10+ billion crawlDB and add say 10M pages per fetch round - every update step in Nutch 1.x requires to read 1010M entries and write out between 1000 and 1010M; a bit wasteful isn't it? )

Assuming that we use GORA (and the AVRO schema for the Behemoth documents), we could then implement a custom Datastore in GATE to debug a Behemoth corpus or test a GATE application.

Now that GORA is in Apache-land, it will hopefully get more contributors involved and more back ends supported.

Saturday 28 August 2010

Behemoth talk from BerlinBuzzwords 2010

The talk I gave on Behemoth at BerlinBuzzwords has been filmed (I do not dare watching it) and is available on http://blip.tv/file/3809855.

The slides can be found on http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/nioche_bbuzz2010.odp

The talk contains a quick demo of GATE and mentions Tika, UIMA and of course Hadoop.

Friday 13 August 2010

Towards Nutch 2.0

Nevermind the dodgy look of the blog - I'll improve that later!

For my first post, I'd like to mention the progress we've made recently towards Apache Nutch 2.0. It is based on a branch named NutchBase which has been developed mainly by Doğacan Güney and it is now in the trunk of the SVN repository. One of the main aspects of Nutch 2.0 is that it is now storing its data in a datastore and not in Hadoop's file-based structures. Note that we still have the distribution and replication of the data over a whole cluster and data locality for MapReduce but we also have the possibility to insert or modify a random entry in the table without having to read/write the whole data structure as it was the case before.

Nutch uses a project named GORA as an intermediate between our code and the backend storage. There would be a lot of things to say on GORA but to make it short what we are trying to achieve with it is to make it a sort of common API for NoSQL stores. GORA already has implementations for HBase and Cassandra but also SQL. The plan for GORA is to put it in the Apache Incubator or possibly as an Apache subproject (Hadoop? HBase? Cassandra?). We'll see how it goes.

There are quite a few structural changes in Nutch, most notably the fact that there aren't any segments any more as all the information about a URL (metadata, original content, extracted text, ...) in stored in a single table which means for instance no more segments to merge or metadata to move back to the crawldb. It's all in one place!

There are other substantial changes in 2.0, notably the removal of the Lucene-based indexing and search as we now rely on SOLR. Other indexing backends might be added later. Another step towards delegating functionalities to external projects is the increased used of Apache Tika for the parsing. We've removed quite a few legacy parsers from Nutch and let Tika do the work for us. We've also revamped the organisation of the code and did a lot of code clean up.

Nutch 2.0 is still at an early stage and we are actively working on it, testing, debugging etc... The good news is that it is not only an architectural change but also a basis for a whole lot of new functionalities (see for instance https://issues.apache.org/jira/browse/NUTCH-882).

I'll keep you posted on our progress, as usual : give it a try, get involved, join us...