Monday, 16 September 2013

NUTCH FIGHT! 1.7 vs 2.2.1

We've had releases in the Nutch 2.x branch for over a year now. As I described in a previous post, the main difference with the 1.x branch is the use of Apache Gora as a storage abstraction layer, which allows to use various flavours of NoSQL databases such as HBase, Cassandra or Accumulo as backends.

There seems to be a growing number of 2.x users, even though 1.x probably still holds the lead, and 2.x (and Gora) is being improved rapidly as a result. The venerable 1.x version has for it its reliability and a few more functionalities currently missing in 2.x, however how do they compare in terms of performance?

Procedure


We have measured the performance of Nutch 1.7 against 2.2.1 (HBase and Cassandra) using 3 million URLs from the CommonCrawl project. These URLs were obtained using the Commoncrawl module in Behemoth.

For this test, we are less interested in fetching the entire contents of the 3M crawl database, but rather how performance varies when using common Nutch commands (inject / generate / parse / update). The fetch time is less relevant here as it is mainly network-bound and is affected less by the differences in storage between both versions.

Disclaimer : It is important to note that we are not comparing Cassandra and HBase themselves, but via  their respective Gora modules and any conclusions drawn are not necessarily applicable in general.  What we are presenting here is what a user gets when using Nutch 2.x with the default configuration for these backends. As we will see later, a lot depends also on the design of Nutch 2.x itself.


Setup


Nutch 1 version: apache-nutch-1.7
Nutch 2 version: apache-nutch-2.2.1
Cassandra version: cassandra-1.2.9
Hbase version: hbase-90.4


We used a large AWS EC2 instance (available at http://aws.amazon.com/ec2/) with 7.5 GB Memory with Hadoop 1.2.0 installed with Apache Whirr. Mapreduce has been configured to allow a maximum of 2 Mappers and 2 Reducers available.

Nutch  configuration

To make the different crawls comparable in terms of the handled urls, newly-discovered links on the webpages are not added to the crawl database, but we only fetch the ones that belong to the original 3M. Furthermore, we limit the number of urls per host to 100 and the size of the fetchlist to 5K.

These parameters are set in nutch-site.xml with the following properties: 

<property>
 <name>db.update.additions.allowed</name>
<value>false</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>


<property>
<name>generate.max.count</name>
<value>100</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>


Note that the the number of urls per fetchlist can also be set in the crawl script, which we run from runtime/deploy/bin. We also removed the lines in the script which were related to indexing operations as these were less relevant for the comparison.

Results

The results can be found in the table below where the values are the averages for each step over 3 runs. The average time per iteration excludes the fetching as explained above. The steps vary a bit between Nutch 1.x and 2.x (e.g. generation done in a singe step, inlinks computed as part of the update in 2.x) but 


Task on 3M urls/ 5K per iteration as listed in Nutch 1.x/
100 urls/host
Nutch 1.7
Time (min) per Tasks in Iteration
averaged over 3
runs
Nutch 2.2.1 & Cassandra
Time (min) per Tasks in Iteration
averaged over 3
runs
Nutch 2.2.1 &
Hbase
Time (min) per Tasks in Iteration
averaged over 3
runs
inject
15.17
85.41
27.25
crawldb
2.20
-
-
Iteration:



generate:select
5.11
11.8
18.54
generate:partition
0.33
-
-
fetch
12.9
34.0
38.9
parse
2.0
7.6
13.64
crawldb:update
3.14
26.3
18.27
linkdb
0.41
-
-
linkdb-merge
1.08 (last 2 it.)
-
-
Avg. per Iteration (min.)
12
45
50
Total Time (min.)
29
130
78


As we can see from these figures, Nutch 1 beats Nutch 2 with both Cassandra (N2C below) and HBase (N2H) on all tasks and with a considerable margin. Who is to be on second place is less clear, as we shall see when looking at the different steps involved in more depth. 

Injection is obviously fastest in N1 (17 minutes), while, N2H takes almost double the amount of time for the same task (27 minutes), but this is still exceeded by N2C, where Injection takes a staggeringly 85 minutes. 

However, it makes up for it during the iterations, where on average it is about 10 minutes faster than N2H. So if we were to run the crawl with more iterations, the longer time taken for injection (as this is only done once) would take less weight in the total. Within the iteration and except for the updating step, N2C is usually faster than N2H. 

The distribution of Mappers and Reducers for each task also stays constant over all iterations with N2C, while with N2Hbase data seems to be partitioned differently in each iteration and more Mappers are required  as the crawl goes on. This results in a longer processing time as our Hadoop setup allows only up to 2 mappers to be used at the same time. Curiously, this increase in the number of mappers was for the same number of entries as input. 

The number of mappers used by N2H and N2C is the main explanation for the differences between them. To give an example, the generation step in the first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes. The latter had its input represented by 3 Mappers whereas the former required only 2 mappers. The mapping part would have certainly taken a lot less time if it had been forced into 2 mappers with a larger input (or if our cluster allowed more than 2 mappers / reducers) .

Besides the way data are stored, Nutch 1 and 2 differ by their storage strategies. Nutch 1.x has a concept of segments corresponding to a fetchlist, i.e. one round of crawling, separate from the data structure containing the status of the URLs (crawldb) whereas Nutch 2.x stores everything in a single, table-like structure. 

The implications of this is that the fetching and parsing steps of Nutch 1.x have the segments as input (i.e 5K URLs in our tests), whereas in Nutch 2.x the whole table (3M URLs) is the input. The way GORA currently operates is that all the entries are given by the backends then filtered on the client side before being submitted as input to the MapReduce Job. When the number of URLs in the table is large, a substantial amount of time is spent getting the content from the backends, filtering them as a preamble to the MapReduce job and discarding most of them in the process as they are not in the current fetchlist.

There is a JIRA issue in GORA  about filtering the content on the backend side which would certainly improve things but it does not seem to have been worked on for quite some time. 



Conclusions


Although more flexibility in terms of storage is attractive, at the moment this still seems to come at the price of a much lower performance compared to Nutch 1.x, which is also simpler to setup as it does not required to configure GORA and the backends (and the corresponding knowledge and skills).

This also has an impact on the hardware that can be used, as running HBase or Cassandra has an impact on the RAM required. We initially ran this test on a slightly dated laptop (3GB RAM) and could not get it to work successfully with HBase nor Cassandra. The same crawl with Nutch 1.7 ran fine.

Nutch 1.x also has the advantage of having been around for much longer and as a result is a lot more reliable. It also has some features currently missing from 2.x (e.g. pluggable indexing backends).

We can expect the performance in Nutch 2.x to improve a lot as GORA gets more features such as the one mentioned above.

We ran this test on a single server in pseudo distributed mode, but it would be interesting to see what happens on a properly distributed setup. 



9 comments:

  1. This is the first comprehensive post on this topic to be published and I thank the digital pebble team for confirming what some of us suspected for quite some time now.
    Thank you very much for this Julien & Co.
    Lewis

    ReplyDelete
  2. Are you going to repeat the comparision considering the changes (filtered scan) of https://issues.apache.org/jira/browse/NUTCH-1674?

    ReplyDelete
    Replies
    1. Hi anonymous,

      Yes, have been working on this lately and will (hopefully) present the results at the next ApacheCon in November. You are welcome to help if you want to : running Nutch 1.x is easy but configuring 2.x on a distributed cluster is a real pain (incompatibilities between dependencies, not working on Hadoop 2 etc...). I can safely predict the outcome of the whole experiement : Nutch 2.x is getting faster but is (and will probably always be) behind Nutch 1.

      Delete
  3. Hi Julien,

    3 years later, do you have more confidence in Nutch 2.x now? I'm starting a nutch + elasticsearch project.

    Thanks.

    James

    ReplyDelete
  4. Hi James. There seems to be more people using it and there has probably been some improvements to the 2.x branch and GORA but I'd personally not use it and stick to 1.x. Or more likely, I'd just use StormCrawler : it is more flexible, faster, more customisable and can be used with ES. Ok, I am biased because it is my baby but quite a few of my clients have migrated from Nutch to SC and haven't looked back since

    ReplyDelete
    Replies
    1. Would be interesting to do a re-run of the benchmark. I meant to, 2 years ago but gave up after trying to set up 2.x and GORA on EC2. Would probably be easier now with Docker and could add SC to the comparison.

      Delete
    2. Thanks. I'll make the switch to 1.x if we stick with Nutch but I'll try StormCrawler first. I'm currently running Nutch 2.3, Hbase 0.94.26 and ES 1.7 but I've hit a couple of snags. I couldn't get a working setup running using the latest versions of all 3.

      Delete
    3. Great, let us know how you get on with SC. There's currently no resources for HBase but you could use Elasticsearch as a back-end; see README on ES module.

      Delete
  5. This comment has been removed by the author.

    ReplyDelete

Note: only a member of this blog may post a comment.