DigitalPebble's Blog: What's new in Storm-Crawler 0.9

The version 0.9 of Storm-Crawler has just been released. It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.

Here are the main changes :

Core

Moved to Storm 0.10.0 #229
FetcherBolt can dump content of its queues to log #45
PluggableSchedulers #245
Bugfix HTTP protocol setConnectionRequestTimeout
Sitemap : option filter out URLs older than certain threshold #249
New URLFilter : remove links to self #252
HTTP protocol can limit amount of content fetched #206
AbstractStatusUpdaterBolt allows proper acking of tuples #241
Improvements to URL normalization #264 #205 #120
Improvements to Robots caching #265
Fetcher to dump the content of its queues to the log #45

Elasticsearch

Spout : one instance per shard #198
AggregationSpout #237
ES-based Spouts use a static Client instance #258
ES use NodeClient only if no address is specified #260
StatusUpdaterBolts must be able to ack/fail explicitly #216

The AggregationSpout in particular is a very useful feature. The existing ElasticSearchSpout offers very few guarantees regarding the diversity of hosts retrieved by the queries. This is improved by randomizing the results, however the latter has an impact on the memory used by ES for the field caching. Limiting the field caching leads to poor performance and the eviction mechanism slows the queries quite a lot. The AggregationSpout on the other hand guarantees a good diversity of URLs by bucketing the search results per hostname (or TLD or IP depending on the value of es.status.routing.fieldname). It is likely to get improved further once we move to Elasticsearch 2.x (see below).

Both ES spout implementations benefit from the sharding mechanism introduced in #198 : if the configuration specifies es.status.routing: true, the StatusUpdaterBolt will direct the URLs to specific shards based on the value of partition.url.mode, i.e. all the URLs for a particular host or TLD will be colocated on the same Elasticsearch shard. This means that the size of the shard can become uneven, depending on the distribution of URLs in your crawl but another implication is that it is now possible to have one Spout instance per shard and parallelise the reads from ES while preserving politeness.

Finally you should see a noticeable improvement in performance now that the StatusUpdaterBolt acks/fails the tuples explicitly #216. Prior to that all tuples were automatically acked as soon as they were buffered for indexing to ES i.e. they could get acked in the spout and removed from its internal cache even though the updates were not yet committed to the ES index. This meant that the spout could resend the same URL down the topology after querying ES for new URLs. Obviously quite wasteful but now luckily fixed!

What's next?

We should see plenty of further improvements in the next months, in particular an upgrade to Apache Storm 1.0 (which is due any time soon) and also a move to Elasticsearch 2.x (#257). Thanks to all users and contributors. Happy crawling!

2 comments:

Gagandeep Singh6 April 2016 at 05:28
Hi Julien,
From quite a long time I was worried about real time web crawling use cases. Have read about Nutch but it needed pretty old Technology stack (Hadoop 2.2 or 2.1 I am using 2.6) and Elasticsearch 1.7/1.4 and I am using 2.0

Can you tell me what version of hadoop and Elasticsearch does it use? And when is 1.0 version of storm-crawler being launched?
Julien Nioche6 April 2016 at 08:00
Hi, Storm-Crawler does not use Hadoop but (surprise!) Apache Storm. Currently on ES 1.7.2 [https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/pom.xml#L21] but have a branch for 2.x which we are currently working on.
As for naming it 1.0, no definite plans - it's just a number, as long as we keep releasing on a regular basis I'm fine with any version number we use. We could release the next one as 1.0 to coincide with Storm 1.0 but then I don't want to be tied to the Storm version. We'll see!

Note: only a member of this blog may post a comment.

Wednesday, 16 March 2016

What's new in Storm-Crawler 0.9

2 comments: