The version 0.9 of Storm-Crawler has just been released. It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.
Here are the main changes :
Core
- Moved to Storm 0.10.0 #229
- FetcherBolt can dump content of its queues to log #45
- PluggableSchedulers #245
- Bugfix HTTP protocol setConnectionRequestTimeout
- Sitemap : option filter out URLs older than certain threshold #249
- New URLFilter : remove links to self #252
- HTTP protocol can limit amount of content fetched #206
- AbstractStatusUpdaterBolt allows proper acking of tuples #241
- Improvements to URL normalization #264 #205 #120
- Improvements to Robots caching #265
- Fetcher to dump the content of its queues to the log #45
Elasticsearch
- Spout : one instance per shard #198
- AggregationSpout #237
- ES-based Spouts use a static Client instance #258
- ES use NodeClient only if no address is specified #260
- StatusUpdaterBolts must be able to ack/fail explicitly #216
The AggregationSpout in particular is a very useful feature. The existing ElasticSearchSpout offers very few guarantees regarding the diversity of hosts retrieved by the queries. This is improved by randomizing the results, however the latter has an impact on the memory used by ES for the field caching. Limiting the field caching leads to poor performance and the eviction mechanism slows the queries quite a lot. The AggregationSpout on the other hand guarantees a good diversity of URLs by bucketing the search results per hostname (or TLD or IP depending on the value of es.status.routing.fieldname). It is likely to get improved further once we move to Elasticsearch 2.x (see below).
Both ES spout implementations benefit from the sharding mechanism introduced in #198 : if the configuration specifies es.status.routing: true, the StatusUpdaterBolt will direct the URLs to specific shards based on the value of partition.url.mode, i.e. all the URLs for a particular host or TLD will be colocated on the same Elasticsearch shard. This means that the size of the shard can become uneven, depending on the distribution of URLs in your crawl but another implication is that it is now possible to have one Spout instance per shard and parallelise the reads from ES while preserving politeness.
Finally you should see a noticeable improvement in performance now that the StatusUpdaterBolt acks/fails the tuples explicitly #216. Prior to that all tuples were automatically acked as soon as they were buffered for indexing to ES i.e. they could get acked in the spout and removed from its internal cache even though the updates were not yet committed to the ES index. This meant that the spout could resend the same URL down the topology after querying ES for new URLs. Obviously quite wasteful but now luckily fixed!
What's next?
We should see plenty of further improvements in the next months, in particular an upgrade to Apache Storm 1.0 (which is due any time soon) and also a move to Elasticsearch 2.x (#257). Thanks to all users and contributors. Happy crawling!