Monday 31 October 2016

What's new in StormCrawler 1.2

StormCrawler 1.2 has been released today after a busy and exciting month, the highlight of which was certainly the announcement by CommonCrawl of the news dataset, which is powered by StormCrawler. This helped raise the profile of the project and also brought various improvements to the WARC and Elasticsearch modules (see below). Another great news was that my talk got accepted for ApacheCon BigData next month in Seville which prompted a Q&A interview on Linux.com.

Back to the content of the release. There have been many improvements on various levels, the main one being that the WARC module was moved to the main repository [#313]. It got many bugfixes and improvements since used by CommonCrawl and is now stable enough to join the other external modules.

We recommend all users to upgrade their configuration to the version 1.2 of StormCrawler.

Apart from minor bug fixes, the main changes in this new version are :


Core


  • Removes StatusStreamBolt [#341]
  • New Parse Filters :
    • MD5 signature [#354]
    • DomainParseFilter [#356]
  • URL Filters
    • URL Normalisation - remove parameters where the value is a 32-bit hash [#363]
    • Filtering : treat path parameters as query parameters [#366]
    • BasicURLFilter to remove URLs based on path repetition and max length [#368]
  • Add metadata.discoveryDate field to enable tracking discovery rate [#360]
  • Add super class for bolts using the status stream [#353]
  • JSoup Handle redirections via meta tag [#350]

Tika

  • Upgraded to Tika 1.13 [#285]
  • Combine JSoupParser with Tika [#357]
  • Tika parser can now parse embedded documents [#358]
Elasticsearch

  • Elasticsearch upgraded to 2.4.1
  • Metadata keys with multiple values not indexed correctly in ES [#345]
  • Refactoring into AbstractSpout for ES [#348]
  • Status index - fields stored unnecessarily [#351]
  • Cache URLs post ack/fail [#349]
The latter is a substantial change to the way the Elasticsearch spouts work. All 3 flavours of spouts hold a cache of the URLs being processed and use it to make sure that any URLs returned by a query are not added twice. This worked OK but did not cater for situations where a URL was towards the bottom of the buffer and acked/failed not long before the buffer was refilled from ES. In such cases, the changes to the status index hadn't had the time to be committed to the underlying index and as a result, the same URL was returned in the next query. This resulted in 10 to 15% of URLs being unnecessarily re-fetched in a short delay. What #349 does is that after ack/failing, URLs are kept in the cache for an extra N-seconds, to give time for the changed to be reflected in the search results (this is of course configurable via es.status.ttl.purgatory).

Coming next?

The releases seem to come more and more frequently. It is not sure yet what the next one will have in store but I am sure the discussions at ApacheCon as well as constant stream of new users will provide new functionalities and bugfixes.

In the meantime and as usual, thanks to all contributors and users and happy crawling!