StormCrawler 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.
You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/24?closed=1
This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.
Dependency upgrades
- crawler-commons 1.0 #693
- okhttp 3.14.0 #692
- guava 27.1 (#702)
- icu4j 64.1 #702)
- httpclient 4.5.8 #702)
- Snakeyaml 1.24 #702)
- wiremock 2.22.0 #702)
- rometools 1.12.0 #702)
- Elasticsearch 7.0.0 (#708)
Core
- Track how long a spout has been without any URLs in its buffer (#685)
- Change ack mechanism for StatusUpdaterBolts (#689)
- Robots URL filter to get instructions from cache only (#700)
- Allow indexing under canonical URL if in the same domain, not just host (#703)
- /bugfix/ URLs ending with a space are fetched over and over again (#704)
- ParseFilter to normalise the mime-type of documents into simple values (#707)
- Robot rules should check the cache in case of a redirection (#709)
- /bugfix/ Fix the logic around sitemap = false (#710)
- Reduce logging of exceptions in FetcherBolt (#719)
Elasticsearch
- Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended (#683)
- StatusUpdaterBolt to load config from non-default param names (#687)
- Add a ScrollSpout to read all the documents from a shard (#688 and #690) - see in our guest post how this can be used to reindex a status index.
- ES IndexerBolt : check success of batches before acking tuples (#647)
- /bugfix/ URLs with content that breaks ES get refetched over and over again (#705)
- /bugfix/ URLs without valid host name (and routing) stay DISCOVERED forever (#706)
- /bugfix/ ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (#715)
- MetricsConsumer to include topology ID in metrics(#714)
WARC
Tika
- Set mimetype whitelist for Tika Parser (#712)
*********
I will be running a workshop on StormCrawler next month at the Web Archiving Conference in Zagreb and give a presentation jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at Elastic.
As usual, thanks to all contributors and users.
Happy crawling!