DigitalPebble's Blog: July 2020

Monday, 20 July 2020

Please welcome StormCrawler 2.0

Nearly 6 years after its initial release and after another 32 releases, StormCrawler has just reached version 2.0!

This is similar to what we did 4 years ago when 1.0 was released, in that the change of major version reflects the version of Apache Storm that StormCrawler is based on. This is not a major refactoring of StormCrawler in any way, although some minor changes can be found, mainly in the way the topologies are submitted. These changes are documented in the READMEs generated by our archetypes.

In terms of functionalities and behavior, StormCrawler 2.0 is similar to the version 1.17 released a few minutes ago.

I expect to keep both branches in parallel for a bit, at least until StormCrawler 2.0 has been sufficiently tested and is used by the majority of our users.

The change to Apache Storm 2 is not just a way of future-proofing StormCrawler, since version 2 is the current branch in Apache Storm. By adopting Storm 2, we are also getting a platform 100% Java making debugging and possible contributions to Apache Storm itself, and we also benefit from Storm's recent improvements such as improved performance and better backpressure model.

I am looking forward to getting feedback (and bugfixes) from the StormCrawler community. Please give StormCrawler 2.0 a try if you can.

Happy crawling!

What's new in StormCrawler 1.17

I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.

Dependency upgrades

Various dependency upgrades #808
CrawlerCommons 1.1 dependency #807
Tika 1.24.1 #797
Jackson-databind #803 #793 #798

Core

Use regular expressions for custom number of threads per queue fetcher #788
/!breaking!/ Prefix protocol metadata #789
Basic authentication for OKHTTP #792
Utility to debug / test parsefilters #794
/!breaking!/ Remove deprecated methods and fields enhancement #791
AdaptiveScheduler to set last-modified time in metadata #777 #812
/bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
/bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
/!breaking!/ Index pages with content="noindex,follow" meta tag #750
Enable extension parsing for SitemapParser enhancement parser #749 #815

WARC

Implement WARC spout #755 #799

Elasticsearch

/bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
/bugfix/ IndexerBolt issue causing ack failures #801
Allow ES to connect over a proxy #787

Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add

protocol.md.prefix: ""

to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip

Thanks to all contributors and users! Happy crawling!

PS: something equally exciting is coming next ;-)