Monday, 20 July 2020

Please welcome StormCrawler 2.0

Nearly 6 years after its initial release and after another 32 releases, StormCrawler has just reached version 2.0! 

This is similar to what we did 4 years ago when 1.0 was released, in that the change of major version reflects the version of Apache Storm that StormCrawler is based on. This is not a major refactoring of StormCrawler in any way, although some minor changes can be found, mainly in the way the topologies are submitted. These changes are documented in the READMEs generated by our archetypes.

In terms of functionalities and behavior, StormCrawler 2.0 is similar to the version 1.17 released a few minutes ago.

I expect to keep both branches in parallel for a bit, at least until StormCrawler 2.0 has been sufficiently tested and is used by the majority of our users.

The change to Apache Storm 2 is not just a way of future-proofing StormCrawler, since version 2 is the current branch in Apache Storm. By adopting Storm 2, we are also getting a platform 100% Java making debugging and possible contributions to Apache Storm itself, and we also benefit from Storm's recent improvements such as improved performance and better backpressure model.

I am looking forward to getting feedback (and bugfixes) from the StormCrawler community. Please give StormCrawler 2.0 a try if you can.

Happy crawling! 




What's new in StormCrawler 1.17


I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.

Dependency upgrades

  • Various dependency upgrades  #808
  • CrawlerCommons 1.1 dependency #807
  • Tika 1.24.1 #797
  • Jackson-databind  #803 #793 #798

Core

  • Use regular expressions for custom number of threads per queue fetcher #788
  • /!breaking!/ Prefix protocol metadata #789
  • Basic authentication for OKHTTP #792
  • Utility to debug / test parsefilters #794
  • /!breaking!/ Remove deprecated methods and fields enhancement #791
  • AdaptiveScheduler to set last-modified time in metadata  #777 #812
  • /bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
  • /bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
  • /!breaking!/ Index pages with content="noindex,follow" meta tag #750
  • Enable extension parsing for SitemapParser enhancement parser #749 #815

WARC



Elasticsearch


  • /bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
  • /bugfix/ IndexerBolt issue causing ack failures #801
  • Allow ES to connect over a proxy #787
Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add 

protocol.md.prefix: ""

to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip


Thanks to all contributors and users! Happy crawling! 

PS: something equally exciting is coming next ;-)