Monday 20 July 2020

What's new in StormCrawler 1.17


I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.

Dependency upgrades

  • Various dependency upgrades  #808
  • CrawlerCommons 1.1 dependency #807
  • Tika 1.24.1 #797
  • Jackson-databind  #803 #793 #798

Core

  • Use regular expressions for custom number of threads per queue fetcher #788
  • /!breaking!/ Prefix protocol metadata #789
  • Basic authentication for OKHTTP #792
  • Utility to debug / test parsefilters #794
  • /!breaking!/ Remove deprecated methods and fields enhancement #791
  • AdaptiveScheduler to set last-modified time in metadata  #777 #812
  • /bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
  • /bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
  • /!breaking!/ Index pages with content="noindex,follow" meta tag #750
  • Enable extension parsing for SitemapParser enhancement parser #749 #815

WARC



Elasticsearch


  • /bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
  • /bugfix/ IndexerBolt issue causing ack failures #801
  • Allow ES to connect over a proxy #787
Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add 

protocol.md.prefix: ""

to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip


Thanks to all contributors and users! Happy crawling! 

PS: something equally exciting is coming next ;-)



No comments:

Post a Comment

Note: only a member of this blog may post a comment.