Friday, 8 September 2017

What's new in StormCrawler 1.6


Dependencies updates

  • jsoup 1.10.3
  • common-crawl 0.8

Core

  • Use ISO representation of time for modifiedtime in adaptivescheduler  #496
  • Use ISO representation of time for discoveryDate and lastProcessedDate, #477
  • Improved Charset Detection #495
  • SitemapParserBolt configure use SAX or not
  • SitemapParserBolt generates metrics for average processing time
  • HTTP protocol based on OKHTTP #484 
  • Apache Http client can use HEAD method on a per URL basis #485
  • ContentFilter to leave trace of the pattern that matched #480
  • Metadata has a new public method for getting first non-empty value from a set of keys
  • Added ARTICLE to patterns for content filter

LangID

  • Can add more than one lang code based on configurable prob threshold. #481

WARC

  •  Added rotation policy based on time and filesize

ES

  • ES: added es.status.reset.fetchdate.after #478
  • Removed Grafana resources - can be downloaded from Grafana portal