Tuesday, 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:


Dependency updates
Core
  • Add option to send only N bytes of text to indexers #476
  • BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
  • MemorySpout to generate tuples with DISCOVERED status #529
  • OKHttp configure type of proxy #530
  • http.content.limit inconsistent default to -1 #534
  • Track time spent in the FetcherBolt queues #535
  • Increase detect.charset.maxlength default value #537
  • FeedParserBolt: metadata added by parse filters not passed forward in topology #541
  • Use UTF-8 for input encoding of seeds (FileSpout) #542
  • Default URL filter: exclude localhost and private address spaces #543
  • URLStreamGrouping returns the taskIDs and not their index #547
WARC
  • Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520
SOLR
  • Schema for status index needs date type for nextFetchDate #544
  • SOLR indexer: use field type text for content field #545
Elasticsearch
  • AggregationSpout fails with default value of es.status.bucket.field == _routing #521
  • Move to Elasticsearch RESTAPi #539
We recommend all users to move to this version as it fixes several bugs (#541#547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

No comments:

Post a Comment

Note: only a member of this blog may post a comment.