Wednesday 5 May 2021

What's new in StormCrawler 1.18

 
StormCrawler 1.18 has just been released. Since the previous version dates from nearly 10 months ago, the number of changes is rather large (see below).

This version contains many bugfixes, as usual, users are advised to upgrade to this version. One of the noticeable new features is module for URLFrontier (if you haven't checked it up, do so right now!); I will publish a tutorial on how to use it soon.

1.18 is also likely to be the last release based an Apache Storm 1.x, our 2.x branch will become master as soon as I have released 2.1.

Happy crawling and thanks to our sponsors, contributors and users!

Dependency upgrades


Core


  • FileSpout doesn't replay failed tuples? #816

  • Simplify indexer config when the metadata key is the same as the field #819 

  • HttpHeaders#formatDate fails to parse date and returns always an empty string #821

  • HTTP date formatter to follow RFC 7231 #820

  • HTTP protocol implementation: allow to configure which protocol version(s) to use #827 

  • FileSpout: ClassCastException if message ID of failed tuples is not of type byte[] #826

  • Fetcher logQueuesContent won't be called if no new tuples are getting in #838

  • Set user-agent as a one liner #846

  • Add option to completely skip text extraction #848

  • Provide option for a faster charset detection strategy #849

  • BREAKING CHANGE  Scheduler implementations return an Optional<Date>#866

  • Jsoupfilters #877

  • Add JSoup specific parse filters enhancement parser #847

  • need a more reliable detection of whether a document has been already parsed by Jsoup #875

  • Default setting for 'selenium.pageLoadTimeout' leads to 'InvalidArgumentException' when using Selenium  #882

  • Track time spent in DNS resolution by OKHTTP #878


Archetypes


  • Archetypes use okttp protocol #845

  • Archetypes generate topologies with Tika parsing #858

  • Add MimeTypeNormalization parse filter to topologies generated from archetypes #860


Elasticsearch


  • Can't skip text or url fields in indexing #818 

  • Elasticsearch IndexerBolt: tuples with canonical URL may not get acked #832

  • Add JUnit tests for ES module #834

  • JUnit tests for ES + tuples with canonical URL may not get acked #836

  • StatusUpdaterBolt should use timeField() to index nextFetchDate? #824

  • Add Deletion bolt to Flux version of the Elasticsearch topo from the archetype #859

  • Do not generate a nextFetchDate at all if the scheduling is set to NEVER #861


WARC


  • WARCSpout: add metadata field "fetch.statusCode" (HTTP status code) #823

  • WARCSpout/FileSpout: ClassCastException if message ID of failed tuples is not of type byte[] #826

  • WARCSpout to add _request.time_ to metadata #831 

  • WARCSpout doesn't handle http.content.limit -1 correctly #850

  • WARCSpout: IllegalArgumentException if http.content.limit == -1 #833


Urlfrontier


  • Add URLFrontier module external #865 #868

  • Spout to stream incoming results instead of using a blocking call #879

No comments:

Post a Comment

Note: only a member of this blog may post a comment.