DigitalPebble's Blog: What's new in StormCrawler 1.18

StormCrawler 1.18 has just been released. Since the previous version dates from nearly 10 months ago, the number of changes is rather large (see below).

This version contains many bugfixes, as usual, users are advised to upgrade to this version. One of the noticeable new features is module for URLFrontier (if you haven't checked it up, do so right now!); I will publish a tutorial on how to use it soon.

1.18 is also likely to be the last release based an Apache Storm 1.x, our 2.x branch will become master as soon as I have released 2.1.

Happy crawling and thanks to our sponsors, contributors and users!

Dependency upgrades

Tika 1.26 #869
Icu4j.version 68.2 #855
Httpclient 4.5.13 #855
Rometools 1.15.0 #855
okhttp 4.9.1 #855
SOLR 8.8.0 #855

Core

FileSpout doesn't replay failed tuples? #816
Simplify indexer config when the metadata key is the same as the field #819
HttpHeaders#formatDate fails to parse date and returns always an empty string #821
HTTP date formatter to follow RFC 7231 #820
HTTP protocol implementation: allow to configure which protocol version(s) to use #827
FileSpout: ClassCastException if message ID of failed tuples is not of type byte[] #826
Fetcher logQueuesContent won't be called if no new tuples are getting in #838
Set user-agent as a one liner #846
Add option to completely skip text extraction #848
Provide option for a faster charset detection strategy #849
BREAKING CHANGE Scheduler implementations return an Optional<Date>#866
Jsoupfilters #877
Add JSoup specific parse filters enhancement parser #847
need a more reliable detection of whether a document has been already parsed by Jsoup #875

Default setting for 'selenium.pageLoadTimeout' leads to 'InvalidArgumentException' when using Selenium #882
Track time spent in DNS resolution by OKHTTP #878

Archetypes

Archetypes use okttp protocol #845
Archetypes generate topologies with Tika parsing #858
Add MimeTypeNormalization parse filter to topologies generated from archetypes #860

Elasticsearch

Can't skip text or url fields in indexing #818
Elasticsearch IndexerBolt: tuples with canonical URL may not get acked #832
Add JUnit tests for ES module #834
JUnit tests for ES + tuples with canonical URL may not get acked #836
StatusUpdaterBolt should use timeField() to index nextFetchDate? #824
Add Deletion bolt to Flux version of the Elasticsearch topo from the archetype #859
Do not generate a nextFetchDate at all if the scheduling is set to NEVER #861

WARC

WARCSpout: add metadata field "fetch.statusCode" (HTTP status code) #823
WARCSpout/FileSpout: ClassCastException if message ID of failed tuples is not of type byte[] #826
WARCSpout to add _request.time_ to metadata #831
WARCSpout doesn't handle http.content.limit -1 correctly #850
WARCSpout: IllegalArgumentException if http.content.limit == -1 #833

Urlfrontier

Add URLFrontier module external #865 #868
Spout to stream incoming results instead of using a blocking call #879

Wednesday, 5 May 2021

What's new in StormCrawler 1.18