DigitalPebble's Blog: January 2019

Happy new year!

I have just released StormCrawler 1.13, which contains important bug fixes and some nice improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

Tika 1.20 (#676)

Xerces 2.12.0 (#672)

Guava 27.0.1 (#672)

Elasticsearch 6.5.3 (#672)

Jackson 2.8.11.3 (14e44)

Core

FileSpout uses StringTabScheme by default (#664)

JSoupParserBolt outlink limit per page (#670)

/BUGFIX/ Date format used for HTTP if-modified-since requests must follow RFC7231 (#674)

/BUGFIX/ DeletionBolt expects Metadata from tuples (#675)

Added configurable TextExtractor to JSoupParserBolt (#678)

!BREAKING! Core Spouts should use status stream if withDiscoveredStatus is set to true (#677)

SQL

SQL IndexerBolt (#608)

Archetype

Archetype sets StormCrawler version in a property (#668)

Replace ContentFilter with TextExtractor (#678)

Apart from the changes to the core spouts (#664 and #677), the main new feature is the addition of the TextExtractor (#678) for the JsoupParserBolt. Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup. The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the ContentParseFilter which merges some tokens. The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. This is useful for instance to remove SCRIPT or STYLE elements.

As usual, thanks to all contributors and users, and particularly the Government of Northwest Territories in Canada who kindly donated some of the code of the TextExtractor.

Happy crawling!

Sunday, 6 January 2019

What's new in StormCrawler 1.13

Dependency upgrades

Core

SQL

Archetype