Thursday 22 November 2018

What's new in StormCrawler 1.12

The previous release was only last month but I decided to ship this one now as it contains several bugfixes and improvements which many users would benefit from.


As you can see below, the main changes are around protocols and sitemaps. We have used Selenium and OKHTTP a lot recently to deal with dynamic websites and the changes below definitely help for these. There is also an important bugfix for JSOUP (#653) and various other improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades


  • JSOUP 1.11.3 (#663)
  • Elasticsearch 6.5.0  (#661)
  • Jackson and Wiremock dependencies (#640)

Core

  • Post JSON data with OKHTTP protocol via metadata (#641)
  • Selenium RemoteDriverProtocol triggered by K/V in metadata  (#642)
  • SeleniumProtocol NavigationFilters not reached in case of a redirection (#643)
  • Limit crawl to URLs found in sitemaps  (#645)
  • spout.reset.fetchdate.after based on time when query was set to NOW  (#648)
  • Avoid StackOverflowError when generating DocumentFragment from JSOUP (#653)
  • redirected sitemaps don't have isSitemap=true  (#660)
  • Staggered scheduling of sitemap URLs (#657)
  • Scheduling -> round to the closest second, minute or hour (#654)
  • FetcherBolt don't add discovered sitemaps if the robots rules do not allow them (#662)

WARC

  • WARC record format: trailing zero byte causes WARC parser to fail  (#652)

Elasticsearch

  • ES IndexerBolt track number of batch sent (#540)
  • Rename index index into docs (#649)
  • ES StatusMetricsBolt generate metrics for total number of docs (#651)

Coming next...



The release of Storm 2.0.0 has taken longer than expected, which is partly my fault as I reported a number of issues. These issues have now been fixed and hopefully, 2.0.0 will be out soon. As mentioned last month, there's a branch of StormCrawler which works on the Storm 2.x branch. Give it a try if you want to be on the cutting edge!

Finally, there will be a StormCrawler workshop in Vilnius next week. I am sure tickets are still available if you fancy a last minute trip to Lithuania.

As usual, thanks to all contributors and users. Happy crawling!

UPDATE

There were 2 bugs in release 1.12 which have been fixed in 1.12.1, see details on