Tuesday 28 November 2017

What's new in StormCrawler 1.7



Amazingly this is the 20th release of StormCrawler! Here are the main changes:

Dependencies updates
  • crawler-commons 0.9 #513
Core
  • (bugfix) ParserBolts should use outlinks from parsefilters #498
  • LD_JSON parsefilter #501
  • okhttp : store request and response headers verbatim in metadata #506
  • (bugfix) okhttp protocol does not store headers in metadata #507
  • HTTP clients should handle http.accept.language and http.accept #499
  • Selenium protocol follows redirections #514
  • RemoteDriverProtocol needs multiple instances #505
  • SitemapParserBolt should force mime-type based on the clue #515
Elasticsearch
  • ES Spout : define filter query via config #502
  • Upgrade to ES 6.0 #517
We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.

This version improves the processing of sitemaps, via #515 and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our okhttp-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on the WIKI.

Finally, if you want to extract semantic data represented in ld-json then you'll love #501.

As usual, thanks to all contributors and users. Happy crawling!